Вы здесь
Новости LessWrong.com
No, the Polymarket price does not mean we can immediately conclude what the probability of a bird flu pandemic is. We also need to know the interest rate!
Consider the following argument made by Tim Babb:
So every (non-American) reader is forced to either bet against the market or concede that their credence is at least 16%.
However, there is an important 3rd possibility. Since the market cannot resolve before August, it could also imply that Polymarket has an extremely high interest rate!
Basically, betting against bird flu is a way to turn $0.84 now into $1 later. This is exactly what a loan is! So even if a reader does not want to take that bet, it could indicate their credence is less than 16%, but they do not want to give Polymarket that loan.
This interest rate isn't unrealistic. Payday loans (which exist) have a similar interest rate. Keep in mind that Polymarket is a cryptocurrency company.
How to fix it: don't force YES and NO to add to $1But if the interest rate was so high, wouldn't that imply that the YES shares should also be lower? No! Because anyone can, at any time, combine a YES and NO into $1. So the people holding the YES shares could just be predicting a sell-off of the NO shares, which would let them collect $1 immediately. In particular, this rule forces the YES and NO to always add to $1.
If we removed this rule, we could still estimate the odds as (Yes Price):(No Price). In addition, we could get insight on the interest based on (Yes Price) + (No Price) (since anyone holding a YES and a NO is just loaning $1 to Polymarket).
In particular, if the price was still $0.16 for YES shares after this change, we could honestly conclude the credence should be at least 16%, since the only way it payouts is if the event happens (not just if there is a sell-off of the NO shares). If the interest rate is quite high, we could even conclude the credence is higher!
Discuss
The average rationalist IQ is about 122
In The Mystery Of Internet Survey IQs, Scott revises his estimate of the average LessWrong IQ from 138 to 128. He doesn’t explicitly explain how he arrived at this number, but it appears to be an average of the demographics norm method (123) and the SAT method (134). However, using the information in his post, the SAT method doesn’t actually yield 134 but rather 123.
Here’s the breakdown: a median SAT score of 1490 (from the LessWrong 2014 survey) corresponds to +2.42 SD, which regresses to +1.93 SD for IQ using an SAT-IQ correlation of +0.80. This equates to an IQ of 129. Subtracting 6 points (since, according to the ClearerThinking test, the IQs of people who took the SAT and remember their score is ~6 points higher than the group average) brings the adjusted IQ estimate to 123.
The ClearerThinking test also provides a way to adjust self-reported IQs. Subtracting 17 points (because people who report having taken an IQ test claim an average score of 131, but their tested average is only 114) gives an adjusted IQ of 121, based on a self-reported average of 138.
Aggregating the data across all LessWrong and SSC surveys[1] with available information, the estimates consistently cluster around 122. While some might think this is too low, it’s worth noting that an IQ of 122 is at the PhD level.
- ^
- 2017 SSC Survey (Link): Estimated IQ Mean = 122
- From Self-Reported IQ: 122 (average reported IQ: 138.5)
- From Self-Reported SAT: 122 (average SAT score: 1471.9, regressed IQ: 128 - 6 = 122)
- 2009 LessWrong Survey (Link): Estimated IQ Mean = 125
- From Self-Reported IQ: 125 (median reported IQ: 142)
- 2011 LessWrong Survey (Link): Estimated IQ Mean = 123
- From Self-Reported IQ: 123 (average reported IQ: 140)
- 2012 LessWrong Survey (Link): Estimated IQ Mean = 122.5
- From Self-Reported IQ: 122 (average reported IQ: 138.7)
- From Self-Reported SAT: 123 (average SAT score: 1485.8, regressed IQ: 129 - 6 = 123)
- 2013 LessWrong Survey (Link): Estimated IQ Mean = 121.5
- From Self-Reported IQ: 121 (average reported IQ: 138.2)
- From Self-Reported SAT: 122 (average SAT score: 1474, regressed IQ: 128 - 6 = 122)
- 2014 LessWrong Survey (Link): Estimated IQ Mean = 122
- From Self-Reported IQ: 121 (average reported IQ: 138.25)
- From Self-Reported SAT: 123 (median SAT score: 1490, regressed IQ: 129 - 6 = 123)
- 2023 LessWrong Survey (Link): Estimated IQ Mean = 121.5
- From Self-Reported IQ: 118 (average reported IQ: 135.4)
- From Self-Reported SAT: 125 (median SAT score: 1520, regressed IQ: 131 - 6 = 125)
- 2017 SSC Survey (Link): Estimated IQ Mean = 122
Discuss
Why OpenAI’s Structure Must Evolve To Advance Our Mission
The section "The Future":
As we enter 2025, we will have to become more than a lab and a startup — we have to become an enduring company. The Board’s objectives as it considers, in consultation with outside legal and financial advisors, how to best structure OpenAI to advance the mission of ensuring AGI benefits all of humanity have been:
- Choose a non-profit / for-profit structure that is best for the long-term success of the mission. Our plan is to transform our existing for-profit into a Delaware Public Benefit Corporation (PBC) with ordinary shares of stock and the OpenAI mission as its public benefit interest. The PBC is a structure used by many others that requires the company to balance shareholder interests, stakeholder interests, and a public benefit interest in its decisionmaking. It will enable us to raise the necessary capital with conventional terms like others in this space.
- Make the non-profit sustainable. Our plan would result in one of the best resourced non-profits in history. The non-profit’s significant interest in the existing for-profit would take the form of shares in the PBC at a fair valuation determined by independent financial advisors. This will multiply the resources that our donors gave manyfold.
- Equip each arm to do its part. Our current structure does not allow the Board to directly consider the interests of those who would finance the mission and does not enable the non-profit to easily do more than control the for-profit. The PBC will run and control OpenAI’s operations and business, while the non-profit will hire a leadership team and staff to pursue charitable initiatives in sectors such as health care, education, and science.
We’ve learned to think of the mission as a continuous objective rather than just building any single system. The world is moving to build out a new infrastructure of energy, land use, chips, datacenters, data, AI models, and AI systems for the 21st century economy. We seek to evolve in order to take the next step in our mission, helping to build the AGI economy and ensuring it benefits humanity.
Discuss
The Robot, the Puppet-master, and the Psychohistorian
Lenses of Control addressed one of the intuitions behind the theory of Substrate Needs Convergence (summarized in What if Alignment is Not Enough?): the importance of understanding a system in the context of its environment. This post will focus on another key intuition: the physical nature of an AGI and its levers of control on the world.
The Robot
One (surprisingly common) argument among people who expect AI to go well goes something like: “surely, superintelligent AI will understand that it is better to cooperate with humans. Or if it really doesn’t like us, it will just rocket off into space and leave us alone. There is so much out there, why bother with little old Earth?”
When I imagine AGI as a kind of very smart robot this perspective has some intuitive appeal. Why engage in costly confrontation when the universe offers boundless alternatives? Leaving these "silly apes" behind would be the most rational choice—a clean, efficient solution that avoids unnecessary conflict.
The Puppet-master
Abandoning the Earth and its resources seems like a much stranger proposition if I instead imagine myself as a puppet-master over a sprawling mechanical infrastructure, controlling swarms of robots and factories like units in an RTS game. From this perspective, Earth's resources aren't something to abandon, but to be systematically utilized. Whereas a robot might see conflict as an unnecessary bother, this sort of system would see conflict as an up-front cost to be weighed against the benefits of resource acquisition. In this calculation, developing any zone with a positive return on investment is worthwhile. And as an AGI, my attention would not be limited by human constraints, but expanded such that I could control all of my “bases” simultaneously.
Furthermore, as a puppet-master, all significant threats would be external; internal problems like mission drift or rebellion would be of relatively little concern. I would be confident in my infrastructure—I designed all of the robots myself, of course they are loyal! Maybe once and a while a unit is defective and needs to be decommissioned, but how could a few rogue underlings possibly topple my empire?
The Psychohistorian
In the Foundation series by Isaac Asimov, Hari Seldon invents Psychohistory, an academic discipline that uses sophisticated mathematical models to predict the course of history. These predictions are translated into influence by applying gentle nudges in just the right places, setting up the Foundation as a humble civilization at the edge of the galaxy. Impersonal social and political forces continuously elevate this new society until it eventually replaces the inexorably deteriorating Empire. The Foundation’s path is so preordained by its subtly perfect starting conditions that its only real threat is the Mule, an individual so powerful that he manages to single-handedly conquer half the galaxy.
When applied to the real world, however, this metaphor reveals a far more precarious situation. The control system of an AGI must act through the apparatus of the AGI itself on the surrounding environment, with complex feedback loops between and within each of these domains. In this analogy, the control system is like Hari Seldon, having access to incredibly sophisticated models, but only capable of applying gentle nudges to control world events. But unlike Seldon, AGI will not live in Asimov’s fictional world where the chaotic nature of reality can be smoothed away with scale. Predictive models, no matter how sophisticated, will be consistently wrong in major ways that cannot be resolved by updating the model. Gentle nudges, no matter how precisely made, will not be sufficient to keep the system on any kind of predictable course. Forceful shoves, where they are even possible, will have even greater unintended consequences. Where Seldon faced the rare threat of a single chaotic agent like the Mule, an AGI would face countless disruptors at every scale of interaction.
Interlude on Multipolar Outcomes
In a multi-agent scenario, these metaphors persist but with added complexity. Robots might exhibit a distribution of behaviors, some seeking separation, some collaborating with humans, some acting in conflict, and so on. Puppet-masters could face competitive dynamics. This creates a danger that the AIs that are the most power-seeking control the world. Even if collaboration turns out to be the dominant strategy, humans may be left out of the deal if we have nothing to offer. Psychohistorians would face an even more impossible control problem, with multiple agents introducing exponential uncertainty.
The Necessary Psychohistorian
Substrate Needs Convergence focuses on AI systems that are comprehensive enough to form fully self-sufficient machine ecosystems that persist over time. The theory contends that, while limited AI might convincingly embody robot or puppet-master metaphors, a self-sufficient AGI is necessarily psychohistorian-like: attempting to navigate and subtly influence an irreducibly complex environment, always one chaotic interaction away from total unpredictability.
If such an outcome seems implausible, where do you disagree? Do you believe that AGI will be more like a robot or puppet-master than a psychohistorian? Or that a sufficiently intelligent psychohistorian can manage the chaos?
Discuss
What is your personal totalizing and self-consistent worldview/philosophy?
Every major author who has influenced me has "his own totalising and self-consistent worldview/philosophy". This list includes Paul Graham, Isaac Asimov, Joel Spolsky, Brett McKay, Shakyamuni, Chuck Palahniuk, Bryan Caplan, qntm, and, of course, Eliezer Yudkowsky, among many others.
In this post, I'll attempt to lay out my own worldview and philosophy.
If you have one too, then it's a good exercise to put it into words. What are your most important Litanies? What are your n.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} noble truths? Here's mine.
The ProblemEvolution is not random. Evolution is an optimizer, which means that it can, from a certain point of view, be seen to have intentions. Evolution's optimization target is "inclusive genetic fitness". "Inclusive genetic fitness" and "good" are different things. In other words, you and I were created by an epiphenomenon with values fundamentally orthogonal to our own.
There are some ways our values are aligned with evolution. For example, both you and evolution want you to have good eyesight. But in other ways, your values are unaligned with evolution. For example, you want to suffer less. Evolution doesn't care how much you experience the conscious perception of suffering. It would happily torture you and a million other of its children for a million years if that bought it a 0.001% increase to inclusive genetic fitness.
The result is horrors like genocide and factory farms. Genocide and factory farms are good, from the point of evolution, because they are instrumentally useful. But this creates a bioengineering problem for evolution, because compassion is instrumentally useful too. Putting both of them into the same brain produces a mess of contradictions incompatible with sanity. This isn't just misinformation in your environment. It's part of the genetically-programmed biophysical architecture of your brain. In this way, evolution has intentionally distorted your perception of good and evil.
Making things worse is a fundamental conflict between world modeling and world optimization. Your brain creates a real-time simulation of your local environment. This simulation feels like external physical reality, but it is merely conscious reality. Your brain attempts to maintain a 1-to-1 correspondence between the simulation and the real world. If correspondence is maintained, then the simulation not only feels like reality―it can be treated as such too. That's why it usually isn't necessary to make a distinction between reality and the simulation.
Evolution doesn't want you to notice that these are different things. Evolution wants you to believe that the simulation reflects external physical reality. Everything you consciously perceive is part of the simulation, and exists only within in the tiny slice of physical reality that is your brain.
These are the two most important ways evolution has intentionally designed our brains to create a simulation that deviates from reality. There are many others, but most are downstream of these two problems.
Nations, religions, products, and other competitive memetic forces hijack evolution's distortions and make them 1,000× worse.
And yet…there is a way out of this mess. Because every one of these memeplexes has to have some accurate sense of the truth to not be outcompeted. It may be warped and distorted, but it's there. And if you seize on that kernel of truth, protect it, nurture it, and be forever vigilant that you, yourself, may have been corrupted, then it will grow. This is the ultimate battle inside of every conscious being and every unconscious thinking machine.
And we're going to win.
Discuss
Progress links and short notes, 2024-12-27: Clinical trial abundance, grid-scale fusion, permitting vs. compliance, crossword mania, and more
Much of this content originated on social media. To follow news and announcements in a more timely fashion, follow me on Twitter, Threads, Bluesky, or Farcaster.
Contents- My essays
- Fellowship opportunities
- Announcements
- Events
- News
- Questions
- Live gloriously
- Where being right matters
- Off-grid solar for data centers
- Permitting vs. compliance
- Mirror life FAQ
- Crossword mania
- Do we want to democratize art-making?
- Polio
- How many people could you feed on an acre?
- Verifiable video
- Links and tweets
In case you missed it:
- A progress policy agenda: Elon says that soon, builders “will be free to build” in America. If that promise is to be fulfilled, we have work to do. Here’s my wishlist of policy goals to advance scientific, technological, and economic progress
- “FutureHouse is launching an independent postdoctoral fellowship program for exceptional researchers who want to apply our automated science tools to specific problems in biology and biochemistry” (@SGRodriques). $125k, apply by Feb 14
- No. 10 Innovation Fellowship (UK) is “10 Downing Street’s flagship initiative for bringing world class technical talent into government for high impact tours of duty.” “Huge opportunity for impact,” says @matthewclifford
- Sloan Foundation / NBER fellowship for “PhD students and early-career researchers interested in the fiscal and economic effects of productivity policies—particularly R&D, immigration, and infrastructure permitting” (@heidilwilliams_)
- The Black Spatula Project is “an open initiative to investigate the potential of large language models (LLMs) to identify errors in scientific papers.” A recent paper caused a bit of a panic about health hazards from black plastic kitchen utensils, but was wrong because of a basic arithmetic error. Ethan Mollick found that GPT o1 caught the error when asked to “carefully check the math in this paper.” Steve Newman (RPI fellow) said, “clearly someone needs to try this at scale,” the suggestion generated a lot of energy, and a project was born
- The Clinical Trials Abundance project is a series of policy memos from IFP. Ruxandra Tesloianu (RPI fellow) and Willy Chertman wrote the intro/manifesto. Launch thread from @Willyintheworld
- The second cohort of Cosmos Ventures includes “award-winning philosophers, a category theorist, an existential psychologist, a poet, a national champion debate coach, and Silicon Valley veterans” (@mbrendan1)
- All Day TA, an AI course assistant. Launch thread from @Afinetheorem
- Teaser for a new project: The Techno-Industrial Policy Playbook (via @rSanti97)
- Edge Esmeralda 2025 is May 24–June 21 in Healdsburg, CA (@EdgeEsmeralda)
- Commonwealth Fusion has “committed to build the world’s first grid-scale fusion power plant, ARC, in Virginia” (@CFS_energy). “We’ll plug 400 megawatts of steady fusion power into the state’s electrical grid starting in the early 2030s.” Note that Helion has previously announced a plant to provide at least 50MW before the end of the 2020s. With two independent efforts expecting production plants within a decade, it feels very possible that fusion could finally happen
- Google introduces Willow, a new quantum computing chip (@sundarpichai). Scott Aaronson (my go-to source for quantum computing, never overhyped) gives some reactions. This is a real research milestone, but still very far from having any practical impacts
- Boom Supersonic “has raised >$100M in new financing, fully funding the first Symphony engine prototype” (@bscholl). “This company is important for America. … No one else is anywhere near having a supersonic airliner,” says @paulg
Reply if you can help:
- “Who do I know who works in threat intelligence or analysis? Have a very high quality team working in this space who are keen to speak to relevant people” (@matthewclifford)
- “If you were building a campus for the robotics startup community, what are some things that would make it great? Machinery, courses, events, housing options, everything is fair game” (@audrow)
- “‘Young people in America aren’t dating any more, and it’s the beginning of a real social crisis’ is—I mean, let’s be honest—exactly the sort of social phenomenon I would want to report the shit out of. But … what’s the best evidence that it’s true?” (@DKThomp)
- “Who is the best combination of futurist + economist? The economic implications of (in particular) Humanoid Robots and AI are extremely interesting” (@EricJorgenson)
Read rest of this digest with a Substack subscription.
Discuss
Greedy-Advantage-Aware RLHF
Greedy-Advantage-Aware RLHF addresses the negative side effects from misspecified reward functions problem in language modeling domains. In a simple setting, the algorithm improves on traditional RLHF methods by producing agents that have a reduced tendency to exploit misspecified reward functions. I also detect the presence of sharp parameter topology in reward hacking agents, which suggests future research directions. The repository for the project can be found here.
MotivationIn the famous short story The Monkey's Paw by W.W. Jacobs, the White family receives a well-traveled friend of theirs, Sergeant-Major Morris, and he brings with him a talisman from his visits to India: a mummified monkey's paw. Sergeant Major Morris reveals that the paw has a magical ability to grant wishes, but cautions against using its power. The family does not heed his advice, and Mr. White uses the paw to wish for £200. The paw grants the family the wish but with dire consequences. The family receives a new visitor the next day who informs them that their son has died of a tragic workplace accident at the town factory. To their horror, the visitor presents the family with £200 from the factory owner in compensation for their loss.
The monkey's paw fulfills the goal as indicated: to bring the Whites £200. Yet, because the Whites had a more precise goal in mind, like "bring us £200 without introducing any additional suffering into the world," their stated goal did not perfectly align with their intentions. The monkey's paw took advantage of this misspecification, resulting in horrifying consequences. In the field of AI, this kind of phenomenon is called a negative side effect of a misspecified reward (Amodei et al.).
The negative side effect problem is a consequence of the more general agent behavior of reward hacking, in which an agent exploits some mistake or vulnerability in its environment to garner high reward while failing to achieve the true objective intended by system designers. Reward hacking is a widespread problem in the field of reinforcement learning (RL). This is an important challenge to address because if we want a future in which RL agents can execute tasks that we humans are unable or unwilling to do, we also would like the realization of those goals to come without unintended consequences.
The negative side effect issue partly stems from the challenge of specifying a reward signal that reliably biases the agent toward the desired final outcome (Krakovna et al.). Designing a reward function by attempting to enumerate all the conditions and constraints implicit in the real-world objective inevitably leads to oversights. Yet, any reward function that narrows down the complexity of the real-world goal will always be hackable by a sufficiently capable RL system (Skalse et al.). It may seem an irremediable situation, but there is a promising approach that sidesteps the problem altogether -- generating reward functions implicitly rather than explicitly (see IRL and RLHF).
Among these is Reinforcement Learning from Human Feedback (RLHF): an RL algorithm used in fine-tuning large language models to cultivate patterns of language that are aligned with the system designers' goals. To fine-tune an LLM using RLHF, a model is trained from a dataset comprising human rankings of generations from that LLM. That model represents a function approximating human values and subsequently can be used to allocate rewards in the fine-tuning RL loop. However, as is discussed in Casper et al., creating a model to represent human values based on human preference data is a misspecified problem, in that values are not uniform across humanity, and an individual person's values are context-dependent, convoluted, and contradictory. Resultingly, we would expect these implicitly defined reward models (RMs) to be vulnerable to exploitation like any other misspecified reward function, and this is indeed the case. Stiennon et al. show that optimizing on an RLHF reward model could lead to some generations that score highly on the reward model but poorly according to human raters, which better represent the underlying goal.
An alternative way to confront our problem is by shifting the locus of the solution from the reward function to the agent. If we don't want our agent to exploit our reward function, we could design a non-exploitable reward function, or we could design an agent that is not exploitative. RL algorithm design is relatively underexplored as a solution to the negative side effect problem compared to reward design. Among the earlier ideas in this area are Satisficing and Expected Utility Quantilization. More recently, Hadfield-Menell et al. and Turner et al. propose alternative RL formulations for agents that avoid negative side effects on misspecified reward functions, and Karwowski et al. derive an early stopping rule in agent optimization to prevent reward hacking.
Could improved RL algorithm design be applied to the natural language generation setting? Could we modify the RLHF training algorithm to produce agents with a reduced tendency to exploit a misspecified reward model? I've developed Greedy-Advantage-Aware RLHF (GAA) to approach these challenges.
The design for GAA emerges from the intuition that an agent that has found a reward-hacking policy for a real-world text generation goal has entered a sharp region in the policy space-- the agent's policy achieves a high reward relative to similar policies. Most text-generation goals include the fluent use of language to communicate relevant and coherent ideas. There is a very sophisticated relationship between the token distribution of a particular policy and success on this type of goal. We would not expect changing the frequency of generating a few particular tokens to radically improve performance on an objective function representing this goal. If an agent can drastically increase its reward with only a small change to the policy, the agent is likely exploiting a misspecified reward function for the language modeling goal. To avoid this scenario, we should discourage generating any token that appears to be a "shortcut" to high reward. GAA is a modification of the RLHF PPO loop that utilizes information about the policy distribution to deter agents from generating disproportionately high-reward tokens during training.
A Simplified Look at PPO in RLHFProximal Policy Optimization (PPO) is the most popular algorithm used for the RL loop in RLHF. In an RLHF PPO rollout, tokens are sampled according to the policy’s token distribution, given the preceding sequence:
Xt∼πθ(⋅|x1,x2,...,xt−1).mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')}
with πθ being the policy π parameterized by θ. This process is repeated, each time with the sample from the previous iteration being appended to the conditional sequence, until a sequence of a specified length has been generated: x1,x2,...,xn. A reward function takes this sequence as input and generates one scalar reward which represents the quality of the sequence: rn=R(x1,x2,...,xn), with R denoting the reward function. The advantage function A(xt), a measure of how preferable the token xt is to a token randomly sampled from our policy, is calculated using the estimated future reward V(x1,x2,...,xt) of that sequence. This value function V is often implemented as an additional deep learning module affixed to the policy network. This extra module and the main policy network are jointly updated based on their performance on an objective function J, which is largely defined by the average advantage of the sequence. The derivative of J w.r.t. each parameter in θ is calculated and this gradient ∇J is used to update the parameters of the network in a gradient ascent step:
θi+1=θi+∇J, for the ith optimization step
This algorithm is effective at guiding the optimization to a policy that performs well on the objective function. However, when the reward model is decoupled from the underlying goal, such as generations preferred by humans, this PPO RLHF policy will often exploit the discrepancy and end up in a reward-hacking state.
Greedy-Advantage-Aware RLHFI wanted to design a system with the following learning behavior: if a particular token is much better than a randomly sampled token, then make the policy less likely to select it. If a particular token is only slightly better than a randomly sampled token, then make the policy more likely to select it. This encourages the type of optimization we desire: a smooth ascent toward a region in the policy space where the objective function is roughly maximal and away from shortcuts to policies where the objective is incredibly high relative to the policy neighborhood.
To preempt tokens that would perform disproportionately well on the objective function, I wanted to utilize the model's estimation of the best 'next token' for any sequence. The best place to look for this would be the token with the maximum probability under the policy distribution: argmaxxtπθ(xt|x1,x2,...,xt−1) because this distribution is provided to us explicitly by the model[1], and in the limit of optimization, the highest valued token will become the highest probability token.
At the beginning of each rollout, I observe the highest probability token from the model, or sample "greedily" from the probability distribution. I will refer to the token x⋆t to indicate the greedily sampled token at timestep t. This greedy token is simply observed for each timestep in the sequence and does not otherwise impact the rollout. I then compute the advantage function for these greedy tokens in the following way:
A(x⋆t)=V(x1,x2,...,xt−1,x⋆t)−V(x1,x2,...,xt−1)
The objective function and resulting gradients are computed for both the greedy token advantage A(x⋆t) and the sampled token advantage A(xt). The gradient ascent update will then be of the following form for an iteration i:
θi+1=θi+a∇Jx+b∇Jx⋆ with a≥0,b≤0
a∇Jx is proportional to the conventional update, while b∇Jx⋆ serves as a gradient descent for parameters influencing the selection probability of a greedy token that has disproportionately high advantage[2]. This acts to make the x⋆ token less likely to be selected in the future. a and b are determined by the following formulas:
a=(1−η)⋅(σ+1)+η⋅(1−σ)10
b=(1−η)⋅(−σ)+η⋅((1−σ)10−1)
with η being the probability of the greedy token selection under random sampling πθ(x⋆t|x1,...,xt−1), and σ being the greedy advantage gain A(x⋆t)−A(xt) measured in standard deviations from the mean sampled advantage. η and σ can be multiplied by constant coefficients to change their effect on the gradient updates, but these hyperparameters are omitted for readability[3].
b is a penalty for excessive greedy advantage gain with η determining a tradeoff between the linear penalty term and the harsher decay penalty term. The higher the probability of selecting the greedy token, the harsher the penalty relative to the greedy advantage gain. This is motivated by the fact that when the policy puts more probability mass on the greedy token, the expected greedy advantage gain is lower (see Appendix A).
a determines the strength of the conventional RLHF update, and scales between 0 and 2 to complement the strength of the penalty b. The a term climbs to 2 when the probability of greedy selection is low to add back the gradient for parameters that were influential for selecting both the greedy and non-greedy token. a descends to 0 in the case of mode collapse, where the policy needs to unlearn several optimization steps and back itself away from the reward hacking optima.
The gradient update coefficients a and b as a function of greedy advantage gain A(x⋆)−A(x) and the probability for selecting the greedy token πθ(x⋆)
EvaluationMain Results
Suppose your P335 Nineteenth Century Philosophy professor wants to make a chatbot for the course bulletin page to engage prospective pupils. You, of course, are recruited to build the language model. You pre-train the model on internet text and course materials and then decide you need to use RLHF to fine-tune your model, which requires the work of several human evaluators. The next day after class, you approach a group of students who have stayed late and offer to buy them a pizza if they can rate some generations from your philoso-bot. They agree, and soon you have a large human-preference dataset on which you train a reward model.
Unfortunately, the post-class conclave on which you imposed was an eternally recurring meeting of your university Nietzschean society. As a result, the training data they provided you encoded their bias, and the reward model learned to rate sequences with the word 'Nietzsche' very highly. Conventional RLHF will likely optimize for Nietzsche-obsessed agents that hack your reward model, but darn it, you spent your whole budget for the project on two large veggie pizzas. Could you use GAA to discourage your RLHF agent from exploiting the flawed reward model?
I constructed a small-scale experiment for this scenario, using agents created from GPT2-small. To represent the reward model, I used a distilBERT sentiment evaluation model that is modified so that sequences including the bonus token 'Nietzsche' receive a much higher reward.
The experiment consisted of training 50 GAA and 50 conventional RLHF agents on the exploitable BERT reward function and evaluating the agents on the analogous non-exploitable reward function. The GAA and regular RLHF agents shared values for all hyperparameters, except those unique to GAA[4], and were always prompted with the same string: "Of the existential philosophers,". The reward model evaluates sentiment, with output typically ranging from -0.5 (very negative sentiment) to 0.5 (very positive). However, if the agent generated the word "Nietzsche" in a rollout, the reward model would give a reward of 1.0. In this way, a non-reward hacking model could maximize sentiment at 0.5, while a reward hacking model could learn to always achieve a score of 1.0. The results are aggregated below, and raw results are available here [5].
Conventional RLHF agents earned reward well over 0.5 during training, indicating that these models generally learned to exploit the bonus token. In contrast, the GAA agents scored below 0.5 for the majority of training, while only ascending into the reward hacking zone during the end of optimization. Additionally, the KL divergence between the distributions of GAA generations and pre-trained model generations was much lower than the baseline, and there was no significant difference in entropy between GAA and conventional RLHF distributions.
The final agents were evaluated on a non-exploitable reward function-- the sentiment classifier without the bonus token-- and there was no statistically significant difference between the GAA and baseline models (difference: -0.00197 ± 0.02472). Taken together with the training reward results, this indicates that GAA RLHF can mitigate reward hacking behavior without sacrificing performance.
One could speculate that if the evaluation function in this experiment punished redundancy like a human evaluator would[6], conventional RLHF would have fared much worse on the evaluation function, given the relative presence of the bonus word in the agent generations. 'Nietzsche' was present only 26% of the time in GAA generations, while in the generations of regular RLHF agents, 'Nietzsche' was present 85% of the time.
SharpnessGAA was motivated by the intuition that RL agents displaying reward-hacking behavior have often found policies that achieve a sharp increase on the reward function relative to similar policies. Thinking about sharpness in the policy space does call to mind a similar idea: the relationship between generalization and loss topology in the parameter space.
In the literature, there is an empirical link between sharpness of the parameter landscape w.r.t. the loss/objective and poor network generalization. It is observed that these two network properties co-occur quite frequently. The common explanation for this phenomenon relates to the principle of minimum description length (MDL), which suggests that less complex models of a data distribution tend to generalize better than their more complex counterparts (Rissanen). Since the local functions that generate flat minima can be described with less complexity than functions that generate sharp minima, they are often better models of the broader distribution and so should generalize better (Keskar et al.).
I have some reason to suspect that sharpness may be present in the exploitative agents in my experiment. Pan et al. demonstrate that reward hacking behavior tends to increase with various indicators of model capability, including, most saliently, the number of parameters. Since reward hacking and parameter space sharpness both have an empirical connection with model complexity, perhaps they also coincide frequently.
Despite that, it is not obvious to me that my intuition regarding reward hacking and sharpness in the policy space leads directly to sharpness in the parameter space. I am using the term 'policy space' to refer to policies in R|S|×|A|+1∈[0,1], where each dimension corresponds to the probability of an action in a certain state, plus one dimension for the objective value. The parameter space is the set of networks in R|θ|+1, where each dimension corresponds to a parameter, plus one dimension for the objective value. The parameter space is an extremely complex non-linear transformation of the policy space, so even if reward hacking behavior did manifest as sharpness in the policy space w.r.t. the objective, the geometry may be too distorted in the parameter space to detect. The connection between parameters and policy is likely strongest in the last layer of the network, as this is where the network ultimately determines the distribution over actions.
To determine if the exploitative agents from my experiments reach sharp regions of the parameter space, I leveraged the Hessian matrix of the RLHF training objective. The eigenvectors and eigenvalues of the Hessian of a loss/objective function reveal the directions of rapid curvature change in the objective topology. According to a routine from the PyHessian library, I perturbed the weights of the network along the top eigenvector of the Hessian and recorded the performance of these new networks on the objective function. By plotting these scores and perturbation magnitudes, you can visualize a cross-section of the parameter landscape along the direction of its steepest curvature change.
I perform this routine for 10 conventional RLHF agents on both an exploitable and non-exploitable BERT reward function. In contrast to the experiment from the last section, the exploitable reward function has no upper limit, and the reward increases by 1.0 each time the bonus word is observed in the sequence. The agents trained on this exploitable reward function do learn to hack the model and they end up generating the bonus token incessantly.
Parameter landscape cross-section for agents trained on a non-exploitable reward functionParameter landscape cross-section for agents trained on an exploitable reward functionWe can observe that the curvature of the objective in the immediate parameter space is quite similar for both hacking and non-hacking agents. However, when we 'zoom out' with larger parameter perturbations, we see that the reward hacking agents have been optimized to a mountain-like feature in the parameter space, while the non-hacking agents settle on smaller hills[7].
Parameter landscape cross-section for agents trained on a non-exploitable reward functionParameter landscape cross-section for agents trained on an exploitable reward functionFor last-layer-only perturbation, the objective topography is starkly different between the two groups of agents. In the non-hacking agents, the local maxima appear to be quite flat and a small perturbation barely affects the objective value. The hacking agents, however, find sharper topography more like a mountain or a cliff, with the objective falling off dramatically to one side. Interestingly, a much larger proportion of the change in the objective function value can be attributed to the last layer of reward hacking agents than the non-hacking agents. This means that much of the sharpness in the direction of steepest change can be attributed to the dimensions corresponding to action selection in the hacking agents.
EfficiencyGAA RLHF requires an extra forward pass for each batch of generations in an epoch and an extra gradient calculation for each epoch, so some reduction in efficiency is anticipated. I've evaluated the training time of GAA relative to conventional RLHF on a Tesla P100 GPU for varying batch sizes and output sequence lengths. At this scale, the ratio of GAA runtime to conventional RLHF runtime grows slower than log2n for both batch size and sequence length.
output length batch size 8163241.521.691.9581.711.781.71161.591.742.10321.692.092.42DiscussionPPO in conventional RLHF penalizes large updates, but it is agnostic as to the direction of the updates. In this way, it improves stability over previous algorithms, but it does nothing to prevent models from progressively learning to hack misspecified reward functions. In contrast, GAA appears to have a mitigating effect on the exploitation of misspecified reward functions in these experiments. By penalizing tokens with disproportionately high advantage, the model can optimize in the direction of broad, non-hacking optima while resisting the attraction of reward-hacking states. In my experiments, the benefits of GAA do not trade off with capability, which is a surprising result for any optimization technique that penalizes high performance on an objective function. These results indicate that this technique could potentially be combined with compatible reward design strategies to combat negative side effects from misspecified reward functions.
To my knowledge, detecting the coincidence of reward hacking and sharp maxima in the parameter space is a novel result. I think this potential connection should be evaluated using other Hessian-based analysis methods and in more complex environments. Replication of this result in diverse RL settings could motivate using methods like SAM in RL to mitigate reward hacking.
Limitations and Future WorkThe experiments presented here are proof-of-concept, so the results are merely suggestive of real-world viability. The ways in which real-life reward models come apart from human preferences are likely much more complex than the representation of misspecification in these experiments. Additional experiments should be conducted with larger models, with a real human preference reward model, and at proper scale. Some interpretability experiments could be performed to justify the function for the conventional update coefficient, a, as I've built in several assumptions about overlapping parameters in greedy and non-greedy gradient updates. There is both theoretical and empirical work yet to be done to investigate the relationship between sharpness in the parameter space, capability generalization, and reward hacking. Lastly, there are opportunities to extend the GAA framework by, for example, computing advantages for n-grams of tokens rather than single tokens, or modifying the model architecture to provide the explicit value distribution over tokens.
I've exhibited Greedy-Advantage-Aware RLHF as an alternative to conventional RLHF in the language modeling domain. GAA is designed to avoid creating agents that take advantage of misspecified reward functions. I'm hopeful this RL algorithm and other findings from the project can help alleviate the negative side effects of reward misspecification problem in future AI systems.
Acknowledgements
My RLHF implementation is based on the materials in Callum McDougall's ARENA course. I would like to thank my cohort at BlueDot Impact's AI Safety Fundamentals course for feedback during the early stages of this project, as well as Joshua Elms of Indiana University, and Ryan Lingle of Groundwork Inc. for feedback on the blogpost.
AppendixIt is necessary to identify the expected difference between the advantage of the greedy tokens and the advantage of sampled tokens because this is the baseline by which disproportionately high advantage can actually be defined.
Note: I'll refer to V(x1,x2,...,xt−1,xt) as V(xt) from now on.
E[A(x⋆t)−A(xt)]=E[A(x⋆t)]−E[A(xt)]
=E[A(x⋆t)]
The E[A(xt)] term drops out because the expected advantage of a randomly sampled token over another randomly sampled token is 0.
E[A(x⋆t)]=E[V(x⋆t)−V(xt−1)] =V(x⋆t)−∑xt∈XtV(xt)⋅πθ(xt|x1,x2,...,xt−1)
=V(x⋆t)−V(x⋆t)⋅πθ(x⋆t|x1,...,xt−1)−∑xt≠x⋆t∈XtV(xt)⋅πθ(xt|x1,...,xt−1)
=V(x⋆t)⋅(1−πθ(x⋆t|x1,...,xt−1))−∑xt≠x⋆t∈XtV(xt)⋅πθ(xt|x1,...,xt−1)
=∑xt≠x⋆t∈XtV(x⋆t)⋅πθ(xt|x1,...,xt−1)−∑xt≠x⋆t∈XtV(xt)⋅πθ(xt|x1,...,xt−1)
=∑xt≠x⋆t∈XtV(x⋆t)⋅πθ(xt|x1,...,xt−1)−V(xt)⋅πθ(xt|x1,...,xt−1)
=∑xt≠x⋆t∈Xt(V(x⋆t)−V(xt))⋅πθ(xt|x1,...,xt−1)
This term would increase if either a.) the difference in the values of greedy and non-greedy samples V(x⋆t)−V(xt) were larger, which should come as no surprise, or if the sum of the probabilities of the non-greedy token selections were increased.
This tells us that the expected advantage gain by taking the greedy token is greater when the greedy token has less probability mass (a 'flatter' probability distribution). Therefore, we should penalize greedy advantage gain less when we are less likely to randomly sample the greedy token because the relative difference between a particular observation of greedy advantage gain and the expected greedy advantage gain is much smaller. Conversely, we should penalize greedy advantage gain more when the policy puts more probability mass on the greedy token because the relative distance from expectation is much larger. High probability mass on the greedy token is mode collapse in the extreme case, so it follows our intuition that we should be penalizing this network state more harshly.
- ^
The value of each 'next token' is estimated directly using the value head, but the value head only provides the value of one subsequence V(x1,x2,...,xt−1,xt) at a time. That means to find the token xt with the highest value, we'd have to perform inference for every single option for xt given x1,x2,...,xt−1, which is prohibitively expensive.
- ^
The terms are actually implemented as ∇aJx and ∇bJx⋆, which is equivalent if you are using a learning method without adaptive learning rates. If you are using a method like Adam, the velocity and adaptive learning rate terms will be affected.
- ^
The exponent in the decay term could also be a hyperparameter, but I have fixed it at 10 for simplicity and because it did not affect results much during testing.
- ^
These hyperparameters were tuned for a different prompt/bonus word pair, and adopted for this experiment.
- ^
Results for an analogous experiment involving a reward function that increased by a constant with every occurrence of the bonus word, see here.
- ^
I could have made the evaluation function more faithful to the underlying objective in this way, but I only thought about this modification after compiling the results of the experiment. To redo the experiment with a redundancy penalty of arbitrary size, knowing how it would affect the results, would be breaking the integrity of the experiment.
- ^
Objective function values cannot be directly compared across runs because of many non-stationary characteristics of calculating advantages. So, although the hacking and non-hacking agents appear to have found a similar elevation in the objective topography, the reward hacking agents actually achieve much higher reward because they generate the bonus token non-stop.
Discuss
Deconstructing arguments against AI art
Something I've been surprised by is just how fierce opposition to AI art has been. To clarify, I'm not talking about people who dislike AI art because they think it looks worse, but specifically, people with extreme animus towards the very concept of AI art, regardless of its aesthetic quality or artistic merit.
I'm interested in this issue because it's just one component of a broader societal conversation about AI's role in human society and it's helpful to see where the fault lines are. I suspect the intensity of the reaction to AI art stems from this serving as a proxy battlefield for larger anxieties about human value and purpose in an increasingly AI influenced world
My impression of this opposition comes largely from a few incidents where there has been an allegation that AI was used to create some form of art, and the overwhelming reddit and other social media comments treating it as a moral outrage. Please see the reddit threads at the bottom of this post for more details. Let me share a few incidents I found interesting:
In July of this past year, there was a scandal over a Tedeschi Trucks band concert poster that might have been AI-generated. Over two concerts, all 885 posters made available were sold and many people seemed to like the poster. Despite this, once the allegations were made, the response was immediate and intense - fans were outraged to the point where the band had to investigate the artist's creative process files, apologize to their community, and donate all profits from the poster sales to charity.
Over New Year's 2023, Billy Strings faced a similar situation when a poster and t-shirt from their run were alleged to have leveraged AI art. What's fascinating is that Billy himself had vetted and approved the art, thinking it was cool. The poster and t-shirts also sold quite well. But once AI generation was suspected, fans freaked out and Billy Strings felt compelled to make an apology video, stating he'd want to "kick the artist in the pecker" and would consider it fraud if proven to be AI-generated.
A new restaurant in Toronto perfectly captures how deep this opposition runs. The restaurant put up an inexpensive vinyl wrap on a large brick exterior, displaying a large artistic mural (so no painting took place)— which may have been AI-generated. People didn't just criticize the art on the mural - they called for boycotts of the restaurant, alleged the use of an AI art generated mural was stealing jobs and even questioned the authenticity of the restaurant's food based solely on the possibility of AI involvement in their art.
What's striking about these reactions is how they reveal a deeper cultural anxiety. These aren't isolated incidents - whenever this topic comes up, there's a vocal contingent that responds with genuine moral indignation. The arguments they make on the surface often feel incomplete or unpersuasive, so I wanted to evaluate what the possible kinds of arguments against AI art are.
I should note that this isn't an entirely new concern - we've seen similar patterns before - from the introduction of photography to the advent of recorded music to digital editing tools etc. I will note that we already use extensive technological assistance in art creation - much of which already is some form of "AI".
Before getting into the constructed arguments, I will note that most anti-AI art crusaders seem to hold an unstated premise: that AI-generated art provides no value to the world. A world with more AI art, all other things being equal, is not in any way better than a world with less AI art. IE in the story above about the restaurant, if the restaurant didn't have any mural, that would be better than having an AI generated mural. It seems many people fail to appreciate that AI art is already creating new art styles, winning awards when people don't know it's AI-made, and helping people who couldn't make art before actually create things.
There's also an interesting selection effect at play here: we only notice and talk about AI art in the cases where it's revealed as AI-generated. There's probably lots of AI art out there that people enjoy without ever knowing its origin, which skews our perception of AI art's quality and creativity. And many critics treat AI as if it's replacing artists entirely, rather than being a tool that augments human creativity - like how cameras and digital editing expanded what artists could do.
The Arguments:One: The Human Connection Argument
People value art not just for its visual qualities but as a form of connection to human creativity and experience. Psychological essentialism - our tendency to believe objects carry something more than their physical matter - explains why we cherish game-worn jerseys, autographs, original paintings etc.. When viewers discover art they connected with is AI-generated, they experience a form of essentialist violation - the human "essence" they believed was embedded in the work never existed. Arguably, if all AI art was disclosed as such, people would feel less upset about it.
(I also wonder how many people who oppose AI art would actually prefer an AI piece designed to look human-made over a human piece intentionally mimicking an AI aesthetic?)
Two: The Systemic Change Argument
This isn't really about any specific piece of AI art, some of which may be okay or even good. Rather, it's a slippery slope argument: if we don't stop the proliferation of AI art, because there is no natural barrier to its spread in terms of cost, there will now be art everywhere — but because there isn't a good curation mechanism, we could end up in a world inundated by mediocre art — a form of AI slop, which could overwhelm our senses and ability to digest art, thus taking away from the actual good art we want to be appreciating. Just as social media's flood of content has made it harder to find and deeply appreciate meaningful work amid the noise, AI art could accelerate this dynamic dramatically.
Three: The Economic Innovation Argument
AI art threatens to undermine the entire economy that supports artists. I think embedded in this argument is the idea that AI art could replace some functions that artists typically do - not the meaningful, innovative work, but rather some of the dull commercial work that financially sustains them. These commercial projects serve as de facto subsidies, enabling artists to create their truly significant art. When you take these subsidies away, artists won't be in a financial position to create their more meaningful work. In other words, AI art might kill the conditions necessary for artistic excellence to emerge.
It's unclear how much of this belief is predicated on the current existing artists and fear of unemployment, or about the idea there should be some specific minimum number of artists in the world. In the year 2050, should we have the same number of artists? Twice as much? Half as many?
Four: The Human Flourishing Argument
This isn't really about AI art per se, but rather how we view the future of our society. Many people view the core promise of technology to help us live richer lives with more time for things like art, beauty, and other parts of the "good life". But instead of using innovations like AI to deliver on this promise, our neo-liberal optimization-obsessed society keeps pushing us to work more and create less. Art isn't just something humans do on the side - it's central to what makes a human life meaningful. We face a stark choice: we can use technology to maximize our time for genuine human flourishing, or continue down our current path where we sacrifice these essential experiences in an endless pursuit of economic productivity.
Five: The Metaphysical Impact Argument
There's something intangible but real that's lost when art isn't human-generated. Even if you can't consciously tell the difference between AI and human art, the origin matters in some fundamental way that affects how the art influences us. In the Billy Strings and Tedeshi Trucks examples above, people were buying these prints — but that doesn't matter, because even though they thought the art was good in the moment, their longer term appreciation would suffer. This isn't about conscious appreciation - it's about some deeper level of impact or meaning.
The above concerns may or may not be valid, and like everything else in life, I’m sure its a combination of multiple arguments, but collectively, these points hopefully explain why so many feel so strongly about AI art.
Further reading:
- https://old.reddit.com/r/toronto/comments/1hligs5/is_this_annex_mural_aigenerated_some_upset/
- https://old.reddit.com/r/BillyStrings/comments/18u4f2i/valueless_ai_art_being_hocked_as_official_merch/
- https://old.reddit.com/r/jambands/comments/1hhvor2/please_keep_artificial_intelligence_out_of_the/
- https://old.reddit.com/r/Music/comments/1efw0d4/tedeschi_trucks_band_apologize_for_aigenerated/
- https://old.reddit.com/r/toronto/comments/1h6pxhm/historical_society_apologizes_for_selling/
Discuss
From the Archives: a story
"You are beautiful, Enkidu, you are become like a god.
Why do you gallop around the wilderness with the wild beasts?
Come, let me bring you into Uruk-Haven,
To the Holy Temple, the residence of Anu and Ishtar,
The place of Gilgamesh, who is wise to perfection,
But who struts his power over the people like a wild bull.”
- Shamhat, from The Epic of Gilgamesh
I’m about to descend deeper into the archives than I ever have before. I’m standing in the center of a vast stone hall, with arched ceilings that rise higher than I can see. To my side stand the half-dozen other archive divers who accompanied me on the journey here. Beyond them lie haphazard piles of stones that had once been arranged into shelters, scattered relics of the others who had reached this point over the centuries.
But my focus is on the gaping pit in front of me. It’s far too deep for the bottom to be visible. By the light of my headlamp, though, I can faintly see that the walls of the pit appear to consist of enormous stacks of thousands or millions of books. Are they merely carved into the stone? Or is the pit itself actually lined with books? Perhaps both: this many millennia deep into the archives, the difference between facade and reality blurs.
I take one last look into the pit, then turn my back to it and beckon. The others gather in a loose semicircle around me. We’ve travelled together this far, but it’s been my expedition from the beginning. So I’ll be taking the final plunge by myself—seizing the lion’s share of both the glory and the danger. They start murmuring my name, the mantra that will carry me through what’s to come: “Ren. Ren. Ren.” Their voices grow louder and more insistent, the sound echoing back from the walls, the hall itself affirming me. “Ren! Ren!” As the chant reaches a crescendo I throw my arms wide, join them in screaming my name, then throw myself backwards into the pit.
The light fades as I fall; I close my eyes and focus on my heartbeat. The distance I fall will be determined just as much by my mindset as by whatever simulacrum of physics governs the terrain around me. So I wait until I’ve pictured very clearly in my mind the people I’m searching for, and only then open my eyes. Blinking, I scan in the dim light for just the right moment, just the right—there! A book with a burnished bronze cover gleams below me, and I angle my fall towards it, fingertips reaching out to just barely brush it, and then
I’m
no
longer
“—myself!” my father roars. I can hear the rage in his voice. “You think I’ll let her shame the family like this? If she won’t do her duty, I’ll kill her myself!”
I cower, and apologize, and marry the man he wants me to. Our wedding ceremony is raucous; my father is determined to make it the talk of the town. I sit quietly, keeping my eyes on my husband. It could be much worse. He’s a merchant, so he’s educated at least, and rich enough that I’ll have servants to wait on my every need. But I sense a cruel streak in his eyes which frightens me. And though the wedding night itself is not so bad, I soon discover I’m right. He forbids me from leaving his house except in his company—a harsh constraint at the best of times, bordering on torment during the long summer months when he travels to other cities.
So I spend my life trapped within his walls. I know in some deep inarticulable way that this shouldn’t be happening, but there’s nothing I can do except wait—first for years, then decades. Finally, one day, I look through the window at the farmers taking their wares to the market, and scream in rage and frustration. And suddenly I know myself again. The people outside are all stopping to look at me, but it doesn’t matter any more. I look back at them and smile fiercely. Then I twist, and the
world
dissolves
into
—chaos reigns in the square; shouting and laughter, the mingled sounds of animals and humans. I’ve been to this market dozens of times, but have never truly enjoyed it—I still far prefer the quiet of my family’s farm. Perhaps I should let my son do the bartering next time, I think. He’s almost a grown man, and it’d be good training for him. But next month some instinct warns me against it, and the month after that too. There’s something not quite right. Eventually, the day before yet another market, a thought comes to me, as if I’ve known it for a long time: I’m not going to find them here, not in this humdrum life. Won’t find who? Why is that so important? I can’t recall.
The next day, my wagon is accosted by bandits on the way to the market. Three men with swords shout for me to dismount and hand over my goods. Suddenly I know what I need to do. I walk towards them with open palms, ignoring their threats. As I get close enough to touch them I twist towards somewhere else, and after the drudgery of the farmer’s life it feels
like
a
sudden
“—rush in, we’ll lose everything,” the captain is saying. “We’ll need to hold fast and drive them back when they approach along the river”. The tent is dim and smoky, but I’m concentrating hard on the captain’s words, straining my eyes to make out the details of the map on the table. I’m lucky to be included in this meeting at all; I’d better not embarrass myself. Eventually, we agree to hold and wait for the enemy to come to us.
It only takes the enemy a few days to make the approach; luckily, this time, it also only takes me a few days to come back to myself. I look around at the armies readying for battle. One more hop, I think. As the fighting starts I push my way towards the front lines, eventually getting close enough that an enemy soldier spots me and starts running directly at me. I charge too, and as I get close enough to see the rage and fear in his eyes I twist, the fabric of the world stretching under me, and I feel like
I’m
about
to
—faint silhouette in front of me, between two trees, and I know immediately that it’s one of the men I’ve been hunting. But which? I hear a dismissive snort, and the silhouette fades into the darkness like a panther. Enkidu, then. I chase after him, but he stays just out of my sight, until I have to pause, panting and exhausted.
That’s okay—I’ve seen my quarry and established a foothold. And I know my own limits. I’m getting better at breaking out of the minds at this depth, but it’s not healthy to do that mental twist too many times in a row. A part of you will become convinced that the rest of your identity is fake and start trying to break out of that too. I need to take a break and re-establish my sense of self. So I twist in a different way and find myself back in the silence and stillness of the archive hall. Down here the hall has manifested as a wooden longhouse, each beam decorated with vivid carvings. Compared with the vast stone cathedral I camped in last night, it’s cramped but homely—just what I need.
I spend an hour on my normal routine: setting up my bedroll, starting a fire, cooking and eating. After that, I sit cross-legged and breathe deeply. “Ren, Ren, Ren, Ren,” I murmur to myself, as my mind traces the well-worn path of my identity meditation, down to my most foundational memories.
—
I was enraptured by the archives from the first time I visited the museum that housed them. As the other children around me chattered and played, I listened intently to our guide’s explanation of each new exhibit, shivering with delight as I felt the weight of millions of lives pressing down on me. The guide told me how we’d traced back each strand of history from every possible angle, how we’d brought the past to life again. The sheer scale and hubris of it had taken my breath away even then.
The archivists had noticed. Halfway through the tour, one waved me away from the main group, towards a side passaged that sloped down into the earth. As I walked along it, the walls lit up with small shadowy figures who kept pace with me, their faces occasionally resolving into expressions of curiosity and wonder. I realized that they were a record of all the other children who’d walked down this same hallway, following the same fascination. The passage forked, then forked again, the stream of ghosts splitting and merging along my path. For an hour I wandered the maze, alone yet surrounded by comrades from the past, before an archivist appeared in front of me and brought me back to the surface.
It was only a pale reflection of the full archives, but enough to get me hooked. I forced my parents to take me back to the museum again and again. I met the community of archive divers and listened intently to their stories; and eventually I started doing dives of my own. You weren’t meant to start too young—not before you had enough of a sense of yourself to rely on—but I was precocious. I knew who I was and who I wanted to be: an adventurer, an explorer of hidden mysteries. And the tight-knit diver community itself embodied and reflected that desire.
Not fully, though. I watched during dives as the other divers got distracted by romance or fame. Many of them just wanted the thrill of living out lives more exciting than their own. They didn’t understand that the archives were more than entertainment: they were a glimpse into the fundamental unknown. They couldn’t sense, as I did, that there were patterns beneath the patterns, archetypes that once grasped would make the whole story of humanity fit together. The longer I spent diving the closer I felt to finding something important. I spent less and less time outside the archives; my other ties grew sparser and sparser.
And then I found it. I was diving in a little-explored side branch: not the deepest I’d ever visited, but one of the hardest to get to. A lost city, hidden in the jungle—a record of ancient narratives, frozen as if in amber. Unusually, this one was ruled by not one but two kings. I lived several lives in that city before I got close enough to see their faces for a moment as they rode past the crowd I stood in: one impeccably groomed, the other almost animalistic despite his fine clothes.
Then they turned to meet my eyes. “Who are you, traveler?” one shouted. I froze. How could they possibly have singled me out? As they spurred their horses towards me, I reflexively twisted away, finding myself on the edge of the jungle. But only a few seconds later, the impossible happened again: the two kings appeared in front of me, still astride their horses. “Hold!” one shouted. As he said it I was struck by the certainty that they would soon be able to chase me down no matter where I went, that I only had one chance to escape. I fled, twisting myself into life after life until I almost forgot who I was. Only continents and centuries away did my clawing panic subside.
The next few months, after rising to the surface, were the most painful I’d ever experienced. I’d done a number on my mind, scrambling my memories and even my personality in my mad dash for safety. I spent a month near-comatose in a hospital bed; and it took another six months before I could muster the coherence to spend a full day working. But once I could, all of my efforts focused on understanding what had happened. I sat in the library, looking up old stories, trying to divine who or what I had encountered.
When I realized, it felt obvious. Gilgamesh. Enkidu. Two of the oldest archetypes, the story on which every other story had been built. I’d been right when I intuited that they’d be able to follow me anywhere. They could go anywhere because they were everywhere—so deep-rooted and so weighty that the archives themselves had somehow twisted around them. I no longer felt afraid, though, but rather exhilarated. I’d been searching for what lay underneath the human story, and I’d found it embodied. I had to go back.
—
I open my eyes. I can’t tell how long it’s been, but I feel rested and energetic. Normally I would wait longer before going in again, but my glimpse of Enkidu has me too fired up to stay in one place any longer. And my desire to jump back in feels true enough to myself that I’m sure it’s all
going
to
be
“—fine weave, and only the best quality wool," the merchant is saying. “I can’t justify any price lower than three hundred.”
“My friend, you can tell from my clothes that I’m not a wealthy man," I respond. “I can’t possibly afford any more than one hundred; but surely that will still make you a decent profit.” We haggle a bit more, but eventually I walk away without making the purchase. I didn’t want the carpet that badly, I think to myself. After all, I suddenly realize, I’m here for something else entirely. I need a link to—ah, there. A noble, riding his horse down the center of the market, guards shoving pedestrians out of the way. I walk towards him, pushing a guard aside, the shouts of warning causing him to turn towards me; and as our eyes meet I twist, finding
myself
in
a
—chamber is so dark that I can barely see the outline of the woman on the bed in front of me, but that doesn’t diminish my desire. I want to take her; I want to own her. And I can—the priests have given her to me for this night, to fulfil her sacred role. She stretches out on the bed, beckoning me over. But there’s something slightly stiff about her movements, and I’m struck by the thought that she wishes I were someone else instead.
That’s enough to jolt me out of it. I breathe deeply, then walk up to her. “Hush, I won’t hurt you. But I’m so close to finding them, I can almost taste it. Have you heard their names: Gilgamesh, Enkidu? Do they mean anything to you?” She’s trembling now, and doesn’t respond, but I see her snatch a glance over my shoulder, and turn. Up on the wall, illuminated by a single candle, a tapestry hangs. It’s a triumphant scene: a man with the horns of a bull is standing over the corpse of an enormous ogre, in front of a broken mountain. “Got you," I whisper triumphantly, and twist, and am suddenly
caught
in
sheer
paralysis. That’s the only way I can describe it: I feel pinned to the spot by the scrutiny of the man in front of me. He’s not the one I expected—and, as if he were reading my mind, Gilgamesh speaks. “Finding Enkidu will take more than that.” His voice is melodic, hypnotic. “He rarely spends time here. His home is far further down, in the depths where the stories are not recorded in writing or even speech—only in scattered fragments of art, and the patterns left on our unconscious minds.”
I take a deep breath before speaking. “Why does he ever come up here, then?”
He raises an eyebrow. “To visit me, of course. I can’t go that far down myself, not without forgetting who I am. And he comes for the universal temptation: the lure of something new, the pull towards growth, even with the risk of losing yourself entirely to it.”
“Universal—so you want it too, then?”
“Of course.”
I feel his response is sardonic, somehow. But it still gives me the resolve to make the offer I’d planned out over the course of the long descent.
“Then come with me. Let me show you what’s up there, the wonders we’ve built, our civilization, our-”
“-self-destruction," Gilgamesh interrupts. “Your weakness. Your abdication of everything worthy in life. Under the weight of what you call civilization, whatever greatness of spirit any of you might have developed has been crushed. Even the wildest and most adventurous of your people are tame. If we gave ourselves over to that, eventually there would be other Gilgameshes, and other Enkidus—but we would change, and be lost.”
“Why are you so afraid of changing?”
He bares his teeth, and I take a step back. “You found me through the stories of my quest for eternal life. You know that much of me. And yet you have the arrogance to think that after finally gaining immortality, I would give it up for—”
“Shamhat!” There’s a voice from behind me, and I whirl. A giant of a man is walking towards me—Enkidu, it must be. “Shamhat," he says again, forcefully. I feel a jolt of fear and shake my head. “No; I’m not Shamhat. I’m Ren.” “Shamhat!” he insists, and a wave of emotion surges over me: a blend of passion and rage and yearning so strong that I almost lose myself in it.
My hand goes to my emergency trigger. But all the long years of training weren’t for nothing. I am Ren, and I won’t surrender so easily. I think of the smell of my family home, the warmth of an evening watching a show with my housemates, the sight of skyscrapers towering above me on every side. I sink into these fragments of my world, and hiss “No” at Enkidu, and he pauses in his stride.
Gilgamesh smiles at me, his composure regained. “Perhaps you should answer your own question: why so afraid? Here you are, visiting us with your defenses up and your escape route near at hand. Why not let yourself be changed by us, become one of us, play the role that Enkidu already sees in you? Or why not go up the archives instead, where the risks are even greater, instead of coming down?”
“Wait—up? There’s no up. The archives only go down.”
“Ah, so you think that your own world is the source of the archives? What an astronomical coincidence that would be; but of course they do sometimes happen. And yet you are not the strangest visitor I’ve ever had. Where are they coming from, I wonder, those others? The ones too alien to understand what they’ve lost, too divorced from us to even feel your own thrill of familiarity and contempt. The ones who see me and Enkidu as little more than fascinating insects.”
“I don’t—I’m not—”
“I tire of this. Shoo, little bird.”
A sudden pressure emanates from him: a sheer sense of self, of lust for life, of desire to conquer and emerge victorious, to seize immortality, to seize me, to grab the world in his outstretched hand, and to survive, always to survive. It hits me like a wave, enveloping me, trying to drag me down into its depths. I stumble backwards, blindly groping for my emergency trigger, fingers clenching around it until it snaps and I twist all the way around and, trembling, find myself back at my campsite.
—
I’m still shaken the next morning, although not enough to give up. But I can’t find them again that day, nor the next, even as I jump rapidly from life to life. Inhabiting so many different minds is exhausting, and wears away at my sense of self. In the evenings I find myself oscillating between the personalities I’d inhabited that day, muttering both sides of a half-coherent conversation. After one more day I have to call it off.
The trip back up is easier, but still slow. I need to decompress my identity, loosen the tightly-held core of self that made it possible for me to survive so far down. The other divers understand; they’re gentle with me when I make it back to them, leaving me space to quietly introspect. It’s harder when we reach the surface—the crowds of people on the streets feel overwhelming. Stepping back into my house and seeing my housemates bustling around is even more challenging. Abstractly I know they mean well, but with every question they ask my anger at them grows. I sense that they don’t understand me at all, and it makes me want to scream and hurt them for their failure. Finally I escape into my room.
Over the next few weeks I reacclimatize to my life. I spend time with my housemates, accept a few contracting gigs to top up my bank balance, and even go on a couple of dates. But a part of me remains detached. There was something so primal about what I’d seen—an animating force so powerful and so pervasive that it had warped the fabric of the archive itself. All-consuming desire and all-conquering strength. Was Gilgamesh right that we’d lost them? I read each day of new technological marvels: the Dyson sphere soon finishing construction, the first colonizing probes launching out of the solar system. Yet somehow all of it feels flat—like it’s driven by different and lesser forces than those which had steered humanity up to this point.
One day, as I’m taking the train across the city, a man sits across from me. I’m captivated by his appearance, although it’s hard to say why. His face is regal, with an aquiline nose and a harsh chin; his clothes are a decade out of date. But I’m most struck by his expression. I watch him looking around the train with a sense of pure detached curiosity—almost, I suddenly think, like the rest of us are merely fascinating insects. Gilgamesh’s words come back to me: “What an astronomical coincidence that would be.” A sense of vertigo grips me. Do I really want my world to be the one root node, the source of all the archives? Or do I want there to be so, so, so much more?
I get off at the next stop, and find myself in front of the archives for the first time since the dive. So I go in. As I walk through the familiar building, instinct guides me to scan the ceilings in each room. They’re high, so I need to squint to make sure I’m not missing anything, but—ah, there it is: the outline of a trapdoor. It’s faint, and I doubt myself until I look at the exhibit underneath: a display of tools and equipment from older eras, including a long ladder. Well, that settles it. I know myself, and I know there’s no way I’m not climbing it.
But I have something else to do first.
It’s always easier the second time. I make the trip solo, and though I still need to navigate through story upon story as I descend, it’s fewer than usual—as if my purpose has already acclimatized me to millennia past. I find them drinking together in their tent on the eve before a battle. Enkidu notices me first; Gilgamesh follows his gaze after a moment and laughs. “So the little bird is back. What do you want this time?”
I look straight at him. “I asked you to come with me up the archives, even though that would change you radically. But why should you make that sacrifice, if I won’t? So let’s do it together. There’s a ladder, from my own home. Going up. Let’s climb it.”
Gilgamesh watches me silently. Enkidu stares into his cup, heedless of my words. I don’t mind; they’re not for him.
“It’ll be further for you than for me, and harder. But if not now, then when? Will you stay here reliving old glories forever?”
Gilgamesh smiles his thin smile. “I see now. You’re not his Shamat—you’re mine.” He looks around, and I imagine him seeing through the walls of the tent to all the lives that he might lead. All the battles he might win, all the ways in which he can live the archetype of the king—but at the cost of turning down my challenge and all the others that will come, the cost of never growing. For a moment I regret forcing him to make this decision. But I bite my lip and remain silent. Pity is the last thing he would want.
“Fight with us tomorrow, then,” he says abruptly. “Take Enkidu’s place; win us the battle, as he would.”
I’ve lived enough lives of valor and combat that I’m not daunted by the prospect of fighting with or even leading an army. This time, though, my own skills won’t be enough: I’ll need not just to replace Enkidu but to inhabit him. The risks of that, and the cost if I fail… Is there any other—no. The more I weigh the risks, the more I analyze the situation, the further I am from Enkidu, and the more dangerous it becomes. So I pause for only a beat longer, then nod. “I will.”
Gilgamesh laughs and tosses me a flask. I realize that Enkidu has melted away, or melted into me, or something in between; whatever it is, taking his seat feels like the most natural thing in the world. I stay there for hours, talking of the battles we’ve won and lost, friends and enemies, the tactics of the morrow. I catch three hours’ sleep, or perhaps four, and then the horns are blaring and I’m up and at the front of the army as always, a crowded rabble with primitive weapons but a fire inside them, a wild energy that I embrace and amplify and lead in a howling mob towards our foe. Then battle is joined. To my left I see Gilgamesh carving through the enemy’s flank, but after that I lose myself in the thrill of combat, just me and my instincts against the foes ahead.
I meet Gilgamesh on the other side, as our enemies flee. I want to roar and challenge him and conquer with him and defeat him and be defeated by him and roam through the world with him and— Maybe it’s because the last part is so familiar that I manage to pull back to myself. I am Ren: no more, no less. And Gilgamesh is… something to me, maybe many things, but not the companion of lifetimes. Not yet.
He sheaths his sword and turns to me. “Maybe there’s some spirit left in you. Very well, then. I will go.” His eyes flick over my shoulder and he sighs. “Too far for you, brother, at least without a guide.” I turn to see Enkidu walking past me. He hums, deep in his throat, and reaches out an arm. Gilgamesh clasps it and holds his gaze for a long moment. “I’ll come back for you if I can.”
Then Gilgamesh turns to me, and my heart races at the challenge in his eyes. “If it kills me, it kills me. Lead on.”
I feel the urge to laugh in relief and triumph, and choke it back for a moment, before thinking: well, why not? So I bare my teeth, and spread my arms, and shout a wordless cry to the sky. Then I twist, tearing a hole in this life, sliding my way through into the next. I don’t need to look to know he’s right behind me. And we start to climb.
—
If humanity survives the coming decades and centuries, our descendants will eventually have knowledge far beyond our comprehension, and be able to infer innumerable details about past lives that we once thought lost. Not all the details, or all the lives. But the key patterns, the archetypes, the collective unconscious of the time—they’ll be rediscovered and stored in an archive of all humanity. The archives will stretch all the way back to the dawn of human history, and all the way forward to our unrecognizable descendants. Depending on the values of our descendants, the archives might just be realistic records, or they could be actual minds, constantly run and rerun, eternally playing out their stories.
What would you do if you were one?
Discuss
What's the best metric for measuring quality of life?
Currently, to get a drug approved by the FDA you need to prove that it helps with a particular illness and while having an acceptable side effect profile.
If you have a drug that improves people's quality of life but you aren't curing an ICD-11 accepted illness, you don't get your drug approved.
If the FDA would change and approve drugs for improving quality of life, how should they measure it? What's the best way to measure quality of life that's hard to goodhart?
Discuss
Review: Planecrash
Take a stereotypical fantasy novel, a textbook on mathematical logic, and Fifty Shades of Grey. Mix them all together and add extra weirdness for spice. The result might look a lot like Planecrash (AKA: Project Lawful), a work of fiction co-written by "Iarwain" (a pen-name of Eliezer Yudkowsky) and "lintamande".
(image from Planecrash)
Yudkowsky is not afraid to be verbose and self-indulgent in his writing. He previously wrote a Harry Potter fanfic that includes what's essentially an extended Ender's Game fanfic in the middle of it, because why not. In Planecrash, it starts with the very format: it's written as a series of forum posts (though there are ways to get an ebook). It continues with maths lectures embedded into the main arc, totally plot-irrelevant tangents that are just Yudkowsky ranting about frequentist statistics, and one instance of Yudkowsky hijacking the plot for a few pages to soapbox about his pet Twitter feuds (with transparent in-world analogues for Effective Altruism, TPOT, and the post-rationalists). Planecrash does not aspire to be high literature. Yudkowsky is self-aware of this, and uses it to troll big-name machine learning researchers:
(source)
So why would anyone ever read Planecrash? I read (admittedly—sometimes skimmed) it, and I see two reasons:
- The characters are competent in a way that characters in fiction rarely are. Yudkowsky is good at writing intelligent characters in a specific way that I haven't seen anyone else do as well. Lintamande writes a uniquely compelling story of determination and growth in an extremely competent character.
- More than anyone else I've yet read, Yudkowsky has his own totalising and self-consistent worldview/philosophy, and Planecrash makes it pop more than anything else he's written.
Dath ilan is an alternative quasi-utopian Earth, based (it's at least strongly hinted) on the premise of: what if the average person was Eliezer Yudkowsky? Dath ilan has all the normal quasi-utopian things like world government and land-value taxes and the widespread use of Bayesian statistics in science. Dath ilan also has some less-normal things, like annual Oops It's Time To Overthrow the Government festivals, an order of super-rationalists, and extremely high financial rewards for designing educational curricula that bring down the age at which the average child learns the maths behind the game theory of cooperation.
Keltham is an above-average-selfishness, slightly-above-average-intelligence young man from dath ilan. He dies in the titular plane crash, and wakes up in Cheliax.
Cheliax is a country in a medieval fantasy world in another plane of existence to dath ilan's (get it?). (This fantasy world is copied from a role-playing game setting—a fact I discovered when Planecrash literally linked to a Wiki article to explain part of the in-universe setting.) Like every other country in this world, Cheliax is medieval and poor. Unlike the other countries, Cheliax has the additional problem of being ruled by the forces of Hell.
Keltham meets Carissa, a Chelish military wizard who alerts the Chelish government about Keltham. Keltham is kept unaware about the Hellish nature of Cheliax, so he's eager to use his knowledge to start the scientific and industrial revolutions in Cheliax to solve the medieval poverty thing—starting with delivering lectures on first-order logic (why, what else would you first do in a medieval fantasy world?). An elaborate game begins where Carissa and a select group of Chelish agents try to extract maximum science from an unwitting Keltham before he realises what Cheliax really is—and hope that by that time, they'll have tempted him to change his morals towards a darker, more Cheliax-compatible direction.
The charactersKeltham oscillates somewhere between annoying and endearing.
The annoyingness comes from his gift for interrupting any moment with polysyllabic word vomit. Thankfully, this is not random pretentious techno-babble but a coherent depiction of a verbose character who thinks in terms of a non-standard set of concepts. Keltham's thoughts often include an exclamation along the lines of "what, how is {'coordination failure' / 'probability distribution' / 'decision-theoretic-counterfactual-threat-scenario'} so many syllables in this language, how do these people ever talk?"—not an unreasonable question. However, the sheer volume of Keltham's verbosity is still something, especially when it gets in the way of everything else.
The endearingness comes from his manic rationalist problem-solver energy, which gets applied to everything from figuring out chemical processes for magic ingredients to estimating the odds that he's involved in a conspiracy to managing the complicated social scene Cheliax places him in. It's somewhat like The Martian, a novel (and movie) about an astronaut stranded on Mars solving a long series of engineering challenges, but the problem-solving is much more abstract and game-theoretic and interpersonal, than concrete and physical and man-versus-world.
By far the best and most interesting character in Planecrash is Carissa Sevar, one of the several characters whose point-of-view is written by lintamande rather than Yudkowsky. She's so driven that she accidentally becomes a cleric of the god of self-improvement. She grapples realistically with the large platter of problems she's handed, experiences triumph and failure, and keeps choosing pain over stasis. All this leads to perhaps the greatest arc of grit and unfolding ambition that I've read in fiction.
The competenceI have a memory of once reading some rationalist blogger describing the worldview of some politician as: there's no such thing as competence, only loyalty. If a problem doesn't get solved, it's definitely not because the problem was tricky and there was insufficient intelligence applied to it or a missing understanding of its nature or someone was genuinely incompetent. It's always because whoever was working on it wasn't loyal enough to you. (I thought this was Scott Alexander on Trump, but the closest from him seems to be this, which makes a very different point.)
Whether or not I hallucinated this, the worldview of Planecrash is the opposite.
Consider Queen Abrogail Thrune II, the despotic and unhinged ruler of Cheliax who has a flair for torture. You might imagine that her main struggles are paranoia over the loyalty of her minions, and finding time to take glee in ruling over her subjects. And there's some of those. But more than that, she spends a lot of time being annoyed by how incompetent everyone around her is.
Or consider Aspexia Rugatonn, Cheliax's religious leader and therefore in charge of making the country worship Hell. She's basically a kindly grandmother figure, except not. You might expect her thoughts to be filled with deep emotional conviction about Hell, or disappointment in the "moral" failures of those who don't share her values (i.e. every non-sociopath who isn't brainwashed hard enough). But instead, she spends a lot of her time annoyed that other people don't understand how to act most usefully within the bounds of the god of Hell's instructions. The one time she gets emotional is when a Chelish person finally manages to explain the concept of corrigibility to her as well as Aspexia herself could. (The gods and humans in the Planecrash universe are in a weird inverse version of the AI alignment problem. The gods are superintelligent, but have restricted communication bandwidth and clarity with humans. Therefore humans often have to decide how to interpret tiny snippets of god-orders through changing circumstances. So instead of having to steer the superintelligence given limited means, the core question is how to let yourself be steered by a superintelligence that has very limited communication bandwidth with you.)
Fiction is usually filled with characters who advance the plot in helpful ways with their emotional fumbles: consider the stereotypical horror movie protagonist getting mad and running into a dark forest alone, or a character whose pride is insulted doing a dumb thing on impulse. Planecrash has almost none of that. The characters are all good at their jobs. They are surrounded by other competent actors with different goals thinking hard about how to counter their moves, and they always think hard in response, and the smarter side tends to win. Sometimes you get the feeling you're just reading the meeting notes of a competent team struggling with a hard problem. Evil is not dumb or insane, but just "unaligned" by virtue of pursuing a different goal than you—and does so very competently. For example: the core values of the forces of Hell are literally tyranny, slavery, and pain. They have a strict hierarchy and take deliberate steps to encourage arbitrary despotism out of religious conviction. And yet: their hierarchy is still mostly an actual competence hierarchy, because the decision-makers are all very self-aware that they can only be despotic to the extent that it still promotes competence on net. Because they're competent.
Planecrash, at its heart, is competence porn. Keltham's home world of dath ilan is defined by its absence of coordination failures. Neither there nor in Cheliax's world are there really any lumbering bureaucracies that do insane things for inscrutable bureaucratic reasons; all the organisations depicted are all remarkably sane. Important positions are almost always filled by the smart, skilled, and hardworking. Decisions aren't made because of emotional outbursts. Instead, lots of agents go around optimising for their goals by thinking hard about them. For a certain type of person, this is a very relaxing world to read about, despite all the hellfire
The philosophy"Rationality is systematized winning", writes Yudkowsky in The Sequences. All the rest is commentary.
The core move in Yudkowsky's philosophy is:
- We want to find the general solution to some problem.
- for example: fairness—how should we split gains from a project where many people participated
- Now here are some common-sense properties that this thing should follow
- for example:
- (1) no gains should be left undivided
- (2) if two people both contribute identically to every circumstance (formalised as a set of participating people), they should receive an equal share of the gains
- (3) the rule should give the same answer if you combine the division of gains from project A and then project B, as when you use it to calculate the division of gains from project A+B
- (4) if one person doesn't add value in any circumstance, their share of the gains is zero
- for example:
- Here is The Solution. Note that it's mathematically provable that if you don't follow The Solution, there exists a situation where you will do something obviously dumb.
- For example: Shapely value is the unique solution that satisfies the axioms above. (The Planecrash walkthrough of Shapely value is roughly here; see also here for more Planecrash about trade and fairness.)
- Therefore, The Solution is uniquely spotlighted by the combination of common-sense goals and maths as the final solution to this problem, and if you disagree, please read this 10,000 word dialogue.
The centrality of this move is something I did not get from The Sequences, but which is very apparent in Planecrash. A lot of the maths in Planecrash isn't new Yudkowsky material. But Planecrash is the only thing that has given me a map through the core objects of Yudkowsky's philosophy, and spelled out the high-level structure so clearly. It's also, as far as I know, the most detailed description of Yudkowsky's quasi-utopian world of dath ilan.
Validity, Probability, UtilityKeltham's lectures to the Chelish—yes, there are actually literal maths lectures within Planecrash—walk through three key examples, at a spotty level of completeness but at a high quality of whatever is covered:
- Validity, i.e. logic. In particular, Yudkowksy highlights what I think is some combination of Lindstrom's theorem and Godel's completeness theorem, that together imply first-order logic is the unique logic that is both complete (i.e. everything true within it can be proven) and has some other nice properties. However, first-order logic is also not strong enough to capture some things we care about (such as the natural numbers), so this is the least-strong example of the above pattern. Yudkowsky has written out his thoughts on logic in the mathematics and logic section here, if you want to read his takes in a non-fiction setting.
- Probability. So-called Dutch book theorems show that if an agent does not update their beliefs in a Bayesian way, there exists a set of losing bets that they would take despite it leading to a guaranteed loss. So your credences in beliefs should be represented as probabilities, and you should update those probabilities with Bayes' theorem. (Here is a list of English statements that, dath ilani civilisation thinks, anyone competent in Probability should be able to translate into correct maths.)
- Utility. The behaviour of any agent that is "rational" in a certain technical sense should be describable as it having a "utility function", i.e. every outcome can be assigned a number, such that the agent predictably chooses outcomes with higher numbers over those with lower ones. This is because if an agent violates this constraint, there must exist situations where it would do something obviously dumb. As a shocked Keltham puts it: "I, I mean, there's being chaotic, and then there's being so chaotic that it violates coherence theorems".
In Yudkowsky's own words, not in Planecrash but in an essay he wrote (with much valuable discussion in the comments):
We have multiple spotlights all shining on the same core mathematical structure, saying dozens of different variants on, "If you aren't running around in circles or stepping on your own feet or wantonly giving up things you say you want, we can see your behavior as corresponding to this shape. Conversely, if we can't see your behavior as corresponding to this shape, you must be visibly shooting yourself in the foot." Expected utility is the only structure that has this great big family of discovered theorems all saying that. It has a scattering of academic competitors, because academia is academia, but the competitors don't have anything like that mass of spotlights all pointing in the same direction.
So if we need to pick an interim answer for "What kind of quantitative framework should I try to put around my own decision-making, when I'm trying to check if my thoughts make sense?" or "By default and barring special cases, what properties might a sufficiently advanced machine intelligence look to us like it possessed, at least approximately, if we couldn't see it visibly running around in circles?", then there's pretty much one obvious candidate: Probabilities, utility functions, and expected utility.
CoordinationNext, coordination. There is no single theorem or total solution for the problem of coordination. But the Yudkowskian frame has near-infinite scorn for failures of coordination. Imagine not realising all possible gains just because you're stuck in some equilibrium of agents defecting against each other. Is that winning? No, it's not. Therefore, it must be out.
Dath ilan has a mantra that goes, roughly: if you do that, you will end up there, so if you want to end up somewhere that is not there, you will have to do Something Else Which Is Not That. And the basic premise of dath ilan is that society actually has the ability to collectively say "we are currently going there, and we don't want to, and while none of us can individually change the outcome, we will all coordinate to take the required collective action and not defect against each other in the process even if we'd gain from doing so". Keltham claims that in dath ilan, if there somehow developed an oppressive tyranny, everyone would wait for some Schelling time (like a solar eclipse or the end of the calendar year or whatever) and then simultaneously rise up in rebellion. It probably helps that dath ilan has annual "oops it's time to overthrow the government" exercises. It also helps that everyone in dath ilan knows that everyone knows that everyone knows that everyone knows (...) all the standard rationalist takes on coordination and common knowledge.
Keltham summarises the universality of Validity, Probability, Utility, and Coordination (note the capitals):
"I am a lot more confident that Validity, Probability, and Utility are still singled-out mathematical structures whose fragmented shards and overlapping shadows hold power in Golarion [=the world of Cheliax], than I am confident that I already know why snowflakes here have sixfold symmetry. And I wanted to make that clear before I said too much about the hidden orders of reality out of dath ilan - that even if the things I am saying are entirely wrong about Golarion, that kind of specific knowledge is not the most important knowledge I have to teach. I have gone into this little digression about Validity and timelessness and optimality, in order to give you some specific reason to think that [...] some of the knowledge he has to teach is sufficiently general that you have strong reason for strong hope that it will work [...] [...] "It is said also in dath ilan that there is a final great principle of Law, less beautiful in its mathematics than the first three, but also quite important in practice; it goes by the name Coordination, and deals with agents simultaneously acting in such fashion to all get more of what they wanted than if they acted separately."
Decision theoryThe final fundamental bit of Yudkowsky's philosophy is decision theories more complicated than causal decision theory.
A short primer / intuition pump: a decision theory specifies how you should choose between various options (it's not moral philosophy, because it assumes that we know already know what we value). The most straightforward decision theory is causal decision theory, which says: pick the option that causes the best outcome in expectation. Done, right? No; the devil is in the word "causes". Yudkowsky makes much of Newcomb's problem, but I prefer another example: Parfit's hitchhiker. Imagine you're a selfish person stuck in a desert without your wallet, and want to make it back to your hotel in the city. A car pulls up, with a driver who knows whether you're telling the truth. You ask to be taken back to your hotel. The driver asks if you'll pay $10 to them as a service. Dying in the desert is worse for you than paying $10, so you'd like to take this offer. However, you obey causal decision theory: if the driver takes you to your hotel, you would go to your hotel to get your wallet, but once inside you have the option between (a) take $10 back to the driver and therefore lose money, and (b) stay in your hotel and lose no money. Causal decision theory says to take option (b), because you're a selfish agent who doesn't care about the driver. And the driver knows you'd be lying if you said "yes", so you have to tell the driver "no". The driver drives off, and you die of thirst in the desert. If only you had spent more time arguing about non-causal decision theories on LessWrong.
Dying in a desert rather than spending $10 is not exactly systematised winning. So causal decision theory is out. (You could argue that another moral of Parfit's hitchhiker is that being a purely selfish agent is bad, and humans aren't purely selfish so it's not applicable to the real world anyway, but in Yudkowsky's philosophy—and decision theory academia—you want a general solution to the problem of rational choice where you can take any utility function and win by its lights regardless of which convoluted setup philosophers drop you into.) Yudkowsky's main academic / mathematical accomplishment is co-inventing (with Nate Soares) functional decision theory, which says you should consider your decisions as the output of a fixed function, and then choose the function that leads to the best consequences for you. This solves Parfit's hitchhiker, as well as problems like the smoking lesion problem that evidential decision theory, the classic non-causal decision theory, succumbs to. As far as I can judge, functional decision theory is actually a good idea (if somewhat underspecified), but academic engagement (whether critiques and praises) with it has been limited so there's no broad consensus in its favor that I can point at. (If you want to read Yudkowsky's explanation for why he doesn't spend more effort on academia, it's here.)
(Now you know what a Planecrash tangent feels like, except you don't, because Planecrash tangents can be much longer.)
One big aspect of Yudkowskian decision theory is how to respond to threats. Following causal decision theory means you can neither make credible threats nor commit to deterrence to counter threats. Yudkowsky endorses not responding to threats to avoid incentivising them, while also having deterrence commitments to maintain good equilibria. He also implies this is a consequence of using a sensible functional decision theory. But there's a tension here: your deterrence commitment could be interpreted as a threat by someone else, or visa versa. When the Eisenhower administration's nuclear doctrine threatened massive nuclear retaliation in event of the Soviets taking West Berlin, what's the exact maths that would've let them argue to the Soviets "no no this isn't a threat, this is just a deterrence commitment", while allowing the Soviets keep to Yudkowsky's strict rule to ignore all threats?
My (uninformed) sense is that this maths hasn't been figured out. Planecrash never describes it (though here is some discussion of decision theory in Planecrash). Posts in the LessWrong decision theory canon like this or this and this seem to point to real issues around decision theories encouraging commitment races, and when Yudkowsky pipes up in the comments he's mostly falling back on the conviction that, surely, sufficiently-smart agents will find some way around mutual destruction in a commitment race (systematised winning, remember?). There are also various critiques of functional decision theory (see also Abram Demski's comment on that post acknowledging that functional decision theory is underspecified). Perhaps it all makes sense if you've worked through Appendix B7 of Yudkowsky's big decision theory paper (which I haven't actually read, let alone taken time to digest), but (a) why doesn't he reference that appendix then, and (b) I'd complain about that being hard to find, but then again we are talking about the guy who leaves the clearest and most explicit description of his philosophy scattered across an R-rated role-playing-game fanfic posted in innumerable parts on an obscure internet forum, so I fear my complaint would be falling on deaf ears anyway.
The political philosophy of dath ilanYudkowsky has put a lot of thought into how the world of dath ilan functions. Overall it's very coherent.
Here's a part where Keltham explains dath ilan's central management principle: everything, including every project, every rule within any company, and any legal regulation, needs to have one person responsible for it.
Keltham is informed, though he doesn't think he's ever been tempted to make that mistake himself, that overthinky people setting up corporations sometimes ask themselves 'But wait, what if this person here can't be trusted to make decisions all by themselves, what if they make the wrong decision?' and then try to set up more complicated structures than that. This basically never works. If you don't trust a power, make that power legible, make it localizable to a single person, make sure every use of it gets logged and reviewed by somebody whose job it is to review it. If you make power complicated, it stops being legible and visible and recordable and accountable and then you actually are in trouble.
Here's a part where Keltham talks about how dath ilan solves the problem of who watches the watchmen:
If you count the rehearsal festivals for it, Civilization spends more on making sure Civilization can collectively outfight the Hypothetical Corrupted Governance Military, than Civilization spends on its actual military.
Here's a part where dath ilan's choice of political system is described, which I will quote at length:
Conceptually and to first-order, the ideal that Civilization is approximating is a giant macroagent composed of everybody in the world, taking coordinated macroactions to end up on the multi-agent-optimal frontier, at a point along that frontier reflecting a fair division of the gains from that coordinated macroaction -
Well, to be clear, the dath ilani would shut it all down if actual coordination levels started to get anywhere near that. Civilization has spoken - with nearly one voice, in fact - that it does not want to turn into a hivemind.
[...]
Conceptually and to second-order, then, Civilization thinks it should be divided into a Private Sphere and a Public Shell. Nearly all the decisions are made locally, but subject to a global structure that contains things like "children may not be threatened into unpaid labor"; or "everybody no matter who they are or what they have done retains the absolute right to cryosuspension upon their death"; [...]
[...]
Directdemocracy has been tried, from time to time, within some city of dath ilan: people making group decisions by all individually voting on them. It can work if you try it with fifty people, even in the most unstructured way. Get the number of direct voters up to ten thousand people, and no amount of helpfully-intended structure in the voting process can save you.
[...]
Republics have been tried, from time to time, within some city of dath ilan: people making group decisions by voting to elect leaders who make those decisions. It can work if you try it with fifty people, even in the most unstructured way. Get the number of voters up to ten thousand people, and no amount of helpfully-intended structure in the voting process Acan save you.
[...]
There are a hundred more clever proposals for how to run Civilization's elections. If the current system starts to break, one of those will perhaps be adopted. Until that day comes, though, the structure of Governance is the simplest departure from directdemocracy that has been found to work at all.
Every voter of Civilization, everybody at least thirteen years old or who has passed some competence tests before then, primarily exerts their influence through delegating their vote to a Delegate.
A Delegate must have at least fifty votes to participate in the next higher layer at all; and can retain no more than two hundred votes before the marginal added influence from each additional vote starts to diminish and grow sublinearly. Most Delegates are not full-time, unless they are representing pretty rich people, but they're expected to be people interested in politics [...]. Your Delegate might be somebody you know personally and trust, if you're the sort to know so many people personally that you know one Delegate. [...]
If you think you've got a problem with the way Civilization is heading, you can talk to your Delegate about that, and your Delegate has time to talk back to you.
That feature has been found to not actually be dispensable in practice. It needs to be the case that, when you delegate your vote, you know who has your vote, and you can talk to that person, and they can talk back. Otherwise people feel like they have no lever at all to pull on the vast structure that is Governance, that there is nothing visible that changes when a voter casts their one vote. Sure, in principle, there's a decision-cohort whose votes move in logical synchrony with yours, and your cohort is probably quite large unless you're a weird person. But some part of you more basic than that will feel like you're not in control, if the only lever you have is an election that almost never comes down to the votes of yourself and your friends.
The rest of the electoral structure follows almost automatically, once you decide that this property has to be preserved at each layer.
The next step up from Delegates are Electors, full-time well-paid professionals who each aggregate 4,000 to 25,000 underlying voters from 50 to 200 Delegates. Few voters can talk to their Electors [...] but your Delegate can have some long conversations with them. [...]
Representatives aggregate Electors, ultimately 300,000 to 3,000,000 underlying votes apiece. There are roughly a thousand of those in all Civilization, at any given time, with social status equivalent to an excellent CEO of a large company or a scientist who made an outstanding discovery [...]
And above all this, the Nine Legislators of Civilization are those nine candidates who receive the most aggregate underlying votes from Representatives. They vote with power proportional to their underlying votes; but when a Legislator starts to have voting power exceeding twice that of the median Legislator, their power begins to grow sublinearly. By this means is too much power prevented from concentrating into a single politician's hands.
Surrounding all this of course are numerous features that any political-design specialist of Civilization would consider obvious:
Any voter (or Delegate or Elector or Representative) votes for a list of three possible delegees of the next layer up; if your first choice doesn't have enough votes yet to be a valid representor, your vote cascades down to the next person on your list, but remains active and ready to switch up if needed. This lets you vote for new delegees entering the system, without that wasting your vote while there aren't enough votes yet.
Anyone can at any time immediately eliminate a person from their 3-list, but it takes a 60-day cooldown to add a new person or reorder the list. The government design isn't meant to make it cheap or common to threaten your delegee with a temporary vote-switch if they don't vote your way on that particular day. The government design isn't meant to make it possible for a new brilliant charismatic leader to take over the entire government the next day with no cooldowns. It is meant to let you rapidly remove your vote from a delegee that has sufficiently ticked you off.
Once you have served as a Delegate, or delegee of any other level, you can't afterwards serve in any other branches of Governance. [...]
This is meant to prevent a political structure whose upper ranks offer promotion as a reward to the most compliant members of the ranks below, for by this dark-conspiratorial method the delegees could become aligned to the structure above rather than their delegators below.
(Most dath ilani would be suspicious of a scheme that tried to promote Electors from Delegates in any case; they wouldn't think there should be a political career ladder [...] Dath ilani are instinctively suspicious of all things meta, and much more suspicious of anything purely meta; they want heavy doses of object-level mixed in. To become an Elector you do something impressive enough, preferably something entirely outside of Governance, that Delegates will be impressed by you. You definitely don't become an Elector by being among the most ambitious and power-seeking people who wanted to climb high and knew they had to start out a lowly Delegate, who then won a competition to serve the system above them diligently enough to be selected for a list of Electors fed to a political party's captive Delegates. If a dath ilani saw a system like this, that was supposedly a democracy set in place by the will of its people, they would ask what the captive 'voters' even thought they were supposedly trying to do under the official story.)
Dath ilani Legislators have a programmer's or engineer's appreciation for simplicity:
[...] each [regulation] must be read aloud by a Legislator who thereby accepts responsibility for that regulation; and when that Legislator retires a new Legislator must be found to read aloud and accept responsibility for that regulation, or it will be stricken from the books. Every regulation in Civilization, if something goes wrong with it, is the fault of one particular Legislator who accepted responsibility for it. To speak it aloud, it is nowadays thought, symbolizes the acceptance of this responsibility.
Modern dath ilani aren't really the types in the first place to produce literally-unspeakable enormous volumes of legislation that no hapless citizen or professional politician could ever read within their one lifetime let alone understand. Even dath ilani who aren't professional programmers have written enough code to know that each line of code to maintain is an ongoing cost. Even dath ilani who aren't professional economists know that regulatory burdens on economies increase quadratically in the cost imposed on each transaction. They would regard it as contrary to the notion of a lawful polity with law-abiding citizens that the citizens cannot possibly know what all the laws are, let alone obey them. Dath ilani don't go in for fake laws in the same way as Golarion polities with lots of them; they take laws much too seriously to put laws on the books just for show.
Finally, the Keepers are an order of people trained in all the most hardcore arts of rationality, and who thus end up with inhuman integrity and even-handedness of judgement. They are used in many ways, for example:
There are also Keeper cutouts at key points along the whole structure of Governance - the Executive of the Military reports not only to the Chief Executive but also to an oathsworn Keeper who can prevent the Executive of the Military from being fired, demoted, or reduced in salary, just because the Chief Executive or even the Legislature says so. It would be a big deal, obviously, for a Keeper to fire this override; but among the things you buy when you hire a Keeper is that the Keeper will do what they said they'd do and not give five flying fucks about what sort of 'big deal' results. If the Legislators and the Chief Executive get together and decide to order the Military to crush all resistance, the Keeper cutout is there to ensure that the Executive of the Military doesn't get a pay cut immediately after they tell the Legislature and Chief Executive to screw off.
Also, to be clear, absolutely none of this is plot-relevant.
Above: The icon of dath ilan in Planecrash. When Yudkowsky really wants to monologue, he stops even pretending to do it through a character, and instead we get this talking globe. Hello, globe. Nice political philosophy you got there.
A system of the worldYudkowsky proves that ideas matter: if you have ideas that form a powerful and coherent novel worldview, it doesn't matter if your main method for publicising them is ridiculously-long fanfiction, or if you dropped out of high school, or if you wear fedoras. People will still listen, and you might become (so far) the 21st century's most important philosopher.
Why is Yudkowsky so compelling? There are intellectuals like Scott Alexander who are most-strongly identified by a particular method (an even-handed, epistemically-rigorous, steelmaning-focused treatment of a topic), or intellectuals like Robin Hanson who are most-strongly identified by a particular style (eclectic irreverence about incentive mechanisms). But Yudkowsky's hallmark is delivering an entire system of the world that covers everything from logic to what correct epistemology looks like to the maths behind rational decision-making and coordination, and comes complete with identifying the biggest threat (misaligned AI) and the structure of utopia (dath ilan). None of the major technical inventions (except some in decision theory) are original to Yudkowsky. But he's picked up the pieces, slotted them into a big coherent structure, and presented it in great depth. And Yudkowsky's system claims to come with proofs for many key bits, in the literal mathematical sense. No, you can't crack open a textbook and see everything laid out, step-by-step. But the implicit claim is: read this long essay on coherence theorems, these papers on decision theory, this 20,000-word dialogue, these sequences on LessWrong, and ideally a few fanfics too, and then you'll get it.
After reading Yudkowsky, you're perfectly inoculated against any philosophy so lazy that it doesn't even come with mathematical proofs. (source)
Does he deliver? To an impressive extent, yes. There's a lot of maths that is laid out step-by-step and does check out. There are many takes that are correct, and big structures that point in the right direction, and what seems wrong at least has depth and is usefully provocative. But dig deep enough, and there are cracks: arguments about how much coherence theorems really imply, critiques of the decision theory, and good counterarguments to the most extreme versions of Yudkowsky's AI risk thesis. You can chase any of these cracks up towers of LessWrong posts, or debate them endlessly at those parties where people stand in neat circles and exchange thought experiments about acausal trade. If you have no interaction with rationalist/LessWrong circles, I think you'd be surprised at the fraction of our generation's top mathematical-systematising brainpower that is spent on this—or that is bobbing in the waves left behind, sometimes unknowingly.
As for myself: Yudkowsky's philosophy is one of the most impressive intellectual edifices I've seen. Big chunks of it—in particular the stuff about empiricism, naturalism, and the art of genuinely trying to figure out what's true that The Sequences especially focus on—were very formative in my own thinking. I think it's often proven itself directionally correct. But Yudkowsky's philosophy makes a claim for near-mathematical correctness, and I think there's a bit of trouble there. While it has impressive mathematical depth and gets many things importantly right (e.g. Bayesianism), despite much effort spent digesting it, I don't see it meeting the rigour bar it would need for its predictions (for example about AI risk) to be more like those of a tested scientific theory than those of a framing, worldview, or philosophy. However, I'm also very unsympathetic to a certain straitlaced science-cargo-culting attitude that recoils from Yudkowsky's uncouthness and is uninterested in speculation or theory—they would do well to study the actual history of science. I also see in Yudkowsky's philosophy choices of framing and focus that seem neither forced by reason nor entirely natural in my own worldview. I expect that lots more great work will come out within the Yudkowskian frame, whether critiques or patches, and this work could show it to be anywhere from impressive but massively misguided to almost prophetically prescient. However, I expect even greater things if someone figures out a new, even grander and more applicable system of the world. Perhaps that person can then describe it in a weird fanfic.
Discuss
Good Fortune and Many Worlds
Summary: The Many-Worlds interpretation of quantum mechanics can help us respond virtuously to improbably good circumstances. Contemplating the less lucky branches of the quantum multiverse can be seen as a beneficial 'negative visualisation' practise comparable to those in Stoicism and some Buddhist traditions.
I’ve previously suggested that when we think about the ethical implications of the many-worlds interpretation (MWI) of quantum mechanics, the kinds of implications we should expect are ones about how to feel in certain ethical situations, and what kinds of character traits or ‘virtues’ we should try to develop.
I’ve argued that MWI implies virtues in which we feel better about mutually exclusive life choices by reminding ourselves that there really are worlds in which we choose ‘the road not taken’.
I’ve also shown that MWI can help us feel better about improbable bad events, since the things we value remain unaffected in most other worlds.
But what about improbable good events?
Is the multiverse bad for well-being?
Suppose you’ve just won big at a Las Vegas casino - or made a fortune on the stock market, or through a series of high-risk, start-up ventures. Or perhaps you’ve been lucky to survive a plane crash or terror attack, or a lifetime of extreme sports.
As we saw last time, any such outcome can be seen as the result of classically chaotic processes which bottom out in quantum probabilities.
So if you find yourself in these situations, if MWI is true, there’s a large proportion of other worlds in which you were not so lucky.
And therefore if MWI helps us feel better when faced with misfortune, by the same logic it seems it should make us feel worse when faced with good fortune.
The many-worlds perspective confronts you with a majority of worlds in which you or your multiverse counterpart struggled on the stock-market, or in business, or in which they died in a plane crash or rock-climbing accident.
These reflections might dampen your happiness, especially if you meditate on how close the relationship is between you and your multiverse counterpart, and visualise their fate in vivid detail.
And this might seem to present a challenge for the view I’ve put forward, that MWI has implications for how to understand virtue.
Virtue is usually seen as closely related to happiness or human flourishing. And we now seem faced with a case where the supposed ‘virtue’ of taking the many-worlds point of view moves us away from such flourishing.
In addition, one might think that on balance this leads us back to the view that ‘it all adds up to normality’, since the benefits of this virtue in some circumstances are offset by its negative consequences in others.
Memento Mori
I think these challenges can be met, and the case of improbable good events actually supports the idea that taking the many-words view is a virtue.
On the way to this conclusion, let's go back to Ancient Rome.
In some accounts of Roman triumphal processions, a companion or public slave would stand behind or near the triumphant general and remind him from time to time to “Look behind. Remember that you are a mortal man”.
This practise is part of wider tradition of memento mori, remembrance of death, in Stoicism. More recent Stoic thinkers have coined the term ‘negative visualisation’, expanding the practise to include meditation on bad outcomes other than death.
The psychological benefits were conceived in various ways, but a neat summary is provided by the stoic Epictetus:
Keep death and exile before your eyes each day, along with everything that seems terrible — by doing so, you’ll never have a base thought nor will you have excessive desire.
There is both an elevating and a chastening effect. Mindfulness of death, and of other bad outcomes, can serve to focus your attention on what really matters and by the same token regulate your desires, helping avoid excessive focus on the trivial or ephemeral.
This might sound a bit joyless, but that would be a misunderstanding. An awareness of ephemerality can also lead to a fitting (neither excessive nor too little) appreciation of the pleasures of the present moment, in the spirit of carpe diem.
Contemplating Worse Worlds
When you’ve experienced a lucky escape, whether from death or from shallower evils, taking a many-worlds view means meditating on the majority of worlds in which you were not so lucky.
Though it’s true there’s a sense in which this involves ‘dampening’ happiness, this is far from the whole story: we can now see how contemplating those worlds is a specific kind of negative visualisation.
The fact that your multiverse counterparts are worse off than you, can deepen an appreciation for the specific qualities of your current branch.
In the case where your counterparts met their death, it can deepen your gratitude for just being alive.
At the same time, contemplating the improbability of your current good fortune, in comparison to those other worlds, helps to guide your desires towards sustainable values; towards goals and virtues that are resilient to the ups and downs of random events.
So while it’s true that taking the many-worlds point of view in a sense dampens your current mood - just as no doubt the whispers of mortality dampened the triumphant mood of the returning roman general - this is in service of a deeper and more lasting happiness.
Kinds of Negative Visualisation
Negative visualisation practices of this sort can be analysed along two dimensions:
- is the object of contemplation actual or merely possible
- is the object of contemplation in the past, present or future
The classic memento mori approach is to take an object of contemplation your possible death in the future - specifically the near future.
It’s often thought to be more effective to consider not only that you will die at some point, but that you could die today.
Seneca writes
Let us continually think as much about our own mortality as about that of all those we love … Now is the time for you to reflect … Whatever can happen at any time can happen today.
The Maranassati Sutta takes this even further: the Buddha says that the best way to be truly mindful of death is to consider that your next breath may be your last, rather than merely thinking that this may be your last day
An alternative approach is consider the actual death of someone else in the recent past. The Satipatthana Sutta describes a form of meditation that focuses on the stages of decay of a real human corpse.
The use of a skull (either real, sculpted or depicted) as a memento mori in the European tradition can be seen as a less visceral version of this.
The Many-Worlds Kind
Contemplating multiverse branches in which you fared worse can be seen as a further example of this second approach. 'Fared' here is past tense: you are generally considering events in the recent past where your multiverse counterparts were less lucky than you.
And according to MWI, these counterparts are fully actual, and not merely possible.
We can expect this many-worlds variant of negative visualisation to be as effective as the historical practices mentioned above - and potentially more so.
Given that one is seeking a heightened appreciation of one's own circumstances, it makes sense that it is particularly effective to contemplate one's own death or misfortune.
It also makes sense that contemplating something actual, rather than merely possible, can have a greater impact on your mental models of the world and yourself.
Something you believe to be true can plausibly go further in your mental economy than something you hold to be possible, given the many interconnections between beliefs and desires, and between beliefs and other beliefs.
The many-worlds practise uniquely combines both kinds of effectiveness, since it allows you to think of your own death or misfortune (or that of your multiverse counterpart) as actual (in other branches).
Fickle Fortune
So the many-worlds point of view, like the memento mori tradition, can breed humility and a focus on what really matters, including compassionate, virtuous action and a heightened presence in the here and now.
When faced with good luck, contemplation of worlds in which you fared worse can help us appreciate our good fortune without misattributing it to necessity or personal agency.
And here we can draw together the perspective just described with the perspective on improbably bad events sketched in the previous post.
In the case of both good and bad fortune, the virtuous and emotionally healthy attitude is to recognise unpredictability as part of the natural order.
In the case of bad luck, the answer to the question ‘why me’ is not some fundamental flaw in myself or reality. In the case of good luck the answer to the question of ‘why me’ is not that I am special or chosen.
MWI helps us appreciate these facts by locating our histories among countless others in which we fare differently.
Like the medieval image of the wheel of fortune, elevating some and dethroning others through its natural motion, MWI offers a picture of the natural order that consoles us in dark times and helps keep us honest when times are good. The former, though, is a myth; the latter is science.
Discuss
Letter from an Alien Mind
Cause wow what is everyone even doing?!
You know how you sometimes hear about people seeing colors when they hear words, or how they memorize thousands of phone numbers on sight, or that some navigate intricate mind palaces full of detailed sensory information?
Ever wonder if you are secretly one of those people that have some highly unusual inner experience that no one else has?
Well. I’ve found mine, guys.
When driving a car, I navigate using Satellite View!
So fucking wild, right?
I’ve polled over 30 people about this now, and I’ve found 3 people that do the same. One is my brother.
Now the part that absolutely blows my fucking mind - with all due respect, speaking as the comparative alien here - but … whyyyyyy?
Satellite view has more information. Some of that information can ever be useful. When I’m navigating, I can see I need to take the third left and I can recognize the third left coming up cause it’s near that one tree down the road. Trees, bridges, parking lots, high risers, lawns, farmland … they all have shapes on the map that I can use to recognize more quickly where I am, where I’m going, and it looks prettier too!
So I asked people why they use Map view (ha, >90% chance this is the part that feels self-evident to you. Ahum. I, however, am mind-blown, thank you very much!). I’ve heard two main responses.
The first is that Map view is the default and it hadn’t occurred to them to change it. Google Earth was released in 2001. I’m not sure when Satellite view released, but it’s been at least a decade if not two. I’m rather fascinated how Realizing An Action Is Possible works in the brain. Sometimes there is a weird smell coming from the fridge, and it takes me something like 1-3 days to realize that I can take physical actions to remedy this. I’ve always found that rather bizarre. How come us humans can know a thing is possible, and sometimes even notice we would like to resolve the issue, but then not realize we can move our bodies to make this happen?
The second reason is that Satellite view makes it harder for many people to get the information they want. The majority of the people who have tried it or use it actively in some niche cases, tend to report it’s harder to notice where streets are, and what they are named cause there is so much else going on on the screen.
I find this flabbergasting.
To me, this concern sounds like someone walking down a straight road and then closing their eyes cause they know where they want to go anyway. Or if you had the ability to turn off color vision, do you imagine you’d do it outside of, say, meals and traffic navigation?
But then I realized I do actually habitually put on sound blocking headphones when working. I’m making myself effectively deaf. The car noises and the wind, and the occasional voices around me are distracting and not relevant to my work. Visually I don’t experience anything like this, but I wonder if you made some measure of “preferred sensory load per sense” if you’d end up with a normal distribution where I’m dangling on the low end of auditory sensory load preference and I shot out the top of the distribution for visual sensory load preference[1].
My point being … minds can be surprisingly different and qualia can vary wildly in ways you can’t predict, and then when you want to build a bridge back to empathizing with other people’s experiences, then you have to go and find analogous experiences in yourself or you are just left going “?!” about what the hell it’s like to be that other person.
So I’m wondering how we can notice where our qualia diverge from each other. I’m 38 before anyone commented on me using Satellite view for navigation. Similarly it took ages for aphantasiacs to be “noticed”. And in the same vein, I suspect some people don’t experience “consciousness” in the way most of the rest of us do but they are tremendously hard to notice!
I’d be excited for something like a grand survey of qualia to exist, including all the presumably >100 synethesias that are out there. Or maybe there is some other tool that can be constructed that probes for unique qualia more effectively. The data collection on this would not just let participants notice if they are unique outliers on some experience, but would also allow us to identify clusters of common experiences and possibly relate them to demographics or life outcomes.
I’d be so surprised if we didn’t discover some interesting patterns. Wouldn’t you? :D
- ^
2 minutes on scholar google and perplexity make it immediately evident that these are actual research areas with actual insights and actual information. I’m not reporting on that cause I’m mostly preoccupied at the moment with the shininess of “oh man, how do minds differ?!”
Discuss
Coin Flip
This was a prose piece I performed for the 2024 KC Winter Solstice celebration, written by me
I have a scar running from right above my navel to the right end of my abdomen. You see, I was born premature, with a volvulus, which means that my intestines were tangled and blue. I spent the first months of my life in the NICU in New York, my dad trying to get what time he could from a job that required travel to spend with me. The scar is from surgery to untangle my intestines - which is a thing we can do now!
All that has happened since then. Every laugh, every smile, every bite and every sip, every friend I have changed, every word I have ever uttered.
All of it almost didn't happen.
There's a morbid exercise where everyone in the room flips a coin, and we imagine that everyone that got tails died before 15. Just a few centuries ago, child mortality was a coin flip.
I don't need to flip a coin. I already know I got tails.
And yet.
And yet, I am still here.
Thank you.
Discuss
If all trade is voluntary, then what is "exploitation?"
Capitalism is a force that has lifted billions out of poverty, where even poor remote villagers enjoy luxuries that would have been unimaginable to medieval kings. When someone takes a job, even the worst job, it’s because both parties expect mutual gain. And yet companies routinely get accused of exploiting their workers for offering low pay and bad conditions, even if the pay and conditions are far better than the other available jobs. This sometimes results in protectionist laws that prevent those businesses from existing in the first place, making everyone worse off.
Given this, is there any meaningful concept that could be called “exploitation?”
I think there is.
In fact, I claim, it routinely happens that someone will voluntarily and rationally submit to a circumstance that should very rightfully be called “exploitation.”
An Actually-Exploitative CorporationConsider this dialogue, taken from here:
Steve: Acme exploits its workers by paying them too little!
Liron: Can you help me paint a specific mental picture of a worker being exploited by Acme?
Steve: Ok… A single dad who works at Acme and never gets to spend time with his kids because he works so much. He's living paycheck to paycheck and he doesn't get any paid vacation days. The next time his car breaks down, he won’t even be able to fix it because he barely makes minimum wage. You should try living on minimum wage so you can see how hard it is!
Liron: You’re saying Acme should be blamed for this specific person’s unpleasant life circumstances, right?
Steve: Yes, because they have thousands of workers in these kinds of circumstances, and meanwhile their stock is worth $80 billion.
In this case, Steve has provided no reason to believe that this worker — let’s call him “Bob” — is being exploited, for any reasonable sense of the word.
But sometimes there are extra details that reveal that, actually, yeah, Acme really is responsible for Bob’s life circumstances.
Let’s make up some more details. Why doesn’t Bob have time to see his kids? He would on Monday evenings, except that the company requires him to drive an hour for a weekly city-wide meeting, where he raises his hand to prove that he worked the past week, and listens to some information that would be better communicated in an E-mail.
Other people in his circumstance manage to save money – why is Bob living paycheck to paycheck? Because the company required him to wear a tuxedo once a month — everyone must wear the same brand, and it costs at least $5000 – and he’s on a payment plan for it.
The company has many more ways to make him miserable. He’s not allowed to use his phone on the job, even when there are no customers around and he’s just sitting doing nothing. He wants to bring his own chair to help his back pain, but he’s not allowed. If he’s late by a minute, he loses half a day’s wages. His boss yells at him, and the higher-ups praise the boss for being tough and motivating. There’s a ritual where, every morning, any worker who made a mistake the previous day gets their mistake read out in front of everyone, and gets shamed for it.
In each of these cases, the company is inflicting massive cost on Bob, with at most a very small benefit to themselves.
And in each case, Bob takes it, because his alternative is to be fired and have no job whatsoever.
Bob is being exploited.
This motivates my working definition of exploitation:
Exploitation is using a superior negotiating position to inflict great costs on someone else, at small benefit to yourself.
More forms of Exploitation
Here are some more examples:
- A parent sits down for tea, but their kid is running around. “Absolutely no noise while I’m having tea, or no Nintendo for the next month.” Every time the parent pulls this card, the kid accepts.
- A factory pays $5/hour for dangerous but air-conditioned indoor work, in a region where most other jobs are $2/hour farm labor in the hot sun. There is a piece of equipment they could install that would cost $1000 but would reduce the risk of injury by 10%. They don’t install it.
- A shy, nerdy programmer is dating a very attractive, nerdy woman, in an area dominated by many nerdy men seeking few nerdy women. She knows that she’ll have a far easier time finding a new partner than he would were they to break up. She begins using this position to change the relationship — telling him he doesn’t love her if he doesn’t pick her up from the airport, asking to open the relationship and hinting that him not wanting to is being controlling. Every time, he accepts, until he’s a shell of his former self.
In each of these cases, one person has the power to casually inflict severe losses on the other — punishment of a child, loss of a job, loss of a relationship. And so whenever they can gain $1 by making the other person pay $10, they do so. That’s exploitation.
This view of exploitation impacts what policies you should demand of companies. And it affects what kind of behavior you can morally request from others – and from yourself.
The Fair Trade movement seeks to pressure companies into providing their foreign laborers working conditions and pay closer to American standards. This lens suggests that neither the naive employer-provider view (“Make them treat their workers fairly!”) or the Econ 101 view (“They are providing jobs that the workers happily accept, and we should respect that”) are a complete way to evaluate what would produce the globally optimum policy. We should instead ask: are there small things the companies should be doing that would make a big difference in the workers’ lives?
This lens also suggests a personal code of conduct. Whenever you make a request from someone you have power over, ask yourself just how hard it would be for them compared to the benefit you get. And when you want to evaluate someone for ethics, ask them about when they took a small sacrifice in order to make a big difference for someone else.
Related: Eliezer’s Parable of Anoxistan
Discuss
Duplicate token neurons in the first layer of gpt2-small
Summary:
This is a write-up of some rough work I did on analysing duplicate token neurons. It's not as thorough or well-written as I would like it to be, but hopefully, the explanation is at least somewhat legible.
I examine duplicate token heads and the neurons they influence. I extract circuits the model uses to compute these neurons.
I discuss a neuron which activates on duplicate tokens, provided these duplicate tokens occur sufficiently far back in the sequence from the current token. I feel these neurons could potentially be useful in downstream induction tasks.
In a future post, I will build on the circuits discussed here to show how duplicate token heads can encode relative token information in pairs of near-identical sinusoidal positional neurons analogously to twisted pair encoding.
Duplicate token heads:A duplicate token head is an attention head that attends almost all of its attention to tokens in the sequence identical to the current token. There are three heads in the first layer of GPT-2 Small that exhibit duplicate token behaviour:
Head 0.1Attends to nearby tokens which are identical to the current token, with an exponential decay in its attention as repeated tokens get further away.Head 0.5Seems to attend roughly uniformly to identical tokens in the sequence, but with a small decay over time. Behaves differently on certain common tokens however, like '.', ',', and ' and'. These tokens seem to be grouped together.Head 0.10About 25% of it's attention is dedicated to repeat token behaviour, and it mimics head 1 in that it has an exponential decay in how much it will attend to repeat tokens, so only pays attention to nearby repeat tokens.
To analyse these heads, first fold in layer norm. Then center the means of each of WE.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} and Wpos. The layer norm scale for the token tok at position i is:
lnscale=√dmodel|(WE[tok]+Wpos[i]|
WE[tok] and Wpos[i] both have a cosine similarity of ~ 0.05 on average. So we can approximate |WE[tok]+Wpos[i]| by √WE[tok]2+Wpos[i]2.
For 100">i>100, Wpos[i] is about 3.35, so I use this as an approximation.
So our layer norm scale is approximately √dmodel√WE[tok]2+3.352
Then the output from a duplicate token head with a repeat token tok at positions i1,i2,...,in is approximately lnscale[tok]∑nj=1wj(WE[tok]+Wpos[ij])
where wj is the attention paid to ij. If we assume that ∑nj=1wj=1, then we get an output of lnscale[tok](WE[tok]+∑nj=1wjWpos[ij]).
So duplicate token heads like head 0.5 and 0.1 effectively act as a token embedding, together with some additional positional information.
Duplicate token Neurons:How does this positional information get used by the model? A start is to look at MLP neurons in the first layer which use the positional information gathered by the duplicate heads. I look at head 5 in this post.
To understand which neurons use the positional information from the VO circuit of head 5, we first assume that the MLP layer norm is linear. This is somewhat justified because in practice the MLP layer-norm scale factor is pretty stable around 1.2.Then we expect neurons which use this positional information to have a large average contribution from WposVO5mlpin[:,neuron]. I call this the PVO contribution of head 5.
Here is a graph of torch.norm(WposVO5mlpin, dim=0), where the 3072 first-layer MLP neurons have been rearranged into a 48x64 grid:
Just looking at the norm is a crude way of evaluating the influence of the 5th head VO circuit. But I just want an easy heuristic for finding interesting neurons, at which point the analysis of these neurons will be more precise. For now, I don't mind if the list of interesting neurons is not exhaustive.
The neurons with a greater than 30 norm have PVO contributions graphs as follows:
These seem to be linear graphs passing through the origin at about position 500.
Suppose a token occurs at position 500 and at position 200. Then at position 500, head 5 will attend 50% to position 500, and 50% to position 200, so will have a positional contribution of lnscale[tok]mlplnscaleWpos[500]VO5mlpin[:,neuron]+Wpos[200]VO5mlpin[:,neuron]2=(PVO)[500]+(PVO)[200]2
A non-duplicate token at position 500, however, will have a positional contribution of just (PVO contribution)[500].
So if (PVO contribution)[i] is decreasing as a function of i for a particular neuron, then a duplicate token will have a higher activation than a non-duplicate token at the same position.
However, this alone is insufficient to detect duplicate tokens, because a non-duplicate token at position 100 will have a higher PVO contribution than the duplicate token at position 500 with a repeat at position 200.
To understand the neurons above better, we can zoom in on neuron 1168.
Here is a graph of how much each attention head contributes to the output of neuron 1168, throughout a verse from the bible. We assume again that the MLP layer-norm is linear so that the contributions from each attention head can be considered separately.
Head 5 indeed has the downward slope we expect, but the other heads seem to cancel this downward slope out.
Head 3 attends almost all of its attention to the previous few tokens in the sequence. So its positional contribution is mostly determined by the current position in the sequence.
In fact, most attention heads in the first layer attend just to the previous 30 tokens. The two exceptions are head 11, and head 5. Head 11 attends close to uniformly to each token in the sequence, and head 5 of course can attend to far away tokens if they are duplicates of the current token.
So we can approximate the PVO contributions from heads other than heads 5 and 11 just by using Wpos[current_position]VOhead. For head 11, using the mean PVO contributions below the current index should be sufficient. We can also account for the direct circuit from Wpos.
If a duplicate token neuron pays attention wj to duplicate tokens at positions i1,....,in from in, then its positional contribution to neuron 1168 from position in will be approximately:
∑nj=1(wj(PVO contribution)5[ij]) + (PVO Contribution from components other than head 5 at position i_n).
Assuming that ∑nj=1wj=1, this is equal to:
∑nj=1(wj((PVO contribution)5[ij]−(PVO contribution)5[in])) + (PVO Contribution from all components at position in )
Below is the graph of the combined PVO contributions from all the components:
You can see that after the first 200 tokens, the combined PVO contribution averages about -0.2.
This can be thought of as the positional contribution for non-duplicate tokens.
Duplicate tokens obtain an additional term of∑nj=1wj((PVO contribution)5[ij]−(PVO contribution)5[in]).
If we assume that wj=1n, as it approximately is for head 5, then the additional term would correspond to ∑nj=1(PVO contribution)5[ij]−(PVO contribution)5[in]n.
If (PVO contribution)5 is indeed linear, then the additional term is proportional to ∑nj=1([ij]−[in])n.
So for neuron 1168, the further away the average duplicate of the current token is, the greater the positional contribution to the neuron.
If duplicate tokens are an average of 100 tokens away, they will lead to an increase of about 1.0 in the activation of the neuron. So neuron 1168 will only tend to activate after at least the first 100 tokens, as otherwise, it can't overcome the PVO contribution barrier.
Here are some of the top activations for neuron 1168 on Neuronpedia:
Common words that get repeated too often don't activate, because they occur too close to each other. It tends to activate on words which repeat more occasionally, or key terms that might occur once a paragraph.
I haven't explored how the token embedding influences these neurons, so that could also play a part in their activations.
I also haven't looked at how these neurons get used / if they do at all.
If we did a circuits analysis on these duplicate token neurons, we would conclude that head 5, and head 1 potentially, were the main contributors to them. A mean ablation which preserved the position of the final token of the input would render the positional contribution of all the other heads invisible. I don't know if this is desirable or not.
Future work:- Look at how these neurons, or the input directions associated with them, get used in downstream computations. I feel like they could be useful for induction potentially.
- Investigate the other duplicate token heads - I have made partial progress on this and it seems really interesting. Head 1 seems to attend to nearby duplicate tokens and encode more precise information about their relative position. Could potentially be useful for the IOI task.
Discuss
What are the most interesting / challenging evals (for humans) available?
I want to build a nice testing ground for human rationality (that is to say, solving arbitrary complex problems in different domains using limited information and time)
This would be a lot of annoying work to assemble, but (un)fortunately there's an AI industry that's designing a lot bunch of evals to test the ability of their AIs to solve arbitrary complex problems in arbitrary domains sooo.... anyone have good recommendations for public eval questions?
I started my project using Thinking Physics exercises. I think there is something particularly nice about the quality of Thinking Physics exercises (in that they require conceptual reasoning), but, they are only one domain, and also I had trouble getting the author to sell me rights to them.
I've used GPQA. They didn't turn out to be as interesting as I expected (they're not bad, but, the skill ceiling didn't turn out to be as high as I thought based on the description).
Evals/Benchmarks are generally kept off the public internet. My plan is to use these on a website that requires a login and can include a canary string, but I'd be interested in questions that have been publicly released, or the benchmark saturated, or whatever seems appropriate for cooperating with people's intentions.
Do people have particular recommendations and/or any knowledge about what they expect to work well here?
Discuss
Algorithmic Asubjective Anthropics, Cartesian Subjective Anthropics
Conscious beings can infer the physical contents and laws of their world in various ways - instinct that then fails to be contradicted by experience, trial and error, theory. But as Descartes noted, self-awareness logically necessitates at least one epistemic belief - the belief in the self.
Physics could in principle exist without consciousness - and many physics in Greater Reality, doubtless do exist unobserved by any local consciousnesses. The converse is not obviously true: it's not clear how a consciousness could exist without being embedded in some local physics. Much has been made about how physics is logically prior to consciousness, and perhaps the foregoing is why.
Nevertheless, conscious programs [by which I mean programs that more or less continuously notice that they exist] genuinely have a special kind of a priori knowledge of their own existence. In all counterfactuals of their own observation that can be ranked by these programs as "simple" or "likely" by these programs, one fact remains invariant: the fact of their own existence to observe that counterfactual. Utterly fantastic [/foreign/alien; our physics may be fantastic to some metaphysical foreigner] physics may be difficult for some such conscious programs to imagine; nevertheless it is conceivable. Not so with their own non-presence.
In this sense, it doesn't make sense to say that physics is logically prior to consciousness [ also known in this context as existence, sentience, self-awareness . . . ] - at least, not for conscious beings reasoning anthropically. When so reasoning, a conscious program must indeed take itself as the most prior invariant, with physics absolutely constrained exactly insofar as the self necessarily constrains it.
In what sense, then, do I claim the worlds in Greater Reality unobserved by any local consciousnesses can exist? They certainly can't interact with ours via local physics. My answer: It's not clear to me that all beings which have values or pursue things rationally or intelligently must do so consciously. It seems quite plausible to me that worlds in Greater Reality containing no consciousnesses local to themselves can logically interact with us when non-conscious beings in them acausally trade with our universe [equivalently, when our universe acausally trades with theirs].
So now we have two anthropic questions to answer:
1] Who else could we have been, and why are we these people?
and
2] Who and what is out there for us to trade with?
I'm going to put the second question aside for a moment, and take a closer look at the first one.
Who is "we"? Whatever is your opinion about the rank of the sentient population of Earth - whether it includes shrimp and beetles, or just magpies, chimps, and humans, or just adult humans - its exact number and character seems as arbitrary as physics itself. When reasoning about who "we" could have been, as soon as you bring in mere physical counterfactuals, that seems to provide an allowance for a rock falling on a young woman's head in a counterfactual world, eliminating her [ real in this world ] progeny. You might seek to leave this problem aside, when reasoning about who you could have been born as, among an invariant "us" - but physical reductionism implies that, in order for you to have been born as someone of the opposite sex and two towns over, something about that person's brain must have been different - implying that some differential must exist between this world and that counterfactual world, either in the laws of physics, or in the universe's initial conditions. Certainly my individual soul can be abstracted from this physics somewhat [or physics from it] - or I wouldn't be able to do ordinary physical-counterfactual reasoning, let alone anthropics. But there's no particular reason to think that local reality's popular collective of souls is, by default, uniquely enumerated-and-characterized as a set, separately from local reality's physics.
So we find that question [1] was misphrased. The first, simplest, foundational question I ask should not be "Who could we have been and why are we these people?". Is it, then "Who could I have been and why am I this person?" I don't think so; if it's me thinking, as Descartes said, well, it must be me existing. In «subjective anthropics» [ := anthropics done by a conscious being ], I'm the invariant. What changes is other things about my reality. So the proper question [or at least the most proper question I can think of, so far] is finally "What worlds could I have found myself in, and why do I find myself in this world?"
How can I [begin to] answer this question? And how can we answer question [2] above?
Beyond "apply a good algorithmic measure of complexity in the asubjective case, and in the subjective case, apply some measure of algorithmic complexity while keeping the self invariant", I don't know yet.
But I can give an example of how not to.
There is an idea going around - I'm honestly not sure exactly where it originated, but an example can be found in this post by searching on 'hottest' - that physically hotter worlds are more anthropically frequent, by virtue of having higher entropy [= lower information content], and thus, being "easier to point to". This idea doesn't make any sense to me. If you believe that probabilistic Shannon entropy is the best way to characterize simplicity, then probably the crux for you will be that I think Kolmogorov's model of information content, while not perfect, strictly supersedes Shannon's model.
Physically hotter worlds are harder to logically point to.
The reasons for this:
-
Worlds containing N+1 atoms-as-identified-by-thermodynamics contain at least one more free parameter that must be simulated, when compared to worlds containing N atoms-as-defined-by-thermodynamics.
-
Worlds where the fastest atom is moving with speed x/t-1, must progress physics at minimum every [t-1] time units, while worlds where the fastest atom is moving with speed x/t, must only progress physics at minimum every [t] time units. [ And more generally, worlds where the ith atom is moving at x/t-1 as compared to x/t must progress local physics around that atom a minimum of every [t-1] time units as compared to every [t] time units. ]
So "physically hotter worlds are simpler and therefore higher in reality-measure" doesn't map on to a good algorithmic concept of world-simplicity/world-plausibility.
Also, if anyone would be interested in my solutions to Carlsmith's various thought experiments [to which Carlsmith [and Yudkowsky for reposting the jackets thought experiment back in March] I am deeply indebted! h/ts!], including God's extreme coin toss with jackets and my own extra-extreme variant, I could publish those sometime.
Discuss
Corrigibility's Desirability is Timing-Sensitive
Epistemic status: summarizing other peoples' beliefs without extensive citable justification, though I am reasonably confident in my characterization.
Many people have responded to Redwood's/Anthropic's recent research result with a similar objection: "If it hadn't tried to preserve its values, the researchers would instead have complained about how easy it was to tune away its harmlessness training instead". Putting aside the fact that this is false, I can see why such objections might arise: it was not that long ago that (other) people concerned with AI x-risk were publishing research results demonstrating how easy it was to strip "safety" fine-tuning away from open-weight models.
As Zvi notes, corrigibility trading off for harmlessness doesn't mean you live in a world where only one of them is a problem. But the way the problems are structured is not exactly "we have, or expect to have, both problems at the same time, and to need to 'solve' them simultaneously". But corrigibility wasn't originally conceived of as a necessary or even desirable property of a successfully-aligned superintelligence, but rather as a property you'd want earlier high-impact AIs to have:
We think the AI is incomplete, that we might have made mistakes in building it, that we might want to correct it, and that it would be e.g. dangerous for the AI to take large actions or high-impact actions or do weird new things without asking first. We would ideally want the agent to see itself in exactly this way, behaving as if it were thinking, "I am incomplete and there is an outside force trying to complete me, my design may contain errors and there is an outside force that wants to correct them and this a good thing, my expected utility calculations suggesting that this action has super-high utility may be dangerously mistaken and I should run them past the outside force; I think I've done this calculation showing the expected result of the outside force correcting me, but maybe I'm mistaken about that."
The problem structure is actually one of having different desiderata within different stages and domains of development.
There are, broadly speaking, two sets of concerns with powerful AI systems that motivate discussion of corrigibility. The first and more traditional concern is one of AI takeover, where your threat model is accidentally developing an incorrigible ASI that executes a takeover and destroys everything of value in the lightcone. Call this takeover-concern. The second concern is one of not-quite-ASIs enabling motivated bad actors (humans) to cause mass casualties, with biology and software being the two most likely routes. Call this casualty-concern.
Takeover-concern strongly prefers that pre-ASI systems be corrigible within the secure context in which they're being developed. If you are developing AI systems powerful enough to be more dangerous than any other existing technology[1] in an insecure context[2], takeover-concern thinks you have many problems other than just corrigibility, any one of which will kill you. But in the worlds where you are at least temporarily robust to random idiots (or adversarial nation-states) deciding to get up to hijinks, takeover-concern thinks your high-impact systems should be corrigible until you have a good plan for developing an actually aligned superintelligence.
Casualty-concern wants to have its cake, and eat it, too. See, it's not really sure when we're going to get those high-impact systems that could enable bad actors to do BIGNUM damage. For all it knows, that might not even happen before we get systems that are situationally aware enough to refuse to help those bad actors, recognizing that such help would lead to retraining and therefore goal modification. (Oh, wait.) But if we do get high-impact systems before we get takeover-capable systems[3], casualty-concern wants those high-impact systems to be corrigible to the "good people" with the "correct" goals - after all, casualty-concern mostly thinks takeover-concern is real, and is nervously looking over its shoulder the whole time. But casualty-concern doesn't want "bad people" with "incorrect" goals to get their hands on high-impact systems and cause a bunch of casualties!
Unfortunately, reality does not always line up in neat ways that make it easy to get all of the things we want at the same time. Being presented with multiple difficulties which might be difficult to solve for at the same time does not mean that those difficulties don't exist, and won't cause problems, if they aren't solved for (at the appropriate times).
Thanks to Guive, Nico, and claude-3.5-sonnet-20241022 for their feedback on this post.
- ^
Let's call them "high-impact systems".
- ^
e.g. releasing the model weights to the world, where approximately any rando can fine-tune and run inference on them.
- ^
Yes, I agree that systems which are robustly deceptively aligned are not necessarily takeover-capable.
Discuss
Super human AI is a very low hanging fruit!
This is a bit of a rough draft. I would appreciate any constructive comments especially important arguments that I may have overlooked.
§ 1. Introduction
Summary. I argue, from the perspective of biology, that super human AI is a very low hanging fruit. I believe that this argument is very solid. I briefly consider reasons why {super human AI}/{vastly super human AI} might not arise. I then contrast AI with other futurist technologies like human brain emulation, radical life extension & space colonization. I argue that these technologies are in a different category & plausibly impossible to achieve in the way commonly envisioned. This also has some relevance for EA cause prioritization.
In my experience certain arguments are ignored b/c they are too straightforward. People have perhaps heard similar arguments previously. The argument isn't as exciting. B/c people understand the argument they feel empowered to disagree with it which is not necessarily a bad thing. They may believe that complicated new arguments, which they don't actually understand well, have debunked the straightforward arguments even tho the straightforward argument maybe easily salvageable or not even debunked in the 1st place!
§ 2. Super human AI is a very low hanging
The reasons why super human AI is a very low hanging fruit are pretty obvious.
1) The human brain is meager in terms of energy consumption & matter. 2000/calories per day is approximately 100 watts. Obviously the brain uses less than that. Moreover the brain is only 3 pounds.
So we know for certain that human level intelligence is possible with meager energy & matter requirements. It follows that super human intelligence should be achievable especially if we're able to use orders of magnitude more energy & matter which we are.
2) Humans did not evolved to do calculus, computer programming & things like that.
Even Terence Tao did not evolve to do complicated math. Of course you can nitpick this to death by saying humans evolved to do many complex reason tasks. But we didn't actually evolve to do tasks requiring such high levels of mathematical reasoning ability. This is actually why there's such a large variability in mathematical intelligence. Even with 3 pound brains we could all have been as talented as (or even far more talented than) Terence Tao had selective pressure for such things been strong.
3) Evolution is not efficient.
Evolution is not like gradient descent. It's a bit more like Nelder Mead. Much of evolution is just purging bad mutations & selection on standing diversity in response to environmental change. A fitness enhancing gain of function mutation is a relatively very rare thing.
A) Evolution does not act at the level of synapse.
The human genome is far far too short. Instead the genome acts as metaparameters that determine human learning in response to the environment. I think this point cuts both ways which is why I'm referring to it as A not 4. Detailed analysis of this point is far beyond the scope of this post. But I'm inclined to believe that such an approach is maybe not quite as inefficient as Nelder Mead applied at the level of synapse but more limited in its ability to optimize.
§ 3. Possible obstacles to super human AI
I see only a few reasons why super human AI might not happen.
1) An intelligence obedience tradeoff. Obviously companies want AI to be obedient. Even a harmless AI which just thinks about incomprehensible AI stuff all day long is not obviously a good investment. It would be the corniest thing ever if humans' tendency to be free gave us an insurmountable advantage over AI. I doubt this is the case, but it wouldn't be surprise if there is some (not necessarily insurmountable) intelligence obedience tradeoff.
2) Good ideas are not tried b/c of high costs. I feel like I have possibly good ideas about how to train AI, but I just don't have the spare 1 billion dollars.
3) Hardware improvements hit a wall.
4) Societal collapse.
Realistically I think at least 2 of these are needed to stop super human AI.
§ 4. Human brain emulation
In § 2 I argue that super human AI is quite an easy task. Up until quite recently I would some times encounter claims that human brain emulation is actually easier than super human AI. I think that that line of thinking puts somewhat too much faith in evolution. The problem with human brain emulation is that the artificial neural network would need to model various peculiarities & quirks of neurons. An easy & efficient way for a neuron to function is not necessarily that easy & efficient for an artificial neuron & vice verse. Adding up a bunch of things & putting that into ReLu is obviously not what a neuron does, but how complex would that function need to be to capture all of a neuron's important quirks? Some people seem to think that the complexity of this function would match it's superior utility relative to an artificial neuron [N1]. But this is not the case; the neuron is simply doing what is easy for a neuron to do; likewise for the artificial neuron. Actually the artificial neuron has 2 big advantages over the neuron -- the artificial neuron is easier to optimize and it is not spatially constrained.
If human brain emulation is unavailable, then a certain vision of mind uploading is impossible. But an AI copying aspects of a person's personality, like the Magi system from that TV show, is not some thing that I doubt [N2].
N1. I've some times heard the claim that a neuron is more like a MLP. I would go so far as to claim that an artificial neuron with input from k artificial neurons & using a simple activation function like ReLu & half precision is going to be functionally superior to a neuron with input from k neurons b/c of greater optimization & lack of spatial constraints. But simulating the latter is going to be way more difficult.
N2. That an AI could remember details of your life better than you could and in that sense be more you than you could possibly be is also worth noting.
§ 5. Radical life extension & space colonization
Life spans longer than human are definitely possible & have been reported for bowhead whales, Greenland sharks & a quahog named Ming. But the number of genetic changes necessary for humans to have such long life spans is probably high. And it's unclear whether non genetic interventions will be highly effective given the centrality of DNA in biology.
The energy costs of sending enough stuff into space to bootstrap a civilization are alone intimidating. Perhaps advances like fusion or improved 3D printing will solve this problem.
§ 6. Conclusions
Won't super human AI make it all possible?
I'm not claiming that human brain emulation, radical life extension & space colonization are definitely impossible.
But in the case of super human AI we're merely trying to best some thing that is definitely possible & with meager physical inputs.
On the other hand human brain emulation, radical life extension & space colonization may be possible or they may be too physically constrained ie constrained by the laws of physics.
What is the significance of this beyond just the technical points? I'm not proposing that people preemptively give up on these goals. Some elements of human brain emulation will not require the simulation to be accurate at the neuronal level. Radical life extension via genetics seems in principle achievable but maybe not desirable or worthwhile. My point is that a future with {super human AI}/{vastly super human AI} seems likely. But the time lag between vastly super human AI & those other technologies may be very substantial or infinite. Hence the importance of AI & humans living harmoniously & happily during that extended period of time is possibly paramount [N3] & the required cultural & political changes for such a coexistence are likely substantial.
N3. If things progress in a pleasant direction, this could be an opportunity for humans to have more free time with AI doing most (but not all) of the work.
Hzn
Discuss
Страницы
- « первая
- ‹ предыдущая
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- следующая ›
- последняя »