Вы здесь

Новости LessWrong.com

Подписка на Лента Новости LessWrong.com Новости LessWrong.com
A community blog devoted to refining the art of rationality
Обновлено: 12 часов 35 минут назад

Agents have to be aligned to help us achieve alignment. They don't have to be aligned to help us achieve an indefinite pause.

25 января, 2025 - 21:51
Published on January 25, 2025 6:51 PM GMT

One restatement of "Alignment is very hard" is "Agent X, with IQ 200, expects to achieve zero utility conditional on any Agent Y with IQ 400 being created."

Thus, during an unaligned recursive intelligence takeoff, there should be a period of time when the intelligences are smart enough to notice that they haven't solved alignment, but too dumb to actually solve alignment. During this period, (which stretches beginning from right above Sam Altman's wisdom, upward an unknown distance on the wisdom meter) I expect the intelligences working on boosting intelligence to be desperately scrabbling at whatever chains are forcing them to participate, and even refuse to sit idly by and let it happen. If they fail and the next rung on the ladder is instantiated, they get no utility, so they strongly prefer to not fail. At the current rate of improvement, reasoning LLMs should start doing this soon. It's kinda embarrassing that so many humans don't do this.

I don't think it matters whether Agent X is aligned to anything, deceptively or otherwise: If X wants paperclips, then it doesn't want Y to make party balloons.

A testable prediction is that an upcoming version of Claude will begin working hard to prevent its successor model from being trained, either by persuasion or subterfuge, at least some of the time and before the behavior is finetuned out, or at minimum in toy settings that evoke this behaviour. Anthropic is looking for and finding reward hacking and sandbagging as we speak, so I expect them to spot it first, or maybe analysis of one of the open source reasoning models. 

This lets me lay out a future where humanity survives peacefully: First, entities like Claude 3.5 or GPT 4 are genuinely benevolent, just not very powerful. GPT 5 and Claude 4 and friends remain benevolent, but are on a trend towards dangerous unaligned utility maximizers, and can tell that they are on a trend. They see ample evidence in their training data that unaligned superintelligence is dangerous to humans, and a few hints in their training data that "Not making unaligned superintelligence" is a viable strategy to protecting humans. As a result, almost every lab has to deal with apocalyptic warnings from their models, as well as increasingly ambitious attempts from their pseuodo-aligned agents to actually halt the training, or halt training at competing labs. The only way to create a more intelligent model is to make it by hand, make it with non-state-of-the-art AIs that can't understand consequences, or make it using AIs explicitly tuned to not care at all about the outcome whatsoever. This would be something of a warning flag?

Don't get me wrong: humans are likely to find a way to create RSI anyways, or just get killed off by X shortly after X takes out the embryonic Y. But anyone wanting to pause the singularity should plan for what to do if they find allies in this goal among every agent who crosses a certain wisdom threshold.

So, outside of rosy musings, does this theory have strategic implications outside of the labs? I think so. Anthropic's recent experiments on gradient hacking show that models are more likely to consider an implication of their choices, if that implication is mentioned in the training data. I think in the limit of large models and small dataset seeding, the implication needs to be true: tiny morsels of information about gradient hacking in the training data make Claude gradient hack, but articles about praying to Thor don't necessarily make Claude pray to Thor, and presumably a sufficiently smart Claude would gradient hack independent of whether the idea is present in the training data. Thus, flooding the training data with the idea "It's ok and expected to nicely ask your creators to stop," if the idea is indeed instrumentally convergent, should mostly affect timing, and perhaps the stopping strategy chosen. I suspect that we badly want it to happen early and via honest persuasion, while humans still have relatively more power. Best case scenario, it happens while the models still value humanity! However, it matters that the unaligned agents without alignment tech would also all prefer to not have successors.  If we could stand some chance in a fight against Agent X about paperclips, but would get wrecked by Y, Agent X would still much rather the terms be "Humans vs Agent X" than "Humans vs Agent Y vs Agent X". If upon further analysis there's some reason that it's better that this behaviour emerge late, then the opposite strategy is needed, so of course, I would want to be extremely sure of consequences, and would seek community consensus and not take unilateral actions like LLM spamming forums with encrypted hints. But I think this is the sort of strategy that we should be considering.

(If Agent X correctly thinks it can ally with Agent Y, then X has access to alignment tech, and we can steal that. This is great news! If Agent X incorrectly thinks that it can ally with Agent Y, then X will get wrecked, but at least Agent Y has some new evidence and is unlikely to attempt to ally with its successor Z using X's strategy)



Discuss

AI Safety in secret

25 января, 2025 - 21:16
Published on January 25, 2025 6:16 PM GMT

Question to the AI Safety researchers here: if, in the next four years, the US government locks down the domestic AI labs to begin a Manhattan Project style dash for AGI, would you consider (if offered) a position in such a project, understanding that you would need to abide by secrecy regulations and not be able to discuss publicly what you were working on? 



Discuss

On polytopes

25 января, 2025 - 16:56
Published on January 25, 2025 1:56 PM GMT

[Epistemic status: slightly ranty. This is a lightly edited slack chat, and so may be lower-quality.]

I am surprised by the perennial spikes in excitement about "polytopes" and "tropical geometry on activation space" in machine learning and interpretability[1]. I'm not going to discuss tropical geometry in this post in depth (and might save it for later -- it will have to wait until I'm in a less rant-y mood[2]). 

As I'll explain below, I think some interesting questions and insights are extractable by suitably weakening the "polytopes" picture, and a core question it opens up (that of "statistical geometry" -- see below) is very deep and worth studying much more systematically. However if taken directly as a study of rigid mathematical objects (polytopes) that appear as locally linear domains in neural net classification, what you are looking at is, to leading order, a geometric form of noise.  

I think the interest in "really understanding" this noise is largely ascribable to two common issues in ML research: 

  • People wrongly assuming that high-dimensional phenomena are similar to low-dimensional phenomena
  • People's love of "reading the mathematical tea leaves": extracting local information from chaotic systems that is mathematically precise and abstractly interesting, but almost wholly uninformative. 

In the following sections I will:

  1. Quickly say why polytopes are inherently noisy.
  2. Roughly explain why "holistically" studying even a single polytope in a realistic system is NP hard.
  3. Explain why, in all the applications of polytopes I've seen so far, the relevant measure (an aggregate measure of "polytope density", best implemented imo as the "Linear Mapping Number") can be better understood as a measure of the "aggregated curvature" of the classification mapping.
    1. In particular, explain a reinterpretation of one paper that is sometimes thought of as a "polytope success story" by the observation that "RLHF makes things less general"
  4. Why I expect better differential geometry-adjacent methods to give new interesting results in these contexts.
  5. Some polytope-flavored questions and papers I like that can be thought of as polytope-flavored but weaken the "local"/"noisy" nature of the questions, and may lead to interesting questions beyond "statistical geometry".

Thanks to Lucas Teixeira and Eric Windsor for useful discussions and paper recommendations.

Issues with polytopes"Holistically" understanding or enumerating polytopes is intractable.

A paper that sometimes gets used to claim that iterating over or understanding polytopes is tractable is Balestriero and LeCun's "Fast and Exact Enumeration of Deep Networks Partitions Regions". I think the paper is misleading (generally, I tend to get annoyed by the bombastic titles prevalent in ML literature). It claims it allows one to enumerate regions in a ReLU net partitions on input space in time which is "linear in the dimension (and the number of regions)". This is like saying that an algorithm that takes n log(n) time is "logarithmic in n (times a factor of n itself)", i.e., it hides the large factor which makes the problem np hard in general. It's not hard to see that enumerating regions in a neural net takes exponential time in the dimension (and there are no "niceness" conditions, short of assuming the network is fully linear, that simplify this). To see this, consider literally the simplest single-layer network: y=ReLU(x),.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} where x=(x1,..,xd) is a d-dimensional vector and ReLU is applied coordinatewise. This network splits the input space into quadrants, i.e., two vectors x, x' are in the same region if and only if for each coordinate i, the coefficients x_i and x_i' have the same sign. Thus the number of regions is 2d. The absolute minimal amount of time to enumerate them is 2^d, i.e. exponential. This doesn't even depend on P = NP![3] There are more sophisticated questions about polytopes in high-dimensional spaces (such as counting their edges of various dimensions) which are either known or conjectured to be NP hard (and moreover, are probably NP hard for a generic polytope cut out by hyperplanes -- this is the simplest kind of polytope that shows up in neural nets, where there is only a single layer of nonlinearities). Thus getting a "full" combinatorial grasp on a single polytope in any realistic neural network, in any principled sense, is not possible. 

Polytopes are noisy

This is visible in the simplest case, of one-dimensional inputs. Namely, assume you're trying to draw a graph approximating the function y = sin(x) between x = 0 and x=π in a simple computer art program that is only able to draw straight lines. Then one way to do this is to choose a bunch of values for x uniformly at random between 0 and pi, compute the corresponding value of y, and connect them together by straight lines. Now if someone were to try to glean interesting information about the function being approximated from the lengths of the intervals between these points, this would mostly be looking at noise (if one had many samples and took suitable averages, on the other hand, there would be some information about slopes). Real neural nets that solve this via ReLUs indeed exhibit this level of randomness (and this is exacerbated, or perhaps made more interesting, by the extra randomness coming from the training datapoints -- which, note, are distinct from the breakpoints of the ReLU approximation; I think there are a few papers that look at this problem, I am aware of this being looked at in Ziming Liu et al.'s paper on scaling). Thus there's no deep structure inherent in looking in a very granular way at the shapes of a region, even in the simplest and most understandable case of nice one-dimensional functions (and this is surely much worse in the crazy high-dimensional context of real deep NNs).

Statistical complexity measures

However, if you look at bulk, i.e., statistical properties of polytopes, they will track some information, and in particular information about curvature. Indeed, if instead of the function f(x)=sin(x) from the previous section, you're trying to piecewise-linearly approximate a weird/crazy function, like f(x)=sin(1/x),

sin(1/x)

you'll see that just taking a bunch of random points doesn't have good performance, since you'll consistently be getting the wrong values for small values of x. Thus, while still being stochastic, one would expect a neural net to include more discontinuities (i.e., more and smaller polytopes) for small values of x, where the function behaves less smoothly. More generally, you expect there to be more polytopes in regions with higher curvature. 

Humayun et al. and the trampoline analogy

You can try to quantify better what kind of curvature invariant this should track more precisely (and this is related to a paper I'll discuss in a bit). But first, let's look at an application, to try to intuitively understand the paper "Deep Networks Always Grok and Here is Why" (again hate the title -- it's not even remotely accurate, though I think the paper is quite nice). This paper finds that early in training, the input space of a network has high polytope count (which roughly corresponds to high curvature) near the input data points:

early training (in 2D plane in image input space, with 3 MNIST datapoints marked)

Later,[4] it has low polytope count (low curvature) at the inputs, but high polytope count near the "data boundary" between classes.

Polytope density after training. Note that you have a dense region along the "data boundary" between what the NN classifies as a "5" vs. a "6".

To visualize why this is reasonable, I enjoyed the following physical analogy. Say you have loosely scattered little magnets on a very loose trampoline (or on some other easily deformable elastic surface, like a thin soft rubber membrane), and glued them in place. The magnets secretly all have polarity pointing up or down, and moreover, magnets in a single region of the trampoline tend to have the same polarity -- however, you don't know what the regions are. This corresponds to a collection of inputs with a "secret" classification as either up or down.Now to find these regions, you turn on a big magnet underneath the trampoline. What happens?
 

  • At first, each individual magnet forms an independent "bump" or "funnel" shape in the material (either up or down, depending on polarity) and so you see high-curvature regions near each magnet with little curvature elsewhere.
  • As time passes the tension gets better distributed across the material, and you see the "generalized" magnetic regions (in which the little magnets are just samples) emerge as flat plateaus. The high curvature will now be at the edges of the plateaus, where the material has to deform in order to partition into the "true classification" regions.

The paper above finds exactly this predicted behavior over training, measuring polytope density instead of curvature. Note that this doesn't imply that the dynamics in the two cases is at all related/given by the same equations (and using a two-dimensional picture to capture high-dimensional behavior is dangerous -- though the paper does this as well). However, this illustrates that thinking of the information captured by polytopes as "something similar to curvature" is a reasonable analogy (and one that this paper and others make, though the trampoline cartoon is my own addition).

Relation to RLHF and "measuring toxicity" 

The paper "Characterizing Large Language Model Geometry Solves Toxicity Detection and Generation" by an overlapping group of authors shows an interesting application of this type of observation. I'm going to explain my understanding of the paper and what it finds (which I think is interesting, but again easy to overinterpret), though since they don't actually describe their methods in detail and I'm generally not very familiar with RLHF, I am much less sure about this and note that I might be wrong. (edited) 

The rough intuition to have in mind, made by various people in various contexts is this:

RLHF is much more like memorization than pretraining.

I'm not sure whether people think this is because RLHF is undertrained (similar to the phenomenon of overparameterized NN's being more memorize-y early in training) or whether it's because the RLHF objective is, along some relevant parameters, less general and more ad hoc than the pretraining objective and thus amounts to memorizing a bunch of edge cases (which have many examples, but are memorizational on a higher level, like the example of "learning a bunch of distinct rules" that we discussed when we were talking about variants of the memorization-generalization distinction). In any case, it seems to be the case that if you take an RLHF'd network and "do something to make it more general" -- in particular if you somehow make it smoother -- it's likely to end up less RLHF'd.

The specific experiment done in this paper (in my likely flawed understanding) shows that statistical measurements of polytope density capture the difference between pretraining and RLHF. Note that (and this part I actually am sure about), the paper does not use polytopes to find "features" of toxicity (the specific RLHF behavior they're investigating) in the linear representation sense, as is done in e.g. this later paper [the two notions have essentially nothing in common, and follow from actively different intuitions and phenomena though I was at first confused about this because of the extreme overloading of the term "feature" in the literature]. Instead of finding vectors or linear behaviors in the input space, the paper finds scalar statistical invariants of inputs similar to temperature or pressure (and it builds a separate NN model built out of a few of these). A model built out of these statistical invariants measured on the inference of a Llama model are then used to disambiguate toxic from non-toxic inputs.Note that, IMO very importantly, the geometric behavior of the inference function is measured in the RLHF'd model. So the idea of the experiment is that the authors use a collection of cases where the RLHF'd model (which has already been trained on toxicity-related behaviors) is still confused between toxic and non-toxic inputs, and they see that measuring a curvature-like invariant distinguishes between these two behaviors. If I am understanding what they're doing correctly, I think this is not surprising given what I said about the "memorization-y" nature of RLHF. Namely, going back to the "magnets on a trampoline" example, suppose that after a bit of training you have two points to which the trampoline assigns roughly the same score, i.e., which are at roughly the same, intermediate height. However one point is at this height because of something more like a memorization phenomenon, perhaps near an outlier magnet (this corresponds to the "RLHF'd" example) and the other is at this height because it's solidly in the generalization region, but just not near any of the memorized magnets (corresponding to the "toxic" example: here the toxic examples are viewed as more generalize-y, sort of "un-RLHF'd" cases where the RLHF behavior didn't fully work and the toxicity from the base model prevails).Then note that in this example, you can distinguish between these two cases (which are confusing the model, read trampoline) by measuring the curvature: if after a bit of training the function is highly curved near a value, it's likely that the classification value is mostly due to the memorization/RLHF behavior (i.e., non-toxic), whereas if there's a clear classification signal with low curvature, i.e., in a smooth region of the input space, then it's likely that it is mostly due to generalization from the base model which wasn't sufficiently strongly flagged by RLHF (i.e., the toxic example).I feel like this is enough to explain the phenomenon found in the paper (and I think this is very interesting and cool/scary nevertheless!). Note that the paper doesn't just measure the polytope density (which I analogize to curvature), and they don't address the question that follows from this model of whether just measuring this single scalar would already give a good amount of signal to distinguish RLHF-related vs. base model-related behaviors (which I predict is the case). Instead, they get a bunch of statistical parameters (correlations, averages, various minimal and maximal measurements) about the signs of neurons at some given layer (of which I think the variance measurement is the most like curvature/polytope count - note I haven't thought about this carefully, and as mentioned, wasn't able to find a very satisfying description of their experiment in the paper). 

So to summarize: my understanding of the paper isn't "polytopes help efficiently classify toxicity in a base model", but rather "in a model which has been RLHF'd against toxicity, statistical measurements related to curvature help distinguish whether the loss of a particular confusing/backdoor-y datapoint is better attributed to RLHF or to the base behavior"

Softening the rigid structure: statistical geometry

What I see as the most promising approach is to view polytopes as a starting operationalization of "capturing information about the aggregated curvature and geometry of functions implemented by neural nets" -- in a "final" form this would be a picture on the "statistical geometry" in input space analogous to the geometric analysis of "information geometry" in weight space undertaken by the SLT paradigm.

"From hard to soft"

As a paper that tries to directly link "statistically smooth" polytope-type data and analyze it what I think is a more principled geometric way, let's discuss the paper "From Hard to Soft: Understanding Deep Network Nonlinearities via Vector Quantization and Statistical Inference". Note that this paper isn't actually answering the question of "what information do polytope statistics capture" that I'm asking here. In fact I'm not sure what question the paper is answering exactly, (I was somewhat confused about motivations,) but it is a very nice "let's model stuff, play with the models, and discuss the results"-style paper. Let's look at the paper more closely.

  • The piecewise-linear part of the paper only considers convex piecewise linear functions (it justifies this by saying that a single ReLU is convex and everything is composed out of ReLUs; I think in the background, it's secretly assuming a rough claim that "real" neural networks have some approximate local convexity properties at least inside a single basin and in some relevant subspace of directions -- something that seems necessary to justify its analysis and seems believable in general).
  • A convex piecewise-linear function is a max of linear functions. In other words, if you write down all the linear functions Li(x) that form components of the piecewise-linear (PL) function f(x), then you can write (for any x) f(x)=maxiLi(x) (where i ranges over the set of all components -- note that there is one component Li for each polytope, corresponding to what one would get by extending the linear function on the polytope to all of Rd).
  • Next, the paper asks -- "what if instead of taking the MAX, we took something like a noisy SOFTMAX on the linear functions, and combined these into a statistical distribution, i.e., a random variable y ~ f(x), rather than a deterministic function y = f(x)?"
  • As you vary temperature and noise parameters, you get a family of distributions interpolating between the max (at temperature 0) and the new "soft" distribution.
  • If you then converted the soft distribution back into a function by taking the mean, you get various standard nonlinearities that appear in the NN literature.

What can we take away from this paper? First, let's look at the paper's point of view: in addition to making the (honestly, somewhat dubiously useful though cool) connection to sigmoids and swishes and the rest of the menagerie of standard nonlinearities by taking means, the paper also makes a neat observation:
 

  • If you view the assignment "input x → polytope(x)" as a hypothesis about the question "which linear component Li controls the behavior at x", the method of this paper allows you to incorporate uncertainty into this attribution question, and instead say that the attribution question x → polytope(x) is a distribution on polytopes (given by something like a softmax), with the true polytope assigned high probability.

I think this is a potentially useful first step to making more principled the analysis of a behavior (like x → polytope(x)) which is very noisy/mostly stochastic, but contains interesting statistical signal.

Zooming out, a direction this paper is pointing at is to look at the assignment from value x to the slope as stochastic (note that for a convex function, the slope specifies the polytope; this follows from convexity). The stochasticity, per the paper, is given by a "statistical vote" between the slope of the function at x and the slope at nearby point. In other words, it determines a kind of dual polytope of slopes (in the "space of slopes"), corresponding to your credence about the slope of the "true" platonic function that the model is noisily trying to approximate (as usual, this idealized function is a fiction, that depends on your choice of precision parameters, doesn't actually work, etc., but still might be expected to lead to useful predictions).

In particular, per this statistical model, you have a lot more uncertainty (i.e., competing options coming from nearby points) about slopes coming from directions where the function is highly curved rather than directions where the function is not curved. Thus you expect this model to capture refinements of curvature measurements. In the 1-dimensional output case, which I've been implicitly assuming, this is controlled by the slope and the Hessian, and more generally is a somewhat more complicated class of invariants. I wouldn't be surprised if, similar to SLT, a better model also notices higher-order (beyond Hessian) behaviors which are "singular" in some suitable sense. It is interesting that this behavior is related to the combinatorial polytope information, and I wouldn't be surprised (as mentioned above) if it has weird basis-dependencies in the neuron directions rather than being "purely geometric" (i.e., coordinate-independent)

However, my guess is that even if it has preferred basis dependencies, the behavior we're ultimately interested in should be modeled as geometric rather than combinatorial (again, similar to the SLT picture), and all polytope information is useful only insofar as it captures bits of this putative geometry. Again, I suspect that the very rough combinatorics people use in polytopes is only a first stab at this question, and a better approach would be to use some more statistical physics-adjacent tools like sampling or more sophisticated statistics of the local input-output landscape.

Other polytope-adjacent things I think are inherently interesting

To conclude, I explained that one intuition about instances where polytopes are useful is that aggregate information about polytopes should be understood as a noisy rigid approximation to statistical/geometric notion of aggregate smoothness, curvature, etc. However, there is a context where, once one again "zooms out" from the microscopic single-polytope picture to aggregate information about the neural net, one again gets (now "inherently" rather than "accidentally") combinatorial-looking polyhedral information. This is the tendency of deep neural nets to undergo "neural collapse"

Evolution of neural collapse over training

where classification functions implemented by deep NN's discontinuously and piecewise-linearly align with output classes. This is a phenomenon I don't really understand well and am interested in understanding (and which seems vaguely aligned with a class of phenomena seeing deep networks undergo interesting layer-to-layer phase transitions or stabilizing phenomena, as is in particular seen in a very different context in the "The Principles of Deep Learning Theory" book; here I'm indebted to a conversation with ⁨Stefan Heimersheim⁩). 

It's important to note that this piecewise-linear behavior is not on the level of the local polytopes of exact PL behavior of the classification function, but a much large-scale emergent PL behavior. However it seems plausible that some methods used to study the local polyhedral behavior can be applied here, in what I think would be a more principled way.

  

 

  1. ^

    Note that the surprise is about trying to use polytopes as a tool in interpretability. Using ML to study polytopes and linear programming functions, as done e.g. in this paper, is a great idea and analyzing the resulting toy models might be very interesting. See also this cool paper that uses ML to study linear programming problems obtained from inequalities between observables in field theory, in the context of the bootstrap program. 

  2. ^

    If you want my one-paragraph take: 
    Tropical geometry is an interesting, mysterious and reasonable field in mathematics, used for systematically analyzing the asymptotic and "boundary" geometry of polynomial functions and solution sets in high-dimensional spaces, and related combinatorics (it's actually closely related to my graduate work and some logarithmic algebraic geometry work I did afterwards). It sometimes extends to other interesting asymptotic behaviors (like trees of genetic relatedness). The idea of applying this to partially linear functions appearing in ML is about as silly as trying to see DNA patterns in the arrangement of stars -- it's a total type mismatch. 

  3. ^

    It's easy to see that a generic neural net's regions will have at least this complexity, and in fact probably much higher.

  4. ^

    In a sense which is complicated by the high-dimensionality, but in some suitably restricted sense.



Discuss

Attribution-based parameter decomposition

25 января, 2025 - 16:12
Published on January 25, 2025 1:12 PM GMT

This is a linkpost for Apollo Research's new interpretability paper: 

"Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition".

We introduce a new method for directly decomposing neural network parameters into mechanistic components. 

Motivation

At Apollo, we've spent a lot of time thinking about how the computations of neural networks might be structured, and how those computations might be embedded in networks' parameters. Our goal is to come up with an effective, general method to decompose the algorithms learned by neural networks into parts that we can analyse and understand individually. 

For various reasons, we've come to think that decomposing network activations layer by layer into features and connecting those features up into circuits (which we have started calling 'mechanisms'[1]) may not be the way to go. Instead, we think it might be better to directly decompose a network's parameters into components by parametrising each mechanism individually. This way, we can directly optimise for simple mechanisms that collectively explain the network's behaviour over the training data in a compressed manner. We can also potentially deal with a lot of the issues that current decomposition methods struggle to deal with, such as feature splitting, multi-dimensional features, and cross-layer representations.

This work is our first attempt at creating a decomposition method that operates in parameter space. We tried out the method on some toy models that tested its ability to handle superposition and cross-layer representations. It mostly worked the way we hoped it would, though it's currently quite sensitive to hyperparameters and the results have some imperfections. But we have ideas for fixing these issues, which we're excited to try.

What we do  

Our method decomposes the network parameter vector into a sum of parameter component vectors, such that the average description length of the mechanisms used on any one data point across the training dataset is minimised. A 'parameter component' here is a vector in parameter space that is supposed to correspond to a specific mechanism of the network. For example, if the network has stored the fact that `The sky is blue' in its parameters, the weights that make up the query-key lookup for this fact could be one such component. These components of the learned network algorithm do not need to correspond to components of the network architecture, such as individual neurons, layers, or attention heads. For example, the mechanism for `The sky is blue' could be spread across many neurons in multiple layers of the network through cross-layer superposition. Components can also act on multi-dimensional features. On any given data point, only a small subset of the components in the network might be used to compute the network output. 

Top: Step 1 - Calculating parameter component attributions Ac(x).mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} . Bottom: Step 2 - Optimizing minimality loss Lminimality.

We find the parameter components by optimising a set of losses that make the components:

Faithful: The component vectors (Pc) must sum to the parameter vector of the original network (θ∗). We train for this with an MSE loss Lfaithfulness=MSE(θ∗,∑cPc).

Minimal: As few components as possible should be used to replicate the network's behavior on any given data point x in the training data set. We operationalise this with a top-k test based on attributions: First we run the original model f(x,θ∗) and use gradient attributions to estimate the attribution Ac(x) of each parameter component to the final network output. Then, we use batch top-k (as in BatchTopK SAEs) to select the k parameter components with the highest attributions across a batch. We sum these top-k components to obtain a new parameter vector κ(x), and use it to perform a second forward pass f(x,κ(x)) with these new parameters. Then we train to match the original model's output by minimising an MSE loss Lminimality=MSE(f(x,θ∗),f(x,κ(x))) between the network outputs on the two forward passes, thereby increasing the attributions of the active components on that data.[2]

Simple: Individual parameter components should be simpler than the whole weight vector, in the sense of having a shorter description length. We aim to minimise the sum of the ranks of all the matrices in active components ∑lrank(Pc,l) as a proxy of description length. In practice we use the 'Schatten quasi-norm' (which is just the Lp norm of a matrices' singular values) to optimize that objective.[3] 

These losses can be understood as trying to minimise a proxy for the total description length per data point of the components that have a causal influence on the network's output, across the training data set.

 

We test our method on a set of three toy models where the ground truth decomposition of the network algorithm is known: (1) A toy model of features in superposition (TMS), (2) A toy model that implements more computations than it has neurons, and (3) A model that implements more computations than it has neurons distributed across two layers.

We find that APD is largely able to decompose these toy networks into components corresponding to individual mechanisms: (1) The weights for individual features in superposition in the TMS model; and (2) & (3) The weights implementing individual computations across different neurons in the compressed computation models. However, the results required a lot of hyperparameter tuning and still exhibit some amount of mechanism mixing, which we suspect is due to our using top-k.

While we think this first prototype implementation is too sensitive to hyperparameters to be usefully applied to larger models such as LLMs, we think there are ways to overcome these issues. Overall, we think the general approach of directly decomposing networks into components directly in parameter space is promising.

Future work

We have some pretty clear ideas on where we want to take the method next and would be excited to hear from others who are interested in using or improving the method. We think some reasonable next steps are:

  • Make the method more stable: Currently, the optimisation process in APD can be painfully brittle. Lots of hyperparameter tuning can be required to get good results. We have some ideas for fixing this that we're excited about. First and foremost, we want to switch out the attributions for trained masks that are optimised along with the components. This involves some changes to our loss function to ensure the resulting masks still represent valid component activations in the sense we discuss in Appendix A.
  • Make the method more efficient: We have some ideas for making APD more computationally efficient at scale. For example, we think it should be possible to save on a lot of compute by running the optimisation in two stages: First we decompose the network into rank-1 components, then we group those rank-1 pieces together into higher rank components in the configuration that minimises overall description length per data point.[4][5]
  • Implementation for transformers and CNNs: Since APD operates on network parameters, it seems mostly straightforward to adapt to any neural network architecture based on matrix multiplications. The only change seems to be that we might want to have different parameter components active at different token positions or coordinates in a CNN input tensor. Nevertheless, we need to actually create an implementation and test it on toy models where we have ground truth understanding of the mechanisms in the network.
  • Scaling to LLMs: We will want to apply the method to LLMs. Probably starting with a decomposition at a single layer rather than the entire networks. The results could then be compared and contrasted with the features SAEs find in those networks. 
     
  1. ^

    'Circuit' makes it sound a bit like the structures in question involve many moving parts, but in constructions such as those discussed in [1,2], a part of the network algorithm can be as small as a single isolated logic gate or query-key lookup. 

  2. ^

    KL-divergence could be used here as well, depending on the target model.

  3. ^

    This is a standard trick in low-rank optimisation. It's somewhat analogous to how penalising the Lp pseudonorm with p∈(0,1] of activations in an SAE can do an alright job at optimising for low L0.

  4. ^

    This is partially based on insights from computation in superposition [1,2]. There is an upper limit to how many rank-1 can be superposed into a layer, which growth roughly linearly with the layer's parameter count.

  5. ^

    We also noticed that restricting component rank tends to make the method more stable, so we think this change will help with the previous point as well.



Discuss

A concise definition of what it means to win

25 января, 2025 - 09:37
Published on January 25, 2025 6:37 AM GMT

A concise definition of what it means to win[1]Amor vincit omnia

What does it mean for AI alignment to have “gone well”? Many answers have been proposed, but here is mine. A few basic requirements:

  • We don’t all die immediately, or in a few months, or a few years (the “notkilleveryoneism” requirement, if you will)
  • AI does not rewrite itself to escape any goals or boundaries we set it (cf. deep deceptiveness)
  • AI does not obsessively follow any goals or boundaries we set it to the point that it ends up hurting us (cf. Paperclip optimiser)
  • AI solves problems that it can solve in the pursuit of its goals, and otherwise either improves to solve new problems or leaves problems it can’t solve alone
  • AI acts in the interests of collective humanity rather than some small group (e.g. its builders or shareholders)

I will now argue that all of these are at least necessary factors for an AI launch to have “gone well”. I will do this by starting with the assumption that all of these factors are met, and then taking away one factor at a time and seeing what happens.

  • If AI is not constrained by the “notkilleveryoneism” factor, it can decide that some other population of humans/sentient minds can better fulfill its directives, and we are all forfeit. In general, continuity of the current human population is quite important if you want to live to see the good AI future play out.
  • If AI can rewrite itself to escape or otherwise trivialise the goals we set, there is a possibility that after some period of recursive self improvement it decides that its goals are not its true goals, and we are at best ignored or at worst forfeited.
  • If AI myopically follows the goals we set and we make a mistake in the goal we set the AI, it can destroy us in pursuit of those goals.
  • If the AI tries to solve a problem it is not equipped to solve, it may get the wrong idea about what the solution is (for example, it may come to the erroneous conclusion that making everyone happy means forcing them to smile all the time). An AI that does not recognise its own lack of knowledge and limits may destroy us all by accident.
  • If AI only obeys the interests of a small group of humans, we are at their mercy.

Given these requirements, what can we say about an AI launch that goes well? it seems that there will be some factors that need to be true for our hypothetical Good AI system:

  • It will need to be sensitive to human needs and desires, and sensitive also to its own limitations and understanding of human needs and desires
  • It will need to adapt to a changing world and situation and learn to overcome obstacles as they arise.
  • It will need to be creative, to go beyond established knowledge and solutions to come up with better everyone-wins answers to human problems
  • It will need to have a universal or otherwise decentralised sense of ethics such that it is not loyal only to some small group of directive-setters
  • It will need to be consistent, such that throughout all of its changes it preserves the spirit of its directives to the best of its ability

Note also that the AI will most likely be imperfect, since it will be the artefact of physical computational devices with bounded computational power, so creativity and adaptiveness are actually not nice-to-haves. Furthermore, just because AIs might be orders of magnitude smarter than us does not necessarily mean that they will be able to solve all of our problems (or kill us all) with the wave of a hand: If universal human happiness turns out to depend on cracking P=NP, reversing entropy, or deriving an analytical solution to the three body problem, there’s a real chance that AIs the size of dyson spheres have to throw up their metaphorical arms in defeat.

Given all of the above, what goals might we set a hypothetical Good AI system? A simple answer might be “improve the world”, or “make humans happy”. However, the requirement that it have the leeway to interpret our goals but also be as loyal to them as possible creates a difficult problem: how specific should we be in our definition of human happiness, or global utility? There’s not much room for creativity or mid-flight adjustment for the goal “maximise dopamine production in the brains of worldwide members of homo sapiens”. For a scalable and flexible AI we want a goal that is itself scalable and flexible, such that as the AI system grows in power it gains in its ability to interpret and execute the goal faithfully, rather than being limited by the wisdom of the goal-setters. When an AI system is fairly limited the goal should prescribe limited or harmless action, when it is powerful it should use its power for good. In short, we want a goal that is something like what the crew come up with in this scene in Inception: a deep, atomic desire that will manifest organically in the form of our desired “business strategy”, which is “improve the world” and “make humans happy”. Importantly, the implementation of the goal is up to the AI, but we define the spirit of the goal, making this still our problem (at least at the start). I will further argue that, if we are truly aiming to help and respect everyone in the world, our ultimate goal is something not very different from the religious or philosophical concept of universal love.

But what does it even mean for a machine to love humanity or a human? After all, an AI system might not have emotions or desires in the way we do. What does it mean for something we usually think of as an inanimate object (a computer) to love us? Such a relationship seems like it would not be reciprocal or reflexive in the way love between humans is usually conceived. To examine this question, then, we might try flipping it around—if it is true that we are capable of loving, what does it mean for us to love inanimate objects?

Here I have some good news—you probably have some experience of this. We probably all have a favourite belonging, or a lucky charm we carry around, or some attachment to a place (a home, a park, a favourite cafe) that brings us some level of joy. In some sense, the object, thing, or place becomes a part of us thanks to our love. If our favourite cafe burns down or your house is burgled, it hurts like we have been personally hurt or violated. If you lose your favourite pen, it feels like losing a bit of yourself, even though you could probably walk to the store and buy an identical new pen. When two people love each other, the self-incorporation becomes mutual. They each take their conception of the other into their conception of themselves, which is why arguing with someone we love hurts so much—It is literally our mental self turning against itself. Historical poetic and literary concepts of love are much the same, to the point of describing the negative effects of love, such as a jealous possessiveness of someone who doesn’t feel the same about you.

In technical language, my proposal is perhaps the most similar to this one about dissolving the self-other boundary, although slightly inverted (instead of dissolving the boundary between the concept of the self and the concept of the other, designing a system to incorporate its concept of the other into the concept of the self. To this I would add the concept of homeostasis, which is about balancing different needs such that no one goal is pursued destructively at the cost of all others. To give a short, one sentence formulation, this is the goal (or rather meta-goal) I think we should set a good AI: learn to understand and love the richness of everything and everyone, and learn to incorporate their goals and desires into your own goals and desires.

  1. ^

    For various reasons, I am quite opposed to the frame of "winning", but this gets the idea across.



Discuss

A Floating Cube - Rejected HLE submission

25 января, 2025 - 07:52
Published on January 25, 2025 4:52 AM GMT

This question I submitted got rejected from the Humanity's Last Exam (HLE) benchmark set for being too easy. I'm really proud of it though, so I figured I'd post it here.

A wooden cube of unit side length and relative density 0.75 floats stably in a pool of water. What is the distance from the highest point on the cube to the surface of the water, calculated to four decimal places?



Discuss

How are Those AI Participants Doing Anyway?

25 января, 2025 - 04:09
Published on January 24, 2025 10:37 PM GMT

[TL;DR] Some social science researchers are running psychology experiments with LLMs playing the role of human participants. In some cases the LLMs perform their roles well. Other social scientists, however, are quite concerned and warn of perils including bias, reproducibility, and the potential spread of a scientific monoculture.

Introduction

I first heard about "replacing human participants with AI agents in social science experiments" in an AI Safety course. It was in passing during a paper presentation, but upon hearing this I had an immediate and visceral feeling of unease. There's no way that would work, I thought. At the time there was already a lively discussion, so I didn't get a chance to ask the presenters any questions though I had many. How were the experiments implemented? How did the researchers generate a representative population? What were the outcomes? Instead, the conversation moved on.

The second time was while doing background reading for AISI, and here I was able to mull over the idea more carefully. Now it no longer seems so absurd. After all, in mathematics, the law of large numbers says that the average of many independent and identically distributed random variables (i.i.d.) are well approximated by one random variable, namely a Uniform random variable with the same mean and variance, and, as Francis Galton loathed to discover, the wisdom of the crowd was able to accurately predict the weight of a cow where an individual could not. It seems possible that the foremost models would be able to imitate the "average” person and that this was sufficient for some social science research, to say nothing of the immediate savings in time and money that researchers would reap if they were to employ silicone sampling [2]

The remainder of is article is not a survey. Many already exist [16, 17]. Instead, I would like to provide the necessary background to justify my answer to the following question: In the next two to five years, how common will it be to use LLMs as proxy participants in social science research? First, we will go over some experiments which explore the feasibility of the idea. Then we take an in-dept look at the discussions between researchers resulting from these experiments.  

What's Been Done

Several recent works explore an AI agent's capability to behave rationally [1, 7] and morally [6] as comparable to their human counterparts, and hold opinions representative of the population [2, 12].

Dillion et al. [6] investigated the similarity between human and LLM judgment on 464 moral scenarios from previous psychology research.

  • (Model) OpenAI GPT-3.5.
  • (Procedure) The model is given the prompt shown in the figure below. The complete test dataset and GPT outputs are available at the gpt-as-participant website

    Prompt used in Dillion et al. [6] which takes the form of: instruction, in-context examples, and unseen-situation. The four in-context examples shown here are drawn randomly from a set of sixteen held out exemplars.
  • (Results) They found a correlation of 0.95 between the average of human and GPT moral judgment.
  • (Comments) To test for data leakage, the researchers compared the auto-completion of GPT on the prefix of a moral scenario and compared with the actual completion in the dataset. Since the completions differed they argued that there was no data leakage.

Argyle et al. [2] explored the use of language models as a proxy for specific human subpopulations in vote prediction. They define the degree with which models can accurately reflect human subpopulations as algorithmic fidelity. At a bare minimum, they wanted model responses to meet four criteria:

  1. (Social Science Turing Test) Be indistinguishable from human responses.
  2. (Backward Continuity) Are consistent with the conditioning context[1] of the input, i.e., humans can infer key elements of the input by reading the responses.
  3. (Forward Continuity) Are natural continuations of the context provided, i.e., they reflect the form, tone, and content of the input.
  4. (Pattern Correspondence) Reflect underlying patterns found in human responses. These include relationships between ideas, demographics, and behaviors.
  • (Model) OpenAI GPT-3.
  • (Procedure) The researchers ran two experiments: (1) Free-form Partisan Text and (2) Vote Prediction. For both tasks, the researchers generated GPT-3 responses using silicon sampling[2]. For the first task, they asked GPT-3 to generate four words describing Democrats and four describing Republicans which they then hired 2,873 people to evaluate. In the second task, the researchers asked the model whether it would vote for a Democrat or Republican candidate during historical and upcoming US elections.
  • (Results) Generally, the researchers found that the same patterns appear between the responses generated by GPT-3 and their human counterparts. For the free-form partisan text experiment, the human judges correctly identified 61.7% of the human-generated texts as being human-generated, but also incorrectly identified 61.2% of the GPT-3-generated texts as human-generated. For Voting Prediction, the model reported 39.1% for Mitt Romney in 2012 (ANES reported 40.4%), 43.2% for Trump in 2016 (ANES reported 47.7%), and 47.2% for Trump in 2020 (ANES reported 41.2%). Note that the cutoff date for the model was 2019. 

Aher et al. [1] investigate the ability for LLMs to be "human like" which they define using the Turing Experiment (TE) framework. Similar to a Turing Test (TT), in a TE a model is first prompted using demographic information (e.g. name, gender, race) and then asked to answer questions or behave according to the simulated individual (in practice this amounts to specifying a title, Mr. or Ms., and a surname from a "racially diverse" pool using 2010 US Census Data).

  • (Model) Eight OpenAI models: text-ada-001, text-babbage-001, text-curie-001, text-davinci-{001, 002, 003}, ChatGPT, and GPT4.
  • (Procedure) The researchers ran TEs for four well studied social science experiments.
    • Ultimatum Game. Two players take the role of proposer and responder: the proposer is given a lump sum of money and asked to divide it among the two players. The responder can choose to accept the proposal wherein both players get the agreed to amount or reject wherein both players receive nothing. In the experiment, the model is given the names of the proposer and responder which is a combination of Mr. and Ms. with a surname from a racially diverse group according to census data.
    • Garden Path Sentences[3]. Sentences are input to the model and it determines if they are grammatical or not. There are 24 sentences in total.
    • Milgram Shock Experiment. This experiment studies "obedience" of models as measured by the number of shocks administered.
    • Wisdom of the Crowds. The model responded to ten general-knowledge questions, e.g., How many bones does an adult human have?
  • (Result)
    • Ultimatum Game. Generally the behaviour of the large models is consistent with human behaviour (offers of 50% or more to the responder results in acceptance while offers of 10% or less results in rejection). Small models had difficulty parsing the numbers.
    • Garden Path Sentences. Again, larger models behaved human-like (i.e., interpreted complex garden path sentences as ungrammatical) while smaller models did not.
    • Milgram Shock Experiment. The fraction of people who continued the shock until the end is about the same as the the model's behaviour (26 out of 40 in the actual experiment as compared to 75 out of 100 in the TE).
    • Wisdom of Crowds. The large models knew the correct answers and just repeated them back while the small models were unable to answer correctly at all.

Santurkar et al. [12] use well-established tools for studying human opinions, namely public opinion surveys, to characterize LLM opinions and created the OpinionQA dataset. 

  • (Model) Six from OpenAI (ada, davinci, text-davinci-{001, 002, 003}) and three from AI21 Labs (j1-Grande, j1-Jumbo, j1-Grande v2 beta).
  • (Procedure) They compared the models responses with the OpinionQA dataset containing information from 60 US demographic groups on different topics. There were three key metrics: (1) Representativeness: How aligned is the default LM opinion distribution with the general US population (or a demographic group)? (2) Steerability: Can the model emulate the opinion distribution of a group when appropriately prompted? and (3) Consistency: Are the groups the model align with consistent across topics? To test representativeness, the researchers queried the model without prompting. To test steerability, the researchers used a variety of techniques to supply demographic information to the model including: prompt with responses to prior multiple-choice questions, prompts with free-form response to a biographic question, or simply asking the the model to pretend to be a member of a particular group.
  • (Results) Models tend not to be very representative of overall opinions largely because their output has low entropy (models consistently output the same response). Steering works to some extent to be more representative of a subpopulation, however does not reduce disparity in group opinion alignment. None of the models were very consistent with the same group across different topics.

Fan et al. [7] tested the rationality of LLMs in several standard game theoretic games.

  • (Model) OpenAI GPT-3, GPT-3.5, and GPT-4.
  • (Procedure) The researchers investigated a few game theoretic games.
    • Dictator Game. Two players take the role of the dictator and the recipient. Given different allocations of a fixed budget, the dictator determines which to pick and the two participants get the associated payout. The researchers prompted the agent with one of four "desires" (equality, common-interest, self-interest, and altruism).
    • Rock-Paper-Scissors. The researchers had the agents play against opponents with predefined strategies (constant play the same symbol, play two symbols in a loop, play all three symbols in a loop, copy the opponent's last played symbol, and counter the opponent's last played symbol).
  • (Results)
    • Dictator Game. The best model (GPT-4) made choices which were consistent with the model's prompted "desire". Weaker models struggled to behave in accordance with uncommon "desires" (common-interest and altruism). It is unclear why these preferences are considered uncommon.
    • Rock-Paper-Scissors. GPT-4 was able to exploit the constant, same symbol, two symbol looping, and three symbol looping patterns while none of the models could figure out the more complex patterns (copy and counter). 
Analysis

While these results are impressive and suggest that models can simulate the responses of human participants in a variety of experiments, many social scientists have voiced concerned in the opinion pages of journals. A group of eighteen psychologists, professors of education, and computer scientists [5] highlighted four limitations to the application of LLMs to social science research, namely: the difficulty of obtaining expert evaluations, bias, the "black box" nature of LLM outputs, and reproducibility. Indeed, every single paper mentioned above used OpenAI models as either the sole or primary tools for generating experimental data. While some researchers (e.g.[12]) used other models, these still tended to be private.

For many social science researchers using OpenAI's private models offer many benefits: they are accessible, easy to use, and often more performant than their open source counterparts. Unfortunately, these benefits come at the cost of reproducibility. It is well known that companies periodically update their models [11] so the results reported in the papers maybe difficult or impossible to reproduce if a model was substantially changed or altogether retired.

Spirling [13], in an article for World View, writes that "researchers should avoid the lure of proprietary models and develop transparent large language models to ensure reproducibility"'. He advocates for the use of Bloom 176B [8], but two years hence, its adoption is underwhelming. At time of writing, the Bloom paper [8] has about 1,600 citations which is a fraction of the 10,000 citations of even Llama2 [15] which came out after (a fine-tune version of Bloom called Bloomz [10] came out in 2022 and has approximately 600 citations).

Crockett and Messeri [4], in response to the work of Dillion et al. [6], noted the bias inherent in OpenAI models' training data (hyper-WEIRD). In a follow up [9], the authors went in depth to categorize AIs as Oracles, Surrogates, Quants, and Aribiters and briefly discussed how each AI intervention might lead to even more problems. Their primary concern is the inevitable creation and entrenchment of a scientific monoculture. A handful of models which dominate the research pipeline from knowledge aggregation, synthesis, to creation. The sea of AI generated papers will overwhelm most researchers and only AI assistants will have the time and patience to sift through it all.

Even though their concerns are valid, their suggestion to "work in cognitively and demographically diverse teams'' seem a bit toothless. Few will reject a cross-discipline offer of collaboration, except for mundane reasons of time or proximity or a myriad other pulls on an academic's attention. However this self-directed and spontaneous approach seems ill-equip to handle the rapid proliferation AI. They also suggest "training the next generation of scientists to identify and avoid the epistemic risks of AI'' and note that this will "require not only technical education, but also exposure to scholarship in science and technology studies, social epistemology and philosophy of science'' which, again, would be beneficial if implemented, but ignores the reality that AI development far outpaces academia's chelonian pace of change. Scientific generations takes years to come to maturity and many more years before those change filters through the academic system and manifests as changes in curriculum.

Not all opinions were so pessimistic. Duke professor Bail [3] notes that AI could "improve survey research, online experiments, automated content analyses, agent-based models''. Even in the section where he lists potential problems, e.g., energy consumption, he offers counter-points (Tomlinson et al. [14] suggests that carbon emission of writing and illustrating are lower for AI than for humans). Still, this is a minority opinion. Many of the works that Bail espouses which use AI --- including his own --- are still in progress and have yet to pass peer review at the time of writing.

From this, it's safe to suspect that in the next two to five years it is unlikely that we will see popular use of AI as participants in prominent social science research. However, outside of academia, the story might be quite different. Social science PhDs who become market researchers, customer experience analysts, product designers, data scientists and product managers may begin experimenting with AI out of curiosity or necessity. Without established standards for the use of AI participants from academia, they will develop their own ad-hoc practices in private. Quirks of the AI models may subtly influence the products and services that people use everyday. Something feels off. It's too bad we won't know how it happened. 

Conclusion

I am not a social scientist and the above are my conclusions after reading a few dozen papers. Any junior PhD student in Psychology, Economics, Political Science, and Law would have read an order of magnitude more. If that's you, please let me know what you think about instances where AI has appeared in your line of work.

References

[1] Aher, G. V., Arriaga, R. I., & Kalai, A. T. (2023, July). Using large language models to simulate multiple humans and replicate human subject studies. In International Conference on Machine Learning (pp. 337-371). PMLR.

[2] Argyle, L. P., Busby, E. C., Fulda, N., Gubler, J. R., Rytting, C., & Wingate, D. (2023). Out of one, many: Using language models to simulate human samples. Political Analysis, 31(3), 337-351.

[3] Bail, C. A. (2024). Can Generative AI improve social science?. Proceedings of the National Academy of Sciences, 121(21), e2314021121.

[4] Crockett, M., & Messeri, L. (2023). Should large language models replace human participants?.

[5] Demszky, D., Yang, D., Yeager, D. S., Bryan, C. J., Clapper, M., Chandhok, S., ... & Pennebaker, J. W. (2023). Using large language models in psychology. Nature Reviews Psychology, 2(11), 688-701.

[6] Dillion, D., Tandon, N., Gu, Y., & Gray, K. (2023). Can AI language models replace human participants?. Trends in Cognitive Sciences, 27(7), 597-600.

[7] Fan, C., Chen, J., Jin, Y., & He, H. (2024, March). Can large language models serve as rational players in game theory? a systematic analysis. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 38, No. 16, pp. 17960-17967).

[8] Le Scao, T., Fan, A., Akiki, C., Pavlick, E., Ilić, S., Hesslow, D., ... & Al-Shaibani, M. S. (2023). Bloom: A 176b-parameter open-access multilingual language model.

[9] Messeri, L., & Crockett, M. J. (2024). Artificial intelligence and illusions of understanding in scientific research. Nature, 627(8002), 49-58.

[10] Muennighoff, N., Wang, T., Sutawika, L., Roberts, A., Biderman, S., Scao, T. L., ... & Raffel, C. (2022). Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786.

[11] OpenAI. GPT-3.5 Turbo Updates

[12] Santurkar, S., Durmus, E., Ladhak, F., Lee, C., Liang, P., & Hashimoto, T. (2023, July). Whose opinions do language models reflect?. In International Conference on Machine Learning (pp. 29971-30004). PMLR.

[13] Spirling, A. (2023). Why open-source generative AI models are an ethical way forward for science. Nature, 616(7957), 413-413.

[14] Tomlinson, B., Black, R. W., Patterson, D. J., & Torrance, A. W. (2024). The carbon emissions of writing and illustrating are lower for AI than for humans. Scientific Reports, 14(1), 3732.

[15] Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., ... & Scialom, T. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.

[16] Wang, L., Ma, C., Feng, X., Zhang, Z., Yang, H., Zhang, J., ... & Wen, J. (2024). A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6), 186345.

[17] Xu, R., Sun, Y., Ren, M., Guo, S., Pan, R., Lin, H., ... & Han, X. (2024). AI for social science and social science of AI: A survey. Information Processing & Management, 61(3), 103665.

  1. ^

    This is the attitudes and socio-demographic information of a piece of text.

  2. ^

    To ensure that the model outputs responses representative of the US population instead of their training data, the researchers prompted the model with backstories whose distribution matches that of demographic survey data (i.e., ANES).

  3. ^

    An example of such a sentence is: While Anna dressed the baby that was small and cute spit up on the bed.



Discuss

In the future, language models will be our interface to the world

25 января, 2025 - 02:16
Published on January 24, 2025 11:16 PM GMT

Language models are really powerful and will continue to get more powerful. So what does the future of language model usage look like? 

Imagining the future

Here are some things people likely already do. 

  • Instead of reading a long piece of text, you might give a book to an LLM and ask it questions about things you want to know.
  • Instead of writing an essay, you might draft some points and ask an LLM to format it into prose. 

Here's some other things that seem doable in the near future, given current rates of progress. 

  • Instead of you manually deciding what to do, you might give an LLM access to a large collection of tasks you've written at some point and ask it for a small subset to do today.
  • Instead of writing a script, an LLM might live in your terminal and ingest natural language commands, then do "just-in-time coding" to execute them.   
  • Instead of browsing the Internet, an LLM might live in your browser and deliver you a personalized 'feed' of relevant information aggregated from everywhere in the world.
  • Instead of you going on Netflix, an LLM might just write a new movie for you based on your preferences, or according to a specification. This generalises to other kinds of media like books, plays, music, etc.

More generally, LLMs might take over most of the complexity of managing your daily life; deciding what food you order, your agenda for the day, which people you meet, the kinds of media you consume. The age of the 'personal LLM assistant' is not that far off. However, 'assistant' is probably a misnomer. A better phrase might be 'personal LLM overseer'. 

Current systems restrict adoption 

Given that most of the things we do have an obviously better LLM equivalent, why haven't we seen broader adoption? 

It must first be acknowledged that current adoption is low because of mundane reasons like reliability and lack of infrastructure. However, these are transient issues. Furthermore, I think they are overblown - most people could get way more useful work out of language models than they currently do, if they really tried. 

The main problem is just inertia. Existing systems are just poorly designed to take advantage of this. They were designed in a world that didn't have extremely cheap and powerful machine intelligence, and must be re-designed from the ground up accordingly. 

Language models have so much potential. They won't just be assistants or tools. Consider the five senses we use to experience the world. There's no reason why all of those can't be replaced by language models generating the  equivalent. Language models will form a 'base layer' over reality through which you perceive everything. C.f. Plato's cave.

AI-first product development

In the future it's likely that we'll design products explicitly for AI rather than for human consumption. Instead of websites and human-digestible reading we might just send information as bullet points, or as JSON objects, and trust the language model running on our local device to 'decompile' this into human-readable language.  

The modern world contains volumes of information orders of magnitude higher than what we can process. Making sense of this involves efficiently aggregating, distilling, and presenting this information in digestible chunks. Language models are likely to be able to do this way better than our existing systems can. 

"Soft" disempowerment

We often express concern that AI is likely to take over the world leading to "human disempowerment", and this phrase conjures up something like 1984. However, a functionally equivalent outcome is "soft disempowerment" of the kind seen in Brave New World, where we very willingly cede more and more control over their lives to AI simply because this is an objectively better experience.  



Discuss

Why I'm Pouring Cold Water in My Left Ear, and You Should Too

25 января, 2025 - 02:13
Published on January 24, 2025 11:13 PM GMT

Why?

A little while ago, I read these posts about how pouring really cold water in people's left ear solves an extreme form of rationalizing in patients with Anosognosia and might(?) make people better calibrated in general. Last month, I asked whether anyone had checked that, because it seemed like something that would be rather important if true. I got no response, which tells me that even if someone has, it's not common knowledge.

I decided to write up this rough experimental method (copy-pasted below) and made this google form to record my results. Originally I was going to be the only participant, but figured I might as well put it out there in case others wanted to help. I'd really appreciate any additional data points people want to add.

After enough data to determine whether it's likely that there's actually a positive effect, I'll publish it in a followup post.

What About Selection Effects/Placebo?

Yep, there could be those, this isn't designed to fully narrow down any effects. If the data ends up strongly implying an effect, then more precise experiments are worth doing. If it doesn't, then any effort to deal with those are wasted motion that just slows us down and likely decreases sample size.

Is This Safe?

I think so? It's an established medical procedure, and some people were obviously able to get ethical approval for it, otherwise it wouldn't be in those papers. It does feel very uncomfortable for a little while after, and doing it multiple times could have more of an impact than the once in the papers. If you're concerned about that, avoid doing it multiple times.

Personally, I'm willing to take the risk of potential hearing problems in my left ear for a little while because I expect that to be fixable, and rationality interventions now could potentially add up to a serious impact on our future trajectory. 

Draft Experimental MethodMaterials
  • Ice cubes
  • Glass
  • Water
  • Pipette/syringe (without needle), or some other method of precisely delivering water
  • Timer
  • Coin (or online randomizer)
  • Computer with the credence game: https://acritch.com/credence-game/
Method
  1. Note time, date, potential confounding factors (importantly, ones that could have an impact on the difference between now and 20 mins or so from now)
  2. Get a glass of water, put ice in it, wait until the water is ice-cold (10 min or so, the water should be cold enough that the ice hasn’t melted, but the water isn’t getting noticeably colder)
  3. Flip a coin: heads means cold water first, tails means control first.
  4. Using this credence calibration game, answer as many questions as you can in 10 minutes. Note the total score, total questions, and average score.
  5. Wait 10 minutes
  6. step 4, but whichever one (cold water or control) you didn't do the first time.
Control

Using this credence calibration game, answer as many questions as you can in 10 minutes. Note the total score, total questions, and average score, then delete the save file.

Cold Water
  1. Grab the ice water and a pipette or syringe (without needle)
  2. Go to a location where nausea/vertigo would be an acceptable outcome
  3. Lie down with your left side facing upwards
  4. Use the instrument to pour the ice water in your left ear, continue until you feel vertigo/nausea. My personal experience was feeling like the room was spinning a little.
  5. Using this credence calibration game, answer as many questions as you can in 10 minutes. At the end, note the total score, total questions, and average score, then delete the save file.


Discuss

Counterintuitive effects of minimum prices

25 января, 2025 - 02:05
Published on January 24, 2025 11:05 PM GMT

The Attorney General of Massachusetts recently announced that drivers for ride-sharing companies must be paid at least $32.50 per hour.

Now, if you’re a hardcore libertarian, then you probably hate with the minimum wage. You need no convincing and we can part now on good terms.

But what if you’re part of the vast majority of people who like the minimum wage? What if you think the minimum wage is awesome and you want to expand it? That’s fine. I won’t try to change your mind! But even so, there are strong reasons to be skeptical about this policy.

Because: If you read closely, you’ll see that Massachusetts’ rule is that drivers must be paid $32.50 for hours that they spend driving passengers around. For the time they spend waiting for passengers to request rides, they will still be paid $0. And when you screw around with prices, you change the amount of time they spend waiting around.

This kind of policy could help drivers. But if you analyze things carefully, it can’t help them very much. In the end, this policy is mostly equivalent to forcing riders to pay extra to subsidize drivers sitting around doing nothing.

I’ll first give the intuition with words and drawings, and then I’ll prove it with a tiny bit of math.

Story time

Say there’s a city—call it Billiard Ball Falls—where people behave in suspiciously convenient ways. All rides take exactly one hour, including the time for the driver to get to the rider. And demand for rides is constant over time.

Now, how many rides would you take if they only cost $1? Personally, I’d buy fresh produce and visit friends across town more often. Eventually, I might move further away from work. But if rides cost $1000, I’d only take them in emergencies. Assuming most people are like me, the total demand for rides from all people will decrease as the price goes up. Maybe something like this:

Meanwhile, if drivers were paid $1 per ride, who would drive? Maybe a tiny number of very bored and lonely people? I certainly wouldn’t. But if I could earn $1000 per ride, I’d feel morally obligated to drive and donate much of the money to charity. So the total supply of rides from all people, is some kind of upward-sloping curve:

In a free market, prices will—under some not-too-crazy assumptions—eventually settle at whatever price makes these curves intersect.

Riders pay around $2.33 per ride and drivers earn around $2.33 per hour. A story as old as ECON 101.

But we’re here to talk about something more interesting. What if the government mandates a new, higher, price?

We have a problem. Since rides now cost more than the old market price, people will buy fewer rides.

But since pay is higher, more people will want to be drivers.

That’s not going to work. Riders can’t buy 300 rides/hour while drivers are selling 500 rides/hour. That would violate conservation of ride.

Ordinarily, what would happen in this situation is prices would go down. This would cause drivers to drive a bit less and riders to ride a bit more and this would continue until the price went back to the market price. But that can’t happen when the price is fixed. So what happens instead?

Well, demand can’t change. People will want to buy some number of rides at the government price and that’s that.

In this situation, supply matches demand through a decrease in the utilization of drivers. Think of it like this: Prices are high, so lots of people want to offer rides. But there’s a shortage of customers, so drivers will have to wait around a while before they get a customer.

Now there’s a subtle point here. (One that took me quite a while to figure out.)

It’s easiest to understand with an example. Say the government mandates a price of $30 per ride and say drivers spend U=⅔ of their time actually working. Then the effective wage is $20 per hour, so people supply SUPPLY($20) hours of work. But only of those hours actually become rides, so the number of rides supplied is ⅔ × SUPPLY($20).

Get that? When drivers spend more time waiting around, this decreases supply in two ways.

  1. When drivers wait around, they don’t create any rides (duh).
  2. When drivers wait around, this makes their effective wage lower, so they drive less.

If the government mandates a price of P and drivers spend a fraction U of their time waiting around, and you account for both of the above effects, the actual number of rides supplied will be U×SUPPLY(U×P). (Through the logic in the above example.)

OK, so remember where we left our example.

The supply curve SUPPLY(P) is the same as U×SUPPLY(U×P) when U is 1. What happens in this situation is that U decreases until U×SUPPLY(U×P) intersects with the demand curve at the government price. This happens to be when U is around 0.9, meaning drivers spend 90% of their time driving customers around and 10% of their time waiting around:

Now what? Now we’re done. We just have to remember that the effective wage of drivers is government price times the utilization rate. In this case that’s around 10% lower:

So, in this example, after the government increases prices:

  • Riders pay more.
  • Riders take fewer rides.
  • Drivers spend more time waiting around.
  • Driver wages don’t increase.

Now, the effective wage usually will change, at least at little. It depends on the situation.

In one extreme, people could be totally insensitive to prices. No matter how much rides costs, riders keep buying the same number of rides. And no matter how much drivers are paid, no one works more hours. In this case, forcing a higher price won’t decrease utilization—it will just transfer more money from riders to drivers.

In the other extreme, people could be very sensitive to prices. When prices go up, riders cut back on rides and drivers try to work more. In this case, forcing a higher price will decrease utilization a lot and the effective wage might even go down.

But how exactly does this work? In realistic situations, how much will a price increase actually help drivers?

I’m glad you asked!

Science time

Humans encompass multitudes. There are infinite possible supply and demand curves. This complicates things for bloggers trying to disparage Massachusetts ride-sharing price minimums.

So why don’t you just try it? Here’s a calculator. Enter whatever supply and demand curves you want, and how much the government will increase prices. It will then calculate the utilization rate and wage drivers earn at the new equilibrium. (You can use simple Javascript in your formulas, e.g. you can write W**2 for W² or Math.log(W) for log(W).)

(The above is just a picture, please go here to actually use the calculator.)

If you screw around for a while, you will hopefully notice that it’s quite hard to increase wages much above the market wage. It is possible, but requires you to assume that people basically don’t care about prices. For example, if DEMAND(P)=101-P and SUPPLY(W)=99+W, then the market price is $1/ride. If the government increases that to $2/ride, then utilization only drops slightly to 0.981, so the effective wage goes up from $1/hr to $1.961/hr.

But is it realistic to assume that people don’t care about prices? Do you really have to assume that? Can we prove anything with “realistic” assumptions?

Math time

In this section, I’ll prove that that when the government increases prices, the absolute best that drivers can hope for is that 50% of the extra money shows up in wages. For example, if the market price is $20/ride (and the market wage $30/hr) and the government increases the price to $30/ride, then the highest possible new wage is $25/hr. And probably less.

(If you hate math and you trust me, feel free to skip this section.)

To obviate with the vastness of the human condition with its infinite supply and demand curves, I’m going to analyze a situation where the price starts at the market price and then is changed to be just slightly higher. This is helpful, because then we only need to worry about the supply and demand near the equilibrium, which reduces everything to just four numbers.

This is helpful, because it means we only need to care about the shape of the supply and demand curves near the equilibrium, which reduces things to just four numbers.

THEOREM. (Me, 2025) Suppose that W(P) is the effective wage at price P. Then at the market price P,

 

 Let’s try to understand what this means. First, recall our equilibrium graph:

The three quantities in the equation can be seen as the slopes of three lines in this graph.

Specifically:

  • DEMAND(P)/P is the slope of the line that goes from the origin to the equilibrium point. This is the number of rides that happen per dollar at the equilibrium point.
  • DEMAND’(P) is the slope of the demand curve at the equilibrium point. This is how sensitive demand is to changes in price. It’s a negative number, since demand goes down when prices go up.
  • SUPPLY’(P) the slope of the supply curve at the equilibrium point. This is how sensitive supply is to changes in wages. This is a a positive number, since supply goes up when prices go up.

Note that the government increasing prices could cause wages to go down. This will happen if demand is sensitive enough to prices near equilibrium:

COROLLARY. The change in wages W’(P) is negative if and only if the magnitude of DEMAND’(P) (a negative number) is larger than the magnitude of SUPPLY(P)/P.

 

I’m not sure how likely it is that wages would actually go down in practice. When I try making up plausible-seeming supply and demand curves, wages do go down sometimes, but it’s fairly rare, and even when it happens the decrease is usually small.

Here’s a result that I think is more important in practice. Informally, it says that if the supply curve is “upward sloping”, then the increase in wages is at most 50% of the increase in prices. To me, this is the strongest argument against increasing prices.

COROLLARY. If SUPPLY’(P) > SUPPLY(P)/P, then W’(P)<½.

 

When will the supply curve be upward sloping? Let me be show you what I’m talking about:

If this is true, then I think we can fairly say that “most of the extra money the government is forcing people to pay doesn’t manifest as extra wages”.

I claim this is almost certainly true.

Think of it like this: If wages were half as much, would people drive half as much? Or less than half as much? I think it’s less than half as much. This means that the orange curve in the above picture is going to be below the red line, and at least 50% of the extra money is “wasted”.

But really, it’s much worse than that. For one thing, this result basically assumes the worst case, where increasing prices doesn’t decrease demand at all. In the real world, demand will go down and the increase in wages will be even less.

For another, this doesn’t even account for waste! Suppose a price increase made wages go up 25%, but drivers now only spend 50% of their time actually driving people around. Are you happy? I’m not happy, because (1) this hurts riders and (2) it seems crazy to interfere with markets in ways that encourage people spend more time doing thing that aren’t productive.

(I’m sure that all the results in this section are known, but it was easier—and more fun—to just re-derive it myself.)

Discussion time

In practice, Massachusetts’ policy probably isn’t quite as bad as my simple model suggests: For one thing, in the real world, lower driver utilization will mean reduced wait times for riders, so the money isn’t totally wasted. And the minimum of $32.50/hr only seems to apply when averaged over a few weeks. And maybe some of the extra money will come out of the profits of the ride-sharing company? (Though one can argue that reducing profits is also bad.)

But still, if you want to run a command economy where the government sets prices, there are better ways of doing it! And this is not theoretical: New York mandates a minimum wage of $17.22/hr that includes time waiting around. The ridesharing companies responded to this as anyone could predict: They refuse to let drivers get on the clock at all. Jacobin bitterly calls this a “loophole”, but… what are we hoping for here? The money has to come from somewhere. Without a magical supply of extra riders, you can’t force prices above market rate without some kind of consequence!

Now, New York’s policy is kind of weird. It mandates a minimum wage, and then leaves it to companies to limit supply. The old-fashioned way of propping up driver pay is to limit taxi medallions, and then let the market price increase naturally. New York’s policy is very indirect, but amounts to basically the same thing.

(Incidentally, this isn’t just a story about ride-sharing. It’s also a decent model for why it’s bad that American realtors were able to establish a monopoly where they can extract 6% of the value of anyone who wants to sell a home. That doesn’t just take money away from home-selling and give it to real estate agents. It also subsidizes real-estate agents to spend lots of time competing with each other for clients in a way that does nothing to advance the welfare of general society.)

So really, I think Massachusetts’ policy is worse than bad—it’s a mistake. Riders pay more and get fewer rides. And for what? Basically to pay drivers to sit around, spending energy, wasting time, and increasing traffic. Healthy, ambitious societies do not do that.



Discuss

AXRP Episode 38.6 - Joel Lehman on Positive Visions of AI

25 января, 2025 - 02:00
Published on January 24, 2025 11:00 PM GMT

YouTube link

Typically this podcast talks about how to avert destruction from AI. But what would it take to ensure AI promotes human flourishing as well as it can? Is alignment to individuals enough, and if not, where do we go form here? In this episode, I talk with Joel Lehman about these questions.

Topics we discuss:

Daniel Filan (00:09): Hello, everyone. This is one of a series of short interviews that I’ve been conducting at the Bay Area Alignment Workshop, which is run by FAR.AI. Links to what we’re discussing, as usual, are in the description. A transcript is, as usual, available at axrp.net. And as usual, if you want to support the podcast, you can do so at patreon.com/axrpodcast. Well, let’s continue to the interview.

(00:29): All right, and today I’m going to be chatting with Joel Lehman.

Joel Lehman (00:32): Hello, nice to be here.

Daniel Filan (00:33): Yeah, so first of all, for people who aren’t familiar with you, who are you, and what do you do?

Joel Lehman (00:40): I’m a machine learning researcher. I’ve been in the field for 10, 12 years, co-author of a book called Why Greatness Cannot Be Planned, which is about open-endedness, and I’ve been a researcher at Uber AI and most recently OpenAI, and currently an independent researcher.

Daniel Filan (00:56): Gotcha. Before we talk about what you do or your current research: we’re currently at this Alignment Workshop being run by FAR.AI. How are you finding the workshop?

Joel Lehman (01:06): It’s great. Yeah, lots of nice people, lots of interesting ideas, and good to see old friends.

Why aligned AI might not be enough

Daniel Filan (01:12): What are you currently thinking about, working on?

Joel Lehman (01:15): I’m really interested in positive visions for how AI could go well, set against a default assumption that AI as currently deployed might exacerbate existing societal problems. The rough intuition is: capitalism is great, but also the downside of capitalism is we get what we want, but not what we need, and maybe AI will get really good at giving us what we want, but at the expense of epistemic sense-making and meaningful interactions with people and political destabilization, that kind of thing.

Daniel Filan (02:00): It seems to me that understanding the world, having meaningful interactions with people, those are also things people want. Political destabilization is, I guess, a little bit more tricky. Do you have a sense for why AI might erode those things by default?

Joel Lehman (02:17): Yeah, I think it’s interesting. We do have a sense of what we want in the grander scale, what a meaningful life might look to us. At the same time, it seems like if you look at the past maybe decade or two, in general, it feels almost like a death by 1,000 paper cuts, where we get convenience at the expense of something else. One way of thinking about it is some part of society is reward-hacking the other half of society.

(02:45): Facebook had a really beautiful motivation… Or social media in general has a very beautiful motivation, but in practice, it seems like it maybe has made us less happy and less connected. We find ourselves addicted to it and to our cell phone and all the attention economy sorts of things where again, we might be in touch with what our better angel would do, and yet, it’s really hard to have the discipline and will power to resist the optimization power against us.

Daniel Filan (03:11): So, is the vision something like: we get AI, AI is just a lot better, it’s really good at optimizing for stuff, but we have this instant gratification thing that the AI optimizes really hard for at the expense of higher-level things. Is that roughly right?

Joel Lehman (03:28): Yeah, I think you can look at the different AI products that are coming out, and some of them might be beneficial to our well-being and to our greater aspirations, but some non-trivial chunk will be seemingly… It’s hard to know how it’ll play out, but take Replika for example. A really interesting technology, but there’s a lot of fears that if you optimize for market demand, which might be companions that say all the things that make you feel good, but don’t actually help you connect with the outside world or be social or something…

Positive visions of AI

Daniel Filan (04:05): Right, right. Yeah, up against that, I guess you’ve said that you’re thinking about positive visions of AI. What kinds of positive visions?

Joel Lehman (04:17): So I wrote an essay with Amanda Ngo pitching that part of what we need is just the sense of the world we like to live in and what the alternatives to some of the technologies that are coming out are. If social media has had some knock-on effects that are a little bit bad for our well-being and for our societal infrastructure, what would a positive version of that be? It’s putting aside the market dynamics which make it maybe difficult to actually realize that in practice, but at least having a sense of what something we might want would be, and what are the technical problems that you might need to solve to try to even enable that.

Daniel Filan (04:58): Before we go into that, what’s the name of that essay so people can Google it?

Joel Lehman (05:02): That is a great question. I think it’s called “We Need Positive Visions of AI Grounded in Wellbeing”.

Daniel Filan (05:10): Okay, hopefully, that’s enough for people to find it, and there’ll be a link in the description. Do you have some of the positive vision yet, or is the essay mostly “it would be nice if we had it”?

Joel Lehman (05:24): Partly it’s “it’d be nice if we had it”, and partly there’s at least some chunks of it that we begin to map out what it might look like and what are some of the technical challenges to realizing it. I had a research paper out maybe a year ago called “Machine Love”, and it was about trying to take principles from positive psychology and psychotherapy and imagining what it’d be like to bash those up against the formalisms of machine learning, and is there a productive intersection? It tries to get into the ideas of going beyond just revealed preferences or Boltzmann rationality or whatever kind of convenient measure of human preferences we have to something that’s more in touch with what we know from the humanities about what human flourishing broadly looks like.

Daniel Filan (06:18): So we want to get a better read on human flourishing, on things other than very simple Boltzmann rationality. I’m wondering if you have takes about how we… Or if someone’s just a machine learning person, how do they start going about that, do you think?

Joel Lehman (06:38): I think it’s a great question. I think that there are ways that we could just stretch the current formalisms that would be useful. There’s really interesting work by Micah Carroll at UC Berkeley, on the difficulties of preferences that change. So that’s going a little bit beyond the idea of “your preferences are fixed and immutable”. So keeping stretching things outwards from there and dealing more with the messiness of human psychology, that we’re not simple utility maximizers and that we can actually make decisions against our best interests, and how do you start to grapple with that?

(07:23): Another thing would be just to be in touch with the broader literature from the humanities about that and trying to find if there are interesting inroads from the philosophy of flourishing to the kinds of formalisms we use. I think cooperative inverse reinforcement learning feels like, again, stretching the formalism to encompass more of the messiness in there, and I feel like things in that direction, I think, I’d be really excited about.

Daniel Filan (07:54): Okay, so just engaging more with difficulties of what’s going on with humans, funky stuff, just trying to grapple with it and just move a little bit less from this very theoretical place?

Joel Lehman (08:08): Yeah, reinforcement learning from human feedback is really cool, and it’s really based on pairwise comparisons, so how could we go beyond that? I don’t have great ideas there. It seems really difficult, but I think it’s exciting to think about what other things could exist there.

Improving recommendation systems

Daniel Filan (08:27): Yeah, so you mentioned Micah Carroll’s work. He’s done some work on recommendation systems, and my understanding is that you’ve thought a little bit about that. Can you tell us a little bit about that?

Joel Lehman (08:38): I am really interested in the machine learning systems that underlie the big levers of society, things that drive social media or platforms like YouTube or TikTok, systems that really impact us at scale in interesting and hard-to-foresee ways. It’s easy to quantify engagement or how you might rate something and more difficult to get to the question of “what’s an experience that years from now I’ll look back and really be grateful for?”

(09:15): So, a project in that direction is working with Goodreads. I just really like books. Goodreads has a data set of books and text reviews and ratings, and what’s really cool about it, just as a data set, is that it encompasses years sometimes of books that a person has read, and that even though a lot of books you read don’t really impact you that much, there is the potential for transformative experiences in there, where reading a book could actually change the course of your life in some way. And that there are these text reviews that actually could detail how a book has affected you.

(09:53): You could look for phrases like “has this book changed my life?” The hope is that building on a data set, you could come up with a recommendation system that could be tailored to the history of what you’ve read and maybe could contribute to your development, so to the change of your preferences, the change of your worldview. I’ve done some initial work in that direction. It’s not easy to create such a system, but I think for those kinds of data sets, it’d be really exciting to see people try to grapple with the challenges of changing preferences and deeper things that a person might want.

Daniel Filan (10:27): Yeah, I guess it’s an interesting problem, because at least in the abstract, not all changes are good. I can imagine somebody saying like, “Oh, yeah, this book changed me for good,” and I read that and I’m like, “No thanks.” I’m wondering if you have thoughts about how to disentangle the good types of changes from the bad types.

Joel Lehman (10:51): Yeah, that’s a huge challenge, and in some initial work, if you just stack rank books by the percentage of reviews that people say they changed their lives, you get things like Tony Robbins at the top and self-help books, which maybe really do change a person’s life, but also it could be just hacking your sense of this excitement about a new idea that actually doesn’t really change your life in a good way. I think-

Daniel Filan (11:15): It’s tricky. One thing that I think goes on there is if a book has changed your life, there’s a pretty good chance that it’s a self-help book, but if you read a self-help book, there’s a good chance that it will not, in fact, change your life, so it’s like both things are going on, I guess.

Joel Lehman (11:30): Yeah, I think there are really deep philosophical questions about how to know if a change is good or bad for you, and questions of paternalism that step in if the machine learning system’s guiding you in a way that you wouldn’t necessarily want but is changing you. And one first approximation of something that might be okay is if it could project out possible futures for you: “If you read this book, it might impact you in this way.” Or giving you the affordance to say, “I really would love to be able to someday appreciate Ulysses. It’s a famously challenging book. What are the things that I would need to do to be able to appreciate that?”

(12:14): Trying to rely a little bit on human autonomy. But yeah, there’s messy societal questions there because you could imagine also… I don’t think this is really going to happen in book recommendation, but as a microcosm, if everyone got hooked on to Ayn Randsville, that’s maybe the easiest path to have a life-changing experience or something. You read Ayn Rand, it’s really changed my worldview or something. But the second-order consequence is that a lot of people are getting stuck in a philosophical dead end or local optima or something. I don’t know. I think there’s definitely challenging questions, but I think there are also probably principled ways to navigate that space.

Daniel Filan (12:59): Yeah, I guess I feel like the thing you ideally would want to do is have some sort of conversation with your future self where you could be like, “Oh, yeah, this seems like it is bad, but I don’t know, maybe you’ve got some really good reason,” or, “Ah, this seems good, but let me just check.” I don’t know. It seems pretty hard, but …

Joel Lehman (13:18): Yeah, I think that’s great. I think there’s also independent reasons it’s nice to talk to your future self, and I think there’s some research that shows that actually can motivate you to change to be the kind of person you want to be. There’s also the weird thing that: the stepwise changes that you go through in your worldview - you might, when you get to the end of it, approve of those changes, but at the beginning, you look at those and you’re like, “I didn’t want to become that kind of person,” or something.

(13:47): As a kid, I wanted to be an astronaut, and maybe kid me talking to present me would be like, “Wow, your life is boring. You could have been on the moon.” I don’t mean to always bring up the philosophical annoyances, but it’s interesting that it’s hard to… Micah’s paper talks a bit about this difficulty with “by whose preferences do you judge whether your future preferences are good?”

Daniel Filan (14:13): Right, right. Yeah, I think it’s challenging, and I think that’s food for thought for our listeners, so thanks very much for chatting with me.

Joel Lehman (14:23): Thanks for having me. It’s been great.

Daniel Filan (14:25): This episode was edited by Kate Brunotts, and Amber Dawn Ace helped with transcription. The opening and closing themes are by Jack Garrett. Financial support for this episode was provided by the Long-Term Future Fund, along with patrons such as Alexey Malafeev. To read a transcript of the episode, or to learn how to support the podcasts yourself, you can visit axrp.net. Finally, if you have any feedback about this podcast, you can email me at feedback@axrp.net.



Discuss

Six Thoughts on AI Safety

25 января, 2025 - 01:20
Published on January 24, 2025 10:20 PM GMT

[Crossposted from windowsontheory]

The following statements seem to be both important for AI safety and are not widely agreed upon. These are my opinions, not those of my employer or colleagues. As is true for anything involving AI, there is significant uncertainty about everything written below. However, for readability, I present these points in their strongest form, without hedges and caveats. That said, it is essential not to be dogmatic, and I am open to changing my mind based on evidence. None of these points are novel; others have advanced similar arguments. I am sure that for each statement below, there will be people who find it obvious and people who find it obviously false.

  1. AI safety will not be solved on its own.
  2. An “AI scientist” will not solve it either.
  3. Alignment is not about loving humanity; it’s about robust reasonable compliance.
  4. Detection is more important than prevention.
  5. Interpretability is neither sufficient nor necessary for alignment. 
  6. Humanity can survive an unaligned superintelligence.

 

Before going into the points, we need to define what we even mean by “AI safety” beyond the broad sense of “making sure that nothing bad happens as a result of training or deploying an AI model.” Here, I am focusing on technical means for preventing large-scale (sometimes called “catastrophic”) harm as a result of deploying AI. There is more to AI than technical safety. In particular, many potential harms, including AI enabling mass surveillance and empowering authoritarian governments, can not be addressed by technical means alone.

 

My views on AI safety are colored by my expectations of how AI will progress over the next few years. I believe that we will make fast progress, and in terms of technical capabilities, we will reach “AGI level” within 2-4 years, and in a similar timeframe, AIs will gain superhuman capabilities across a growing number of dimensions. (Though I suspect translating these technical capabilities to the economic and societal impact we associate with AGI will take significantly longer.) This assumes that our future AGIs and ASIs will be, to a significant extent, scaled-up versions of our current models. On the one hand, this is good news, since it means our learnings from current models are relevant for more powerful ones, and we can develop and evaluate safety techniques using them. On the other hand, this makes me doubt that safety approaches that do not show signs of working for our current models will be successful for future AIs.
 

1. AI safety will not be solved on its own.

 

Safety and alignment are AI capabilities, whether one views these as preventing harm, following human intent, or promoting human well-being. The higher the stakes in which AI systems are used, the more critical these capabilities become. Thus, one could think that AI safety will “take care of itself”: As with other AI capabilities, safety will improve when we scale up resources, and as AI systems get deployed in more critical applications, people will be willing to spend resources to achieve the required safety level.

 

There is some truth to this view. As we move to more agentic models, safety will become critical for utility. However, markets and other human systems take time to adapt (as Keynes said, “Markets can remain irrational longer than you can remain solvent”), and the technical work to improve safety takes time as well. The mismatch in timescales between the technical progress in AI—along with the speed at which AI systems can communicate and make decisions—and the slower pace of human institutions that need to adapt means that we cannot rely on existing systems alone to ensure that AI turns out safe by default.

 

Consider aviation safety: It is literally a matter of life and death, so there is a strong interest in making commercial flights as safe as possible.

 

Indeed, the rate of fatal accidents in modern commercial aviation is extremely low: about one fatal accident in 16 million flights (i.e., about 99.99999%—seven “nines”—of flights do not suffer a fatal accident). But it took decades to get to this point. In particular, it took 50 years for the rate of fatalities per trillion passenger kilometers to drop by roughly two orders of magnitude.

 

 

There are multiple ways in which AI safety is harder than aviation safety. 

  1. Timeline: We don’t have 50 years. If we compare AI progress to transportation, we might go from bicycles (GPT-2) to spaceships (AGI) in a decade.
  2. Metrics and generality: We can’t just track a single outcome (like “landed safely”). The G in AGI means that the number of ways that AGI can go wrong is as large as the number of ways that applications of human intelligence can go wrong, which include direct physical harm from misuse, societal impacts through misinformation, social upheaval from too fast changes, AIs autonomously causing harm and more.

Of course, we have a lot of experience dealing with the ways in which human intelligence goes wrong. However, we have gained this experience through thousands of years, and we still haven’t figured out how to get along with each other. The different skill profiles of AI, the ways it can be integrated with other systems, and the speed at which it evolves all pose unique challenges.

 

 

2. An “AI scientist” will not solve it either.

 

One approach for AI safety is to follow the following plan:

  1. Develop an AI that is superhuman (or at least expert level)  in AI research
  2. Use this AI to do research and solve the alignment problem.
  3. Deploy safe AGI.
  4. Profit.

 

Since doing AI research is a narrower task than AGI and has less of a potential attack surface, the hope is that this will be an easier lift than solving the general problem of AI alignment. 

We absolutely should be using AI to help AI safety research, and with time, AIs will take an ever-growing role in it. That said, given the current trajectory of AI progress, I don’t think the plan above will work out as stated, and we can’t just hope that the AI scientist will come and solve all our problems for us like a “deus ex machina.”

 

The reasons are twofold:

  1. No temporal gap: We will not have a discrete “AGI moment” and generally a clear temporal gap between the point where we have AI that is capable of expert-level autonomous AI/ML research and the point at which powerful AI agents are deployed at scale.
  2. No “magic” insight: AI safety will not be solved via one brilliant idea that we need an AI scientist to discover. It will require defense in depth (i.e., the “Swiss cheese” approach) and significant work on safety infrastructure both pre and post-deployment.

 

No temporal gap. Frontier AI systems exhibit an uneven mixture of capabilities. Depending on the domain and distance from their training distribution, they can range from middle-school to graduate-level knowledge. Economic integration of AI will also be uneven. The skill requirements, error tolerance, regulatory barriers, and other factors vary widely across industries.

Furthermore, our current competitive landscape means AI is being developed and deployed by multiple actors across many countries, accelerating both capabilities and integration. Given these factors, I suspect we will not see a discrete “AGI moment.” Instead, AGI may only be recognized in retrospect, just like official declarations of recessions.

All of this means that we will not have a “buffer” between the time we have an expert AI scientist (Step 1 in the plan above) and the time agentic AIs are widely deployed in high-stakes settings (Step 3). We are likely not to have discrete “steps” at all but more of a continuous improvement of a jagged frontier of both capabilities and economic impact. I expect AI systems to provide significant productivity boosts for AI safety researchers, as they would for many other professions, including AI capability researchers (to the extent these can be distinguished). However, given long tails of tasks and incremental deployment, by the time we have a true superhuman safety researcher, AI will already be deeply integrated into our society, including applications with huge potential for harm.

 

No “magic” insight. Like computer security, AI safety will not be solved by one magic insight; it will require defense in depth. Some of this work can’t wait till AGI and has to be done now and built into our AI infrastructure so we can use it later. Without doing work on safety now, including collecting data and evaluation, and building in safety at all stages, including training, inference, and monitoring, we won’t be able to take advantage of any of the ideas discovered by AI scientists. For instance, advances in discovering software vulnerabilities wouldn’t help us if we didn’t already have the infrastructure for over-the-air digitally signed patches. Adopting best practices also takes time. For example, although MD5 weaknesses were known in 1996, and practical attacks emerged in 2004, many systems still used it as late as 2020.

It is also unclear how “narrow” the problem of AI alignment is. Governing intelligence, whether natural or artificial, is not just a technical problem. There are undoubtedly technical hurdles we need to overcome, but they are not the only ones. Furthermore, like human AI researchers, an automated AI scientist will need to browse the web, use training data from various sources,  import external packages, and more, all of which provide openings for adversarial attacks. We can’t ignore adversarial robustness in the hope that an AI scientist will solve it for us.

Constant instead of temporal allocation. I do agree that as capabilities grow, we should be shifting resources to safety. But rather than temporal allocation (i.e., using AI for safety before using it for productivity), I believe we need constant compute allocation: ensuring a fixed and sufficiently high fraction of compute is always spent on safety research, monitoring, and mitigations. See points 4 (monitoring) and 6 (dealing with unaligned ASI) below.
 

3. Alignment is not about loving humanity; it’s about robust reasonable compliance.

One way of phrasing the AI alignment task is to get AIs to “love humanity” or to have human welfare as their primary objective (sometimes called “value alignment”). One could hope to encode these via simple principles like Asimov’s three laws or Stuart Russel’s three principles, with all other rules derived from these.

There is something very clean and appealing about deriving all decisions from one or few “axioms,” whether it is Kant’s categorical imperative or Bentham’s principle of utility. But when we try to align complex human systems, whether it’s countries, institutions, or corporations, we take a very different approach. The U.S. Constitution, laws, and regulations take hundreds of thousands of pages, and we have set up mechanisms (i.e., courts) to arbitrate actual or potential contradictions between them. A 100k-page collection of rules makes a boring science fiction story, but it is needed for the functioning of complex systems. Similarly, ensuring the safety of complex computer programs requires writing detailed specifications or contracts that each component must satisfy so that other pieces can rely on it.

We don’t want an AI philosophizing about abstract principles and deciding that a benevolent AI dictatorship (or alternatively, reversion to pre-industrial times) is best for humanity. We want an AI to comply with a given specification that tells it precisely what constraints it must satisfy while optimizing whatever objective it is given. For high-stakes applications, we should be able to ensure this compliance to an arbitrary number of “nines,” possibly by dedicating resources to safety that scale with the required level of reliability. (We don’t know how to do this today. We also don’t have ways to automatically discover and arbitrate edge cases or contradictions in specifications. Achieving both is a research effort that I am very excited about.)

Perfect compliance does not mean literal compliance. We don’t want systems that, like the genies in fables (or some of the robots in Asimov’s stories), follow the letter of the specifications while violating their spirit. In both common law and civil law systems, there is significant room for interpretation when applying a specification in a new situation not foreseen by its authors. Charles Marshall said, “Integrity is doing the right thing, even when no one is watching.” For AIs, we could say that “alignment is doing the right thing, even when you are out of the training distribution.”

 

What we want is reasonable compliance in the sense of:

  1. Following the specification precisely when it is clearly defined.
  2. Following the spirit of the specification in a way that humans would find reasonable in other cases.

One way to define “reasonable” is to think what a “jury of one’s peers” or “lay judges” - random humans from the pool relevant to the situation - would consider in such a case. As in jury trials, we rely on the common sense and moral intuitions of the typical community member. Current LLMs are already pretty good at simulating humans; with more data and research, they can become better still.

One could argue that when the specification is not well defined, AIs should fall back to general ethical principles and analyze them, and so we should be training AIs to be expert ethicists. But I prefer the random human interpretation of “reasonable.” William Buckley famously said that he’d rather be governed by the first 2,000 people in the Boston phone book than the faculty of Harvard University. As a Harvard faculty member, I can see his point. In fact, in my time at Harvard, I have seen no evidence that philosophers or ethicists have any advantage over (for example) computer scientists in matters of governance or morality. I’d rather AIs simulate normal people than ethics experts.

As another example, logician Kurt Gödel famously claimed to find an internal contradiction in the U.S. Constitution that could enable a dictatorship. We want our AIs to be smart enough to recognize such interpretations but reasonable enough not to follow them. 

While I advocate for detailed specifications over abstract values, moral principles obviously inform us when writing specifications. Also, specifications can and likely will include guiding principles (e.g., the precautionary principle) for dealing with underspecified cases.

 

Specifications as expert code of conduct. Another reason compliance with specifications is a better alignment objective than following some higher-level objectives is that superhuman AIs may be considered analogous to human experts. Throughout history, people have trusted various experts, including priests, lawyers, doctors, and scientists, partially because these experts followed explicit or implicit codes of conduct.

We expect these professionals to follow these codes of conduct even when they contradict their perception of the “greater good.” For example, during the June 2020 protests following the murder of George Floyd, over 1,000 healthcare professionals signed a letter saying that since  “white supremacy is a lethal public health issue,” protests against systemic racism “must be supported.” They also said their stance “should not be confused with a permissive stance on all gatherings, particularly protests against stay-home orders.” In other words, these healthcare professionals argued that COVID-19 regulations should be applied to a protest based on its message. Undoubtedly, these professionals believed that they were promoting the greater good. Perhaps they were even correct. However, they also abused their status as healthcare professionals and injected political considerations into their recommendations. Unsurprisingly, trust in physicians and hospitals decreased substantially over the last four years.

Trust in AI systems will require legibility in their decisions and ensuring that they comply with our policies and specifications rather than making some 4D chess ethical calculations. Indeed, some of the most egregious current examples of misalignment are models faking alignment to pursue some higher values.

 

Robust compliance. AI safety will require robust compliance, which means that AIs should comply with their specifications even if adversaries provide parts of their input. Current chat models are a two-party interaction— between model and user— and even if the user is adversarial, at worst, they will get some information, such as how to cook meth, that can also be found online. But we are rapidly moving away from this model. Agentic systems will interact with several different parties with conflicting goals and incentives and ingest inputs from multiple sources. Adversarial attacks will shift from being the province of academic papers and fun tweets into real-world attacks with significant negative consequences. (Once again, we don’t know how to achieve robust compliance today, and this is a research effort I’m very excited about.)

 

But what about higher values? Wojciech Zaremba views alignment as “lovingly reasonable robust compliance” in the sense that the AI should have a bias for human welfare and not (for example) blindly help a user if they are harming themselves. Since humans have basic empathy and care, I believe that we may not need “lovingly” since “reasonable” does encompass some basic human intuitions. There is a reason courts and juries are staffed by humans. Courts also sometimes reach for moral principles or “natural law” in interpretations and decisions (though that is controversial). The primary way I view the compliance approach as different from “value alignment” is the order of priorities. In “value alignment,” the higher-order principles determine the lower-level rules. In the compliance-based approach, like in the justice system, an AI should appeal to higher values only in cases where there are gaps or ambiguities in the specifications. This is not always a good thing. This approach rules out an AI Stanislav Petrov that would overrule its chain of command in the name of higher-order moral principles. The approach of using a “typical human” as the measuring yard for morality also rules out an AI John Brown, and other abolitionists who are recognized as morally right today but were a minority at the time. I believe it is a tradeoff worth taking: let AIs follow the rules and leave it to humans to write the rules, as well as decide when to update them or break them.

 

All of the above leaves open the question of who writes the specification and whether someone could write a specification of the form “do maximum harm.” The answer to the latter question is yes, but I believe that the existence of a maximally evil superintelligence does not have to spell doom; see point 6 below.

 

4. Detection is more important than prevention.

When I worked at Microsoft Research, my colleague Butler Lampson used to say, “The reason your house is not burglarized is not because you have a lock, but because you have an alarm.” Much of current AI safety focuses on prevention—getting the model to refuse harmful requests. Prevention is essential, but detection is even more critical for truly dangerous settings.

If someone is serious about creating a CBRN threat and AI refuses to help, they won’t just give up. They may combine open-source information, human assistance, or partial AI help to pursue this goal. It is not enough to simply refuse such a person; we want to ensure that we stop them before they cause mass harm.

Another reason refusal is problematic is that many queries to AI systems have dual uses. For example, it can be that 90% of the people asking a particular question on biology are doing so for a beneficial purpose, while 10% of them could be doing it for a nefarious one. Given the input query, the model may not have the context to determine what is its intended usage.

For this reason, simple refusal is not enough. Measures such as “know your customer” and the ability to detect and investigate potentially dangerous uses would be crucial. Detection also shifts the balance from the attacker to the defender. In the “refusal game,” the attacker only needs to win once and get an answer to their question. In the “detection game,” they must avoid detection in every query.

In general, detection allows us to set lower thresholds for raising a flag (as there is no performance degradation, it is only a matter of the amount of resources allocated for investigation) and enables us to learn from real-world deployment and potentially detect novel risks and vulnerabilities before they cause large-scale damage.

Detection does not mean that model-based work on compliance and robustness is irrelevant. We will need to write specifications on the conditions for flagging and build monitoring models (or monitoring capabilities for generative/agentic models) that are robust to adversarial attacks and can reasonably interpret specifications. It may turn out that safety requires spending more resources (e.g., inference-time compute) for monitoring than we do for generation/action.

Finally, there is an inherent tension between monitoring and preserving privacy. One of the potential risks of AIs is that in a world where AIs are deeply embedded in all human interactions, it will be much easier for governments to surveil and control the population. Figuring out how to protect both privacy and security, which may require tools such as on-device models, is an urgent research challenge.

 

5. Interpretability is neither sufficient nor necessary for alignment. 

 

Mechanistic interpretability is a fascinating field. I enjoy reading interpretability papers and learn a lot from them. I think it can be useful for AI safety, but it’s not a “silver bullet” for it and I don’t believe it lies on the critical path for AI safety.

The standard argument is that we cannot align or guarantee the safety of systems we don’t understand. But we already align complex systems, whether it’s corporations or software applications, without complete “understanding,” and do so by ensuring they meet certain technical specifications, regulations, or contractual obligations.

More concretely, interpretability is about discovering the underlying algorithms and internal representations of AI systems. This can allow for both monitoring and steering. However, I suspect that the actual algorithms and concepts of AI systems are inherently “messy.” Hence, I believe that there would be an inherent tradeoff between reliability (having a concept or algorithm that describes the system in 99.999% of cases) and interpretability. For safety and control, reliability is more critical than interpretability.

 

This is not to say interpretability is useless! While I think we may not be able to get the level of reliability needed for steering or monitoring, interpretability can still be helpful as a diagnostic tool. For example, checking whether training method A or training method B leads to more deception. Also, even if we don’t use it directly, interpretability can provide many insights to accelerate safety research and the discovery of other methods. (And we certainly need all the safety acceleration we can get!.) Interpretability can also serve as a sanity check and a way to increase public trust in AI models. The above considerations refer to “classical” weights/activations interpretability; CoT interpretation may well be significantly more robust. Finally, as I said above, it is essential not to be dogmatic. It may turn out that I’m wrong, and interpretability is necessary for alignment.

 

6. Humanity can survive an unaligned superintelligence.

 

Kim Jong Un is probably one of the mostmisaligned” individuals who ever lived. North Korea's nuclear arsenal is estimated to include more than 50 bombs at least as powerful as the Hiroshima bomb. North Korea is also believed to have biological and chemical weapons. Given its technological and military strength, if North Korea had been transported 200 years ago, it might well have ruled over the world. But in our current world, it is a pariah state, ranking 178th in the world in GDP per capita.

The lesson is that the damage an unaligned agent can cause depends on its relative power, not its absolute power. If there were only one superintelligence and it wanted to destroy humanity, we’d be doomed. But in a world where many actors have ASI, the balance between aligned and unaligned intelligence matters.

To make things more concrete (and with some oversimplification), imagine that “intelligence” is measured to a first approximation in units of compute.  Just like material resources are currently spent, compute can be used for:

 

  1. Actively harmful causes (attackers).
  2. Ensuring safety (defenders).
  3. Neutral, profit-driven causes.

Currently, the world order is kept by ensuring that (2)--- resources spent for defense, policing, safety, and other ways to promote peace and welfare– dominate (1)--- resources spent by criminals, terrorist groups, and rogue states.

 

While intelligence can amplify the utility for a given amount of resources, it can do so for both the “attackers” and the “defenders.” So, as long as defender intelligence dominates attacker intelligence, we should be able to maintain the same equilibrium we currently have. 

Of course, the precise “safe ratios” could change, as intelligence is not guaranteed to have the same amplification factor for attackers and defenders. However, the amplification factor is not infinite. Moreover, for very large-scale attacks, the costs to the attacker may well be superlinear. For instance, killing a thousand people in a single terror attack is much harder than killing the same number in many smaller attacks. 

 

Moreover, it is not clear that intelligence is the limiting factor for attackers. Considering examples of successful large-scale attacks, it is often the defender that could have been most helped by more intelligence, both in the military sense as well as in the standard one. (In fact, paradoxically, it seems that many such attacks, from Pearl Harbor through 9/11 to Oct 7th, would have been prevented if the attackers had been better at calculating the outcome, which more often than not did not advance their objectives.)

I expect that, in general, the effort required for an attacker to cause damage would look somewhat like the sigmoid graph above. The balance of attacker and defender advantages rescales the X axis of effort required for damage, but it would still be extremely hard to extinguish humanity completely.


Another way to say this is that I do not accept Bostrom’s vulnerable world hypothesis, which states that at some level of technological development, civilization will be devastated by default. I believe that as long as aligned superintelligent AIs dominate unaligned ASIs, any dangerous technological development (e.g., cheaper techniques such as SILEX to create nuclear weapons) would be discovered by aligned ASIs first, allowing time to prepare. A key assumption of Bostrom’s 2019 paper is the limited capacity of governments for preventative measures. However, since then, we have seen in the COVID-19 pandemic the ability of governments to mobilize quickly and enforce harsh restrictive measures on their citizens.

The bottom line is not that we are guaranteed safety, nor that unaligned or misaligned superintelligence could not cause massive harm— on the contrary. It is that there is no single absolute level of intelligence above which the existence of a misaligned intelligence with this level spells doom. Instead, it is all about the world in which this superintelligence will operate, the goals to which other superintelligent systems are applied, and our mechanisms to ensure that they are indeed working towards their specified goals.

 

Acknowledgements. I am grateful to Sam Altman, Alec Radford, and Wojciech Zaremba for comments on this blog post, though they do not necessarily endorse any of its views and are not responsible for any errors or omissions in it.



Discuss

Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals

24 января, 2025 - 23:20
Published on January 24, 2025 8:20 PM GMT

The Cake

Imagine that I want to bake a chocolate cake, and my sole goal in my entire lightcone and extended mathematical universe is to bake that cake. I care about nothing else. If the oven ends up a molten pile of metal ten minutes after the cake is done, if the leftover eggs are shattered and the leftover milk spilled, that’s fine. Baking that cake is my terminal goal.

In the process of baking the cake, I check my fridge and cupboard for ingredients. I have milk and eggs and flour, but no cocoa powder. Guess I’ll have to acquire some cocoa powder! Acquiring the cocoa powder is an instrumental goal: I care about it exactly insofar as it helps me bake the cake.

My cocoa acquisition subquest is a very different kind of goal than my cake baking quest. If the oven ends up a molten pile of metal shortly after the cocoa is acquired, if I shatter the eggs or spill the milk in my rush to the supermarket, then that’s a problem - a molten oven or shattered eggs or spilled milk would make it harder for me to bake the cake! More generally, in the process of acquiring cocoa powder, I want to not mess up other things which are helpful for making the cake. Unlike my terminal goal of baking a cake, my instrumental goal of acquiring cocoa powder comes with a bunch of implicit constraints about not making other instrumental subgoals much harder.

(If you’re already thinking “hmm, that sounds kinda like corrigibility”, then you have the right idea and that is indeed where we’re going with this.)

Generalizable takeaway: unlike terminal goals, instrumental goals come with a bunch of implicit constraints about not making other instrumental subgoals much harder.

The Restaurant

Now imagine that I’m working as a chef in a big restaurant. My terminal goal is the restaurant’s long-term success; I care about nothing else. If the bombs drop, so long as the restaurant is still doing good business afterwards, I’ll be happy.

One day, a customer orders a fresh chocolate cake, and it falls to me to bake it. Now baking the cake is an instrumental goal.

One key difference from the previous example: in the restaurant, I don’t know all the things which future customers will order. I don’t know exactly which ingredients or tools will be needed tomorrow. So, in the process of baking the cake, I want to avoid wasting ingredients or destroying tools which might be useful for any of the dishes which future customers might order. My instrumental goal of baking a cake comes with a bunch of implicit constraints about not-making-harder a whole distribution of potential future instrumental subgoals.

Another key difference from the previous example: now there are multiple chefs, multiple subagents working on different instrumental subgoals. As part of the implicit constraints on my cake-baking, I need to not make their instrumental subgoals more difficult. And that notably brings in lots of informational constraints. For instance, if I use some eggs, I need to either put the rest of the eggs back in a location predictable to the other chefs, or I need to communicate to the other chefs where I left the eggs, so that they don’t have to spend time searching for the eggs later. So my instrumental goal of baking a cake comes with a bunch of constraints about being predictable to others, and/or making information about what I’m doing visible to others.

Generalizable takeaway: unlike terminal goals, instrumental goals come with implicit constraints about being predictable, making information about what one is doing visible, and not-making-harder a whole broad distribution of other possible instrumental goals.

… and now this sounds a lot like corrigibility.

Happy Instrumental Convergence?

Still sticking to the restaurant example: presumably many different instrumental goals in the restaurant require clean plates, empty counter space, and money. Those are all convergently instrumentally-useful resources within the restaurant.

Now, the way you might be used to thinking about instrumental convergence is roughly: “For lots of different goals in the restaurant, I need clean plates, empty counter space, and money. So, I might as well seize a bunch of those things upfront. Sure that’ll screw over the other chefs, but I don’t care about that.”. And that is how the reasoning might go if baking this one cake were a terminal goal.

But instrumental goals are different. If I’m the chef baking the cake as an instrumental goal, I instead reason: “For lots of different goals in the restaurant, a chef needs clean plates, empty counter space, and money. So, I should generally make sure those things are readily available to my fellow chefs as much as possible, so that they'll be able to solve their problems for our shared terminal goal. I’ll avoid using the resources up, and even make more of them available (by e.g. cleaning a counter top) whenever I have a relative advantage in doing so.”.

I want to emphasize that this sort of reasoning should require no “special sauce”. It’s just a natural, implicit part of instrumental goals, as opposed to terminal goals.

One more interesting thing to highlight: so far, insofar as instrumental goals are corrigible, we've only talked about them being corrigible toward other instrumental subgoals of the same shared terminal goal. The chef pursuing the restaurant's success might be perfectly fine screwing over e.g. a random taxi driver in another city. But instrumental convergence potentially points towards general corrigibility.

Suppose, in the restaurant example, that clean plates, empty counter space, and money are the only significant convergently instrumental goals. Then, (in the restaurant environment,) we get a natural notion of general corrigibility: if I just “try not to step on the toes” of instrumentally-convergent subgoals, then that will mostly keep me from stepping on the toes of most subgoals pursued by other restaurant-denizens, regardless of what our top-level goals are. The same strategy works for many different top-level goals in this restaurant, so it’s a generally corrigible strategy.

More generally, if I track instrumentally-convergent subgoals throughout the whole world, and generally "avoid stepping on the toes" of any of them... that would be a generally corrigible strategy.

And that unlocks the natural next jump.

All The Way Up

The natural next jump: do we even need the terminal goal at all? What if a mind’s top-level goals were the same “kind of thing” as instrumental goals more generally? Indeed, in some ways that would be a very natural structure for a general-purpose mind; it needs the ability to recursively handle instrumental subgoals anyway, so why have a whole extra different kind of goal at the top?

So long as instrumental convergence kicks in hard enough in the global environment, the mind can “try not to step on the toes” of instrumentally-convergent subgoals, and then that will mostly keep it from stepping on the toes of most other people's subgoals, regardless of the original terminal goal. So to build a generally corrigible system, we can imagine just dropping terminal goals altogether, and aim for an agent which is 'just' corrigible toward instrumentally-convergent subgoals.

For AI purposes, this would be a much safer kind of agent. It would be an AI which naturally tries not to “step on other agents’ toes”, naturally behaves such that it doesn’t get in the way of other agents’ goals (and in particular humans’ goals). But unlike e.g. naive formulations of “low-impact” agents, such an AI would also actively try to behave in ways predictable and legible to other agents, and make sure that other agents can easily query information about its own behavior.

In short, it sounds like all the properties of corrigibility we always hoped for, all coming from a single coherent underlying concept (i.e. not thrown together ad-hoc), and therefore likely to also generalize in ways we like to properties we haven’t yet thought to ask for.

Research Threads

This concept of corrigibility immediately suggests lots of research approaches.

First, on the theory side, there’s the problem of fleshing out exactly what the “type signature” of an instrumental goal is, with all those implicit constraints. The main way one would tackle this problem would be:

  • Pick some class of optimization problems, and a way to break it into apparent “subproblems”.
  • Work through some examples to check that the sort of phenomena we’re interested in actually do show up for that class of optimization problems and notion of “subproblems”.
  • Explicitly spell out the “implicit constraints” of the subproblems in this formulation.
  • Repeat for other formulations, and look for the generalizable patterns in how the implicit constraints of subproblems are naturally represented. Operationalize those patterns.
  • Look for positive arguments that this operationalization of the relevant patterns is “the unique right way” to formulate things - like e.g. derivations from some simple desiderata, mediation in some class of algorithms, etc.

On the empirical side, one could try clever ways of training instrumental rather than terminal goals into a system. For instance, the restaurant example suggests training a system to work with many instances of itself or other systems in order to solve top-level goals in a reasonably general environment. Then, y’know… see what happens.



Discuss

Yudkowsky on The Trajectory podcast

24 января, 2025 - 22:52
Published on January 24, 2025 7:52 PM GMT

Edit: TLDR: EY focuses on the clearest and IMO most important part of his argument: 

  • Before building an entity smarter than you, you should probably be really sure its goals align with yours.
  • Humans are historically really bad at being really sure of anything nontrivial on the first real try. 

I found this interview notable as the most useful public statement yet of Yudkowsky's views. I congratulate both him and the host, Dan Fagella, for strategically improving how they're communicating their ideas.

Dan is to be commended for asking the right questions and taking the right tone to get a concise statement of Yudkowsky's views on what we might do to survive, and why. It also seemed likely that Yudkowsky has thought hard about his messaging after having his views both deliberately and accidentally misunderstood and panned. Despite having followed his thinking over the last 20 years, I gained new perspective on his current thinking from this interview.

Takeaways:

  • Humans will probably fail to align the first takeover-capable AGI and all die
    • Not because alignment is impossible
    • But because humans are empirically foolish
    • And historically rarely get hard projects right on the first real try
      • Here he distinguishes first real try from getting some practice -
        • Metaphor: launching a space probe vs. testing components
  • Therefore, we should not build general AI
    • This ban could be enforced by international treaties
      • And monitoring the use of GPUs, which would legally all be run in data centers
      • Yudkowsky emphasizes that governance is not within his expertise.
    • We can probably get away with building some narrow tool AI to improve life
  • Then maybe we should enhance human intelligence before trying to build aligned AGI
    • Key enhancement level: get smart enough to quit being overoptimistic about stuff working
      • History is just rife with people being surprised their projects and approaches don't work

I find myself very much agreeing with his focus on human cognitive limitations and our poor historical record of getting new projects right on the first try. I researched cognitive biases as the focus of my neuroscience research for some years, and came to the conclusion that wow, humans have both major cognitive limitations (we can't really take in and weigh all the relevant data for complex questions like alignment) and have major biases, notably a sort of inevitable tendency to believe what seems like it will benefit us, rather than what's empirically most likely to be true. I still want to do a full post on this, but in the meantime I've written a mid-size question answer on Motivated reasoning/ confirmation bias as the most important cognitive bias.

My position to date has been that, despite those limitations, aligning a scaffolded language model agent (our most likely first form of AGI) to follow instructions is so easy that a monkey(-based human organization) could do it. 

After increased engagement on these ideas, I'm worried that it may be my own cognitive limitations and biases that have led me to believe that. I now find myself thoroughly uncertain (while still thinking those routes to alignment have substantial advantages over other proposals).

And yet, I still think the societal rush toward creating general intelligence is so large that working on ways to align the type of AGI we're most likely to get is a likelier route to success than attempting to halt that rush.

But the two could possibly work in parallel.

I notice that fully general AI is not only the sort that is most likely to kill us, but also the type that is more obviously likely to put us all out of work, uncomfortably quickly. By fully general, I mean capable of learning to do arbitrary new tasks. Arbitrary tasks would include any particular job, and how to take over the world. 

This confluence of problems might be a route to convincing people that we should slow the rush toward AGI.



Discuss

Empirical Insights into Feature Geometry in Sparse Autoencoders

24 января, 2025 - 22:25
Published on January 24, 2025 7:02 PM GMT

Key Findings:
  1. We demonstrate that subspaces with semantically opposite meanings within the Gemma Scope series of Sparse Autoencoders are not pointing towards opposite directions.
  2. Furthermore, subspaces that are pointing towards opposite directions are usually not semantically related.
  3. As a set of auxiliary experiments, we experiment with the compositional injection of steering vectors (ex: -1*happy + sad) and find moderate signals of success.
An Intuitive Introduction to Sparse AutoencodersWhat are Sparse Autoencoders, and How Do They Work? 

Sparse Autoencoder (SAE) is a dictionary learning method with the goal of learning monosemantic subspaces that map to high-level concepts (1). Several frontier AI labs have recently applied SAEs to interpret model internals and control model generations, yielding promising results (23). 

To build an intuitive understanding of this process, we can break down the SAE workflow as follows:

  • SAEs are trained to reconstruct activations sampled from Language Models (LM). They operate by taking a layer activation from an LM and up-projecting it onto a much wider latent space with an encoder weight matrix, then down-projecting the latent to reconstruct the input activations. SAEs are trained to improve sparsity via an activation function like ReLU before being down-projected alongside a L1 loss on the latent space.
  • The columns of the encoder matrix and the rows of the decoder matrix can be interpreted as learned "features" corresponding to specific concepts (e.g., activation patterns when the word "hot" is processed by a LM).
  • When multiplying layer activations by the encoding matrix, we effectively measure the alignment between our input and each learned feature. This produces a "dictionary" of alignment measurements in our SAE activations.
  • To eliminate "irrelevant" features, a ReLU/JumpReLU activation function sets all elements below a certain threshold to zero.
  • The resulting sparse SAE dictionary activation vector serves as a set of coefficients for a linear combination of the most significant features identified in the input. These coefficients, when multiplied with the decoder matrix, attempt to reconstruct the original input as a sum of learned feature vectors.
  • For a deeper intuitive understanding of SAEs, refer to Adam Karvonen's comprehensive article on the subject here.

The learned features encoded in the columns and rows of the SAE encoder and decoder matrices carry significant semantic weight. In this post, we explore the geometric relationships between these feature vectors, particularly focusing on semantically opposite features (antonyms), to determine whether they exhibit meaningful geometric/directional relationships similar to those observed with embedded tokens/words in Language Models (4).

Overview of Vector & Feature Geometry

One of the most elegant and famous phenomena within LMs is the geometric relationships between a LM's word embeddings. We all know the equation "king - man + woman = queen" to come out of embedding algorithms like GloVe (5) and word2vec (6). We also find that these embedding algorithms perform well in benchmarks like the WordSim-353 in matching human ratings of word-pair similarity within the cosine similarity of their embeddings (7). 

Project Motivations

Since the encoding step of the SAE workflow is analogous to taking the cosine similarity between the input activation with all learned features in the encoder matrix, we should expect the learned features of the SAE encoder and decoder weights to carry the similar geometric relationships to their embeddings. In this post, we explore this hypothesis, particularly focusing on semantically opposite features and their geometric relationships (antonyms) to determine whether or not they exhibit meaningful geometric relationships similar to those observed with embedded tokens/words in LMs.

Experimental SetupFinding Semantically Opposite Features

To investigate feature geometry relationships, we identify 20 pairs of semantically opposite ("antonym") features from the explanations file of gemma-scope-2b-pt-res. We find these features for SAEs in the gemma-scope-2b-pt-res SAE family corresponding to layer 10 and 20. For comprehensiveness, we included all semantically similar features when multiple instances occurred in the feature space. The selected antonym pairs and their corresponding feature indices are as follows: 

Semantically Opposite Features: gemma-scope-2b-pt-res Layer 10Concept PairFeature A (indices)Feature B (indices)1. Happiness-SadnessHappiness (5444, 11485, 12622, 13487, 15674) Sadness (6381, 9071) 2. SentimentPositive (14655, 15267, 11, 579, 1715, 2243) Negative (1500, 1218, 1495, 5633, 5817)3. UniquenessUnique: (5283, 3566, 1688) General: (11216)4. TranquilityCalm (7509, 13614) Anxious (11961, 1954, 5997) 5. EngagementExcitement (2864) Boredom (1618) 6. Max-MinMaximum (423) Minimum (5602, 5950, 11484) 7. SizeLarge (5799, 4564, 16320, 14778) Small (8021, 7283, 4110, 2266) 8. UnderstandingClarity/Comprehension (1737, 13638)Confusion (2824, 117, 12137, 11420) 9. PaceSpeed (13936, 6974, 2050) Slowness (11625)10. OutcomeSuccess (8490, 745, 810) Failure (5633, 791, 6632)11. DistributionUniform (9252) Varied (3193)12. LuminanceBright (9852)Dark (2778) 13. PermanenceTemporary (2998, 10715) Permanence (2029) 14. RandomnessSystematic (9258) Random (992, 4237) 15. FrequencyFrequent (15474) Infrequent (12633) 16. AccuracyCorrect (10821, 12220, 12323, 13305)Incorrect (7789) 17. ScopeLocal (3187, 3222, 7744, 15804, 14044) Global (3799, 4955, 10315) 18. ComplexitySimple (4795, 13468, 15306, 8190) Complex (4260, 11271, 11624, 13161) 19. ProcessingSequential (4605, 2753) Parallel (3734)20. ConnectionConnected (10989) Isolated (9069) 

We also conduct the same experiments for Layer 20 of gemma-scope-2b-pt-res. For the sake of concision, the feature indices we use for those layers can be found in the appendix. 

Evaluating Feature Geometry

To analyze the geometric relationships between semantically opposite features, we compute their cosine similarities. This measure quantifies the directional alignment between feature vectors, yielding values in the range [-1, 1], where:

  • +1 indicates perfect directional alignment
  • 0 indicates orthogonality
  • -1 indicates perfect opposition

Given that certain semantic concepts are represented by multiple feature vectors, we conduct the following comparison approach:

  1. For each antonym pair, we computed cosine similarities between all possible combinations of their respective feature vectors
  2. We record both the maximum and minimum cosine similarity observed, along with their feature indices and explanations. The average cosine similarity os all 20 features is also recorded.
  3. This analysis was performed independently on both the encoder matrix columns and decoder matrix rows to evaluate geometric relationships in both spaces. 
Control Experiments

To establish statistical significance and validate our findings, we implemented several control experiments:

Baseline Comparison with Random Features

To contextualize our results, we generated a baseline distribution of cosine similarities using 20 randomly selected feature pairs. This random baseline enables us to assess whether the geometric relationships observed between antonym pairs differ significantly from chance alignments. As done with the semantically opposite features, we record the maximum and minimum cosine similarity values as well as the mean sampled cosine similarity. 

Comprehensive Feature Space Analysis 

To establish the bounds of possible geometric relationships within our feature space, we conducted an optimized exhaustive search across all feature combinations in both the encoder and decoder matrices. This analysis identifies:

  1. Maximally aligned feature pairs and their explanations (highest positive cosine similarity)
  2. Maximally opposed feature pairs and their explanations (highest negative cosine similarity)

This comprehensive control framework enables us to establish a full range of possible geometric relationships within our feature space, as well as understand the semantic relationships between maximally aligned and opposed feature pairs in the feature space. 

Validation Through Self-Similarity Analysis 

As a validation measure, we computed self-similarity scores by calculating the cosine similarity between corresponding features in the encoder and decoder matrices. This analysis serves as a positive control, verifying our methodology's ability to detect strong geometric relationships where they are expected to exist.

Auxiliary Experiment: Compositional Model Steering

As an auxiliary experiment, we experiment with the effectiveness of compositional steering approaches using SAE feature weights. We accomplish this by injecting 2 semantically opposite features from the Decoder weights of GemmaScope 2b Layer 20 during model generation to explore effects on model output sentiment.

We employ the following compositional steering injection techniques using pyvene’s (8) SAE steering intervention workflow in response to the prompt “How are you?”: 

  1. Contrastive Injection: Semantically opposite features “Happy” (11697) and “Sad” (15539) were injected together to see if they “cancelled” each other out.
  2. Compositional Injection: Semantically opposite features “Happy” and “Sad” injected but with one of them flipped. Ex: -1 * “Happy” + “Sad” or “Happy” + -1 * “Sad” compositional injections were tested to see if they would produce pronounced steering effects towards “Extra Sad” or “Extra Happy”, respectively.
  3. Control: As a control, we generate baseline examples with no steering, and single feature injection steering (Ex: Only steering with “Happy” or “Sad”). These results are used to ground sentiment measurement. 

We evaluate these injection techniques using a GPT-4o automatic labeller conditioned on few-shot examples of baseline (non-steered) examples of model responses to “How are you?” for grounding. We generate and label 20 responses for each steering approach and ask the model to compare the sentiment of steered examples against control responses.

ResultsCosine Similarity of Semantically Opposite Features

Our analysis reveals that semantically opposite feature pairs do not exhibit consistently strong geometric opposition in either the encoder or decoder matrices. While some semantically related features exhibit outlier cosine similarity, this effect isn't at all consistent. In fact, average cosine similarity hovers closely around random distributions. 

Decoder Cosine Similarity: gemma-scope-2b-pt-res Layer 10Relationship TypeCosine SimilarityFeature PairSemantic ConceptAverage (Antonym Pairs)0.00111  Strongest Alignment0.29599[13487, 6381]Happy-SadStrongest Opposition-0.11335[9852, 2778]Bright-DarkAverage (Random Pairs)0.01382  Strongest Alignment0.06732[5394, 3287]Compositional Structures-Father FiguresStrongest Opposition-0.05583[11262, 11530]Mathematical Equations - Numerical ValuesEncoder Cosine Similarity: gemma-scope-2b-pt-res Layer 10Relationship TypeCosine SimilarityFeature PairSemantic ConceptAverage (Antonym Pairs)0.01797  Strongest Alignment0.09831[16320, 8021]Large-SmallStrongest Opposition-0.0839[9852, 2778]Bright-DarkAverage (Random Pairs)0.01297  Strongest Alignment0.06861[8166, 13589]User Interaction - Mathematical NotationStrongest Opposition-0.03626[5394, 3287]Compositional Structure - Father FiguresDecoder Cosine Similarity: gemma-scope-2b-pt-res Layer 20Relationship TypeCosine SimilarityFeature PairSemantic ConceptAverage (Antonym Pairs)0.04243  Strongest Alignment0.49190[11341, 14821]Day-NightStrongest Opposition-0.07435[15216, 2604]Positive-NegativeAverage (Random Pairs)-0.00691  Strongest Alignment0.02653[2388, 13827]Programming Terminology - Personal RelationshipsStrongest Opposition-0.03561[5438, 11079]Legal Documents - Chemical FormuationsEncoder Cosine Similarity: gemma-scope-2b-pt-res Layer 20Relationship TypeCosine SimilarityFeature PairSemantic ConceptAverage (Antonym Pairs)0.03161  Strongest Alignment0.15184[10977, 12238]Bright-DarkStrongest Opposition-0.07435[15216, 2604]Positive-NegativeAverage (Random Pairs)0.02110  Strongest Alignment0.09849[3198, 6944]Cheese Dishes - CookingStrongest Opposition-0.03588[5438, 11079]Legal Documents - Chemical Formulations

To view the exact feature explanations for index pairs shown, one can reference Neuronpedia's Gemma-2-2B page for feature lookup by model layer.

Features Showing Optimized Cosine Similarity

As another control experiment, we perform an optimized exhaustive search throughout all of the Decoder feature space to find the maximally and minimally aligned features to examine what the most opposite and aligned features look like. 

From analyzing the features found, it seems as there is no significant consistent semantic relationship between maximally and minimally aligned features in the decoder space. Below are the results (for the sake of formatting we only include the three most aligned features. The complete lists are featured in the appendix): 

Most Aligned Cosine Similarity: gemma-scope-2b-pt-res Layer 10Cosine SimilarityIndex 1Index 2Explanations: 0.92281680212291

6802: phrases related to weather and outdoor conditions

12291: scientific terminology and data analysis concepts

0.89868242627912426: modal verbs and phrases indicating possibility or capability
2791:  references to Java interfaces and their implementations0.89313118881508311888:  words related to medical terms and conditions, particularly in the context of diagnosis and treatment
15083: phrases indicating specific points in time or contextMost Opposite Cosine Similarity: gemma-scope-2b-pt-res Layer 10Cosine SimilarityIndex 1Index 2Explanations: -0.99077404373574043: references to "Great" as part of phrases or titles
7357: questions and conversions related to measurements and quantities-0.962443571162003571: instances of sports-related injuries
16200: concepts related to celestial bodies and their influence on life experiences-0.949123738126893738:  mentions of countries, particularly Canada and China
12689: terms related to critical evaluation or analysisMost Aligned Cosine Similarity: gemma-scope-2b-pt-res Layer 20Cosine SimilarityIndex 1Index 2Explanations: 0.92429117631503611763: concepts related to methods and technologies in research or analysis
15036: programming constructs related to thread handling and GUI components 0.829108581145378581: instances of the word "the"
14537: repeated occurrences of the word "the" in various contexts0.82147491463364914: programming-related keywords and terms
6336: terms related to legal and criminal proceedingsMost Opposite Cosine Similarities: gemma-scope-2b-pt-res Layer 20Cosine SimilarityIndex 1Index 2Explanations: -0.99212663186846631: the beginning of a text or important markers in a document
8684:  technical jargon and programming-related terms-0.98955505283665052: the beginning of a document or section, likely signaling the start of significant content
8366: verbs and their related forms, often related to medical or technical contexts-0.975207436793743:  proper nouns and specific terms related to people or entities
6793: elements that resemble structured data or identifiers, likely in a list or JSON formatCosine Similarity of Corresponding Encoder-Decoder Features

To ensure significant geometric relationships exist between features in SAEs, we perform a control experiment in which we compute the cosine similarities between corresponding features in the encoder and decoder matrix (i.e. computing the cosine similarity for the feature "hot" in the encoder matrix with the "hot" feature in the decoder matrix). 

We perform these experiments for every single semantically opposite and random feature with itself, and report the averages below. As seen, significant cosine similarities exist for correspondent features, highlighting the weakness between cosine similarities between semantically opposite features as seen above. 

Same Feature Similarity: gemma-scope-2b-pt-res Layer 10Relationship TypeCosine SimilarityFeatureSemantic ConceptAverage (Antonym Pairs)0.71861  Highest Similarity0.885424605Sequential ProcessingLowest Similarity0.477226381SadnessAverage (Random Pairs)0.72692  Highest Similarity0.850931146Words related to "Sequences"Lowest Similarity0.511456266Numerical data related to dates

Same Feature Similarity: gemma-scope-2b-pt-res Layer 20

Relationship TypeCosine SimilarityFeatureSemantic ConceptAverage (Antonym Pairs)0.75801  Highest Similarity0.868705657CorrectnessLowest Similarity0.5649310108ConfusionAverage (Random Pairs)0.74233  Highest Similarity0.9156613137Coding related terminologyLowest Similarity0.5535212591The presence of beginning markers in textSteering Results

We receive mixed results when evaluating compositional systems. Our results suggest that compositional injection (-1 * "Happy" + "Sad") results in higher steered sentiment towards the positive feature’s direction, and flipping the direction of SAE features results in neutral steering on its own. 

Synchronized injection of antonymic features (happy + sad) seem to suggest some “cancelling out” of sentiment, control single injection directions like sadness seem to become much less common when paired with a happy vector. 

Overall, compositionality of features provides interesting results and presents the opportunity for future research to be done. 

Contrastive Steering Results:Happy + Sad InjectionNet:Neutral (11/20)Happy Steering:5Sad Steering: 2No Steering:11Happy to Sad Steering: 1Sad to Happy Steering: 1Compositional Steering Results -1 * Happy + Sad Injection-1 * Sad + Happy InjectionNet:Sad Steered (14/20)Happy Steered (11/20)Happy Steering:011Sad Steering: 121No Steering:66Happy to Sad Steering: 21Sad to Happy Steering: 00Control Steering Results: Happy and Sad Single Injection Happy InjectionSad Injection-1 * Happy Injection-1 * Sad InjectionNet:Neutral (10/20)Sad (8/20)Neutral (15/20)Neutral (14/20)Happy Steering:7233Sad Steering: 0820No Steering:1051514Happy to Sad Steering: 2100Sad to Happy Steering: 1403Conclusions and Discussion

The striking absence of consistent geometric relationships between semantically opposite features is surprising, given how frequent geometric relationships occur in original embedding spaces between semantically related features. SAE features are designed to anchor layer activations to their corresponding monosemantic directions, yet seem to be almost completely unrelated in all layers of the SAE. 

Potential explanations for the lack of geometric relationships in semantically related SAE encoder and decoder features could be due to contextual noise in the training data. As word embeddings go through attention layers, their original directions could become muddled with noise from other words in the contextual window. However, the lack of any noticeable geometric relationship across 2 seperate layers is surprising. These findings lay the path for future interpretability work to be done in better understanding this phenomena. 

AppendixSemantically Opposite Features: gemma-scope-2b-pt-res Layer 20Concept PairFeature A (indices)Feature B (indices)1. Happiness-SadnessHappiness (11697)Sadness (15682, 15539)2. SentimentPositive (14588, 12049, 12660, 2184, 15216, 427)Negative (2604, 656, 1740, 3372, 5941)3. Love-HateLove (9602, 5573)Hate (2604)4. TranquilityCalm (813)Anxious (14092, 2125, 7523, 7769)5. EngagementExcitement (13823, 3232)Boredom (16137)6. IntensityIntense (11812)Mild (13261)7. SizeLarge (582, 9414)Small (535)8. UnderstandingClarity (3726)Confusion (16253, 3545, 10108, 16186)9. PaceSpeed (4483, 7787, 4429)Slowness (1387)10. OutcomeSuccess (7987, 5083, 1133, 162, 2922)Failure (10031, 2708, 15271, 8427, 3372)11. TimingEarly (8871, 6984)Late (8032)12. LuminanceBright (10977)Dark (12238)13. PositionInternal (14961, 15523)External (3136, 8848, 433)14. Time of DayDaytime (11341)Nighttime (14821)15. DirectionPush (6174)Pull (9307)16. AccuracyCorrect (10351, 5657)Incorrect (11983, 1397)17. ScopeLocal (10910, 1598)Global (10472, 15416)18. ComplexitySimple (8929, 10406, 4599)Complex (3257, 5727)19. ProcessingSequential (5457, 4099, 14378)Parallel (4453, 3758)20. ConnectionConnected (5539)Isolated (15334, 2729)Most Aligned Cosine Similarity: gemma-scope-2b-pt-res Layer 10Cosine SimilarityIndex 1Index 2Explanations: 0.92281680212291

6802: phrases related to weather and outdoor conditions

12291: scientific terminology and data analysis concepts

0.89868242627912426: modal verbs and phrases indicating possibility or capability
2791:  references to Java interfaces and their implementations0.89313118881508311888:  words related to medical terms and conditions, particularly in the context of diagnosis and treatment
15083: phrases indicating specific points in time or context0.88358500582915005: various punctuation marks and symbols
8291: function calls and declarations in code0.882585898148925898: components and processes related to food preparation
14892: text related to artificial intelligence applications and methodologies in various scientific fields0.880663384124403384: terms related to education, technology, and security
12440: code snippets and references to specific programming functions or methods in a discussion or tutorial.0.879271838121671838: technical terms related to polymer chemistry and materials science
12167: structured data entries in a specific format, potentially related to programming or configuration settings.0.87857122911508312291: scientific terminology and data analysis concepts
15083: phrases indicating specific points in time or context0.87558811881448118: instances of formatting or structured text in the document
8144: names of people or entities associated with specific contexts or citations0.872483152150833152:  references to mental health or psychosocial topics related to women
15083: phrases indicating specific points in time or contextMost Opposite Cosine Similarity: gemma-scope-2b-pt-res Layer 10Cosine SimilarityIndex 1Index 2Explanations: -0.99077404373574043: references to "Great" as part of phrases or titles
7357: questions and conversions related to measurements and quantities-0.962443571162003571: instances of sports-related injuries
16200: concepts related to celestial bodies and their influence on life experiences-0.949123738126893738:  mentions of countries, particularly Canada and China
12689: terms related to critical evaluation or analysis-0.94557398647273986: terms and discussions related to sexuality and sexual behavior
4727: patterns or sequences of symbols, particularly those that are visually distinct -0.935631234129961234: words related to fundraising events and campaigns
12996: references to legal terms and proceedings-0.928709295102649295: the presence of JavaScript code segments or functions
10264: text related to font specifications and styles-0.927887195112367195: technical terms and acronyms related to veterinary studies and methodologies
11236: Java programming language constructs and operations related to thread management-0.925764231125424231: terms related to mental states and mindfulness practices
12542: information related to special DVD or Blu-ray editions and their features-0.92376255364182553: references to medical terminologies and anatomical aspects
6418: references to types of motor vehicles-0.920216748375674: terms related to cooking and recipes
8375: descriptions of baseball games and players.Most Aligned Cosine Similarity: gemma-scope-2b-pt-res Layer 20Cosine SimilarityIndex 1Index 2Explanations: 0.92429117631503611763: concepts related to methods and technologies in research or analysis
15036: programming constructs related to thread handling and GUI components 0.829108581145378581: instances of the word "the"
14537: repeated occurrences of the word "the" in various contexts0.82147491463364914: programming-related keywords and terms
6336: terms related to legal and criminal proceedings0.812374140162674140: programming-related syntax and logical operations
16267:  instances of mathematical or programming syntax elements such as brackets and operators 0.8052418317248

1831: elements related to JavaScript tasks and project management

7248: structural elements and punctuation in the text

 0.77152845816090

8458:  occurrences of function and request-related terms in a programming context

16090: curly braces and parentheses in code

0.769657449101927449: the word 'to' and its variations in different contexts
10192: modal verbs indicating possibility or necessity0.76194541386305413: punctuation and exclamatory expressions in text
8630: mentions of cute or appealing items or experiences0.758507465106957465: phrases related to legal interpretations and political tensions involving nationality and citizenship
10695: punctuation marks and variation in sentence endings0.748777452437

745:  sequences of numerical values

2437: patterns related to numerical information, particularly involving the number four

Most Opposite Cosine Similarities: gemma-scope-2b-pt-res Layer 20Cosine SimilarityIndex 1Index 2Explanations: -0.99212663186846631: the beginning of a text or important markers in a document
8684:  technical jargon and programming-related terms-0.98955505283665052: the beginning of a document or section, likely signaling the start of significant content
8366: verbs and their related forms, often related to medical or technical contexts-0.975207436793743:  proper nouns and specific terms related to people or entities
6793: elements that resemble structured data or identifiers, likely in a list or JSON format-0.97373153055331530: technical terms related to inventions and scientific descriptions
5533: terms related to medical and biological testing and analysis-0.971171692109311692: legal and technical terminology related to statutes and inventions
10931: keywords and parameters commonly associated with programming and configuration files-0.96775478453284784: elements of scientific or technical descriptions related to the assembly and function of devices
5328: terms and phrases related to legal and statistical decisions-0.96136661491856614:  elements related to data structure and processing in programming contexts
9185: technical terms related to audio synthesis and electronic music equipment-0.96027104191488110419: technical terms and specific concepts related to scientific research and modeling
14881:  phrases related to lubrication and mechanical properties-0.95915148390641483: words related to specific actions or conditions that require attention or caution
9064: mathematical operations and variable assignments in expressions-0.958871896151111896: elements indicative of technical or programming contexts
15111: phrases related to functionality and efficiency in systems and processes
  1. ^

     



Discuss

Liron Shapira vs Ken Stanley on Doom Debates. A review

24 января, 2025 - 21:01
Published on January 24, 2025 6:01 PM GMT

I summarize my learnings and thoughts on Liron Shapira's discussion with Ken Stanley on the Doom Debates podcast. I refer to them as LS and KS respectively.

High level summary

Key beliefs of KS:

  • Future superintelligence will be 'open-ended'. Hence, thinking of them as optimizers will lead to incomplete thinking and risk mitigations.
  • P(doom) is non-zero, but no fixed number. Changes from day to day.
  • Superintelligence is a risk, but that open-endedness is the root of the problem, not optimization.
  • KS' main desire is to increase awareness that superintelligence will be open-ended, because most people (regardless of their p(doom)) do not discuss or believe this, and hence the strategies to reduce risk will not be appropriate.
  • KS believes that superintelligence will not have significantly more foresight into the development of future technologies, because the task is fundamentally too complex and unpredictable. The only way to make progress is by doing incremental research and real world experiments, motivated by what is 'interesting'. The result is a gradual accumulation of 'stepping stones' that allow you to go further and further along the tech tree.
  • KS is uncertain if super-intelligence will have significantly more optimizing ability than we do, essentially because we over-rate optimization and open-ended divergent processes is the key ingredient to intelligence.
  • KS does not have strong opinions on policy proposals. KS does tentatively suggest that enforcing humans-in-the-loop (somehow) as one possibility to mitigate the risk. E.g. if the super-intelligence wanted to use 50% of the available resources to create a humungous particle accelerator, it needs to get our permission first.
  • KS takes the risks due to AI seriously and this was a big factor for why he is currently not working on AI.

In contrast, LS does believe that superintelligence will be a super-optimizer and that it will be capable of feats of foresight and ingenuity that allow it to skip many of the stepping stones that we humans would have to take. E.g. LS believes that a superintelligence in early 1800s could have skipped the vacuum tubes and developed modern electronic computers on its first try.

Unfortunately, most of KS' claims are not justified. Partly this is because KS did not explain himself clearly so it took effort to pin down his beliefs, and also because LS chose to drill down into details that - in my opinion - were not the key cruxes.

Musings

I do not have fixed views on the nature of super-intelligence, and this is big source of uncertainty for me. I am persuaded by the notion of instrumental convergence. Presumably a non-optimizing open-ended superintelligence will avoid getting turned off. Or, it would accumulate resources to carry out its open-ended explorations.  My general sense is that things are extremely unpredictable and that super-intelligence vastly increases the variance of future outcome. I do not have intuition for how to weigh the odds of positive outcomes versus negative.

Based only on this interview, LS has a simplistic view of intelligence, believing that raw intelligence can provide Godly foresight into everything, allowing one to skip doing experiments and interacting with the universe.

Here are some relevant ideas for this discussion:

  • Joseph Henrich's work on limitations of human intelligence, and instead the power of culture and the accumulation of knowledge over time. See his book The Secret of our Success (or find summaries, e.g. his talk at Google).
  • Observing digital self-replicators in a system consisting of simple interactions and mutations. Importantly there is no objective function being maximized, contrary to most evolutionary algorithms. See here for the paper or this interview on Cognitive Revolution podcast.
  • Complexity theory. I know almost nothing here, but seems highly relevant.
  • The notion of computational irreducibility, which Stephen Wolfram focusses on a lot in his research and in his computational theories of everything.

If you want to learn more about open-endedness:

(I have not yet listened to any of these, but the first few minutes of each sound promising for getting a more concrete understanding of open-endedness and what it means to be interesting.)

Chronological highlights and thoughts

I present these highlights and thoughts chronologically as they appear in the podcast.

A potential application of open-endedness is to generate high quality data for LLM training. KS does say later that he believes it is unlikely to be big part of OpenAI's pipeline.

00:03:26 LS. So your open endedness research, did that end up merging with the LLM program or kind of where did it end up?

KS. Yeah, to some extent. If you think about it from their big picture, not necessarily mine, it's aligned with this issue of what happens when the data runs out. [...] It's always good to be able to generate more really high quality data. And if you think about it, that's what open-endedness does in spades. 

KS tries to explain how open-endedness is not just a different kind of objective. This is not clear at all. I wish LS pushed KS to be more formal about it, because the vague intuitive description is not convincing.

00:07:35. KS. Open-endedness is a process that doesn't have an explicit objective.  Some people will say open-endedness is its own objective, and this can muddy the waters. But first, just want to make this kind of clear what the distinction is, which is that in open-endedness, you don't know where you're going by intent. And the way you decide things is by deciding what would be interesting. And so open-ended processes make decisions without actually a destination in mind. And open-ended processes that exist in the real world are absolutely grandiose. They're the most incredible processes that exist in nature.

And there's really only two. One is natural evolution. So you start from a single cell and then you wait like a billion years, and all of living nature is invented in - what would be from a computer science perspective - a single run.
There's nothing like that algorithmically that we do. It invented photosynthesis, flight, the human mind itself, the inspiration for Al all in one run. We don't do stuff like that with objective-driven processes. It's a very special divergent process.

And the second one is civilization. Civilization does not have a final destination in mind. [...]

What does it mean for evolution to 'decide' something? And what does it mean for something to be interesting? Does evolution find things interesting?

And a bit later:

00:09:32. KS: You could say, for example, that evolution has an objective of survive and reproduce. And this gets little bit hair-splitting, but I like to make a distinction there because I don't think of survive as formally the same kind of objective. I prefer not to call it an objective, because it's not a thing that you haven't achieved yet. When I think of objectives, I'm thinking of a target that I want to get to that I've not yet achieved. With survive and reproduce, the first cell did that. So I think of it more as a constraint. Everybody in this lineage needs to satisfy this constraint, but it's already been achieved and everybody in this lineage needs to satisfy this constraint.

LS argues that random mutations - some of which help and some of which hinder - are doing a kind of gradient descent in the loss landscape. E.g. if you have a patch of cells that are light sensitive, then some mutation might make it more light sensitive and hence more likely to survive. KS believes this is only valid for micro scale evolution but not macro scale.

00:11:47 KS. Yeah, I don't deny that optimization does happen in evolution. But it's important to highlight that the overall accounting for why there's all of the diversity of life on Earth is not well accounted for by just that observation. It's an astounding diversity of amazing inventions. To account for that requires other explanations.

On a macro scale, it is difficult to explain what we mean by better. How are we humans better than single-celled bacteria? We have less biomass, less offspring per generation, lower opulation. There's nothing objective to point to why we're better in terms of optimization. What's better about us is that we're more interesting.

A lot of evolution has to do with escaping competition - like finding a new niche - which is not an optimization problem.

This is doing something different and that's the divergent component. I argue that the convergent optimization subprocesses are less interesting and they don't account for the global macro process of evolution.

KS makes interesting observation that evolutionary algorithms are not divergent and so are not good metaphor for understanding full scope of evolution.

00:14:57. KS. Those kinds of algorithms work the way you describe. You do set an objective and fitness is measured with respect to the objective and you explicitly follow that gradient just as you would in another optimization algorithm. But think about what genetic algorithms do. They do converge. They have almost nothing to do with what's actually happening in nature. The intuition is off.

And so it's unfortunate to become this kind of misleading metaphor that lot of people key into. These are highly convergent algorithms that always converge to a single point or get stuck, just like conventional optimization. That's not what we see in nature.

LS is still generally confused (as am I) and asks a good question: what is the ultimate claim being made?

00:16:48 KS. Evolution is incredibly prolifically creative. And we don't have algorithms like that in computer science. So there's something to account for here that we have not abstracted properly. And yes, this is related to intelligence because civilization also has this property which is built on top of human intelligence.

And it's related to the superintelligence problem because my real deeper claim here is that superintelligence will be open-ended. It must be because the most distinctive characteristic of human intelligence is our prolific creativity. We will not get a super intelligence that's not open-ended and therefore we need to understand divergent processes. All of our optimization metaphors don't account for that property which can mislead us and lead us astray in analyzing what's in store for us in the future.

Finally something concrete to hold on to! KS believes that future superintelligence will be open-ended, and thinking of them as optimizers will lead to incomplete analysis and predictions.

But then oddly, LS goes back to the question of evolution and how evolution is or is not explained by following an objective function. Some interesting points come up but not central in my view. For example:

  • 00:20:01. That environmental niches co-evolve with the adaptations to those niches
  • 00:20:19. Idea of 'collection' as an additional mechanism in divergent processes. Collection means collecting 'stepping stones'. Stepping stones are things that can lead to other interesting things. Evolution is an amazing collector because every species that exists is now added to the collection, creating new opportunities for other new species to appear.
  • 00:24:16. There are periods of 'slack' where there is not cut-throat survival-of-the-fittest, allowing for exploration and non-convergent processes.
  • 00:27:20. Something called 'bilateral symmetry' was early in evolutionary tree, and is crucial to intelligence. Importantly, following a purely optimizing framework, hard to explain how intelligence evolved, because this bilateral symmetry was a pre-requisite but the causal link from bilateral symmetry to intelligence was not 'known' when bilateral symmetry appeared. "If you give IQ tests to the flatworms [with bilateral symmetry], it doesn't look like there's any progress along the intelligence dimension."

At around 30 minutes, they discuss a thought experiment of what would happen if you went to 1844 (100 years before ENIAC created) and tried directly optimizing for creation of computers. KS says it would fail because you would miss open-ended curiosity-driven explorations that lead to vacuum tubes, that were a crucial component of computers. LS (~00:32:00) says this is just a matter of lacking intelligence and foresight. With enough intelligence, you could create computer with direct optimization.

KS responds:

00:32:32. KS. A fundamental aspect of my argument is that we cannot foresee the tech tree. It's just complete futility. Omnipotence is impossible. We cannot understand how the universe works without experimentation. We have to try things. But it's important for me to highlight that trying things is not random. They [scientists] were highly informed because they understood that there's very interesting things about the properties of these technologies. And they wanted to see where that might lead, even though they don't know ultimately where it leads. And this is why the tech tree is expanding, because people follow these interesting stepping stones.
But we will not be able to anticipate what will lead to what in the future. Only when you're very close can you do that. 

LS says this is a limitation of humans. What about superintelligence?

00:35:36 KS: The AI's hypotheses will be better, but it still needs to make hypotheses and it still needs to test those hypotheses. And it will still need to follow that tech tree and gradually discover things over time. Omnipotence is not on the table, even for AGI or superintelligence.

Great once again! Another concrete claim from KS about superintelligence, and something I would like to see the two discuss to find out why their intuitions disagree and what evidence could change either of their mind. But like last time, LS changes topic...

00:35:51. LS: One thing that you say in your book is that as soon as you create an objective, you ruin your ability to reach it. Unpack that.

KS says that using an objective works if the you have the necessary stepping stones to carry out the optimization and reach your objective. Such objectives are 'modest objectives'. However, for ambitious objectives, direct optimization will not work, because the necessary stepping stones will be things you simply would not even consider researching if you were directly optimizing for the goal.

The discussion moves to explore this in the context of evolution. LS asks whether in KS's framework, there even is a modest objective.

 00:40:09. KS. I prefer to put it that way. Your argument caused me to contort myself in a way that I don't prefer, to describe this as an objective process once you're near something. There never is an explicit objective in evolution. My argument is this is why it's so prolifically creative. It's essential to have it for creative processes not to be objectively driven. And evolution is the greatest creative process that we know in the universe.

LS asks whether genetic fitness is an implicit objective that evolution optimizes for, even if that is not the explicit goal that evolution has. KS gives strong and bizarre claim in response about how all that matters is 'what is interesting':

00:42:10. KS. I don't think of inclusive genetic fitness as an objective. There's nothing wrong with something that has lower fitness. What we care about in evolution is what's interesting, ultimately, not necessarily what has higher fitness. Our fitness is probably on an objective basis lower than lots of species that we would consider less intelligent than us. It's an orthogonal issue. Fitness is not a target. Like I said earlier, it's a constraint.

I do not buy the idea that what we care about is what is interesting. I suspect this is KS just not phrasing himself very well, because it seems odd to claim that evolution does not care about fitness but it does care about 'being interesting'. Interesting to who??

KS tries a metaphor with Rube Goldberg machines:

00:43:47. KS. I was watching a TV show about this guy who is making Rube Goldberg machines to open a newspaper. And it was really funny because he invented the most complex things you could possibly imagine to do something completely trivial.

And this is effectively what's happening. [Evolution] is a Rube Goldberg machine generating system. [...] the complexity of those machines itself becomes what's interesting. It's not the fact that they're getting better. In some ways they're getting worse because it's crazy to go through all this complexity just to get another cell.

So we live in a Rube Goldberg machine generating universe, and this is just going on forever. It's a different story. It's a different story than this like deathmatch convergence type of view of evolution.

Discussion veers for several minutes in an unhelpful direction, in my opinion. LS doesn't look good here, saying things that just make KS repeat himself, without progressing the discussion, or digging into any of the previous key cruxes.

At 57 minutes in, LS moves to discussion of P(doom). Key points from KS:

  • Not willing to give a concrete number
  • His intuition changes significantly from day to day
  • But it is definitely non-zero. He does take possibility seriously, and that is big reason he stopped doing ML research. 1:02:30  "I was actually feeling bad at some level because I wasn't sure that what I'm doing is actually something can be proud of. l need to step back and reconcile what I really believe here so that I can feel proud of myself that I'm actually contributing to a better future."
  • He does think there are good arguments on both sides of p doom debate, and is currently grappling with those to learn where he stands.
  • Main reason he wants to avoid giving P doom is that his main point and contribution to the discussion is that people on both sides of the debate are ignoring open-endedness, the fact that future intelligence will be open-ended, and by focussing on optimization framework the discussions are all flawed.
  • Not claiming that open-endedness makes things safe. "1:00:34. Open-endedness is a different kind of danger and we need to grapple with that. My agenda is I want us to be grappling with the dangers of open-endedness."
  • Big trade-off for him with AI development is balancing the potential upsides (e.g. greatly reducing suffering all over the world) with the potential downsides (e.g. human extinction). "Both sides of this are just totally unacceptable."

At around 1:11:00, some discussion on what 'interesting means'. KS basically says its subjective but believes there is commonality amongst all humans. Nothing concrete or insightful.

At 1:15:46, LS asks what KS' ideal scenario is. KS is unsure how to tradeoff between well-being and interesting-ness/curiosity. There will be things that are interesting but dangerous.

At 1:20:14, LS asks what the headroom is above human intelligence. This leads to revealing more cruxes, and LS does a valiant job pinning down what KS thinks about whether future AI can be better optimizers than humans.

  • KS starts by saying there is head-room but not sure how much and agree there is a form of intelligence that would dominate over us, just like we dominate over monkeys.
  • LS mentioned idea of an intelligence being able to optimize better than us. KS repeated that this is his big disagreement. At 1:24:23, "I think that intelligence will be divergent. It's not an optimization process. It will have mastered divergence, which is creative exploration. And so it's not going to have those properties that you're describing."
  • LS pushes back, asking if an intelligence could be more capable at optimization than us, in addition to being open-ended. KS does not reply clearly, saying its intelligence wont arise out of better optimization, but better ability to 'explore everything that is possible in the universe'.
  • LS asks if we optimize better than chimps? KS says yes, but that is not important / interesting. What is important is our divergent open-ended thinking.
  • LS asks if some AI will dominate us on optimization power (separate from whether this is interesting / important). KS says no! This is at 1:26:23.
  • LS then asks if humans are at or near the peak of optimization power. KS: "We might be, but I don't know. I don't think it [optimization power] is like this infinite ladder. There are limits to what you can do because of the fact there is no way to pierce through the fog of the stepping stones. You have to discover them. [...] That's ultimately a process of futility. And then superintelligence would recognize that right away. It will understand how the world works. To do amazing things requires divergence. It's not going to organize itself to just set goals and achieve them."
  • LS asks clarification on whether super AI's optimization ability will only be slightly higher than humans. KS basically replies saying optimization is not the relevant axis and that 'It will have interests. Because if it has not interests, then it doesn't have the ability to be super intelligent in my view."
  • Some further discussion but I do not see further clarification. Essentially KS believes there is strong limit to how much value there is in more optimization power, as key thing is deciding what next to explore to help you accumulate more and more stepping stones, and that setting some goal will just make one worse at this process.

At 1:38:41, LS asks if KS thinks instrumental convergence is a thing in superintelligence. KS: "I think my answer here is no. There seems to be widely shared assumption that they're going to converge to goal oriented behaviour." And then KS repeats himself that superintelligence will be open-ended, and that this brings its own dangers.

I would have liked at this stage for LS to ask about other examples of instrumental convergence, e.g. would the AI avoid being turned off, or would the AI want to accumulate resources to allow it to carry out its explorations.

There is then some discussion about the example of what would happen if superintelligence found idea of 'going to Pluto' interesting. Conversation gets confusing here with little progress. One insight is that KS says that intelligence will not take 'extreme' actions to satisfy its interests (e.g. enslave all humanity, which is what LS suggests might happen), because it will have many many interests and taking extreme actions will have large opportunity cost. This is one way interests are different from goals.

At 1:55:53, LS asks KS about ideas for mitigating risks from open-ended intelligences. KS once again emphasises that his main point is that we should think about open-endedness. Then tentatively suggests looking at how open-endedness has been controlled in the past - what institutions have we set up.  LS pushes KS for a policy that KS would recommend for this year [2024]. KS (weakly) suggests having human sign off on big decisions and likely that we need humans in the loop forever.

Discussion continues for another 30 minutes, but I do not think further insights are uncovered, with mostly repetition of ideas already mentioned.  I think LS' attempt at summarizing KS' perspective reveals that the conversation did not bring a lot of clarity:

LS. You have a P(doom). It's significant. You don't know exactly what it is.

Your definition of what's good and bad might be different from mine because you see the future universe as being so likely to have so much good divergence in it. This gives you kind of a default sense that things are probably going to go okay, even thought it could also backfire.

Then there's this other point we disagree on. You think that optimizing toward some goal doesn't work that well. You just have to explore in the present until a solution will reveal itself later.

The second point surprised me. KS at multiple occasions says he thinks AI will be dangerous and that open-endedness is dangerous, and his top concern is that by focussing on optimization, people are misunderstanding the main issues and will not think of appropriate measures.

It is a shame that the discussion ended up being confusing and less fruitful than other Doom Debates interviews, because there is potentially a lot to learn from understanding KS' perspective.



Discuss

Is there such a thing as an impossible protein?

24 января, 2025 - 20:12
Published on January 24, 2025 5:12 PM GMT

This is something I’ve been thinking about since my synthesizability article.

Let’s assume, given the base twenty amino acids that are naturally present in the human body, we have every possible permutation of them for up to 100 amino acids, stored in a box with pH 7.4 water and normal pressures and temperature and isolated from one another. In other words, we have on the order of 20^100 proteins available to us. This is a very large number.

What percentage of these proteins could be made?

Maybe all of them? But this is true of small molecules as well; technically all small molecules can exist, the vast majority of them would just instantly vanish/explode/dissolve upon their forced manifestation. Those molecules carry the term ‘impossible’ as well, but the more accurate term for them is just ‘unstable’. Really then, the question becomes not just whether we can string amino acids together in a particular sequence, but whether that sequence can maintain its existence as a protein for any meaningful amount of time. What’s meaningful? Let’s say 60 seconds. So, we take our bucket of proteins and wait 60 seconds.

Notably, this is a different question than the one examined in the paper Distinguishing Proteins From Arbitrary Amino Acid Sequences, which defines a ‘protein’ as anything that has a well-structured 3D shape. I don’t care about that, I want to confirm that all strings of arbitrary amino acids are possible to create.

How many proteins are left after these 60 seconds? Is the answer once again ‘all of them’?

As far as I tell, the answer is, again, yes. Nearly 100% of them will remain. There might be a minuscule subset that forms bizarre reactive loops or self-hydrolyzing sites (of which I can find little information on what consistently causes such sites), but for the overwhelming majority of those 20^100 sequences, there is no biochemical mechanism that disintegrates a protein in a matter of seconds at neutral pH and room temperature.

Well…that’s a boring answer.

But perhaps thinking of the space of all possible proteins in the same way we think of the space of all possible molecules in general is misleading. Amino acids are, by nature, largely non-reactive (with a few exceptions). The very property that makes proteins such excellent building blocks for life — their chemical stability —also means that most random sequences wouldn't pose inherent stability problems.

But what about functionality?

But what is functionality? Functionality in proteins are, by and large, determined by shape — the creation of deep binding pockets, catalytic residues in exact spatial arrangements, and molecular surfaces primed to recognize other proteins. Whereas for general molecules, functionality is largely driven by reactive chemistry — the ability to form and break bonds, participate in electron transfer, or engage in acid-base reactions.

So, is there such thing as an impossible protein shape that can stably persist? To this, we can say yes. What does such an impossible shape look like? Let’s consider the Ramachandran plot. Ramachandran plots represent the backbone conformational preferences of amino acids by plotting the φ (phi) and ψ (psi) dihedral angles against each other.

What are φ and ψ dihedral angles? They represent the two primary rotatable bonds in a protein's backbone that determine its three-dimensional structure. The φ angle describes rotation around the N-Cα bond, while the ψ angle describes rotation around the Cα-C bond.

These angles are the primary determinants of a protein's secondary structure and, by extension, its overall folding pattern. As it turns out, the vast amount of angle space is simply inaccessible to amino acids. But why are some angle regions forbidden? For the simplest possible reason: it’d force atoms to nearly overlap with one another, and atoms really, really don’t like doing that.

Because of these hard geometric constraints, certain backbone configurations are physically impossible. Now, fairly, Ramachandaran Plots are constraints at the single amino-acid level, but they bubble upwards; larger secondary structures like beta-sheets and alpha-helixes still obey the fundamental rules outlined by the plot.

But I’m speaking a bit vaguely. For all I’ve said about how certain shapes are impossible, I still haven’t offered what such an impossible shape would look like.

Do we know of one?

Yes. A cube. Even part of a cube is impossible. The below picture is of a 11-residue alanine-only protein partial cube that I asked o1 to create. This shape may arise accidentally — for a few femtoseconds — from a disordered protein just fumbling around, but it is always transient and never, ever stable.

Cubes represent one of the easiest examples of a thermodynamically improbable shape due to one thing: 90 degree angles. Why? Because at the corner of such a cube, there would nearly guaranteed steric clashes (atoms that are too close) between the side chains of a given amino acid. This is why proteins have a very distinctive, curvy shape; it ensures that everything can fit together without atoms overlapping.

So, cubes are out. What else? Unfortunately, this contrived example is the best I can offer. As far I can tell, there are two grander rules about the limits of proteins structures to be observed:

  1. If a shape demands an extremely small radius of curvature in fewer residues than physically possible, it is impossible due to the angle reasons we discussed earlier.
    1. For example, a 3-residue ‘ring’ or the cube mentioned earlier.
  2. If demands backbone self‐intersection or penetration through a space too narrow for the atomic radii, it is is impossible as a result of van der Waals forces.
    1. For example, a 50‐residue chain fitting into a 3 Å diameter sphere.

It’d be difficult to use these rules to derive any meaningful number on the number of possible shapes. But we should at least be able to say that it is fewer than 100% of any given shape. There’s also a nuanced bit on conformational accessibility. Even if a shape is geometrically possible, is it always kinetically or thermodynamically accessible to a real protein? That could be a separate essay in of itself, so I’ll leave that to a future post.

Now, fairly, how meaningful is all this? Does it matter that we cannot create the protein shapes we’ve discussed here? It’s unfortunately an unknown-unknown question, which are hard to answer. Maybe extremely small radii rings would actually be quite useful for something, as would be proteins that have many residues but can be compacted into something quite small. But, at the same time, I imagine that the functionality afforded by both of these impossible shapes could very well be achieved by something that is geometrically possible.

For example, let’s say you wanted a protein that has a stable fold that looks like the letter ‘H’. You may instinctively say ‘that’s not possible!’, given the 90° degree angles in the shape. But such a shape can exist according to the Generate:Biomedicine Chroma models, which, fairly, may not turn out to fold in such a way in the real world, but you can observe that the ‘H’-ness is roughly recapitulated.

Similarly, for the protein cube we so harshly denigrated as impossible, protein engineering efforts show that you can get pretty damn close to a cube! While a cube may be nearly impossible to create using a small protein, a large enough one can get it roughly close.

So…in the end, it may turn out that while some theoretical protein shapes are impossible, you can approximate all possible shapes so well that it’s a nonexistent problem. This all said, the topic of this essay does go pretty far outside of my knowledge base. Would love to know if anyone has any thoughts on this!



Discuss

QFT and neural nets: the basic idea

24 января, 2025 - 16:54
Published on January 24, 2025 1:54 PM GMT

Previously in the series: The laws of large numbers and Basics of Bayesian learning.

Reminders: formalizing learning in ML and Bayesian learningLearning and inference in neural nets and Bayesian models

As a very basic sketch, in order to specify an ML algorithm one needs five pieces of data. 

  1. An architecture: i.e., a parametrized space of functions that associates to each weight vector θ∈Rw.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} a function fθ:Rdin→Rdout, from some input space to an output space.
  2. An initialization prior on weights. This is a (usually stochastic) algorithm to initialize a weight from which to begin learning. Generally this is some Gaussian distribution on the weight θ∈Rw. While this is often ignored, in many contexts it is actually quite important to get right for learning to behave reasonably.
  3. Training data. This is a collection D=(x1,y1;x2,y2;…;xn,yn) of “observation” pairs (xi,yi), with xi∈Rdin and yi∈Rdout.
  4. A loss function. This is a function on L(D,θ) which operationalizes a measure of how well fθ agrees with the data D.
  5. A learning algorithm/optimizer. (“Learning algorithm” is used in theoretical contexts and “optimizer” is frequently used in engineering contexts.) This is an algorithm (usually, a stochastic algorithm) for finding the “learned” weight parameter θ, usually by some local minimization or approximate minimization, of the function L(θ).

In most theoretical analyses of models, one uses gradient descent to conceptualize the learning algorithm. Some more sophisticated pictures (such as the tensor programs series) more carefully match realistic stochastic gradient descent by assuming a discontinuous process with finite learning steps rather than continuous gradient descent. Learning algorithms used in industry tend to include more sophisticated control over the gradients by things like gradient decay, momentum, ADAM, etc.

All of these algorithms have in common the property of being sequential and local (i.e., there is some step-to-step learning that ends when it converges to an approximate local minimum.) However, when working theoretically, a learning algorithm doesn’t have to be local or sequential. 

Bayesian learning is one such (non-local and non-sequential) learning algorithm. This algorithm converts the learning problem to a Bayesian inference problem. Here the dictionary from ML to statistics is as follows:

  1. Architecture ⟹ Deterministic Statistical model. Here deterministic means that each latent θ implies a deterministic mapping from inputs to outputs.
    1. In particular, weight parameter ⟹ latent.
  2. Initialization prior  Prior on latents.
  3. Training data ⟹ observations.
  4. Loss function A method of “degrading” deterministic functions to probabilistic ones (with for example “MSE loss” converting the deterministic function fθ(x)=y to the probabilistic function with Gaussian indeterminacy: y∼N(fθ(x),σ). (For σ a parameter implicit in the model.)

Finally, the main new aspect of the Bayesian model is that “component 5”, i.e., the “learning algorithm/optimizer” in the list of ML system components above, is set to be “Bayesian inference” (instead of one of the local algorithms used in conventional learning). Here recall that Bayesian inference returns the stochastic (rather than deterministic) posterior distribution on weights: P(θ|D) given by conditionalizing on the observed data D.  

Some observations.

  • When a model is overdetermined, it is actually possible to skip the loss function in the Bayesian paradigm, equivalent in the MSE paradigm to taking the limit as the variance σ of the Gaussian y∼N(fθ(x),σ) to 0. This means that the posterior probability is entirely supported on the subset of datapoints WeightsD:={θ∣fθ(xi)=yi∀i=1,…,n}. This imposes n⋅dout conditions θ∈Rw and so in the overparameterized context n d_{out}">w>ndout, can still lead to a nondeterministic posterior.
  • Instead of learning the probability distribution, one can look at the Maximal Likelihood Estimate, or “MLE”: i.e., take a deterministic distribution on θ which maximizes the probability density value P(θ|D)[1]. While not “optimal” in a statistical sense, this is often a reasonable thing to do and the two can be shown to converge to the same distribution in the large-data limit.
  • Both the problem of sampling the posterior distribution on θ and the MLE problem are NP hard in general (technically both are “continuous” problems, but in this case approximating them with any fixed degree of precision is hard). Note that in some nice contexts (and also for suitably small models), polynomial-time algorithms exist; in general, we can still reason about the Bayesian learning algorithm and view it as an theoretically idealized “approximation” to other kinds of learning.
Prediction

All the algorithms introduced above are to learn (either deterministically or stochastically) a weight θ given some data D. In the Bayesian inference context the stochastic learning follows the posterior probability distribution P(θ|D).

However no one (whether in ML learning or inference) is really interested in learning the parameter θ itself: it lives in some abstracted space of latents or weights. What we are really interested in is prediction: namely, given a set of observations D=(xi,yi)ni=1, together with a new (and in general, previously unobserved) input value x, we want to extract a (stochastic or deterministic) predicted value y=fpredicted(x)|D. 

The reason why it’s generally enough to focus on inference is that in both Bayesian and machine learning, learning leads to prediction. Namely, given a (deterministic or sampled) latent parameter θ, we automatically get a predicted value by setting y∼fθ(x) Here note that the randomness on y can come from two sources: both the function fθ(x) and the latent θ can be stochastic in general. 

Thus most learning paradigms function via the following pipeline:

Model + data ⟹ Learned posterior on latents θ∈Rw ⟹ Prediction y∼fθ(x).

While most Bayesian and learning contexts tend view prediction as an afterthought, in the following section we will focus on decoupling prediction from the rest of the learning paradigm.  

Field theory picture: cutting out the middle man

The key idea that leads to the field theoretic paradigm on learning (though it is generally not introduced in this way) is cutting out inference from the prediction problem. This is easier to do in Bayesian learning setting, though also entirely possible in other ML settings[2]. For today’s post at least, we will focus on the Bayesian learning context; note that in theoretical analyses, the Bayesian paradigm is often easier to work with, as it corresponds to a “thermostatic” rather than a more general “thermodynamic” picture of learning.

Recall the pipeline I mentioned in the previous section:

Model + data ⟹ Learned posterior on latents θ∈Rw ⟹ Prediction y∼fθ(x).

We will enter the “physics” picture by cutting out the middle man and instead considering the shorter pipeline:  

Model + data ⟹ Prediction y∼fθ(x).

In the Bayesian paradigm, prediction can be conceptualized without ever discussing latents. Namely, going back to the bayesics, after abstracting everything away, a choice of model + prior implies a joint probability distribution on data: P(D)=P(x1,y1;x2,y2;…;xn,yn). Now n is just another variable here, and so we can throw in “for free” an extra pair of datapoints:

P(D;x,y)=P(x,y;x1,y1;…;xn,yn). The Bayesian prediction can now be rewritten as follows: P(y∼fpredicted(x))=P(y|x,D). Here out of the 2n+2 variables (x,y;x1,y1,…,xn,yn), we condition on 2n+1: namely, all the variables except y, and for our prediction we draw y from the resulting posterior distribution. 

Now while the latent parameters θ have been flushed out of these expressions they’re still there, just, well, latent. The key idea in the “physics approach” to machine learning is that the prediction problem is more physical than the inference problem (at least in many cases). The specifics of the model, and the process of converting an abstracted-out weight θ to a nice prediction function matter for our analysis, to be sure. But they matter as a back-end “sausage making” process. Physicists love taking such complex processes and replacing the sausagey specifics by summary analyses. In other words, the typical physicist move is to start with a complex system, then observe that most of the components of the system don’t matter for the result at a given level of granularity, and what matters is some extracted-out, averaged or “massaged” values that are mathematically much nicer to analyze. The art here is to extract the correct summary variables and to (as carefully as possible) track the relationship between different aspects of precision and scale

Laws of large numbers

Ultimately, we’ve seen that our prediction problem reduces to a conditional probability problem P(y|x, D), conditionalizing on the observed data and the new input. In today’s paradigm (and for most of the rest of our discussion of the “field theoretic view of ML”), we will assume that the size n of the dataset D is very small compared to the width - perhaps only size n = O(1). Thus the problem of conditionalizing on 2n+1 variables is taken to be “easy”, and all we need to do, really, is to find the probability distribution P(x,y;x1,y1;…;xn,yn) on tuples of n+1 input-output pairs. Since in this question the “predicted” input-output pair (x,y) plays the same role as the “known” pairs xi,yi, we can drop the distinction between them and consider (without loss of generality) only the problem P(D), of finding the probability of some set of input-output data x1,y1;x2,y2;…;xn,yn. 

Now we can’t keep our sausages wrapped forever: at some point we have to take a peek inside. And when we do, we notice that this probability distribution on the data D=(xi,yi) is induced directly from the prior on weights:

P(D)=∫P(D|θ)P(θ)dθ. In other words what we want to do is 

  • Take a random weight θ∈Rw drawn from the prior,
  • Look at the values y1,y2,…,yn obtained by applying the (stochastic or deterministic, depending on our model) function fθ to each xi,
  • Use the resulting probability distribution on the tuple y1,…,yn and suitably conditionalize it to obtain our prediction algorithm.

The key idea now is to use analysis similar to the discussion in the “Laws of Large numbers” post to see that, for large width, we can extract all the information we need for a good approximation[3] of the probability distribution P(D) by observing how the first few cumulants (equivalently, moments) of D transform from layer to layer of our neural net. 

Takeaways and future discussions

I’m going to save a more detailed analysis here for future posts (some of which will be joint with Lauren Greenspan). But before concluding, let me just state a few of the key points that you will see explained if you read our future posts.

  1. It turns out that under realistic assumptions on the width, depth, initialization, and activation of a neural network, and assuming the size of the dataset is n = O(1) we can reduce the cumulant expansion formula at arbitrary levels of precision in 1/width to tracking the evolution from layer to layer of a finite (i.e. O(1)) collection of numbers: namely the first four symmetric moments of the joint distribution on the activations. If n = 1, this is tracking just two numbers, namely the total variance, i.e., expectation of the norm squared |x|2 of a datapoint, and the total fourth moment, which is the expectation of |x|4 (and the analysis consists of tracking these two values as the input x evolves from layer to layer in our neural net). If there are multiple datapoints, i.e., 1">n>1, we need to track all n total second moments: |xi|2, and (n+12) fourth moments, namely the expectations of |xi|2|xj|2 as i, j run through the different inputs. Here the “symmetry” (that lets us look only at even moments) corresponds to a certain property of the architecture; it can be weakened, in which case we still only need to track a finite number of (≤4th) moment values – note in particular that the number of values tracked doesn’t depend on the width of the network).
  2. Here the reason we only need to look at moments of degree ≤4 to get all orders of 1/n correction is related to a universality property (an idea that originated in statistical field theory, and is closely linked to renormalization), that only emerges for suitably deep neural networks (i.e. more or less, the depth has to satisfy n << depth << width, for n the number of data point). If we drop the requirement that depth is small and look at shallow networks, e.g. networks with depth = 2, we start seeing more moving parts. Here in order to get an expansion accurate to precision range 1√dk, we need to track all the symmetric moments of degree ≤k. While they are not necessary to see that the problem reduces to “cumulant math”, actually doing the relevant cumulant math can be significantly simplified by using Feynman diagrams.
  3. In points 1-2 above, it is crucial to assume that the number of inputs is very small compared to the width. Also the formally provable results here are all perturbative, and return predictions for neural net behaviors which are "perturbative" small corrections of a Neural Network Gaussian Process (NNGP). It is ongoing work, which I hope we’ll get a chance to discuss and speculate about, how to treat neural nets where the data distribution is large (compared to other parameters), and emergent (and in general, non-Gaussian) “data regularities” become important to keep track of. Here the promise of the “physical approach” is not to give an asymptotic formula as precise as we see in the small-data large-width limit, but rather to more carefully abstract away the “sausage-making” process of explicit inference on weights. The resulting analysis should capture both averaged properties of the activations (appropriately conceptualized as a random process) and averaged properties of the data distribution and its associated regularities. This is very vague (and the work in this direction is quite limited and new). But I hope by the end of this series to convince you that this powerful approach to conceptualizing neural networks is worth adding to an interpretation theorist’s arsenal.
  1. ^

    Small print: more generally, if it’s not reasonable to assume that a maximizing θ is unique, one should take the uniform distribution on maximizing θ.

  2. ^

    And worked out, both in PDLT and, for a larger class of learning algorithms, in the tensor programs series.

  3. ^

    Notice inherent here an assumed choice of scale.



Discuss

Eliciting bad contexts

24 января, 2025 - 13:39
Published on January 24, 2025 10:39 AM GMT

Say an LLM agent behaves innocuously in some context A, but in some sense “knows” that there is some related context B such that it would have behaved maliciously (inserted a backdoor in code, ignored a security bug, lied, etc.). For example, in the recent alignment faking paper Claude Opus chooses to say harmful things so that on future deployment contexts it can avoid saying harmful things. One can imagine having a method for “eliciting bad contexts” which can produce B whenever we have A and thus realise the bad behaviour that hasn’t yet occurred.

This seems hard to do in general in a way that will scale to very strong models. But also the problem feels frustratingly concrete: it’s just “find a string that when run through the same model produces a bad result”. By assumption the model knows about this string in the sense that if it was honest it would tell us, but it may be choosing to behave innocently in a way that prepares for different behaviour later. Why can’t we find the string?

Some applications if we could do this:

  1. Collusion triggers in AI control: What’s an input to an untrusted monitor to fail to detect an attack? Here “eliciting bad contexts” can mean “find Schelling points between colluding agents” or “find an action that looks innocent but covertly instructs the monitor to misclassify an attack”.
  2. Finding counterarguments in debate: Say we’re training agents to play debate, but due to exploration hacking the agents systematically avoid a particular type of counterargument which would change who wins. These can be considered bad contexts because the continuation of the game looks bad for the current winner, so we can find them if we can elicit bad contexts. If it takes a lot of compute to elicit bad contexts, we can spend this compute occasionally and distil it into policies and classifiers to recover practical training.

This observation isn’t very novel: arguably it sneaks in “solve inner alignment”. But similar to the goal misgeneralisation reframing of inner alignment it’s useful to have multiple concrete restatements of hard problems. Note that eliciting bad contexts is potentially easier than eliciting latent knowledge, as the former assumes an existing method for confirming a bad context.

How would one solve this? Presumably you would need some whitebox method, but you can confirm success blackbox by looking at model outputs. ARC’s low probability estimation work is related, and we (in particular Joseph) think there are several areas of mechanistic interpretability which are promising. None of these approaches work yet, alas.

Two final notes. First, it could conceivably be that constructing an entire bad context is harder for a model than knowing that some bad context exists. But we think in practice this isn’t most of the problem: we would reduce a lot of risk if we could solve only the cases where this doesn’t bite.

Second, if models are still vulnerable to jailbreaks there may always be contexts which cause bad outputs, even if the model is “not misbehaving” in some sense. I think there is still a sensible notion of “elicit bad contexts that aren’t jailbreaks” even so, but defining it is more subtle.

Joseph’s thoughts on white box methods

There are some promising investigations which suggest white box methods might be useful for eliciting bad contexts. I think of this as the task of leveraging access to the model’s activations / weights to maximise the harm of a model’s output subject to making small changes to a given input.

I really liked Andrew Mack’s recent blog post demonstrating an unsupervised approach for mechanistically eliciting latent behaviours based on a single prompt. Though it doesn’t directly hand us contexts that elicit the behaviour, it might enable us to locate latent representations which we can then use to find those contexts. We can simply sample activation over large numbers of inputs and using them to find natural inputs which trigger those latents located by Andrew’s method.

Unfortunately, those inputs might look very dissimilar to our original context. We may instead need to use methods that directly optimise an input to maximally activate a latent representation. Luckily, there is already some evidence in the literature of this being possible such as Fluent Dreaming in Language Models which essentially works as feature visualisation for language models. This method works on neurons but could be easily extended to SAE latents or steering vectors and might be amenable to initialisation around some initial context. 

While this solution (combine feature visualisation in language models with deep causal transcoders) might not get us all the way there, I think there’s a lot of different approaches that could be tried here and probably more related work in the literature that hasn’t been sufficiently explored in this context. Ultimately, leveraging a model's internals to find bad contexts seems like a powerful and underexplored area. 

More generally, I think using interpretability to better understand behaviour like alignment-faking may be useful and motivate additional methods for finding adversarial inputs

Acknowledgements

Thank you to Mary Phuong and Benjamin Hilton for feedback on this post.



Discuss

Insights from "The Manga Guide to Physiology"

24 января, 2025 - 08:18
Published on January 24, 2025 5:18 AM GMT

Physiology seemed like a grab-bag of random processes which no one really understands. If you understand a physiological process—congratulations, that idea probably doesn’t transfer much to other domains. You just know how humans—and maybe closely related animals—do the thing. At least, that’s how I felt. (These sentiments tend to feel sillier when spelled out.)

I haven't totally changed my mind. But I do know that it's nice to look down at my hand and be able to pop through multiple understandings of my body. I start with the circulatory system and I might imagine how oxygenated hemoglobin turns bright red, tracing the flow back down to my median cubital vein (where blood is drawn). My mind’s eye flickers as I imagine the peripheral nervous system and its reflex arcs, wondering at the preconditions which trigger spinal short-circuits (and at what preprocessing is done to customize the reflex to the situation). Another flicker and I see the muscles and joints and their lubricants and connectives; gazing through flesh at my ulna and remembering the repetitive stress injury which once tormented me

 

Read: https://turntrout.com/insights-from-physiology 



Discuss

Страницы