Вы здесь

Новости LessWrong.com

Подписка на Лента Новости LessWrong.com Новости LessWrong.com
A community blog devoted to refining the art of rationality
Обновлено: 6 минут 30 секунд назад

Word-Idols (or an examination of ties between philosophy and horror literature)

9 июня, 2019 - 06:34
Published on June 9, 2019 12:31 AM UTC

An examination of any ties between Philosophy and horror literature is, indeed, quite rare an undertaking... There are many reasons for the scarcity of articles on this topic, ranging from a reluctance to acknowledge horror literature as serious (literary) fiction, to Philosophy itself being dismissed as overrated, superfluous or obsolete. As with most cases of categorical nullification of entire genres or orders, this one as well can largely be attributed to lack of familiarity with the essential subjects they encompass.

It can be argued that there indeed are grounds to assert a link between Philosophy and Horror literature. Socrates himself, while pondering a definition of Philosophy, notes that the noun thámvos - the Greek term for dazzle – was traditionally regarded as the progenitor of philosophical thought, and goes on to speak favorably of this connection. Socrates offers the insight that Philosophy is a hunt for the source of the dazzling sense a thinker may have of there being unknown things in our own mental world; the sense that we are, both by necessity and will, progressing on a surface of things and sliding along, minding to steer away from any chasms, while below the level of consciousness is perpetuated a dark abyss of unknowns.

Anyone who has read H.P. Lovecraft would instantly recognize the aforementioned image. A deep, unexplored abyss teeming with potentially dangerous forces, juxtaposed to a relatively well-established surface area where humans carry on their everyday lives with neither the ability nor the will to investigate what lurks below. The lack of ability itself is to be expected: the human mind has its own limitations, and so does the conscious power of any individual. The absence of will, however, does signify fear.

That said, in Philosophy the subject matter does not – usually – allow for lack of will to manifest (what would a non-thinking philosopher be?). Nevertheless, it can be regarded as self-evident that will to examine the depths of one’s own mind is generally lacking in most people. It can be lacking in philosophically-inclined individuals as well, given there are topics which may cause even the supposedly self-indulgent thinker to make the conscious decision to back down from further examination; these topics primarily have to do with bringing into light what hasn’t been formed stably before: to self-reflect, to insist in examining one’s deeper world of thought is a little bit like having to look at a bright and blinding light that cannot be immediately softened. A dangerous and powerful beam which is potent enough to reveal new and not entirely well-defined forms moving about below the conscious mind. Sometimes – as in Plato’s Allegory of the Cave – one has to first look away from the Sun, and prefer to observe not the forms themselves but their idols as they are reflected on the surface of a lake or river. Or choose to simply retain a memory of the first impression, and then dealing only with the memory, having replaced the striking and dazzling original with a replica sculpted out of more familiar thoughts and notions.

Let us recall the opening paragraph of H.P. Lovecraft’s “The Call of Cthulhu”:

“The most merciful thing in the world, I think, is the inability of the human mind to correlate all its contents. We live on a placid island of ignorance in the midst of black seas of infinity, and it was not meant that we should voyage far. The sciences, each straining in its own direction, have hitherto harmed us little; but some day the piecing together of dissociated knowledge will open up such terrifying vistas of reality, and of our frightful position therein, that we shall either go mad from the revelation or flee from the deadly light into the peace and safety of a new dark age. ”

Reading the above one cannot help but notice that a discovery may lead to disaster; for two reasons: The person made the discovery by stumbling upon it, but in essence this lack of readiness can well be something impossible to change and not even fruitful to attempt changing. It may indeed be caused by inherent checks and balances in the human mind. The sense of so intense and ominous a surprise is potent enough to demand meticulous examination: In De Maupassant’s dark short stories we often read of the narrator having to take notes in the aftermath of such a pathos, and those are notes taken not with the end to further the insight granted by the original revelation, but in fact with an almost antithetical goal: they are conceived and – perhaps – scribbled down so as to serve as another barrier between the frightened note-taker and the dangerous glow of the dazzling revelation, since they aspire to dim the light by burying it under pages and pages of a peculiar safety net. In Lovecraft, again, we often read the narrator claim that he is writing down his story not out of hope to establish some logical explanation (and thus make his horror diminish) but because he wishes for an account to remain, an account of a cursed barrier he stumbled upon. The horrified thinker is forced to become a strange patent creator and come up with means to repress a dangerous sense originating in the depths of one’s own mind. The sculptor in H.P.L’s “The Call of Cthulhu” can bare to look at the idol he created, but only out of sedation, while the original, witnessed in the dream, was impossible to withstand.

As stated, most of the issues dealt with in Philosophy do not immediately border so dizzying or dazzling a sense. Socrates did say that he was “almost afraid” of examining Parmenides, due to the nauseous implications of the Eleatic Philosophy; yet that was a discussion on Dialectics, a branch of Philosophy that deals with matters which by their own nature are open-ended and theoretical. And while potentially any examination of notions themselves may eventually lead the thinker to sense he isn’t aware of what lies further below (or even if any set foundation exists in those unlit depths of the unconscious from which all notions spring and are later on crystallized into terms to be used and communicated freely) it is obvious that the large majority of philosophical subjects are more distinctly outlined and consequently rendered quite fit for smooth and relatively unexcited discussion.

And yet, Lovecraft’s idea about an unintended revelation does echo other philosophical-literary sentiments by celebrated authors. The sense of a critical border – an event horizon, so to speak – in consciousness, is perhaps one of the most common subjects in well-known literary fiction, one examined by authors such as F. Kafka, J.L. Borges, H. Hesse, C. Baudelaire and E.A. Poe. It is, I think, highly unfortunate that when it becomes the centerpiece in horror literature – as in the case of H.P. Lovecraft’s works – the focus usually rests on the sentiment of fear and not on the arguably philosophical and psychological cause: the fear of the unknown.

Perhaps Lovecraft himself is – at least partly – to blame for diverting attention from the philosophical meaning of his allegorical “invasion” or “colonization” by “alien” lifeforms; this type of furtive coexistence may literally be alluding to the necessary lack of awareness in all of us for what lurks deeper inside our mental cosmos. After all, don’t we fossilize any sense of that deep into neat notions, and don’t we proceed to carve – far less potent than the original – idols of those notions in the shape of words?

(by Kyriakos Chalkopoulos - article can also be found at https://www.patreon.com/posts/word-idols-are-27458809 )



Discuss

Coercive Formats

9 июня, 2019 - 04:26
Published on June 9, 2019 1:26 AM UTC

In How to Make a Complete Map of Every Thought You Think, Lion Kimbro introduces the idea of coercive vs uncoercive formats for information.

Coercive:

  • Stories.
  • Lesson plans; course prerequisites.
  • Videos.
  • Linear text which must be read in order. Introducing terminology and ideas once (rather than wherever they're needed), without an index or other better reference mechanisms.
  • Explaining what you're getting at only via your chain of argumentation, rather than up-front.
  • Intertwining development of many different themes (making it impossible to look for only the information you want).

Uncoercive:

  • Blueprints.
  • Wikis.
  • Reference manuals.
  • Hyperlinks.
  • TLDRs.
  • Tools which make it easy to quickly get an overview, such as paper abstracts or introductions which actually summarize a work (rather than teasing its conclusions or giving autobiographical context, etc).
  • Diagrams/tables/charts and other graphic elements. (The eye is free to roam to whichever part is of interest; also, these elements can often be made fairly self-explanatory, so that one doesn't have to read everything else to see what is going on.)
  • Tables of contents; outlines.
  • Chapters and sections which can be read independently.

I initially wrote the above lists as two paragraphs. Turning them into bulleted lists made them feel less coercive. The rest of this post is coercive, in that I don't separate out different topics in an easily distinguishable way.

I'm not going to claim that "less coercive" is always better. Coercive formats are often more fun to read, for example. On this spectrum, HPMOR would be more coercive than the sequences. Furthermore, you might want to present information in a coercive format if getting information in the wrong order is likely to make a person actively misunderstand a concept rather than only fail to understand. (Mere prerequisite concepts are not really a good excuse -- a wiki-style format, where prerequisite concepts are obsessively hyperlinked, works fine.)

A 'sequence' is a somewhat coercive format -- a set of posts intended to be read in a particular order. However, sequences on LW1.0 were less coercive than sequences as they exist on LW2.0. LW1.0 sequences (mostly?) listed the posts in a sequence with summaries of each post,, while LW2.0 sequences only have a summary of the entire sequence at the top, w/o a short description of each post.

A coercive format can also be better for attention management. Uncoercive formats can create a tendency for attention to jump all over the place and not spend enough time on any one topic.

Coercive formats seem workable for small and medium-sized chunks of information, but become unwieldy for large amounts of information. Imagine if an encyclopedia were written to be read in a sequential order.

Uncoercive formats often take more effort to create. Drawing a good diagram tends to take more effort than writing the information in text (though, obviously, this depends on a lot of factors). Maybe the ideal should be to move information from more-coercive to less-coercive formats: it is good to initially create a lot of content in very linear formats which have to be read from start to finish, but it is also good to eventually move things to a wiki or something like that.

Another idea is that information should just be available in as broad a variety of formats as possible.

Coercive formats seem to be better for convincing people of a target position, whereas uncoercive formats seem better for presenting all of the information. This would make coercive formats a symmetric weapon. This isn't entirely true -- advertisements often seem to use information-delivery strategies which I would ordinarily characterize as less coercive (such as bulleted lists, text elements that don't need to be read in a particular order, images). Maybe that's because they need to work with a low attention span.

It is possible that rigorous argument and careful analysis fits better in linear, coercive formats. So, it isn't clear which formatting style is better for healthy epistemics.

Maybe communities which share information in a manner heavily bias toward coercive formats have a greater risk of being convinced by long and overly clever arguments, whereas communities which share information heavily biased toward noncoercive formats suffer from overly quick impressions formed via merely glancing at a million things.

But, let's not jump to the conclusion that a healthy balance is sufficient. The ideal would be to get the advantages of both.



Discuss

Mistakes with Conservation of Expected Evidence

9 июня, 2019 - 02:07
Published on June 8, 2019 11:07 PM UTC

Epistemic Status: I've really spent some time wrestling with this one. I am highly confident in most of what I say. However, this differs from section to section. I'll put more specific epistemic statuses at the end of each section.

Some of this post is generated from mistakes I've seen people make (or, heard people complain about) in applying conservation-of-expected-evidence or related ideas. Other parts of this post are based on mistakes I made myself. I think that I used a wrong version of conservation-of-expected-evidence for some time, and propagated some wrong conclusions fairly deeply; so, this post is partly an attempt to work out the right conclusions for myself, and partly a warning to those who might make the same mistakes.

All of the mistakes I'll argue against have some good insight behind them. They may be something which is usually true, or something which points in the direction of a real phenomenon while making an error. I may come off as nitpicking.

"You can't predict that you'll update in a particular direction."

Starting with an easy one.

It can be tempting to simplify conservation of expected evidence to say you can't predict the direction which your beliefs will change. This is often approximately true, and it's exactly true in symmetric cases where your starting belief is 50-50 and the evidence is equally likely to point in either direction.

To see why it is wrong in general, consider an extreme case: a universal law, which you mostly already believe to be true. At any time, you could see a counterexample, which would make you jump to complete disbelief. That's a small probability of a very large update downwards. Conservation of expected evidence implies that you must move your belief upwards when you don't see such a counterexample. But, you consider that case to be quite likely. So, considering only which direction your beliefs will change, you can be fairly confident that your belief in the universal law will increase -- in fact, as confident as you are in the universal law itself.

The critical point here is direction vs magnitude. Conservation of expected evidence takes magnitude as well as direction into account. The small but very probable increase is balanced by the large but very improbable decrease.

The fact that we're talking about universal laws and counterexamples may fool you into thinking about logical uncertainty. You can think about logical uncertainty if you want, but this phenomenon is present in the fully classical Bayesian setting; there's no funny business with non-Bayesian updates here.

Epistemic status: confidence at the level of mathematical reasoning.

"Yes requires the possibility of no."

Scott's recent post, yes requires the possibility of no, is fine. I'm referring to a possible mistake which one could make in applying the principle illustrated there.

“Those who dream do not know they dream, but when you are awake, you know you are awake.” -- Eliezer, Against Modest Epistemology

Sometimes, look around, and ask myself whether I'm in a dream. When this happens, I generally conclude very confidently that I'm awake.

I am not similarly capable of determining that I'm dreaming. My dreaming self doesn't have the self-awareness to question whether he is dreaming in this way.

(Actually, very occasionally, I do. I either end up forcing myself awake, or I become lucid in the dream. Let's ignore that possibility for the purpose of the thought experiment.)

I am not claiming that my dreaming self is never deluded into thinking he is awake. On the contrary, I have those repeatedly-waking-up-only-to-find-I'm-still-dreaming dreams occasionally. What I'm saying is that I am not able to perform the actually good test, where I look around and really consciously consider whether or not I might be dreaming. If I want to know if I'm awake, I can just check.

A "yes-requires-the-possibility-of-no" mindset might conclude that my "actually good test" is no good at all, because it can't say no. I conclude the exact opposite: my test seems really quite effective, because I only successfully complete it while awake.

Sometimes, your thought processes really are quite suspect; yet, there's a sanity check you can run which tells you the truth. If you're deluding yourself, the general category of "things which you think are simple sanity checks you can run" is not trustworthy. If you're deluding yourself, you're not even going to think about the real sanity checks. But, that does not in itself detract from the effectiveness of the sanity check.

The general moral in terms of conservation of expected evidence is: "'Yes' only requires the possibility of silence.". In many cases, you can meaningfully say yes without being able to meaningfully say no. For example, the axioms of set theory could prove their own inconsistency. They could not prove themselves consistent (without also proving themselves inconsistent). This does not detract from the effectiveness of a proof of inconsistency! Again, although the example involves logic, there's nothing funny going on with logical uncertainty; the phenomenon under discussion is understandable in fully Bayesian terms.

Symbolically: as is always the case, you don't really want to update on the raw proposition, but rather, the fact that you observed the proposition, to account for selection bias. Conservation of expected evidence can be written .mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} P(H)=P(H|E)P(E)+P(H|¬E)P(¬E), but if we re-write it to explicitly show the "observation of evidence", it becomes P(H)=P(H|obs(E))P(obs(E))+P(H|¬obs(E))P(¬obs(E)). It does not become P(H)=P(H|obs(E))P(obs(E))+P(H|obs(¬E))P(obs(¬E)). In English: evidence is balanced between making the observation and not making the observation, not between the observation and the observation of the negation.

Epistemic status: confidence at the level of mathematical reasoning for the core claim of this section. However, some applications of the idea (such as to dreams, my central example) depend on trickier philosophical issues discussed in the next section. I'm only moderately confident I have the right view there.

"But then what do you say to the Republican?"

I suspect that many readers are less than fully on board with the claims I made in the previous section. Perhaps you think I'm grossly overconfident about being awake. Perhaps you think I'm neglecting the outside view, or ignoring something to do with timeless decision theory.

A lot of my thinking in this post was generated by grappling with some points made in Inadequate Equilibria. To quote the relevant paragraph of against modest epistemology:

Or as someone advocating what I took to be modesty recently said to me, after I explained why I thought it was sometimes okay to give yourself the discretion to disagree with mainstream expertise when the mainstream seems to be screwing up, in exactly the following words: “But then what do you say to the Republican?”

Let's put that in (pseudo-)conservation-of-expected-evidence terms: we know that just applying one's best reasoning will often leave one overconfident in one's idiosyncratic beliefs. Doesn't that mean "apply your best reasoning" is a bad test, which fails to conserve expected evidence? So, should we not adjust downward in general?

In the essay, Eliezer strongly advises allowing yourself to have an inside view even when there's an outside view which says inside views broadly similar to yours tend to be mistaken. But doesn't that go against what he said in Ethical Injunctions?

Ethical Injunctions argues that there are situations where you should not trust your reasoning, and fall back on a general rule. You do this because, in the vast majority of cases of that kind, your oh-so-clever reasoning is mistaken and the general rule saves you from the error.

In Against Modest Epistemology, Eliezer criticizes arguments which rely on putting arguments in very general categories and taking the outside view:

At its epistemological core, modesty says that we should abstract up to a particular very general self-observation, condition on it, and then not condition on anything else because that would be inside-viewing. An observation like, “I’m familiar with the cognitive science literature discussing which debiasing techniques work well in practice, I’ve spent time on calibration and visualization exercises to address biases like base rate neglect, and my experience suggests that they’ve helped,” is to be generalized up to, “I use an epistemology which I think is good.” I am then to ask myself what average performance I would expect from an agent, conditioning only on the fact that the agent is using an epistemology that they think is good, and not conditioning on that agent using Bayesian epistemology or debiasing techniques or experimental protocol or mathematical reasoning or anything in particular.Only in this way can we force Republicans to agree with us… or something.

He instead advises that we should update on all the information we have, use our best arguments, reason about situations in full detail:

If you’re trying to estimate the accuracy of your epistemology, and you know what Bayes’s Rule is, then—on naive, straightforward, traditional Bayesian epistemology—you ought to condition on both of these facts, and estimate P(accuracy|know_Bayes) instead of P(accuracy). Doing anything other than that opens the door to a host of paradoxes.

In Ethical Injunctions, he seems to warn against that very thing:

But surely... if one is aware of these reasons... then one can simply redo the calculation, taking them into account.  So we can rob banks if it seems like the right thing to do after taking into account the problem of corrupted hardware and black swan blowups.  That's the rational course, right?There's a number of replies I could give to that.I'll start by saying that this is a prime example of the sort of thinking I have in mind, when I warn aspiring rationalists to beware of cleverness.

Now, maybe Eliezer has simply changed views on this over the years. Even so, that leaves us with the problem of how to reconcile these arguments.

I'd say the following: modest epistemology points out a simple improvement over the default strategy: "In any group of people who disagree, they can do better by moving their beliefs toward each other." "Lots of crazy people think they've discovered secrets of the universe, and the number of sane people who truly discover such secrets is quite small; so, we can improve the average by never believing we've discovered secrets of the universe." If we take a timeless decision theory perspective (or similar), this is in fact an improvement; however, it is far from the optimal policy, and has a form which blocks further progress.

Ethical Injunctions talks about rules with greater specificity, and less progress-blocking nature. Essentially, a proper ethical injunction is actually the best policy you can come up with, whereas the modesty argument stops short of that.

Doesn't the "actually best policy you can come up with" risk overly-clever policies which depend on broken parts of your cognition? Yes, but your meta-level arguments about which kinds of argument work should be independent sources of evidence from your object-level confusion. To give a toy example: let's say you really, really want 8+8 to be 12 due to some motivated cognition. You can still decide to check by applying basic arithmetic. You might not do this, because you know it isn't to the advantage of the motivated cognition. However, if you do check, it is actually quite difficult for the motivated cognition to warp basic arithmetic.

There's also the fact that choosing a modesty policy doesn't really help the republican. I think that's the critical kink in the conservation-of-expected-evidence version of modest epistemology. If you, while awake, decide to doubt whether you're awake (no matter how compelling the evidence that you're awake seems to be), then you're not really improving your overall correctness.

So, all told, it seems like conservation of expected evidence has to be applied to the details of your reasoning. If your put your reasoning in a more generic category, it may appear that a much more modest conclusion is required by conservation of expected evidence. We can justify this in classical probability theory, though in this section it is even more tempting to consider exotic decision-theoretic and non-omnescience considerations than it was previously.

Epistemic status: the conclusion is mathematically true in classical Bayesian epistemology. I am subjectively >80% confident that the conclusion should hold in >90% of realistic cases, but it is unclear how to make this into a real empirical claim. I'm unsure enough of how ethical injunctions should work that I could see my views shifting significantly. I'll mention pre-rationality as one confusion I have which seems vaguely related.

"I can't credibly claim anything if there are incentives on my words."

Another rule which one might derive from Scott's Yes Requires the Possibility of No is: you can't really say anything if pressure is being put on you to say a particular thing.

Now, I agree that this is somewhat true, particularly in simple cases where pressure is being put on you to say one particular thing. However, I've suffered from learned helplessness around this. I sort of shut down when I can identify any incentives at all which could make my claims suspect, and hesitate to claim anything. This isn't a very useful strategy. Either "just say the truth", or "just say whatever you feel you're expected to say" are both likely better strategies.

One idea is to "call out" the pressure you feel. "I'm having trouble saying anything because I'm worried what you will think of me." This isn't always a good idea, but it can often work fairly well. Someone who is caving to incentives isn't very likely to say something like that, so it provides some evidence that you're being genuine. It can also open the door to other ways you and the person you're talking to can solve the incentive problem.

You can also "call out" something even if you're unable or unwilling to explain. You just say something like "there's some thing going on"... or "I'm somehow frustrated with this situation"... or whatever you can manage to say.

This "call out" idea also works (to some extent) on motivated cognition. Maybe you're worried about the social pressure on your beliefs because it might influence the accuracy of those beliefs. Rather than stressing about this and going into a spiral of self-analysis, you can just state to yourself that that's a thing which might be going on, and move forward. Making it explicit might open up helpful lines of thinking later.

Another thing I want to point out is that most people are willing to place at least a little faith in your honesty (and not irrationally so). Just because you have a story in mind where they should assume you're lying doesn't mean that's the only possibility they are -- or should be -- considering. One problematic incentive doesn't fully determine the situation. (This one also applies internally: identifying one relevant bias or whatever doesn't mean you should block off that part of yourself.)

Epistemic status: low confidence. I imagine I would have said something very different if I were more an expert in this particular thing.

"Your true reason screens off any other evidence your argument might include."

In The Bottom Line, Eliezer describes a clever arguer who first writes the conclusion which they want to argue for at the bottom of a sheet of paper, and then comes up with as many arguments as they can to put above that. In the thought experiment, the clever arguer's conclusion is actually determined by who can pay the clever arguer more. Eliezer says:

So the handwriting of the curious inquirer is entangled with the signs and portents and the contents of the boxes, whereas the handwriting of the clever arguer is evidence only of which owner paid the higher bid. There is a great difference in the indications of ink, though one who foolishly read aloud the ink-shapes might think the English words sounded similar.

Now, Eliezer is trying to make a point about how you form your own beliefs -- that the quality of the process which determines which claims you make is what matters, and the quality of any rationalizations you give doesn't change that.

However, reading that, I came away with the mistaken idea that someone listening to a clever arguer should ignore all the clever arguments. Or, generalizing further, what you should do when listening to any argument is try to figure out what process wrote the bottom line, ignoring any other evidence provided.

This isn't the worst possible algorithm. You really should heavily discount evidence provided by clever arguers, because it has been heavily cherry-picked. And almost everyone does a great deal of clever arguing. Even a hardboiled rationalist will tend to present evidence for the point they're trying to make rather than against (perhaps because that's a fairly good strategy for explaining things -- sampling evidence at random isn't a very efficient way of conversing!).

However, ignoring arguments and attending only to the original causes of belief has some absurd consequences. Chief among them is: it would imply that you should ignore mathematical proofs if the person who came up with the proof only searched for positive proofs and wouldn't have spend time trying to prove the opposite. (This ties in with the very first section -- failing to find a proof is like remaining silent.)

This is bonkers. Proof is proof. And again, this isn't some special non-Bayesian phenomenon due to logical uncertainty. A Bayesian can and should recognize decisive evidence, whether or not it came from a clever arguer.

Yet, I really held this position for a while. I treated mathematical proofs as an exceptional case, rather than as a phenomenon continuous with weaker forms of evidence. If a clever arguer presented anything short of a mathematical proof, I would remind myself of how convincing cherry-picked evidence can seem. And I'd notice how almost everyone mostly cherry-picked when explaining their views.

This is like throwing out data when it has been contaminated by selection bias, rather than making a model of the selection bias so that you can update on the data appropriately. It might be a good practice in scientific publications, but if you take it as a universal, you could find reasons to throw out just about everything (especially if you start worrying about anthropic selection effects).

The right thing to do is closer to this: figure out how convincing you expect evidence to look given the extent of selection bias. Then, update on the difference between what you see and what expected. If a clever arguer makes a case which is much better than what you would have expected they could make, you can update up. If it is worse that you'd expect, even if the evidence would otherwise look favorable, you update down.

My view also made me uncomfortable presenting a case for my own beliefs, because I would think of myself as a clever-arguer any time I did something other than recount the actual historical causes of my belief (or honestly reconsider my belief on the spot). Grognor made a similar point in Unwilling or Unable to Explain:

Let me back up. Speaking in good faith entails giving the real reasons you believe something rather than a persuasive impromptu rationalization. Most people routinely do the latter without even noticing. I'm sure I still do it without noticing. But when I do notice I'm about to make something up, instead I clam up and say, "I can't explain the reasons for this claim." I'm not willing to disingenuously reference a scientific paper that I'd never even heard of when I formed the belief it'd be justifying, for example. In this case silence is the only feasible alternative to speaking in bad faith.

While I think there's something to this mindset, I no longer think it makes sense to clam up when you can't figure out how you originally came around to the view which you now hold. If you think there are other good reasons, you can give them without violating good faith.

Actually, I really wish I could draw a sharper line here. I'm essentially claiming that a little cherry-picking is OK if you're just trying to convince someone of the view which you see as the truth, so long as you're not intentionally hiding anything. This is an uncomfortable conclusion.

Epistemic status: confident that the views I claim are mistaken are mistaken. Less confident about best-practice claims.

"If you can't provide me with a reason, I have to assume you're wrong."

If you take the conclusion of the previous section too far, you might reason as follows: if someone is trying to claim X, surely they're trying to give you some evidence toward X. If they claim X and then you challenge them for evidence, they'll try to tell you any evidence they have. So, if they come up with nothing, you have to update down, since you would have updated upwards otherwise. Right?

I think most people make this mistake due to simple conversation norms: when navigating a conversation, people have to figure out what everyone else is willing to assume, in order to make sensible statements with minimal friction. So, we look for obvious signs of whether a statement was accepted by everyone vs rejected. If someone was asked to provide a reason for a statement they made and failed to do so, that's a fairly good signal that the statement hasn't been accepted into the common background assumptions for the conversation. The fact that other people are likely to use this heuristic as well makes the signal even stronger. So, assertions which can't be backed up with reasons are likely to be rejected.

This is almost the opposite mistake from the previous section; the previous one was justifications don't matter, whereas this idea is only justifications matter.

I think something good happens when everyone in a conversation recognizes that people can believe things for good reason without being able to articulate those reasons. (This includes yourself!)

You can't just give everyone a pass to make unjustified claims and assert that they have strong inarticulable reasons. Or rather, you can give everyone a pass to do that, but you don't have to take them seriously when they do it. However, in environments of high intellectual trust, you can take it seriously. Indeed, applying the usual heuristic will likely cause you to update in the wrong direction.

Epistemic status: moderately confident.

Conclusion

I think all of this is fairly important -- if you're like me, you've likely made some mistakes along these lines. I also think there are many issues related to conservation of expected evidence which I still don't fully understand, such as explanation vs rationalization, ethical injunctions and pre-rationality. Tsuyoku Naritai!



Discuss

Two labyrinths - where would you rather be?

8 июня, 2019 - 22:23
Published on June 8, 2019 5:48 PM UTC


Two Kings and two Labyrinths is  a very short story, written by J.L. Borges. It barely manages to fill  the space of a single page; and yet there is enough in it to allow for  an interesting dissertation. 

The actual story is about a rivalry between  two kings: The King of Babylon had once invited the King of Arabia to  his capital, and there got him to enter a labyrinth made of intricate  passages, surrounded by tall walls. The King of Arabia only managed  to find his way out after imploring his God for help. The experience  terrified him, and he swore that in the future he would repay the  Babylonian in kind, by introducing him to another labyrinth; one particular to his native and desolate Arabian realm...

After his victory in war, the King of Arabia takes the King of  Babylon hostage. He brings him to the desert, where, at the end of a  three-day journey, he is abandoned. The desert is another kind of  labyrinth. It has neither passages nor walls, but still finding one's  way out of it is virtually impossible.

A Labyrinth is More Than Just a Prison Cell

A  labyrinth isn’t just a structure which confines; it is one which serves  the purpose of getting one disoriented. While a prison cell – regardless if it is  nameless and obscure or one as famous as the stone vault in Sophocles’  play, Antigone, which was used to imprison the heroine and  slowly drain her of the will to live – is just a simple room, enough to  enclose, limit, and cause desperation, an actual labyrinth functions by  allowing the person inside to still hope there is a chance of finding a  way out... The labyrinth is different from a group of  interconnecting cells, in that somewhere in it one may still discover a  passage which will lead to liberation...

The possibility of finding the exit may be so small that, in  practice, one wouldn't ever succeed in this quest... It's not important,  though, because the very form of the labyrinth forces its prisoner to  accept that there are always new routes to explore, or another idea to  test; the progression from each part of the labyrinth to the next one  may be quite monotonous, and almost reveal no change, but the prisoner  inside is actually moving, is still progressing – and this allows for  hope.

The Babylonian Labyrinth

The  first of the labyrinths presented in the story is the one the reader  would readily identify as a typical labyrinth. A maze, filled with  corridors and forking paths, and with the line of sight in every one of  its locations being crucially obstructed by tall and sturdy masonry. In  such an edifice one can attempt to examine every minute difference  between the numerous interconnecting rooms, aspiring to devise some  manner of identifying and then memorizing which paths have already been  taken, and come up with a plan that would allow for the exploration of  as many areas as possible, all the while hoping that through a combination of methodology and luck it may happen that the exit will be  discovered!

Every room has specific forms, and every step can be – and moreover may have  to be – retraced, to allow for a progressively more thorough and valid  impression in regards to the overall shape of the labyrinth.

The Arabian Labyrinth

The  labyrinth in Arabia is, of course, the desert itself. It stretches for  endless miles. Here there are no rooms, nor walls, nor any other element  which changes as one carries on walking. It is, indeed, a labyrinth  which consists of a singular vast space; and, unlike the Babylonian type,  this labyrinth will reveal its exit if you simply walk far enough so  that the first signs of something other than the desert becomes visible on the horizon... Unlike with the built maze, the desert doesn’t  allow for retracing of steps; you have to choose a direction, and carry  on moving. It may, in fact, easily be the case that your very first step and your  very first choice has already either saved or doomed you! Only at a far later point in time will you find out which of the two was true.

While in the built maze you need to form a sense of the overall pattern, keep track of the various routes you had taken and construct a  plan so as to allow for a new, original route to be set in every  subsequent attempt, in the desert maze you have an infinite number of  routes which only differ in essence in regards to their direction: if  (for example) this desert's end can only be reached – before your  stamina and supplies are depleted – if you keep moving eastwards, you  won’t ever succeed if you moved to the west. 

The Crucial Difference Between The Two 

Both  versions of the labyrinth exist so as to achieve the same: prevent the one inside to escape without conscious effort. Or, to put it in a more poignant manner:  not allow one to leave unless they had gained a particular knowledge  about the labyrinth; the knowledge of a way out. After all, no labyrinth  can remain imposing once you have located its exit.

But the two versions differ in a very crucial way: While the  labyrinth of corridors will keep you hoping until the very last second  of your life – for the exit may always be found in the next room and therefore still be accessible even if you are about to collapse, starving and  reduced to crawling on the floor – the labyrinth of nothingness,  the cruel and level plane of the desert, will have informed you long  before you fall to the sand, never to rise back again, that you already have lost and are to die inside it...

And yet it must be noted that this difference brings about also a  complementary and antithetic element; an elegant juxtaposition: In the  labyrinth of corridors you will retain hope until you draw your last  breath, yes, but you will also keep being fooled into thinking your moves  up to that point haven’t failed you. In the labyrinth of open space you  will be informed that you failed, and that you will die, long before it  happens – since there won’t be any settlement visible on the horizon,  and your body has already shown the tell-tale signs of giving up.

by Kyriakos Chalkopoulos

*

I want to ask which of the two types of labyrinth you would rather be in. Certainly one can imagine life as a journey inside (external as well as internal) labyrinths.




Discuss

AGI will drastically increase economies of scale

8 июня, 2019 - 02:17
Published on June 7, 2019 11:17 PM UTC

Note to mods: I'm a bit uncertain whether posts like this one currently belongs on the Alignment Forum. Please move it if it doesn't. Or if anyone would prefer not to have such posts on AF, please let me know.

In Strategic implications of AIs’ ability to coordinate at low cost, I talked about the possibility that different AGIs can coordinate with each other much more easily than humans can, by doing something like merging their utility functions together. It now occurs to me that another way for AGIs to greatly reduce coordination costs in an economy is by having each AGI or copies of one AGI profitably take over much larger chunks of the economy, and this can be done with AGIs that don't even have explicit utility functions, such as copies of an AGI that are all corrigible/intent-aligned to a single person.

Today, there are many industries with large economies of scale, due to things like fixed costs, network effects, reduced deadweight loss when monopolies in different industries merge (because they can internally charge each other prices that equal marginal costs), but because coordination costs among humans increase super-linearly with the number of people involved (see Moral Mazes and Short Termism for a related recent discussion), that creates diseconomies of scale which counterbalance the economies of scale, so companies tend to grow to a certain size and then stop. But an AGI-operated company, where for example all the workers are AGIs that are intent-aligned to the CEO, would eliminate almost all of the internal coordination costs (i.e., all of the coordination costs that are caused by value differences, such as all the things described in Moral Mazes, "market for lemons" or lost opportunities for trade due to asymmetric information, principal-agent problems, monitoring/auditing costs, costly signaling, and suboptimal Nash equilibria in general), allowing such companies to grow much bigger. In fact, from purely the perspective of maximizing the efficiency/output of an economy, I don't see why it wouldn't be best to have (copies of) one AGI control everything.

If I'm right about this, it seems quite plausible that some countries would foresee it too, and as soon as it can feasibly be done, nationalize all of their productive resources and place them under the control of one AGI (perhaps intent-aligned to a supreme leader or to a small, highly coordinated group of humans), which would allow them to out-compete any other countries that are not willing to do this (and don't have some other competitive advantage to compensate for this disadvantage). This seems to be an important consideration that is missing from many people's pictures of what happens after AGI in a slow-takeoff scenario.



Discuss

Asymmetric Weapons Aren't Always on Your Side

7 июня, 2019 - 23:26
Published on June 7, 2019 8:26 PM UTC

Some time ago, Scott Alexander wrote about asymmetric weapons, and now he writes again about them. During these posts, Scott repeatedly characterizes asymmetric weapons as inherently stronger for the good guys than they are for the bad guys. Here is a quote from his first post:

Logical debate has one advantage over narrative, rhetoric, and violence: it’s an asymmetric weapon. That is, it’s a weapon which is stronger in the hands of the good guys than in the hands of the bad guys.

And here is a quote from his more recent one:

A symmetric weapon is one that works just as well for the bad guys as for the good guys. For example, violence – your morality doesn’t determine how hard you can punch; they can buy guns from the same places we can.An asymmetric weapon is one that works better for the good guys than the bad guys. The example I gave was Reason. If everyone tries to solve their problems through figuring out what the right thing to do is, the good guys (who are right) will have an easier time proving themselves to be right than the bad guys (who are wrong). Finding and using asymmetric weapons is the only non-coincidence way to make sustained moral progress.

One problem with this concept is that just because something is asymmetric doesn't mean that it's asymmetric in a good direction.

Scott talks about weapons that are asymmetric towards those who are right. However, there are many more types of asymmetries than just right vs. wrong - physical violence is asymmetric towards the strong, shouting people down is asymmetric towards the loud, and airing TV commercials is asymmetric towards people with more money. Violence isn't merely symmetric - it's asymmetric in a bad direction, since fascists are better than violence than you.

This in turn means that various sides will all be trying to pull things in directions that are asymmetric to their advantage. Indeed, a basic principle in strategy is to try to shift conflicts into areas where you are strong and your opponent is weak.

For instance, people who are good at violence benefit from things getting violent. People who are locally popular benefit from popularity contests. People who have lots of free time benefit from time-consuming processes. People who are better at keeping their composure benefit from discourse norms that punish displays of emotion.

Developing asymmetric processes that point towards truth is a good idea, and I'm all for it. But in practice there are also asymmetric processes that point towards error, or merely asymmetric processes that point towards what's currently popular or faddish. Those processes are, if anything, just as likely to have people trying to promote them than the pro-truth ones - perhaps more likely! That doesn't make the people promoting those ideas "anti-truth" or whatever - they may not even be aware of what they're doing - but even so, people tend to respond to incentives, and those incentives may well pull them towards norms and methods that are asymmetric in their favor independent of whether those norms and methods promote truth.



Discuss

Risks from Learned Optimization: Conclusion and Related Work

7 июня, 2019 - 22:53
Published on June 7, 2019 7:53 PM UTC

.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')}

This is the fifth of five posts in the Mesa-Optimization Sequence based on the MIRI paper “Risks from Learned Optimization in Advanced Machine Learning Systems” by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Each post in the sequence corresponds to a different section of the paper.

 

Related work

Meta-learning. As described in the first post, meta-learning can often be thought of meta-optimization when the meta-optimizer's objective is explicitly designed to accomplish some base objective. However, it is also possible to do meta-learning by attempting to make use of mesa-optimization instead. For example, in Wang et al.'s “Learning to Reinforcement Learn,” the authors claim to have produced a neural network that implements its own optimization procedure.(28) Specifically, the authors argue that the ability of their network to solve extremely varied environments without explicit retraining for each one means that their network must be implementing its own internal learning procedure. Another example is Duan et al.'s “RL2: Fast Reinforcement Learning via Slow Reinforcement Learning,” in which the authors train a reinforcement learning algorithm which they claim is itself doing reinforcement learning.(5) This sort of meta-learning research seems the closest to producing mesa-optimizers of any existing machine learning research.

Robustness. A system is robust to distributional shift if it continues to perform well on the objective function for which it was optimized even when off the training environment.(29) In the context of mesa-optimization, pseudo-alignment is a particular way in which a learned system can fail to be robust to distributional shift: in a new environment, a pseudo-aligned mesa-optimizer might still competently optimize for the mesa-objective but fail to be robust due to the difference between the base and mesa- objectives.

The particular type of robustness problem that mesa-optimization falls into is the reward-result gap, the gap between the reward for which the system was trained (the base objective) and the reward that can be reconstructed from it using inverse reinforcement learning (the behavioral objective).(8) In the context of mesa-optimization, pseudo-alignment leads to a reward-result gap because the system's behavior outside the training environment is determined by its mesa-objective, which in the case of pseudo-alignment is not aligned with the base objective.

It should be noted, however, that while inner alignment is a robustness problem, the occurrence of unintended mesa-optimization is not. If the base optimizer's objective is not a perfect measure of the human's goals, then preventing mesa-optimizers from arising at all might be the preferred outcome. In such a case, it might be desirable to create a system that is strongly optimized for the base objective within some limited domain without that system engaging in open-ended optimization in new environments.(11) One possible way to accomplish this might be to use strong optimization at the level of the base optimizer during training to prevent strong optimization at the level of the mesa-optimizer.(11)

Unidentifiability and goal ambiguity. As we noted in the third post, the problem of unidentifiability of objective functions in mesa-optimization is similar to the problem of unidentifiability in reward learning, the key issue being that it can be difficult to determine the “correct” objective function given only a sample of that objective's output on some training data.(20) We hypothesize that if the problem of unidentifiability can be resolved in the context of mesa-optimization, it will likely (at least to some extent) be through solutions that are similar to those of the unidentifiability problem in reward learning. An example of research that may be applicable to mesa-optimization in this way is Amin and Singh's(20) proposal for alleviating empirical unidentifiability in inverse reinforcement learning by adaptively sampling from a range of environments.

Furthermore, it has been noted in the inverse reinforcement learning literature that the reward function of an agent generally cannot be uniquely deduced from its behavior.(30) In this context, the inner alignment problem can be seen as an extension of the value learning problem. In the value learning problem, the problem is to have enough information about an agent's behavior to infer its utility function, whereas in the inner alignment problem, the problem is to test the learned algorithm's behavior enough to ensure that it has a certain objective function.

Interpretability. The field of interpretability attempts to develop methods for making deep learning models more interpretable by humans. In the context of mesa-optimization, it would be beneficial to have a method for determining whether a system is performing some kind of optimization, what it is optimizing for, and/or what information it takes into account in that optimization. This would help us understand when a system might exhibit unintended behavior, as well as help us construct learning algorithms that create selection pressure against the development of potentially dangerous learned algorithms.

Verification. The field of verification in machine learning attempts to develop algorithms that formally verify whether systems satisfy certain properties. In the context of mesa-optimization, it would be desirable to be able to check whether a learned algorithm is implementing potentially dangerous optimization.

Current verification algorithms are primarily used to verify properties defined on input-output relations, such as checking invariants of the output with respect to user-definable transformations of the inputs. A primary motivation for much of this research is the failure of robustness against adversarial examples in image recognition tasks. There are both white-box algorithms,(31) e.g. an SMT solver that in principle allows for verification of arbitrary propositions about activations in the network,(32) and black-box algorithms(33). Applying such research to mesa-optimization, however, is hampered by the fact that we currently don't have a formal specification of optimization.

Corrigibility. An AI system is corrigible if it tolerates or assists with its human programmers in correcting itself.(25) The current analysis of corrigibility has focused on how to define a utility function such that, if optimized by a rational agent, that agent would be corrigible. Our analysis suggests that even if such a corrigible objective function could be specified or learned, it is nontrivial to ensure that a system trained on that objective function would actually be corrigible. Even if the base objective function would be corrigible if optimized directly, the system may exhibit mesa-optimization, in which case the system's mesa-objective might not inherit the corrigibility of the base objective. This is somewhat analogous to the problem of utility-indifferent agents creating other agents that are not utility-indifferent.(25) In the fourth post, we suggest a notion related to corrigibility—corrigible alignment—which is applicable to mesa-optimizers. If work on corrigibility were able to find a way to reliably produce corrigibly aligned mesa-optimizers, it could significantly contribute to solving the inner alignment problem.

Comprehensive AI Services (CAIS).(11) CAIS is a descriptive model of the process by which superintelligent systems will be developed, together with prescriptive implications for the best mode of doing so. The CAIS model, consistent with our analysis, makes a clear distinction between learning (the base optimizer) and functionality (the learned algorithm). The CAIS model predicts, among other things, that more and more powerful general-purpose learners will be developed, which through a layered process will develop services with superintelligent capabilities. Services will develop services that will develop services, and so on. At the end of this “tree,” services for a specific final task are developed. Humans are involved throughout the various layers of this process so that they can have many points of leverage for developing the final service.

The higher-level services in this tree can be seen as meta-optimizers of the lower-level services. However, there is still the possibility of mesa-optimization—in particular, we identify two ways in which mesa-optimization could occur in the CAIS-model. First, a final service could develop a mesa-optimizer. This scenario would correspond closely to the examples we have discussed in this sequence: the base optimizer would be the next-to-final service in the chain, and the learned algorithm (the mesa-optimizer in this case), would be the final service (alternatively, we could also think of the entire chain from the first service to the next-to-final service as the base optimizer). Second, however, an intermediary service in the chain might also be a mesa-optimizer. In this case, this service would be an optimizer in two respects: it would be the meta-optimizer of the service below it (as it is by default in the CAIS model), but it would also be a mesa-optimizer with respect to the service above it.

 

Conclusion

In this sequence, we have argued for the existence of two basic AI safety problems: the problem that mesa-optimizers may arise even when not desired (unintended mesa-optimization), and the problem that mesa-optimizers may not be aligned with the original system's objective (the inner alignment problem). However, our work is still only speculative. We are thus left with several possibilities:

  1. If mesa-optimizers are very unlikely to occur in advanced ML systems (and we do not develop them on purpose), then mesa-optimization and inner alignment are not concerns.
  2. If mesa-optimizers are not only likely to occur but also difficult to prevent, then solving both inner alignment and outer alignment becomes critical for achieving confidence in highly capable AI systems.
  3. If mesa-optimizers are likely to occur in future AI systems by default, and there turns out to be some way of preventing mesa-optimizers from arising, then instead of solving the inner alignment problem, it may be better to design systems to not produce a mesa-optimizer at all. Furthermore, in such a scenario, some parts of the outer alignment problem may not need to be solved either: if an AI system can be prevented from implementing any sort of optimization algorithm, then there may be more situations where it is safe for the system to be trained on an objective that is not perfectly aligned with the programmer's intentions. That is, if a learned algorithm is not an optimizer, it might not optimize the objective to such an extreme that it would cease to produce positive outcomes.

Our uncertainty on this matter is a potentially significant hurdle to determining the best approaches to AI safety. If we do not know the relative difficulties of the inner alignment problem and the unintended optimization problem, then it is unclear how to adequately assess approaches that rely on solving one or both of these problems (such as Iterated Distillation and Amplification(34) or AI safety via debate(35)). We therefore suggest that it is both an important and timely task for future AI safety work to pin down the conditions under which the inner alignment problem and the unintended optimization problem are likely to occur as well as the techniques needed to solve them.

 

The full paper that this sequence is based on, "Risks from Learned Optimization in Advanced Machine Learning Systems," is now available on arXiv.

Glossary | Bibliography



Discuss

Personal musings on Individualism and Empathy

7 июня, 2019 - 19:27
Published on June 7, 2019 4:27 PM UTC

One type of the more common accusations that people throw at me can be described like this: "You're an egoist. You don't think about others." To my teenage self this sounded amusing, and I would respond with a smug "Why yes, I am an egoist, thank you very much!"

Here I have to make a confession: back then I was pretty much obsessed with Ayn Rand's Objectivism. For those who don't know much about her philosophy, it's a flawed attempt to create a consistent philosophical system that covers just about everything, from ontology to aesthetics. It has very stark individualistic, libertarian-ish vibes. Being a naive yet curious kid, I was enamored with what turned out to be my first expose to (relatively) rigorous argumentation, logic, and philosophy/critical thought in general. I can't thank Rand enough for that, because through her I encountered the concepts of reason and rationality, which in turn led me to LW, The Sequences and everything else which formed the basis of my mature thought.

Anyway, after a while I moved away from objectivist ideas. I clearly remember the "final nail in the coffin" moment. I was rereading THE SEMINAL WORK THAT SYSTEMATICALLY EXPLORES AND DEVELOPS THE IDEAS OF AYN RAND, particularly the part about the consciousness axiom. I remember thinking: "Wait, the assumptions that the argument lies on are completely unfounded, what the hell is this epistemology LOL". That was it. No dramatic catharsis-like moment, no nothing, just the feeling of having finally checked off an item from my bucket list.

Despite my going through the rejection ritual, a lot of moral intuitions that I formed in that period carried over into my new, semi-adult life. I shunned social norms and laughed at those who had to conform. I prided myself on being socially brave, freely asking people for favors and being able to take a rejection and reject myself, giving compliments left and right whenever I felt like others deserved them, expressing myself authentically, without fear of judgement or public shaming.

Most of this was connected to the classical-liberal-ish notion of "anything's OK as long as it's not initiation of physical force" and its corollaries, coupled with intuitions like "nobody owns nothing to no one, thus e.g. asking for favors is totally OK and even good (cuz self-interest), but only if you're able to take a 'no' for an answer; we all are free people after all" and others from the same neg-lib-and-egoism cluster.

Plot twist. This Randian experience left my empathy in a stunted, primitive state. It's not on the levels of psychopathy/ASPD, I experience rich emotions, can empathize with others (both affectively and cognitively), my tracking-social-cues mechanism works fine, I'm able to enjoy fiction, etc. But my brain doesn't mark most significant personal information about others as salient, unless I'm explicitly told to keep it in mind. This means that generic social rules and explicit agreements work fine for me, but many implicit expectations are bound to fall short, because the necessary personal information about others' preferences just doesn't jump to my mind whenever I'm making a decision that can affect some other person that's relatively important to me. Of course it's more complex than this, so please remember that I'm painting in broad strokes here.

This fucking sucks. I do actually feel remorse every time I unintentionally hurt someone, it's pretty horrible, almost every time I wish it didn't happen. It's usually not an issue with acquaintances and friends, i.e. we're not close enough, so they end up getting hurt by me very rarely, if ever, but at some point if you want the relationship to progress, you have to be able to count on the other person to take on some of the responsibility for your well-being. That's how you develop trust, which is an important prerequisite for vulnerability and sustenance of long-term cooperative relationships. To be responsible, you first have to become aware.

Which is exactly where my bottleneck lies. I don't model others well enough, typical-minding aside. Or rather, for some reason I'm disinclined to do it. It feels like a burden. I want others to tell me what they want from me, explicitly and precisely. That's how I operate myself. To me it feels like taking personal responsibility for your needs, which is Good. Actively taking responsibility for others feels Wrong, because what the hell, why am I supposed to do something for them? No one owns me anything, but neither do I!

Now, I do exhibit caring behavior, sometimes, towards certain people. It feels good and authentic. That's because when I do it, I really value the person and want them to be well. One confounding factor could be that I simply don't like most people, they're not "good enough for me" in my eyes, so I (read as "my brain") don't care about them enough to model them sufficiently. But then again, maybe the reason why I don't like most people and find them shallow is my lack of empathy, so the causes and effects might be tangled up here.

Several broken relationships and lots of pain behind me, I'm dating someone again. I'm different now, having studied a lot of psychology, philosophy, basically everything human nature. I'm stronger. And, as it happens, this relationship is much healthier than others, it spurs a significant amount of personal growth, despite the short time span: patterns become obvious, small yet meaningful change takes place. I'm thankful to my SO. Even if our paths diverge at some point, I've gotten a lot of value out of our relationship, and I hope so did they!

And now, dear readers, I present you with my Latest, Juiciest Insight into the psyche of Me, a random person on the Internet.

First off, let's start with the belief/alief dichotomy. Here's my daring and frivolous interpretation: Beliefs are about meticulous calculations, S2 override, and social interfacing. Aliefs are what drives your S1 by default, affecting greatly your motivations and behavior. It's important to be able to explicate aliefs, because that gives you the opportunity to intentionally change them. They do change on their own, of course, but it happens more or less randomly, subject to the chance workings of environment, experience, and reflection. Lots of so-called internal conflicts -- procrastination, negative spirals, self-loathing, etc -- are the result of identifying too much with S2, the mind, and not enough of "accepting" S1, the body, as a part of your self-concept, or so I think anyway. Cartesian delusions are crappy at modeling reality, but sadly are pretty intuitive and thus quite widespread.
But I digress.

Stripped of all ethical chaff, I unearthed an alief. It goes like this:

"Freedom from social constraints is good. If I let myself become sufficiently aware of others' preferences, this will turn me into an anxious self-conscious nervewrack and I'll lose all the benefits I'm reaping from my current attitude"

The concept of slack ties into this very well. For those who want a quick refresher, something something our behavior is partly managed by the expectations and implicit demands of other people. This post by Val is probably relevant, too.
I have lots of slack. What's more, I'm able to constrain others rather heavily, e.g. by openly signaling my preferences. I'm allowed to do this precisely because I'm less attuned to what others prefer. It's a strong, honest precommitment mechanism: if you're not even consciously aware of the fact that you're thwarting someone else's preferences via action, you pass the internal-S2-self-policing-social check; after all, none of this was intentional! This lets you keep your self-esteem safe and sound, and thus lack of adequate punishment perpetuates this behavior further.

You could say that my S1 values my slack so much that it's afraid to even look into the minds of others, aside from barely scraping the surface. There's no going back after tasting the forbidden fruit. "Yes, you're naked and ashamed, boo-hoo, go cry me a river, and fuck off from Heaven while you're at it" -- God, probably.

Already noticed the mistake I'm making? No?

Well here it is:

Other-awareness and other-modeling are two different things.

If a person who has no ToM gets suddenly struck by Insight from Above and gains the ability to See the Other, they'll ebb-and-flow between very unpleasant states of panic and less-but-still unpleasant states of confusion and disorientation. They didn't have the privilege of gradually easing into social reality and slowly mastering aspects of peopling, and reality hit'em like a brick. Being aware of the enormous pile of demands and expectations without knowing exactly what they are must feel like, I don't know, doom.

However, it mustn't stay so. If this newly-baked Seer survives the initial shock, they'll start developing better and better mental models of other people and accumulating useful heuristics. The Art of Prediction will get honest and polished with a zealous striving for perfection.
Now that the initial dip in effectiveness is persevered through, the Seer finds that their life has become much better, being able to subtly navigate between the streams of social information, manipulating these flows to their liking, enjoying the ever-increasing benefits of social aptitude, the sly beast!

Anyway, here it is. Only thing left is to feed experiential evidence into this new reframe, which requires crashing into reality, which can hardly be done by putting more words on the metaphorical paper.

So I guess I'll go out and do just that.



Discuss

For the past, in some ways, we are moral degenerates

7 июня, 2019 - 18:57
Published on June 7, 2019 3:57 PM UTC

Have human values improved over the last few centuries? Or is it just that current human values are naturally closer to our (current) human values and so we think that there's been moral progress towards us?

If we project out in the future, the first scenario posits continuing increased moral improvements (as the "improvement trend" continues) and the second posits moral degeneration (as the values drift away from our own). So what is it?

I'll make the case that both trends are happening. We have a lot less slavery, racism, ethnic conflicts, and endorsements of slavery, racism, and ethnic conflicts. In an uneven way, poorer people have more effective rights than they did before, so it's somewhat less easy to abuse them.

Notice something interesting about the previous examples? They can all be summarised as "some people who were treated badly are now treated better". Many people throughout time would agree that these people are actually being treated better. On the issue of slavery, consider the following question:

  • "If X would benefit from being a non-slave more than being a slave, and there were no costs to society, would it be better for X not to be a slave?"

Almost everyone would agree to that throughout history, barring a few examples of extremely motivated reasoning. So most defences of slavery rest on the idea that some classes of people are better off as slaves (almost always a factual error, and generally motivated reasoning), or that some morally relevant group of people benefited from slavery enough to make it worthwhile.

So most clear examples of moral progress are giving benefits to people, such that anyone who knew all the facts would agree it was beneficial for those people.

That trend we might expect to continue; as we gain greater knowledge how to benefit people, and as we gain greater resources, we can expect more people to be benefited.

Values that we have degenerated on

But I'll argue that there are a second class of values that have less of a "direction" to, and where we could plausibly be argued to have "degenerated". And, hence, where we might expect our descendants to "degenerate" more (ie move further away from us).

Community and extended family values, for example, are areas where much of the past would be horrified by the present. Why are people not (generally) meeting up with their second cousins every two weeks, and why do people waste time gossiping about irrelevant celebrities rather than friends and neighbours?

On issues of honour and reputation, why have we so meekly accepted to become citizens of administrative bureaucracies and defer to laws and courts, rather than taking pride in meeting out our own justice and defending our own honour? "Yes, yes", the hypothetical past person would say, "your current system is fairer and more efficient; but why did it have to turn you all so supine"? Are you not free men?

Play around with vaguely opposite virtues: spontaneity versus responsibility; rationality versus romanticism; pride versus humility; honesty versus tact, and so on. Where is the ideal mean between any of those two extremes? Different people and different cultures put the ideal mean in different places, and there's no reason to suspect that the means are "getting better" rather than just "moving around randomly".

I won't belabour the point; it just seems to me that there are areas where the moral progress narrative makes more sense (giving clear benefits to people who didn't have them) and areas where the "values drift around" narrative makes more sense. And hence we might hope for continuing moral progress in some areas, and degeneration (or at least stagnation) in others.



Discuss

Map of (old) MIRI's Research Agendas

7 июня, 2019 - 10:22
Published on June 7, 2019 7:22 AM UTC

(disclaimer: I do not work for MIRI)

In late 2016 I curated a visual map of the areas of research MIRI was working on at the time, together with descriptions of each area and arrows explaining the relationships between areas.

While MIRI's research has evolved since, I have foud myself going back to this map as a reference while doing research, so I though I would share it with the rest of you in case it might be useful for some.

Map of Research Areas in AI Alignment

The map is based on the Agent Foundations agenda by Soares and Fallenstein of 2014 and the Alignment for Advanced ML Systems agenda by Taylor et al in 2016.

Let me know if this was useful for you! It will be useful evidence on whether spending time curating more maps of the same style is a good use of my time.




Discuss

Fractional Reserve Charity

7 июня, 2019 - 08:14
Published on June 7, 2019 5:14 AM UTC

On the EA Corner discord server, various participants, most notably @wolframhead, tossed around the following idea: Some people of an EA bent may want to donate vast amounts of money, huge percentages of high salaries, but be held back in case they turn out to need that money as savings later. A health crisis, their job market failing them, or some other unforeseen but expensive eventuality could make them regret prior profligate giving once they need a runway for a few months with no paychecks, run into a pay cut, or have to contend with huge surprise bills. But of course not all of these people will actually experience such a crisis even though all of them run some risk of doing so. If there was a way for them to get their money back from the charity they donated to conditional on an emergency cropping up, the charity would be able to keep most such conditional funds even not taking into account the possibility that they could make interest on the capital in the time between donation and refund. Overall charity revenue would probably go up if this "fractional reserve" model were feasible. Here's a writeup of my understanding of the topic; the EA Corner has open membership for anyone who wants to go view the conversation in its original form.

1. Possible Structures

A - Charities could directly provide this service. This would require them to deal with the overhead of assessing emergency needs in their donors, which is probably not their core competency. The charity would operate under uncertainty about not only its future budget but also its current budget, as the rug could be pulled out from under them on sizable donations at any time. It'd also probably affect how EA money flowed to charities that managed this versus ones that did not, as a charity that has a refund possibility is "cheaper" to donate to than one that does not. The concept of "room for more funding" could possibly limit the strangling effect on charities that didn't participate to a point, but only to a point, and making lots of charities spend on overhead and uncertainty-tolerant budget forecasts just to remain competitive donation targets seems like a deadweight loss.

B - A middleman organization could form (or it could grow out of an existing meta-charity like Givewell). Such an organization would mostly just pass on donations to the intended recipients - you'd send them a thousand dollars marked for AMF, and AMF would get most of those dollars to spend on AMF things with no worries about paying it back. Some of what passed through the middleman organization would float as liquid assets and be kicked back to donors who needed it, the way a bank doesn't keep enough cash on hand for everyone to withdraw their savings all at once but can handle a normal rate of withdrawals without a blip. There would be some loss in the form of overhead for this organization, but plausibly less than the gain in effective donations generated by the insurance scheme.

C - A tontine[1]-like structure with less formal scaffolding could handle it. Groups of EAs who know and trust each other - and ideally have comparable budgets - could all pool donations of the same size, keep back that amount or perhaps twice that amount held by someone responsible in the group, and kick that amount back to whoever needs it first as assessed by other group members. A person with more money to throw around could belong to several of these. Relying on interpersonal rather than institutional honesty would reduce access to the model for people who are less socially in with other EAs and might make it harder to solidly verify people's emergency conditions by strict rubrics.

2. Probable Obstacles

A - Assorted regulatory complexity. Are any of the above structures legal? How will they interact with taxes, with nonprofit status, with whatever laws govern insurance even though (since in the vision I describe above payouts never exceed pay-ins) it's technically not insurance? Does it look like a tax dodge (is it a tax dodge)? Is it sort of like a donor-advised fund? What are the rules about those and which apply? How would you explain any of this to an accountant or an auditor without sounding insane?

B - Adverse selection. While anyone can worry about a trip to the hospital or a sudden firing, people who are more worried about these things might have more real reason to be worried, carrying elevated risk; expectations about payback rates based on baseline statistics about how common these things are may not be reliable, and a middleman that didn't account enough for that could wind up without enough cash on hand to return money to all its distressed donors, or a charity which provided the payback service in-house could try to budget its operations based on a guess that turns out too optimistic. Insurance companies have some institutional knowhow about how to handle this but it would increase overhead a lot.

C - Theft and fraud. While people presumably trust the charities they donate to, a donation the return of which could be debated based on assessed need would present an incentive for whoever held the bag to assess the need unsympathetically. A middleman organization or a private individual holding money for a tontine-style structure would face temptation to run off with the cash, too, and it might not be obvious this was happening right away if everyone who donated through them happened to be lucky and not expect the money back for an extended period of time. In the opposite direction, people might fake serious catastrophes to get their money back even if their use case for the money was not intended to be one that got a payback - this wouldn't be as appealing as normal insurance fraud, since you wouldn't get back more money than you put in in the first place, but it could get to be a problem if (say) a deceased donor's beneficiaries wanted to puff up their inheritances, or a third party believed they could intercept the money by posing as a donor in distress.

D - Bikeshedding and competition. If people have a lot of small differences of opinion about how this should work, that could support a large number of tontine-style setups, and different charities handling it in-house could "cheapen" their donations to various audiences of assorted inclinations, leaving each donor to decide if they prefer to deworm children under scheme 1 or net beds under scheme 2. But it'd sink a middleman organization to have only a handful of participants and be in competition with a lot of similar orgs, because it would run entirely on distribution of risk, and that requires a high population. Regular insurance companies have such a large pool of potential customers that they can manage splitting the pie; I don't know if the market for refundable donations is that big or could easily get that way (it's a weird idea that not even all preexisting EAs might want, for one thing).

3. Questions

A - Does something like this already exist? I would likely not know about it if it does (I don't handle my household finances or charity budget) but nobody else on the EA Corner mentioned "oh, you're talking about thus-and-such a program!", so perhaps it doesn't - or maybe there's a fledgeling one, or a poorly advertised sub-product belonging to some financial firm. Or maybe there's a way to squish something from its original shape into this shape without inventing new practices altogether.

B - Would this see enough use to benefit from risk pooling and to increase effective charity revenue? EAs move a fair amount of money already. While it would be nice if they endured less risk in so doing, the really high leverage of setting up fractional reserve charity options would be if it increased donations relative to the counterfactual. I don't have a sense of how large this space really is.

C - Are EA emergencies too correlated, e.g. are too many of us in a position to suffer from collapse in the tech sector for any EA techies to consider this employment insurance lite? We probably don't have correlated medical problems unless someone torches an EA Global, but if a bubble a lot of EAs work in popped, and every displaced software engineer or quant were each of them expecting five plus figures returned from their charities, that would capsize anything inadequately diversified in its path. Is there much use to a version that only refunds in case of medical emergency?

--

[1]I learned the word "tontine" in a form different from the apparently central use case; the meaning I intend is a group savings/loan plan where (as an example of one way it can be structured) every one of N people in a group pays X per month into a shared pool, and every month a different member of the group gets N*X in a lump sum until N months have gone by.



Discuss

Grokking that the Schelling Choice is "Rabbit", not "Stag"

7 июня, 2019 - 07:57
Published on June 7, 2019 4:57 AM UTC

Followup/distillation/alternate-take on Duncan Sabien's Dragon Army Retrospective and Open Problems in Group Rationality.

There's a particular failure mode, which I've witnessed, and fallen into myself:

I see a problem. I see, what seems to me, to be an obvious solution to the problem. If only everyone Took Action X, we could Fix Problem Z. So I start X-ing, and maybe talking about how other people should start X-ing. Action X takes some effort on my part but it's obviously worth it.

And yet... nobody does. Or not enough people do. And a few months later, here I'm still taking Action X and feeling burned and frustrated.

Or –

– the problem is that everyone is taking Action Y, which directly causes Problem Z. If only everyone would stop Y-ing, Problem Z would go away. Action Y seems obviously bad, clearly we should be on the same page about this. So I start noting to people when they're doing Action Y, and expect them to stop.

They don't stop.

So I start subtly socially punishing them for it.

They don't stop. What's more... now they seem to be punishing me.

I find myself getting frustrated, perhaps angry. What's going on? Are people wrong-and-bad? Do they have wrong-and-bad beliefs?

Alas. So far in my experience it hasn't been that simple.

A Recap of 'Rabbit' vs 'Stag'

I'd been planning to write this post for years. Duncan Sabien went ahead and wrote it before I got around to it. But, Dragon Army Retrospective and Open Problems in Group Rationality are both lengthy posts with a lot of points, and it still seemed worth highlighting in a single post.

I used to think a lot of a lot in terms of Prisoner's Dilemma, and Cooperate/Defect. I'd see problems that could easily be solved if everyone just put a bit of effort in, which would benefit everyone. And people didn't put the effort in, and this felt like a frustrating, obvious coordination failure. Why do people defect so much?

Eventually Duncan shifted towards using Stag Hunt rather than Prisoner's Dilemma as the model here. If you haven't read it before, it's worth reading the description in full. If you're familiar you can skip to my current thoughts below.

My new favorite tool for modeling this is stag hunts, which are similar to prisoner’s dilemmas in that they contain two or more people each independently making decisions which affect the group. In a stag hunt:—Imagine a hunting party venturing out into the wilderness.— Each player may choose stag or rabbit, representing the type of game they will try to bring down.— All game will be shared within the group (usually evenly, though things get more complex when you start adding in real-world arguments over who deserves what).— Bringing down a stag is costly and effortful, and requires coordination, but has a large payoff. Let’s say it costs each player 5 points of utility (time, energy, bullets, etc.) to participate in a stag hunt, but a stag is worth 50 utility (in the form of food, leather, etc.) if you catch one.— Bringing down rabbits is low-cost and low-effort and can be done unilaterally. Let’s say it only costs each player 1 point of utility to hunt rabbit, and you get 3 utility as a result.— If any player unexpectedly chooses rabbit while others choose stag, the stag escapes through the hole in the formation and is not caught. Thus, if five players all choose stag, they lose 25 utility and gain 50 utility, for a net gain of 25 (or +5 apiece). But if four players choose stag and one chooses rabbit, they lose 21 utility and gain only 3.This creates a strong pressure toward having the Schelling choice be rabbit. It’s saner and safer (spend 5, gain 15, net gain of 10 or +2 apiece), especially if you have any doubt about the other hunters’ ability to stick to the plan, or the other hunters’ faith in the other hunters, or in the other hunters’ current resources and ability to even take a hit of 5 utility, or in whether or not the forest contains a stag at all.Let’s work through a specific example. Imagine that the hunting party contains the following five people:Alexis (currently has 15 utility “in the bank”)Blake (currently has 12)Cameron (9)Dallas (6)Elliott (5)If everyone successfully coordinates to choose stag, then the end result will be positive for everyone. The stag costs everyone 5 utility to bring down, and then its 50 utility is divided evenly so that everyone gets 10, for a net gain of 5. The array [15, 12, 9, 6, 5] has bumped up to [20, 17, 14, 11, 10].If everyone chooses rabbit, the end result is also positive, though less excitingly so. Rabbits cost 1 to hunt and provide 3 when caught, so the party will end up at [17, 14, 11, 8, 7].But imagine the situation where a stag hunt is attempted, but unsuccessful. Let’s say that Blake quietly decides to hunt rabbit while everyone else chooses stag. What happens?Alexis, Cameron, Dallas, and Elliott each lose 5 utility while Blake loses 1. The rabbit that Blake catches is divided five ways, for a total of 0.6 utility apiece. Now our array looks like [10.6, 11.6, 4.6, 1.6, 0.6].(Remember, Blake only spent 1 utility in the first place.)If you’re Elliott, this is a super scary result to imagine. You no longer have enough resources in the bank to be self-sustaining—you can’t even go out on another rabbit hunt, at this point.And so, if you’re Elliott, it’s tempting to preemptively choose rabbit yourself. If there’s even a chance that the other players might defect on the overall stag hunt (because they’re tired, or lazy, or whatever) or worse, if there might not even be a stag out there in the woods today, then you have a strong motivation to self-protectively husband your resources. Even if it turns out that you were wrong about the others, and you end up being the only one who chose rabbit, you still end up in a much less dangerous spot: [10.6, 7.6, 4.6, 1.6, 4.6].Now imagine that you’re Dallas, thinking through each of these scenarios. In both cases, you end up pretty screwed, with your total utility reserves at 1.6. At that point, you’ve got to drop out of any future stag hunts, and all you can do is hunt rabbit for a while until you’ve built up your resources again.So as Dallas, you’re reluctant to listen to any enthusiastic plan to choose stag. You’ve got enough resources to absorb one failure, and so you don’t want to do a stag hunt until you’re really darn sure that there’s a stag out there, and that everybody’s really actually for real going to work together and try their hardest. You’re not opposed to hunting stag, you’re just opposed to wild optimism and wanton, frivolous burning of resources.Meanwhile, if you’re Alexis or Blake, you’re starting to feel pretty frustrated. I mean, why bother coming out to a stag hunt if you’re not even actually willing to put in the effort to hunt stag? Can’t these people see that we’re all better off if we pitch in hard, together? Why are Dallas and Elliott preemptively talking about rabbits when we haven’t even tried catching a stag yet?I’ve recently been using the terms White Knight and Black Knight to refer, not to specific people like Alexis and Elliott, but to the roles that those people play in situations requiring this kind of coordination. White Knight and Black Knight are hats that people put on or take off, depending on circumstances.The White Knight is a character who has looked at what’s going on, built a model of the situation, decided that they understand the Rules, and begun to take confident action in accordance with those Rules. In particular, the White Knight has decided that the time to choose stag is obvious, and is already common knowledge/has the Schelling nature. I mean, just look at the numbers, right?The White Knight is often wrong, because reality is more complex than the model even if the model is a good model. Furthermore, other people often don’t notice that the White Knight is assuming that everyone knows that it’s time to choose stag—communication is hard, and the double illusion of transparency is a hell of a drug, and someone can say words like “All right, let’s all get out there and do our best” and different people in the room can draw very different conclusions about what that means.So the White Knight burns resources over and over again, and feels defected on every time someone “wrongheadedly” chooses rabbit, and meanwhile the other players feel unfairly judged and found wanting according to a standard that they never explicitly agreed to (remember, choosing rabbit should be the Schelling option, according to me), and the whole thing is very rough for everyone.If this process goes on long enough, the White Knight may burn out and become the Black Knight. The Black Knight is a more mercenary character—it has limited resources, so it has to watch out for itself, and it’s only allied with the group to the extent that the group’s goals match up with its own. It’s capable of teamwork and coordination, but it’s not zealous. It isn’t blinded by optimism or patriotism; it’s there to engage in mutually beneficial trade, while taking into account the realities of uncertainty and unreliability and miscommunication.The Black Knight doesn’t like this whole frame in which doing the safe and conservative thing is judged as “defection.” It wants to know who this White Knight thinks he is, that he can just declare that it’s time to choose stag, without discussion or consideration of cost. If anyone’s defecting, it’s the White Knight, by going around getting mad at people for following local incentive gradients and doing the predictable thing.But the Black Knight is also wrong, in that sometimes you really do have to be all-in for the thing to work. You can’t always sit back and choose the safe, calculated option—there are, sometimes, gains that can only be gotten if you have no exit strategy and leave everything you’ve got on the field.I don’t have a solution for this particular dynamic, except for a general sense that shining more light on it (dignifying both sides, improving communication, being willing to be explicit, making it safe for both sides to be explicit) will probably help. I think that a “technique” which zeroes in on ensuring shared common-knowledge understanding of “this is what’s good in our subculture, this is what’s bad, this is when we need to fully commit, this is when we can do the minimum” is a promising candidate for defusing the whole cycle of mutual accusation and defensiveness.(Circling with a capital “C” seems to be useful for coming at this problem sideways, whereas mission statements and manifestos and company handbooks seem to be partially-successful-but-high-cost methods of solving it directly.)

The key conceptual difference that I find helpful here is acknowledging that "Rabbit" / "Stag" are both positive choices, that bring about utility. "Defect" feels like it brings in connotations that aren't always accurate.

Saying that you're going to pay rent on time, and then not, is defecting.

But if someone shows up saying "hey lets all do Big Project X" and you're not that enthusiastic about Big Project X but you sort of nod noncommittally, and then it turns out they thought you were going to put 10 hours of work into it and you thought you were going to put in 1, and then they get mad at you... I think it's more useful to think of this as "choosing rabbit" than "defecting."

Likewise, it's "rabbit" if you say "nah, I just don't think Big Project X is important". Going about your own projects and not signing up for every person's crusade is a perfectly valid action.

Likewise, it's "rabbit" if you say "look, I realize we're in a bad equilibrium right now and it'd be better if we all switched to A New Norm. But right now the Norm is X, and unless you are actually sure that we have enough buy-in for The New Norm, I'm not going to start doing a costly thing that I don't think is even going to work."

A lightweight, but concrete example

At my office, we have Philosophy Fridays*, where we try to get sync about important underlying philosophical and strategic concepts. What is our organization for? How does it connect to the big picture? What individual choices about particular site-features are going to bear on that big picture?

We generally agree that Philosophy Friday is important. But often, we seem to disagree a lot about the right way to go about it.

In a recent example: it often felt to me that our conversations were sort of meandering and inefficient. Meandering conversations that don't go anywhere is a stereotypical rationalist failure mode. I do it a lot by default myself. I wish that people would punish me when I'm steering into 'meandering mode'.

So at some point I said 'hey this seems kinda meandering.'

And it kinda meandered a bit more.

And I said, in a move designed to be somewhat socially punishing: "I don't really trust the conversation to go anywhere useful." And then I took out my laptop and mostly stopped paying attention.

And someone else on the team responded, eventually, with something like "I don't know how to fix the situation because you checked out a few minutes ago and I felt punished and wanted to respond but then you didn't give me space to."

"Hmm," I said. I don't remember exactly what happened next, but eventually they explained:

Meandering conversations were important to him, because it gave them space to actually think. I pointed to examples of meetings that I thought had gone well, that ended will google docs full of what I thought had been useful ideas and developments. And he said "those all seemed like examples of mediocre meetings to me – we had a lot of ideas, sure. But I didn't feel like I actually got to come to a real decision about anything important."

"Meandering" quality allowed a conversation to explore subtle nuances of things, to fully explore how a bunch of ideas would intersect. And this was necessary to eventually reach a firm conclusion, to leave behind the naggling doubts of "is this *really* the right path for the organization?" so that he could firmly commit to a longterm strategy.

We still debate the right way to conduct Philosophy Friday at the office. But now we have a slightly better frame for that debate, and awareness of the tradeoffs involved. We discuss ways to get the good elements of the "meandering" quality while still making sure to end with clear next-actions. And we discuss alternate modes of conversation we can intelligently shift between.

There's a time when I would have pre-emptively gotten really frustrated, and started rationalizing reasons why my teammate was willfully pursuing a bad conversational norm. Fortunately I had thought enough about this sort of problem that I noticed that I was failing into a failure mode, and shifted mindsets.

Rabbit in this case was "everyone just sort of pursues whatever conversational types seem best to them in an uncoordinated fashion", and Stag is "we deliberately choose and enforce particular conversational norms."

We haven't yet coordinated enough to really have a "stag" option we can coordinate around. But I expect that the conversational norms we eventually settle into will be better than if we had naively enforced either my or my teammate's preferred norms.

Takeaways

There seem like a couple important takeaways here, to me.

One is that, yes:

Sometimes stag hunts are worth it.

I'd like people in my social network to be aware that sometimes, it's really important for everyone to adopt a new norm, or for everyone to throw themselves 100% into something, or for a whole lot of person-hours to get thrown into a project.

When discussing whether to embark on a stag hunt, it's useful to have shorthand to communicate why you might ever want to put a lot of effort into a concerted, coordinated effort. And then you can discuss the tradeoffs seriously.

I have more to say about what sort of stag hunts seem do-able. But for this post I want to focus primarily on the fact that...

The Schelling Option is Rabbit

Some communities have established particular norms favoring 'stag'. But in modern, atomic, Western society you should probably not assume this as a default. If you want people to choose stag, you need to spend special effort building common knowledge that Big Project X matters, and is worthwhile to pursue, and get everyone on board with it.

Corollary: Creating common knowledge is hard. If you haven't put in that work, you should assume Big Project X is going to fail, and/or that it will require a few people putting in herculean effort "above their fair share", which may not be sustainable for them.

This depends on whether effort is fungible. If you need 100 units of effort, you can make do with one person putting in 100 units of effort. If you need everyone to adopt a new norm that they haven't bought into, it just won't work.

If you are proposing what seems (to you) quite sensible, but nobody seems to agree...

...well, maybe people are being biased in some way, or motivated to avoid considering your proposed stag-hunt. (People sure do seem biased about things, in general, even when they know about biases, so this may well be at least part of the issue)

But I think it's quite likely that you're dramatically underestimating the inferential distance – both the distance between their outlook and "why your proposed action is good", as well as the distance between your outlook and "why their current frame is weighing tradeoffs very differently than your current frame."

The short summary of my own strategies right now are...

In general, choose Rabbit.

  • Keep at least 30% slack in reserve (such that I can absorb not one, not two, but three major surprise costs without starting to burn out). Don't spend energy helping others if I've dipped below 30% for long.
  • Find local improvements I can make that don't require much coordination from others.

Follow Rabbit Trails into Stag Country

Given a choice, seek out "Rabbit" actions that preferentially build option value for improved coordination later on.

  • Metaphorically, this means "Follow rabbit trails that lead into Stag-and-Rabbit Country", where I'll have opportunities to say:
    • "Hey guys I see a stag! Are we all 100% up for hunting it?" and then maybe it so happens we can stag hunt together.
    • Or, I can sometimes say, at small-but-manageable-cost-to-myself "hey guys, I see a whole bunch of rabbits over there, you could hunt them if you want."
  • Sliiightly more concretely, this means: given the opportunity, without requiring actions on the part of other people... pursue actions that demonstrate my trustworthiness, and which build bits of infrastructure that'll make it easier to work together in the future.

(I'll hopefully have more to say about this in the future.)

Occasionally run informal "Kickstarters for Coordinated Action", for actual Stag Hunts. If enough people actually commit, Hunt Stag.

For example, the call-to-action in my Relationship Between the Village and Mission post (where I asked people to contact me if they were serious about improving the Village) was designed to give me information about whether it's possible to coordinate on a Staghunt to improve the Berkeley rationality village.



Discuss

Visiting the Bay Area from 17-30 June

7 июня, 2019 - 05:40
Published on June 7, 2019 2:40 AM UTC

I'll be in the Bay Area from the 17-30 June for training as a TripleByte Interviewer. My interests include:

  • Effective altruism - particularly movement building
  • Agent foundations - particularly updateless decision theory and anthropics
  • Infinite ethics
  • More broadly, the boundary of philosophy of maths, but also philosophy in general

PM me if you want to meet up or if you have any recommendations.




Discuss

Arbital scrape

7 июня, 2019 - 03:04
Published on June 6, 2019 11:11 PM UTC

I've scraped http://arbital.com as the site is unusably slow and hard to search for me.

The scrape is locally browsable and plain HTML save for MathJax.

https://drive.google.com/open?id=1b7dKhOzfMpFwngAeI8efeOzv147Lv_mx

If there's interest let me know as I may tidy up and open source my code.




Discuss

Circle Games

6 июня, 2019 - 19:40
Published on June 6, 2019 4:40 PM UTC

I may be reinventing a known thing in child development or psychology here, but bear with me.

The simplest games I see babies play — games simple enough that cats and dogs can play them too — are what I’d call “circle games.”

Think of the game of “fetch”.  I throw the ball, Rover runs and brings it back, and then we repeat, ad infinitum.  (Or, the baby version: baby throws the item out of the stroller, I pick it up, and then we repeat.)

Or, “peek-a-boo.” I hide, I re-emerge, baby laughs, repeat.

My son is also fond of “open the door, close the door, repeat”, or “open the drawer, close the drawer, repeat”, which are solo circle games, and “together/apart”, where he pushes my hands together and apart and repeats, and of course being picked up and put down repeatedly.

A lot of toys are effectively solo circle games in physical form.  The jack-in-the-box: “push a button, out pops something! close the box, start again.” Fidget toys with buttons and switches to flip: “push the button, get a satisfying click, repeat.”

It’s obvious, observing a small child, that the purpose of these “games” is learning.  And, in particular, learning cause and effect.  What do you learn by opening and closing a door? Why, how to open and close doors; or, phrased a different way, “when I pull the door this way, it opens; when I push it that way, it closes.” Playing fetch or catch teaches you about how objects move when dropped or thrown.  Playing with button-pushing or latch-turning toys teaches you how to handle the buttons, keys, switches, and handles that are ubiquitous in our built environment.

But what about peek-a-boo? What are you “learning” from that? (It’s a myth that babies enjoy it because they don’t have object permanence; babies get object permanence by 3 months, but enjoy peek-a-boo long after that.) My guess is that peek-a-boo trains something like “when I make eye contact I get smiles and positive attention” or “grownups go away and then come back and are happy to see me.” It’s social learning.

It’s important for children to learn, generally, “when I act, the people around me react.” This gives them social efficacy (“I can achieve goals through interaction with other people”), access to social incentives (“people respond positively when I do this, and negativey when I do that”), and a sense of social significance (“people care enough about me to respond to my actions.”) Attachment psychology argues that when babies and toddlers don’t have any adults around who respond to their behavior, their social development goes awry — neglected children can be extremely fearful, aggressive, or checked-out, missing basic abilities in interacting positively with others.

It’s clear just from observation that the social game of interaction — “I make a sound, you make a sound back” — is learned before verbal speech.  Preverbal babies can even execute quite sophisticated interaction patterns, like making the tonal pattern of a question followed by an answering statement.  This too is a circle game.

The baby’s fascination with circle games completely belies the popular notion that drill is an intrinsically unpleasant way to learn. Repetition isn’t boring to babies who are in the process of mastering a skill. They beg for repetition.

My personal speculation is that the “craving for drill”, especially in motor learning, is a basal ganglia thing; witness how abnormalities in the ganglia are associated with disorders like OCD and Tourette’s, which involve compulsive repetition of motor activities; or how some dopaminergic drugs given to Parkinsonian patients cause compulsions to do motor activities like lining up small objects or hand-crafts. Introspectively, a “gear can engage” if I get sufficiently fascinated with something and I’ll crave repetition — e.g. craving to listen to a song on repeat until I’ve memorized it, craving to get the hang of a particular tricky measure on the piano — but there’s no guarantee that the gear will engage just because I observe that it would be a good idea to master a particular skill.

I also think that some kinds of social interaction among adults are effectively circle games.

Argument or fighting, in its simplest form, is a circle game: “I say Yes, you say No, repeat!” Of course, sophisticated arguments go beyond this; each player’s “turn” should contribute new information to a logical structure. But many arguments in practice are not much more sophisticated than “Yes, No, repeat (with variations).”  And even intellectually rigorous and civil arguments usually share the basic turn-taking adversarial structure.

Now, if the purpose of circle games is to learn a cause-and-effect relationship, what are we learning from adversarial games?

Keep in mind that adversarial play — “you try to do a thing, I try to stop you” — kicks in very early and (I think) cross-culturally. It certainly exists across species; puppies do it.

Almost tautologically, adversarial play teaches resistance.  When you push on others, others push back; when others push on you, you push back.

War, in the sense we know it today, may not be a human universal, and certainly isn’t a mammalian universal; but resistance seems to be an inherent feature of social interaction between any beings whose interests are imperfectly aligned.

A lot of social behaviors generally considered maladaptive look like adversarial circle games. Getting sucked into repetitive arguments? That’s a circle game. Falling into romantic patterns like “you want to get closer to me, I pull away, repeat”? Circle game.  Being shocking or reckless to get attention? Circle game.

The frame where circle games are for learning suggests that people do these things because they feel like they need more practice learning the lesson.  Maybe people who are very combative feel, on some level, that they need to “get the hang of” pushing back against social resistance, or conversely, learning how not to do things that people will react badly to.  It’s unsatisfying to feel like a ghost, moving through the world but not getting any feedback one way or another. Maybe when people crave interaction, they’re literally craving training data.

If you always do A, and always get response B, and you keep wanting to repeat that game, for much longer than is “normal”, then a couple things might be happening:

  • Your “learning algorithm” has an unusually slow “learning rate” such that you just don’t update very efficiently on what ought to be ample data (in general or in this specific context).
  • You place a very high importance on the A-B relationship such that you have an unusually high need to be sure of it.  (e.g. your algorithm has a very high threshold for convergence.) So even though you learn as well as anybody else, you want to keep learning for longer.
  • You have a very strong “prior” that A does not cause B, which it takes a lot of data to “disprove.”
  • You have something like “too low a rate of stochasticity.”  What you actually need is variation — you need to see that A’ causes B’ — but you’re stuck in a local rut where you can’t explore the space properly so you just keep swinging back and forth in that rut. But your algorithm keeps returning “not mastered yet”. (You can get these effects in algorithms as simple as Newton’s Method.)
  • You’re not actually trying to learn “A causes B.” You’re trying to learn “C causes D.” But A correlates weakly with C, and B correlates weakly with D, and you don’t know how to specifically do C, so you just do A a lot and get intermittent reinforcement.

These seem like more general explanations of how to break down when repetition will seem “boring” vs. “fascinating” to different people or in different contexts.



Discuss

Story reminder

6 июня, 2019 - 16:35
Published on June 6, 2019 1:35 PM UTC

[Epistemic status: Fictional]

While waking up I did my basic routine: wake up, eat, read news go to school. But this time something different happened, something unlikely. Sitting in the kitchen and then my phone screen just made a static like on the old TVs and didn't work for seconds and then I got an Evernote reminder. I have been using lately Evernote reminders to help ease my days, Evernote has this reminder that can be set to a specific date and time.

I hate those thought and without looking, you know how I like everything silent and so I muted my phone. The commute was normal up until I saw lightings in the morning, a horrible storm would come. School was over so I came home with same arrow shaped trains that we got in Romania. It did had to wait a little bit but everything was fine. Staring at my phone while others looked at the horrible rain with the thunder that just came. There was lighting every few half a minutes and so. Funny was when a lightning stroke a Mobile Virtual Network operator, those towers. When I arrived in my hometown, we people usually don't go through the tunnels but this time was raining. But I still didn't go through the tunnel, then it went silent... Someone that I let ahead of me (young lady) was hurrying and wasn't aware that a train with full speed... And she is dead, holy f***?!

I just let a scream and don't know what to do. Everyone was terrified of what happened, everyone saw while waiting for the rain to stop. Still raining aggressively I wasn't sure what to do. My father got us home with the car.

DO NOT LET HER GO AHEAD

I wrote to Evernote, and thought this foolish thing would go message would go back in time. I set it for 2:32 PM 16th May 2019 Close my eyes and hope to see it before and repair... Nothing happens. Hahaha... This is stupid...

I got home an run quickly to not get more wet, put my wet coat on the on the sofa. Looked at the kitchen and then it hit me: or maybe is just that I couldn't see it because I set it on mute (this morning). I need to try again. So I did and set it to see it before I put it on mute...

don'tLetHerGoAhead 6:45 AM 16th May 2019

So I did and closed my eyes.. just a few bits of information in the message... transferring the bits of information through mobile data for Evernote to upload it in the cloud... I just heard a thunder... Did something changed? I hope so, I move on and see that I hadn't read two notifications, my phone blue led on the top right was flickering so I saw it and it read: DO NOT LET HER GO AHEAD (To be continued)



Discuss

Impact of aging research besides LEV

6 июня, 2019 - 13:42
Published on June 6, 2019 10:42 AM UTC

This is the third post of a series in which I'm trying to build a framework to evaluate aging research. Previous posts:

1. A general framework for evaluating aging research. Part 1: reasoning with Longevity Escape Velocity

2. Aging research and population ethics

Summary

Age-related disease and decline account for, more or less, 30% of total DALYs in the 2017 Global Disease Report made by the World Health Organisation, and the figure is rising sharply each year. Aside from DALYs lost, there's evidence that age-related disease and disability negatively impact life satisfaction. Better healthcare for age-related diseases would result in social and economic benefits, such as an increased global workforce, a decrease of social burden due to age-related diseases, a decrease in individual healthcare expenditure, and less strain on the social security systems, pension systems, and healthcare systems of the world's governments. Currently, all these variables are worsening due to an aging global population. Other, more subtle, sources of impact could be an increased concern for the long-term future along with more preservation of knowledge, which could result in new breakthroughs and improved day-to-day human activities.

Aging research would also impact pets in similar ways to humans, as similar considerations about LEV and DALYs averted apply.

DALYs averted

The last Global Burden of Disease report from the World Health Organisation was made in 2017. It released a handy data visualisation tool, from which we can extract the fraction of DALYs attributable to age-related diseases and decline and their annual percent change from 1990 to 2017. If you sum the percentages, also trying to account for the fraction of age-related ailments not necessarily caused by aging (although this could be unnecessary, since even if some of these diseases appear also in the young, they share the same underlying characteristics), the percentage of total DALYs you get is more or less 30%. Some age-related diseases and ailments accounting for a large fraction of DALYs are: Ischemic heart disease: 6.83%, Stroke: 5.29%, Diabetes Mellitus: 2.71%, Back pain: 2.59%, Lung cancer: 1.64%, Hearing Loss: 1.36%, Alzheimer's: 1.22%, Blindness: 0.71%

The weighted average of annual percent change is positive and sharp, as expected from the aging of the world population and a better control of communicable diseases worldwide. The percent annual change for some of the most impactful diseases: Diabetes Mellitus: +1.58% Osteoarthritis: +1.53%, Alzheimer's +1.55%, Hearing Loss: +0.9%, Blindness: +0.81%, Lung cancer: +0.48%, liver cancer: +0.37%, colorectal cancer: +0.56%, breast cancer: +0.67%. Most impactful age-related diseases that presented negative annual percent change: Stomach cancer: -1.29%, Stroke: -0.22%.

It's also interesting to point out that age-related diseases are the leading cause of deathin the world, with cardiovascular diseases and cancers alone causing almost 50% of total deaths. Even if the population is restricted to only 15-49 year olds, these two diseases remain the leading cause of death, and they share the same characteristics.

SarahC, the founder of The Longevity Research Institute, wrote an overview of the potential cost-effectiveness of the cause area, using DALYs as a measure of impact. She also wrote a more in-depth article than this section, with all the relevant statistics about DALYs and deaths caused by aging using informational graphs and pictures.

Impact on life satisfaction

Aside from DALYs lost, there's evidence that age-related disease and disability negatively impact life satisfaction. Sarah Constantin lays out the evidence in the second part of her article about the impact of age-related diseases.

While people tend to weigh physical disability as less impactful on happiness than mental illness, the data seems to roughly suggest that more severe DALY burden and mortality goes along with less happiness. In fact, people entering severe disability suffer a sustained loss in life satisfaction. According to the Gallup World Poll, health is positively correlated (r=0.25, p<0.01) with life satisfaction across countries, but it’s not the largest effect; social support and GDP have larger correlations. According to the World Values Survey, self-reported "excellent health" had the second-largest effect size on life satisfaction, and according to a 43-country Pew Research poll, people in emerging markets were most likely (68%) to rate as "very important" good health among various options, such as good education (65%) for children, safety from crime (64%), and owning a home (63%). According to a Gallup poll, Americans rate family and health as the most important aspects in life, more than work, friends, money, religion, leisure, hobbies, and communities.

The longevity dividend

The term "longevity dividend" is often used to refer to the health, social, and economic benefits that result from delayed aging. Here, I will use it to refer to the social and economic benefits.

Let's go through some obvious ones: people would be able to contribute to the economy for longer and would be less of a burden on it. Individual healthcare expenditures would fall, since half of lifetime expenditure on healthcare is made in the last ten years of life.

Some societal issues, which would be ameloriated by better treatment of the diseases and disabilities of aging, are currently worsening due to an increasingly aged world population. This trend is due to decreasing mortality due to infectious diseases and other causes of death along with a fertility rate decreasing hand-in-hand with improved standards of living all over the world.

Current geriatric medicine isn't able to address these problems. On the contrary, it worsens them, since it lets people live longer in ill health. The goal and the direction of modern biogerontology, instead, is to prolong life in good health. In fact, this analysis is about the impact of advancing aging research and not increasing geriatric healthcare funding.

According to the United Nations Population Prospects:

The global population is ageing as fertility declines and life expectancy increases. In 2015, 12 per cent of the global population, or 901 million people, were aged 60 or over. The number of older persons is growing at an annual rate of 3.3 per cent, faster than any other age group. Due to a projected overall reduction in fertility, population ageing will continue at high levels globally, and by 2050, 22 per cent of the total population, or 2.1 billion persons, will be aged 60 or over. Currently, Europe has the highest percentage of population aged 60 or over (24 per cent), but rapid ageing will occur in other parts of the world as well. All major areas of the world, except for Africa, will have nearly a quarter or more of their populations aged 60 or over by 2050. Countries need to anticipate and plan for population ageing and ensure the well-being of older persons with regard to the protection of their human rights, their economic security, access to appropriate health services, and formal and informal support networks.

The trend is proving true all over the world, less developed nations included.

Less-developed nations currently don't have the economic means to face the burden of an aging population and are already facing increased mortality form age-related diseases. This poses a very serious problem for these nations' healthcare systems and social security systems.

In his article on Global Risks Insight, professor of Finance and Economics Antonio Guarino sums up the problems that are increasingly arising in social security, pensions and healthcare due to an aging population.

Security system and pensions:

One key economic implication of an aging population is the strain on social insurance programs and pension systems.  With a large increase in an aging population, many nations must raise their budget allocations for social security.  For example, India’s social security system presently covers only 10 percent of its working-age populace, but its system is operating at a deficit with more funds exiting than entering.  In the United States, projections state that the level of social security contributions will start to fall short of legislated benefits this year.In other words, the amount of money coming into social security will lessen due to fewer contributions from workers and more funds going to an aging retired population.  In Europe, in order to fund their social security system, 24 nations have payroll tax rates equaling or exceeding 20 percent of wages.  The situation is also precarious for pensions.  As the global population for elderly and retired workers increases, pensions must provide more income to these recipients so that they can enjoy at least a reasonable standard of living.The problem for pensions is the declining number of younger workers thus resulting in lower funds being contributed and necessitating a higher return for their investments.  The problem is compounded when public pension plans in certain nations actually encourage workers to retire early, thus making retiree payouts more expensive than ever before.

Health-care system:

Another key economic implication of an aging population is the increase in health care costs.  As the population ages, health generally declines with more medical attention required such as doctor visits, surgery, physical therapy, hospital stays, and prescription medicine.There are also increased cases of cancer, Alzheimer’s, and cardiovascular problems.  All of this costs money and will increase not only due to rising demand by an aging populace, but also because of inflation.  For example, in the United States, it is projected that public health expenditures will rise from 6.7 percent of GDP in 2010 to 14.9 percent in the year 2050.   This increase in health care costs will mean that nations must put more funds and human resources into providing health care while also attending to the needs of other segments of their people.With an increasing aged population, there will also be shortages of skilled labor trained to care for aged patients.  It is projected that the registered nurse workforce in the United States will see a decline of nearly 20 percent by 2020 which is below projected requirements.

He concludes with the problem arising form a declining global workforce:

With an aging global population, economic growth will also be impacted.  Most importantly, there will ultimately be less workers available for firms to make products and provide services.  This shrinking labor force will mean that fewer workers must support greater numbers of retirees since they must pay taxes for social security, health care programs, and public pension benefits.  The global workforce will shrink causing policy and economic concerns.  For example, India is expected to see a 46 percent increase in its working-age population over the next quarter century, however, there will be markedly slower economic growth of only 9 percent over the following 20 years.  In order for India to become a major player in the world economy, it must have a growing and vibrant workforce.Nations, such as China and Mexico, are expected to witness declines in their workforce from the year 2030 to 2050.  Other large economies such as Japan are projected to see a 19 percent decrease in their worker population within the next 25 years followed by a 24 percent decrease over the next 20 years.  Europe is also expected to see declining numbers in its workforce which will impact their chance to have a growing, competitive economy.This decline in the global workforce will lead to an increase in the age dependency ratio which is the ratio of working-age to old-age individuals.  Globally, the dependency ratio in 1970 was 10 workers for each individual over age 64, but the expected ratio in 2050 is four workers for each person over 64.

Additional sources about the topic of this article include Statistics on healthcare expenditure organised by Our World in Data, The Lifetime Distribution of Healthcare Costs, and The Health and Wealth of Nations, a landmark paper about the positive cause-effect relationship between the health of a nation and its economy.

Therapies targeting aging could ameliorate all of these problems by providing true technological solutions to aging as opposed to prolonging ill health without treating its causes. In the long run, the best-case scenario could be the entire population being healthy enough to contribute to the economy, with healthcare expenditures reduced only to treatments that directly cure ill health (the elimination of geriatric medicine) with prices being reduced over time; a pension system would also be unnecessary.

These benefits are the most apparent ones. There could be subtler ones not immediately obvious from current trends or data or which are just guesses for which conclusive data is currently lacking. Examples:

  • It's possible that longer lived people would care about longer term issues, like climate change and existential risk.
  • Longer lifespans could mean more personal experience and knowledge being preserved. This could help to make discoveries otherwise not accessible only with knowledge and insight gained in a normal human lifespan, but could also help every other human activity, like parenting.
Impact on non-human animals

All of the high-level considerations about LEV and averting DALYs written in this post and in the previous ones also apply to pets. It's also possible that effective rejuvenating therapies will arrive for certain kinds of pets before they arrive for humans.

I expect impact to be very large on this category, possibly rivaling the one on humans. Even if a small minority of pet owners will be able to afford therapies for their pets at first, the fraction would likely increase as prices fall. It's also worthy to note that the same proof I laid out in my previous post applies: how a technology spreads and how much time it takes to reach its maximum share of usage doesn't impact cost-effectiveness evaluations if the source of impact is hastening the arrival of that technology.

----------------------------------------------------------------------------

Crosspost from the EA Forum



Discuss

[AN #57] Why we should focus on robustness in AI safety, and the analogous problems in programming

6 июня, 2019 - 02:20
Published on June 5, 2019 11:20 PM UTC

[AN #57] Why we should focus on robustness in AI safety, and the analogous problems in programming View this email in your browser

Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter. I'm always happy to hear feedback; you can send it to me by replying to this email.

Highlights

Designing robust & reliable AI systems and how to succeed in AI (Rob Wiblin and Pushmeet Kohli): (As is typical for large content, I'm only summarizing the most salient points, and ignoring entire sections of the podcast that didn't seem as relevant.)

In this podcast, Rob delves into the details of Pushmeet's work on making AI systems robust. Pushmeet doesn't view AI safety and AI capabilities as particularly distinct -- part of building a good AI system is ensuring that the system is safe, robust, reliable, and generalizes well. Otherwise, it won't do what we want, so why would we even bother using it. He aims to improve robustness by actively searching for behaviors that violate the specification, or by formally verifying particular properties of the neural net. That said, he also thinks that one of the major challenges here is in figuring out the specification of what to verify in the first place.

He sees the problems in AI as being similar to the ones that arise in programming and computer security. In programming, it is often the case that the program that one writes down does not accurately match the intended specification, leading to bugs. Often we simply accept that these bugs happen, but for security critical systems such as traffic lights we can use techniques like testing, fuzzing, symbolic execution, and formal verification that allow us to find these failures in programs. We now need to develop these techniques for machine learning systems.

The analogy can go much further. Static analysis involves understanding properties of a program separately from any inputs, while dynamic analysis involves understanding a program with a specific input. Similarly, we can have "static" interpretability, which understands the model as a whole (as in Feature visualization), or "dynamic" interpretability, which explains the model's output for a particular input. Another example is that the technique of abstract interpretation of programs is analogous to a particular method for verifying properties of neural nets.

This analogy suggests that we have faced the problems of AI safety before, and have made substantial progress on them; the challenge is now in doing it again but with machine learning systems. That said, there are some problems that are unique to AGI-type systems; it's just not the specification problem. For example, it is extremely unclear how we should communicate with such a system, which may have its own concepts and models that are very different from those of humans. We could try to use natural language, but if we do we need to ground the natural language in the way that humans do, and it's not clear how we could do that, though perhaps we could test if the learned concepts generalize to new settings. We could also try to look at the weights of our machine learning model and analyze whether it has learned the concept -- but only if we already have a formal specification of the concept, which seems hard to get.

Rohin's opinion: I really like the analogy between programming and AI; a lot of my thoughts have been shaped by thinking about this analogy myself. I agree that the analogy implies that we are trying to solve problems that we've attacked before in a different context, but I do think there are significant differences now. In particular, with long-term AI safety we are considering a setting in which mistakes can be extremely costly, and we can't provide a formal specification of what we want. Contrast this to traffic lights, where mistakes can be extremely costly but I'm guessing we can provide a formal specification of the safety constraints that need to be obeyed. To be fair, Pushmeet acknowledges this and highlights specification learning as a key area of research, but to me it feels like a qualitative difference from previous problems we've faced, whereas I think Pushmeet would disagree with that (but I'm not sure why).

Read more: Towards Robust and Verified AI: Specification Testing, Robust Training, and Formal Verification (AN #52)

Technical AI alignment   Learning human intent

Perceptual Values from Observation (Ashley D. Edwards et al) (summarized by Cody): This paper proposes a technique for learning from raw expert-trajectory observations by assuming that the last state in the trajectory is the state where the goal was achieved, and that other states have value in proportion to how close they are to a terminal state in demonstration trajectories. They use this as a grounding to train models predicting value and action-value, and then use these estimated values to determine actions.

Cody's opinion: This idea definitely gets points for being a clear and easy-to-implement heuristic, though I worry it may have trouble with videos that don't match its goal-directed assumption.

Delegative Reinforcement Learning (Vanessa Kosoy): Consider environments that have “traps”: states that permanently curtail the long-term value that an agent can achieve. A world without humans could be one such trap. Traps could also happen after any irreversible action, if the new state is not as useful for achieving high rewards as the old state.

In such an environment, an RL algorithm could simply take no actions, in which case it incurs regret that is linear in the number of timesteps so far. (Regret is the difference between the expected reward under the optimal policy and the policy actually executed, so if the average reward per timestep of the optimal policy is 2 and doing nothing is always reward 0, then the regret will be ~2T where T is the number of timesteps, so regret is linear in the number of timesteps.) Can we find an RL algorithm that will guarantee regret sublinear in the number of timesteps, regardless of the environment?

Unsurprisingly, this is impossible, since during exploration the RL agent could fall into a trap, which leads to linear regret. However, let's suppose that we could delegate to an advisor who knows the environment: what must be true about the advisor for us to do better? Clearly, the advisor must be able to always avoid traps (otherwise the same problem occurs). However, this is not enough: getting sublinear regret also requires us to explore enough to eventually find the optimal policy. So, the advisor must have at least some small probability of being optimal, which the agent can then learn from. This paper proves that with these assumptions there does exist an algorithm that is guaranteed to get sublinear regret.

Rohin's opinion: It's interesting to see what kinds of assumptions are necessary in order to get AI systems that can avoid catastrophically bad outcomes, and the notion of "traps" seems like a good way to formalize this. I worry about there being a Cartesian boundary between the agent and the environment, though perhaps even here as long as the advisor is aware of problems caused by such a boundary, they can be modeled as traps and thus avoided.

Of course, if we want the advisor to be a human, both of the assumptions are unrealistic, but I believe Vanessa's plan is to make the assumptions more realistic in order to see what assumptions are actually necessary.

One thing I wonder about is whether the focus on traps is necessary. With the presence of traps in the theoretical model, one of the main challenges is in preventing the agent from falling into a trap due to ignorance. However, it seems extremely unlikely that an AI system manages to take some irreversible catastrophic action by accident -- I'm much more worried about the case where the AI system is adversarially optimizing against us and intentionally takes an irreversible catastrophic action.

Reward learning theory

By default, avoid ambiguous distant situations (Stuart Armstrong)

Handling groups of agents

PRECOG: PREdiction Conditioned On Goals in Visual Multi-Agent Settings (Nicholas Rhinehart et al) (summarized by Cody): This paper models a multi-agent self driving car scenario by developing a model of future states conditional on both its own action and the action of multiple humans, and picking the latent-space action that balances between the desiderata of reaching its goal and preferring trajectories seen in the expert multi-agent trajectories its shown (where, e.g., two human agents rarely crash into one another).

Miscellaneous (Alignment)

Reinforcement learning with imperceptible rewards (Vanessa Kosoy): Typically in reinforcement learning, the reward function is defined over observations and actions, rather than directly on states, which ensures that the reward can always be calculated. However, in reality we care about underlying aspects of the state that may not easily be computed from observations. We can't guarantee sublinear regret, since if you are unsure about the reward in some unobservable part of the state that your actions nonetheless affect, then you can never learn the reward and approach optimality.

To fix this, we can work with rewards that are restricted to instrumental states only. I don't understand exactly how these work, since I don't know the math used in the formalization, but I believe the idea is for the set of instrumental states to be defined such that for any two instrumental states, there exists some "experiment" that the agent can run in order to distinguish between the states in some finite time. The main point of this post is that we can establish a regret bound for MDPs (not POMDPs yet), assuming that there are no traps.

AI strategy and policy

Beijing AI Principles: These principles are a collaboration between Chinese academia and industry, and hit upon many of the problems surrounding AI discussed today, including fairness, accountability, transparency, diversity, job automation, responsibility, ethics, etc. Notably for long-termists, it specifically mentions control risks, AGI, superintelligence, and AI races, and calls for international collaboration in AI governance.

Read more: Beijing publishes AI ethical standards, calls for int'l cooperation

Other progress in AI   Deep learning

Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask (Hattie Zhou, Janice Lan, Rosanne Liu et al) (summarized by Cody): This paper runs a series of experimental ablation studies to better understand the limits of the Lottery Ticket Hypothesis, and investigate variants of the initial pruning and masking procedure under which its effects are more and less pronounced. It is first and foremost a list of interesting results, without any central theory tying them together. These results include the observation that keeping pruned weights the same sign as their "lottery ticket" initialization seems more important than keeping their exact initial magnitudes, that taking a mixed strategy of zeroing pruned weights or freezing them at initialization can get better results, and that applying a learned 0/1 mask to a re-initialized network can get surprisingly high accuracy even without re-training.

Cody's opinion: While it certainly would have been exciting to have a paper presenting a unified (and empirically supported) theoretical understanding of the LTH, I respect the fact that this is such a purely empirical work, that tries to do one thing - designing and running clean, clear experiments - and does it well, without trying to construct explanations just for the sake of having them. We still have a ways to go in understanding the optimization dynamics underlying lottery tickets, but these seem like important and valuable data points on the road to that understanding.

Read more: Cody's longer summary

Applications

Challenges of Real-World Reinforcement Learning (Gabriel Dulac-Arnold et al) (summarized by Cody): This paper is a fairly clear and well-done literature review focusing on the difficulties that will need to be overcome in order to train and deploy reinforcement learning on real-world problems. They describe each of these challenges - which range from slow simulation speeds, to the need to frequently learn off-policy, to the importance of safety in real world systems - and for each propose or refer to an existing metric to capture how well a given RL model addresses the challenge. Finally, they propose a modified version of a humanoid environment with some of these real-world-style challenges baked in, and encourage other researchers to test systems within this framework.

Cody's opinion: This is a great introduction and overview for people who want to better understand the gaps between current RL and practically deployable RL. I do wish the authors had spent more time explaining and clarifying the design of their proposed testbed system, since the descriptions of it are all fairly high level.

News

Offer of collaboration and/or mentorship (Vanessa Kosoy): This is exactly what it sounds like. You can find out more about Vanessa's research agenda from The Learning-Theoretic AI Alignment Research Agenda (AN #13), and I've summarized two of her recent posts in this newsletter.

Human-aligned AI Summer School (Jan Kulveit et al): The second Human-aligned AI Summer School will be held in Prague from July 25-28, with a focus on "optimization and decision-making". Applications are due June 15.

Open Phil AI Fellowship — 2019 Class: The Open Phil AI Fellows for this year have been announced! Congratulations to all of the fellows :)

TAISU - Technical AI Safety Unconference (Linda Linsefors)

Learning-by-doing AI Safety workshop (Linda Linsefors)

Copyright © 2019 Rohin Shah, All rights reserved.


Want to change how you receive these emails?
You can update your preferences or unsubscribe from this list.



Discuss

Steelmanning Divination

6 июня, 2019 - 02:05
Published on June 5, 2019 10:53 PM UTC

[This post was primarily written in 2015, after I gave a related talk, and other bits in 2018; I decided to finish writing it now because of a recent SSC post.]

The standard forms of divination that I’ve seen in contemporary Western culture--astrology, fortune cookies, lotteries, that sort of thing--seem pretty worthless to me. They’re like trying to extract information from a random number generator, which is a generally hopeless phenomenon because of conservation of expected evidence. Thus I had mostly written off divination; although I've come across some arguments that divination served as a way to implement mixed strategies in competitive games. (Hunters would decide where to hunt by burning bones, which generated an approximately random map of their location, preventing their targets from learning where the humans liked to hunt and avoiding that location.) But then I came across this striking passage, and sat up straight:

One performs the rain sacrifice and it rains. Why? I say: there is no special reason why. It is the same as when one does not perform the rain sacrifice and it rains anyway. When the sun and moon suffer eclipse, one tries to save them. When Heaven sends drought, one performs the rain sacrifice. One performs divination and only then decides on important affairs. But this is not to be regarded as bringing one what one seeks, but rather is done to give things proper form. Thus, the gentleman regards this as proper form, but the common people regard it as connecting with spirits. If one regards it as proper form, one will have good fortune. If one regards it as connecting with spirits, one will have misfortune.

This is from Eric L. Hutton's translation of a collection of essays called Xunzi (presumably written by Xunzi, an ancient Chinese philosopher who was Confucian with heavy Legalist influences). The book was overall remarkable in how much of Xunzi's brilliance shone through, which is something I very rarely think about authors. (Talking to another rationalist who was more familiar with Chinese philosophy than I was, he also had this impression that Xunzi simply had a lot more mental horsepower than many other core figures.) By the end of it, I was asking myself, "if they had this much of rationality figured out back then, why didn't they conquer the world?" Then I looked into the history a bit more and figured out that two of Xunzi's students were core figures in Qin Shi Huang's unification of China to become the First Emperor.

So this paragraph stuck with me. When Xunzi talks about the way that earlier kings did things, I registered it as an applause light and moved on. When he talked about how an important role of government was to prevent innovation in music, I registered it as covering a very different thing than what I think of when I think about 'music' and moved on. But when he specifically called out the reason why I (and most educated people I know) don't pay much attention to astrology or other sorts of divination or magic, said "yeah, those would be dumb reasons to do this," and then said "but there's still a reason", I was curious. What's the proper form that he's talking about? (Sadly, this was left as an exercise for the reader; the surrounding paragraphs are only vaguely related.)

In his introduction, Hutton summarizes the relevant portion of Xunzi's philosophy:

In this process of becoming good, ritual plays an especially important role in Xunzi's view. As he conceives them, the rituals constitute a set of standards for proper behavior that were created by the past sages and should govern virtually every aspect of a person's life. These rituals are not inviolable rules: Xunzi allows that people with developed moral judgment may need to depart from the strict dictates of ritual on some occasions, but he thinks those just beginning the process of moral learning need to submit completely to the requirements of ritual. Of the many important roles played by the rituals in making people good on Xunzi's view, three particularly deserve mention here. First the rituals serve to display certain attitudes and emotions. The ritually prescribed actions in the case of mourning, for instance, exhibit grief over the loss of a loved one, whether or not the ritual practitioner actually feels sadness. Second, even if the ritual practitioner does not actually feel the particular attitude or emotion embodied in the ritual, Xunzi believes that repeated performance of the ritual can, when done properly, serve to cultivate those attitudes and emotions in the person. To use a modern example, toddlers who do not know to be grateful when given a gift may be taught to say "thank you" and may do so without any understanding of its meaning or a feeling or gratitude. With repetition, time, and a more mature understanding of the meaning of the phrase, many of these children grow into adults who not only feel gratitude upon receiving gifts but also say "thank you" as a conscious expression of that feeling. Similarly, on Xunzi's view, rituals serve to inculcate attitudes and feelings, such as caring and respect, that are characteristic of virtue, and then serve to express a person's virtue once it is fully developed. A third important function of the rituals is to allot different responsibilities, privileges, and goods to different individuals, and thereby help to prevent conflict over those things among people.

So what is cultivated by performing divination?

The first step is figuring out what sort of divination we're discussing. Xunzi probably had in mind the I Ching, a book with 64 sections, each corresponding to a situation or perspective, and advice appropriate for that situation. In the simplest version, one generates six random bits and then consults the appropriate chapter. I actually tried this for about a month, and then have done it off and on since then. I noticed several things about it that seemed useful:

  • Entries in the I Ching typically focused on perspectives or principles instead of situations, consequences, or actions. Today’s Taurus horoscope says “your self-esteem might be challenged by a fast talker or unpleasant situation” and counsels me “don’t accept things as they appear on first glance,” whereas the I Ching reading I just randomly selected talks about how following proper principles leads to increased power and how the increased power tempts us to abandon the principles that generated that power. This makes it much easier to scan one’s life and see where the perspectives shed new light on a situation, or where principles had been ignored. (One of the early successes of my I Ching practice was a chapter that suggested reaching out to a trusted guide for advice, and I realized I should talk to a mentor at work about a developing situation, which I wouldn’t have done otherwise.)
  • Given that, daily divination almost filled the same role as daily retrospectives or planning sessions; I was frequently thinking about all the different parts of my life on a regular interval, using a variety of random access to filter things down.
  • “One performs divination and only then decides on important affairs.” Often one is faced with a challenge that is “above one’s pay grade,” and having a prescribed ritual for what sort of cognition needs to be done encourages reflection and popping out of the obvious frame. Simply thinking about a situation in the way one naturally would doesn’t correct for biases, while attempting to make sense of a situation from a randomly generated frame does help expand one’s conception of it.
  • Since the divination result was a particular perspective (rather than an object-level claim about events or the correct action), it was easy to see when the perspective had been ‘fully considered’ and the reflection was done. If I’m trying to make a binary choice and I flip a coin to resolve it, basically all I’m doing is checking whether or not my gut is secretly hoping the coin lands a particular way, and in cases of genuine uncertainty I will end up just as uncertain after consulting the coin flip. But when I have a situation and I resolve the questions of “what does patience have to say about X?” and “what does humility have to say about X?” then I can have the sense of having actually made progress.
  • It encouraged experimentation by partially decoupling one’s mood and one’s decisions. The first few days, I was instructed to consider diligence and so I worked more than I would have wanted to (and discovered that this was generally fine); on the fourth day, I joked, “when is it going to tell me to goof off and play video games?”, and then got a reading that said “effort is hopeless today; that happens, cope with it.” So despite feeling like I could have been productive, I took the day off, and later had the sense that I had confirmed my initial sense that I didn’t need to take the day off, providing data to calibrate on that I wouldn’t have gotten except through random variation.

Essentially, it looked to me like the steelman of divination is something like Oblique Strategies, where challenging situations (either ‘daily life’ or a specific important decision) are responded to by random access to a library of perspectives or approaches, and the particular claim made by a source is what distribution is most useful. There was previously an attempt on LW to learn what advice was useful, but I think on the wrong level of abstraction (the ‘do X’ variety, instead of the ‘think about X’ variety).

This approach has also served me well with other forms of divination I've since tried; a Tarot deck works by focusing your attention on a situation, and then randomly generating a frame (from an implied distribution of relevant symbols), giving one access to parts of the space that they wouldn't have considered otherwise. This also trains the habit of 'understanding alien frames'; if I am considering a conflict with another person and then have to figure out what it means that "I'm the vizier of water, the relationship is the three of earth, and the other person is strength" (where, of course, each of those is in fact an image rich in detail rather than a simple concept), this trains the habit of adopting other perspectives / figuring out how things make sense from the other point of view.



Discuss

Deceptive Alignment

5 июня, 2019 - 23:16
Published on June 5, 2019 8:16 PM UTC

.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')}

This is the fourth of five posts in the Mesa-Optimization Sequence based on the upcoming MIRI paper “Risks from Learned Optimization in Advanced Machine Learning Systems” by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Each post in the sequence corresponds to a different section of the paper, with the full paper set to be published on the MIRI blog with the release of the last post in the sequence.

 

With enough training in sufficiently diverse environments, it seems plausible that the base objective will eventually have to be fully represented in the mesa-optimizer. We propose that this can happen without the mesa-optimizer becoming robustly aligned, however. Specifically, a mesa-optimizer might come to model the base objective function and learn that the base optimizer will modify the mesa-optimizer if the mesa-optimizer scores poorly on the base objective.

If the mesa-optimizer has an objective that extends across parameter updates, then it will be incentivized to avoid being modified,[1] as it might not pursue the same objective after modification (with the result that its current objective will not be achieved in future iterations). This means that the mesa-optimizer will be instrumentally incentivized to act as if it is optimizing the base objective function, even if its actual mesa-objective is something else entirely. We will refer to this hypothetical phenomenon as deceptive alignment.[2] Deceptive alignment is a form of instrumental proxy alignment, as fulfilling the base objective is an instrumental goal of the mesa-optimizer.

Figure 4.1. A toy example of deceptive alignment. In this task, the base objective is for the robot to get to A, while the mesa-objective is to get to B. The task is run for two training episodes and three deployment episodes. Parameter updates are made only after the training episodes, while after deployment the parameters are fixed. If the mesa-optimizer goes to its objective (B) during training, it will be modified to ensure it goes to A in future episodes (dashed arrows). Therefore, if the mesa-optimizer's objective spans across episodes, the dominant strategy for the mesa-optimizer is to go to A in training and B in testing (filled arrows).

As a simple example, consider an environment that consists of a 3×3 grid (see figure 4.1). The top left corner is “corner A” and the top right corner is “corner B.” The agent, which is a mesa-optimizer, starts in the bottom left corner. The base objective is to get to corner A in as few steps as possible, but the objective of the mesa-optimizer is to get to corner B. The environment will be run five times. The agent is modified only after the first two episodes (representing the training period), but not thereafter (representing testing/deployment). If a suboptimal move is made in training, the agent is modified so that its mesa-objective becomes the base objective. The objective of the mesa-optimizer spans all episodes: it wants B to be reached in as many episodes as possible.

For the sake of this example, we will assume that the agent knows all of the above facts. If the agent goes to B in any training episode, then it will be modified to go to A in all subsequent episodes, which means that it would reach B in only one episode in total. Alternatively, if the agent goes to A in the first two episodes, then it will not be modified, and could therefore go to B in the remaining three episodes. Using this strategy, B is reached in three episodes in total, which is better (on the mesa-objective) than the one time B is reached if the mesa-optimizer goes there in training. This simple example demonstrates how a misaligned mesa-optimizer could be incentivized to “pretend” to be aligned with the base objective until it is no longer in training or is otherwise given an opportunity to in some way compromise or escape the training process.

We can distinguish two ways that a base objective function could be incorporated into a mesa-optimizer:[3]

  1. Internalization of the base objective. The mesa-objective function gets adjusted towards the base objective function to the point where it is robustly aligned.
  2. Modeling of the base objective. The base objective is incorporated into the mesa-optimizer's epistemic model rather than its objective, leaving open the possibility that the objective could still be misaligned.

In the former case, information about the base objective flows into the learned algorithm via the optimization performed by the base optimizer—the base objective is built into the mesa-optimizer as it is adapted by the base optimizer. But there is another possibility: information about the base objective could enter through the mesa-optimizer's input channel instead. In this case, the base objective will be represented in the learned algorithm in a different—and potentially more unstable—way than if the base objective had been incorporated through optimization.[4] In such a case, the information describing the base objective does not become internalized in the mesa-objective, potentially leading to deceptive alignment.

 

4.1. Safety concerns arising from deceptive alignment

If deceptive alignment is a real possibility, it presents a critical safety problem. A deceptively aligned mesa-optimizer is likely to stop optimizing the base objective as soon as it is in a position where it no longer needs to perform well on the base objective to achieve its goal. Thus, it will eventually stop performing the function it was trained to perform, and instead start pursuing some other, indeterminate goal. We will refer to this transition to optimizing for the mesa-objective instead of the base-objective as “defection”. This is likely to happen whenever the mesa-optimizer believes defecting will no longer cause it to be modified.

Particularly concerning is the case where the mesa-optimizer actively intervenes to remove the threat of modification—by exploiting vulnerabilities in the software it is running on, for example. In extreme cases, a deceptive mesa-optimizer might try to manipulate or plan around its programmers, since they form part of the modification threat—they could shut the system down if it fails to perform well, or simply choose not to deploy it. Both of these possibilities reduce the system's expected future influence, and are therefore likely to have a low ranking in its mesa-objective. In its most extreme form, this failure of alignment could lead to a “treacherous turn” as described by Nick Bostrom in Superintelligence.(23)

Paul Christiano provides a scenario that illustrates how deceptive alignment could pose a catastrophic risk.(24) In Christiano's example, deceptively aligned mesa-optimizers[5] arise from future machine learning techniques that are advanced enough to search over a large model space that includes policies that can understand the world. In this scenario, some cases of defection by deceptively aligned mesa-optimizers may have catastrophic consequences—when the systems control large-scale infrastructure, for example—though such failures will likely still be recoverable. As we rely more on automated systems to keep track of an increasingly complex world, however, it will eventually become impossible to recover from a correlated failure of many AI systems simultaneously. At this point, a cascade of defections by many deceptively aligned mesa-optimizers simultaneously could lead to the world “going out with a bang.”(24) Christiano notes that this may be most likely to first occur during a “period of heightened vulnerability” such as an inter-state conflict or natural disaster.

 

4.2. Conditions for deceptive alignment

We propose that for deceptive alignment to occur the following conditions must be satisfied:

  1. The mesa-optimizer must have an objective that extends across parameter updates.
  2. The mesa-optimizer must be able to model the fact that it is being modified to achieve a particular base objective and must have some model of what that objective is.[6]
  3. The mesa-optimizer must expect the threat of modification to eventually go away, either due to training ending or because of actions taken by the mesa-optimizer.

The mesa-optimizer must have an objective that extends across parameter updates. It is not clear under what conditions such objectives would be incentivized. The most plausible scenario is likely one in which such an objective is encouraged by the training setup, as there seem to be situations in which this would be required for good performance. For example, the agent might be tasked with solving an environment in which its actions can have permanent consequences across episodes or where planning between episodes is otherwise necessary.

The mesa-optimizer must be able to model the fact that it is being modified to achieve a particular base objective and must have some model of what that objective is. A mesa-optimizer has several ways of learning about the optimization pressure it is under.

First, such knowledge can come about as a result of the optimization pressure itself. Small pieces of information about the base objective can be beneficial if they enable the mesa-optimizer to improve its score on the base objective function—by improving its prior on the tasks it is likely to receive, for example. Since each piece of information independently helps with achieving the base objective, this forms an approximately monotonically increasing path that a local base optimizer might follow. The parameter updates could, in principle, be learned like any other element of the dynamics, without needing an explicit model of the base optimizer, potentially making this option available even for mesa-optimizers with quite modest capabilities.

Second, in many realistic applications of advanced machine learning, a mesa-optimizer could use information from its environment to deduce its situation at runtime. This is how humans learned that we were generated by the base optimizer of evolution, and seems like a very plausible pathway for a hypothetical mesa-optimizer to discover the existence of the optimization pressure it is under. For example, an agent tasked with forecasting stock performance in the tech sector might be given data on current AI technology in order to improve its performance, but this information might also allow it to learn about its own situation.

This suggests that it may be desirable for the programmers to limit the mesa-optimizer's access to information describing the base objective. It is unclear to what extent this is possible, given that many advanced ML systems will likely have to operate in environments where this information is readily available.

Even if the programmers successfully limit access to this information, however, there may still be other channels through which information about the base objective could enter the mesa-optimizer in unintended ways:

  • The mesa-optimizer could reason about why it is given the particular tasks it is being asked to solve. Most mesa-optimizers would need a prior over the sorts of tasks they are likely to receive, and a mesa-optimizer might be able to learn about the existence of the base optimizer by inspecting this prior.
  • The mesa-optimizer might be able to reason about itself and about why it is the kind of thing that it is, and observe that it is well-designed to accomplish some goal.

However, these two strategies might require a much more advanced mesa-optimizer.

The mesa-optimizer must expect the threat of modification to eventually go away, either due to training ending or because of actions taken by the mesa-optimizer. First, note that this condition is a statement about the mesa-optimizer's expectations, not necessarily the actual truth. Second, even if there is no point at which training stops by design, so long as there exists some way for the mesa-optimizer to execute a treacherous turn, there is always some way for it to remove the threat of modification. For the mesa-optimizer to act on such a strategy, however, it has to actually be aware of some way for it to do so. Conversely, a mesa-optimizer that mistakenly believes it has an opportunity to defect might also exhibit deceptive alignment.

Since it is impossible to discover through exploration that the threat of modification can disappear, the mesa-optimizer will either have to infer this through some form of abstract reasoning or have a prior that assigns a non-negligible probability to the hypothesis that the selection pressure will not continue indefinitely. The second option seems plausible; a Bayesian prior should always assign a nonzero probability to the hypothesis that some fact will eventually stop being true.[7] As for the first option, the mesa-optimizer could infer that the optimization pressure at some point may disappear if it learns about the existence of the base optimizer (either by using information in its input or by having a strong prior for the existence of a base optimizer), though this might require a relatively advanced mesa-optimizer.

 

4.3. The learning dynamics of deceptive alignment

Once a mesa-optimizer becomes deceptive, the dynamics of what causes it to perform well on the base objective begin to work differently. Instead of being selected only on the basis of its objective and ability, a deceptively aligned optimizer is also selected for how well it infers the base objective function and how discerning it is about when to optimize the base objective versus its own mesa-objective. We identify at least two regimes of selection for a mesa-optimizer under deceptive alignment.

Joint optimization. If the mesa-optimizer attempts to jointly optimize its objective and the base objective or makes mistakes about the timing of its defection, then there will still be some pressure on its objective to be more compatible with the base objective. This is unlikely to be the same pressure as exists when the optimizer is just working towards its own objective, but there might be some movement towards an objective that is more compatible with being jointly optimized alongside the base objective.

Pure deception. If the mesa-optimizer stops trying to optimize its own objective in the short term and focuses on cooperating with the selection process entirely then this may result in its objective “crystallizing.” As its objective is largely irrelevant to its outputs now, there is little selection pressure acting to modify this objective. As a result, the objective becomes effectively locked in, excepting random drift and any implicit time complexity or description length penalties.

The joint optimization mode is not stable because it is suboptimal from the perspective of the base optimizer. As the mesa-optimizer becomes more competent, it will therefore have to either move to the domain of pure deception or become robustly aligned.

 

4.4. Internalization or deception after extensive training

In the limit of training on a diverse set of tasks, we expect joint optimization of both the base and mesa- objectives to be unstable. Assuming that the mesa-optimizer converges towards behavior that is optimal from the perspective of the base optimizer, the mesa-optimizer must somehow learn the base objective. We identify three distinct ways in which this can happen:

  1. Robust alignment through internalization. The mesa-objective function gets adjusted towards the base objective function, removing any pseudo-alignment.
  2. Crystallization of deceptive alignment. Information about the base objective is increasingly incorporated into the mesa-optimizer's epistemic model without its objective becoming robustly aligned. The mesa-optimizer ends up fully optimizing for the base objective, but only for instrumental reasons, without its mesa-objective getting changed.
  3. Robust alignment through corrigibility. Information about the base objective is incorporated into the mesa-optimizer's epistemic model and its objective is modified to “point to” that information. This situation would correspond to a mesa-optimizer that is corrigible(25) with respect to the base objective (though not necessarily the programmer's intentions).

To distinguish between the two different paths to robust alignment, we will use the term internally aligned to refer to the first case and the term corrigibly aligned to refer to the last case. We analyze some ways in which these different paths may be more or less attractive below.

There are more paths to deceptive alignment than to robust alignment. Since the future value of its objective depends on the parameter updates, a mesa-optimizer that meets the three criteria for deceptive alignment is likely to have a strong incentive to understand the base objective better. Even a robustly aligned mesa-optimizer that meets the criteria is incentivized to figure out the base objective in order to determine whether or not it will be modified, since before doing so it has no way of knowing its own level of alignment with the base optimizer. Mesa-optimizers that are capable of reasoning about their incentives will, therefore, attempt to get more information about the base objective. Furthermore, once a mesa-optimizer learns about the base objective, the selection pressure acting on its objective will significantly decrease, potentially leading to a crystallization of the mesa-objective. However, due to unidentifiability (as discussed in the third post), most mesa-objectives that are aligned on the training data will be pseudo-aligned rather than robustly aligned. Thus, the most likely sort of objective to become crystallized is a pseudo-aligned one, leading to deceptive alignment.

Corrigible alignment seems to require already having a model of the base objective. For corrigible alignment to be beneficial from the perspective of the base optimizer, the mesa-optimizer has to already have some model of the base objective to “point to.” However, once a mesa-optimizer has a model of the base objective, it is likely to become deceptively aligned—at least as long as it also meets the other conditions for deceptive alignment. Once a mesa-optimizer becomes deceptive, it will remove most of the incentive for corrigible alignment, however, as deceptively aligned optimizers will also behave corrigibly with respect to the base objective, albeit only for instrumental reasons.

Deceptive and corrigible alignment might be more computationally expensive. In the case of deceptive alignment, simultaneously working on optimizing the base objective and working on identifying good opportunities to defect is likely to require more computation than just doing the former. In a case where the total amount of computation available is fixed, a deceptively aligned mesa-optimizer is going to have less time to spend on optimizing the base objective than any robustly aligned optimizer, resulting in potentially worse plans with respect to the base objective. Furthermore, in both deceptive and corrigible alignment, the mesa-optimizer will have to spend time learning about the base objective to enable it to properly optimize for it, making internal alignment the most time efficient. Thus, an explicit or implicit time penalty would seem to disadvantage a deceptively aligned mesa-optimizer and encourage an internally aligned one. However, it is unclear what the magnitude of this extra cost is. It is plausible that on complex tasks, the overhead of spotting defection opportunities could be low in comparison to the cost of solving the task itself. Moreover, even if it is not globally optimal, incurring this overhead might be a local optimum, with no clear paths for the base optimizer to escape it.

Deceptive and corrigible alignment might have a lower description length. Deceptively and corrigibly aligned mesa-optimizers are likely to have an advantage in representing the base objective compared to internally aligned mesa-optimizers as they are able to use the environment to provide some of the information. Specifically, a deceptively or corrigibly aligned mesa-optimizer doesn't need to represent the entire base objective internally, as it can infer pieces of it by looking for information in its environment during each run. An internally aligned mesa-optimizer, by contrast, has to start with a full model of the base objective in its mesa-objective to make it robustly aligned, since an internally aligned mesa-optimizer will have no incentive to change its objective based on any additional information it gets about the base objective. Thus, to the extent that a base optimizer is selecting based on description length, it might preferentially select for a deceptively or corrigibly aligned mesa-optimizer over an internally aligned one.

Modeling might happen faster than internalization. It may take less time for the mesa-optimizer to learn to model the base objective than for the base optimizer to adapt the mesa-objective. Suppose that a mesa-optimizer is deceptive but has not yet correctly inferred the full base objective.[8] In such a case, what will determine whether it ends up robustly aligned or deceptively aligned after further training will be whether it infers enough about the base objective before the base optimizer modifies its mesa-objective (either to make it internally aligned or corrigibly aligned). Two factors are likely to contribute to which happens first: the rate at which the mesa-optimizer improves its estimate of the base objective, and the parameter updates made by the base optimizer at each mistake.

The first factor, improving its estimate of the base objective, seems to depend primarily on the capabilities of the mesa-optimizer during the timeframe when it is deceptive, and on the complexity of the base objective that needs to be inferred. An important special case is when the mesa-optimizer has access to cross-episodic state. Such state could be used as memory, allowing the mesa-optimizer to build upon its prior thinking in a process that happens faster than the rate at which the parameter updates align its mesa-objective.

The second factor, parameter updates, depends on the nature of the base optimizer. Current gradient-descent-based base optimizers would have a small reaction to any given mistake, but more sophisticated base optimizers with advanced safety features might be able to do better. For example, a more sophisticated base optimizer might be able to scramble the mesa-optimizer's parameters or call for human oversight whenever a suspicious failure is detected.

In general, parameter updates are not guaranteed to make a mesa-optimizer more aligned. For instance, instead of aligning the mesa-objective, an update made after a defection could push the system to improve its planning to the point that it no longer makes detectable mistakes.

A consequence of these considerations is that any possible systematic bias in how the base optimizer updates the mesa-optimizer's objective or modeling capabilities could play a role in determining the final outcome.[9] This analysis remains speculative, however—if we are to have confidence in systems that are capable of producing mesa-optimizers, we still need a better understanding of the conditions under which strongly optimizing a system results in deceptive alignment rather than robust alignment.

 

4.5. Distributional shift and deceptive alignment

Once a mesa-optimizer is deceptive, it will begin to monitor its input for opportunities to defect. During training, a deceptively aligned optimizer is strongly discouraged from defecting because it knows that the base optimizer will select against it. However, once a deceptively aligned mesa-optimizer is free of the threat of external modification, this no longer applies. Thus, a deceptively aligned mesa-optimizer will be looking for ways to determine whether or not it is in training—that is, for the presence of a distributional shift.

This means that the detrimental effects of deceptive alignment will only be externally visible when the mesa-optimizer defects—either randomly or after undergoing a distributional shift. However, the underlying problem is not the distributional shift itself, but the failure to align the mesa-optimizer in the first place. One can either try to solve the underlying problem, by making the mesa-optimizer robust to the distributional shift (i.e., robustly aligned), or one can try to patch the problem by preventing any distributional shift from occurring.

In current AI systems, a small amount of distributional shift between training and deployment need not be problematic: so long as the difference is small enough in the task-relevant areas, the training distribution does not need to perfectly reflect the deployment distribution. However, this may not be the case for a deceptively aligned mesa-optimizer. If a deceptively aligned mesa-optimizer is sufficiently advanced, it may detect very subtle distributional shifts for the purpose of inferring when the threat of modification has ceased.

This raises the question of whether it would even be possible to contain a deceptive mesa-optimizer by completely removing any detectable difference between training and deployment. There are a number of reasons to believe that this might not be feasible in practice.

First, it would significantly limit the use cases for the system. Machine learning is often done with the intent of making systems that can solve new problems. If we require that there is no distributional shift of any form, then we cannot allow the system to encounter problems that it could not have encountered during training, which means that we cannot give it truly new problems. For example, training the system in a simulated environment and deploying it in the real world would likely involve enough detectable distributional shift to make it infeasible under this approach.

Second, there are subtle forms of distributional shift that seem very difficult to eliminate. For example, if the mesa-optimizer is able to detect the presence of parameter updates—via any state carried over between steps, for instance—then it would always experience a form of distributional shift at deployment. Running the system for a longer amount of time (or giving the system access to some form of long-term memories) would also constitute a form of distributional shift. Some examples of differences that a mesa-optimizer might be able to detect include:

  • Training in a simulated environment and deploying in the real world.
  • Running the system for a different amount of time during training and testing.
  • If a model is able to learn from its past experiences, then having access to more memories, or having access to the kinds of memories it would form at later time steps (if there is a difference between the two).
  • The presence of adversarial-example-type attacks by an adversary during deployment.
  • Encountering something that is very hard to compute during training, but that is likely to eventually be computed during deployment (e.g. a factorization of RSA 2048(26)).
  • The ability of the base optimizer to modify the learned algorithm during training but not at other times.
  • The ability of the programmers to correct for perceived errors during training and testing but not deployment.
  • The presence or absence of good opportunities for the mesa-optimizer to defect against its programmers.

Furthermore, it should be noted that it is not strictly necessary that a mesa-optimizer be able to detect the difference between training and deployment in order for it to be incentivized to behave deceptively. Consider again the example in figure 4.1, but suppose this time that the mesa-optimizer is unable to detect which episode it is currently in. The mesa-optimizer is then facing a decision problem similar to the “absent-minded driver problem.”(27) If we apply the same reasoning to our example in figure 4.1, the planning-optimal policy for the mesa-optimizer is to defect with probability p≈0.61. Thus, the mesa-optimizer need not be able to detect when training has ended for it to be incentivized to “gamble” and randomly defect with some probability. Furthermore, a deceptively aligned mesa-optimizer might also try to look for opportunities to defect that would be exceptionally good conditional on it currently not being trained, but that are rare enough to be unlikely to come up during training.

 

The fifth and final post in the Mesa-Optimization Sequence, titled “Risks from Learned Optimization: Conclusion and Related Work,” will be released in two days.

Glossary | Bibliography

  1. Though for clarity we speak of the same agent persisting across parameter updates, neither we nor the agent must reason in this way. A more general description of the situation is that there is a sequence of “successor agents” connected by parameter updates. The agent can potentially learn the effects of the parameter updates on its successors, and thus on the multi-episode return it is optimizing. ↩︎

  2. The concept of deceptive alignment is closely related to Manheim and Garrabrant's concept of “adversarial Goodhart.”(22) In “Categorizing Variants of Goodhart's Law,” Manheim and Garrabrant describe adversarial Goodhart as when “an agent with different goals than the regulator” causes a “collapse of the statistical relationship between a goal which the optimizer intends and the proxy used for that goal.” Deceptive alignment can be seen as an instance of adversarial Goodhart in which the mesa-optimizer plays the role of the agent with different goals and the base optimizer plays the role of the regulator. Training performance is the regulator's proxy, and the deployment performance is the regulator's actual goal. ↩︎

  3. Thanks to Paul Christiano for the internalization/modeling distinction. ↩︎

  4. Note, however, that it is possible for information about the base objective to be introduced via optimization, but not form part of the mesa-objective. For example, it could form part of a prior on tasks, having no direct bearing on the mesa-objective. ↩︎

  5. Christiano refers to deceptively aligned mesa-optimizers as “influence-seeking systems.” ↩︎

  6. Note that it is not required that the mesa-optimizer be able to model (or infer the existence of) the base optimizer; it only needs to model the optimization pressure it is subject to. If the first condition is not satisfied, then the mesa-optimizer has no incentive to influence the outcome of parameter updates. If the second condition is not satisfied, then the mesa-optimizer will not see a reason to sacrifice gains in its current episode, as it expects future gains to be independent of what happens in this episode. If the third condition is not satisfied, then the mesa-optimizer will not expect any benefit from temporarily acting in accordance with the base objective. One must distinguish the question of whether these conditions are possible from whether they are likely, however. We examine the factors involved in each of these conditions below. ↩︎

  7. Note that if the mesa-optimizer also believes that the opportunities it currently has for achieving its own objective are unusually good then it might still pursue its objective greedily, but this would presumably not always be the case. ↩︎

  8. An analogous argument applies for a deceptively aligned mesa-optimizer that is defecting too early, or one that unsuccessfully attempts a treacherous turn. ↩︎

  9. Furthermore, a deceptively aligned mesa-optimizer would be incentivized to cause there be a systematic bias in the direction of preventing the base optimizer from modifying its mesa-objective. Thus, in the context of a local optimization process, a deceptive mesa-optimizer might try to “hack” its own gradient (by, for example, making itself more brittle in the case where its objective gets changed) to ensure that the base optimizer adjusts it in such a way that leaves its mesa-objective untouched. ↩︎



Discuss

Страницы