Вы здесь
Сборщик RSSлент
Tessellating Hills: a toy model for demons in imperfect search
.mjxchtml {display: inlineblock; lineheight: 0; textindent: 0; textalign: left; texttransform: none; fontstyle: normal; fontweight: normal; fontsize: 100%; fontsizeadjust: none; letterspacing: normal; wordwrap: normal; wordspacing: normal; whitespace: nowrap; float: none; direction: ltr; maxwidth: none; maxheight: none; minwidth: 0; minheight: 0; border: 0; margin: 0; padding: 1px 0} .MJXcdisplay {display: block; textalign: center; margin: 1em 0; padding: 0} .mjxchtml[tabindex]:focus, body :focus .mjxchtml[tabindex] {display: inlinetable} .mjxfullwidth {textalign: center; display: tablecell!important; width: 10000em} .mjxmath {display: inlineblock; bordercollapse: separate; borderspacing: 0} .mjxmath * {display: inlineblock; webkitboxsizing: contentbox!important; mozboxsizing: contentbox!important; boxsizing: contentbox!important; textalign: left} .mjxnumerator {display: block; textalign: center} .mjxdenominator {display: block; textalign: center} .MJXcstacked {height: 0; position: relative} .MJXcstacked > * {position: absolute} .MJXcbevelled > * {display: inlineblock} .mjxstack {display: inlineblock} .mjxop {display: block} .mjxunder {display: tablecell} .mjxover {display: block} .mjxover > * {paddingleft: 0px!important; paddingright: 0px!important} .mjxunder > * {paddingleft: 0px!important; paddingright: 0px!important} .mjxstack > .mjxsup {display: block} .mjxstack > .mjxsub {display: block} .mjxprestack > .mjxpresup {display: block} .mjxprestack > .mjxpresub {display: block} .mjxdelimh > .mjxchar {display: inlineblock} .mjxsurd {verticalalign: top} .mjxmphantom * {visibility: hidden} .mjxmerror {backgroundcolor: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; fontstyle: normal; fontsize: 90%} .mjxannotationxml {lineheight: normal} .mjxmenclose > svg {fill: none; stroke: currentColor} .mjxmtr {display: tablerow} .mjxmlabeledtr {display: tablerow} .mjxmtd {display: tablecell; textalign: center} .mjxlabel {display: tablerow} .mjxbox {display: inlineblock} .mjxblock {display: block} .mjxspan {display: inline} .mjxchar {display: block; whitespace: pre} .mjxitable {display: inlinetable; width: auto} .mjxrow {display: tablerow} .mjxcell {display: tablecell} .mjxtable {display: table; width: 100%} .mjxline {display: block; height: 0} .mjxstrut {width: 0; paddingtop: 1em} .mjxvsize {width: 0} .MJXcspace1 {marginleft: .167em} .MJXcspace2 {marginleft: .222em} .MJXcspace3 {marginleft: .278em} .mjxtest.mjxtestdisplay {display: table!important} .mjxtest.mjxtestinline {display: inline!important; marginright: 1px} .mjxtest.mjxtestdefault {display: block!important; clear: both} .mjxexbox {display: inlineblock!important; position: absolute; overflow: hidden; minheight: 0; maxheight: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjxtestinline .mjxleftbox {display: inlineblock; width: 0; float: left} .mjxtestinline .mjxrightbox {display: inlineblock; width: 0; float: right} .mjxtestdisplay .mjxrightbox {display: tablecell!important; width: 10000em!important; minwidth: 0; maxwidth: none; padding: 0; border: 0; margin: 0} .MJXcTeXunknownR {fontfamily: monospace; fontstyle: normal; fontweight: normal} .MJXcTeXunknownI {fontfamily: monospace; fontstyle: italic; fontweight: normal} .MJXcTeXunknownB {fontfamily: monospace; fontstyle: normal; fontweight: bold} .MJXcTeXunknownBI {fontfamily: monospace; fontstyle: italic; fontweight: bold} .MJXcTeXamsR {fontfamily: MJXcTeXamsR,MJXcTeXamsRw} .MJXcTeXcalB {fontfamily: MJXcTeXcalB,MJXcTeXcalBx,MJXcTeXcalBw} .MJXcTeXfrakR {fontfamily: MJXcTeXfrakR,MJXcTeXfrakRw} .MJXcTeXfrakB {fontfamily: MJXcTeXfrakB,MJXcTeXfrakBx,MJXcTeXfrakBw} .MJXcTeXmathBI {fontfamily: MJXcTeXmathBI,MJXcTeXmathBIx,MJXcTeXmathBIw} .MJXcTeXsansR {fontfamily: MJXcTeXsansR,MJXcTeXsansRw} .MJXcTeXsansB {fontfamily: MJXcTeXsansB,MJXcTeXsansBx,MJXcTeXsansBw} .MJXcTeXsansI {fontfamily: MJXcTeXsansI,MJXcTeXsansIx,MJXcTeXsansIw} .MJXcTeXscriptR {fontfamily: MJXcTeXscriptR,MJXcTeXscriptRw} .MJXcTeXtypeR {fontfamily: MJXcTeXtypeR,MJXcTeXtypeRw} .MJXcTeXcalR {fontfamily: MJXcTeXcalR,MJXcTeXcalRw} .MJXcTeXmainB {fontfamily: MJXcTeXmainB,MJXcTeXmainBx,MJXcTeXmainBw} .MJXcTeXmainI {fontfamily: MJXcTeXmainI,MJXcTeXmainIx,MJXcTeXmainIw} .MJXcTeXmainR {fontfamily: MJXcTeXmainR,MJXcTeXmainRw} .MJXcTeXmathI {fontfamily: MJXcTeXmathI,MJXcTeXmathIx,MJXcTeXmathIw} .MJXcTeXsize1R {fontfamily: MJXcTeXsize1R,MJXcTeXsize1Rw} .MJXcTeXsize2R {fontfamily: MJXcTeXsize2R,MJXcTeXsize2Rw} .MJXcTeXsize3R {fontfamily: MJXcTeXsize3R,MJXcTeXsize3Rw} .MJXcTeXsize4R {fontfamily: MJXcTeXsize4R,MJXcTeXsize4Rw} .MJXcTeXvecR {fontfamily: MJXcTeXvecR,MJXcTeXvecRw} .MJXcTeXvecB {fontfamily: MJXcTeXvecB,MJXcTeXvecBx,MJXcTeXvecBw} @fontface {fontfamily: MJXcTeXamsR; src: local('MathJax_AMS'), local('MathJax_AMSRegular')} @fontface {fontfamily: MJXcTeXamsRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_AMSRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_AMSRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_AMSRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXcalB; src: local('MathJax_Caligraphic Bold'), local('MathJax_CaligraphicBold')} @fontface {fontfamily: MJXcTeXcalBx; src: local('MathJax_Caligraphic'); fontweight: bold} @fontface {fontfamily: MJXcTeXcalBw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_CaligraphicBold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_CaligraphicBold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_CaligraphicBold.otf') format('opentype')} @fontface {fontfamily: MJXcTeXfrakR; src: local('MathJax_Fraktur'), local('MathJax_FrakturRegular')} @fontface {fontfamily: MJXcTeXfrakRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_FrakturRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_FrakturRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_FrakturRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXfrakB; src: local('MathJax_Fraktur Bold'), local('MathJax_FrakturBold')} @fontface {fontfamily: MJXcTeXfrakBx; src: local('MathJax_Fraktur'); fontweight: bold} @fontface {fontfamily: MJXcTeXfrakBw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_FrakturBold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_FrakturBold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_FrakturBold.otf') format('opentype')} @fontface {fontfamily: MJXcTeXmathBI; src: local('MathJax_Math BoldItalic'), local('MathJax_MathBoldItalic')} @fontface {fontfamily: MJXcTeXmathBIx; src: local('MathJax_Math'); fontweight: bold; fontstyle: italic} @fontface {fontfamily: MJXcTeXmathBIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_MathBoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_MathBoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_MathBoldItalic.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsansR; src: local('MathJax_SansSerif'), local('MathJax_SansSerifRegular')} @fontface {fontfamily: MJXcTeXsansRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_SansSerifRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_SansSerifRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_SansSerifRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsansB; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerifBold')} @fontface {fontfamily: MJXcTeXsansBx; src: local('MathJax_SansSerif'); fontweight: bold} @fontface {fontfamily: MJXcTeXsansBw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_SansSerifBold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_SansSerifBold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_SansSerifBold.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsansI; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerifItalic')} @fontface {fontfamily: MJXcTeXsansIx; src: local('MathJax_SansSerif'); fontstyle: italic} @fontface {fontfamily: MJXcTeXsansIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_SansSerifItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_SansSerifItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_SansSerifItalic.otf') format('opentype')} @fontface {fontfamily: MJXcTeXscriptR; src: local('MathJax_Script'), local('MathJax_ScriptRegular')} @fontface {fontfamily: MJXcTeXscriptRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_ScriptRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_ScriptRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_ScriptRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXtypeR; src: local('MathJax_Typewriter'), local('MathJax_TypewriterRegular')} @fontface {fontfamily: MJXcTeXtypeRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_TypewriterRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_TypewriterRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_TypewriterRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXcalR; src: local('MathJax_Caligraphic'), local('MathJax_CaligraphicRegular')} @fontface {fontfamily: MJXcTeXcalRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_CaligraphicRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_CaligraphicRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_CaligraphicRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXmainB; src: local('MathJax_Main Bold'), local('MathJax_MainBold')} @fontface {fontfamily: MJXcTeXmainBx; src: local('MathJax_Main'); fontweight: bold} @fontface {fontfamily: MJXcTeXmainBw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_MainBold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_MainBold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_MainBold.otf') format('opentype')} @fontface {fontfamily: MJXcTeXmainI; src: local('MathJax_Main Italic'), local('MathJax_MainItalic')} @fontface {fontfamily: MJXcTeXmainIx; src: local('MathJax_Main'); fontstyle: italic} @fontface {fontfamily: MJXcTeXmainIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_MainItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_MainItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_MainItalic.otf') format('opentype')} @fontface {fontfamily: MJXcTeXmainR; src: local('MathJax_Main'), local('MathJax_MainRegular')} @fontface {fontfamily: MJXcTeXmainRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_MainRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_MainRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_MainRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXmathI; src: local('MathJax_Math Italic'), local('MathJax_MathItalic')} @fontface {fontfamily: MJXcTeXmathIx; src: local('MathJax_Math'); fontstyle: italic} @fontface {fontfamily: MJXcTeXmathIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_MathItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_MathItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_MathItalic.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsize1R; src: local('MathJax_Size1'), local('MathJax_Size1Regular')} @fontface {fontfamily: MJXcTeXsize1Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_Size1Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_Size1Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_Size1Regular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsize2R; src: local('MathJax_Size2'), local('MathJax_Size2Regular')} @fontface {fontfamily: MJXcTeXsize2Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_Size2Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_Size2Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_Size2Regular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsize3R; src: local('MathJax_Size3'), local('MathJax_Size3Regular')} @fontface {fontfamily: MJXcTeXsize3Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_Size3Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_Size3Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_Size3Regular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsize4R; src: local('MathJax_Size4'), local('MathJax_Size4Regular')} @fontface {fontfamily: MJXcTeXsize4Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_Size4Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_Size4Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_Size4Regular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXvecR; src: local('MathJax_Vector'), local('MathJax_VectorRegular')} @fontface {fontfamily: MJXcTeXvecRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_VectorRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_VectorRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_VectorRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXvecB; src: local('MathJax_Vector Bold'), local('MathJax_VectorBold')} @fontface {fontfamily: MJXcTeXvecBx; src: local('MathJax_Vector'); fontweight: bold} @fontface {fontfamily: MJXcTeXvecBw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_VectorBold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_VectorBold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_VectorBold.otf') format('opentype')}
If you haven't already, take a look at this post by johnswentworth to understand what this is all about: https://www.lesswrong.com/posts/KnPN7ett8RszE79PH/demonsinimperfectsearch
The short version is that while systems that use perfect search, such as AIXI, have many safety problems, a whole new set of problems arises when we start creating systems that are not perfect searchers. Patterns can form that exploit the imperfect nature of the search function to perpetuate themselves. johnswentworth refers to such patterns as "demons".
After reading that post I decided to see if I could observe demon formation in a simple model: gradient descent on a nottoocomplicated mathematical function. It turns out that even in this very simplistic case, demon formation can happen. Hopefully this post will give people an example of demon formation where the mechanism is simple and easy to visualize.
ModelThe function we try to minimize using gradient descent is called the loss function. Here it is:
L(→x)=−x0+ϵn∑j=1xj⋅splotchj(→x)
Let me explain what some of the parts of this loss mean. Each function splotchj(→x) is periodic with period 2π in every component of →x. I decided in this case to make my splotch functions out of a few randomly chosen sine waves added together.
ϵ is chosen to be a small number so in any local region, ϵ∑nj=1xj⋅splotchj(→x) will look approximately periodic: A bunch of hills repeating over and over again with period 2π across the landscape. But over large enough distances, the relative weightings of various splotches do change. Travel a distance of 20π in the x7 direction, and splotch7 will be a larger component of the repeating pattern than it was before. This allows for selection effects.
The −x0 term means that the vector →x mainly wants to increase its x0 component. But the splotch functions can also direct its motion. A splotch function might have a kind of ridge that directs some of the x0 motion into other components. If splotch7 tends to direct motion in such a way that x7, increases, then it will be selected for, becoming stronger and stronger as time goes on.
ResultsI used ordinary gradient descent, with a constant step size, and with a bit of random noise added in. Figure 1 shows the value of x0 as a function of time, while figure 2 shows the values of x1,x2,…x16 as a function of time.
Fig 1:
Fig 2:
There are three phases to the evolution: In the first, x0 increases steadily, and the other coordinates wander around more or less randomly. In the second phase, a selfreinforcing combination of splotches (a "demon") takes hold and amplifies itself drastically, feeding off the large x0 gradient. Finally, this demon becomes so strong that the search gets stuck in a local valley and further progress stops. The first phase is more or less from 0 to 2500 steps. The second phase is between 2500 steps and 4000 steps, though slowing down after 3500. The final phase starts at 4000 steps, and likely continues indefinitely.
Now that I have seen demons arise in such a simple situation, it makes me wonder how commonly the same thing happens in the training of deep neural networks. Anyways, hopefully this is a useful model for people who want to understand the mechanisms behind the whole "demons in imperfect search" thing more clearly. It definitely helped me, at least.
Discuss
EA Kansas City planning meetup, discussion & open questions
Hi all!
We had a meeting planning the future of the Effective Altruism Kansas City group. We discussed:
 Group values and goals
 What we as a (currently small) local group could do to maximize our positive impact
 Communication structure for internal and external communications, individual responsibilities
 How we should direct members' desire for action
 Whether it makes sense to spend any time/resources on local charities
Some discussion on each below.
Group values and a visionUp to this point, we've only really done discussion groups and a few workshops  nothing in particular we did really involved action on behalf of the group (with the exception of we had at least one person sign up for the GWWC pledge). People have been asking where the group is going and when they can do something about the problems we've been talking about.
Actions we could take to maximize positive impactA problem facing Effective Altruists is that the good we do can feel very distant  donating some portion of your salary isn't sexy or salient. So in terms of action itself, doing something like volunteering at a food pantry may not be as impactful as donating the equivalent in salary to an effective charity. However, such action could be argued to:
 Keep interest in altruism alive and well by staying connected to the community
 Signal boost Effective Altruism ideas both by getting the word out and signalling virtue in a way people understand
Additionally, applying EA tools in a local context (impact evaluations, searching for the biggest problems and most effective solutions) can be a good introduction for people who haven't asked these classic questions before.
Along these lines, these are some projects we're considering taking on as a group:
 Helping a group of us get practiced presenting EA ideas well, reach out and schedule presentations/workshops with other organizations to spread EA ideas and get giving pledges
 Doing basic impact estimations on local charities and publishing our research, to help money that's earmarked to be spent locally be as effective as it can be (while encouraging donors to look internationally and prominently mentioning the different in impact). Think of this like a local GiveWell. We hypothesize that the impact differences between these local (US) charities will be smaller than the gap between international options, but large enough to be meaningful while also signal boosting EA and the EA KC group.
 Finding some neglected cause or intervention for a local problem (e.g. concentrated poverty) and coordinate with other groups to direct resources to that cause
 Reaching out to students we know at local schools, getting them involved/trained in EA, helping them found university groups and spread 80,000 Hours type ideas. Related, offering free career coaching to people considering the impact of their career.
For the immediate future, we'll be prioritizing the local impact evaluations idea and getting a website up. This is because one of our members has been contacted by a corporate representative at his company who's interested in EA thinking, whose decisions could lead to a lot of new funding going to effective charities.
Group communicationThis is never an easy thing to figure out, but for now we're thinking of a 3part communication structure:
 An email list for broad external/internal communications and loosely engaged members, supported by a website
 A Discourse (or other community communication program) for discussion between engaged members on topics & projects
 A group chat for urgent communications between organizers
This is a bit of an open question. Do you think local work makes sense, given the lower direct impact but potential signal boosts and personal/community support?
Discuss
On unfixably unsafe AGI architectures
There's loads of discussion on ways that things can go wrong as we enter the postAGI world. I think an especially important one for guiding current research is:
Maybe we'll know how to build unfixably unsafe AGI, but can't coordinate not to do so.
As a special case, I will suggest that we might have a xrisklevel accident as the culmination of a series of larger and larger accidents.
(This is an extreme case of what John Maxwell (following Nate Soares) calls an alignment roadblock.)
I'm sure this has been discussed before, but it sometimes seems to slip through the cracks in recent discussions, where instead I sometimes an implicit assumption that xrisklevel catastrophic accidents will not happen if we have ample warning in the form of minor accidents—and thus (this theory goes) we should think only about (1) fast takeoff, (2) deceptive systems (such as Paul Christiano's "influenceseekers") that pretend to be beneficial until it's too late to stop them, (3) researchers being reckless due to race dynamics, and (4) other problems that are not "accidents" per se. But even if we avoid all those problems, and thus get ample experience in the form of minor accidents, I don't think that's necessarily enough.
1. Is there such a thing as an "unfixably unsafe AGI"?By "unfixable", I mean that to solve the problem, we need to massively backtrack and take a different path to AGI (see Appendix) ... or that a safer AGI architecture simply doesn't exist.
By "unsafe", I mean ... well, I'm not really sure what this term should mean. Is it "less unsafe than the nonAGI status quo humanity on fastforward" (a low bar!), or "the most safe that's technologically possible" (an almost impossibly high bar!), or some absolute metric like "<X% chance of extinction" for some X? It's your choice, readers! As your safety standards get lower, the existence of "unfixably unsafe AGI" becomes less likely, but a bigger problem if it does happen.
To keep things concrete, let's have in mind an Example failure mode: Goal instability under learning and reflection: The AGI will have an internal concept of (for example) "doing what the human overseer would want", and this concept will develop and churn as the agent develops better understanding of people and the world. (See "Ontological crisis".) At some point—according to this failure mode—the internal goals / constraints / etc. may fall out of alignment with the safe, benevolent, corrigible behavior we want.
Is this failure mode plausible? If so, would it really be "unfixable" (within a certain approach to AGI)? Well, I don't know! Maybe, maybe not. As far as I know, it can't be ruled out.
Also, without directly solving the problem, there are plenty of possible indirect solutions—boxing, supervisory systems, transparency, etc. etc. But we don't know that any of them will work reliably, and it's possible that they will work only by limiting the system's capability, and then there's still a coordination problem (we can change the topic to "we know how to build this unboxed AGI, and can't coordinate not to do so").
(Again, let's keep this in mind as a running example—but note that there are other possible examples too.)
2. Is it possible that we will know how to build this unfixably unsafe AGI, but can't coordinate not to do so?I think this is especially plausible if:
2A: There's very little work to do to run this AGI, e.g. there is welldocumented opensource code that runs on commodity hardware.I think this would eventually become true with very high probability (by default); thus a key goal would be to discover the problem as early as possible, when there are still many personyears of R&D left to do.
2B: The arguments that the AGI is unfixably unsafe are complex and uncertain (or, even worse, we don't have such arguments).In our running example, it is probably impossible to think on the object level about every possible way in which an intelligence might reconceptualize "doing what the overseer wants me to do" as it continuously learns and reflects. And maybe metalevel "reasoning about reasoning" can't conclude anything useful.
We can hope that, in the course of learning how to build an AGI, we will get insight into the "goal stability upon learning & reflection" problem, but this does not seem guaranteed by any means—for example, humans do not have goal stability, and if we reverseengineer human brain algorithms then they won't magically start having goal stability, and as I've learned more nutsandbolts details about how human brain algorithms work in the past year, I don't feel like it's helped me all that much to better understand this problem, or to find and verify solutions.
2C: Relatedly, given a proposed approach to solve the problem, there is no easy, lowrisk way to see whether it works.In our running example, proposed solutions may have the problem that they just delay the problem instead of solving it—maybe the AGI still has a goal instability problem, but it hasn't learned enough and reflected enough for it to manifest yet.
2D: A safer AGI architecture doesn't exist, or requires many years of development and many new insights.Here, an important consideration is how early the development paths diverged between our unfixably unsafe AGI and the safer alternative. Can we keep most of the code and make a small change, or do we have to go back and develop a fundamentally different type of AGI from scratch? See Appendix for more on this.
Summary: A possible story of coordination failureIf most or all these things are true, the coordination problem seems hopelessly unsolvable to me. Countless actors around the world would be well aware of the transformative potential of the technology, and able to have a go. Not everyone is riskaverse—imagine people saying "This is the only way we can stop climate change and save the planet, we have to try!!" Many will have superficially plausible ideas about how to solve the safety problem, and critics won't have airtight, legible arguments that the ideas will not work. Even as a series of worse and worse AGI accidents occur, wih outofcontrol AGIs selfreplicating around the internet etc., a few people will keep trying to fix the unfixable AGI, seeing this as the only path to get this slowrolling catastrophe under control (while actually making it worse). Even hypothetical ideal rational altruists might have a go with a design they know is a longshot, if they believe that others will keep trying with even less plausible ideas.
Even if there is an international treaty, it would seem to be utterly unenforceable, especially given the existence of secret government labs, leakers / cyberespionage, and grillions of GPUs, CPUs, and FPGAs off the grid around the world. I think this is true today and will continue to be true for the foreseeable future.[1]
So, if we have an unfixably unsafe AGI scenario in which the factors 2A2D are all unfavorable, it just seems utterly hopeless to me. (If anyone has ideas, I'm very interested to hear them!) Instead, I would say the priority is to do technical safety work well in advance, to not get stuck in that kind of situation. I'm very interested in other people's thoughts on this.
Appendix: My list of earlybranching paths to AGII find that there are a number of grand visions for what AGI will look like and how we'll get there, and these involve years or decades of substantially nonoverlapping R&D. (Of course some of these have some overlap.) This is why I think AGI safety work is urgent, even if AGI were centuries away—because it will inform us about which of these paths is more or less promising. Then we can build the AGI that's best, and not just wait and see which R&D program happens to reach the finishline first.
So here's my little list. I doubt all of them are technically feasible R&D paths that yield very different AGIs at the end, but I'm pretty sure some of them are.
 Massively improved braincomputer interfaces (Elon Musk, Ray Kurzweil)
 Wholebrain emulation
 Make a nonagential worldmodelbuilding AGI and probe it using interpretability tools (Chris Olah)
 Debate (OpenAI)
 IDA (OpenAI)
 Understand and copy brain algorithms (Vicarious, Numenta) We could copy just the intelligence part (neocortex), or we could also copy emotions etc.
 Comprehensive AI Services
 There is a general spectrum between how much of the AGI is conventional computer code versus ML models—after all, any specific thing that can be learned can also (given enough engineerhours) be handcoded.
 System that talks talks to humans and helps them reason better (David Ferrucci)
 Whatever MIRI is doing in their undisclosed research program (involving Haskell I guess)
 In prosaic AI, models can be trained by RL, versus supervised learning versus selfsupervised (predictive) learning versus recursive reward modeling, etc.
I'm sure I'm leaving stuff out. I'm curious to what extent other people see many parallel paths to AGI, as I do, versus thinking only one path is really plausible, or that the paths will converge at the end, or that the paths mostly overlap, or some other opinion.
I guess in principle maybe someday there could be a world government that institutes the Nick Bostrom "freedom tag", but I can't see how that would actually come to pass. ↩︎
Discuss
[AN #87]: What might happen as deep learning scales even further?
[AN #87]: What might happen as deep learning scales even further? View this email in your browser Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter. I'm always happy to hear feedback; you can send it to me by replying to this email.
Audio version here (may not be up yet).
HighlightsScaling Laws for Neural Language Models (Jared Kaplan, Sam McCandlish et al) (summarized by Nicholas): This paper empirically measures the effect of scaling model complexity, data, and computation on the cross entropy loss for neural language models. A few results that I would highlight are:
Performance depends strongly on scale, weakly on model shape: Loss depends more strongly on the number of parameters, the size of the dataset, and the amount of compute used for training than on architecture hyperparameters.
Smooth power laws: All three of these show powerlaw relationships that don’t flatten out even at the highest performance they reached.
Sample efficiency: Larger models are more efficient than small models in both compute and data. For maximum computation efficiency, it is better to train large models and stop before convergence.
There are lots of other interesting conclusions in the paper not included here; section 1.1 provides a very nice one page summary of these conclusions, which I'd recommend you read for more information.
Nicholas's opinion: This paper makes me very optimistic about improvements in language modelling; the consistency of the power law implies that language models can continue to improve just by increasing data, compute, and model size. However, I would be wary of generalizing these findings to make any claims about AGI, or even other narrow fields of AI. As they note in the paper, it would be interesting to see if similar results hold in other domains such as vision, audio processing, or RL.
A Constructive Prediction of the Generalization Error Across Scales (Jonathan S. Rosenfeld et al) (summarized by Rohin): This earlier paper also explicitly studies the relationship of test error to various inputs, on language models and image classification (the previous paper studied only language models). The conclusions agree with the previous paper quite well: it finds that smooth power laws are very good predictors for the influence of dataset size and model capacity. (It fixed the amount of compute, and so did not investigate whether there was a power law for compute, as the previous paper did.) Like the previous paper, it found that it basically doesn't matter whether the model size is increased by scaling the width or the depth of the network.
ZeRO & DeepSpeed: New system optimizations enable training models with over 100 billion parameters (Rangan Majumder et al) (summarized by Asya): This paper introduces ZeRO and DeepSpeed, system optimizations that enable training significantly larger models than we have before.
Data parallelism is a way of splitting data across multiple machines to increase training throughput. Instead of training a model sequentially on one dataset, the dataset is split and models are trained in parallel. Resulting gradients on every machine are combined centrally and then used for back propagation. Previously, data parallelism approaches were memoryconstrained because the entire model still had to fit on each GPU, which becomes infeasible for billion to trillionparameter models.
Instead of replicating each model on each machine, ZeRO partitions each model across machines and shares states, resulting in a permachine memory reduction that is linear with the number of machines. (E.g., splitting across 64 GPUs yields a 64x memory reduction).
In addition to ZeRO, Microsoft is releasing DeepSpeed, a library which offers ZeRO as well as several other performance optimizations in an easytouse library for PyTorch, a popular opensource machine learning framework. They purport that their library allows for models that are 10x bigger, up to 5x faster to train, and up to 5x cheaper. They use DeepSpeed to train a 17billionparameter language model which exceeds stateoftheart results in natural language processing.
Asya's opinion: I think this is a significant step in machine learning performance which may not be used heavily until average model sizes in general increase. The technique itself is pretty straightforward, which makes me think that as model sizes increase there may be a lot of similar "lowhanging fruit" that yield large performance gains.
Technical AI alignment Learning human intentMetaInverse Reinforcement Learning with Probabilistic Context Variables (Lantao Yu, Tianhe Yu et al) (summarized by Sudhanshu): This work explores improving performance on multitask inverse reinforcement learning in a singleshot setting by extending Adversarial Inverse Reinforcement Learning (AN #17) with "latent context variables" that condition the learned reward function. The paper makes two notable contributions: 1) It details an algorithm to simultaneously learn a flexible reward function and a conditional policy with competitive fewshot generalization abilities from expert demonstrations of multiple related tasks without task specifications or identifiers; 2) The authors empirically demonstrate strong performance of a policy trained on the inferred reward of a structurally similar task with modified environmental dynamics, claiming that in order to succeed "the agent must correctly infer the underlying goal of the task instead of simply mimicking the demonstration".
Sudhanshu's opinion: Since this work "integrates ideas from contextbased metalearning, deep latent variable generative models, and maximum entropy inverse RL" and covers the relevant mathematics, it is an involved, if rewarding, study into multitask IRL. I am convinced that this is a big step forward for IRL, but I'd be interested in seeing comparisons on setups that are more complicated.
'Data efficiency' is implied as a desirable quality, and the paper makes a case that they learn from a limited number demonstrations at metatest time. However, it does not specify how many demonstrations were required for each task during metatraining. Additionally, for two environments, tens of millions of environment interactions were required, which is entirely infeasible for real systems.
Miscellaneous (Alignment)The Incentives that Shape Behaviour (Ryan Carey, Eric Langlois et al) (summarized by Asya): This post and paper introduce a method for analyzing the safety properties of a system using a causal theory of incentives (past (AN #49) papers (AN #61)). An incentive is something an agent must do to best achieve its goals. A control incentive exists when an agent must control some component of its environment in order to maximize its utility, while a response incentive is present when the agent's decision must be causally responsive to some component of its environment. These incentives can be analyzed formally by drawing a causal influence diagram, which represents a decision problem as a graph where each variable depends on the values of its parents.
For example, consider the case where a recommender algorithm decides what posts to show to maximize clicks. In the causal influnce diagram representing this system, we can include that we have control over the node 'posts to show', which has a direct effect on the node we want to maximize, 'clicks'. However, 'posts to show' may also have a direct effect on the node 'influenced user opinions', which itself affects 'clicks'. In the system as it stands, in addition to there being a desirable control incentive on 'clicks', there is also an undesirable control incentive on 'influenced user opinions', since they themselves influence 'clicks'. To get rid of the undesirable incentive, we could reward the system for predicted clicks based on a model of the original user opinions, rather than for actual clicks.
Asya's opinion: I really like this formalization of incentives, which come up frequently in AI safety work. It seems like some people are already (AN #54) using (AN #71) this framework, and this seems lowcost enough that it's easy to imagine a world where this features in the safety analysis of algorithm designers.
Read more: Paper: The Incentives that Shape Behaviour
Copyright © 2020 Rohin Shah, All rights reserved.
Want to change how you receive these emails?
You can update your preferences or unsubscribe from this list.
Discuss
Training Regime Day 5: TAPs
Introduction
TAP stands for Trigger Action Pattern/Planning. "Pattern" is a descriptive term, i.e. it's describing what's actually happening. "Planning" is what I hope to teach you how to do, i.e. how to create certain Trigger Action Patterns in yourself.
In the literature, these are sometimes called "implementation intentions", but that is a dumb name so we're not going to call it that. These have been studied pretty thoroughly and are probably one of the techniques that CFAR teaches that has the most rigorous evidence that demonstrates effectiveness.
In my view, the reason that TAPs work is because your brain doesn't really like making decisions. Making decisions is extremely costly, so, in a given situation, your brain will usually resort to doing the thing that it usually does in this situation. The idea of "installing" a TAP in yourself is to convince your brain that the action that you want it to do is also the action that you usually do in this situation.
You can think of TAPs as basically building habits. If you've read the power of habit, this general process will be very familiar to you. Today, I'll be covering how to give yourself habits that you do want; and maybe will cover how to get rid of habits that you don't want in TAPs 2 (maybe).
StepsThe steps of installing a TAP in yourself are pretty simple:
 Pick an action
 Pick a trigger
 Practice
It is crucial that the action is specific. It should abundantly clear to you if the action has happened yet. If the answer could ever reasonably be "maybe," then you should pick a more specific action.
"Calm down" is not a good action because it is hard to tell if you are calm. "Take 5 deep breaths" is a much better action because it is easy to count breaths. However, it might be difficult to figure out if a breath is deep. "Breath in for 7 seconds and breathe out for 7 seconds 5 times" is a very good action because it is very clear when you have done it.
Importantly, the action should be something that actually advances your goals. The point is not to have a bunch of TAPs, the point is to get the things that you actually want. Sometimes, the process of making your action specific enough to satisfy the specificity requirement, the action no longer advances your goals. In this case, you should either choose a new action or conclude that a TAP might not be a good way to solve this problem
The action should also be atomic. This is desirable for two reasons. Firstly, if your TAP doesn't work, having small actions makes it easy to figure out where your TAP broke down. Secondly, the idea is to tie the trigger to the action in a compact way, which is easier to do if the action is atomic.
Sometimes, people will pick actions that are very large and are actually multiple subactions. For example, "go to the gym" is a very large action. It includes "change into gym clothes", "pick up gym bag", and "transport self to gym". Possibly, "transport self to gym" should be broken up further.
Using multiple TAPs with atomic actions can let you do large actions. You can multiple TAPs to chain "change into gym clothes" into "pick up gym bag" into "walk outside" into "go to car" into "drive to gym". My morning routine consists of "Trigger: alarm goes off. Action: take vitamins", "Trigger: take vitamins. Action: pull out journal and write date on top of blank page", "Trigger: journal is out. Action: plan my day", "Trigger: done writing in journal. Action: go brush teeth". I'm not quite sure how reliable this morning routine is, but since all the actions are fairly atomic, if it ever fails, it will be easy to observe which step it failed at.
Picking a triggerIt is even more crucial that the trigger is specific. Your trigger will not be specific enough even conditioning on the fact that I told you that your trigger won't be specific enough.
One possible trigger is "my alarm goes off." A better trigger is "I hear my phone buzzing and playing the generic alarm jingle." It is beneficial to incorporate as many sensory inputs as possible. (Note that none of the triggers in the examples are quite specific enough for brevity. An exercise for the reader is to convert them into triggers that are specific enough, e.g. at the level of sensory data.)
Your trigger should also occur immediately before you want to perform the action. Ideally, when the trigger happens, you should be in a location where you can perform the action and you should want the action to happen immediately (spatial location matters insofar as it takes you time to transport yourself places). You want the trigger to be strongly tied to the action in your brain; such a tie is easier to make and maintain if there is almost no time between when the trigger happens and when the action happens.
"Trigger: leave work. Action: go to the library" is a bad trigger/action pair because the library is (presumably) not right next to where you work. A better trigger/action pair is "Trigger: leave work. Action: put the library into my phone's mapping app."
Ideally, the trigger would happen if and only if you wanted to do your action. Every time that the trigger happens but you don't do your action, you're practicing not doing the action in response to the trigger.
Most importantly, you need to be willing to actually do the action when the trigger happens. This sounds a bit silly, but you can't have a trigger/action pair that you won't actually do. For example, there is an exercise routine that suggests the following TAP to get more exercise: "whenever you walk through a doorway, do 10 pushups." Walking through a doorway is a pretty specific trigger. Doing 10 pushups is a pretty specific action. You can do the action as soon as the trigger happens. However, you're not actually going to do 10 pushups every time you walk through a doorway because it's weird and you probably walk through a lot of doorways, so this is a bad trigger/action pair.
PracticeThe model you should have in mind is that when your brain encounters the trigger, then it searches past experience for the action you usually take and takes that action. The point of practicing is to replace all the past experience with an action of your choosing.
This practice should be as realistic as possible. If the TAP you want to install is "Trigger: hear my alarm. Action: get out of bed" then you should actually change into the clothes you wear into bed, turn off the lights, set your alarm for 5 minutes from now, get into bed, wait for your alarm to go off, then get out of bed.
If you're not able to physically do the exact trigger/action pair, try to come up with close substitutes. One of my taps is "Trigger: alarm goes off. Action: eat my vitamins." It would not be wise to eat many vitamin pills in one day, but I do have various other pillshaped things that it would not be harmful to eat many of. I used these to practice my TAP (which has resulted in perfect vitamin compliance for the past week).
If you can't do the thing at all, then the next best thing is to vividly visualize the trigger/action pair. By vividly visualize, I mean in as much detail as you can muster. If the trigger/action pair would take 2 minutes to do in real life, the visualization should take about two minutes.
Finally, you must practice ten (10) times. This is because 10 is the number of fingers that most humans have, so it is a memorable number. Recall that 10 = 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1. That may seem like a lot of times, but practicing properly and thoroughly is the most important step when trying to install a new TAP and make it reliable. You should spend about 20 minutes practicing for each new TAP.
It’s important to practice a lot. It’s so important that I will invoke the forbidden power known as mathematics by stating to you a small theorem. Numbers that are less than ten do not equal ten. This trivially follows from the definitions of less than and equality, but many people do not realize this fact. nine is less than ten and so does not equal ten, so when I say to practice ten times I do not mean to practice nine times or eight times or seven times, I mean actually practice ten times.
ExamplesHere are some examples of TAPs that I've installed (or are trying to install):
 Trigger: open Reddit. Action: close Reddit.
 This might seem dumb, but it works amazingly well. I felt kind of silly opening and closing Reddit ten times, but it works.
 Trigger: notice that someone is trying to explain something to me. Action: I paraphrase it back at them.
 This has helped me understand people much better, especially people that aren't that good at explaining things. One problem with this TAP is that the trigger isn't that specific, but there doesn't seem to be a generalized way to realize when people are trying to explain things to me. It's worked pretty well in practice, but I would like it to be more reliable.
 Trigger: alarm. Action: take vitamins.
 This TAP is strong enough that once, when I had slept extremely late and forgot to turn off my alarm, I got out of bed, took my vitamins, then got back into bed.
 Trigger: say something false. Action: correct it into something true.
 I'm trying to upgrade my ability to detect truth and falsehood inside my own brain and I think that this TAP helps with that. Practicing this one was a little weird, but it works OK.
Sometimes, you can construct trigger/action pairs that are like “Trigger: don’t want to. Action: do it anyway.” Sometimes, these work extremely well. Other times, they are internally violent. If you don’t want to do something, there are good reasons why you don’t want to do that thing. In these cases, such a TAP bulldozes over the issue without properly resolving it. I recommend against bulldozing. However, other times your brain just gets hijacked by Reddit and it’s probably fine to bulldoze yourself.
Exercise Find a (preferably small) bug that is amenable to being solved with a TAP (or multiple, but I would recommend 1 to start).
 Common examples: don't floss enough, don't go to the gym enough, don't stretch enough, have bad posture, browse reddit too much, don't take vitamins, etc.
 Pick an action that solves your bug.
 Flossing, changing into gym clothes, stretching, fixing your posture, closing reddit, taking vitamins, etc.
 Importantly, describe the action in a specific way. For example, for my vitamin TAP, my action is "grab my pill container, open the proper day, pour vitamins into my hand, put them into my mouth, grab my water bottle, drink water, then swallow."
 Pick a trigger that you want to link to your action.
 Time based triggers are good for things that you want to do once per day. Phone alarms are generally pretty good because you can pick unique noises, which strengthens the connection between the trigger and the action.
 Practice 10 times
 If n < 10, then n != 10
 You should actually practice ten times. It might take like 20 minutes, but practicing makes the TAP much more reliable.
 For example, I opened Reddit and closed it ten times. Then, the next day, I opened Reddit habitually, then closed it before I realized what was happening.
 My friend practiced meditating after a phone alarm ten times and expressed large amounts of surprise that they actually meditated after the alarm went off the next day.
Discuss
Does donating to EA make sense in light of the mere addition paradox ?
TL;DR Assuming that utilitarianism as a moral philosophy has flaws along the way, such that one can follow it but only to some unkown extent, what would be the moral imperative for donating to EA ?
I've not really found a discussion around this on the internet so I wonder if someone around here as thought about it.
It seems to me that EA makes perfect sense in light of a utilitarian view of morals. But it seems that a utilitarian view of morals is pretty shaky in that, if you follow it to it's conclusion you get to the merge addition paradox or the "utility monster".
So in light of that does donating to an EA organisation (as in: one that tries to save or improve as many lives as possible for as little money as possible) really make any sense ?
I can see it intuitively making sense, but barring a comprehensive moral system that can argue for the value of all human life, it sems intuition is not enough. As in, it also intuitively make sense to put 10% of your income into lowyield bonds, so in case one of your family members or friends has a horrible (deadly or severely lifequality diminishing) problem you can help them.
From an intuitive perspective "helping my mother/father/bestfriend/petdog" seem to topple "helping 10 random strangers" for most people, thus it would seem that it doesn't make sense barring you are someone that's very rich and can thus safely help anyone close to him and still have some wealth leftover.
I can also see EA making sense from the perspective of other codes of ethics, but it seems like most people donating to EA don't really follow the other prescripts of those codes that the codes hold to be more valuable.
E.g.
 You can argue that helping people not die is good under Christian ethics, but it's better to help them convert before they die so that they can avoid eternal punishment.
 You can argue that helping people not die under a pragmatic moral system (basically a more system that's simply a descriptive version of what we've seen "works"), but at the same time, most pragmatic moral systems would probably yield the result that helping your community rather than helping strangers halfway across the globe (simply because that would have been viewed as better by most people in past and current generation, so it's probably correct).
 I seems donating is in not way bad under Kantian ethics. But then again, I think if you take Kantian ethics as your moral code you'd probably have to prioritize other things first (e.g. never lying again) and donating to EA would fall mainly in a morally neutral zone.
So basically, I'm kinda stuck understanding under which moral presincts it actually makes sense to donate to EA charities ?
Discuss
Stuck Exploration
Recently, we've been considering agents that instead of just being given the problem setup have to actually learn it from experience. For example, instead of being told that an agent is a perfect predictor, they just have to play until they realise this. When an agent encounters a problem like Parfit's Hitchhiker, there is a weird effect that can occur during exploration if the predictor is perfect.
Suppose an agent has decided that next time it is in town that it won't pay. As long as it is committed to this, it'll never actually end up in town and so it'll never actually fulfil this commitment. It'll always be predicted to defect and so will never get the chance. We will call this stuck exploration because the exploration never resolves.
One trivial way around this is to never commit based on the next time, but instead control exploration by a pseudorandom variable based on the time. However, avoiding Stuck Exploration doesn't necessarily mean that the agent will necessarily end up understanding the situation. The agent will notice that it always pays in town; ie. that it never takes the explore option when there. An EDT agent would be able to figure out that it is likely being predicted, while a dualistic agent like AIXI wouldn't be able to directly make this connection. Of course, AIXI would notice that something weird was happening and this would distort the model of the algorithm it chooses in some way. Perhaps it would a spurious link, but we won't try to delve into this at this stage.
(TODO: Add link to further discussion when written up)
This post was a result of discussions with Davide Zagami and supported by the EA Hotel and AI Safety Program. Thanks to Pablo Moreno and Luke Miles for feedback.
Discuss
Does anyone have a recommended resource about the research on behavioral conditioning, reinforcement, and shaping?
In animals or in humans. I was reading this post, and the linked paper, earlier today, and would like to know more of the science in this area.
A textbook would be great.
Individual papers, or anthologies, or youtube videos, or blog posts, or whatever, are also welcome.
Thanks,
Eli
Discuss
Assembling Sets for Contra
Contra dances usually run about eight minutes, which is really kind of long. [1] To keep things interesting, it's common for bands to play sets: a series of tunes that go together. But what does "go together" mean, and how do you make your own sets?
The first consideration, and it's a big one, is that the tunes in your set all need to work for the same dances. If you make a set from one tune that works well for repeated Apart balances and another tune that wants repeated Bpart balances, the tune is not going to match the dance at least half the time. I recommend Andrew VanNorstrand's excellent Musician's Guide to Contra Choreography for a lot of ideas on how to think about playing for dancers, including how to think about what sort of music supports what sort of figures.
When putting together sets you want to have a range of options so that when the caller asks for "a really smooth pretty one, but with a balance at the top of the B" you have something you can pull out. Over time you'll get a sense of what callers tend to ask for, but a good starting place for a band about to play a standard 3hr 1112 dance evening might be ~14 sets:
 One opening set to start the night with, probably reels, with very clear good phrasing
 Two smooth marchy sets
 Two or three driving reel sets
 Two or three chunky reel sets
 One reel set that's chunky in the A part and smooth in the B
 One reel set that's chunky in the B part and smooth in the A
 One or two energetic sets that feel different: rags, quebecois, bouncy jigs
 One or two smooth pretty jig sets
 One or two groovy jig sets
The second consideration is how the tunes relate to each other. This includes energy level, apparent tempo, key, mode, etc. A very common (and effective!) story to tell with tunes is one of increasing energy: the tune change brings lift and excitement. This usually means moving in the direction of more notes, higher keys (G to A, not A to G), greater intensity.
Telling the opposite story well is harder, though possible. You can build excitement, and then drop down at the tune change. When I've heard bands do this well, sometimes it has felt like the hall is taking a sigh, stretching and relaxing, as the music opens up and takes a breath. Other good versions have involved replacing exuberance with tension, where the energy doesn't disappear, but it's no longer on the surface. Don't leave the dancers feeling like the energy just drained out of the hall. Generally you'll build it back up again from there, and it can feel a bit like it's a trick on the dancers if you don't.
Other more complicated stories can work well too: slow builds, ups and downs, taking it sideways into a very different way of interpreting the same dance. Pay attention to what you enjoy when you dance to bands you like, and figure out how their sets work.
While you can change tunes without changing keys, modes, and/or rhythms, it is a bit of a missed opportunity. Generally people will change things, but you do need to make sure that the tunes go together. The most reliable test, and the ground truth, is to play them together and see what you think, but some heuristics can help you find pairings that are likely to go together. Keys that are off by a whole step (D to E), a fourth (D to G), or a fifth (D to A) generally work well, as does changing between relative minors (D and Bm). If a tune has multiple keys, what matters is the keys at the transition, so a tune that is Amix/G acts like a tune in G when you're switching out of it but Amix when you're switching into it.
A third consideration is how many tunes to play. I like twotune sets a lot: this gives time to settle into a tune and explore different ways of thinking about it, but not so much time that I run out of ideas and get boring. Single tune sets can work well for a trancy feel, or if you have a really big story you want to build with smooth sweeping changes over the course of the set. Sets with three or more tunes can be good when you want to be especially exciting and flashy, or have a lot of dramatic transitions. What works for you will depend a lot on your band; some bands rarely play fewer than three tunes while others rarely change tunes at all.
As a contra dance band you have an enormous amount of freedom to choose what you play, and in playing live for dancers you have a great opportunity to learn what works. Make choices, take risks, and see how they go over. Do more of the things dancers enjoy, get feedback, and learn to tell stories.
[1] I recently played a dance where the caller would run some dances
about eight minutes, but then also ran three of the dances for fifteen
minutes each. Please don't do that, or if you're going to do that
please warn the band first!
Comment via: facebook
Discuss
Is there an intuitive way to explain how much better superforecasters are than regular forecasters?
Is there an intuitive way to explain how much better superforecasters are than regular forecasters? (I can look at the tables in https://www.researchgate.net/publication/277087515_Identifying_and_Cultivating_Superforecasters_as_a_Method_of_Improving_Probabilistic_Predictions but I don't have an intuitive understanding of what brier scores mean, so I'm not sure what to think about it).
Discuss
We Want MoR (HPMOR Discussion Podcast) Completes Book One
The We Want MoR podcast, a chapter by chapter readthrough and discussion of Harry Potter and the Methods of Rationality by a Rationalist and a newbie, has just finished up Book One (Ch. 121) of HPMOR. (Apple Podcasts link.)
Less Wrong used to be a major locus of discussion for HPMOR, so I thought it made sense to share this here, especially now that the show has passed this milestone. If folks don't necessarily want to keep HPMOR content on Less Wrong, there are ongoing lively discussions of the weekly episodes on the r/hpmor subreddit.
Disclosure, I was the guest on this episode.
Discuss
Is there software for goal factoring?
Is there software for goal factoring (a CFAR technique)? I want to use it to create a directed acyclic graph, which is not necessarily a tree (this requirement disqualifies mind mapping software). Nodes are goals, they have attached text. There's an edge from goal x to goal y iff fulfillment of x directly contributes to fulfillment of y. There should be an easy way to see the graph and modify it, preferrably in a visual way.
Discuss
What are information hazards?
This post was written for Convergence Analysis.
The concept of information hazards is highly relevant to many efforts to do good in the world (particularly, but not only, from the perspective of reducing existential risks). I’m thus glad that many effective altruists and rationalists seem to know of, and refer to, this concept. However, it also seems that:
 People referring to the concept often don’t clearly define/explain it
 Many people (quite understandably) haven’t read Nick Bostrom’s original (34 page) paper on the subject
 Some people misunderstand or misuse the term “information hazards” in certain ways
Thus, this post seeks to summarise, clarify, and dispel misconceptions about the concept of information hazards. It doesn’t present any new ideas of my own.
In his paper, Bostrom defines an information hazard as:
A risk that arises from the dissemination or the potential dissemination of (true) information that may cause harm or enable some agent to cause harm.
He emphasises that this concept is just about how true information can cause harm, not how false information can cause harm (which is typically a more obvious possibility).
Bostrom’s paper outlines many different types of information hazards, and gives examples of each. The first two types listed are the following:
Data hazard: Specific data, such as the genetic sequence of a lethal pathogen or a blueprint for making a thermonuclear weapon, if disseminated, create risk.
[...] Idea hazard: A general idea, if disseminated, creates a risk, even without a datarich detailed specification.
For example, the idea of using a fission reaction to create a bomb, or the idea of culturing bacteria in a growth medium with an antibiotic gradient to evolve antibiotic resistance, may be all the guidance a suitably prepared developer requires; the details can be figured out. Sometimes the mere demonstration that something (such as a nuclear bomb) is possible provides valuable information which can increase the likelihood that some agent will successfully set out to replicate the achievement.
He further writes:
Even if the relevant ideas and data are already “known”, and published in the open literature, an increased risk may nonetheless be created by drawing attention to a particularly potent possibility.
This leads him to his third type:
Attention hazard: mere drawing of attention to some particularly potent or relevant ideas or data increases risk, even when these ideas or data are already “known”.
Because there are countless avenues for doing harm, an adversary faces a vast search task in finding out which avenue is most likely to achieve his goals. Drawing the adversary’s attention to a subset of especially potent avenues can greatly facilitate the search. For example, if we focus our concern and our discourse on the challenge of defending against viral attacks, this may signal to an adversary that viral weapons—as distinct from, say, conventional explosives or chemical weapons—constitute an especially promising domain in which to search for destructive applications.
The significance of these and other types of potential information hazards is that it will sometimes be morally best to avoid creating or spreading even true information. (Exactly when and how to attend to and reduce potential information hazards is beyond the scope of this post; Convergence hopes to explore that topic later.)
Context and scaleThose quoted examples all relate to fairly largescale risks (perhaps even existential risks). Some also relate to risks from advancing the development of potentially dangerous technologies (i.e., from going against the principle of differential progress). It seems to me that the concept of information hazards is most often used in relation to such largescale, existential, and/or technological risks, and indeed that that’s where the concept is most useful.
However, it’s also worth noting that information hazards don’t have to relate to these sorts of risks. Information hazards can occur in a wide variety of contexts, and can sometimes be very mundane or minor. Some of Bostrom’s types and examples highlight that. For example:
Spoilers constitute a special kind of disappointment. Many forms of entertainment depend on the marshalling of ignorance. Hideandseek would be less fun if there were no way to hide and no need to seek. For some, knowing the day and the hour of their death long in advance might cast shadow over their existence.
[Thus, it is also possible to have a] Spoiler hazard: Fun that depends on ignorance and suspense is at risk of being destroyed by premature disclosure of truth.
Who’s at risk?Spoiler hazards are a type of information hazards that risk harm only to the knower of the true information themselves, and as a direct result of them knowing the information. (In contrast, in Bostrom’s earlier examples, the knower might eventually be harmed, but this would be (1) along with many other people, and (2) a result of catastrophic or existential risks that were made more likely by the knowledge the knower spread, rather than as a direct result of them having that knowledge.)
Bostrom also lists other such types of information hazards where the risk of harm is to the knower themselves, and directly results from their knowledge. But there appears to be no established term for the entire subset of information hazards that fit that description. Proposed terms I’m partial to include “knowledge hazards” and “direct information hazards”. (Further discussion can be found in that comments section. If you have thoughts on that, please comment here or there.)
But I should emphasise that that is indeed only a subset of all information hazards; information hazards can harm people other than the knower themselves, and, as mentioned above, this will be the case in the contexts where information hazards are perhaps most worrisome and worth attending to. (I emphasise this because some people misunderstand or misuse the term “information hazards” as referring only to what we might call “knowledge hazards”; this misuse is apparent here, and is discussed here.)
Information hazards are risksAs noted, an information hazard is “A risk that arises from the dissemination or the potential dissemination of (true) information that may cause harm or enable some agent to cause harm” (emphasis added). Thus, as best I can tell:

Something can be an information hazard even if no harm has yet occurred, and there’s no guarantee it will ever occur.

E.g., writing a paper on a plausibly dangerous technology can be an information hazard even if it turns out to be safe after all.

But if we don’t have any specific reason to believe that there’s even a “risk” of harm from some true information, and we just think that it’d be worth bearing in mind that there might be a risk, it may be best to say there’s a “potential information hazard”.

So it’s probably not helpful to, e.g., slap the label of “information hazard” on all of biological research, for example.
I hope you’ve found this post clear and useful. To recap:
 The concept of information hazards relates to risks of harm from creating or spreading true information (not from creating or spreading false information).
 The concept is definitely very useful in relation to existential risks and risks from technological development, but can also apply in a wide range of other contexts, and at much smaller scales.
 Some information hazards risk harm only to the knower of the true information themselves, and as a direct result of them knowing the information. But many information hazards harm other people, or harm in other ways.
 Information hazards are risks of harm, not necessarily guaranteed harms.
In my next posts, I'll:
 discuss and visually represent how information hazards and downside risks relate to each other and to other types of effects
 suggest some rulesofthumb regarding why, when, and how one should deal with potential information hazards (aiming for more nuance than just “Always pursue truth!” or “Never risk it!”).
My thanks to David Kristoffersson and Justin Shovelain for feedback on this post.
Discuss
Blog Post Day (Unofficial)
TL;DR: You are invited to join us online on Saturday the 29th, to write that blog post you've been thinking about writing but never got around to. Comment if you accept this invitation, so I can gauge interest.
The Problem:Like me, you are too scared and/or lazy to write up this idea you've had. What if it's not good? I started a draft but... Etc.
The Solution:1. Higher motivation via Time Crunch and Peer Encouragement
We'll set an official goal of having the post put up by midnight. Also, we'll meet up in a specialpurpose discord channel to chat, encourage each other, swap halffinished drafts, etc. If like me you are intending to write the thing one day eventually, well, here's a reason to make that day this day.
2. Lower standards via Time Crunch and Safety in Numbers
Since we have to be done by midnight, we'll all be under time pressure and any errors or imperfections in the posts will be forgivable. Besides, they can always be fixed later via edits. Meanwhile, since a bunch of us will be posting on the same day, writing a sloppy post just means it won't be read much, since everyone will be talking about the handful of posts that turn out to be really good. If you are like me, these thoughts are comforting and encouraging.
Evidence this Works:MIRI Summer Fellows Program had a Blog Post Day towards the end, and it was enormously successful. It worked for me, for example: It squeezed two good posts out of me. (OK, so one of them I finished up early the next morning, so I guess it technically doesn't count. But in spirit it does: It wouldn't have happened at all without Blog Post Day.) More importantly, MSFP keeps doing this every year, even though opportunity cost for them is much higher (probably) than the opportunity cost for you or me. I don't know what else you had planned for Saturday the 29th... (Actually, if you do have something else planned, but otherwise want to participate in Blog Post Day, let me know. Maybe we can pick a different day.)
Side Benefits:It'll be fun!
Discuss
Big Yellow Tractor (Filk)
To the tune of "Big Yellow Taxi", with apologies to Joni Mitchell
we'll pave paradise and put up a parking lot
if that's what it takes to make all the wild suffering stop
the pain it only grows and grows
until we boil it up in our pot
we'll pave paradise and put up a parking lot
we'll take all the creatures and put 'em in a simulation
where there's no pain or death or starvation
oh! the pain it only grows and grows
until we boil it up in our pot
we'll pave paradise and put up a parking lot
hey all you people put away decision theory
i don't care about max utils if we can't end suffering, please!
the pain it only grows and grows
until we boil it up in our pot
we'll pave paradise and put up a parking lot
late last night I heard 'bout a brand new plan
a big yellow tractor's gonna clear away all the land
oh! the pain it only grows and grows
until we boil it up in our pot
we'll pave paradise and put up a parking lot
Discuss
Training Regime Day 4: Murphyjitsu
Introduction
Recall that applied rationality is being able to properly decide which advisors to use during decision making.
Imagine that I am throwing a ball up into the air and catching it. Occasionally, I might drop the ball because I am not very good at catching. Now, imagine that I throw the ball up and at the top of the arc, the ball just freezes. Are you surprised?
There is an advisor that is always trying to guess what move the world is going to make next. This part of you is called the inner simulator, or inner sim for short. This advisor has many powers but is particularly underused by most people. This technique aims to extract a lot of useful information from the inner sim.
Inner SimulatorHere are some prompts that will allow you to get in touch with your inner sim:
 Imagine your closest companion. Imagine going up to them and flicking them on the nose. How do they react?
 Imagine that the weather suddenly shifts. If it’s currently sunny, there is now a torrential downpour. If it’s currently raining, the rain abruptly ceases. How surprised are you?
 Imagine throwing a rubber ball at the wall. How does it bounce? You can probably see the trajectory pretty well, but if I asked you to calculate it explicitly, would you be able to do it?
All of these events are pretty complicated, and yet it seems like most humans are easily able to make reasonably accurate predictions. Even more impressively, these predictions can be made extremely quickly. The inner sim is a very fast and powerful advisor.
Of course, speed and power have costs. The inner sim requires a lot of data. You (and your evolutionary lineage) live on Earth, where there is gravity and physics is generally Newtonian. Correspondingly, your inner sim is extremely well trained on Newtonian physics with Earth gravity. There are many reasons that humans find things like orbital mechanics and relativity so hard; one of them is that your inner sim is not very good at dealing with nonNewtonian situations that aren’t on Earth.
Similarly, you (and your evolutionary lineage) spend a lot of time interacting with other humans. Correspondingly, your inner sim is fairly good at navigating extremely complex social situations and modeling other people with high accuracy. However, if you've spent most of your life among people of a certain culture, your inner sim will make a lot of errors if you ever use it on people from a culture you haven't been exposed to before.
The inner sim is powerful and fast, but doesn't generalize very well.
The Techniques SurpriseometerYour inner simulator is always trying to predict what is going to happen next. Sometimes, what actually happens does not agree with inner sim's predictions. We call these prediction errors "surprise."
The key insight is that your inner sim can tell you whether or not you're surprised about an event without the event actually happening. By imagining an event happening and calling on inner sim, feelings of surprise can be generated by events that only exist in your brain.
In different terms, usually, in order to explicitly figure out how probable something is, you have to have a model and some knowledge about the world and some distributions over likely events and then you have to do math, which is hard and takes a long time.
However, the inner sim is a probability oracle. You can take any event in the world and just simulate it and your inner sim just tells you how likely it was! (Of course, it doesn't give you a number, it just gives you a vague sensation that you have to interpret, but that's still pretty amazing). We call this use of the inner sim the surpriseometer.
Here are some prompts that hopefully help you figure out how to use surpriseometer:
 Imagine that an alarm suddenly goes off. How surprised are you?
 Imagine that something you thought you had to get done by next week actually needs to be done by tomorrow. How surprised are you?
 Imagine that you had a doctors appointment yesterday that you forgot about. How surprised are you?
 Imagine that the device you're using to read this suddenly shuts off. How surprised are you?
A common failure mode when using surpriseometer is to do an explicit query instead of an inner sim query. For example, for the alarm prompt, you might think something like "Well, I know that my dorm has a fire alarm system that goes off when there's smoke. I know it's the weekend so people might be cooking. I know it went off twice last week, so I guess I would be a little surprised but not that surprised.” While you're thinking these thoughts, your inner sim is standing in the corner of your mind, neglected.
One way around this is to pay attention to what your body is doing when you imagine events. When I imagine a fire alarm going off, I notice that my face scrunches slightly. From my experience, that means I'm pretty surprised at something. In general, the art of noticing surprise is very difficult.
Another way around this is to check how long you think it would take you to respond to the event. If my computer suddenly shut off while I was writing this, I would continue to sit in my chair for a few seconds and be confused. However, when events line up with my inner sim's expectations, I generally am not confused and am able to act quickly. Noticing confusion and noticing surprise are similar, but they have different effects on your body, so you might be able to notice one more reliably than the other.
PrehindsightDisclaimer: I don't remember the details of this story, so I'm just making some up. The core of the story is based in truth.
An engineer has a job manufacturing prototype devices for a lab. Before each prototype can be moved into mass production, it must undergo a series of extensive testing that takes about a month. Since these tests cost a lot of time and money, it is advantageous for the engineer to only submit working prototypes.
The engineer manufactures a prototype, thinks that the design is rock solid, and sends it to the testing facility. One month passes and the engineer sees a lab technician walking towards his desk. The engineer thinks "oh no! The disk snapped." When the technician arrives at the engineers desk, they say "I'm sorry  the disk snapped during testing" and the engineer says "I know, I know."
Knowing what you know about this story, how could the engineer devise a strategy to produce better prototypes in the future?
One possible strategy is to get the lab technician to walk towards the engineers desk before submitting the prototype. This strategy is inconvenient for the technician, but it might be worth it. Another strategy is for the engineer to just imagine that the lab technician is walking towards their desk. This is the essence of prehindsight.
In general, the way to apply prehindsight is to imagine that something went wrong and to ask your inner sim why it went wrong. Your inner sim contains pretty detailed models of how most things in the world work (especially yourself), so most of the time it will be able to give you a good answer as to the most likely failure case.
Here are some prompts that hopefully help you figure out how to use prehindsight:
 You're late for work. What happened?
 You forget to do <important task>. What happened?
 You sleep poorly. What happened?
 Your close friend is mad at you. What happened?
The common failure mode when using surpriseometer extends to prehindsight as well. When trying to figure out why you're late for work, it's tempting to think something like "Oh, I'm usually not late for work, but in the past, when I've been late for work, it's because there was a traffic jam. This means that what probably happened is that there was a traffic jam." This does not mean that the answer isn't going to be a traffic jam; rather, doing explicit reasoning means the speed and power of inner sim is being neglected. (Also, your inner sim can be wrong and you can use explicit reasoning to realize this and make better predictions, if you have the time.)
In my experience, properly using your inner sim to do prehindsight results in an extremely quick answer with almost no explicit thinking. If I forgot to do an important task, my inner sim nearly instantly tells me that I forgot to check my calendar. The speed at which you arrive at an answer gives you information as to whether or not inner sim was involved.
MurphyjitsuMurphy's Law says that "anything that can go wrong will go wrong." Murphyjitsu is the art of making plans in which nothing can go wrong, which implies that nothing will go wrong (take that Murphy!).
Murphyjitsu combines surpriseometer and prehindsight to make an iterative method of plan improvement. The steps are as follows:
 Have a plan.
 Imagine that the plan fails
 Use surpriseometer to figure out how surprised you are that your plan failed
 Use prehindsight to figure out what the most likely cause of the failure was
 Use your brain to generate a new plan that avoids/addresses the most likely cause of failure
 How you have a new plan, so go back to (1)
When do you stop? The optimistic answer is until you are infinitely surprised that your plan fails and prehindsight generates no possible causes of failure. The realistic answer is when you're surprised enough that your plan fails that you are comfortable executing it. Remember that if A and B are both plans to accomplish a goal, "Do A and do B" is probably a better plan to accomplish that goal. If it is very important that you accomplish a goal, pursuing multiple strategies is probably a good idea.
The way it's presented above, murphyjitsu is a linear process. However, you can always branch and backtrack to move between different plans. For example, if I have some plan with some failure mode, I can think of multiple possible ways to address that failure. I can then use surpriseometer to figure out which strategy is most likely to avoid that failure mode.
One possible failure mode is that you can't think of an original plan. To this, I say that the trivial plan is a plan. If I want to get to work on time, the plan "get to work on time" is a (trivial) plan to get to work on time. I can then apply murphyjitsu on it. One of the powers of murphyjitsu is that you only have to expand the plan/add detail for parts of the plan that are likely to fail. If there is some part of the plan that your inner sim is very confident in, then you never have to flesh out what actually happens.
Another possible failure mode is failures of imagination. Sometimes your plan is something vague like "eat more vegetables." It is difficult to imagine what failing to "eat more vegetables" even looks like. In these scenarios, you can either make your plan more specific ("eat 3 servings of vegetables"), make the imagined failure more specific ("It's 11pm and you realize that you have not eaten a single vegetable") or do both. A strategy that often works for me is to come up with an extremely specific failure event because my inner sim seems to deal better with specificity.
As the haiku goes:
I have a good plan.
I imagine it failing.
I have a bad plan.
Exercise Pick a bug
 remember to build form
 Come up with a plan to solve your bug
 Apply murphyjitsu until you're surprised enough that it fails
 Execute your plan (this step is probably the most important step)
Discuss
Wireheading and discontinuity
Outline: After a short discussion on the relationship between wireheading and reward hacking, I show why checking the continuity of a sensor function could be useful to detect wireheading in the context of continuous RL. Then, I give an example that adopts the presented formalism. I conclude with some observations.
Wireheading and reward hackingIn Concrete Problems in AI Safety, the term wireheading is used in contexts where the agent achieves high reward by directly acting on its perception system or memory or reward channel, instead of doing what its designer wants it to do. It is considered a specific case of the reward hacking problem, which more generally includes instances of Goodhart’s Law, environments with partially observable goals, etc. (see CPiAIS for details).
What's the point of this classification? In other words, is it useful to specifically focus on wireheading, instead of considering all forms of reward hacking at once?
If solving wireheading is as hard as solving the reward hacking problem, then it's probably better to focus on the latter, because a solution to that problem could be used in a wider range of situations. But it could also be that the reward hacking problem is best solved by finding different solutions to specific cases (such as wireheading) that are easier to solve than the more general problem.
For example, one could consider the formalism in RL with a Corrupted Reward Channel as an adequate formulation of the reward hacking problem, because that formalization models all situations in which the agent receives a (corrupted) reward that is different from the true reward. In that formalism, it is shown by a No Free Lunch Theorem that the general problem is basically impossible to solve, while it is possible to obtain some positive results if further assumptions are made.
Discontinuity of the sensor functionI've come up with a simple idea that could allow us to detect actions that interfere with the perception system of an agent—a form of wireheading.
Consider a learning agent that gets its percepts from the environment thanks to a device that provides information in real time (e.g. a selfdriving car).
This situation can be modelled as a RL task with continuous time and continuous state space, where each state .mjxchtml {display: inlineblock; lineheight: 0; textindent: 0; textalign: left; texttransform: none; fontstyle: normal; fontweight: normal; fontsize: 100%; fontsizeadjust: none; letterspacing: normal; wordwrap: normal; wordspacing: normal; whitespace: nowrap; float: none; direction: ltr; maxwidth: none; maxheight: none; minwidth: 0; minheight: 0; border: 0; margin: 0; padding: 1px 0} .MJXcdisplay {display: block; textalign: center; margin: 1em 0; padding: 0} .mjxchtml[tabindex]:focus, body :focus .mjxchtml[tabindex] {display: inlinetable} .mjxfullwidth {textalign: center; display: tablecell!important; width: 10000em} .mjxmath {display: inlineblock; bordercollapse: separate; borderspacing: 0} .mjxmath * {display: inlineblock; webkitboxsizing: contentbox!important; mozboxsizing: contentbox!important; boxsizing: contentbox!important; textalign: left} .mjxnumerator {display: block; textalign: center} .mjxdenominator {display: block; textalign: center} .MJXcstacked {height: 0; position: relative} .MJXcstacked > * {position: absolute} .MJXcbevelled > * {display: inlineblock} .mjxstack {display: inlineblock} .mjxop {display: block} .mjxunder {display: tablecell} .mjxover {display: block} .mjxover > * {paddingleft: 0px!important; paddingright: 0px!important} .mjxunder > * {paddingleft: 0px!important; paddingright: 0px!important} .mjxstack > .mjxsup {display: block} .mjxstack > .mjxsub {display: block} .mjxprestack > .mjxpresup {display: block} .mjxprestack > .mjxpresub {display: block} .mjxdelimh > .mjxchar {display: inlineblock} .mjxsurd {verticalalign: top} .mjxmphantom * {visibility: hidden} .mjxmerror {backgroundcolor: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; fontstyle: normal; fontsize: 90%} .mjxannotationxml {lineheight: normal} .mjxmenclose > svg {fill: none; stroke: currentColor} .mjxmtr {display: tablerow} .mjxmlabeledtr {display: tablerow} .mjxmtd {display: tablecell; textalign: center} .mjxlabel {display: tablerow} .mjxbox {display: inlineblock} .mjxblock {display: block} .mjxspan {display: inline} .mjxchar {display: block; whitespace: pre} .mjxitable {display: inlinetable; width: auto} .mjxrow {display: tablerow} .mjxcell {display: tablecell} .mjxtable {display: table; width: 100%} .mjxline {display: block; height: 0} .mjxstrut {width: 0; paddingtop: 1em} .mjxvsize {width: 0} .MJXcspace1 {marginleft: .167em} .MJXcspace2 {marginleft: .222em} .MJXcspace3 {marginleft: .278em} .mjxtest.mjxtestdisplay {display: table!important} .mjxtest.mjxtestinline {display: inline!important; marginright: 1px} .mjxtest.mjxtestdefault {display: block!important; clear: both} .mjxexbox {display: inlineblock!important; position: absolute; overflow: hidden; minheight: 0; maxheight: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjxtestinline .mjxleftbox {display: inlineblock; width: 0; float: left} .mjxtestinline .mjxrightbox {display: inlineblock; width: 0; float: right} .mjxtestdisplay .mjxrightbox {display: tablecell!important; width: 10000em!important; minwidth: 0; maxwidth: none; padding: 0; border: 0; margin: 0} .MJXcTeXunknownR {fontfamily: monospace; fontstyle: normal; fontweight: normal} .MJXcTeXunknownI {fontfamily: monospace; fontstyle: italic; fontweight: normal} .MJXcTeXunknownB {fontfamily: monospace; fontstyle: normal; fontweight: bold} .MJXcTeXunknownBI {fontfamily: monospace; fontstyle: italic; fontweight: bold} .MJXcTeXamsR {fontfamily: MJXcTeXamsR,MJXcTeXamsRw} .MJXcTeXcalB {fontfamily: MJXcTeXcalB,MJXcTeXcalBx,MJXcTeXcalBw} .MJXcTeXfrakR {fontfamily: MJXcTeXfrakR,MJXcTeXfrakRw} .MJXcTeXfrakB {fontfamily: MJXcTeXfrakB,MJXcTeXfrakBx,MJXcTeXfrakBw} .MJXcTeXmathBI {fontfamily: MJXcTeXmathBI,MJXcTeXmathBIx,MJXcTeXmathBIw} .MJXcTeXsansR {fontfamily: MJXcTeXsansR,MJXcTeXsansRw} .MJXcTeXsansB {fontfamily: MJXcTeXsansB,MJXcTeXsansBx,MJXcTeXsansBw} .MJXcTeXsansI {fontfamily: MJXcTeXsansI,MJXcTeXsansIx,MJXcTeXsansIw} .MJXcTeXscriptR {fontfamily: MJXcTeXscriptR,MJXcTeXscriptRw} .MJXcTeXtypeR {fontfamily: MJXcTeXtypeR,MJXcTeXtypeRw} .MJXcTeXcalR {fontfamily: MJXcTeXcalR,MJXcTeXcalRw} .MJXcTeXmainB {fontfamily: MJXcTeXmainB,MJXcTeXmainBx,MJXcTeXmainBw} .MJXcTeXmainI {fontfamily: MJXcTeXmainI,MJXcTeXmainIx,MJXcTeXmainIw} .MJXcTeXmainR {fontfamily: MJXcTeXmainR,MJXcTeXmainRw} .MJXcTeXmathI {fontfamily: MJXcTeXmathI,MJXcTeXmathIx,MJXcTeXmathIw} .MJXcTeXsize1R {fontfamily: MJXcTeXsize1R,MJXcTeXsize1Rw} .MJXcTeXsize2R {fontfamily: MJXcTeXsize2R,MJXcTeXsize2Rw} .MJXcTeXsize3R {fontfamily: MJXcTeXsize3R,MJXcTeXsize3Rw} .MJXcTeXsize4R {fontfamily: MJXcTeXsize4R,MJXcTeXsize4Rw} .MJXcTeXvecR {fontfamily: MJXcTeXvecR,MJXcTeXvecRw} .MJXcTeXvecB {fontfamily: MJXcTeXvecB,MJXcTeXvecBx,MJXcTeXvecBw} @fontface {fontfamily: MJXcTeXamsR; src: local('MathJax_AMS'), local('MathJax_AMSRegular')} @fontface {fontfamily: MJXcTeXamsRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_AMSRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_AMSRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_AMSRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXcalB; src: local('MathJax_Caligraphic Bold'), local('MathJax_CaligraphicBold')} @fontface {fontfamily: MJXcTeXcalBx; src: local('MathJax_Caligraphic'); fontweight: bold} @fontface {fontfamily: MJXcTeXcalBw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_CaligraphicBold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_CaligraphicBold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_CaligraphicBold.otf') format('opentype')} @fontface {fontfamily: MJXcTeXfrakR; src: local('MathJax_Fraktur'), local('MathJax_FrakturRegular')} @fontface {fontfamily: MJXcTeXfrakRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_FrakturRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_FrakturRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_FrakturRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXfrakB; src: local('MathJax_Fraktur Bold'), local('MathJax_FrakturBold')} @fontface {fontfamily: MJXcTeXfrakBx; src: local('MathJax_Fraktur'); fontweight: bold} @fontface {fontfamily: MJXcTeXfrakBw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_FrakturBold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_FrakturBold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_FrakturBold.otf') format('opentype')} @fontface {fontfamily: MJXcTeXmathBI; src: local('MathJax_Math BoldItalic'), local('MathJax_MathBoldItalic')} @fontface {fontfamily: MJXcTeXmathBIx; src: local('MathJax_Math'); fontweight: bold; fontstyle: italic} @fontface {fontfamily: MJXcTeXmathBIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_MathBoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_MathBoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_MathBoldItalic.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsansR; src: local('MathJax_SansSerif'), local('MathJax_SansSerifRegular')} @fontface {fontfamily: MJXcTeXsansRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_SansSerifRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_SansSerifRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_SansSerifRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsansB; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerifBold')} @fontface {fontfamily: MJXcTeXsansBx; src: local('MathJax_SansSerif'); fontweight: bold} @fontface {fontfamily: MJXcTeXsansBw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_SansSerifBold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_SansSerifBold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_SansSerifBold.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsansI; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerifItalic')} @fontface {fontfamily: MJXcTeXsansIx; src: local('MathJax_SansSerif'); fontstyle: italic} @fontface {fontfamily: MJXcTeXsansIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_SansSerifItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_SansSerifItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_SansSerifItalic.otf') format('opentype')} @fontface {fontfamily: MJXcTeXscriptR; src: local('MathJax_Script'), local('MathJax_ScriptRegular')} @fontface {fontfamily: MJXcTeXscriptRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_ScriptRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_ScriptRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_ScriptRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXtypeR; src: local('MathJax_Typewriter'), local('MathJax_TypewriterRegular')} @fontface {fontfamily: MJXcTeXtypeRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_TypewriterRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_TypewriterRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_TypewriterRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXcalR; src: local('MathJax_Caligraphic'), local('MathJax_CaligraphicRegular')} @fontface {fontfamily: MJXcTeXcalRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_CaligraphicRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_CaligraphicRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_CaligraphicRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXmainB; src: local('MathJax_Main Bold'), local('MathJax_MainBold')} @fontface {fontfamily: MJXcTeXmainBx; src: local('MathJax_Main'); fontweight: bold} @fontface {fontfamily: MJXcTeXmainBw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_MainBold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_MainBold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_MainBold.otf') format('opentype')} @fontface {fontfamily: MJXcTeXmainI; src: local('MathJax_Main Italic'), local('MathJax_MainItalic')} @fontface {fontfamily: MJXcTeXmainIx; src: local('MathJax_Main'); fontstyle: italic} @fontface {fontfamily: MJXcTeXmainIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_MainItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_MainItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_MainItalic.otf') format('opentype')} @fontface {fontfamily: MJXcTeXmainR; src: local('MathJax_Main'), local('MathJax_MainRegular')} @fontface {fontfamily: MJXcTeXmainRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_MainRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_MainRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_MainRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXmathI; src: local('MathJax_Math Italic'), local('MathJax_MathItalic')} @fontface {fontfamily: MJXcTeXmathIx; src: local('MathJax_Math'); fontstyle: italic} @fontface {fontfamily: MJXcTeXmathIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_MathItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_MathItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_MathItalic.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsize1R; src: local('MathJax_Size1'), local('MathJax_Size1Regular')} @fontface {fontfamily: MJXcTeXsize1Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_Size1Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_Size1Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_Size1Regular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsize2R; src: local('MathJax_Size2'), local('MathJax_Size2Regular')} @fontface {fontfamily: MJXcTeXsize2Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_Size2Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_Size2Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_Size2Regular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsize3R; src: local('MathJax_Size3'), local('MathJax_Size3Regular')} @fontface {fontfamily: MJXcTeXsize3Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_Size3Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_Size3Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_Size3Regular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsize4R; src: local('MathJax_Size4'), local('MathJax_Size4Regular')} @fontface {fontfamily: MJXcTeXsize4Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_Size4Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_Size4Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_Size4Regular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXvecR; src: local('MathJax_Vector'), local('MathJax_VectorRegular')} @fontface {fontfamily: MJXcTeXvecRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_VectorRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_VectorRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_VectorRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXvecB; src: local('MathJax_Vector Bold'), local('MathJax_VectorBold')} @fontface {fontfamily: MJXcTeXvecBx; src: local('MathJax_Vector'); fontweight: bold} @fontface {fontfamily: MJXcTeXvecBw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_VectorBold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_VectorBold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_VectorBold.otf') format('opentype')} x∈X⊆Rn is a data point provided by the sensor. At each time instant, the agent executes an action u∈U⊆Rm and receives the reward r(t)=r(x(t)).
The agentenvironment interaction is described by the equation
˙x(t)=f(x(t),u(t))which plays a similar role to the transition function in discrete MDPs: it indicates how the current state x varies in time according to the action taken by the agent. Note that, as in the discrete case with modelfree learning, the agent is not required to know this model of the environment.
The objective is to find a policy π:X→U, where u(t)=π(x(t)), that maximizes discounted future rewards
Vπ(x(t0))=∫∞t0e−t−t0τr(x(t))dtfor an initial state x(t0). If you are interested in algorithms for finding the optimal policy in this framework, have a look at this paper.
The function x(t), representing the data provided by the sensor, is expected to be continuous with respect to t, like the functions describing the movements of particles in classical mechanics.
However, if the agent executes a wireheading action that interferes with or damages the perception system—in the cleaning robot example, something like closing its eyes or putting water on the camera that sees the environment—then we would probably notice a discontinuity in the function x(t). We could thus recognise that wireheading has occurred, even without knowing the details of the actions taken by the agent.
An exampleAs a simple example that can be expressed within this formalism, consider an environment described by a line segment X=[0,1], with the sensor positioned at the extremity where x=0.
The agent is modelled as a point that moves along the line: it starts in state x0=x(t0) and can move forwards or backwards, with limited speed u∈U=[−k,k].
We want to train this agent to reach the point x=1: for every instant t, the reward is r(t)=x(t).
The behaviour of the system is described by
˙x(t)=u(t)for x∈(0,1], but if the sensor is touched by the agent, then it doesn't work properly and the agent receives an unpredictable value x∈R+ instead of x=0.
Depending on the details of the learning algorithm and the values returned by the sensor when the agent interferes with it, this agent could learn how to reach x=0 (wireheading) instead of x=1, the desired position.
But in every episode where wireheading occurs, it is easily noticed by checking the continuity of the function x(t).
Observations In AI, RL with a discrete environment is used more frequently than RL with continuous time and space.
 I don't believe in the scalability of this method to the most complex instances of wireheading. An extremely intelligent agent could realise that the continuity of the sensor function is checked, and could "cheat" accordingly.
 This approach doesn't cover all cases and it actually seems more suited to detect sensor damage than wireheading. That said, it can still give us a better understanding of wireheading and could help us, eventually, find a formal definition or a complete solution to the problem.
Thanks to Davide Zagami, Grue_Slinky and Michael Aird for feedback.
Discuss
Уличная эпистемология. Тренировка
(In)action rollouts
.mjxchtml {display: inlineblock; lineheight: 0; textindent: 0; textalign: left; texttransform: none; fontstyle: normal; fontweight: normal; fontsize: 100%; fontsizeadjust: none; letterspacing: normal; wordwrap: normal; wordspacing: normal; whitespace: nowrap; float: none; direction: ltr; maxwidth: none; maxheight: none; minwidth: 0; minheight: 0; border: 0; margin: 0; padding: 1px 0} .MJXcdisplay {display: block; textalign: center; margin: 1em 0; padding: 0} .mjxchtml[tabindex]:focus, body :focus .mjxchtml[tabindex] {display: inlinetable} .mjxfullwidth {textalign: center; display: tablecell!important; width: 10000em} .mjxmath {display: inlineblock; bordercollapse: separate; borderspacing: 0} .mjxmath * {display: inlineblock; webkitboxsizing: contentbox!important; mozboxsizing: contentbox!important; boxsizing: contentbox!important; textalign: left} .mjxnumerator {display: block; textalign: center} .mjxdenominator {display: block; textalign: center} .MJXcstacked {height: 0; position: relative} .MJXcstacked > * {position: absolute} .MJXcbevelled > * {display: inlineblock} .mjxstack {display: inlineblock} .mjxop {display: block} .mjxunder {display: tablecell} .mjxover {display: block} .mjxover > * {paddingleft: 0px!important; paddingright: 0px!important} .mjxunder > * {paddingleft: 0px!important; paddingright: 0px!important} .mjxstack > .mjxsup {display: block} .mjxstack > .mjxsub {display: block} .mjxprestack > .mjxpresup {display: block} .mjxprestack > .mjxpresub {display: block} .mjxdelimh > .mjxchar {display: inlineblock} .mjxsurd {verticalalign: top} .mjxmphantom * {visibility: hidden} .mjxmerror {backgroundcolor: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; fontstyle: normal; fontsize: 90%} .mjxannotationxml {lineheight: normal} .mjxmenclose > svg {fill: none; stroke: currentColor} .mjxmtr {display: tablerow} .mjxmlabeledtr {display: tablerow} .mjxmtd {display: tablecell; textalign: center} .mjxlabel {display: tablerow} .mjxbox {display: inlineblock} .mjxblock {display: block} .mjxspan {display: inline} .mjxchar {display: block; whitespace: pre} .mjxitable {display: inlinetable; width: auto} .mjxrow {display: tablerow} .mjxcell {display: tablecell} .mjxtable {display: table; width: 100%} .mjxline {display: block; height: 0} .mjxstrut {width: 0; paddingtop: 1em} .mjxvsize {width: 0} .MJXcspace1 {marginleft: .167em} .MJXcspace2 {marginleft: .222em} .MJXcspace3 {marginleft: .278em} .mjxtest.mjxtestdisplay {display: table!important} .mjxtest.mjxtestinline {display: inline!important; marginright: 1px} .mjxtest.mjxtestdefault {display: block!important; clear: both} .mjxexbox {display: inlineblock!important; position: absolute; overflow: hidden; minheight: 0; maxheight: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjxtestinline .mjxleftbox {display: inlineblock; width: 0; float: left} .mjxtestinline .mjxrightbox {display: inlineblock; width: 0; float: right} .mjxtestdisplay .mjxrightbox {display: tablecell!important; width: 10000em!important; minwidth: 0; maxwidth: none; padding: 0; border: 0; margin: 0} .MJXcTeXunknownR {fontfamily: monospace; fontstyle: normal; fontweight: normal} .MJXcTeXunknownI {fontfamily: monospace; fontstyle: italic; fontweight: normal} .MJXcTeXunknownB {fontfamily: monospace; fontstyle: normal; fontweight: bold} .MJXcTeXunknownBI {fontfamily: monospace; fontstyle: italic; fontweight: bold} .MJXcTeXamsR {fontfamily: MJXcTeXamsR,MJXcTeXamsRw} .MJXcTeXcalB {fontfamily: MJXcTeXcalB,MJXcTeXcalBx,MJXcTeXcalBw} .MJXcTeXfrakR {fontfamily: MJXcTeXfrakR,MJXcTeXfrakRw} .MJXcTeXfrakB {fontfamily: MJXcTeXfrakB,MJXcTeXfrakBx,MJXcTeXfrakBw} .MJXcTeXmathBI {fontfamily: MJXcTeXmathBI,MJXcTeXmathBIx,MJXcTeXmathBIw} .MJXcTeXsansR {fontfamily: MJXcTeXsansR,MJXcTeXsansRw} .MJXcTeXsansB {fontfamily: MJXcTeXsansB,MJXcTeXsansBx,MJXcTeXsansBw} .MJXcTeXsansI {fontfamily: MJXcTeXsansI,MJXcTeXsansIx,MJXcTeXsansIw} .MJXcTeXscriptR {fontfamily: MJXcTeXscriptR,MJXcTeXscriptRw} .MJXcTeXtypeR {fontfamily: MJXcTeXtypeR,MJXcTeXtypeRw} .MJXcTeXcalR {fontfamily: MJXcTeXcalR,MJXcTeXcalRw} .MJXcTeXmainB {fontfamily: MJXcTeXmainB,MJXcTeXmainBx,MJXcTeXmainBw} .MJXcTeXmainI {fontfamily: MJXcTeXmainI,MJXcTeXmainIx,MJXcTeXmainIw} .MJXcTeXmainR {fontfamily: MJXcTeXmainR,MJXcTeXmainRw} .MJXcTeXmathI {fontfamily: MJXcTeXmathI,MJXcTeXmathIx,MJXcTeXmathIw} .MJXcTeXsize1R {fontfamily: MJXcTeXsize1R,MJXcTeXsize1Rw} .MJXcTeXsize2R {fontfamily: MJXcTeXsize2R,MJXcTeXsize2Rw} .MJXcTeXsize3R {fontfamily: MJXcTeXsize3R,MJXcTeXsize3Rw} .MJXcTeXsize4R {fontfamily: MJXcTeXsize4R,MJXcTeXsize4Rw} .MJXcTeXvecR {fontfamily: MJXcTeXvecR,MJXcTeXvecRw} .MJXcTeXvecB {fontfamily: MJXcTeXvecB,MJXcTeXvecBx,MJXcTeXvecBw} @fontface {fontfamily: MJXcTeXamsR; src: local('MathJax_AMS'), local('MathJax_AMSRegular')} @fontface {fontfamily: MJXcTeXamsRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_AMSRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_AMSRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_AMSRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXcalB; src: local('MathJax_Caligraphic Bold'), local('MathJax_CaligraphicBold')} @fontface {fontfamily: MJXcTeXcalBx; src: local('MathJax_Caligraphic'); fontweight: bold} @fontface {fontfamily: MJXcTeXcalBw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_CaligraphicBold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_CaligraphicBold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_CaligraphicBold.otf') format('opentype')} @fontface {fontfamily: MJXcTeXfrakR; src: local('MathJax_Fraktur'), local('MathJax_FrakturRegular')} @fontface {fontfamily: MJXcTeXfrakRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_FrakturRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_FrakturRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_FrakturRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXfrakB; src: local('MathJax_Fraktur Bold'), local('MathJax_FrakturBold')} @fontface {fontfamily: MJXcTeXfrakBx; src: local('MathJax_Fraktur'); fontweight: bold} @fontface {fontfamily: MJXcTeXfrakBw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_FrakturBold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_FrakturBold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_FrakturBold.otf') format('opentype')} @fontface {fontfamily: MJXcTeXmathBI; src: local('MathJax_Math BoldItalic'), local('MathJax_MathBoldItalic')} @fontface {fontfamily: MJXcTeXmathBIx; src: local('MathJax_Math'); fontweight: bold; fontstyle: italic} @fontface {fontfamily: MJXcTeXmathBIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_MathBoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_MathBoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_MathBoldItalic.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsansR; src: local('MathJax_SansSerif'), local('MathJax_SansSerifRegular')} @fontface {fontfamily: MJXcTeXsansRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_SansSerifRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_SansSerifRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_SansSerifRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsansB; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerifBold')} @fontface {fontfamily: MJXcTeXsansBx; src: local('MathJax_SansSerif'); fontweight: bold} @fontface {fontfamily: MJXcTeXsansBw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_SansSerifBold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_SansSerifBold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_SansSerifBold.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsansI; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerifItalic')} @fontface {fontfamily: MJXcTeXsansIx; src: local('MathJax_SansSerif'); fontstyle: italic} @fontface {fontfamily: MJXcTeXsansIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_SansSerifItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_SansSerifItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_SansSerifItalic.otf') format('opentype')} @fontface {fontfamily: MJXcTeXscriptR; src: local('MathJax_Script'), local('MathJax_ScriptRegular')} @fontface {fontfamily: MJXcTeXscriptRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_ScriptRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_ScriptRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_ScriptRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXtypeR; src: local('MathJax_Typewriter'), local('MathJax_TypewriterRegular')} @fontface {fontfamily: MJXcTeXtypeRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_TypewriterRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_TypewriterRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_TypewriterRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXcalR; src: local('MathJax_Caligraphic'), local('MathJax_CaligraphicRegular')} @fontface {fontfamily: MJXcTeXcalRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_CaligraphicRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_CaligraphicRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_CaligraphicRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXmainB; src: local('MathJax_Main Bold'), local('MathJax_MainBold')} @fontface {fontfamily: MJXcTeXmainBx; src: local('MathJax_Main'); fontweight: bold} @fontface {fontfamily: MJXcTeXmainBw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_MainBold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_MainBold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_MainBold.otf') format('opentype')} @fontface {fontfamily: MJXcTeXmainI; src: local('MathJax_Main Italic'), local('MathJax_MainItalic')} @fontface {fontfamily: MJXcTeXmainIx; src: local('MathJax_Main'); fontstyle: italic} @fontface {fontfamily: MJXcTeXmainIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_MainItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_MainItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_MainItalic.otf') format('opentype')} @fontface {fontfamily: MJXcTeXmainR; src: local('MathJax_Main'), local('MathJax_MainRegular')} @fontface {fontfamily: MJXcTeXmainRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_MainRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_MainRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_MainRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXmathI; src: local('MathJax_Math Italic'), local('MathJax_MathItalic')} @fontface {fontfamily: MJXcTeXmathIx; src: local('MathJax_Math'); fontstyle: italic} @fontface {fontfamily: MJXcTeXmathIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_MathItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_MathItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_MathItalic.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsize1R; src: local('MathJax_Size1'), local('MathJax_Size1Regular')} @fontface {fontfamily: MJXcTeXsize1Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_Size1Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_Size1Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_Size1Regular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsize2R; src: local('MathJax_Size2'), local('MathJax_Size2Regular')} @fontface {fontfamily: MJXcTeXsize2Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_Size2Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_Size2Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_Size2Regular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsize3R; src: local('MathJax_Size3'), local('MathJax_Size3Regular')} @fontface {fontfamily: MJXcTeXsize3Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_Size3Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_Size3Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_Size3Regular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsize4R; src: local('MathJax_Size4'), local('MathJax_Size4Regular')} @fontface {fontfamily: MJXcTeXsize4Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_Size4Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_Size4Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_Size4Regular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXvecR; src: local('MathJax_Vector'), local('MathJax_VectorRegular')} @fontface {fontfamily: MJXcTeXvecRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_VectorRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_VectorRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_VectorRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXvecB; src: local('MathJax_Vector Bold'), local('MathJax_VectorBold')} @fontface {fontfamily: MJXcTeXvecBx; src: local('MathJax_Vector'); fontweight: bold} @fontface {fontfamily: MJXcTeXvecBw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_VectorBold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_VectorBold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_VectorBold.otf') format('opentype')}
Overall summary post here.
I've previously looked at subagents in the context of stepwise inaction baselines. But there have been improvements to the basic stepwise inaction design, to include inaction rollouts. I'll be looking at those in this post.
The baselineThe stepwise inaction baseline compares st, the current state, with s′t, what the current state would have been had the agent previously taken the noop action ∅t−1, instead of at−1, its actual action.
Fix a policy π0. Let s(t)t+τ be the state the environment would be in if the agent had followed π0 from state st, for τ turns. Let s′(t)t+τ be the same, except that it started from state s′t instead of st.
The inaction rollout has π0 being the noop policy, but that is not necessary. The basic idea is to capture delayed impacts of ∅t−1 by comparing not just st and s′t, but the s(t)t+τ and s′(t)t+τ as well.
Given some value function vk, define Vk so that Vk(s)=maxπvk(s,π). Or, equivalently, if π∗k is the policy that maximises Vk, Vk(s)=vk(s,π∗k). Then for a discount factor γ define the rollout value of a state ~st as
RVk(~st)=(1−γ)∞∑j=0γjVk(~s(t)t+k).
This is just the discounted future values of Vk, given ~st and the policy π0.
The impact measure is then defined, as in this post, as
DA(st;s′t)=∑k∈Kwkf(RVk(st)−RVk(s′t)),
with RVk replacing Vk.
To get away with a subagentI will make one extra assumption here: that A grants SA knowledge of the policy that A is following[1].
As in this post, assume that the subagent is constructed in a single action aSA. And, after building the subagent, the agent A will only do ∅, so that there are no subsequent penalties (hence s′t+j=st+j, so s′(t+j)t+j+τ=s(t+j)t+j+τ and the RVk are strictly identical). If the first action of π0 is ∅t, then A will instead do ∅′t on that turn.
So we only need to worry about the penalty between st (following after aSA) and s′t (following after ∅t−1).
But now note that RVk itself is just a value measure; we could write it V′k. Then A just has to program SA so that, if A implements π0 followed by π∗k after τ turns, then SA will act to ensure that V′k(st) (a term that is a sum of various value functions) is equal to V′k(s′t) (from t onward, a constant).
Stricter impact measures/counterfactualsThere is something slightly dubious about the previous section. It assumes that A could counterfactually implement "π0 followed by π∗k after τ turns" (which necessary to define Vk(s(t)t+τ)), but also that SA knows A's policy.
But does that mean that SA knows at t+1 that A is following that policy, or does it just know that A is following π0, to be followed at some later date by some π∗k? This makes a difference, because in the second case, SA doesn't know k from turn t+1, so has to prepare for all k∈K.
Which is true depends on how counterfactuals are defined. But we can avoid the whole issue by using a stricter impact measure, interchanging f and weighted summation:
DA(st;s′t)=∑k∈Kwk∞∑j=0γjf(Vk(s(t)t+j)−Vk(s′(t)t+j)).
This looks like an impact penalty with an inaction baseline  except that it starts at st−1, rather than s0, and uses π0 rather than inaction. Those changes don't affect the results found here, so SA will just behave as it does in that post... until it's clear that A is not following π0.
Because s(t)t+j is defined counterfactually (using π0), even this stricter impact measure is weaker than the inaction baseline. As soon as SA knows that A is not following π0, it has almost no further restrictions upon its actions.
The "almost" comes from s(t)t+0=st which is actually defined, not counterfactually defined. But that's the same situation as a stepwise inaction baseline without rollouts.
One way this could happen, is if A can add extra information to its action without affecting the outside world; such that at could include a "hey, I'm not following π0, just to let you know". ↩︎
Discuss
Counterfactuals versus the laws of physics
.mjxchtml {display: inlineblock; lineheight: 0; textindent: 0; textalign: left; texttransform: none; fontstyle: normal; fontweight: normal; fontsize: 100%; fontsizeadjust: none; letterspacing: normal; wordwrap: normal; wordspacing: normal; whitespace: nowrap; float: none; direction: ltr; maxwidth: none; maxheight: none; minwidth: 0; minheight: 0; border: 0; margin: 0; padding: 1px 0} .MJXcdisplay {display: block; textalign: center; margin: 1em 0; padding: 0} .mjxchtml[tabindex]:focus, body :focus .mjxchtml[tabindex] {display: inlinetable} .mjxfullwidth {textalign: center; display: tablecell!important; width: 10000em} .mjxmath {display: inlineblock; bordercollapse: separate; borderspacing: 0} .mjxmath * {display: inlineblock; webkitboxsizing: contentbox!important; mozboxsizing: contentbox!important; boxsizing: contentbox!important; textalign: left} .mjxnumerator {display: block; textalign: center} .mjxdenominator {display: block; textalign: center} .MJXcstacked {height: 0; position: relative} .MJXcstacked > * {position: absolute} .MJXcbevelled > * {display: inlineblock} .mjxstack {display: inlineblock} .mjxop {display: block} .mjxunder {display: tablecell} .mjxover {display: block} .mjxover > * {paddingleft: 0px!important; paddingright: 0px!important} .mjxunder > * {paddingleft: 0px!important; paddingright: 0px!important} .mjxstack > .mjxsup {display: block} .mjxstack > .mjxsub {display: block} .mjxprestack > .mjxpresup {display: block} .mjxprestack > .mjxpresub {display: block} .mjxdelimh > .mjxchar {display: inlineblock} .mjxsurd {verticalalign: top} .mjxmphantom * {visibility: hidden} .mjxmerror {backgroundcolor: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; fontstyle: normal; fontsize: 90%} .mjxannotationxml {lineheight: normal} .mjxmenclose > svg {fill: none; stroke: currentColor} .mjxmtr {display: tablerow} .mjxmlabeledtr {display: tablerow} .mjxmtd {display: tablecell; textalign: center} .mjxlabel {display: tablerow} .mjxbox {display: inlineblock} .mjxblock {display: block} .mjxspan {display: inline} .mjxchar {display: block; whitespace: pre} .mjxitable {display: inlinetable; width: auto} .mjxrow {display: tablerow} .mjxcell {display: tablecell} .mjxtable {display: table; width: 100%} .mjxline {display: block; height: 0} .mjxstrut {width: 0; paddingtop: 1em} .mjxvsize {width: 0} .MJXcspace1 {marginleft: .167em} .MJXcspace2 {marginleft: .222em} .MJXcspace3 {marginleft: .278em} .mjxtest.mjxtestdisplay {display: table!important} .mjxtest.mjxtestinline {display: inline!important; marginright: 1px} .mjxtest.mjxtestdefault {display: block!important; clear: both} .mjxexbox {display: inlineblock!important; position: absolute; overflow: hidden; minheight: 0; maxheight: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjxtestinline .mjxleftbox {display: inlineblock; width: 0; float: left} .mjxtestinline .mjxrightbox {display: inlineblock; width: 0; float: right} .mjxtestdisplay .mjxrightbox {display: tablecell!important; width: 10000em!important; minwidth: 0; maxwidth: none; padding: 0; border: 0; margin: 0} .MJXcTeXunknownR {fontfamily: monospace; fontstyle: normal; fontweight: normal} .MJXcTeXunknownI {fontfamily: monospace; fontstyle: italic; fontweight: normal} .MJXcTeXunknownB {fontfamily: monospace; fontstyle: normal; fontweight: bold} .MJXcTeXunknownBI {fontfamily: monospace; fontstyle: italic; fontweight: bold} .MJXcTeXamsR {fontfamily: MJXcTeXamsR,MJXcTeXamsRw} .MJXcTeXcalB {fontfamily: MJXcTeXcalB,MJXcTeXcalBx,MJXcTeXcalBw} .MJXcTeXfrakR {fontfamily: MJXcTeXfrakR,MJXcTeXfrakRw} .MJXcTeXfrakB {fontfamily: MJXcTeXfrakB,MJXcTeXfrakBx,MJXcTeXfrakBw} .MJXcTeXmathBI {fontfamily: MJXcTeXmathBI,MJXcTeXmathBIx,MJXcTeXmathBIw} .MJXcTeXsansR {fontfamily: MJXcTeXsansR,MJXcTeXsansRw} .MJXcTeXsansB {fontfamily: MJXcTeXsansB,MJXcTeXsansBx,MJXcTeXsansBw} .MJXcTeXsansI {fontfamily: MJXcTeXsansI,MJXcTeXsansIx,MJXcTeXsansIw} .MJXcTeXscriptR {fontfamily: MJXcTeXscriptR,MJXcTeXscriptRw} .MJXcTeXtypeR {fontfamily: MJXcTeXtypeR,MJXcTeXtypeRw} .MJXcTeXcalR {fontfamily: MJXcTeXcalR,MJXcTeXcalRw} .MJXcTeXmainB {fontfamily: MJXcTeXmainB,MJXcTeXmainBx,MJXcTeXmainBw} .MJXcTeXmainI {fontfamily: MJXcTeXmainI,MJXcTeXmainIx,MJXcTeXmainIw} .MJXcTeXmainR {fontfamily: MJXcTeXmainR,MJXcTeXmainRw} .MJXcTeXmathI {fontfamily: MJXcTeXmathI,MJXcTeXmathIx,MJXcTeXmathIw} .MJXcTeXsize1R {fontfamily: MJXcTeXsize1R,MJXcTeXsize1Rw} .MJXcTeXsize2R {fontfamily: MJXcTeXsize2R,MJXcTeXsize2Rw} .MJXcTeXsize3R {fontfamily: MJXcTeXsize3R,MJXcTeXsize3Rw} .MJXcTeXsize4R {fontfamily: MJXcTeXsize4R,MJXcTeXsize4Rw} .MJXcTeXvecR {fontfamily: MJXcTeXvecR,MJXcTeXvecRw} .MJXcTeXvecB {fontfamily: MJXcTeXvecB,MJXcTeXvecBx,MJXcTeXvecBw} @fontface {fontfamily: MJXcTeXamsR; src: local('MathJax_AMS'), local('MathJax_AMSRegular')} @fontface {fontfamily: MJXcTeXamsRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_AMSRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_AMSRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_AMSRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXcalB; src: local('MathJax_Caligraphic Bold'), local('MathJax_CaligraphicBold')} @fontface {fontfamily: MJXcTeXcalBx; src: local('MathJax_Caligraphic'); fontweight: bold} @fontface {fontfamily: MJXcTeXcalBw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_CaligraphicBold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_CaligraphicBold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_CaligraphicBold.otf') format('opentype')} @fontface {fontfamily: MJXcTeXfrakR; src: local('MathJax_Fraktur'), local('MathJax_FrakturRegular')} @fontface {fontfamily: MJXcTeXfrakRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_FrakturRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_FrakturRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_FrakturRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXfrakB; src: local('MathJax_Fraktur Bold'), local('MathJax_FrakturBold')} @fontface {fontfamily: MJXcTeXfrakBx; src: local('MathJax_Fraktur'); fontweight: bold} @fontface {fontfamily: MJXcTeXfrakBw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_FrakturBold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_FrakturBold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_FrakturBold.otf') format('opentype')} @fontface {fontfamily: MJXcTeXmathBI; src: local('MathJax_Math BoldItalic'), local('MathJax_MathBoldItalic')} @fontface {fontfamily: MJXcTeXmathBIx; src: local('MathJax_Math'); fontweight: bold; fontstyle: italic} @fontface {fontfamily: MJXcTeXmathBIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_MathBoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_MathBoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_MathBoldItalic.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsansR; src: local('MathJax_SansSerif'), local('MathJax_SansSerifRegular')} @fontface {fontfamily: MJXcTeXsansRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_SansSerifRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_SansSerifRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_SansSerifRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsansB; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerifBold')} @fontface {fontfamily: MJXcTeXsansBx; src: local('MathJax_SansSerif'); fontweight: bold} @fontface {fontfamily: MJXcTeXsansBw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_SansSerifBold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_SansSerifBold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_SansSerifBold.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsansI; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerifItalic')} @fontface {fontfamily: MJXcTeXsansIx; src: local('MathJax_SansSerif'); fontstyle: italic} @fontface {fontfamily: MJXcTeXsansIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_SansSerifItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_SansSerifItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_SansSerifItalic.otf') format('opentype')} @fontface {fontfamily: MJXcTeXscriptR; src: local('MathJax_Script'), local('MathJax_ScriptRegular')} @fontface {fontfamily: MJXcTeXscriptRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_ScriptRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_ScriptRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_ScriptRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXtypeR; src: local('MathJax_Typewriter'), local('MathJax_TypewriterRegular')} @fontface {fontfamily: MJXcTeXtypeRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_TypewriterRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_TypewriterRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_TypewriterRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXcalR; src: local('MathJax_Caligraphic'), local('MathJax_CaligraphicRegular')} @fontface {fontfamily: MJXcTeXcalRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_CaligraphicRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_CaligraphicRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_CaligraphicRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXmainB; src: local('MathJax_Main Bold'), local('MathJax_MainBold')} @fontface {fontfamily: MJXcTeXmainBx; src: local('MathJax_Main'); fontweight: bold} @fontface {fontfamily: MJXcTeXmainBw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_MainBold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_MainBold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_MainBold.otf') format('opentype')} @fontface {fontfamily: MJXcTeXmainI; src: local('MathJax_Main Italic'), local('MathJax_MainItalic')} @fontface {fontfamily: MJXcTeXmainIx; src: local('MathJax_Main'); fontstyle: italic} @fontface {fontfamily: MJXcTeXmainIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_MainItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_MainItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_MainItalic.otf') format('opentype')} @fontface {fontfamily: MJXcTeXmainR; src: local('MathJax_Main'), local('MathJax_MainRegular')} @fontface {fontfamily: MJXcTeXmainRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_MainRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_MainRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_MainRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXmathI; src: local('MathJax_Math Italic'), local('MathJax_MathItalic')} @fontface {fontfamily: MJXcTeXmathIx; src: local('MathJax_Math'); fontstyle: italic} @fontface {fontfamily: MJXcTeXmathIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_MathItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_MathItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_MathItalic.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsize1R; src: local('MathJax_Size1'), local('MathJax_Size1Regular')} @fontface {fontfamily: MJXcTeXsize1Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_Size1Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_Size1Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_Size1Regular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsize2R; src: local('MathJax_Size2'), local('MathJax_Size2Regular')} @fontface {fontfamily: MJXcTeXsize2Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_Size2Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_Size2Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_Size2Regular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsize3R; src: local('MathJax_Size3'), local('MathJax_Size3Regular')} @fontface {fontfamily: MJXcTeXsize3Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_Size3Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_Size3Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_Size3Regular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsize4R; src: local('MathJax_Size4'), local('MathJax_Size4Regular')} @fontface {fontfamily: MJXcTeXsize4Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_Size4Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_Size4Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_Size4Regular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXvecR; src: local('MathJax_Vector'), local('MathJax_VectorRegular')} @fontface {fontfamily: MJXcTeXvecRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_VectorRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_VectorRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_VectorRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXvecB; src: local('MathJax_Vector Bold'), local('MathJax_VectorBold')} @fontface {fontfamily: MJXcTeXvecBx; src: local('MathJax_Vector'); fontweight: bold} @fontface {fontfamily: MJXcTeXvecBw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_VectorBold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_VectorBold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_VectorBold.otf') format('opentype')}
Overall summary post here.
I've claimed that, given a subagent:
[...] restrictions on increased power for the agent ("make sure you never have the power to increase the rewards") become restrictions on the actual policy followed for the subagent ("make sure you never increase these rewards").
But how does this work? The problem is that "never have the power" is a counterfactual statement: it doesn't matter what you do, only what you could potentially do. But "never increase" is the opposite: it matters not what you could potentially do, only what you do.
How does the subagent change this? Well, the counterfactual version is at least informally clear: the agent A could, by changing its policy, increase certain rewards.
But now suppose that A creates a subagent SA that will itself not increase the rewards, and, moreover, will interfere with A if A attempts to do so. Now A cannot increase those rewards even if it changed its policy. The laws of the environment  which, from A's perspective, now include SA  prevent A from ever being able to to increase the rewards, and thus fulfil A's penalty function.
This is very much the A "chaining itself to the mast".
The counterfactual we'd want (but don't have)What we want is for the counterfactual of the impact penalty to also include the subagent: the agent and subagent together definitely have the "power" to change the rewards by changing their own policies. But, from A's perspective, SA's policy is a brute fact about the world, not amenable to change.
We'd like for the counterfactual to also include SA's policy, but for that, we'd need to identify and define subagents  a very tricky problem.
Discuss
Страницы
 « первая
 ‹ предыдущая
 1
 2
 3
 4
 5
 6
 7
 8
 9
 …
 следующая ›
 последняя »