Вы здесь
Новости LessWrong.com
Agent Foundation Foundations and the Rocket Alignment Problem
Many people are quite skeptical of the value of agent foundations. The kinds of problems that MIRI worries about, in terms of accounting for perfect predictors, cooperation with clones and being easily predictable are a world away from the kinds of problems that are being faced in machine learning. Many people think that proper research in this area would involve code. They may also think that this kind of research consists purely of extremely rare edge cases of no practical importance, won't be integratable into the kinds of AI systems that are likely to be produced or is just much, much less pressing than solving the kinds of safety challenges that we can already see arising in our current AI system.
In order to convey his intuitions that Agent Foundations being important, Eliezier wrote the Rocket Alignment Problem. The argument is roughly that any attempt to define what an AI should do is build upon shaky premises. This means that it is practically impossible to provide guarantees that things will go well is basically impossible. The hope is that by exploring highly simplified, theoretical problems we may learn something important that would deconfuse us. However, it is also noted that this might not pan out, as it is hard to see how useful improved theoretical understanding is before you've obtained it. Further it is argued that AI safety is an unusually difficult area where things that sound like pretty good solutions could result in disasterous outcomes. For example, powerful utility maximisers are very good at finding the most obscure loopholes in their utility function to achieve a higher score. Powerful approval based agents are likely to try find a way to manipulating us. Powerful boxed agents are likely to find a way to escape that box.
Most of my current work on Less Wrong is best described as Agent Foundations Foundations. It involves working through the various claims or results in Agent Foundations, finding aspects that confuse me and then digging deeper to determine whether I'm confused, or something needs to be patched, or there's a deeper problem.
Agent Foundations Foundations research looks very different from most Agent Foundations research. Agent Foundations is primarily about producing mathematical formulations, while Agent Foundations Foundations is primarily about questioning philosophical assumptions. Agent Foundations Foundations is intended to eventually lead to mathematical formalisations, but the focus is more on figuring out exactly what we want from out maths. Rushing out and producing incorrect formalisations would defeat the point.
Agent Foundations research relies on a significant amount of philosophical assumptions and this is also an area where the default is disaster. The best philosophers are extremely careful with every step of the argument, yet they often come to entirely different conclusions. Given this, attempting to rush over this terrain by handwaving philosophical arguments is likely to end badly.
One challenge is that mathematicians can be blinded by elegant formalisations, to the point where they can't objectively assess the merits of the assumptions it is build upon. Another key issue is that when someone is able to use a formalisation to produce a result, they assume that it must be correct. Agent Foundations Foundations attempts to fight against these biases.
Agent Foundations Foundations focuses on what often appears to be weird niche issues from the perspective of Agent Foundations. This includes questions such as:
 Given a perfect predictor, what is it predicting when the counterfactual is impossible (Counterfactuals for Perfect Predictors)
 Why doesn't Timeless Decision Theory depend on backwards causation? (The Prediction Problem, further discussion here)
 What class of problems should our decision theory optimise over? (One doubt about Timeless Decision Theories)?
 What should we optimise for when your reference class depends on your decision? (Evil Genie Problem, Decision Theory with F@#!edUp Reference Classes)
 What general kind of entities are logical counterfactuals? (Logical Counterfactuals & the Cooperation Game, Deconfusing Logical Counterfactuals)
Of course, lots of other people have done work in this vein too. I didn't want to spend a lot of time browsing the archive, but some examples include:
 Motivating a Semantics of Logical Counterfactuals
 Logical Counterfactuals are Low Res
 Smoking Lesion Steelman
I don't want to pretend that the separation is clean at all. But in some work, the maths is first and the philosophical assumptions are second. For Agent Foundations Foundations, it is the other way round.
Discuss
The Simple Solow Model of Software Engineering
Optional background: The SuperSimple Solow Model
Software is economic capital  just like buildings, infrastructure, machines, etc. It’s created once, and then used for a (relatively) long time. Using it does not destroy it. Someone who buys/creates a machine usually plans to use it to build other things and make back their investment over time. Someone who buys/creates software usually plans to use it for other things and make back their investment over time.
Software depreciates. Hardware needs to be replaced (or cloud provider switched), operating systems need to be upgraded, and backward compatibility is not always maintained. Security problems pop up, and need to be patched. External libraries are deprecated, abandoned, and stop working altogether. People shift from desktop to browser to mobile to ???. Perhaps most frequently, external APIs change format or meaning or are shut down altogether.
In most macroeconomic models, new capital accumulates until it reaches an equilibrium level, where all investment goes toward repairing/replacing depreciated capital  resurfacing roads, replacing machines, repairing buildings rather than creating new roads, machines and buildings. The same applies to software teams/companies: code accumulates until it reaches an equilibrium level, where all effort goes toward repairing/replacing depreciated code  switching to new libraries, updating to match changed APIs, and patching bugs introduced by previous repairs.
What qualitative predictions does this model make?
Prediction 1If a software company wants to expand the capabilities of their software over time, they can’t just write more code  the old software will break down if the engineers turn their attention elsewhere. That leaves a few options:
 Hire more engineers (economics equivalent: population/labor force growth)
 Hire/train better engineers (economics equivalent: more education)
 Figure out better ways to make the software do what it does (economics equivalent: innovation)
Hiring more engineers is the “throw money at it solution”, and probably the most common in practice  but also the solution most prone to total failure when the VC funding dries up.
Hiring/training better engineers is the dream. Every software company wishes they could do so, many even claim to do so, but few (if any) actually manage it. There are many reasons: it’s hard to recognize skill levels above your own, education is slow and hard to measure, there’s lots of bullshit on the subject and it’s hard to comb through, hard to get buyin from management, etc.
Figuring out better ways to make the software do what it does is probably the most technically interesting item on the list, and also arguably the item with the most longterm potential. This includes adopting new libraries/languages/frameworks/techniques. It includes refactoring to unify duplicate functionality. It includes designing new abstraction layers. Unfortunately, all of these things are also easy to get wrong  unifying things with significantly divergent use cases, or designing a leaky abstraction  and it’s often hard to tell until later whether the change has helped or hurt.
Prediction 2New products from small companies tend to catch up to large existing products, at least in terms of features. The new product with a small code base needs to invest much less in fighting back depreciation (i.e. legacy code), so they can add new features much more quickly.
If you’ve worked in a software startup, you’ve probably experienced this first hand.
Conversely, as the code base grows, the pace of new features necessarily slows. Decreasing marginal returns of new features meets up with increasing depreciation load, until adding a new feature means abandoning an old one. Unless a company is constantly adding engineers, the pace of feature addition will slow to a crawl as they grow.
Prediction 3Since this all depends on depreciation, it’s going to hit hardest when the software depreciates fastest.
The biggest factor here (at least in my experience) is external APIs. A company whose code does not call out to any external APIs has relatively light depreciation load  once their code is written, it’s mostly going to keep running, other than longterm changes in the language or OS. APIs usually change much more frequently than languages or operating systems, and are less stringent about backwards compatibility. (For apps built by large companies, this also includes calling APIs maintained by other teams.)
Redis is a pretty selfcontained system  not much depreciation there. Redis could easily add a lot more features without drowning in maintenance needs. On the other end of the spectrum, a mortgage app needs to call loads of APIs  credit agencies, property databases, government APIs, pricing feeds… they’ll hit equilibrium pretty quickly. In that sort of environment, you’ll probably end up with a roughlyconstant number of APIs per engineer which can be sustained long term.
Discuss
IRL 6/8: Querying the human: Active Reward Learning
Value Learning is only Asymptotically Safe
I showed recently, predicated on a few assumptions, that a certain agent was asymptotically “benign” with probability 1. (That term may be replaced by something like “domesticated” in the next version, but it I’ll use “benign” for now).
This result leaves something to be desired: namely an agent which is safe for its entire lifetime. It seems very difficult to formally show such a strong result for any agent. Suppose we had a design for an agent which did value learning properly. That is, suppose we somehow figured out how to design an agent which understood what constituted observational evidence of humanity’s reflectivelyendorsed utility function.
Presumably, such an agent could learn (just about) any utility function depending on what observations it encounters. Surely, there would be a set of observations which caused it to believe that every human was better off dead.
In the presence of cosmic rays, then, this agent is not safe for its entire lifetime with probability 1. For any finite sequence of observations that would cause the agent to conclude that humanity was better off dead, this sequence has strictly positive probability, since with positive probability, cosmic rays will flip every relevant bit in the computer’s memory.
This agent is presumably still asymptotically safe. This is a bit hard to justify without a concrete proposal for what this agent looks like, but at the very least, the cosmic ray argument doesn’t go through. With probability 1, the sample mean of a Bernoulli(.mjxchtml {display: inlineblock; lineheight: 0; textindent: 0; textalign: left; texttransform: none; fontstyle: normal; fontweight: normal; fontsize: 100%; fontsizeadjust: none; letterspacing: normal; wordwrap: normal; wordspacing: normal; whitespace: nowrap; float: none; direction: ltr; maxwidth: none; maxheight: none; minwidth: 0; minheight: 0; border: 0; margin: 0; padding: 1px 0} .MJXcdisplay {display: block; textalign: center; margin: 1em 0; padding: 0} .mjxchtml[tabindex]:focus, body :focus .mjxchtml[tabindex] {display: inlinetable} .mjxfullwidth {textalign: center; display: tablecell!important; width: 10000em} .mjxmath {display: inlineblock; bordercollapse: separate; borderspacing: 0} .mjxmath * {display: inlineblock; webkitboxsizing: contentbox!important; mozboxsizing: contentbox!important; boxsizing: contentbox!important; textalign: left} .mjxnumerator {display: block; textalign: center} .mjxdenominator {display: block; textalign: center} .MJXcstacked {height: 0; position: relative} .MJXcstacked > * {position: absolute} .MJXcbevelled > * {display: inlineblock} .mjxstack {display: inlineblock} .mjxop {display: block} .mjxunder {display: tablecell} .mjxover {display: block} .mjxover > * {paddingleft: 0px!important; paddingright: 0px!important} .mjxunder > * {paddingleft: 0px!important; paddingright: 0px!important} .mjxstack > .mjxsup {display: block} .mjxstack > .mjxsub {display: block} .mjxprestack > .mjxpresup {display: block} .mjxprestack > .mjxpresub {display: block} .mjxdelimh > .mjxchar {display: inlineblock} .mjxsurd {verticalalign: top} .mjxmphantom * {visibility: hidden} .mjxmerror {backgroundcolor: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; fontstyle: normal; fontsize: 90%} .mjxannotationxml {lineheight: normal} .mjxmenclose > svg {fill: none; stroke: currentColor} .mjxmtr {display: tablerow} .mjxmlabeledtr {display: tablerow} .mjxmtd {display: tablecell; textalign: center} .mjxlabel {display: tablerow} .mjxbox {display: inlineblock} .mjxblock {display: block} .mjxspan {display: inline} .mjxchar {display: block; whitespace: pre} .mjxitable {display: inlinetable; width: auto} .mjxrow {display: tablerow} .mjxcell {display: tablecell} .mjxtable {display: table; width: 100%} .mjxline {display: block; height: 0} .mjxstrut {width: 0; paddingtop: 1em} .mjxvsize {width: 0} .MJXcspace1 {marginleft: .167em} .MJXcspace2 {marginleft: .222em} .MJXcspace3 {marginleft: .278em} .mjxtest.mjxtestdisplay {display: table!important} .mjxtest.mjxtestinline {display: inline!important; marginright: 1px} .mjxtest.mjxtestdefault {display: block!important; clear: both} .mjxexbox {display: inlineblock!important; position: absolute; overflow: hidden; minheight: 0; maxheight: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjxtestinline .mjxleftbox {display: inlineblock; width: 0; float: left} .mjxtestinline .mjxrightbox {display: inlineblock; width: 0; float: right} .mjxtestdisplay .mjxrightbox {display: tablecell!important; width: 10000em!important; minwidth: 0; maxwidth: none; padding: 0; border: 0; margin: 0} .MJXcTeXunknownR {fontfamily: monospace; fontstyle: normal; fontweight: normal} .MJXcTeXunknownI {fontfamily: monospace; fontstyle: italic; fontweight: normal} .MJXcTeXunknownB {fontfamily: monospace; fontstyle: normal; fontweight: bold} .MJXcTeXunknownBI {fontfamily: monospace; fontstyle: italic; fontweight: bold} .MJXcTeXamsR {fontfamily: MJXcTeXamsR,MJXcTeXamsRw} .MJXcTeXcalB {fontfamily: MJXcTeXcalB,MJXcTeXcalBx,MJXcTeXcalBw} .MJXcTeXfrakR {fontfamily: MJXcTeXfrakR,MJXcTeXfrakRw} .MJXcTeXfrakB {fontfamily: MJXcTeXfrakB,MJXcTeXfrakBx,MJXcTeXfrakBw} .MJXcTeXmathBI {fontfamily: MJXcTeXmathBI,MJXcTeXmathBIx,MJXcTeXmathBIw} .MJXcTeXsansR {fontfamily: MJXcTeXsansR,MJXcTeXsansRw} .MJXcTeXsansB {fontfamily: MJXcTeXsansB,MJXcTeXsansBx,MJXcTeXsansBw} .MJXcTeXsansI {fontfamily: MJXcTeXsansI,MJXcTeXsansIx,MJXcTeXsansIw} .MJXcTeXscriptR {fontfamily: MJXcTeXscriptR,MJXcTeXscriptRw} .MJXcTeXtypeR {fontfamily: MJXcTeXtypeR,MJXcTeXtypeRw} .MJXcTeXcalR {fontfamily: MJXcTeXcalR,MJXcTeXcalRw} .MJXcTeXmainB {fontfamily: MJXcTeXmainB,MJXcTeXmainBx,MJXcTeXmainBw} .MJXcTeXmainI {fontfamily: MJXcTeXmainI,MJXcTeXmainIx,MJXcTeXmainIw} .MJXcTeXmainR {fontfamily: MJXcTeXmainR,MJXcTeXmainRw} .MJXcTeXmathI {fontfamily: MJXcTeXmathI,MJXcTeXmathIx,MJXcTeXmathIw} .MJXcTeXsize1R {fontfamily: MJXcTeXsize1R,MJXcTeXsize1Rw} .MJXcTeXsize2R {fontfamily: MJXcTeXsize2R,MJXcTeXsize2Rw} .MJXcTeXsize3R {fontfamily: MJXcTeXsize3R,MJXcTeXsize3Rw} .MJXcTeXsize4R {fontfamily: MJXcTeXsize4R,MJXcTeXsize4Rw} .MJXcTeXvecR {fontfamily: MJXcTeXvecR,MJXcTeXvecRw} .MJXcTeXvecB {fontfamily: MJXcTeXvecB,MJXcTeXvecBx,MJXcTeXvecBw} @fontface {fontfamily: MJXcTeXamsR; src: local('MathJax_AMS'), local('MathJax_AMSRegular')} @fontface {fontfamily: MJXcTeXamsRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/eot/MathJax_AMSRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/woff/MathJax_AMSRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/otf/MathJax_AMSRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXcalB; src: local('MathJax_Caligraphic Bold'), local('MathJax_CaligraphicBold')} @fontface {fontfamily: MJXcTeXcalBx; src: local('MathJax_Caligraphic'); fontweight: bold} @fontface {fontfamily: MJXcTeXcalBw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/eot/MathJax_CaligraphicBold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/woff/MathJax_CaligraphicBold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/otf/MathJax_CaligraphicBold.otf') format('opentype')} @fontface {fontfamily: MJXcTeXfrakR; src: local('MathJax_Fraktur'), local('MathJax_FrakturRegular')} @fontface {fontfamily: MJXcTeXfrakRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/eot/MathJax_FrakturRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/woff/MathJax_FrakturRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/otf/MathJax_FrakturRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXfrakB; src: local('MathJax_Fraktur Bold'), local('MathJax_FrakturBold')} @fontface {fontfamily: MJXcTeXfrakBx; src: local('MathJax_Fraktur'); fontweight: bold} @fontface {fontfamily: MJXcTeXfrakBw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/eot/MathJax_FrakturBold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/woff/MathJax_FrakturBold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/otf/MathJax_FrakturBold.otf') format('opentype')} @fontface {fontfamily: MJXcTeXmathBI; src: local('MathJax_Math BoldItalic'), local('MathJax_MathBoldItalic')} @fontface {fontfamily: MJXcTeXmathBIx; src: local('MathJax_Math'); fontweight: bold; fontstyle: italic} @fontface {fontfamily: MJXcTeXmathBIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/eot/MathJax_MathBoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/woff/MathJax_MathBoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/otf/MathJax_MathBoldItalic.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsansR; src: local('MathJax_SansSerif'), local('MathJax_SansSerifRegular')} @fontface {fontfamily: MJXcTeXsansRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/eot/MathJax_SansSerifRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/woff/MathJax_SansSerifRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/otf/MathJax_SansSerifRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsansB; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerifBold')} @fontface {fontfamily: MJXcTeXsansBx; src: local('MathJax_SansSerif'); fontweight: bold} @fontface {fontfamily: MJXcTeXsansBw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/eot/MathJax_SansSerifBold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/woff/MathJax_SansSerifBold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/otf/MathJax_SansSerifBold.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsansI; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerifItalic')} @fontface {fontfamily: MJXcTeXsansIx; src: local('MathJax_SansSerif'); fontstyle: italic} @fontface {fontfamily: MJXcTeXsansIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/eot/MathJax_SansSerifItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/woff/MathJax_SansSerifItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/otf/MathJax_SansSerifItalic.otf') format('opentype')} @fontface {fontfamily: MJXcTeXscriptR; src: local('MathJax_Script'), local('MathJax_ScriptRegular')} @fontface {fontfamily: MJXcTeXscriptRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/eot/MathJax_ScriptRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/woff/MathJax_ScriptRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/otf/MathJax_ScriptRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXtypeR; src: local('MathJax_Typewriter'), local('MathJax_TypewriterRegular')} @fontface {fontfamily: MJXcTeXtypeRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/eot/MathJax_TypewriterRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/woff/MathJax_TypewriterRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/otf/MathJax_TypewriterRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXcalR; src: local('MathJax_Caligraphic'), local('MathJax_CaligraphicRegular')} @fontface {fontfamily: MJXcTeXcalRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/eot/MathJax_CaligraphicRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/woff/MathJax_CaligraphicRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/otf/MathJax_CaligraphicRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXmainB; src: local('MathJax_Main Bold'), local('MathJax_MainBold')} @fontface {fontfamily: MJXcTeXmainBx; src: local('MathJax_Main'); fontweight: bold} @fontface {fontfamily: MJXcTeXmainBw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/eot/MathJax_MainBold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/woff/MathJax_MainBold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/otf/MathJax_MainBold.otf') format('opentype')} @fontface {fontfamily: MJXcTeXmainI; src: local('MathJax_Main Italic'), local('MathJax_MainItalic')} @fontface {fontfamily: MJXcTeXmainIx; src: local('MathJax_Main'); fontstyle: italic} @fontface {fontfamily: MJXcTeXmainIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/eot/MathJax_MainItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/woff/MathJax_MainItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/otf/MathJax_MainItalic.otf') format('opentype')} @fontface {fontfamily: MJXcTeXmainR; src: local('MathJax_Main'), local('MathJax_MainRegular')} @fontface {fontfamily: MJXcTeXmainRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/eot/MathJax_MainRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/woff/MathJax_MainRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/otf/MathJax_MainRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXmathI; src: local('MathJax_Math Italic'), local('MathJax_MathItalic')} @fontface {fontfamily: MJXcTeXmathIx; src: local('MathJax_Math'); fontstyle: italic} @fontface {fontfamily: MJXcTeXmathIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/eot/MathJax_MathItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/woff/MathJax_MathItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/otf/MathJax_MathItalic.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsize1R; src: local('MathJax_Size1'), local('MathJax_Size1Regular')} @fontface {fontfamily: MJXcTeXsize1Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/eot/MathJax_Size1Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/woff/MathJax_Size1Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/otf/MathJax_Size1Regular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsize2R; src: local('MathJax_Size2'), local('MathJax_Size2Regular')} @fontface {fontfamily: MJXcTeXsize2Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/eot/MathJax_Size2Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/woff/MathJax_Size2Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/otf/MathJax_Size2Regular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsize3R; src: local('MathJax_Size3'), local('MathJax_Size3Regular')} @fontface {fontfamily: MJXcTeXsize3Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/eot/MathJax_Size3Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/woff/MathJax_Size3Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/otf/MathJax_Size3Regular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsize4R; src: local('MathJax_Size4'), local('MathJax_Size4Regular')} @fontface {fontfamily: MJXcTeXsize4Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/eot/MathJax_Size4Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/woff/MathJax_Size4Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/otf/MathJax_Size4Regular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXvecR; src: local('MathJax_Vector'), local('MathJax_VectorRegular')} @fontface {fontfamily: MJXcTeXvecRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/eot/MathJax_VectorRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/woff/MathJax_VectorRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/otf/MathJax_VectorRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXvecB; src: local('MathJax_Vector Bold'), local('MathJax_VectorBold')} @fontface {fontfamily: MJXcTeXvecBx; src: local('MathJax_Vector'); fontweight: bold} @fontface {fontfamily: MJXcTeXvecBw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/eot/MathJax_VectorBold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/woff/MathJax_VectorBold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTMLCSS/TeX/otf/MathJax_VectorBold.otf') format('opentype')} θ) random variable (like the indicator of whether a bit was flipped) approaches θ, which is small enough that a competent value learner should be able to deal with it.
This is not to suggest that the value learner is unsafe. Insanely inconvenient cosmic ray activity is a risk I’m willing to take. The takeaway here is that it complicates the question of what we as algorithm designers should aim for. We should definitely be writing down sets assumptions from which we can derive formal results about the expected behavior of an agent, but is there anything to aim for that is stronger than asymptotic safety?
Discuss
The AI alignment problem as a consequence of the recursive nature of plans
Ruby's recent post on how Plans are Recursive & Why This is Important has strongly shaped my thinking. I think it is a fact that deserves more attention, both when thinking about our own lives as well as from an AI alignment standpoint. The AI alignment problem seems to be a problem inherent to seeking out sufficiently complex goals.
Any sufficiently complex goal, in order to be achieved, needs to be broken down recursively into subgoals until you have executable actions. E.g. “becoming rich” is not an executable action, and neither is “leaving the maximum copies of your genes.” No one ever *does* those things. The same goes to values — what does it mean to value art, to value meaning, to value life? Whatever it is that you want, you need to translate it into muscle movements and/or lines of code in order to get any close at all to achieving it. This is an image Ruby used in that post:
Now, imagine that you have a really complex goal in that upper node. Say you want to "create the best civilization possible" or "maximize happiness in the universe." Or you just have a less complex goal  "having a good and meaningful life" and you're in a really complex environment, like the modern capitalist economy. A tree representing that goal and the subgoals and actions it needs to be decomposed to would be very tall. I can't make my own image because I am in a borrowed computer, but imagine the tree above with like, 20 levels of nodes.
Now, our cognitive resources are not infinite, and they don't scale with the complexity of the environment. You wouldn't be able to store that entire tree in your mind at all times and always keep track of how your every action is serving your one ultimate terminal goal. So in order to act at all, in order to get anywhere, you need to forget a considerable part of your supergoals, and focus on the more achievable subgoals  you need to "zoom in."
Any agent that seeks X as an instrumental goal, with, say, Y as a terminal goal, can easily be outcompeted by an agent that seeks X as a terminal goal. If you’re thinking of Y all the time, you’re simply not going to do the best you can to get X. Someone who sees becoming a lawyer as their terminal goal, someone who is intrinsically motivated by it, will probably do much better at becoming a lawyer than someone who sees it as merely a step towards something else. That is analogous to how an AI trained to do X will outcompete an AI trained to do X, *plus* value human lives and meaning and everything.
Importantly, this happens in humans at a very visceral level. Motivation, desire, wanting, are not infinite resources. If there is something that is, theoretically, your one true value/desire/goal, you're not necessarily going to feel any motivation at all to pursue it if you have lowerdown subgoals to achieve first, even if what originally made you select those subgoals was that they brought you closer to that supergoal.
That may be why sometimes we find ourselves unsure about what we want in life. That also may be why we disagree on what values should guide society. Our motivation needs to be directed at something concrete and actionable in order for us to get anywhere at all.
So the crux of the issue is that we need to manage to 1) determine the sequence of subgoals and executable actions that will lead to our terminal goal being achieved, and 2) make sure that those subgoals and executable actions remain aligned with the terminal goal.
There are many examples of that going wrong. Evolution “wants” us to “leave the maximum copies of our genes.” The executable actions that that comes down to are things like “being attracted to specific features in the opposite sex that in the environment of evolutionary adaptedness correlated with leaving the maximum copies of your genes” and “having sex.” Nowadays, of course, having sex doesn’t lead to spreading genes, so evolution is kind of failing at the human alignment problem.
Another example would be people who work their entire lives to succeed at a specific prestigious profession and get a lot of money, but when they do, they end up not being entirely sure of why they wanted that in the first place, and find themselves unhappy.
You can see humans maximizing for i.e. pictures of feminine curves on Instagram as kind of like an AI maximizing paperclips. Some people think of the paperclip maximizer thought experiment as weird or arbitrary, but it really isn't any different from what we already do as humans. From what we have to do.
What AI does, in my view, is massively scale and exacerbate that alreadyexisting issue. AI maximizes efficiency, maximizes our capacity to get what we want, and because of that, specifying what we want becomes the trickiest issue.
Goodhart’s law says that “When a measure becomes a target, it ceases to be a good measure.” But every action we take is a movement towards a target! And complex goals are not going to do as targets to be directly aimed at without the extensive use of proxies. The role of human language is largely to impress other humans. So when we say that we value happiness, meaning, a great civilization, or whatever, that sounds impressive and cool to other humans, but it says very little about what muscle movements need to be made or what lines of code need to be written.
Discuss
How did Voldemort learn the horcrux spell?
[HPMOR related]
This question has always bugged me. The interdict of Merlin stops spells being learnt from anything but a living mind. Yet Tom Riddle is specifically said to have learnt the horcrux spell from a book. Why are there even spells in books?
Who would believe the ritual for the philosopher's stone that is written down since no one can learn it from a book anyway?
The existence of books containing and describing spells seems completely illogical in that world.
Discuss
The Hard Work of Translation (Buddhism)
The issue, as it seems to me, is that almost every text you read on Buddhism does not attempt to do the actual work of translation. The first transmission of Buddhism to the west reified a bunch of translations of terms, such as concentration, equanimity, tranquility, mindfulness, suffering, etc. and works since then have mostly stuck to rearranging these words in different combinations and referencing the same metaphors that have been in use since the time of the Buddha. If these authors had true discernment they would realize that the umpteenth text on 'establishing the noble bases of tranquility secluded from sensuous ignorance' or whathaveyou aren't helping anyone who didn't already get the message.
At this point I want to say that I think this approach is 'working' for the fraction of the population it is going to work for. If we want to make the practical fruits of Buddhist practice dramatically more accessible to a broader range of humanity we need people to do the hard work of translation to put the Buddha's teachings in forms that will be accessible to various groups of people.
The hard work of translation is to attempt to use language to point your mind at the same distinctions that the original author was trying to point to. Attempts to do this will inevitably fail in lots of ways, but can hopefully communicate enough of the core message that people can piece together the essential causal relations after which, having had direct experience as a result of skillful practice, they can help to improve the translations further.
So, putting my money where my mouth is, I want to try to produce a translation of what I see as the core causal loop that causes progress on the Buddha's path. I'm attempting this because I believe the core causal loop is actually quite small. The Buddha had a tougher task because he had to explain causation, locus of control, and other critical concepts to farmers from scratch.
To begin with, you may think that the purpose of meditation is to eliminate thoughts. But read the Pali Canon and you find a text rife with concepts, schemas, diagnostic methods for various classifications of mental activity, meditation taxonomies, sensory taxonomies, feedback loops etc. Pretending you're already enlightened and that there isn't hard work to do is something the new agers have borrowed from some shitty spiritual schools of various flavors. I refer to people preaching such messages as mindlessness teachers.
To be clear, a decrease in discursive thought, and especially unpleasant mental contents that don't seem to serve any purpose, are one of many pleasant effects of proper practice, but don't really need to be focused on. It is a benefit that arrives in stages on its own.
So, what is the core loop?
It's basically cognitive behavioral therapy, supercharged with a mental state more intense than most pharmaceuticals.
There are two categories of practice, one for cultivating the useful mental state, the other uses that mental state to investigate the causal linkages between various parts of your perception (physical sensations, emotional tones, and mental reactions) which leads to clearing out of old linkages that weren't constructed well.
You have physical sensations in the course of life. Your nervous system reacts to these sensations with high or low valence (positive, negative, neutral) and arousal (sympathetic and parasympathetic nervous system activation), your mind reacts to these nowemotionladen sensations with activity (mental image, mental talk) out of which you then build stories to make sense of your situation.
The key insight that drives everything is the knowledge (and later, direct experience) that this system isn't wired up efficiently. Importantly: I don't mean this in a normative way. Like you should wire it the way I say just because, but in the 'this type of circuit only needs 20 nand gates, why are there 60 and why is it shunting excess voltage into the anger circuits over there that have nothing to do with this computation?' way. Regardless of possible arguments over an ultimately 'correct' way to wire everything, there are very low hanging fruit in terms of improvements that will help you effectively pursue *any* other goal you set your mind to.
Funny aside, emotional 'resistance' might be well named, it might be literal electrical resistance in the CNSs wiring as a result of this spaghetti logic.
So back to these stories and story building blocks that are the outputs of this system. You generated a bunch of the primitive building blocks when you were very young and throwing everything together on an as needed basis with no instructions. You both have a back log of such stories and story buildingblocks and are generating new ones all the time. Practice improves each of these situations. It improves the backlog by going through and reprocessing stories that aren't actually reality aligned when examined. Again, not pointing to edge cases here but things in the 'your partner humming the spongebob theme shouldn't make you furious because of something that happened when you were 12' class. You can clean up all the obvious stuff and then let your future self (who now has more resources) think about how to wisely deal with the fuzzy edge cases. It improves the new stories coming in (partially by learning as it processes the back log) by building far fewer incoherent stories out of pieces that don't fit together, and building less of the shittier building blocks in the first place.
I'll go ahead and name these things now to connect them up for people who have some knowledge of existing translations.
Concentration meditation gives rise to a mental state where the mind is very calm and inclined to neutrality. Of the same sort you'd want in a good judge.
Insight meditation makes one aware of the causal links in the perceptual system between physical sensations, feelings, and mental reactions.
Sankharas are the stories and story pieces that get reexamined and refactored as a result.
So what is the core loop of meditation practice?
Concentration puts you in the ideal state for insight.
Insight stirs up Sankaras.
Examining Sankharas riles up the mind, eventually leading to a desire to do some more concentration in order to calm down and keep making progress.
Why is concentration ideal to prepare you for insight practice?
Insight requires a high degree of temporal and spatial resolution in order to see the finer linkages between mental activities that normally flow past you without you noticing. Concentration meditation improves that resolution.
Second, to examine the Sankharas is to, to some extent, reactivate the sensations, feelings, and mental reactions associated with them. Since the ones we are most concerned with are the ones that are causing the biggest negative reactions in our lives, we need the mind to be calm and tranquil in order to do this work. Concentration greatly improves this tranquility as well.
How do insights stir up Sankharas?
This would require more speculation about somatic theories that don't yet have a good evidence base. Subjectively, it feels like building up insights into particular kinds of linkages between physical sensations, feelings, and mental reactions causes areas of your backlog that are particularly heavy in those linkages to get some activation and thus be available to consciousness.
You've experienced this if you've ever had a conceptual insight and then spent the next week noticing ways it was applicable, seemingly spontaneously. The only difference here is that insight can also be nonconceptual (ie, insight into how two particular physical sensations interact might generate no verbal content/mental talk but some sense of something happening.)
So, the Buddha taught a method of concentration, a system for developing insight that we know as mindfulness, and to use these to both stop building new stories and to clear out our backlog of stories. That's actually it. The rest is details for how this plays out in practice. Failure modes can get a bit weird, and even if you do it right some mind blowing states and experiences can pop up. So there's lots of whataboutism for all that.
The miswired central nervous system story gives us simple answers to things like trauma (extreme levels of miswiring of things into fear and freeze responses), why stuff like yoga and exercise help (general CNS health, probably capacitance/fuse breaker improvements), why psychotherapy sometimes but not always activates childhood memories and the significance of that, and why practitioners claim they have a much better life but can't always explain why (they perform the same actions but with much less internal resistance).
So then why all the rest of this crap?
Well, besides my post on why practitioners make so many metaphysical claims, it's also just that there's a lot of idiosyncrasy in first unwiring a randomly wired CNS and then rewiring it in arbitrary order. Especially when you don't really know that that's what you're doing as you're doing it and your mindlessness teacher is a bit clueless as well (though may still have good pragmatic advice despite bad epistemics.)
In addition, note I said that each of the practices is actually a practice category. Though the Buddha taught one specific concentration technique and a simple series of insight techniques, but there are probably a dozen alternatives in each category that seem to work for some people and which entire traditions have subsequently built themselves around and gotten into fights with rival schools about.
(I am fairly confident this is how things work up until 2nd path. Since approximately zero percent of people make it beyond that point I'm not too worried about this.)
Discuss
Towards a Quieter Life
My life is LOUD. Thoughts of fiction, conversations, unrealistic hypotheticals, games, tv shows flood my mind when there's no need to be babbling those thoughts. I have coroutines running in my mind to CHECK YOUR PHONE, YOUR EMAIL, YOUR RSS FEED, ...
Being aware of myself in those moments, I've realized I'm not actually happy doing any of them. Cravings for escapism is not what I want. I want something real, something true... I want reality.
So here's to a quieter life.
I'll check my phone/email after lunch and dinner. If I'm bored, I'll meditate. If I'm tired of meditating, I'll investigate that feeling. If that fails, I can Sabbath for a bit.
When I'm doing something, I'm only doing that. If an unrelated thought comes, I'll accept it and let it go. When I'm done with one thing, I'll go to the next with joy. If I don't have a next thing, I'll make a plan with joy. If that fails, I can Sabbath for a bit.
I'll be strict for the next few weeks, post the results, and go from there.
~Metta
Discuss
Experiment: Observing progress or failure of reaching goals
I've seen a lot of people here on LessWrong asking and writing about things like accomplishing goals, fighting akrasia, learning, overcoming obstacles, etc. There also seems to be some debate about the usefulness of such insights.
I'm proposing an experiment instead, with myself as the subject.
Below are described some of my goals. The goals are semidiscrete entities and can be accomplished in reality in a medium amount of time (weeks to months).
You would get to observe me struggling to complete the goals. If I succeed, you can ask me questions about my methods, my experiences and my background. If I fail, the whole thing fades into forgetfulness.
I'm only one data point, but others may do something similar.
To be honest, I don't quite understand the term 'ambition'. I only wish to figure out everything and improve the amount of control we can exercise over our lives. These faroff goals seem quite sensible to me. I mean, what else could I do? It's not like there are viable alternatives lying around. Press this button and become invincible. Press the other button and cure all illnesses. If buttons like that existed, some idiot would have pressed them by now.
Moving on.
The GoalsIn order to reach my faroff goals, I must first ...
1. Learn enough mathematics, so that I can understand and work on basic problems in AGI, mathematics and computer science. For now, that means working through introductionary books of most topics. I need a better grasp of what everything is. Then I can figure out the next stages (something along the lines of 'learn more ...' or 'do ...').
2. Improve the structure that I'm using for the organisation of living and moving towards my goals. This work never seems to be quite finished (which is to be expected). As I complete some goals, others will demand behavior that I haven't trained. However, I've observed that particular structures allow more control and less upkeep. I aim to improve these qualities.
3. Programming. This is for fun as much as it is for practice. I started a few months ago, but haven't practiced much. I'm using programming to pass some requirements, and to improve my ability to figure out and implement solutions.
4. Developing my own ideas. I can't measure this adequately. But I need to test the extend of my understanding, improve my ability to sort through confusion and improve the ability to develop ideas that can be precisely defined. For this I'll write ideas down, challenge myself to explain them better and investigate parts that are standins for naivity.
I already have basic mathematical ability, the habit of writing out and explaining/investigating my thoughts, and an inner drive that keeps me going. I'm also in a place where I can afford to take some time to do this.
Observing progress or failureI'm open for suggestions. The experiment will run until late August. Given that there is at least some success, I will do a review as described at the start.
An example of what I could do for observation (about once per week): post a summary of where I'm at for each goal. I could supplement this with one or two pieces of evidence, such as partial reviews of the books I'm reading, how well my organisational structure stood up to scrutiny (perhaps by reporting the ratio of what I succeeded / failed at), or something else.
Please ask questions or comment below.
Discuss
Reinforcement learning with imperceptible rewards
.mjxchtml {display: inlineblock; lineheight: 0; textindent: 0; textalign: left; texttransform: none; fontstyle: normal; fontweight: normal; fontsize: 100%; fontsizeadjust: none; letterspacing: normal; wordwrap: normal; wordspacing: normal; whitespace: nowrap; float: none; direction: ltr; maxwidth: none; maxheight: none; minwidth: 0; minheight: 0; border: 0; margin: 0; padding: 1px 0} .MJXcdisplay {display: block; textalign: center; margin: 1em 0; padding: 0} .mjxchtml[tabindex]:focus, body :focus .mjxchtml[tabindex] {display: inlinetable} .mjxfullwidth {textalign: center; display: tablecell!important; width: 10000em} .mjxmath {display: inlineblock; bordercollapse: separate; borderspacing: 0} .mjxmath * {display: inlineblock; webkitboxsizing: contentbox!important; mozboxsizing: contentbox!important; boxsizing: contentbox!important; textalign: left} .mjxnumerator {display: block; textalign: center} .mjxdenominator {display: block; textalign: center} .MJXcstacked {height: 0; position: relative} .MJXcstacked > * {position: absolute} .MJXcbevelled > * {display: inlineblock} .mjxstack {display: inlineblock} .mjxop {display: block} .mjxunder {display: tablecell} .mjxover {display: block} .mjxover > * {paddingleft: 0px!important; paddingright: 0px!important} .mjxunder > * {paddingleft: 0px!important; paddingright: 0px!important} .mjxstack > .mjxsup {display: block} .mjxstack > .mjxsub {display: block} .mjxprestack > .mjxpresup {display: block} .mjxprestack > .mjxpresub {display: block} .mjxdelimh > .mjxchar {display: inlineblock} .mjxsurd {verticalalign: top} .mjxmphantom * {visibility: hidden} .mjxmerror {backgroundcolor: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; fontstyle: normal; fontsize: 90%} .mjxannotationxml {lineheight: normal} .mjxmenclose > svg {fill: none; stroke: currentColor} .mjxmtr {display: tablerow} .mjxmlabeledtr {display: tablerow} .mjxmtd {display: tablecell; textalign: center} .mjxlabel {display: tablerow} .mjxbox {display: inlineblock} .mjxblock {display: block} .mjxspan {display: inline} .mjxchar {display: block; whitespace: pre} .mjxitable {display: inlinetable; width: auto} .mjxrow {display: tablerow} .mjxcell {display: tablecell} .mjxtable {display: table; width: 100%} .mjxline {display: block; height: 0} .mjxstrut {width: 0; paddingtop: 1em} .mjxvsize {width: 0} .MJXcspace1 {marginleft: .167em} .MJXcspace2 {marginleft: .222em} .MJXcspace3 {marginleft: .278em} .mjxtest.mjxtestdisplay {display: table!important} .mjxtest.mjxtestinline {display: inline!important; marginright: 1px} .mjxtest.mjxtestdefault {display: block!important; clear: both} .mjxexbox {display: inlineblock!important; position: absolute; overflow: hidden; minheight: 0; maxheight: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjxtestinline .mjxleftbox {display: inlineblock; width: 0; float: left} .mjxtestinline .mjxrightbox {display: inlineblock; width: 0; float: right} .mjxtestdisplay .mjxrightbox {display: tablecell!important; width: 10000em!important; minwidth: 0; maxwidth: none; padding: 0; border: 0; margin: 0} .MJXcTeXunknownR {fontfamily: monospace; fontstyle: normal; fontweight: normal} .MJXcTeXunknownI {fontfamily: monospace; fontstyle: italic; fontweight: normal} .MJXcTeXunknownB {fontfamily: monospace; fontstyle: normal; fontweight: bold} .MJXcTeXunknownBI {fontfamily: monospace; fontstyle: italic; fontweight: bold} .MJXcTeXamsR {fontfamily: MJXcTeXamsR,MJXcTeXamsRw} .MJXcTeXcalB {fontfamily: MJXcTeXcalB,MJXcTeXcalBx,MJXcTeXcalBw} .MJXcTeXfrakR {fontfamily: MJXcTeXfrakR,MJXcTeXfrakRw} .MJXcTeXfrakB {fontfamily: MJXcTeXfrakB,MJXcTeXfrakBx,MJXcTeXfrakBw} .MJXcTeXmathBI {fontfamily: MJXcTeXmathBI,MJXcTeXmathBIx,MJXcTeXmathBIw} .MJXcTeXsansR {fontfamily: MJXcTeXsansR,MJXcTeXsansRw} .MJXcTeXsansB {fontfamily: MJXcTeXsansB,MJXcTeXsansBx,MJXcTeXsansBw} .MJXcTeXsansI {fontfamily: MJXcTeXsansI,MJXcTeXsansIx,MJXcTeXsansIw} .MJXcTeXscriptR {fontfamily: MJXcTeXscriptR,MJXcTeXscriptRw} .MJXcTeXtypeR {fontfamily: MJXcTeXtypeR,MJXcTeXtypeRw} .MJXcTeXcalR {fontfamily: MJXcTeXcalR,MJXcTeXcalRw} .MJXcTeXmainB {fontfamily: MJXcTeXmainB,MJXcTeXmainBx,MJXcTeXmainBw} .MJXcTeXmainI {fontfamily: MJXcTeXmainI,MJXcTeXmainIx,MJXcTeXmainIw} .MJXcTeXmainR {fontfamily: MJXcTeXmainR,MJXcTeXmainRw} .MJXcTeXmathI {fontfamily: MJXcTeXmathI,MJXcTeXmathIx,MJXcTeXmathIw} .MJXcTeXsize1R {fontfamily: MJXcTeXsize1R,MJXcTeXsize1Rw} .MJXcTeXsize2R {fontfamily: MJXcTeXsize2R,MJXcTeXsize2Rw} .MJXcTeXsize3R {fontfamily: MJXcTeXsize3R,MJXcTeXsize3Rw} .MJXcTeXsize4R {fontfamily: MJXcTeXsize4R,MJXcTeXsize4Rw} .MJXcTeXvecR {fontfamily: MJXcTeXvecR,MJXcTeXvecRw} .MJXcTeXvecB {fontfamily: MJXcTeXvecB,MJXcTeXvecBx,MJXcTeXvecBw} @fontface {fontfamily: MJXcTeXamsR; src: local('MathJax_AMS'), local('MathJax_AMSRegular')} @fontface {fontfamily: MJXcTeXamsRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_AMSRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_AMSRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_AMSRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXcalB; src: local('MathJax_Caligraphic Bold'), local('MathJax_CaligraphicBold')} @fontface {fontfamily: MJXcTeXcalBx; src: local('MathJax_Caligraphic'); fontweight: bold} @fontface {fontfamily: MJXcTeXcalBw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_CaligraphicBold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_CaligraphicBold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_CaligraphicBold.otf') format('opentype')} @fontface {fontfamily: MJXcTeXfrakR; src: local('MathJax_Fraktur'), local('MathJax_FrakturRegular')} @fontface {fontfamily: MJXcTeXfrakRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_FrakturRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_FrakturRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_FrakturRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXfrakB; src: local('MathJax_Fraktur Bold'), local('MathJax_FrakturBold')} @fontface {fontfamily: MJXcTeXfrakBx; src: local('MathJax_Fraktur'); fontweight: bold} @fontface {fontfamily: MJXcTeXfrakBw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_FrakturBold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_FrakturBold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_FrakturBold.otf') format('opentype')} @fontface {fontfamily: MJXcTeXmathBI; src: local('MathJax_Math BoldItalic'), local('MathJax_MathBoldItalic')} @fontface {fontfamily: MJXcTeXmathBIx; src: local('MathJax_Math'); fontweight: bold; fontstyle: italic} @fontface {fontfamily: MJXcTeXmathBIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_MathBoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_MathBoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_MathBoldItalic.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsansR; src: local('MathJax_SansSerif'), local('MathJax_SansSerifRegular')} @fontface {fontfamily: MJXcTeXsansRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_SansSerifRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_SansSerifRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_SansSerifRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsansB; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerifBold')} @fontface {fontfamily: MJXcTeXsansBx; src: local('MathJax_SansSerif'); fontweight: bold} @fontface {fontfamily: MJXcTeXsansBw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_SansSerifBold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_SansSerifBold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_SansSerifBold.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsansI; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerifItalic')} @fontface {fontfamily: MJXcTeXsansIx; src: local('MathJax_SansSerif'); fontstyle: italic} @fontface {fontfamily: MJXcTeXsansIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_SansSerifItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_SansSerifItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_SansSerifItalic.otf') format('opentype')} @fontface {fontfamily: MJXcTeXscriptR; src: local('MathJax_Script'), local('MathJax_ScriptRegular')} @fontface {fontfamily: MJXcTeXscriptRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_ScriptRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_ScriptRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_ScriptRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXtypeR; src: local('MathJax_Typewriter'), local('MathJax_TypewriterRegular')} @fontface {fontfamily: MJXcTeXtypeRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_TypewriterRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_TypewriterRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_TypewriterRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXcalR; src: local('MathJax_Caligraphic'), local('MathJax_CaligraphicRegular')} @fontface {fontfamily: MJXcTeXcalRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_CaligraphicRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_CaligraphicRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_CaligraphicRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXmainB; src: local('MathJax_Main Bold'), local('MathJax_MainBold')} @fontface {fontfamily: MJXcTeXmainBx; src: local('MathJax_Main'); fontweight: bold} @fontface {fontfamily: MJXcTeXmainBw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_MainBold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_MainBold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_MainBold.otf') format('opentype')} @fontface {fontfamily: MJXcTeXmainI; src: local('MathJax_Main Italic'), local('MathJax_MainItalic')} @fontface {fontfamily: MJXcTeXmainIx; src: local('MathJax_Main'); fontstyle: italic} @fontface {fontfamily: MJXcTeXmainIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_MainItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_MainItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_MainItalic.otf') format('opentype')} @fontface {fontfamily: MJXcTeXmainR; src: local('MathJax_Main'), local('MathJax_MainRegular')} @fontface {fontfamily: MJXcTeXmainRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_MainRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_MainRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_MainRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXmathI; src: local('MathJax_Math Italic'), local('MathJax_MathItalic')} @fontface {fontfamily: MJXcTeXmathIx; src: local('MathJax_Math'); fontstyle: italic} @fontface {fontfamily: MJXcTeXmathIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_MathItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_MathItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_MathItalic.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsize1R; src: local('MathJax_Size1'), local('MathJax_Size1Regular')} @fontface {fontfamily: MJXcTeXsize1Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_Size1Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_Size1Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_Size1Regular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsize2R; src: local('MathJax_Size2'), local('MathJax_Size2Regular')} @fontface {fontfamily: MJXcTeXsize2Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_Size2Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_Size2Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_Size2Regular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsize3R; src: local('MathJax_Size3'), local('MathJax_Size3Regular')} @fontface {fontfamily: MJXcTeXsize3Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_Size3Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_Size3Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_Size3Regular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsize4R; src: local('MathJax_Size4'), local('MathJax_Size4Regular')} @fontface {fontfamily: MJXcTeXsize4Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_Size4Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_Size4Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_Size4Regular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXvecR; src: local('MathJax_Vector'), local('MathJax_VectorRegular')} @fontface {fontfamily: MJXcTeXvecRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_VectorRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_VectorRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_VectorRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXvecB; src: local('MathJax_Vector Bold'), local('MathJax_VectorBold')} @fontface {fontfamily: MJXcTeXvecBx; src: local('MathJax_Vector'); fontweight: bold} @fontface {fontfamily: MJXcTeXvecBw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_VectorBold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_VectorBold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_VectorBold.otf') format('opentype')}
TLDR: We define a variant of reinforcement learning in which the reward is not perceived directly, but can be estimated at any given moment by some (possibly costly) experiment. The reward function is no longer a function of the observation history, but a different object that we call "instrumental reward function". We give two definitions of instrumental reward function and prove their equivalence. We also derive a regret bound for this setting.
BackgroundIn "classical" reinforcement learning the agent perceives the reward signal on every round of its interaction with the environment, whether through a distinct input channel or through some given way to compute the reward from the interaction history so far. On the other hand, we can rather easily imagine agents that optimize properties of their environment that they do not directly perceive. For example, if Alice, who lives in Dominica, donates money to the Against Malaria Foundation in order to save someone in Africa, then the result is usually not visible to Alice at the time it occurs, if ever. Similarly, Clippy the paperclip maximizer doesn't always perceive all the paperclips in the universe. Moreover, we might want to design agents that, in order to estimate the reward, direct queries to humans (which is costly and cannot be done continuously nonstop).
Now, it is possible to define the perceived reward as the subjective expected value of the "true" imperceptible reward (see the Results section for details). Although this transformation preserves expected utility, it does not preserve Bayesian regret. Indeed, Bayesian regret is the difference between the expected utility attained by the agent and the expected utility attained by a "reference" agent that knows the true environment from the onset. However, after the transformation, the reference agent will behave as if it knows the observable dynamics of the true environment but still pretends not to know the true environment for the purpose of computing the reward. Therefore, regret analysis requires us to consider the imperceptible reward honestly. In fact, as we will see, certain assumptions about the imperceptible reward function are needed even to make the problem learnable at all. Finally, this transformation makes the reward function more complex and hence applying a "generic" reinforcement learning algorithm after the transformation (instead of exploiting the special form of the resulting reward function) might carry a significant computational complexity penalty.
Related Work: De Blanc 2011 studies so called "ontological crises". That is, de Blanc examines the problem of translating a reward function from one ontology into another. Here, we avoid this problem by considering reward functions that are automatically defined in all ontologies. That said, it might still be interesting to think about how to specify our type of reward function starting from a reward function that is only defined in one particular ontology. We will return to this in the Discussion section.
Krueger et al 2016 consider a setting where querying the reward is instantaneous and has a fixed cost. This is a special case of our setting, but we allow a much more general type of query and cost. Also, Krueger et al don't derive any theoretical regret bound (but they do present some experimental results).
Finally, multiarmed bandits with partial monitoring are closely related, see for example Bartok et al 2013. However, bandits by definition assume a stateless environment, and also our approach is rather different than what is usually studied in partial monitoring.
The literature study was very cursory and I will be glad to know about prior work I missed!
Results Partially Observable MDPs with Imperceptible RewardsPartially Observable Markov Decision Processes (POMDPs) serve as a natural starting point for thinking about imperceptible rewards. Indeed, it might seem like all we have to do is consider RL in a POMDP environment and let the reward to be a function of the (imperceptible) state. However, this setting is in general unlearnable even given assumptions that rule out traps.
A (finite) POMDP is defined by nonempty finite sets S (states), A (actions) and O (observations), the transition kernel T:S×Ak→S×O and the reward function R:S→[0,1]. As opposed to the "classical" definition of POMDP, we allow the value of R is to be imperceptible. The perceptible setting is a special case of the imperceptible setting: we can always encode the reward into the observation.
To formulate a learning problem, we assume S, A and O to be fixed and an initial state s0∈S to be given, while T and R are unknown and belong to some hypothesis class H:
H⊆{S×Ak→S×O}×{S→[0,1]}
Such a problem can be unlearnable even if R is known and there are no irreversible events that can happen:
Example 1
Suppose that S={s0,s−,s+}, A={a−,a+}, O={⊥}, R(s0)=12, R(s−)=0,R(s+)=1 and H={(T−,R),(T+,R)} where for any s∈S
T−(s+,⊥s,a−)=1 T−(s−,⊥s,a+)=1 T+(s+,⊥s,a+)=1 T+(s−,⊥s,a−)=1
Since O=1, there is no way to gain any information about which hypothesis is correct. Moreover, the optimal policies for the two hypotheses are different. Namely, for T+ we should always take action a+ and for T− we should always take action a−.
To formalize and illustrate the discussion in the Background section, suppose H is Borel and ζ∈ΔH is the prior. We can then define the "perceived effective reward" ER:(A×O)∗→[0,1] by
ER(ao):=E(T,R)∼ζ(sn+1,o′n)∼T(sn,an)[R(sh)∣∣o′=o]
It is then easy to see that the E operator preserves expected utility: given any policy π:(A×O)∗k→A and m∈N
E(T,R)∼ζan∼π(ao:n)(sn+1,on)∼T(sn,an)[ER(ao:m)]=E(T,R)∼ζan∼π(ao:n)(sn+1,on)∼T(sn,an)[R(sm)]
and therefore, for any geometric time discount parameter γ∈[0,1)
E(T,R)∼ζan∼π(ao:n)(sn+1,on)∼T(sn,an)[∞∑n=0γnER(ao:n)]=E(T,R)∼ζan∼π(ao:n)(sn+1,on)∼T(sn,an)[∞∑n=0γnR(sn)]
On the other hand, Bayesian regret is not preserved since, in general
E(T,R)∼ζ[maxπEaos∼Tπ[∞∑n=0γnER(ao:n)]]≠E(T,R)∼ζ[maxπEaos∼Tπ[∞∑n=0γnR(sn)]]
Here aos∼Tπ is shorthand notation for the same probability distribution as before.
Indeed, in Example 1 the LHS of the above is 12⋅11−γ since ER≡12, whereas the RHS is 12+γ1−γ.
The pathology of Example 1 comes about because reward is not only imperceptible but entirely unobservable. That is, no experiment can produce any information about whether the reward on a given round 0">n>0 was 0 or 1. More specifically, the states s− and s+ are assigned different rewards, but there is no observable difference between them. It is as if Alice would assign value, not to people in Africa (whose existence and wellbeing can be measured) but to some Flying Spaghetti Monster s.t. the world behaves exactly the same regardless of its existence or condition.
This observation suggests that, instead of assigning rewards to states which are just abstract labels in a model, we should assign rewards to states that are defined in terms of the observable consequences of the interaction of the agent with the environment. This leads us to the notion of "instrumental state", which we will now define formally.
Instrumental States and Reward FunctionsFirst, we introduce some technical definitions for notational convenience.
Definition 1
ConSet is the category whose objects are pairs (V,C) where V is a real vector space and C is a convex subset of V, and whose morphisms are
Mor((V,C),(W,D)):= {f:C→D∃A∈Hom(V,W),w∈W∀v∈C:f(v)=Av+w}
We omit describing identity and composition of morphisms since they are obvious.
It is easy to see that ConSet has arbitrary limits. In particular, ConSet has a final object that we denote by pt (the one point set), products ((V,C)×(W,D)≅(V⊕W,C×D)) and inverse limits of sequences. For any finite set A, (RA,ΔA) is an object in ConSet. Also, we will sometimes abuse notation by regarding C as an object of ConSet instead of (V,C) (i.e. making V implicit).
Definition 2
The functor Cone:ConSet→ConSet is defined by
Cone(V,C):=(V⊕R,{(λv,λ)λ∈[0,1],v∈C}) (Conef)(λv,λ):=(λf(v),λ)
Definition 3
For any C∈ConSet, we define hC:ConeC→[0,1] by
hC(λv,λ):=λ
Note that h is a natural transformation from Cone to the constant functor with value [0,1].
Given C∈ConSet and x∈ConeC s.t. x=(u,λ) for 0">λ>0, we denote [x]:=λ−1u∈C.
Now, we are ready to define instrumental states. We fix the sets A and O.
Definition 4
For any n∈N, we define ISn∈ConSet, the space of n time step instrumental states, recursively by
IS0:=pt ISn+1:=∏a∈A(∏o∈OhISn)−1(ΔO)
Here, ∏o∈OhISn is a mapping from ∏o∈OConeISn to [0,1]O. The inverse image of a convex set (in this case ΔO⊆[0,1]O) under an affine mapping (in this case ∏o∈OhISn) is also a convex set.
The semantics of Definition 4 is as follows. Any α∈(∏o∈OhISn)−1(ΔO) can be regarded as a pair consisting of some α′∈ΔO (the image of α under ∏o∈OhISn) and a mapping α′′:suppα′→ISn defined by α′′(o):=[αo]. Given θ∈ISn+1, θ′a is the probability distribution over observations resulting from taking action a in state θ, whereas θ′′a(o) is the state resulting from taking action a in state θ conditional on observing o. This semantics can be made more formal as follows:
Definition 5
Given θ∈ISn, we define domθ⊆(A×O)∗ recursively as follows:

λ∈domθ

*For all h∈(A×O)∗, a∈A and o∈O: aoh∈domθ iff 0">n>0, 0">hISn−1(θao)>0 and h∈dom[θao].}
Definition 6
Given θ∈ISn, h∈domθ and a∈A, and assuming that h<n, we recursively define θ(ha)∈ΔO by

For h=λ: θ(o∣a):=hISn−1(θao).

For h=a′o′h′ with some a′∈A, o′∈O and h′∈(A×O)∗: θ(ha):=[θa′o′](h′a).
The point of defining ISn in this manner is that (i) different points in ISn correspond to states that are truly not equivalent, i.e. can be empirically distinguished (statistically) and (ii) convex combinations of points in ISn correspond precisely to probabilistic mixtures. Formally, we have:
Definition 7
Given θ∈ISn and π:(A×O)<nk→A (a policy), we define θπ∈Δ(A×O)n (probability distribution over histories resulting from following policy π in state θ) by
Prao∼θπ[am=a∗]=Eao∼θπ[π(a∗ao:m)] Prao∼θπ[om=o∗]=Eao∼θπ[θ(o∗ao:mam)]
Proposition 1
Consider some θ,θ′∈ISn and assume that for any π:(A×O)<n→A, θπ=θ′π. Then, θ=θ′.
Proposition 2
Consider some θ,θ′∈ISn, π:(A×O)<nk→A and p∈[0,1]. Then
(pθ+(1−p)θ′)π=pθπ+(1−p)θ′π
ISn is a bounded polytope but in general it is *not} a simplex: we cannot regard it as just probability distributions over some set. For example, if A=O=2, then it's easy to see that IS1 is a square (one axis is the probability to get a given observation when taking one action, the other axis is the probability for the other action). On the other hand, if A=1 then ISn is a simplex: it is canonically isomorphic to ΔOn.
There are a natural morphisms prn:ISn+1→ISn whose semantics is forgetting about the behavior of the state at time step n:
Definition 8
We define prn:ISn+1→ISn for any n∈N recursively. pr0 is the unique morphism from IS1 to IS0≅pt. For any n∈N, prn+1 is given by
prn+1(θ)ao:=(Coneprn)(θao)
We thereby get a sequence IS0←IS1←IS2←…
Definition 9
We define ISω, the space of (infinite time step) instrumental states by
ISω:=lim←−nISn
We denote the canonical projections by prωn:ISω→ISn.
Of course, ISω can also be regarded as the space of all possible stochastic environments. Specifically, we have:
Definition 10
For any μ∈ISω we define domμ∈(A×O)∗ by
domμ:=∞⋃n=0domprωnμ
Definition 11
Given μ∈ISω, h∈domμ and a∈A, we define μ(ha)∈ΔO by
μ(ha):=prωh+1μ(ha)
Like in the finite time case, we have
Definition 12
Given μ∈ISω and π:(A×O)∗k→A, we define μπ∈Δ(A×O)ω by
Prao∼μπ[an=a∗]=Eao∼μπ[π(a∗ao:n)] Prao∼μπ[on=o∗]=Eao∼μπ[μ(o∗ao:nan)]
Proposition 3
Consider some μ,μ′∈ISω and assume that for any π:(A×O)∗→A, μπ=μ′π. Then, μ=μ′.
Proposition 4
Consider some μ,μ′∈ISω, π:(A×O)∗k→A and p∈[0,1]. Then
(pμ+(1−p)μ′)π=pμπ+(1−p)μ′π
For each n∈N, ISn is finitedimensional and therefore has a natural topology. ISω also becomes a topological space by equipping it with the inverse limit topology. Since the ISn are closed and bounded they are compact, and therefore ISω is also compact by Tychonoff's theorem. In the special case A=1, ISω≅ΔOω and the inverse limit topology is the weak topology for probability measures, defined w.r.t. the product topology on Oω.
We can now give the first definition of an instrumental reward function: a continuous affine function R:ISω→R. Why affine? A convex linear combination of instrumental states is empirically indistinguishable from a probabilistic lottery. If we assign expected values to probabilistic lotteries (as we should by the VNM theorem), then we must also assign them to convex linear combinations of instrumental states: otherwise our reward again depends on entirely unobservable parameters of our model.
An alternative approach is to consider the notion of "experiment" explicitly.
We will use the notation X≤ω:=X∗⊔Xω. Given a logical condition ϕ, the symbol 1ϕ will denote 1 when ϕ is true and 0 when ϕ is false.
Definition 13
Given π:(A×O)∗k→A⊔{⊥} and μ∈ISω, we define μπ∈Δ(A×O)≤ω by
Prao∼μπ[an=a∗]=Eao∼μπ[1ao≥nπ(a∗ao:n)] n}\mu\APM{o_*}{ao_{:n}a_n}}">Prao∼μπ[on=o∗]=Eao∼μπ[1ao>nμ(o∗ao:nan)]
π is said to be a terminable policy when for any μ∈ISω
Prh∼μπ[h<∞]=1
That is, a terminable policy is allowed to produce a special token ⊥ which terminates the "experiment" (instead of choosing an action), and we require that, for any environment, this token will be eventually produced with probability 1.
Definition 14
Given a terminable policy π and a bounded function r:(A×O)∗→R, we define the function Rπr:ISω→R by
Rπr(μ):=Eh∼μπ[r(h)]
This gives us a second definition of instrumental reward functions. In fact, the two definitions are equivalent:
Theorem 1
For any terminable policy π and bounded function r:(A×O)∗→R, Rπr is continuous and affine. Conversely, for any continuous affine R:ISω→[0,1], there exists a terminable policy π and r:(A×O)∗→[−2,2] s.t. R=Rπr.
Putting the second part of the theorem into words, for any instrumental reward function (in the sense of the first definition) there is some experiment the agent can do which yields an unbiased estimate of the reward for the instrumental state that existed at the beginning of the experiment.
The range [−2,2] is not optimal, but for the regret bound in the next subsection, it's only important that it is bounded by some fixed constant.
To illustrate this concept of instrumental reward function, imagine that Clippy has access to a black box with a collection of levers on its outside. Pulling the levers produces some sounds that hint at what happens inside the box, but are not enough to determine it with certainty. The box has a shuttered glass window, whose shutter can be opened from the outside. Through the window, Clippy can see a jumble of scrap metal, mechanical manipulators that are controlled by the levers (and can be used to shape the scrap metal), and also a few mice running around the box and wreaking havoc. However, it is not possible to control the manipulators while the shutter is open. Worse, while opening the shutter allows seeing a snapshot of the shape of the metal, it also causes the manipulators to move chaotically, ruining this shape. So, Clippy can experiment with the levers and occasionally open the shutter to test the result. However, in order to produce and maintain paperclips inside the box, the shutter has to be kept closed (and the paperclips hidden) most of the time.
It is also possible to consider reward functions of the more general form R:(A×O)∗×ISω→R, required to be continuous and affine in the second argument. Such a reward function depends both on the current (unknown) instrumental state of the environment and the observable history so far. By Theorem 1, such a reward can be equivalently described in terms of a family {πh:(A×O)∗→A⊔{⊥}}h∈(A×O)∗ of terminable policies and a family {rh:(A×O)∗→R}h∈(A×O)∗ of bounded functions s.t.
R(h,μ)=Rπhrh(μ)
This means that, the value of reward can be estimated empirically, but only if the agent remembers the entire observation history. If the history is forgotten at some point, it might never be able to estimate the reward again. We will such reward functions "semiinstrumental".
Although semiinstrumental reward functions are more general than instrumental reward functions, I think that there is some interest in studying the narrower class. Arguably, instrumental reward functions are a better model of what counts as "consequentialist" or "goaldirected" behavior, since they depend only on the state of the environment. Indeed, it is easy to construct a semiinstrumental reward function that makes any given policy Bayesoptimal for any given prior, so semiinstrumental reward functions are incapable (without further assumptions) to distinguish between "intelligent" and "unintelligent" behavior. On the other hand, optimality w.r.t. some instrumental reward function seems like a stronger condition.
In order to derive a regret bound, we will restrict attention to those reward functions for which the terminable policy can be made to terminate within time that has some finite and bounded expected value. I don't know an elegant way to characterize those reward functions in general, but we will describe one class of such functions.
Definition 15
Consider any C∈ConSet. Then, we can define the total variation distance dtv:C×C→R by
dtv(x,y):=supr∈Mor(C,[0,1])r(x)−r(y)
In general, dtv is a pseudometric. Moreover, when C is finitedimensional and bounded, it is a metrization of the natural topology. For a finite set A and C=ΔA, dtv is just the usual total variation metric. For C a ball of unit diameter in Euclidean space, dtv is the Euclidean metric.
Definition 16
Consider any λ∈(0,1). We define the metric dλtv on ISω by
dλtv(μ,ν):=supn∈Nλndtv(prωnμ,prωnν)
For any λ, dλtv is a metrization of the inverse limit topology on ISω.
Proposition 5
Consider any λ∈(0,1) and R:ISω→R affine and Lipschitz with respect to dλtv. Then there is a terminable policy π and a bounded function r:(A×O)∗→R s.t. R=Rπr and
supμ∈ISωEh∼μπ[h]<∞
Note that λ is a parameter reminiscent of geometric time discount that constraints the shape of the reward function. However, in the regret analysis that follows, it is not the same as the actual geometric time discount parameter γ. In particular, we consider the asymptotics in which the latter approaches 1, while the reward function is assumed to be fixed. It might be interesting to study regimes in which both approach 1, but we will not attempt it at present.
A Regret Bound for RL with Instrumental RewardsFix A and O. For any finite set S and T:S×Ak→S×O, there is a unique mapping IT:S→ISω s.t.
IT(s)ao=(∑s′∈ST(s′,o∣∣s,a)IT(s′),∑s′∈ST(s′,o∣∣s,a))
That is, IT(s) is just the instrumental state corresponding to the POMDP state s.
Given R:ISω→R continuous affine, we get for any S and T the POMDP reward function R∘IT. Hence, one natural setting for regret analysis is fixing R and S and considering a hypothesis class of transition kernels
H⊆{S×Ak→S×O}
However, regret analysis for POMDPs involves some technical complexity, and we prefer to start with a simpler setting. Hopefully, we will address the POMDP case in a followup essay.
Suppose that O=S. Then, given any MPD transition kernel T:S×Ak→S, we can define the POMDP transition kernel T♯:S×Ak→S×O by
T♯(s′,s′′∣∣s,a):=1s′=s′′T(s′∣∣s,a)
Fix R:S×ISω→[0,1] continuous affine in the second argument (a type of semiinstrumental reward function). For any T as above, we get the induced reward function RT:S→[0,1] given by RT(s):=R(s,IT♯(s)). We can now consider the learning problem corresponding to a hypothesis class of MDP transition kernels
H⊆{S×Ak→S}
It might seem odd to consider a setting with fully observable states in the context of imperceptible rewards. However, we can think about it as follows: what we observe is only a label whose physical meaning is unknown. Only given the (unknown) transition kernel, such a label can be interpreted as assigned a reward.
For example, imagine a robot tasked with sorting balls into bins according to their weight. Some of the balls are solid gold and some of them are solid silver, so it's in principle possible to know their weight just by looking at them. However, the robot doesn't know a priori whether silver or gold is heavier. On the other hand, the robot can perform some experiment with a ball to determine its weight. In this situation, the reward is imperceptible even if the state (e.g. the locations, colors and sizes of the balls, the locations and shapes of the bins and the location of the robot and the state of its manipulators) is fully observable.
Using the concepts of MB dimension and RVO dimension we defined in a previous essay, we can formulate the regret bound.
Theorem 2
There is some C∈R+ s.t. the following holds.
Consider any finite nonempty sets S and A, function continuous affine in the second argument R:S×ISω→[0,1], closed set H⊆{S×Ak→S} and Borel probability measure ζ on H (prior). Suppose πest:S×(A×S)∗k→A⊔{⊥} is s.t. for any s∈S, πest(s,⋅) is terminable, and r:S×(A×S)∗→[0,1] is a bounded function s.t.
R(s,⋅)=Rπest(s,⋅)r(s,⋅)
Define the maximal expected reward estimation time test by
test=sups∈S,μ∈ISωEh∼μπest(s,⋅)[h]
Define the maximal bias span τ by
τ:=limsupγ→1maxT∈Hmaxs∈SVTRT(s,γ)−mins∈SVTRT(s,γ)1−γ
Define HR⊆{S→R} by
HR:={RTT∈H}
*Denote DMB:=dimMBHR and DRVO:=dimRVOHR. Define the Bayesian regret R(γ) by}
R(γ):=ET∼ζ[EU∗TRT(γ)−EUπ†γTRT(γ)]
Then, there is a family of policies {π†γ:S∗×Sk→A}γ∈(0,1) s.t.
limsupγ→1R(γ)3√(τ+test)(DMB+1)DRVO(1−γ)ln11−γ≤C
Discussion More on the Regret BoundIt might appear odd that the regret bound in Theorem 2 depends on the dimensions of the class of reward functions, but not on the dimensions on the class of transition kernels, like in the perceptible case. The reason for this is that we only give the asymptotic form. Asymptotically, the leading term is O(3√(1−γ)ln11−γ) and its coefficient depends only on those parameters that appear in Theorem 2. However, examining the proof reveals that there is also an O(√(1−γ)ln11−γ) term that has the same form as the regret bound in the perceptible case. For the sake of brevity and simplicity we will not attempt to write down a more precise regret bound that reflects the role of the dimensions of the class of transition kernels, but in principle it is not difficult. We must keep in mind, however, that in practice the other term might be significant: a priori the dimensions of the class of transition kernels are only bounded by a polynomial in S and A and the latter might be exponentially big for realistic models.
The Death of the Agent and Kamikaze StrategiesThere is one potential application of defining rewards that are not a direct function of observations which seems not supported by instrumental rewards as we defined them. Namely, one can imagine agents that are motivated to follow a certain strategy that is designed to destroy the agent but produce some desirable results in the external world ("kamikaze strategy"). In other words, although survival is a convergent instrumental goal, it seems entirely reasonable for it to be sacrificed in certain specific scenarios. To give an example, imagine a bodyguard robot whose primary goal is keeping a certain person alive. If assassins shoot at the person, and the only wait to stop them is for the robot to block the bullets with its body, then it can be the optimal strategy, even if it will destroy the robot and prevent it from continuing its function as bodyguard (assuming that e.g. it will give the person a chance to escape or someone else a chance to stop the assassins).
Our formalism doesn't directly address the possibility of the agent's death, because the sequence of actions and observations is assumed to be infinite. One simple way to accommodate death is postulating a special observation ⊥∈O s.t. it is never received before death and always received after death. If we do this, then death corresponds to a specific instrumental state and therefore its reward is a fixed constant. This seems incompatible with kamikaze strategies where the decision of selfsacrifice is contingent on conditions in the external world after death.
Another approach is assuming the agent becomes a "ghost": it keeps receiving observations which reflect the actual physical world but its actions have no effect. Such ghosts might theoretically be produced by a simplicity prior: for example, if the agent is an AI connected to a camera that monitors a room, then we can imagine a virtual video of the same room continuing beyond the AI's shutdown. This allows for different instrumental states after death and can potentially justify kamikaze strategies, but it seems like a very fragile construct and is unlikely to guarantee reasonable behavior.
The problem with death can be viewed from another perspective: our regret analysis assumes a notraps condition (τ<∞) whereas death is usually a trap. Therefore, to guarantee rational behavior while accounting for death, we need to operate within a framework that allows for traps.
One such framework is requiring Bayesoptimality and giving up on learnability. This seems both too weak (because nothing is guaranteed for specific environments) and too strong (because it's computationally intractable). That said, I think this approach can be salvaged by requiring something somewhat weaker than Bayesoptimality and proving something somewhat weaker than learnability (hopefully I will write on that in a future essay). In any case, once we give up on learnability we can allow unobservable rewards (the general POMDP setting in the beginning of the Results section) which allow handling death and kamikaze strategies easily. Specifically, we can have a plethora of "dead" states that produce only the ⊥ observation and whose transitions do no depend on the action, but which have different rewards. So, this approach "solves" the problem but at a high cost: no learnability or regret analysis.
Another framework for dealing with traps is Delegative Reinforcement Learning (DRL). In DRL the agent interacts with an external advisor, and is thereby able to successfully navigate all traps that the advisor knows about. In other words, it converges to a policy that is Bayesoptimal with respect to the belief state of the advisor (while the advisor is not able to produce such a policy by itself). Combining DRL with the instrumental rewards formalism should allow accounting for death. Specifically, in any given POMDP hypothesis, the state space will be decomposed as S=Sdead⊔Salive, with the following assumptions:

The transition kernel on states in Sdead doesn't depend on the action.

Sdead is invariant under the dynamics (death is irreversible).

The reward function on Sdead can be arbitrary.

The reward function on Salive has to factor through IT (i.e. depend only on the instrumental state), and moreover the policy for estimating the reward function is known to be safe (alternatively, we might let the agent learn a safe estimation policy from some space of candidate policies).
Under these assumptions (or some variant thereof), it seems plausible that we should be able to prove learnability and derive a reasonable regret bound. Formalizing this line of thought is a task for future work.
The DRL approach to kamikaze strategies also has implications on corrigibility. Indeed, if the external advice is compatible with an admissible reward function that recommends shutting down the agent (i.e. upon delegation, the advisor acts to shut down the agent) then a DRL agent will assist with its own shutdown. This property is preserved by subagents, since creating a subagent is another potentially irreversible action (and therefore must be vetted by the advisor).
Also, it is tempting to speculate about DRL as a model of human selfsacrifice. We have already speculated before that we can view DRL as model of human learning, where the human's social environment or humanity as a whole is regarded as playing the role of the advisor. Similarly, humans that sacrifice their lives for the sake of their loved ones or the "greater good" can be regarded as heeding that "advise" regarding the rewards of states in which the human doesn't exist. Under this model, we can make sense of selfsacrifices by individual humans but not of hypothetical selfsacrifice by humanity as a whole (potentially augmented by other minds with which we could communicate and find common ground), but, the latter restriction seems compatible with intuition.
Specifying Instrumental Reward FunctionsIt might be useful to find natural ways to specify instrumental reward functions. In particular, it is interesting to start with a reward function defined on a specific ontology (POMDP) and somehow extend it to a full instrumental reward function (which, like we said before, induces a reward function on any ontology via the IT mapping).
De Blanc suggests one way to extend reward functions from one POMDP to another, but I don't know whether this operation leads to an instrumental reward function, i.e. whether it is compatible with the constraints imposed by affine dependencies between the states (and I don't see any reason why it should).
Essentially, what we're looking for is a way to extend an affine function from an affine subspace to the entire affine space (the affine subspace is the affine span of the instrumental states corresponding to the states of the initial ontology; note that, if these instrumental states are not affine independent, then we have some constraint on this initial reward function). One natural way to do it is looking for an affine function whose differential (the linear function part) has minimal norm, or choosing some privileged projection operator from the entire space to the affine subspace. However, since the natural norms we have here are not inner product norms, these choices are usually nonunique, and it's possible we can't get much out of it.
A different natural approach is using Occam's razor, i.e. looking for an extension of minimal description length or the expected value of a random extension sampled from a conditional simplicity prior. This requires a natural way to describe instrumental reward functions. We do have such a way: assuming that the π and r guaranteed to exist by Theorem 1 can be chosen to be computable, the description is the program for those objects with respect to a fixed universal Turing machine (we consider a single program that computes both π and r to exploit possible mutual information between the two).
These questions are left for future research.
ProofsProof of Proposition 1
We proceed by induction on n. For n=0 there is nothing to prove since IS0=pt so θ=θ′ a priori. Now, suppose 0">n>0.
We need to show that for any a∗∈A and o∗∈O, θa∗o∗=θ′a∗o∗. To show it for given a∗ and o∗, we consider some π s.t. π(λ)=a∗. Since θπ=θ′π, we have
Prao∼θπ[o0=o∗]=Prao∼θ′π[o0=o∗]
By Definition 7, we get,
θ(o∗a∗)=θ′(o∗a∗)
By Definition 6, we get,
hISn−1(θa∗o∗)=hISn−1(θ′a∗o∗)
If this value vanishes then θa∗o∗=θ′a∗o∗=0. Otherwise, consider any σ:(A×O)<n−1→A and define π:(A×O)<n→A by
π(λ):=a∗ π(aoh):=σ(h)
Definitions 6 and 7 imply that
[θa∗o∗]σ(h)=θπ(a∗o∗h)θ(o∗a∗)
and similarly for θ′. We have θπ=θ′π and θ(o∗a∗)=θ′(o∗a∗), which implies [θa∗o∗]σ=[θ′a∗o∗]σ. By the induction hypothesis, we get [θa∗o∗]=[θ′a∗o∗]. Combining this with hISn−1(θa∗o∗)=hISn−1(θ′a∗o∗), we get θa∗o∗=θ′a∗o∗. ■
Proposition A.1
Consider any C∈ConSet, θ,θ′∈ConeC and p∈[0,1]. Assume that
0">hC(pθ+(1−p)θ′)>0
Then
[pθ+(1−p)θ′]=phC(θ)[θ]+(1−p)hC(θ′)[θ′]phC(θ)+(1−p)hC(θ′)
Here, hC(θ)[θ] is understood to mean 0 when hC(θ)=0 and the same for θ′.
Proof of Proposition A.1
Obvious from the definitions.
Proof of Proposition 2
Given n∈N+ and some α,β∈Δ(A×O)n, to show that α=β it is sufficient to show that
 For any a∗∈A and o∗∈O,
Prao∼α[ao:1=a∗o∗]=Prao∼β[ao:1=a∗o∗]
 For any a∗∈A and o∗∈O, if 0">Prao∼α[ao:1=a∗o∗]>0 then for any h∈(A×O)n−1
Prao∼α[ao=a∗o∗hao:1=a∗o∗]=Prao∼β[ao=a∗o∗hao:1=a∗o∗]
We will use this to prove the claim by induction on n. For n=0, there is nothing to prove. For 0">n>0, by Definition 7
Prao∼(pθ+(1−p)θ′)π[ao:1=a∗o∗]=π(a∗λ)(pθ+(1−p)θ′)(o∗a∗)
By Definition 6 Prao∼(pθ+(1−p)θ′)π[ao:1=a∗o∗]=π(a∗λ)hISn−1((pθ+(1−p)θ′)a∗o∗)
Pr(pθ+(1−p)θ′)π[ao:1=a∗o∗]=pπ(a∗λ)h(θa∗o∗)+(1−p)π(a∗λ)h(θ′a∗o∗)
Pr(pθ+(1−p)θ′)π[ao:1=a∗o∗]=pπ(a∗λ)θ(o∗a∗)+(1−p)π(a∗λ)θ′(o∗a∗)
Pr(pθ+(1−p)θ′)π[ao:1=a∗o∗]=pPrθπ[ao:1=a∗o∗]+(1−p)Prθ′π[ao:1=a∗o∗]
Pr(pθ+(1−p)θ′)π[ao:1=a∗o∗]=Prpθπ+(1−p)θπ[ao:1=a∗o∗]
Furthermore, Definitions 6 and 7 imply that
Pr(pθ+(1−p)θ′)π[a∗o∗hao:1=a∗o∗]=Pr[(pθ+(1−p)θ′)a∗o∗]σ[h]
Here, σ is defined by σ(h):=π(a∗o∗h).
Denote
q:=ph(θa∗o∗)ph(θa∗o∗)+(1−p)h(θ′a∗o∗)
By Proposition A.1
[(pθ+(1−p)θ′)a∗o∗]=q[θa∗o∗]+(1−q)[θ′a∗o∗]
We get
Pr(pθ+(1−p)θ′)π[a∗o∗hao:1=a∗o∗]=Pr(q[θa∗o∗]+(1−q)[θ′a∗o∗])σ[h]
By the induction hypothesis, we get
Pr(pθ+(1−p)θ′)π[a∗o∗hao:1=a∗o∗]=Prq[θa∗o∗]σ+(1−q)[θ′a∗o∗]σ[h]
Decomposing the right hand side and applying the same reasoning in reverse,
Pr(pθ+(1−p)θ′)π[a∗o∗ha∗o∗]=qPrθπ[a∗o∗ha∗o∗]+(1−q)Prθ′π[a∗o∗ha∗o∗]
By Definition 6, q can be written as
q=pθ(o∗a∗)pθ(o∗a∗)+(1−p)θ′(o∗a∗)
Multiplying the numerator and denominator by π(a∗λ), we get
q=pπ(a∗λ)θ(o∗a∗)pπ(a∗λ)θ(o∗a∗)+(1−p)π(a∗λ)θ′(o∗a∗)
q=pPrθπ[a∗o∗]pPrθπ[a∗o∗]+(1−p)Prθ′π[a∗o∗]
Combining this with the previous identity, we get
…=pPrθπ[a∗o∗]Prθπ[a∗o∗ha∗o∗]+(1−p)Prθ′π[a∗o∗]Prθ′π[a∗o∗ha∗o∗]pPrθπ[a∗o∗]+(1−p)Prθ′π[a∗o∗]
Pr(pθ+(1−p)θ′)π[a∗o∗ha∗o∗]=pPrθπ[a∗o∗h]+(1−p)Prθ′π[a∗o∗h]pPrθπ[a∗o∗]+(1−p)Prθ′π[a∗o∗]
Pr(pθ+(1−p)θ′)π[a∗o∗ha∗o∗]=Prpθπ+(1−p)θ′π[a∗o∗h]Prpθπ+(1−p)θ′π[a∗o∗]
Pr(pθ+(1−p)θ′)π[a∗o∗ha∗o∗]=Prpθπ+(1−p)θ′π[a∗o∗ha∗o∗] ■
Given μ∈ΔXω and f:Xω→Xn defined by f(x):=x:n, we will denote μ:n:=f∗μ.
Proposition A.2
Consider any C,D∈ConSet and f∈Mor(C,D). Then,
hD∘Conef=hC
Proof of Proposition A.2
Obvious from the definitions.
Proposition A.3
Consider any C,D∈ConSet, f∈Mor(C,D) and θ∈ConeC∖0. Then,
[(Conef)(θ)]=f([θ])
Proof of Proposition A.3
Obvious from the definitions.
Proposition A.4
Consider any n∈N, θ∈ISn+1, a∈A and h∈domθ s.t. h<n. Then,
prn(θ)(ha)=θ(ha)
Proof of Proposition A.4
We prove the claim by induction on n. For n=0, there is nothing to prove. Assume 0">n>0. For h=λ, we have
prn(θ)(oa)=h(prn(θ)ao)=h((Coneprn−1)(θao))
By Proposition A.2
prn(θ)(oa)=h(θao)=θ(oa)
Now suppose that h=a′o′h′. We get
prn(θ)(a′o′h′a)=[prn(θ)a′o′](h′a)=[(Coneprn−1)(θa′o′)](h′a)
By Proposition A.3
prn(θ)(a′o′h′a)=prn−1[θa′o′](h′a)
Using the induction hypothesis
prn(θ)(a′o′h′a)=[θa′o′](h′a)=θ(a′o′h′a) ■
Proof of Proposition 3
Proposition A.4 implies that, in Definition 11, we can replace h+1 by any m≥h+1. Therefore, for any n∈N, (prωnμ)π=μπ:n. Hence, μπ=μ′π implies that (prωnμ)π=(prωnμ′)π. By Proposition 1, we get prωnμ=prωnμ′, and therefore μ=μ′. ■
Proof of Proposition 4
For any n∈N, Proposition 2 implies that
(p⋅prωnμ+(1−p)⋅prωnμ′)π=p(prωnμ)π+(1−p)(prωnμ′)π
prωn is an affine mapping, therefore
prωn(pμ+(1−p)μ′)π=p(prωnμ)π+(1−p)(prωnμ′)π
Using Proposition A.4
(pμ+(1−p)μ′)π:n=pμπ:n+(1−p)μ′π:n=(pμπ+(1−p)μ′π):n
Since this holds for any n, we must have that
(pμ+(1−p)μ′)π=pμπ+(1−p)μ′π ■
Proposition A.5
For any terminable policy π
n}=0">limn→∞supμ∈ISωPrh∼μπ[h>n]=0
Proof of Proposition A.5
Assume to the contrary that there is ϵ∈(0,1) and sequences {nk∈N}k∈N and {μk∈ISω}k∈N s.t. n_k">nk+1>nk and, for any k∈N
n_k} > \epsilon">Prh∼μkπ[h>nk]>ϵ
ISω is compact, therefore {μk} has a convergent subsequence. Without loss of generality, we can assume that {μk} itself converges and denote μ∗:=limk→∞μk. For any k∈N and k">j>k, we have nk<nj and therefore
n_k} \geq \Pa{h\sim\mu_j\pi}{\Abs{h}>n_j} > \epsilon">Prh∼μjπ[h>nk]≥Prh∼μjπ[h>nj]>ϵ
Using Proposition A.4, it follows that
n_k} > \epsilon">Prh∼prωnk+1(μj)π[h>nk]>ϵ
The right hand side is clearly continuous in prωnk+1(μj), and the latter converges to prωnk+1(μ∗) as j goes to ∞. We get
n_k}=\Pa{h\sim\Prj^\omega_{n_k+1}\AP{\mu_*}\pi}{\Abs{h}>n_k} > \epsilon">Prh∼μ∗π[h>nk]=Prh∼prωnk+1(μ∗)π[h>nk]>ϵ
Since this holds for any k, it follows that
n_k} \geq \epsilon">Prh∼μ∗π[h=∞]=limk→∞Prh∼μ∗π[h>nk]≥ϵ
This contradicts the assumption that π is terminable. ■
Proposition A.6
Consider {Xn}n∈N a sequence of compact Polish spaces, and {prn:Xn+1→Xn}n∈N continuous mappings. Denote
Xω:=lim←−nXn
Denote prωn:Xω→Xn the canonical mapping. Let f:X→R be continuous. Then,
limn→∞supx∈Xnx1,2∈(prωn)−1(x)f(x1)−f(x2)=0
Proof of Proposition A.6
Assume to the contrary that there is ϵ∈R+, and sequences {nk∈N}k∈N, {xk1,xk2∈Xω}k∈N s.t. n_k">nk+1>nk, prωnk(xk1)=prωnk(xk2) and \epsilon">f(xk1)−f(xk2)>ϵ. Without loss of generality, we assume that nk=k and the limits x∗1,2:=limn→∞xn1,2 exist (the latter using the fact Xω is compact by Tychonoff's theorem). It follows that f(x∗1)−f(x∗2)≥ϵ, and in particular x∗1≠x∗2 and therefore there is m∈N s.t. prωm(x∗1)≠prωm(x∗2). On the other hand, prωn(xn1)=prωn(xn2), implying that, for n≥m, prωm(xn1)=prωm(xn2) and therefore prωm(x∗1)=prωm(x∗2), a contradiction. ■
Proposition A.7
Let V be a finitedimensional normed vector space and W⊆V∗ a linear subspace. Suppose v∈V and ϵ∈R+ are s.t. for any α∈W, α(v)≤ϵ∥α∥. Then, there is v′∈W⊥⊆V s.t. ∥v−v′∥≤ϵ.
In the above, ∥α∥ refers to the standard dual norm on V∗, induced by the norm on V. W⊥ refers to the set {v∈V∀α∈W:α(v)=0}.
Proof of Proposition A.7
Consider v∗∈W∗ defined by v∗(α):=α(v). By the assumption about v, ∥v∗∥≤ϵ. By the HahnBanach theorem, there is u∗∈V∗∗ s.t. u∗W=v∗ and ∥u∗∥≤ϵ. Using the canonical isomorphism V≅V∗∗, u∗ corresponds to some u∈V, and it follows that ∥u∥≤ϵ and for any α∈W, α(u)=α(v). We now take v′:=v−u. ■
Proposition A.8
Let C,D∈ConSet, f∈Mor(C,D) and r∈Mor(C,R). Assume D is a bounded closed finitedimensional polytope, and f is onto (as a mapping). Suppose ϵ∈R+ s.t. for any x1,2∈C, if f(x1)=f(x2) then r(x1)−r(x2)≤ϵ. Then, there exists r′∈Mor(D,R) s.t. for any x∈C
∣∣r(x)−r′(f(x))∣∣≤32ϵ
Proof of Proposition A.8
Let X be the vector space correspond to C, let Y be the vector space corresponding to D (so that C⊆X and D⊆Y) and let Q be the (finite) set of vertices of D. Since f is onto, we can choose some g:Q→C s.t. for any q∈Q, f(g(q))=q. Define v∈RQ by vq:=r(g(q)). Let RQ0 be the linear subspace of RQ given by
RQ0:=⎧⎨⎩u∈RQ∣∣ ∣∣∑q∈Quq=0⎫⎬⎭
Define the linear operators A:RQ→X and B:RQ→Y by
Au:=∑q∈Quqg(q)
Bu:=∑q∈Quqq
Consider any w∈kerB∩RQ0∖0. In particular, w∈RQ0 and therefore
0}w_q+\sum_{q\in Q}\boldsymbol{1}_{w_q<0}w_q">0=∑q∈Qwq=∑q∈Q1wq>0wq+∑q∈Q1wq<0wq
Also
0}w_q\sum_{q\in Q}\boldsymbol{1}_{w_q<0}w_q">∥w∥1=∑q∈Qwq=∑q∈Q1wq>0wq−∑q∈Q1wq<0wq
Combining the last two identities, we conclude
0}w_q =\sum_{q\in Q}\boldsymbol{1}_{w_q<0}w_q=\frac{\Norm{w}_1}{2}">∑q∈Q1wq>0wq=−∑q∈Q1wq<0wq=∥w∥12
Using the assumption w≠0, this allows us to define w+,w−∈ΔQ by
0}w_q">w+q:=2∥w∥11wq>0wq
w−q:=−2∥w∥11wq<0wq
We have w=12∥w∥1(w+−w−). Also, Bw=0 and therefore Bw+=Bw−.
Now, for any u∈ΔQ, we have
u⋅v=∑q∈Quqvq=∑q∈Quqr(g(q))=r⎛⎝∑q∈Quqg(q)⎞⎠=r(Au)
Moreover,
f(Au)=f⎛⎝∑q∈Quqg(q)⎞⎠=∑q∈Quqf(g(q))=∑q∈Quqq=Bu
It follows that
w⋅v=12∥w∥1(w+⋅v−w−⋅v)=12∥w∥1(r(Aw+)−r(Aw−))
By our previous reasoning, f(Aw+)=Bw+=Bw−=f(Aw−). Therefore, we can use the assumption on r to conclude that
w⋅v=12∥w∥1∣∣r(Aw+)−r(Aw−)∣∣≤ϵ2∥w∥1
By Proposition A.7, it follows that there is some v′∈RQ s.t. v′∈(kerB∩RQ0)⊥ and ∥v−v′∥∞≤ϵ2 (the L∞ norm is dual to the L1 norm).
Now, consider some y∈D. There is some u∈ΔQ s.t. y=Bu. We define r′(y):=u⋅v′. To see this definition is unambiguous, consider some u′∈ΔQ s.t. also y=Bu′. In particular, B(u−u′)=0 and therefore u−u′∈kerB. Moreover, u−u′∈RQ0 since u,u′∈ΔQ. Using that u−u′∈kerB∩RQ0 and v′∈(kerB∩RQ0)⊥, we get
u⋅v′−u′⋅v′=(u−u′)⋅v′=0
It is easy to see that r′ is affine.
Finally, consider any x∈C. Choose u∈ΔQ s.t. f(x)=Bu. We have
∣∣r(x)−r′(f(x))∣∣≤∣∣r(x)−r(Au)∣∣+∣∣r(Au)−r′(f(x))∣∣
Since f(Au)=Bu, we can use the assumption on r to bound the first term on the right hand side:
∣∣r(x)−r(Au)∣∣≤ϵ
For the second term, we have
∣∣r(Au)−r′(f(x))∣∣=∣∣u⋅v−u⋅v′∣∣=∣∣u⋅(v−v′)∣∣≤∥u∥1∥∥v−v′∥∥∞≤ϵ2
Combining the two, we conclude
∣∣r(x)−r′(f(x))∣∣≤32ϵ ■
Proposition A.9
Let A be a finite set, C∈ConSet and R∈Mor(CA,[0,1]). Assume R attains its infimum at some point θ∗∈CA. Then, there exist π∈ΔA and r:A→Mor(C,[0,1]) s.t. for any θ∈CA R(θ)=Ea∼π[r(a,θa)]
Here, we used implicit currying: r(a,x):=r(a)(x).
Proof of Proposition A.9
For each a∈A, define ιa∈Mor(C,CA) by
ιa(x)b:={x if a=bθ∗b if a≠b
Denote Ra:=R∘ιa and Na:=supRa−infRa. If Na vanishes for every a∈A, then R is constant, so we can set r to the same constant and take arbitrary π. Otherwise, we define π and r by
πa:=Na∑b∈ANb
r(a,x):=R(θ∗)+∑b∈ANbNa(Ra(x)−R(θ∗))
We need to show that r takes values in [0,1]. It is nonnegative since R is minimal at θ∗, so both terms in r are nonnegative. To see it is not greater than 1, observe that
r(a,x)=R(θ∗)+∑b∈ANbNa(Ra(x)−R(θ∗))≤infR+∑b∈ANbNa(supRa−infR)
Clearly infR=infRa and therefore supRa−infR=Na. It is also easy to see that, since R is affine, supR−infR=∑b∈ANb. We get
r(a,x)≤infR+∑b∈ANbNaNa=infR+∑b∈ANb=supR≤1
It remains to show that R(θ)=Ea∼π[r(a,θa)]. We have
Ea∼π[r(a,θa)]=∑a∈Aπar(a,θa)
Ea∼π[r(a,θa)]=∑a∈Aπa(R(θ∗)+∑b∈ANbNa(Ra(θa)−R(θ∗)))
Ea∼π[r(a,θa)]=∑a∈AπaR(θ∗)+∑a∈Aπa∑b∈ANbNa(Ra(θa)−R(θ∗))
Ea∼π[r(a,θa)]=R(θ∗)+∑a∈ANa∑b∈ANb⋅∑b∈ANbNa(Ra(θa)−R(θ∗))
Ea∼π[r(a,θa)]=R(θ∗)+∑a∈A(Ra(θa)−R(θ∗))
Using the fact R is affine, the desired conclusion follows. ■
Proposition A.10
Consider any n∈N and R∈Mor(ISn,[0,1]). Then, there exist π:(A×O)<nk→A and r:(A×O)n→[0,1] s.t. for any θ∈ISn
R(θ)=Eθπ[r]
Proof of Proposition A.10
We proceed by induction on n. For n=0, R is constant and we set r to the same constant. Now, suppose 0">n>0.
Define ISn−12∈ConSet by
ISn−12:=(∏o∈OhISn−1)−1(ΔO)
Using Proposition A.9, we get π0∈ΔA and r0:A→Mor(ISn−12,[0,1]) s.t. for any θ∈ISn
R(θ)=Ea∼π0[r0(a,θa)]
For every o∈O, we define jo∈Mor(ISn−1,ISn−12) by
jo(θ)o′:={(θ,1) if o=o′0 if o≠o′
For every a∈A and o∈O, we apply the induction hypothesis to r0(a)∘jo, and get πao:(A×O)<n−1k→A and rao:(A×O)n−1→[0,1] s.t. for any θ∈ISn−1
r0(a,jo(θ))=Eθπao[rao]
We now define π and r by
π(λ):=π0 π(aoh):=πao(h)
r(aoh):=rao(h)
Now, observe that, for any θ∈ISn
Eh∼θπ[r(h)]=Ea∼π0o∼θ(a)[Eh′∼[θao]πao[rao(h′)]]=Ea∼π0o∼θ(a)[r0(a,jo[θao])]
Using the fact that r0 is affine in the second argument, we get
Eh∼θπ[r(h)]=Ea∼π0[r0(a,Eo∼θ(a)[jo[θao]])]
Moreover, using Definition 6 and the definition of j0, we get
Eo∼θ(a)[jo[θao]]=∑o∈Oh(θao)⨁o′∈O(1o′=oθao,1o′=o)
Eo∼θ(a)[jo[θao]]=⨁o′∈O(∑o∈Oh(θao)1o′=oθao,∑o∈Oh(θao)1o′=o)
Eo∼θ(a)[jo[θao]]=⨁o′∈O(h(θao′)θao′,h(θao′))=⨁o′∈Oθao′=θa
Combining this with the previous identity, we get
Eh∼θπ[r(h)]=Ea∼π0[r0(a,θa)]=R(θ) ■
Proposition A.11
Consider some α∈ΔN and sequences
{πk:(A×O)∗k→A⊔{⊥}}k∈N
{rk:(A×O)∗→[0,1]}k∈N
Assume πk is a terminable policy for every k∈N. Then, there is π a terminable policy and r:(A×O)∗→[0,1] s.t.
Rπr=Ek∼α[Rπkrk]
Proof of Proposition A.11
Define π and r by setting, for any n∈N and ao∈(A×O)n
π(ao):=En∼α[∏n−1m=0πn(amao:m)⋅πn(ao)]En∼α[∏n−1m=0πn(amao:m)]
r(ao):=En∼α[∏n−1m=0πn(amao:m)⋅πn(⊥ao)rn(ao)]En∼α[∏n−1m=0πn(amao:m)⋅πn(⊥ao)]
When the denominator vanishes, we can make an arbitrary choice.
Essentially, what we do is sample n from α and then run the "experiment" defined by (πn,rn). We have
Rπr(μ)=Eμπ[r]=En∼α[Eμπn[rn]]=En∼α[Rπnrn(μ)] ■
For any n,m∈N s.t. m≥n, we will denote by prmn:ISm→ISn the canonical projection. That is
prmn:=prm−1∘prm−2…∘prn+1∘prn
Proof of Theorem 1
Direction 1: Rπr is affine by Proposition 4. To verify it is continuous, consider a sequence {μk∈ISω}k∈N s.t. μ∗:=limk→∞μk exists. Denote Δr:=supr−infr, μnk:=prωn(μk) and μn∗:=prωn(μ∗). For any n∈N, we can decompose the expected value into the contribution of histories at length at most n and the contribution of longer histories.
n} +\Pa{\mu_*\pi}{\Abs{h}>n}}">Rπr(μk)−Rπr(μ∗)≤Δr(dtv(μnk,μn∗)+Prμkπ[h>n]+Prμ∗π[h>n])
μnk converges to μn∗ as k goes to ∞, therefore dtv(μnk,μn∗) converges to 0 and we get
n}">limsupk→∞Rπr(μk)−Rπr(μ∗)≤2Δrsupμ∈ISωPrμπ[h>n]
Since this holds for any n∈N, we get
n}">limsupk→∞Rπr(μk)−Rπr(μ∗)≤2Δrinfn∈Nsupμ∈ISωPrμπ[h>n]
By Proposition A.5, the right hand side vanishes.
Direction 2: Define {ϵn∈[0,1]}n∈N by
ϵn:=supθ∈ISnμ1,2∈(prωn)−1(θ)R(μ1)−R(μ2)
By Proposition A.6, limn→∞ϵn=0. By Proposition A.8, there is a sequence {~Rn∈Mor(ISn,[0,1])}n∈N s.t. for any n∈N
supμ∈ISω∣∣R(μ)−~Rn(prωn(μ))∣∣≤32ϵn
It follows that there is a sequence {nk∈N}k∈N s.t. n_k">nk+1>nk and
∞∑k=0supμ∈ISω∣∣~Rnk+1(prωnk+1(μ))−~Rnk(prωnk(μ))∣∣<1
Define ΔRk∈Mor(ISnk+1,[−1,1]) by
ΔRk(θ):=~Rnk+1(θ)−~Rnk(prnk+1nk(θ))
Denote δk:=supΔRk. We can assume without loss of generality that 0">δk>0 for every k∈N (this can be arranged by choosing appropriate nk, unless for some m∈N, R=prωn∘~Rm; but in the latter case, the theorem follows directly from Proposition A.10). By Proposition A.10, for every k∈N, there is πk+1:(A×O)<nk+1k→A and rk+1:(A×O)nk+1→[−1,1] s.t.
1δkΔRk(θ)=Eθπk+1[rk+1]
Also by Proposition A.10, there is π0:(A×O)<n0k→A and r0:(A×O)n0→[0,1] s.t.
~Rn0(θ)=Eθπ0[r0]
By the construction, ∑∞k=0δk<1, and moreover, denoting C:=1+∑∞k=0δk
R=prωn0∘~Rn0+∞∑k=0prωnk+1∘ΔRk=C(1Cprωn0∘~Rn0+∞∑k=0δkC⋅prωnk+1∘ΔRkδk)
By Proposition A.11, there is π a terminable policy and r:(A×O)∗→[−2,2] (the range of rk is [−1,1] and C<2) s.t. Rπr=R. ■
Proposition A.12
Consider any λ∈(0,1), n∈N, θ∈ISn and μ1,2∈(prωn)−1(θ). Then,
dλtv(μ1,μ2)≤λn+1
Proof of Proposition A.12
By Definition 16,
dλtv(μ1,μ2)=supm∈Nλmdtv(prωmμ1,prωmμ2)
For m≤n, prωmμ1=prnmθ=prωmμ2, and therefore dtv(prωmμ1,prωmμ2)=0. For n">m>n, dtv(prωmμ1,prωmμ2)≤1 by Definition 15. We get
n}{\lambda^m}=\lambda^{n+1}\ \blacksquare">dλtv(μ1,μ2)≤supm>nλm=λn+1 ■
Proof of Proposition 5
Define {ϵn∈[0,1]}n∈N by
ϵn:=supθ∈ISnμ1,2∈(prωn)−1(θ)R(μ1)−R(μ2)
Let L∈R be the Lipschitz constant of R. We have
ϵn≤Lsupθ∈ISnμ1,2∈(prωn)−1(θ)dλtv(μ1,μ2) By Proposition A.12, we get
ϵn≤Lλn+1
Without loss of generality, assume the range of R is contained in [0,1]. By Proposition A.8, it follows that there is a sequence {~Rn∈Mor(ISn,[0,1])}n∈N s.t. for any n∈N
supμ∈ISω∣∣R(μ)−~Rn(prωn(μ))∣∣≤32Lλn+1
It follows that
∞∑n=0supμ∈ISω∣∣~Rn+1(prωn+1(μ))−~Rn(prωn(μ))∣∣≤32L∞∑n=0(λn+2+λn+1)
∞∑n=0supμ∈ISω∣∣~Rn+1(prωn+1(μ))−~Rn(prωn(μ))∣∣≤3L2⋅λ2+λ1−λ
Define ΔRn∈Mor(ISn+1,[−1,1]) by
ΔRn(θ):=~Rn+1(θ)−~Rn(prn(θ))
Denote δn:=supΔRn. By Proposition A.10, for every n∈N s.t. 0">δn>0, there is πn+1:(A×O)≤nk→A and rn+1:(A×O)n+1→[−1,1] s.t.
1δnΔRn(θ)=Eθπn+1[rn+1]
~R0 is a constant. Define C∈R+ by
C:=1+∞∑k=0δk≤1+3L2⋅λ2+λ1−λ
We have
R=~R0+∞∑n=0prωn+1∘ΔRn=C(1C~R0+∞∑n=0δnC⋅prωn+1∘ΔRnδn)
Here, the terms with δn=0 are understood to vanish. By Proposition A.11, there is π a terminable policy and r:(A×O)∗→[−C,C] s.t. Rπr=R. Moreover, looking at the construction in the proof of Proposition A.11, we can see that
Eh∼μπ[h]=∞∑n=0δnC(n+1)≤∞∑n=03L2(λn+2+λn+1)(n+1)<∞ ■
In order to prove Theorem 2, we will consider the policy πPSγ,T:S∗×Sk→A implemented by an algorithm which is PSRL with two modifications:
 Each episode has random duration, uniformly distributed from 1 to 2T−1, for some fixed T∈N+.
 Between any two regular episodes, there is an "interlude" which consists of, performing the reward estimation experiment (πest).
The PSRL algorithm assumes choosing Π:H×S→A, a Borel measurable mapping s.t. Π(T) is an optimal policy, i.e.
EUΠ(T)TRT(γ)=EU∗TRT(γ)
As usual, we will consider (Ω,P), a probability space governing both the uncertainty about the true hypothesis, the stochastic behavior of the environment and the random sampling inside the algorithm. Let Y∗:Ω→H be a random variable representing the true hypothesis, {Yk:Ω→H}k∈N be random variables s.t. Yn represents the hypothesis sampled at episode k, {Θn:Ω→S}n∈N be s.t. Θn represents the state at time n, {An:Ω→A}n∈N be s.t. An represents the action taken at time n, {Nk:Ω→N}k∈N be s.t. Nk represents the time when the kth episode starts and {Mk:Ω→N}k∈N be s.t. Mk represents the time when the kthe interlude starts.
This probability space can be formally defined via recursive equations on the random variables, but this is straightforward and we will omit it.
Proposition A.13
In the setting of Theorem 2, fix T∈N+ and γ∈(0,1).
Denote Rk:=RYk, R∗:=RY∗, ΔRk:=Rk−R∗, ΔYk:=Yk−Y∗, E∗k:=Y∗Π(Yk), Ek∗:=YkΠ(Y∗), E∗∗:=Y∗Π(Y∗), ΔNk:=Mk−Nk and ΔMk:=Nk+1−Mk. Then,
R(γ)=∞∑k=0E⎡⎣Mk−1∑n=Nkγn((1−γ)ΔRk(Θn)+γEΔYk[VYkRk(γ)∣∣Θn,An])⎤⎦ +∞∑k=0E⎡⎣γMkEEΔNkk∗−EΔNk∗k[VY∗R∗(γ)∣∣ΘNk]⎤⎦ +(1−γ)∞∑k=0E⎡⎣Nk+1−1∑n=MkγnEEn−Mk∗∗−En−Mk∗k[R∗∣∣ΘMk]⎤⎦ +∞∑k=0E⎡⎣γNk+1EEΔMk∗∗−EΔMk∗k[VY∗R∗(γ)∣∣ΘMk]⎤⎦
Proof of Proposition A.13
For any n∈N, define Πn:H×H×S∗×Sk→A as follows.
Πn(T1,T2,h,s):={Π(T1,s) if h<nΠ(T2,s) if h≥n
That is, Πn(T1,T2) is a policy that follows Π(T1) for time n and Π(T2) afterwards.
In the following, we use the shorthand notation
V∗(s):=VY∗R∗(s,γ)
Vk(s):=VYkRk(s,γ)
Vkl(s):=VY∗ΠΔNk−l(Yk,Y∗)R∗(s,γ)
It is easy to see that
R(γ)=∞∑k=0E[γNk(V∗(ΘNk)−Vk0(ΘNk))] +(1−γ)∞∑k=0E⎡⎣Nk+1−1∑n=MkγnEEn−Mk∗∗−En−Mk∗k[R∗∣∣ΘMk]⎤⎦ +∞∑k=0E⎡⎣γNk+1EEΔMk∗∗−EΔMk∗k[VY∗R∗(γ)∣∣ΘMk]⎤⎦
Here, the first term represents the regret incurred during the episodes, whereas the second and third term represent the regret incurred during the interludes (the lost reward and lost value respectively). We will denote the first term R0(γ) in the following.
By definition, Yk and Y∗ have the same distribution even when conditioned by the history up to Nk. Therefore
E[V∗(ΘNk)]=E[Vk(ΘNk)]
It follows that
R0(γ)=∞∑k=0E[γNk(Vk(ΘNk)−Vk0(ΘNk))]
Denote Θkl:=ΘNk+l and ΔVkl:=Vk(Θkl)−Vkl(Θkl). We now prove by induction on l∈N that, with probability 1
l≤ΔNk⟹E[ΔVk0Nk,Mk]= E⎡⎣Nk+l−1∑n=Nkγn−Nk((1−γ)ΔRk(Θn)+γEΔYk[VkΘn,An])+γlΔVkl∣∣ ∣∣Nk,Mk⎤⎦
For l=0 this is a tautology. For any l∈N, the Bellman equation says that
Vk(s)=(1−γ)Rk(s)+γEYkΠ(Yk)[Vks]
l<ΔNk⟹Vkl(s)=(1−γ)R∗(s)+γEY∗Π(Yk)[Vk,l+1s]
Denote Ekk:=YkΠ(Yk). We will also use the notation Ek[X]:=E[XNk,Mk]. Substituting s=Θkl and subtracting the two identities, we get that, in the subspace of Ω defined by l<ΔNk
Ek[ΔVkl]=Ek[(1−γ)ΔR(Θkl)+γ(EEkk[VkΘkl]−EE∗k[Vk,l+1Θkl])]
Denote Akl:=ANk+l. Since Π(Yk) is exactly the policy followed by PSRL at time Nk+l, we get
Ek[ΔVkl]=Ek[(1−γ)ΔR(Θkl)+γ(EYk[VkΘkl,Akl]−EY∗[Vk,l+1Θkl,Akl])]
We now subtract and add γEY∗[VkΘkl,Akl], and use the fact that Y∗(Θkl,Akl) is the conditional distribution of Θk,l+1.
Ek[ΔVkl]=Ek[(1−γ)ΔR(Θkl)+γ(EΔYk[VkΘkl,Akl]+ΔVk,l+1)]
Applying this identity to the last term on the right hand side of the induction hypothesis, we prove the induction step. For l=ΔNk, we get
Ek[ΔVk0]=Mk∑n=Nkγn−NkEk[(1−γ)ΔRk(Θn)+γEΔYk[VkΘn,An]]+ γΔNkEk[Vk(ΘMk)−V∗(ΘMk)]
Clearly
Ek[Vk(ΘMk)]=Ek⎡⎣EEΔNk∗k[Vk∣∣ΘNk]⎤⎦
Ek[V∗(ΘMk)]=Ek⎡⎣EEΔNk∗k[V∗∣∣ΘNk]⎤⎦
Using the definition of PSRL, we can exchange and true and sampled hypothesis and get
Ek[Vk(ΘMk)]=Ek⎡⎣EEΔNkk∗[V∗∣∣ΘNk]⎤⎦
It follows that
Ek[ΔVk0]=Mk∑n=Nkγn−NkEk[(1−γ)ΔRk(Θn)+γEΔYk[VkΘn,An]]+ γΔNkEk⎡⎣EEΔNkk∗−EΔNk∗k[V∗∣∣ΘNk]⎤⎦
Applying this to each term in the earlier expression for R0(γ), we get the desired result. ■
Proposition A.14
Consider (Ω,P) a probability space, {Mk,Δk:Ω→N}k∈N random variables and T∈N+. Assume that Mk+1−Mk≥Δk and the Δk are all independent and uniformly distributed between 1 and 2T−1. Define the random variable Kn:Ω→N by
Kn:=min{k∈NMk≥n}
Then,
\frac{2n}{T}+1}\leq\exp\AP{\frac{n}{4T}}">Pr[Kn>2nT+1]≤exp(−n4T)
Proof of Proposition A.14
Define M′k:=∑k=1l=0Δl. Obviously, M′k≤Mk. For any k∈N and m∈R≥0
Pr[Mk≤kT−m]≤Pr[M′k≤kT−m]
Apply Hoeffding's inequality to the RHS,
Pr[Mk≤kT−m]≤exp(−2m2k(2T−2)2)≤exp(−m22kT2)
Taking m:=12kT, we get
Pr[Mk≤kT2]≤exp⎛⎜ ⎜⎝−(12kT)22kT2⎞⎟ ⎟⎠=exp(−k8)
Moreover, by definition of Kn, we have, for any k,n∈N
k}=\Pa{}{M_k< n}">Pr[Kn>k]=Pr[Mk<n]
It follows that
k}=\Pa{}{M_k< \Ceil{\frac{kT}{2}}}=\Pa{}{M_k<\frac{kT}{2}}\leq\exp\AP{\frac{k}{8}}">Pr[K⌈12kT⌉>k]=Pr[Mk<⌈kT2⌉]=Pr[Mk<kT2]≤exp(−k8)
Take k:=⌈2nT⌉. We get
\frac{2n}{T}+1}\leq\Pa{}{K_{\Ceil{\frac{1}{2}\Ceil{\frac{2n}{T}}T}}>\Ceil{\frac{2n}{T}}}\leq\exp\AP{\frac{\Ceil{\frac{2n}{T}}}{8}}\leq\exp\AP{\frac{n}{4T}}\ \blacksquare">Pr[Kn>2nT+1]≤Pr[K⌈12⌈2nT⌉T⌉>⌈2nT⌉]≤exp⎛⎜ ⎜⎝−⌈2nT⌉8⎞⎟ ⎟⎠≤exp(−n4T) ■
Proposition A.15
Consider Ψ be a measurable space, (Ω,P) a probability space, {Hn⊆P(Ω)}n∈N a filtration, {Xn:Ω→Ψ}n∈N a stochastic process adapted to H, {Mk:Ω→N}k∈N stopping times and {fn:Ψn+1→[0,1]}n∈N measurable functions. Consider also γ∈(0,1) and C∈R+ and assume that with probability 1
∞∑k=0γkfk(XM0,XM1…XMk)≤C
In addition, consider some T∈N+, T≥2 and assume that for all k∈N
Pr[Mk=nMk−1]={(2T−1)−1 if Mk−1<n<Mk−1+2T0 otherwise
Here, M−1 is understood to identically equal −1.
Finally, assume that, conditional on Mk≥n, Mk and Xn are independent. Define the Hnmeasurable random variable Kn:Ω→N by
Kn:=min{k∈NMk≥n}
Then,
∞∑n=0γ2nT+1E[fKn(XM0,XM1…XMKn−1,Xn)]≤2CT+11−exp(−14T)
Proof of Proposition A.15
We have, with probability 1
∞∑n=01MKn=nγKnfKn(XM0,XM1…XMKn−1,Xn)≤C
Taking expected value
∞∑n=0E[1MKn=nγKnfKn(XM0,XM1…XMKn−1,Xn)]≤C
Denote by K♯n the joint random variables Kn and M0,M1…MKn−1. Our assumptions imply that, conditional on K♯n, the factor 1MKn=n is independent of the rest of the expression inside the expected value. It follows that
∞∑n=0E[Pr[MKn=n∣∣K♯n]γKnE[fKn(XM0,XM1…XMKn−1,Xn)∣∣K♯n]]≤C
Clearly
Pr[MKn=n∣∣K♯n]=12T−n+MKn−1≥12T
We get
∞∑n=0E[12TγKnfKn(XM0,XM1…XMKn−1,Xn)]≤C
∞∑n=0E[γKnfKn(XM0,XM1…XMKn−1,Xn)]≤2CT
Using Proposition A.14, we get
∞∑n=0(E[γ2nT+1fKn(XM0,XM1…XMKn−1,Xn)]−exp(−n4T))≤2CT
∞∑n=0γ2nT+1E[fKn(XM0,XM1…XMKn−1,Xn)]≤2CT+∞∑n=0exp(−n4T)
∞∑n=0γ2nT+1E[fKn(XM0,XM1…XMKn−1,Xn)]≤2CT+11−exp(−14T) ■
Proposition A.16
Consider (Ω,P) a probability space, T∈N+ and {Nk,Δk:Ω→N}k∈N random variables s.t. Nk+1−Nk≥Δk and the Δk are all independent and uniformly distributed between 1 and 2T−1. Consider also γ∈(0,1). Then, for any k∈N
E[γNk]≤(1−Tγ2T−2(1−γ))k
In particular,
∞∑k=0E[γNk]≤1Tγ2T−2(1−γ)
Proof of Proposition A.16
For any n∈N, we have
1−γn1−γ=n−1∑m=0γm
1−γn=(1−γ)n−1∑m=0γm≥n(1−γ)γn−1
γn≤1−n(1−γ)γn−1
Therefore, since Δ0 is uniformly distributed between 1 and 2T−1,
E[γΔ0]=12T−12T−1∑n=1γn≤12T−12T−1∑n=1(1−n(1−γ)γn−1)
E[γΔ0]≤12T−12T−1∑n=1(1−n(1−γ)γ2T−2)=1−T(1−γ)γ2T−2
Also, for any k∈N
Nk≥k−1∑l=0Δl
γNk≤γ∑k−1l=0Δl=k−1∏l=0γΔl
E[γNk]≤E[k−1∏l=0γΔl]=k−1∏l=0E[γΔl]=E[γΔ0]k
Here, we used that the Δl are independent and equally distributed. Substituting the expression for E[Δ0], we get the desired result. ■
Proposition A.17
In the setting of Proposition A.16, consider some δ∈R+. Denote
α:=1−Tγ2T−2(1−γ)
Assume 0">α>0. Then,
∞∑k=0min(E[γNk],δ)≤(⌈lnδlnα⌉+11−α)δ
Proof of Proposition A.17
By Proposition A.16
∞∑k=0min(E[γNk],δ)≤∞∑k=0min(αk,δ)
∞∑k=0min(E[γNk],δ)≤⌈lnδlnα⌉−1∑k=0min(αk,δ)+∞∑k=⌈lnδlnα⌉min(αk,δ)
In the second term, we have
αk≤α⌈lnδlnα⌉≤αlnδlnα=elnαlnδlnα=elnδ=δ
We get
∞∑k=0min(E[γNk],δ)≤⌈lnδlnα⌉−1∑k=0δ+∞∑k=⌈lnδlnα⌉αk=⌈lnδlnα⌉δ+α⌈lnδlnα⌉1−α
We've already seen that α⌈lnδlnα⌉≤δ, and therefore
∞∑k=0min(E[γNk],δ)≤⌈lnδlnα⌉δ+δ1−α=(⌈lnδlnα⌉+11−α)δ ■
We will use the notations of Definitions B.1 and B.2 (see Appendix).
The following is a simple generalization of "Lemma 1" in Osband and Van Roy 2014 and the proof is essentially the same.
Proposition A.18
Consider a set X, a finitedimensional inner product space Y and some F⊆{X→Y}. Consider also some x∈Xω, y∈Yω, T∈N, θ∈R+, an increasing sequence {Nk∈N}k∈N and a nondecreasing sequence {βk∈R+}k∈N. Suppose that for any k∈N, Nk+1≤Nk+T. For any k∈N, define Fk by
Fk:=CSF[xy:Nk,β−1k]
Denote D:=dimRVOF. Then, for any K∈N
\theta}}\leq(D+1)\AP{\frac{4\beta_{K1}}{\theta^{2}}+T}">∣∣{(k∈[K],n∈N)∣∣Nk≤n<Nk+1, WFk(xn)>θ}∣∣≤(D+1)(4βK−1θ2+T)
Here, β−1 is understood to mean 0.
Proof of Proposition A.18
Let A⊂N×N be the set
\theta}">A:={(k∈[K],n∈N)∣∣Nk≤n<Nk+1, WFk(xn)>θ}
Let A′⊂N be the set
A′:={n∈N∃k∈[K]:(k,n)∈A}
For each i∈[A], we define (ki,ni)∈A recursively by
n0:=minA′ n_i}">ni+1:=min{n∈A′∣∣n>ni}
Denote D:=dimRVOF and let L:=⌊AD+1⌋. Given any n∈N∗, the notation xn will refer to the element of X∗ s.t. xn=n and for any i∈[n], (xn)i:=xni. We define j∈[A] and {mli∈N∗}l∈[L],i∈[j+1] by recursion over i.
For i=0, we set ml0:=λ. For any i, we consider whether there is l∈[L] s.t. xni is (F,θ)independent of xmli. If there is such l, we set ml,i+1:=mlini and for any l′∈[L]∖l, ml′,i+1:=mli. If there is no such l, we set j:=i. The latter situation must occur for some i∈[A], since otherwise we would get
∑l∈[L]∣∣mlA∣∣=A
That would imply that there is l∈[L] s.t.
∣∣mlA∣∣≥AL=A⌊AD+1⌋≥A(AD+1)=D+1
This is impossible since, by construction, each element of the sequence xmlA is (F,θ)independent of the preceding subsequence, whereas, by definition of D, it is the maximal length of such a sequence.
Since (kj,nj)∈A, \theta">WFkj(xnj)>θ. Therefore, there are f,~f∈Fkj s.t.
\theta">∥∥f(xnj)−~f(xnj)∥∥>θ
By construction of j and m, For each l∈[L], xnj is (F,θ)dependent of xmlj. Therefore,
\theta^2">∣∣mlj∣∣−1∑i=0∥∥f(xmlji)−~f(xmlji)∥∥2>θ2
Define J⊆[L] by
J:={l∈[L]∣∣∀i∈[∣∣mlj∣∣]:mlji<Nkj}
∑l∈J∣∣mlj∣∣−1∑i=0∥∥f(xmlji)−~f(xmlji)∥∥2≥Jθ2
By construction, the sequences mlj for all values of l together are a collection of distinct elements of [nj]. Therefore, J≥L−(nj−Nkj)≥L−T+1. It follows that
Nkj−1∑n=0∥∥f(xn)−~f(xn)∥∥2≥(L−T+1)θ2
⎷Nkj−1∑n=0∥∥f(xn)−~f(xn)∥∥2≥θ√max(L−T+1,0)
Denote f∗:=LSF[xy:Nkj].
⎷Nkj−1∑n=0∥f(xn)−f∗(xn)∥2+ ⎷Nkj−1∑n=0∥∥~f(xn)−f∗(xn)∥∥2≥θ√max(L−T+1,0)
Since f,~f∈Fkj, by the definition of Fkj each of the two terms on the RHS is at most √βkj≤√βK−1 and we get
2√βK−1≥θ√max(L−T+1,0)
4βK−1≥θ2(L+1−T)
4βK−1≥θ2(AD+1−T)
A≤(D+1)(4βK−1θ2+T) ■
Proposition A.19
Consider α∈N and f:R+→R+ nondecreasing. Define g:R→R by g(x):=lnf(ex), and assume g is Lipschitz continuous with constant α. In other words, for any t∈R+ and λ∈[1,∞), we have f(λt)≤λαf(t). Let L denote the Laplace transform operator. Then,
L[f](s)≤1−e−1+α!sf(1s)
Proof of Proposition A.19
L[f](s)=∫∞0e−stf(t)dt=∫1s0e−stf(t)dt+∫∞1se−stf(t)dt
For the first term, we will use that f is nondecreasing, and for the second term, we will use that g is Lipschitz continuous with constant α.
L[f](s)≤∫1s0e−stf(1s)dt+∫∞1se−stf(1s)⋅(ts)αdt
L[f](s)≤(∫1s0e−stdt+sα∫∞1se−sttαdt)f(1s)
L[f](s)≤(∫1s0e−stdt+sα∫∞0e−sttαdt)f(1s)
L[f](s)≤(1−e−1s+sα⋅α!s−α−1)f(1s)=1−e−1+α!sf(1s) ■
Proposition A.20
There is some CA.20∈R+ s.t. the following holds.
Consider a set X, an inner product space Y and some F⊆{X→Y}. Consider also some x∈Xω, y∈Yω, T∈N+, γ∈(0,1), θ∈R+, η0,η1∈R+, δ∈(0,1) and an increasing sequence {Nk∈N}k∈N. Suppose that N0=0, for any k∈N, Nk+1≤Nk+T, and \frac{1}{2}">γT>12. Denote
β(t):=η0+η1tlnetδ
For any k∈N, define Fk by
Fk:=CSF[xy:Nk,β(Nk+1)−1]
Denote D:=dimRVOF. Then,
\theta}} \leq C_{\mathrm{A.20}}D\AP{\frac{1}{\theta^2}\beta\AP{\frac{1}{1\gamma}}+T}">∞∑k=0Nk+1−1∑n=NkγkT1WFk(xn)>θ≤CA.20D(1θ2β(11−γ)+T)
Proof of Proposition A.20
By Proposition A.18, for any K∈N
\theta} } \leq(D+1)\AP{\frac{4\beta\AP{N_{K1}+1}}{\theta^{2}}+T}">K−1∑k=0Nk+1−1∑n=Nk1WFk(xn)>θ≤(D+1)(4β(NK−1+1)θ2+T)
Here, N−1 is understood to mean 0. Observing that Nk≤kT, we get
\theta} } \leq(D+1)\AP{\frac{4\beta\AP{KT+1}}{\theta^{2}}+T}">K−1∑k=0Nk+1−1∑n=Nk1WFk(xn)>θ≤(D+1)(4β(KT+1)θ2+T)
Multiplying the inequality by γKT and summing over K, we get
\theta}} \leq(D+1)\sum_{K=0}^\infty\gamma^{KT}\AP{\frac{4\beta\AP{KT+1}}{\theta^{2}}+T}">∞∑k=0(∞∑K=k+1γKT)Nk+1−1∑n=Nk1WFk(xn)>θ≤(D+1)∞∑K=0γKT(4β(KT+1)θ2+T)
On the left hand side, we sum the geometric series. On the right hand side, we use the observation that
γKT(4β(KT+1)θ2+T)≤1T∫T0γKT+t−T(4β(KT+t+1)θ2+T)dt
γKT(4β(KT+1)θ2+T)≤1TγT∫T0γKT+t(4β(KT+t+1)θ2+T)dt
Here, we used that β(t) is an increasing function for t≥1. We get
\theta}} \leq \frac{D+1}{T\gamma^T}\int_0^\infty\gamma^{t}\AP{\frac{4\beta(t+1)}{\theta^2}+T}\D t">∞∑k=0γ(k+1)T1−γTNk+1−1∑n=Nk1WFk(xn)>θ≤D+1TγT∫∞0γt(4β(t+1)θ2+T)dt
\theta}} \leq \frac{D+1}{T\gamma^T}\La_t\AB{\frac{4\beta(t+1)}{\theta^2}+T}\AP{\ln\frac{1}{\gamma}}">γT1−γT∞∑k=0γkTNk+1−1∑n=Nk1WFk(xn)>θ≤D+1TγTLt[4β(t+1)θ2+T](ln1γ)
It is easy to see that Proposition A.19 is applicable to the RHS for α=2. We get,
\theta}} \leq \frac{D+1}{T\gamma^T}\cdot\frac{3e^{1}}{\ln\frac{1}{\gamma}}\AP{\frac{4\beta\AP{\frac{1}{\ln\frac{1}{\gamma}}+1}}{\theta^2}+T}">γT1−γT∞∑k=0γkTNk+1−1∑n=Nk1WFk(xn)>θ≤D+1TγT⋅3−e−1ln1γ⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝4β(1ln1γ+1)θ2+T⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠
Using the condition \frac{1}{2}">γT>12 and absorbing O(1) factors into the definition of CA.20, we get
\theta}} \leq C_{\text{A.20}} D\AP{\frac{1}{\theta^2}\beta\AP{\frac{1}{1\gamma}}+T}\ \blacksquare">∞∑k=0γkTNk+1−1∑n=Nk1WFk(xn)>θ≤CA.20D(1θ2β(11−γ)+T) ■
Proposition A.21
There is some CA.21∈R+ s.t. the following holds. In the setting of Proposition A.20, assume that for any x∈X and f∈F, ∥f(x)∥≤1. Then,
∞∑k=0Nk+1−1∑n=NkγkTWFk(xn)≤CA.21(DT+√Dβ(11−γ)11−γ)
Proof of Proposition A.21
Due to the assumption ∥f(x)∥≤1, we have WFk(x)≤2. For any t∈R+, we have
\theta}\D\theta\leq t+\int_t^2\boldsymbol{1}_{\W^{F^k}\AP{\boldsymbol{x}_{n}}>\theta}\D\theta">WFk(xn)=∫201WFk(xn)>θdθ≤t+∫2t1WFk(xn)>θdθ
\theta}\D\theta">∞∑k=0Nk+1−1∑n=NkγkTWFk(xn)≤tT1−γT+∫2t∞∑k=0Nk+1−1∑n=NkγkT1WFk(xn)>θdθ
Applying Proposition A.20 to integrand on the RHS
∞∑k=0Nk+1−1∑n=NkγkTWFk(xn)≤tT1−γT+CA.20D∫2t(1θ2β(11−γ)+T)dθ
Evaluating the integral and dropping some negative terms on the right hand side, we get
∞∑k=0Nk+1−1∑n=NkγkTWFk(xn)≤tT1−γT+CA.20D(1tβ(11−γ)+2T)
We now set t to be
t:=√Dβ(11−γ)⋅(1−γ)
For an appropriate choice of CA.21, and using the assumption \frac{1}{2}">γT>12, it follows that
∞∑k=0Nk+1−1∑n=NkγkTWFk(xn)≤CA.21(DT+√Dβ(11−γ)11−γ) ■
Proof of Theorem 2
We take π†γ:=πPSγ,T where T∈N+ will be specified later. By Proposition A.13
R(γ)≤∞∑k=0E⎡⎣Mk−1∑n=Nkγn((1−γ)ΔRk(Θn)+γ∣∣∣EΔYk[VYkRk(γ)∣∣Θn,An]∣∣∣)⎤⎦ +∞∑k=0E⎡⎣γMk∣∣ ∣∣EEΔNkk∗−EΔNk∗k[VY∗R∗(γ)∣∣ΘNk]∣∣ ∣∣⎤⎦ +(1−γ)∞∑k=0E⎡⎣Nk+1−1∑n=Mkγn∣∣ ∣∣EEn−Mk∗∗−En−Mk∗k[R∗∣∣ΘMk]∣∣ ∣∣⎤⎦ +∞∑k=0E⎡⎣γNk+1∣∣ ∣∣EEΔMk∗∗−EΔMk∗k[VY∗R∗(γ)∣∣ΘMk]∣∣ ∣∣⎤⎦
We will use the notation
ΔV(γ):=maxT∈H(maxs∈SVTRT(s,γ)−mins∈SVTRT(s,γ))
We will also use the notation dtv(μ−ν):=dtv(μ,ν).
It follows that
R(γ)≤∞∑k=0E⎡⎣Mk−1∑n=Nkγn((1−γ)ΔRk(Θn)+γΔV(γ)dtv(ΔYk(Θn,An)))⎤⎦ +ΔV(γ)∞∑k=0E[γMk]+(1−γ)∞∑k=0E⎡⎣Nk+1−1∑n=Mkγn⎤⎦+ΔV(γ)∞∑k=0E[γNk+1]
Denote R0(γ) the first term on the RHS and R1(γ) the sum of the others terms. We have
R(γ)≤R0(γ)+R1(γ)
First, we analyze R1. In the second term, we have γn≤γMk, leading to
E⎡⎣Nk+1−1∑n=Mkγn⎤⎦≤E[ΔMkγMk]=E[E[ΔMkMk]γMk]≤testE[γMk]
We get
R1(γ)≤ΔV(γ)∞∑k=0E[γMk]+test(1−γ)∞∑k=0E[γMk]+ΔV(γ)∞∑k=0E[γNk+1]
Applying Proposition A.16 to each term, we get
R1(γ)≤2ΔV(γ)+test(1−γ)γ2T−2(1−γ)T
Now, we analyze R0. Define ΘrM∈(S×R)ω by
ΘrMk:=(ΘMk,r(ΘAMk:Nk+1ΘNk+1))
That is, ΘrM is the history of reward estimation experiments. For any i∈N, let Li be the stopping time defined recursively by
L0:=0 L_i,\ \exists k\in\Nats:\ET_k\leq l < \IT_k}">Li+1:=min{l∈Nl>Li, ∃k∈N:Nk≤l<Mk}
That is, L are time indices that traverse the "interior" of the episodes only. Define ΘAL∈(S×A)ω by
ΘALi:=ΘLiALi
We apply Proposition B.1 (see Appendix) with δ:=12(1−γ)2 and ϵ:=(1−γ)2, to each of the two terms in R0(γ):
Pr[R∗∈∞⋂k=0CSHR[ΘrM:k,βR(k+1)−1]]≥1−12(1−γ)2 Pr[Y∗∈∞⋂i=0CSH[ΘAL:iΘLi,βT(i+1)−1]]≥1−12(1−γ)2
Here, βR and βT correspond to β in Proposition B.1.
We also define N′k by the condition LN′k=Nk. Since the hypotheses Yk is sampled from the posterior, for any k∈N we also have
Pr[Rk∈CSHR[ΘrM:k,βR(k+1)−1]]≥1−12(1−γ)2 Pr[Yk∈CSH[ΘAL:N′kΘNk,βT(N′k+1)−1]]≥1−12(1−γ)2
Pr[R∗,Rk∈CSHR[ΘrM:k,βR(k+1)−1]]≥1−(1−γ)2 Pr[Y∗,Yk∈CSH[ΘAL:N′kΘNk,βT(N′k+1)−1]]≥1−(1−γ)2
Denote
GkR:={R∗,Rk∈CSHR[ΘrM:k,βR(k+1)−1]}⊆Ω GkT:={Y∗,Yk∈CSH[ΘAL:N′kΘNk,βT(N′k+1)−1]}⊆Ω
Define RR(γ) and RT(γ) by
RR(γ):=(1−γ)∞∑k=0E⎡⎣Mk−1∑n=NkγnΔRk(Θn)⎤⎦ RT(γ):=ΔV(γ)∞∑k=0E⎡⎣Mk−1∑n=Nkγn+1dtv(ΔYk(Θn,An))⎤⎦
We have
R0(γ)=RR(γ)+RT(γ)
We split RR into the GkR and Gk∁R contributions
RR(γ)=(1−γ)∞∑k=0(E⎡⎣Mk−1∑n=NkγnΔRk(Θn);GkR⎤⎦+ E⎡⎣Mk−1∑n=NkγnΔRk(Θn);Gk∁R⎤⎦)
For the second term, we have
Mk−1∑n=NkγnΔRk(Θn)≤2TγNk
E⎡⎣Mk−1∑n=NkγnΔRk(Θn);Gk∁R⎤⎦≤2TE[γNk;Gk∁R]≤2Tmin(E[γNk],Pr[Gk∁R])
E⎡⎣Mk−1∑n=NkγnΔRk(Θn);Gk∁R⎤⎦≤2Tmin(E[γNk],(1−γ)2)
∞∑k=0E⎡⎣Mk−1∑n=NkγnΔRk(Θn);Gk∁R⎤⎦≤2T∞∑k=0min(E[γNk],(1−γ)2)
Applying Proposition A.17 to the RHS, we get
∞∑k=0E⎡⎣Mk−1∑n=NkγnΔRk(Θn);Gk∁R⎤⎦≤2T(⌈2ln(1−γ)lnα⌉+1Tγ2T−2(1−γ))(1−γ)2
Here, α is as defined in Proposition A.17. Since we are interested in the asymptotics γ→1, and our ultimate choice of T will ensure that γT→1, we will henceforth make the assumption that \frac{1}{2}">γ2T>12.
∞∑k=0E⎡⎣Mk−1∑n=NkγnΔRk(Θn);Gk∁R⎤⎦≤2T⎡⎢ ⎢ ⎢ ⎢⎢2ln(1−γ)ln(1−12T(1−γ))⎤⎥ ⎥ ⎥ ⎥⎥(1−γ)2+4(1−γ)
Denote ρ(γ,T) the expression on the RHS. We get
RR(γ)≤(1−γ)⎛⎝∞∑k=0E⎡⎣Mk−1∑n=NkγnΔRk(Θn);GkR⎤⎦+ρ(γ,T)⎞⎠
Denote
HkR:=CSHR[ΘrM:k,βR(k+1)−1]
Clearly
Pr[ΔRk(Θn)≤WHkR(Θn)∣∣GkR]=1
Using this inequality, dropping the ;GkR (since it can only the right hand side smaller) and moving the sum inside the expected value, we get
RR(γ)≤(1−γ)⎛⎝E⎡⎣∞∑k=0Mk−1∑n=NkγnWHkR(Θn)⎤⎦+ρ(γ,T)⎞⎠
Extending the sum on the RHS to Mk (that can only increase it), and using the fact that the width is ≤1,
RR(γ)≤(1−γ)⎛⎝E⎡⎣∞∑k=0⎛⎝γNk+Mk∑n=Nk+1γnWHkR(Θn)⎞⎠⎤⎦+ρ(γ,T)⎞⎠
By Proposition A.16,
RR(γ)≤(1−γ)⎛⎝E⎡⎣∞∑k=0Mk∑n=Nk+1γnWHkR(Θn)⎤⎦+ρ(γ,T)⎞⎠+2T
We define the random variable Ki:Ω→N by
Ki:=min{k∈NMk≥Li}
We can now rewrite the previous inequality as
RR(γ)≤(1−γ)(E[∞∑i=0γLi+1WHKiR(ΘLi+1)]+ρ(γ,T))+2T
Obviously Li+1≥i and hence
RR(γ)≤(1−γ)(∞∑i=0γiE[WHKiR(ΘLi+1)]+ρ(γ,T))+2T
On the other hand, by Proposition A.21 (for the T=1, Nk=k case, applied to the subsequence ΘMk, with γ12T playing the role of γ)
∞∑k=0γ12TkWHkR(ΘMk)≤CA.21⎛⎜ ⎜⎝DRVO+ ⎷DRVOβR⎛⎝11−γ12T⎞⎠11−γ12T⎞⎟ ⎟⎠
Denote ~βR(γ,T,DRVO) the expression on the RHS. Applying Proposition A.15, we conclude
∞∑i=0γi+12TE[WHKiR(ΘLi+1)]≤2T~βR(γ,T,DRVO)+11−exp(−14T)
It follows
RR(γ)≤(1−γ)⎛⎜ ⎜⎝γ−12T⎛⎜ ⎜⎝2T~βR(γ,T,DRVO)+11−exp(−14T)⎞⎟ ⎟⎠+ρ(γ,T)⎞⎟ ⎟⎠+2T
Now, we analyze RT. By the same reasoning as for RR, we have
RT(γ)≤ΔV(γ)⎛⎝E⎡⎣Mk−1∑n=Nkγndtv(ΔYk(Θn,An));GkT⎤⎦+ρ(γ,T)⎞⎠
Denote
HkT:=CSH[ΘAL:N′kΘNk,βT(N′k+1)−1]
It is easy to see that for Nk≤n<Mk
Pr[dtv(ΔYk(Θn,An))≤S2WHkT(ΘnAn)∣∣∣GkT]=1
It follows
RT(γ)≤ΔV(γ)⎛⎝S2E⎡⎣∞∑k=0Mk−1∑n=NkγnWHkT(ΘnAn)⎤⎦+ρ(γ,T)⎞⎠
RT(γ)≤ΔV(γ)(S2∞∑i=0E[γLiWHKiT(ΘLiALi)]+ρ(γ,T))
Since Li≥i, By Proposition A.14
∞∑i=0E[γLiWHKiT(ΘLiALi)]≤∞∑i=0(E[γ12(Ki−1)TWHKiT(ΘLiALi)]+exp(−i4T))
∞∑i=0E[γLiWHKiT(ΘLiALi)]≤γ−12T∞∑i=0E[γ12KiTWHKiT(ΘLiALi)]+11−exp(−14T)
Denote
~ρ(γ,T,S):=ρ(γ,T)+S2⋅11−exp(−14T)
We get, using that γ−12T<2
RT(γ)≤ΔV(γ)(S∞∑i=0E[γ12KiTWHKiT(ΘLiALi)]+~ρ(γ,T,S))
On the other hand, dimRVOH≤SA and by Proposition A.21
∞∑i=0γ12KiTWHKiT(ΘLiALi)≤CA.21⎛⎜ ⎜⎝SAT+ ⎷SAβT⎛⎝11−γ12⎞⎠11−γ12⎞⎟ ⎟⎠
Denote ~βT(γ,T,SA) the expression on the RHS. It follows that
RT(γ)≤ΔV(γ)(S~βT(γ,T,SA)+~ρ(γ,T))
We set
T:=3 ⎷(τ+test)2(DMB+1)DRVO(1−γ)ln11−γ
Now, we analyze the γ→1 limit. In this limit, the expression for T justifies our assumption that \frac{1}{2}">γ2T>12. Indeed, we have
limγ→1lnγT=limγ→1Tlnγ=limγ→13 ⎷(τ+test)2(DMB+1)DRVO(1−γ)ln11−γ⋅lnγ=0
We now analyze the separate contributions of RR, RT and R1 to the limit of interest. We will use the notation x≲y to mean, there is some constant C0∈R+ that depends on nothing (i.e. on none of the parameters of the problem) s.t. x≤C0y.
Our previous bound on RR can be written as
RR(γ)≲DRVO(1−γ)T+ ⎷DRVOlnN(HR,(1−γ)−4)⋅11−γ12T⋅(1−γ)T+ ⎷DRVO11−γ12Tln11−γ⋅(1−γ)T+ 1−γ1−exp(−14T)+ T⎡⎢ ⎢ ⎢ ⎢⎢ln(1−γ)ln(1−12T(1−γ))⎤⎥ ⎥ ⎥ ⎥⎥(1−γ)3+ (1−γ)2+ 1T
Here we used that 1≲γT≤1, substituted the expressions for ~βR, ~β and ρ, and dropped some dominated terms.
We analyze the separate contribution of each term.
limsupγ→1DRVO(1−γ)T3√(τ+test)(DMB+1)DRVO(1−γ)ln11−γ=0
limsupγ→1√DRVOlnN(HR,(1−γ)−4)⋅11−γ12T⋅(1−γ)T3√(τ+test)(DMB+1)DRVO(1−γ)ln11−γ≲ limsupγ→1√DRVODMBln11−γ⋅1(1−γ)T⋅(1−γ)T3√(τ+test)(DMB+1)DRVO(1−γ)ln11−γ≲ limsupγ→1√DRVODMB(1−γ)ln11−γ⋅T3√(τ+test)(DMB+1)DRVO(1−γ)ln11−γ≲1
limsupγ→1√DRVO11−γ12Tln11−γ⋅(1−γ)T3√(τ+test)(DMB+1)DRVO(1−γ)ln11−γ≲ limsupγ→1√DRVO1(1−γ)Tln11−γ⋅(1−γ)T3√(τ+test)(DMB+1)DRVO(1−γ)ln11−γ= limsupγ→1√DRVO(1−γ)ln11−γ⋅T3√(τ+test)(DMB+1)DRVO(1−γ)ln11−γ=13√DMB+1
limsupγ→1(1−γ1−exp(−14T))3√(τ+test)(DMB+1)DRVO(1−γ)ln11−γ≲ limsupγ→1(1−γ)T3√(τ+test)(DMB+1)DRVO(1−γ)ln11−γ=0
limsupγ→1(T⌈ln(1−γ)ln(1−12T(1−γ))⌉(1−γ)3)3√(τ+test)(DMB+1)DRVO(1−γ)ln11−γ≲ limsupγ→1ln11−γ(1−γ)23√(τ+test)(DMB+1)DRVO(1−γ)ln11−γ=0
limsupγ→1(1−γ)23√(τ+test)(DMB+1)DRVO(1−γ)ln11−γ=0
limsupγ→11T3√(τ+test)(DMB+1)DRVO(1−γ)ln11−γ=1τ+test
Our previous bound on RT can be written as
RT(γ)≲ΔV(γ)S2AT +ΔV(γ)S ⎷SAlnN(H,(1−γ)−4)(1−γ)2⋅11−γ12 +ΔV(γ)S ⎷SA(1−γ)211−γ12ln(11−γ12)(1−γ)2⋅11−γ12 +ΔV(γ)T⎡⎢ ⎢ ⎢ ⎢⎢ln(1−γ)ln(1−12T(1−γ))⎤⎥ ⎥ ⎥ ⎥⎥(1−γ)2 +ΔV(γ)⋅(1−γ) +ΔV(γ)S2⋅11−exp(−14T)
We analyze the contribution of each term.
limsupγ→1ΔV(γ)S2AT3√(τ+test)(DMB+1)DRVO(1−γ)ln11−γ≲ limsupγ→1τ(1−γ)S2AT3√(τ+test)(DMB+1)DRVO(1−γ)ln11−γ=0
limsupγ→1ΔV(γ)S√SAlnN(H,(1−γ)−4)(1−γ)2⋅11−γ123√(τ+test)(DMB+1)DRVO(1−γ)ln11−γ≲ limsupγ→1τ(1−γ)S√SA(dimMBH+1)ln11−γ⋅11−γ3√(τ+test)(DMB+1)DRVO(1−γ)ln11−γ=0
limsupγ→1ΔV(γ)S ⎷SA(1−γ)211−γ12ln⎛⎜ ⎜⎝11−γ12⎞⎟ ⎟⎠(1−γ)2⋅11−γ123√(τ+test)(DMB+1)DRVO(1−γ)ln11−γ=0
limsupγ→1ΔV(γ)T⌈ln(1−γ)ln(1−12T(1−γ))⌉(1−γ)23√(τ+test)(DMB+1)DRVO(1−γ)ln11−γ≲ limsupγ→1τ(1−γ)2ln11−γ3√(τ+test)(DMB+1)DRVO(1−γ)ln11−γ=0
limsupγ→1ΔV(γ)⋅(1−γ)3√(τ+test)(DMB+1)DRVO(1−γ)ln11−γ=0
limsupγ→1ΔV(γ)S2⋅11−exp(−14T)3√(τ+test)(DMB+1)DRVO(1−γ)ln11−γ≲ limsupγ→1τ(1−γ)ST3√(τ+test)(DMB+1)DRVO(1−γ)ln11−γ=0
To complete the proof, we need to analyze the contribution of R1. We have
R1(γ)≲1T(ΔV(γ)(1−γ)+test)
limsupγ→11T(ΔV(γ)(1−γ)+test)3√(τ+test)(DMB+1)DRVO(1−γ)ln11−γ≲ limsupγ→1τ+testT3√(τ+test)(DMB+1)DRVO(1−γ)ln11−γ=1
Putting everything together, we conclude
limsupγ→1R(γ)3√(τ+test)(DMB+1)DRVO(1−γ)ln11−γ≲1 ■
AppendixThe following is a special case of what appeared in the previous essay as "Definition 1", introduced here for the sake of simplifying notations.
Definition B.1
Consider a set X, an inner product space Y and some F⊆{X→Y}. Consider also θ∈R+, n∈N, a sequence {xk∈X}k∈[n] and x∗∈X. x∗ is said to be (F,θ)dependant on {xk} when, for any f,~f∈F
n−1∑k=0∥∥f(xk)−~f(xk)∥∥2≤θ2⟹∥∥f(x∗)−~f(x∗)∥∥≤θ
Otherwise, x∗ is said to be (F,θ)independent of {xk}.
The following is a special case of what appeared in the previous essay as "Definition A.1".
Definition B.2
Consider a set X, a finitedimensional inner product space Y and some F⊆{X→Y}. Assume F is compact w.r.t. the product topology on X→Y≅∏x∈XY. Consider also some n∈N, x∈Xn, y∈Yn and λ∈R+. We then use the notation
LSF[xy]:=argminf∈Fn−1∑m=0∥f(xm)−ym∥2
CSF[xy,λ]:={f∈F∣∣ ∣∣n−1∑m=0∥∥f(xm)−LSF[xy](xm)∥∥2≤1λ}
Proposition B.1
There is some CB.1∈R+ s.t. the following holds.
Consider a finite set X, a finitedimensional inner product space Y and some F⊆{X→Y}. Let {Hn⊆P(Xω×Yω)}n∈N be the canonical filtration, i.e.
Hn:={A′⊆Xω×Yω∣∣A′={xyxy:n∈A}, A⊆Xn×Yn Borel}
Consider also f∗∈F and μ∈Δ(Xω×Yω) s.t. for any n∈N and x∈X
Exy∼μ[ynxn=x, Hn]=f∗(x)
Assume that ∥yn∥≤1 with μprobability 1 for all n∈N. Fix ϵ∈R+, δ∈(0,1). Define β:R+→R by
β(t):=CB.1(σ2lnN(F,ϵ−2Id)δ+ϵtlnetδ)
Then,
Prxy∼μ[f∗∉∞⋂n=0CSF[xy:n,β(n+1)−1]]≤δ
Proof of Proposition B.1
Straightforward from "Proposition B.1" and "Proposition A.3" in the previous essay. ■
Discuss
Alignment Newsletter #52
Alignment Newsletter #52 Why we may not want our AI systems to model humans View this email in your browser
Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter.
HighlightsThoughts on Human Models (Ramana Kumar and Scott Garrabrant): Many approaches to AI safety involve modeling humans in some way, for example in order to correctly interpret their feedback. However, there are significant disadvantages to human modeling. First and most importantly, if we have AI systems do useful things without modeling humans, then we can use human approval as a "test set": we can check whether the AI's behavior is something we approve of, and this is an independent evaluation of the AI system. However, if the AI system had a human model, then it may have optimized its behavior for human approval, and so we cannot use approval as a "test set". Second, if our AI system has a catastrophic bug, it seems better if it doesn't have any human models. An AI system without human models will at worst optimize for some unrelated goal like paperclips, which at worst leads to it treating humans as obstacles and causing extinction. However, an AI system with human models with a catastrophic bug might optimize for human suffering, or having humans respond to email all day, etc. Thirdly, an AI system with human models might be simulating conscious beings that can suffer. Fourthly, since humans are agentlike, an AI system that models humans is likely to produce a subsystem that is agentlike and so dangerous.
The authors then discuss why it might be hard to avoid human models. Most notably, it is hard to see how to use a powerful AI system that avoids human models to produce a better future. In particular, human models could be particularly useful for interpreting specifications (in order to do what humans mean, as opposed to what we literally say) and for achieving performance given a specification (e.g. if we want to replicate aspects of human cognition). Another issue is that it is hard to avoid human modeling, since even "independent" tasks have some amount of information about human motivations in selecting that task.
Nevertheless, the authors would like to see more work on engineeringfocused approaches to AI safety without human models, especially since this area is neglected, with very little such work currently. While MIRI does work on AI safety without human models, this is from a very theoretical perspective. In addition to technical work, we could also promote certain types of AI research that is less likely to develop human models "by default" (e.g. training AI systems in procedurally generated simulations, rather than on humangenerated text and images).
Rohin's opinion: While I don't disagree with the reasoning, I disagree with the main thrust of this post. I wrote a long comment about it; the TL;DR is that since humans want very specific behavior out of AI systems, the AI system needs to get a lot of information from humans about what it should do, and if it understands all that information then it necessarily has a (maybe implicit) human model. In other words, if you require your AI system not to have human models, it will not be very useful, and people will use other techniques.
Technical AI alignment Iterated amplificationAI Alignment Podcast: AI Alignment through Debate (Lucas Perry and Geoffrey Irving) (summarized by Richard): We want AI safety solutions to scale to very intelligent agents; debate is one scalability technique. It's formulated as a two player zerosum perfect information game in which agents make arguments in natural language, to be evaluated by a human judge. Whether or not such debates are truthconducive is an empirical question which we can try to evaluate experimentally; doing so will require both technical and social science expertise (as discussed in a previous post (AN #47)).
Richard's opinion: I think one of the key questions underlying Debate is how efficiently natural language can summarise reasoning about properties of the world. This question is subject to some disagreement (at one extreme, Facebook's roadmap towards machine intelligence describes a training environment which is "entirely linguistically defined") and probably deserves more public discussion in the context of safety.
Rohin's note: If you've read the previous posts on debate, the novel parts of this podcast are on the relation between iterated amplification and debate (which has been discussed before, but not in as much depth), and the reasons for optimism and pessimism about debate.
Agent foundationsPavlov Generalizes (Abram Demski): In the iterated prisoner's dilemma, the Pavlov strategy is to start by cooperating, and then switch the action you take whenever the opponent defects. This can be generalized to arbitrary games. Roughly, an agent is "discontent" by default and chooses actions randomly. It can become "content" if it gets a high payoff, in which case it continues to choose whatever action it previously chose as long as the payoffs remain consistently high. This generalization achieves Pareto optimality in the limit, though with a very bad convergence rate. Basically, all of the agents start out discontent and do a lot of exploration, and as long as any one agent is discontent the payoffs will be inconsistent and all agents will tend to be discontent. Only when by chance all of the agents take actions that lead to all of them getting high payoffs do they all become content, at which point they keep choosing the same action and stay in the equilibrium.
Despite the bad convergence, the cool thing about the Pavlov generalization is that it only requires agents to notice when the results are good or bad for them. In contrast, typical strategies that aim to mimic TitforTat require the agent to reason about the beliefs and utility functions of other agents, which can be quite difficult to do. By just focusing on whether things are going well for themselves, Pavlov agents can get a lot of properties in environments with other agents that TitforTat strategies don't obviously get, such as exploiting agents that always cooperate. However, when thinking about logical time (AN #25), it would seem that a Pavlovesque strategy would have to make decisions based on a prediction about its own behavior, which is... not obviously doomed, but seems odd. Regardless, given the lack of work on Pavlov strategies, it's worth trying to generalize them further.
Approvaldirected agency and the decision theory of Newcomblike problems (Caspar Oesterheld)
Learning human intentThoughts on Human Models (Ramana Kumar and Scott Garrabrant): Summarized in the highlights!
VerificationAlgorithms for Verifying Deep Neural Networks (Changliu Liu et al): This is a survey paper about verification of properties of deep neural nets.
RobustnessTowards Robust and Verified AI: Specification Testing, Robust Training, and Formal Verification (Pushmeet Kohli et al): This post highlights three areas of current research towards making robust AI systems. First, we need better evaluation metrics: rather than just evaluating RL systems on the environments they were trained on, we need to actively search for situations in which they fail. Second, given a specification or constraint that we would like to ensure, we can develop new training techniques that can ensure that the specifications hold. Finally, given a specification, we can use formal verification techniques to ensure that the model obeys the specification on all possible inputs. The authors also list four areas of future research that they are excited about: leveraging AI capabilities for evaluation and verification, developing publicly available tools for evaluation and verification, broadening the scope of adversarial examples beyond the Linfinity norm ball, and learning specifications.
Rohin's opinion: The biggest challenge I see with this area of research, at least in its application to powerful and general AI systems, is how you get the specification in the first place, so I'm glad to see "learning specifications" as one of the areas of interest.
If I take the view from this post, it seems to me that techniques like domain randomization, and more generally training on a larger distribution of data, would count as an example of the second type of research: it is a change to the training procedure that allows us to meet the specification "the agent should achieve high reward in a broad variety of environments". Of course, this doesn't give us any provable guarantees, so I'm not sure if the authors of the post would include it in this category.
ForecastingHistorical economic growth trends (Katja Grace) (summarized by Richard): Data on historical economic growth "suggest that (proportional) rates of economic and population growth increase roughly linearly with the size of the world economy and population", at least from around 0 CE to 1950. However, this trend has not held since 1950  in fact, growth rates have fallen since then.
Miscellaneous (Alignment)Coherent behaviour in the real world is an incoherent concept (Richard Ngo): In a previous post (AN #35), I argued that coherence arguments (such as those based on VNM rationality) do not constrain the behavior of an intelligent agent. In this post, Richard delves further into the argument, and considers other ways that we could draw implications from coherence arguments.
I modeled the agent as having preferences over full trajectories, and objected that if you only look at observed behavior (rather than hypothetical behavior), you can always construct a utility function such that the observed behavior optimizes that utility function. Richard agrees that this objection is strong, but looks at another case: when the agent has preferences over states at a single point in time. This case leads to other objections. First, many reasonable preferences cannot be modeled via a reward function over states, such as the preference to sing a great song perfectly. Second, in the real world you are never in the same state more than once, since at the very least your memories will change, and so you can never infer a coherence violation by looking at observed behavior.
He also identifies further problems with applying coherence arguments to realistic agents. First, all behavior is optimal for the constant zero reward function. Second, any real agent will not have full information about the world, and will have to have beliefs over the world. Any definition of coherence will have to allow for multiple beliefs  but if you allow all beliefs, then you can rationalize any behavior as based on some weird belief that the agent has. If you require the agent to be Bayesian, you can still rationalize any behavior by choosing a prior appropriately.
Rohin's opinion: I reject modeling agents as having preferences over states primarily for the first reason that Richard identified: there are many "reasonable" preferences that cannot be modeled with a reward function solely on states. However, I don't find the argument about beliefs as a free variable very convincing: I think it's reasonable to argue that a superintelligent AI system will on average have much better beliefs than us, and so anything that we could determine as a coherence violation with high confidence should be something the AI system can also determine as a coherence violation with high confidence.
Three ways that "Sufficiently optimized agents appear coherent" can be false (Wei Dai): This post talks about three ways that agents could not appear coherent, where here "coherent" means "optimizing for a reasonable goal". First, if due to distributional shift the agent is put into situations it has never encountered before, it may not act coherently. Second, we may want to "force" the agent to pretend as though compute is very expensive, even if this is not the case, in order to keep them bounded. Finally, we may explicitly try to keep the agent incoherent  for example, population ethics has impossibility results that show that any coherent agent must bite some bullet that we don't want to bite, and so we may instead elect to keep the agent incoherent instead. (See Impossibility and Uncertainty Theorems in AI Value Alignment (AN #45).)
The Unavoidable Problem of SelfImprovement in AI and The Problem of SelfReferential Reasoning in SelfImproving AI (Jolene Creighton and Ramana Kumar): These articles introduce the thinking around AI selfimprovement, and the problem of how to ensure that future, more intelligent versions of an AI system are just as safe as the original system. This cannot be easily done in the case of proofbased systems, due to Godel's incompleteness theorem. Some existing work on the problem: Botworld, Vingean reflection, and Logical induction.
Other progress in AI Deep learningThe Lottery Ticket Hypothesis at Scale (Jonathan Frankle et al) (summarized by Richard): The lottery ticket hypothesis is the claim that "dense, randomlyinitialized, feedforward networks contain subnetworks (winning tickets) that  when trained in isolation  reach test accuracy comparable to the original network in a similar number of iterations". This paper builds on previous work to show that winning tickets can also be found for larger networks (Resnet50, not just Resnet18), if those winning tickets are initialised not with their initial weights from the full network, but rather with their weights after a small amount of fullnetwork training.
Richard's opinion: It's interesting that the lottery ticket hypothesis scales; however, this paper seems quite incremental overall.
NewsOpenAI LP (OpenAI) (summarized by Richard): OpenAI is transitioning to a new structure, consisting of a cappedprofit company (OpenAI LP) controlled by the original OpenAI nonprofit organisation. The nonprofit is still dedicated to its charter, which OpenAI LP has a legal duty to prioritise. All investors must agree that generating profits for them is a secondary goal, and that their overall returns will be capped at 100x their investment (with any excess going back to the nonprofit).
Richard's opinion: Given the high cost of salaries and compute for machine learning research, I don't find this a particularly surprising development. I'd also note that, in the context of investing in a startup, a 100x return over a timeframe of decades is not actually that high.
Copyright © 2019 Rohin Shah, All rights reserved.Want to change how you receive these emails?
You can update your preferences or unsubscribe from this list.
Discuss
Would solving logical counterfactuals solve anthropics?
One of the key problems with anthropics is establishing the appropriate reference class. When we attempt to calculate a probability accounting for anthropics, do we consider all agents or all humans or all humans who understand decision theory?
If a tree falls on Sleeping Beauty argues probability is not ontologically basic and the "probability" depends on how you count bets. In this vein, one might attempt to solve anthropics by asking about whose decision to take a bet is linked to yours. You could then count up all the linked agents who observe A and all the agents who observe not A and then calculate the expected value of the bet. More generally, you could use the same method to calculate the payoffs given any set of actions.
Discuss
Rules and Skills
The extraordinary moment in 20th Century Philosophy when Ludwig Wittgenstein’s analysis of how people are able to follow rules lines up with Martin Heidegger’s exposition of skillfully dealing with the world we are embedded in.… Thereby dismantling Epistemological Scepticism and Antirealism in a single and simple insight.https://thort.space/journey/110898789587140112950_5587971339654183211_6886495592041384105/1
There’s a lot going on in Ludwig Wittgenstein’s Philosophical Investigationsand in Martin Heidegger’s Being and Time.
But a single and simple related move made by both Ludwig Wittgenstein and Martin Heidegger is enough to dismantle both … Epistemological Skepticism and Antirealism…
https://thort.space/journey/110898789587140112950_5587971339654183211_6886495592041384105/2These two masterworks converge around the analysis of following a rule(L.W.) and skillfully dealing with the world we are embedded in (M.H.)
https://thort.space/journey/110898789587140112950_5587971339654183211_6886495592041384105/3How do humans acquire skills? By playing, experimenting, trialanderror, trying things out
 By practicing
 By imitating
 By being coached, instructed, trained
 By being embedded in the world … we literally cannot help ourselves from acquiring skills, just as a consequence of the continuous feedback that the world gives us all the time, moment by moment, as we live our lives !
The coach breaks skills down into subskills and trains the apprentice in each of the subskills , and then trains the apprentice to combine the subskills
(1) The coach says: DoX … the apprentice tries to do X.
(2) The coach tells the apprentice whether they did X, or how well or badly they did X
(3) repeat
Notice that the apprentice does not know whether they are doing X or not, or doing it correctly, or doing it well or doing it badly — apart from the sayso of the coach.
https://thort.space/journey/110898789587140112950_5587971339654183211_6886495592041384105/5Elaboration of being embedded in the worldWe are coupled with the world … the world gives us continuous feedback in respect of the consequences and outcomes of our actions.
At one time you couldn’t do it … now you have to really concentrate or pay attention to not do it !
https://thort.space/journey/110898789587140112950_5587971339654183211_6886495592041384105/6As far as human beings are concerned, the world is not usually composed of objects ( substances ) at all … its composed of TOOLS
( and otherpeople — which is a whole other conversation )
Tools become transparent as we become familiar with them and how to use them ie. as we become more skilful.
Think of the clutch pedal on a manualgeared car or the door handle you used to enter the room you’re sat in. Tools withdraw from our conscious awareness as we become familiar and competent with their use. Tools and skill withdraw from our conscious awareness and become part of the tacitknowledge takenforgranted background against which is becomes possible to follow a rule or an instruction.
https://thort.space/journey/110898789587140112950_5587971339654183211_6886495592041384105/7The world we each get to live in is a function of the skills we have.
The primary connection between the humans and the world is not via perception or via cognition or even some combination thereof … it is via SKILL.
To be clearer, it is a mistake to speak of a connection between humans and the world, because humans arise first always already in the world and as that world.
Skills (like tools) are transparent.
We are typically not able to say what rules we are following when skilfully dealing with the world in a particular way.
When learning a new skill we try to apply rules , but rules only take us so far.
We are in any case not able to capture the entirety of a skill in a finite set of rules. How to specify how and when a given rule is to be applied requires another rule which leads to an infinite regress of rules that are required to explicate the application of other rules.
https://thort.space/journey/110898789587140112950_5587971339654183211_6886495592041384105/8No finite set of rules is able to capture the performance of a skillLudwig Wittgenstein points out in Philosophical Investigations that it is not possible to specify a finite set of rules for how to follow a rule, because it leads to an infinite regress.
How do we know how to apply a rule ?
It can’t be by means of the “ruleapplying rule”, because then it can just be asked how we know how to apply that rule (ie. how to apply the ruleapplying rule).
How then is it possible to follow a rule ?
The answer is simple. We are able to apply rules because we have acquired the necessary skills to be able to do so.
Consideration of this makes it apparent that we operate on the basis of a whole load of tacitknowledge ie. knowhow that we bring when we apply a rule — knowhow that we don’t even need to think about; knowhow that we possibly may not even have words for or conceptual understanding of.
https://thort.space/journey/110898789587140112950_5587971339654183211_6886495592041384105/9tacitknowledge = knowhow = transparentskillsPerformative knowledge i.e. knowhow involves time — activity happens (can only happen) in time and space.
Performative knowledge is a function of our embeddedness.
Contrast knowhow with knowthat.
https://thort.space/journey/110898789587140112950_5587971339654183211_6886495592041384105/10Ludwig Wittgenstein Philosophical Investigations approach is to ask awkward questions.
Martin Heidegger Being and Time approach is to assert the primacy of the experience and behaviour.
These two approaches come together in this extraordinary moment in philosophy where the ungroundness of rule following is balanced by our embeddedness as skillful agents.
We are indeed grounded in the world in the only way we could be grounded — by being embedded in the world that we operate within.
The world gives us continuous momentbymoment feedback in respect of every action we ever took. It is because we are embedded in the world (coupled with the world), and that things matter to us, that the continuous feedback provided by the world enables us to develop the tacit skills we need to operate adequately in the world.
https://thort.space/journey/110898789587140112950_5587971339654183211_6886495592041384105/11How does understanding ourselves as skilful agents (always already embedded in the world and already taking action in the world) defeat Epistemological Skepticism and Antirealism ?Epistemological Skepticism asserts that we cannot have “knowledge of the external world” which lies outside our minds/brains/souls/whatever . The only thing we can genuinely know is the truth of our own knowing subject.
But following from Wittgenstein’s awkward questions and Heidegger’s assertions, we see that our experience and behaviour does not suggest that we are primarily knowing subjects over and against an external world of objects in the first place … rather it implies that we are primarily embedded skilful agents who use transparent tools by means of transparent skills.
[We arise in the world prior to becoming a subject. Whatever it is that a subject is, it arises secondarily to the primacy of our having first arisen in the world we all share. This primacy of world over subject is both in terms of time but also in respect of how it makes sense to understand ourselves. We are always already in the world first, and subjects second.]
It is only by virtue of these transparent skills and tools that we have any knowledge of our physiology and neurology in the first place.
It is on the basis of this set of tools and skills that we are transparently able to build up our explicit knowthat by building on the foundation of our implicit knowhow which itself is a consequence of the continuous feedback the world constantly gives us in response to our actions.
Antirealism asserts that we have no “direct contact with reality”. All our supposed experience of reality is actually just experience of sense data, such as impulses propagated by our optic nerves.
But again, once we understand ourselves as embedded skilful agents who use transparent tools by means of transparent skills, then we cannot in the first place find a gap between ourselves and reality that needs to be bridged.
The world that we primarily find ourselves in, is one in which we are always already completely immersed and already taking action. This is the world that we find ourselves already in when we first find ourselves at all, and find ourselves anywhere. We don’t have to break into this world; much rather we have very few means of escaping from it.
This is the world that we always find ourselves already in — the place we start from prior to any understanding of it.
Consequently there is in fact no need to try to break out of the “cave” of our minds, and into the light of an “external world”. We are always already in the only world we have — the one world we all share and live in together.
https://thort.space/journey/110898789587140112950_5587971339654183211_6886495592041384105/12Summaryhttps://thort.space/journey/110898789587140112950_5587971339654183211_6886495592041384105/13Discuss
How would a better replacement for tissues look like?
When one has a cold common disposable tissues are the state of the art solution for getting mucus out of one's nose.
If there's however a lot of mucus a bit of that mucus usually ends up on the hand that holds the tissue.
This in turn makes it likely that germs end up on the hand and on objects that one touches with the hand.
If we invent a better solution to replace disposable tissues, it would be valuable for reducing the spread of infectious disease.
Does anybody have ideas?
Discuss
Aumann Agreement by Combat
The first paper in the SIGBOVIK 2019 proceedings, “Aumann Agreement by Combat” by Travis Hance, seems relevant to this site. One can’t link to sections within a PDF, but the paper is on page 4 of the PDF (page 8 if you include front matter).
The abstract:
The celebrated Aumann’s Agreement Theorem shows that two rational agents with the same priors on an event who make different observations will always converge on the same posteriors after some civilized conversation over tea. Furthermore, they will come to agree even if they do nothing other than simply state their posteriors over and over again.However, this protocol is widely criticized for being too boring. We therefore introduce a more exciting alternative, which we name Aumann Agreement by Combat.The SIGBOVIK conference is held every year on April 1.
Discuss
Cambridge SSC Meetup
Monthly Cambridge (Massachusetts) SSC meetup.
In apartment 2.
Discuss
How do people become ambitious?
I have a pet theory about how people become ambitious and agenty.
I was about to write up some insight porn about it, and then was like "you know, Raemon, you should probably actually think about about this for real, since it seems like Pet Psychology Theories are one of the easier ways to get stuck in dumb cognitive traps."
"How do people get ambitious and agenty" seems to be something people should have actually tried to study before. I'm thinking something as simple as "interviewing lots of people and checking for common patterns."
2 seconds spent on google scholar suggested I could use better keywords.
Curious if anyone has looked into this (either reviewing existing literature, or conducting interviews themselves or otherwise trying to tackle the question in a serious way)
For clarity, the two phenomena I want to understand better are:
 Ambition. How do people end up having plans (which they realistically expect to achieve) that affect at least thousands, and preferably millions of people?
 Agency. How do people generally gain the capacity to be selfmotivated, think through plans, and decide how to pursue goals that will change the world (changing it in small ways is fine)
Discuss
Modelling Model Comparisons
Define a model as a set of objects and their relationships. For example, when discussing a model of music, "notes" would be the objects and a possible relationship would be "harmony". [Technically, the objects would be (frequency, volume, time), and from their you could define the relationships "harmony", "tempo", "crescendo", etc.]
Using this, there are two different ways to compare models: similar relationships and (Object.mjxchtml {display: inlineblock; lineheight: 0; textindent: 0; textalign: left; texttransform: none; fontstyle: normal; fontweight: normal; fontsize: 100%; fontsizeadjust: none; letterspacing: normal; wordwrap: normal; wordspacing: normal; whitespace: nowrap; float: none; direction: ltr; maxwidth: none; maxheight: none; minwidth: 0; minheight: 0; border: 0; margin: 0; padding: 1px 0} .MJXcdisplay {display: block; textalign: center; margin: 1em 0; padding: 0} .mjxchtml[tabindex]:focus, body :focus .mjxchtml[tabindex] {display: inlinetable} .mjxfullwidth {textalign: center; display: tablecell!important; width: 10000em} .mjxmath {display: inlineblock; bordercollapse: separate; borderspacing: 0} .mjxmath * {display: inlineblock; webkitboxsizing: contentbox!important; mozboxsizing: contentbox!important; boxsizing: contentbox!important; textalign: left} .mjxnumerator {display: block; textalign: center} .mjxdenominator {display: block; textalign: center} .MJXcstacked {height: 0; position: relative} .MJXcstacked > * {position: absolute} .MJXcbevelled > * {display: inlineblock} .mjxstack {display: inlineblock} .mjxop {display: block} .mjxunder {display: tablecell} .mjxover {display: block} .mjxover > * {paddingleft: 0px!important; paddingright: 0px!important} .mjxunder > * {paddingleft: 0px!important; paddingright: 0px!important} .mjxstack > .mjxsup {display: block} .mjxstack > .mjxsub {display: block} .mjxprestack > .mjxpresup {display: block} .mjxprestack > .mjxpresub {display: block} .mjxdelimh > .mjxchar {display: inlineblock} .mjxsurd {verticalalign: top} .mjxmphantom * {visibility: hidden} .mjxmerror {backgroundcolor: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; fontstyle: normal; fontsize: 90%} .mjxannotationxml {lineheight: normal} .mjxmenclose > svg {fill: none; stroke: currentColor} .mjxmtr {display: tablerow} .mjxmlabeledtr {display: tablerow} .mjxmtd {display: tablecell; textalign: center} .mjxlabel {display: tablerow} .mjxbox {display: inlineblock} .mjxblock {display: block} .mjxspan {display: inline} .mjxchar {display: block; whitespace: pre} .mjxitable {display: inlinetable; width: auto} .mjxrow {display: tablerow} .mjxcell {display: tablecell} .mjxtable {display: table; width: 100%} .mjxline {display: block; height: 0} .mjxstrut {width: 0; paddingtop: 1em} .mjxvsize {width: 0} .MJXcspace1 {marginleft: .167em} .MJXcspace2 {marginleft: .222em} .MJXcspace3 {marginleft: .278em} .mjxtest.mjxtestdisplay {display: table!important} .mjxtest.mjxtestinline {display: inline!important; marginright: 1px} .mjxtest.mjxtestdefault {display: block!important; clear: both} .mjxexbox {display: inlineblock!important; position: absolute; overflow: hidden; minheight: 0; maxheight: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjxtestinline .mjxleftbox {display: inlineblock; width: 0; float: left} .mjxtestinline .mjxrightbox {display: inlineblock; width: 0; float: right} .mjxtestdisplay .mjxrightbox {display: tablecell!important; width: 10000em!important; minwidth: 0; maxwidth: none; padding: 0; border: 0; margin: 0} .MJXcTeXunknownR {fontfamily: monospace; fontstyle: normal; fontweight: normal} .MJXcTeXunknownI {fontfamily: monospace; fontstyle: italic; fontweight: normal} .MJXcTeXunknownB {fontfamily: monospace; fontstyle: normal; fontweight: bold} .MJXcTeXunknownBI {fontfamily: monospace; fontstyle: italic; fontweight: bold} .MJXcTeXamsR {fontfamily: MJXcTeXamsR,MJXcTeXamsRw} .MJXcTeXcalB {fontfamily: MJXcTeXcalB,MJXcTeXcalBx,MJXcTeXcalBw} .MJXcTeXfrakR {fontfamily: MJXcTeXfrakR,MJXcTeXfrakRw} .MJXcTeXfrakB {fontfamily: MJXcTeXfrakB,MJXcTeXfrakBx,MJXcTeXfrakBw} .MJXcTeXmathBI {fontfamily: MJXcTeXmathBI,MJXcTeXmathBIx,MJXcTeXmathBIw} .MJXcTeXsansR {fontfamily: MJXcTeXsansR,MJXcTeXsansRw} .MJXcTeXsansB {fontfamily: MJXcTeXsansB,MJXcTeXsansBx,MJXcTeXsansBw} .MJXcTeXsansI {fontfamily: MJXcTeXsansI,MJXcTeXsansIx,MJXcTeXsansIw} .MJXcTeXscriptR {fontfamily: MJXcTeXscriptR,MJXcTeXscriptRw} .MJXcTeXtypeR {fontfamily: MJXcTeXtypeR,MJXcTeXtypeRw} .MJXcTeXcalR {fontfamily: MJXcTeXcalR,MJXcTeXcalRw} .MJXcTeXmainB {fontfamily: MJXcTeXmainB,MJXcTeXmainBx,MJXcTeXmainBw} .MJXcTeXmainI {fontfamily: MJXcTeXmainI,MJXcTeXmainIx,MJXcTeXmainIw} .MJXcTeXmainR {fontfamily: MJXcTeXmainR,MJXcTeXmainRw} .MJXcTeXmathI {fontfamily: MJXcTeXmathI,MJXcTeXmathIx,MJXcTeXmathIw} .MJXcTeXsize1R {fontfamily: MJXcTeXsize1R,MJXcTeXsize1Rw} .MJXcTeXsize2R {fontfamily: MJXcTeXsize2R,MJXcTeXsize2Rw} .MJXcTeXsize3R {fontfamily: MJXcTeXsize3R,MJXcTeXsize3Rw} .MJXcTeXsize4R {fontfamily: MJXcTeXsize4R,MJXcTeXsize4Rw} .MJXcTeXvecR {fontfamily: MJXcTeXvecR,MJXcTeXvecRw} .MJXcTeXvecB {fontfamily: MJXcTeXvecB,MJXcTeXvecBx,MJXcTeXvecBw} @fontface {fontfamily: MJXcTeXamsR; src: local('MathJax_AMS'), local('MathJax_AMSRegular')} @fontface {fontfamily: MJXcTeXamsRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_AMSRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_AMSRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_AMSRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXcalB; src: local('MathJax_Caligraphic Bold'), local('MathJax_CaligraphicBold')} @fontface {fontfamily: MJXcTeXcalBx; src: local('MathJax_Caligraphic'); fontweight: bold} @fontface {fontfamily: MJXcTeXcalBw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_CaligraphicBold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_CaligraphicBold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_CaligraphicBold.otf') format('opentype')} @fontface {fontfamily: MJXcTeXfrakR; src: local('MathJax_Fraktur'), local('MathJax_FrakturRegular')} @fontface {fontfamily: MJXcTeXfrakRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_FrakturRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_FrakturRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_FrakturRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXfrakB; src: local('MathJax_Fraktur Bold'), local('MathJax_FrakturBold')} @fontface {fontfamily: MJXcTeXfrakBx; src: local('MathJax_Fraktur'); fontweight: bold} @fontface {fontfamily: MJXcTeXfrakBw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_FrakturBold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_FrakturBold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_FrakturBold.otf') format('opentype')} @fontface {fontfamily: MJXcTeXmathBI; src: local('MathJax_Math BoldItalic'), local('MathJax_MathBoldItalic')} @fontface {fontfamily: MJXcTeXmathBIx; src: local('MathJax_Math'); fontweight: bold; fontstyle: italic} @fontface {fontfamily: MJXcTeXmathBIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_MathBoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_MathBoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_MathBoldItalic.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsansR; src: local('MathJax_SansSerif'), local('MathJax_SansSerifRegular')} @fontface {fontfamily: MJXcTeXsansRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_SansSerifRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_SansSerifRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_SansSerifRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsansB; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerifBold')} @fontface {fontfamily: MJXcTeXsansBx; src: local('MathJax_SansSerif'); fontweight: bold} @fontface {fontfamily: MJXcTeXsansBw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_SansSerifBold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_SansSerifBold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_SansSerifBold.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsansI; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerifItalic')} @fontface {fontfamily: MJXcTeXsansIx; src: local('MathJax_SansSerif'); fontstyle: italic} @fontface {fontfamily: MJXcTeXsansIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_SansSerifItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_SansSerifItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_SansSerifItalic.otf') format('opentype')} @fontface {fontfamily: MJXcTeXscriptR; src: local('MathJax_Script'), local('MathJax_ScriptRegular')} @fontface {fontfamily: MJXcTeXscriptRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_ScriptRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_ScriptRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_ScriptRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXtypeR; src: local('MathJax_Typewriter'), local('MathJax_TypewriterRegular')} @fontface {fontfamily: MJXcTeXtypeRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_TypewriterRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_TypewriterRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_TypewriterRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXcalR; src: local('MathJax_Caligraphic'), local('MathJax_CaligraphicRegular')} @fontface {fontfamily: MJXcTeXcalRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_CaligraphicRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_CaligraphicRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_CaligraphicRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXmainB; src: local('MathJax_Main Bold'), local('MathJax_MainBold')} @fontface {fontfamily: MJXcTeXmainBx; src: local('MathJax_Main'); fontweight: bold} @fontface {fontfamily: MJXcTeXmainBw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_MainBold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_MainBold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_MainBold.otf') format('opentype')} @fontface {fontfamily: MJXcTeXmainI; src: local('MathJax_Main Italic'), local('MathJax_MainItalic')} @fontface {fontfamily: MJXcTeXmainIx; src: local('MathJax_Main'); fontstyle: italic} @fontface {fontfamily: MJXcTeXmainIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_MainItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_MainItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_MainItalic.otf') format('opentype')} @fontface {fontfamily: MJXcTeXmainR; src: local('MathJax_Main'), local('MathJax_MainRegular')} @fontface {fontfamily: MJXcTeXmainRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_MainRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_MainRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_MainRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXmathI; src: local('MathJax_Math Italic'), local('MathJax_MathItalic')} @fontface {fontfamily: MJXcTeXmathIx; src: local('MathJax_Math'); fontstyle: italic} @fontface {fontfamily: MJXcTeXmathIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_MathItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_MathItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_MathItalic.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsize1R; src: local('MathJax_Size1'), local('MathJax_Size1Regular')} @fontface {fontfamily: MJXcTeXsize1Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_Size1Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_Size1Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_Size1Regular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsize2R; src: local('MathJax_Size2'), local('MathJax_Size2Regular')} @fontface {fontfamily: MJXcTeXsize2Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_Size2Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_Size2Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_Size2Regular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsize3R; src: local('MathJax_Size3'), local('MathJax_Size3Regular')} @fontface {fontfamily: MJXcTeXsize3Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_Size3Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_Size3Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_Size3Regular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsize4R; src: local('MathJax_Size4'), local('MathJax_Size4Regular')} @fontface {fontfamily: MJXcTeXsize4Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_Size4Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_Size4Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_Size4Regular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXvecR; src: local('MathJax_Vector'), local('MathJax_VectorRegular')} @fontface {fontfamily: MJXcTeXvecRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_VectorRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_VectorRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_VectorRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXvecB; src: local('MathJax_Vector Bold'), local('MathJax_VectorBold')} @fontface {fontfamily: MJXcTeXvecBx; src: local('MathJax_Vector'); fontweight: bold} @fontface {fontfamily: MJXcTeXvecBw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_VectorBold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_VectorBold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_VectorBold.otf') format('opentype')} 1, Relationship) = (Object2)
1. Similar Relationships [Type 1 comparison]Using the relationship "Repetition", there is repetition in musical phrases (Fur Elise), in poetic phrases (Annabel Lee), in song chorus's, themes in movies, etc. Even though these four examples contain different objects, we're able to find the similar relationships and compare them.
I can think of two uses for this type of comparison. The first is in metaphors. The second is in generalization. Elaborating on the latter, I find a piece of music annoying if it's too repetitive. The same is true, for poems, songs, and movies; however, I very much enjoy a song if it strikes a good balance between repetition and novelty. Learning to strike that balance when writing music has generalized to doing the same in writing lyrics and dance choreography.
The same generalization can happen in nonart fields as well, such as realizing that two different functions are both convex or that two different problems can be solved recursively, and so on.
2. (Object1, Relationship) = (Object2) [Type 2 comparison]A good example to start is:
(Object1,Relationship)=(Object2)(Carbon, atoms arranged in diamond cubic)=("diamond")But this is more than just connecting quarks/electrons to atoms or atoms to objects in language. You can connect models of various levels together. With the model of language, we can connect to the lowlevel object "quarks" and the highlevel object "diamond". This has helped my understanding of how Multilevel Models relate to transparency: if an AI can find the correct Type 2 comparison between what it's doing and our human language, then transparency is solved; however, human language is very complex.
Human Language and Communicating ModelsLet's say you're talking with Alice. Two failures in communicating models are when you're (1) not discussing the same objects (Alice's definition of "sound" ≠ your definition) or (2) you don't know how Alice believes those objects relate or vice versa.
One might naively say that the model of language is simply "words" and how they relate; however, it's more like "words, your models that relate to those words, and your model of Alice's models that relate to those words". The two failures above arise from thinking in the naive model of language, and can be alleviated by the more complex model. Two helpful thoughts when talking to Alice are:
1. "Are we discussing the same objects?"
2. "How does Alice think these objects relate?"
and of course asking Alice the relevant questions to figure that out.
Future Work:1. Type 1 comparisons:
 What are the very useful relationships used to solve past problems, and could they be used to solve current problems (ex. using the same math technique to solve an entire class of math problems)
 Is there a way to find those relationships systematically?
2. Type 2 comparisons with human language/transparency
3. Creating lots of examples of Type 1 and Type 2 comparisons to unconfuse myself (I predict that what I wrote in this post is confused but at least points in the right direction)
4. A better word for Type 1, 2 comparisons
Discuss