Вы здесь
Сборщик RSSлент
Структурирование
Эффективный Альтруизм. Дискуссионный клуб
Структурирование
General Intelligence is a Fractal Representation
Throwing more data and compute at a today's neural network architectures will not spit out an artificial general intelligence (AGI).
An AGI is a machine with humanequialent intelligence. Computers are already superior to human beings at memory, recall, scalability, compute speed and math. Humans are superior to computers at small data.
Today's artificial neural networks (ANNs) will not solve the problem of small data because today's ANNs are based off of the multilayer perceptron and the multilayer perceptron is too data hungry; it scales badly. No realistic quantity of training data plus physical compute hardware can compensate for a sufficiently bad scaling factor.
We know the multilayer perceptron must scale badly because it has fractal dimension 1.
Data Compression"Entropy" is the minimum message length necessary to encode information. The combined entropy SA+B.mjxchtml {display: inlineblock; lineheight: 0; textindent: 0; textalign: left; texttransform: none; fontstyle: normal; fontweight: normal; fontsize: 100%; fontsizeadjust: none; letterspacing: normal; wordwrap: normal; wordspacing: normal; whitespace: nowrap; float: none; direction: ltr; maxwidth: none; maxheight: none; minwidth: 0; minheight: 0; border: 0; margin: 0; padding: 1px 0} .MJXcdisplay {display: block; textalign: center; margin: 1em 0; padding: 0} .mjxchtml[tabindex]:focus, body :focus .mjxchtml[tabindex] {display: inlinetable} .mjxfullwidth {textalign: center; display: tablecell!important; width: 10000em} .mjxmath {display: inlineblock; bordercollapse: separate; borderspacing: 0} .mjxmath * {display: inlineblock; webkitboxsizing: contentbox!important; mozboxsizing: contentbox!important; boxsizing: contentbox!important; textalign: left} .mjxnumerator {display: block; textalign: center} .mjxdenominator {display: block; textalign: center} .MJXcstacked {height: 0; position: relative} .MJXcstacked > * {position: absolute} .MJXcbevelled > * {display: inlineblock} .mjxstack {display: inlineblock} .mjxop {display: block} .mjxunder {display: tablecell} .mjxover {display: block} .mjxover > * {paddingleft: 0px!important; paddingright: 0px!important} .mjxunder > * {paddingleft: 0px!important; paddingright: 0px!important} .mjxstack > .mjxsup {display: block} .mjxstack > .mjxsub {display: block} .mjxprestack > .mjxpresup {display: block} .mjxprestack > .mjxpresub {display: block} .mjxdelimh > .mjxchar {display: inlineblock} .mjxsurd {verticalalign: top} .mjxmphantom * {visibility: hidden} .mjxmerror {backgroundcolor: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; fontstyle: normal; fontsize: 90%} .mjxannotationxml {lineheight: normal} .mjxmenclose > svg {fill: none; stroke: currentColor} .mjxmtr {display: tablerow} .mjxmlabeledtr {display: tablerow} .mjxmtd {display: tablecell; textalign: center} .mjxlabel {display: tablerow} .mjxbox {display: inlineblock} .mjxblock {display: block} .mjxspan {display: inline} .mjxchar {display: block; whitespace: pre} .mjxitable {display: inlinetable; width: auto} .mjxrow {display: tablerow} .mjxcell {display: tablecell} .mjxtable {display: table; width: 100%} .mjxline {display: block; height: 0} .mjxstrut {width: 0; paddingtop: 1em} .mjxvsize {width: 0} .MJXcspace1 {marginleft: .167em} .MJXcspace2 {marginleft: .222em} .MJXcspace3 {marginleft: .278em} .mjxtest.mjxtestdisplay {display: table!important} .mjxtest.mjxtestinline {display: inline!important; marginright: 1px} .mjxtest.mjxtestdefault {display: block!important; clear: both} .mjxexbox {display: inlineblock!important; position: absolute; overflow: hidden; minheight: 0; maxheight: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjxtestinline .mjxleftbox {display: inlineblock; width: 0; float: left} .mjxtestinline .mjxrightbox {display: inlineblock; width: 0; float: right} .mjxtestdisplay .mjxrightbox {display: tablecell!important; width: 10000em!important; minwidth: 0; maxwidth: none; padding: 0; border: 0; margin: 0} .MJXcTeXunknownR {fontfamily: monospace; fontstyle: normal; fontweight: normal} .MJXcTeXunknownI {fontfamily: monospace; fontstyle: italic; fontweight: normal} .MJXcTeXunknownB {fontfamily: monospace; fontstyle: normal; fontweight: bold} .MJXcTeXunknownBI {fontfamily: monospace; fontstyle: italic; fontweight: bold} .MJXcTeXamsR {fontfamily: MJXcTeXamsR,MJXcTeXamsRw} .MJXcTeXcalB {fontfamily: MJXcTeXcalB,MJXcTeXcalBx,MJXcTeXcalBw} .MJXcTeXfrakR {fontfamily: MJXcTeXfrakR,MJXcTeXfrakRw} .MJXcTeXfrakB {fontfamily: MJXcTeXfrakB,MJXcTeXfrakBx,MJXcTeXfrakBw} .MJXcTeXmathBI {fontfamily: MJXcTeXmathBI,MJXcTeXmathBIx,MJXcTeXmathBIw} .MJXcTeXsansR {fontfamily: MJXcTeXsansR,MJXcTeXsansRw} .MJXcTeXsansB {fontfamily: MJXcTeXsansB,MJXcTeXsansBx,MJXcTeXsansBw} .MJXcTeXsansI {fontfamily: MJXcTeXsansI,MJXcTeXsansIx,MJXcTeXsansIw} .MJXcTeXscriptR {fontfamily: MJXcTeXscriptR,MJXcTeXscriptRw} .MJXcTeXtypeR {fontfamily: MJXcTeXtypeR,MJXcTeXtypeRw} .MJXcTeXcalR {fontfamily: MJXcTeXcalR,MJXcTeXcalRw} .MJXcTeXmainB {fontfamily: MJXcTeXmainB,MJXcTeXmainBx,MJXcTeXmainBw} .MJXcTeXmainI {fontfamily: MJXcTeXmainI,MJXcTeXmainIx,MJXcTeXmainIw} .MJXcTeXmainR {fontfamily: MJXcTeXmainR,MJXcTeXmainRw} .MJXcTeXmathI {fontfamily: MJXcTeXmathI,MJXcTeXmathIx,MJXcTeXmathIw} .MJXcTeXsize1R {fontfamily: MJXcTeXsize1R,MJXcTeXsize1Rw} .MJXcTeXsize2R {fontfamily: MJXcTeXsize2R,MJXcTeXsize2Rw} .MJXcTeXsize3R {fontfamily: MJXcTeXsize3R,MJXcTeXsize3Rw} .MJXcTeXsize4R {fontfamily: MJXcTeXsize4R,MJXcTeXsize4Rw} .MJXcTeXvecR {fontfamily: MJXcTeXvecR,MJXcTeXvecRw} .MJXcTeXvecB {fontfamily: MJXcTeXvecB,MJXcTeXvecBx,MJXcTeXvecBw} @fontface {fontfamily: MJXcTeXamsR; src: local('MathJax_AMS'), local('MathJax_AMSRegular')} @fontface {fontfamily: MJXcTeXamsRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_AMSRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_AMSRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_AMSRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXcalB; src: local('MathJax_Caligraphic Bold'), local('MathJax_CaligraphicBold')} @fontface {fontfamily: MJXcTeXcalBx; src: local('MathJax_Caligraphic'); fontweight: bold} @fontface {fontfamily: MJXcTeXcalBw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_CaligraphicBold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_CaligraphicBold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_CaligraphicBold.otf') format('opentype')} @fontface {fontfamily: MJXcTeXfrakR; src: local('MathJax_Fraktur'), local('MathJax_FrakturRegular')} @fontface {fontfamily: MJXcTeXfrakRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_FrakturRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_FrakturRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_FrakturRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXfrakB; src: local('MathJax_Fraktur Bold'), local('MathJax_FrakturBold')} @fontface {fontfamily: MJXcTeXfrakBx; src: local('MathJax_Fraktur'); fontweight: bold} @fontface {fontfamily: MJXcTeXfrakBw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_FrakturBold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_FrakturBold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_FrakturBold.otf') format('opentype')} @fontface {fontfamily: MJXcTeXmathBI; src: local('MathJax_Math BoldItalic'), local('MathJax_MathBoldItalic')} @fontface {fontfamily: MJXcTeXmathBIx; src: local('MathJax_Math'); fontweight: bold; fontstyle: italic} @fontface {fontfamily: MJXcTeXmathBIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_MathBoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_MathBoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_MathBoldItalic.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsansR; src: local('MathJax_SansSerif'), local('MathJax_SansSerifRegular')} @fontface {fontfamily: MJXcTeXsansRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_SansSerifRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_SansSerifRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_SansSerifRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsansB; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerifBold')} @fontface {fontfamily: MJXcTeXsansBx; src: local('MathJax_SansSerif'); fontweight: bold} @fontface {fontfamily: MJXcTeXsansBw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_SansSerifBold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_SansSerifBold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_SansSerifBold.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsansI; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerifItalic')} @fontface {fontfamily: MJXcTeXsansIx; src: local('MathJax_SansSerif'); fontstyle: italic} @fontface {fontfamily: MJXcTeXsansIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_SansSerifItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_SansSerifItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_SansSerifItalic.otf') format('opentype')} @fontface {fontfamily: MJXcTeXscriptR; src: local('MathJax_Script'), local('MathJax_ScriptRegular')} @fontface {fontfamily: MJXcTeXscriptRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_ScriptRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_ScriptRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_ScriptRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXtypeR; src: local('MathJax_Typewriter'), local('MathJax_TypewriterRegular')} @fontface {fontfamily: MJXcTeXtypeRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_TypewriterRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_TypewriterRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_TypewriterRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXcalR; src: local('MathJax_Caligraphic'), local('MathJax_CaligraphicRegular')} @fontface {fontfamily: MJXcTeXcalRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_CaligraphicRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_CaligraphicRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_CaligraphicRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXmainB; src: local('MathJax_Main Bold'), local('MathJax_MainBold')} @fontface {fontfamily: MJXcTeXmainBx; src: local('MathJax_Main'); fontweight: bold} @fontface {fontfamily: MJXcTeXmainBw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_MainBold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_MainBold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_MainBold.otf') format('opentype')} @fontface {fontfamily: MJXcTeXmainI; src: local('MathJax_Main Italic'), local('MathJax_MainItalic')} @fontface {fontfamily: MJXcTeXmainIx; src: local('MathJax_Main'); fontstyle: italic} @fontface {fontfamily: MJXcTeXmainIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_MainItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_MainItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_MainItalic.otf') format('opentype')} @fontface {fontfamily: MJXcTeXmainR; src: local('MathJax_Main'), local('MathJax_MainRegular')} @fontface {fontfamily: MJXcTeXmainRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_MainRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_MainRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_MainRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXmathI; src: local('MathJax_Math Italic'), local('MathJax_MathItalic')} @fontface {fontfamily: MJXcTeXmathIx; src: local('MathJax_Math'); fontstyle: italic} @fontface {fontfamily: MJXcTeXmathIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_MathItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_MathItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_MathItalic.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsize1R; src: local('MathJax_Size1'), local('MathJax_Size1Regular')} @fontface {fontfamily: MJXcTeXsize1Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_Size1Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_Size1Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_Size1Regular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsize2R; src: local('MathJax_Size2'), local('MathJax_Size2Regular')} @fontface {fontfamily: MJXcTeXsize2Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_Size2Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_Size2Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_Size2Regular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsize3R; src: local('MathJax_Size3'), local('MathJax_Size3Regular')} @fontface {fontfamily: MJXcTeXsize3Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_Size3Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_Size3Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_Size3Regular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsize4R; src: local('MathJax_Size4'), local('MathJax_Size4Regular')} @fontface {fontfamily: MJXcTeXsize4Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_Size4Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_Size4Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_Size4Regular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXvecR; src: local('MathJax_Vector'), local('MathJax_VectorRegular')} @fontface {fontfamily: MJXcTeXvecRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_VectorRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_VectorRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_VectorRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXvecB; src: local('MathJax_Vector Bold'), local('MathJax_VectorBold')} @fontface {fontfamily: MJXcTeXvecBx; src: local('MathJax_Vector'); fontweight: bold} @fontface {fontfamily: MJXcTeXvecBw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_VectorBold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_VectorBold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_VectorBold.otf') format('opentype')} of two blocks of random information A,B equals the sum of the two entropies SA,SB because random information is uncompressible.
SA+B=SA+SB (uncompressible data)
Suppose instead you have two blocks of nonrandom information C,D such that C,D are related. Then the combined entropy SC+D is less than the sum of the individual entropies.
SC+D<SC+SD (compressible data)
In practice, part of this message can be though of as encoding ontologies (generalizations) and part of the message can be thought of as encoding specifics.
For example consider principal component analysis where you calculate the eigenvectors of a dataset, throw away all eigenvectors with small eigenvalues, and then project your original dataset into this lowerdimensional space. The eigenvectors with large eigenvalues can be through of as encoding ontologies and the projection of your dataset into this new basis can be thought of as encoding specifics.
As you compress more and more information, the uncompressed ratio of ontologies to specifics increases. This is the fundamental principal behind transfer learning and the last conceptual hurdle between today's technology and humanequivalent artificial intelligence.
Compressible Data is FractalThe word "fractal" describes a selfsimilar mathematical structure with a fractal dimension Df greater than its topological dimension Dt.
Compressible data meets both requirements.
 Compressed data is trivially selfsimilar.
 The topological length lt of a block of information is equal to its raw uncompressed length. The fractal length lf of a block of information is equal to its entropy. D_t">Df>Dt follows from SC+D<SC+SD.
Compressed data is fractal too because compressed data is isomorphic to compressible data.
Generalizing a dataset and compressing it are the same thing. An artificial general intelligence equals a generalpurpose compression algorithm. If a general purpose compression algorithm is to scale to arbitrary levels of complexity then it must encode data fractally.
 It is no coincidence human brainwaves exhibit a fractal structure. General intelligence, including AGI, is necessarily fractal.
 A multilayer perceptron scales badly on hierarchically complex data because the multilayer perceptron's fractal dimension 1 equals its topological dimension 1.
Discuss
Emptiness and Form
Translation note: There is no English equivalent to the Sanskrit words शून्यता śūnyatā and रूप rūpa. By convention, शून्यता is translated "emptiness" and रूप is translated "form". I follow this convention. My use of the words "emptiness" and "form" in this post have little to do with the English words "emptiness" and "form"; they are placeholders for Sanskrit.
Consider a cat. From the perspective of fundamental physics, the cat is a collection of particles no more special than any other collection of particles. There is no clear line between "cat" and "noncat". Everything is quantum fields. The "cat" is a representation created by the human mind. It is a trick of human perspective. From the perspective of an omniscient unbiased observer, the cat is just a scoop of water in a limitless ocean.
Cats are real.
The perspective "cats are real" is called "form". The perspective "cats are an arbitrary ontology with no welldefined meaning amongst the fundamental laws of the universe" is called "emptiness". There is no conflict between form and emptiness just as there is no conflict between quantum mechanics and classical mechanics. They are different ways interpreting the same thing at different scales.
Classical mechanics can be more practical than quantum mechanics even though quantum mechanics is more fundamental than classical mechanics. Similarly, emptiness is more fundamental than form yet form is a more useful model of the world than emptiness. Emptiness and form are neither equally true nor equally practical.
Maps ≠ Form & Emptiness ≠ TerritoryYou could say "form" roughly corresponds to "maps" and "emptiness" roughly corresponds to "territory". That would constitute a better translation from the original Sanskrit than "form" and "emptiness". But the formemptiness dichotomy draws its line in a slightly different place than the mapterritory dichotomy.
The mapterritory dichotomy draws the line between reality and models of reality. The mapterritory dichotomy distinguishes between reality and one's simplified models of reality. In this way, the mapterritory dichotomy is a materialist perspective.
The formemptiness dichotomy is an informatic perspective. If there is no difference between a map and a territory then—mathematically—the map and the territory are isomorphic respresentations of the same group.
Ontologies"Emptiness" describes a shared quality between the reductionist nature of objective reality and the raw sensory data coming into a mind. In both cases, our Bayesian priors bucket highdimensional data into into an ontology called "form".
In other words, form is a byproduct of subjectivity. All ontologies dissolve under the scrutiny of theoretical physics.
The duality between emptiness and form is fundamental to general intelligence.
Discreteness and DifferentiabilityBig data is easy. The hard problem of general intelligence concerns small data. Small data is all about transfer learning. Transfer learning is all about ontologies.
An intelligent system with hardcoded ontologies is conceptually adaptable and therefore not a general intelligence. A general intelligence's ontologies must be emergent from its input data. But ontologies are discrete and the only way to navigate a highdimensional input data is via the gradient descent algorithm. But the gradient descent algorithm requires a continuous representation. Can a representation be both continuous and discrete?
In theory, no. In practice, yes.
Consider the sigmoid function in the multilayer perceptron.
output=tanh(input)=11+einput.mjxchtml {display: inlineblock; lineheight: 0; textindent: 0; textalign: left; texttransform: none; fontstyle: normal; fontweight: normal; fontsize: 100%; fontsizeadjust: none; letterspacing: normal; wordwrap: normal; wordspacing: normal; whitespace: nowrap; float: none; direction: ltr; maxwidth: none; maxheight: none; minwidth: 0; minheight: 0; border: 0; margin: 0; padding: 1px 0} .MJXcdisplay {display: block; textalign: center; margin: 1em 0; padding: 0} .mjxchtml[tabindex]:focus, body :focus .mjxchtml[tabindex] {display: inlinetable} .mjxfullwidth {textalign: center; display: tablecell!important; width: 10000em} .mjxmath {display: inlineblock; bordercollapse: separate; borderspacing: 0} .mjxmath * {display: inlineblock; webkitboxsizing: contentbox!important; mozboxsizing: contentbox!important; boxsizing: contentbox!important; textalign: left} .mjxnumerator {display: block; textalign: center} .mjxdenominator {display: block; textalign: center} .MJXcstacked {height: 0; position: relative} .MJXcstacked > * {position: absolute} .MJXcbevelled > * {display: inlineblock} .mjxstack {display: inlineblock} .mjxop {display: block} .mjxunder {display: tablecell} .mjxover {display: block} .mjxover > * {paddingleft: 0px!important; paddingright: 0px!important} .mjxunder > * {paddingleft: 0px!important; paddingright: 0px!important} .mjxstack > .mjxsup {display: block} .mjxstack > .mjxsub {display: block} .mjxprestack > .mjxpresup {display: block} .mjxprestack > .mjxpresub {display: block} .mjxdelimh > .mjxchar {display: inlineblock} .mjxsurd {verticalalign: top} .mjxmphantom * {visibility: hidden} .mjxmerror {backgroundcolor: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; fontstyle: normal; fontsize: 90%} .mjxannotationxml {lineheight: normal} .mjxmenclose > svg {fill: none; stroke: currentColor} .mjxmtr {display: tablerow} .mjxmlabeledtr {display: tablerow} .mjxmtd {display: tablecell; textalign: center} .mjxlabel {display: tablerow} .mjxbox {display: inlineblock} .mjxblock {display: block} .mjxspan {display: inline} .mjxchar {display: block; whitespace: pre} .mjxitable {display: inlinetable; width: auto} .mjxrow {display: tablerow} .mjxcell {display: tablecell} .mjxtable {display: table; width: 100%} .mjxline {display: block; height: 0} .mjxstrut {width: 0; paddingtop: 1em} .mjxvsize {width: 0} .MJXcspace1 {marginleft: .167em} .MJXcspace2 {marginleft: .222em} .MJXcspace3 {marginleft: .278em} .mjxtest.mjxtestdisplay {display: table!important} .mjxtest.mjxtestinline {display: inline!important; marginright: 1px} .mjxtest.mjxtestdefault {display: block!important; clear: both} .mjxexbox {display: inlineblock!important; position: absolute; overflow: hidden; minheight: 0; maxheight: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjxtestinline .mjxleftbox {display: inlineblock; width: 0; float: left} .mjxtestinline .mjxrightbox {display: inlineblock; width: 0; float: right} .mjxtestdisplay .mjxrightbox {display: tablecell!important; width: 10000em!important; minwidth: 0; maxwidth: none; padding: 0; border: 0; margin: 0} .MJXcTeXunknownR {fontfamily: monospace; fontstyle: normal; fontweight: normal} .MJXcTeXunknownI {fontfamily: monospace; fontstyle: italic; fontweight: normal} .MJXcTeXunknownB {fontfamily: monospace; fontstyle: normal; fontweight: bold} .MJXcTeXunknownBI {fontfamily: monospace; fontstyle: italic; fontweight: bold} .MJXcTeXamsR {fontfamily: MJXcTeXamsR,MJXcTeXamsRw} .MJXcTeXcalB {fontfamily: MJXcTeXcalB,MJXcTeXcalBx,MJXcTeXcalBw} .MJXcTeXfrakR {fontfamily: MJXcTeXfrakR,MJXcTeXfrakRw} .MJXcTeXfrakB {fontfamily: MJXcTeXfrakB,MJXcTeXfrakBx,MJXcTeXfrakBw} .MJXcTeXmathBI {fontfamily: MJXcTeXmathBI,MJXcTeXmathBIx,MJXcTeXmathBIw} .MJXcTeXsansR {fontfamily: MJXcTeXsansR,MJXcTeXsansRw} .MJXcTeXsansB {fontfamily: MJXcTeXsansB,MJXcTeXsansBx,MJXcTeXsansBw} .MJXcTeXsansI {fontfamily: MJXcTeXsansI,MJXcTeXsansIx,MJXcTeXsansIw} .MJXcTeXscriptR {fontfamily: MJXcTeXscriptR,MJXcTeXscriptRw} .MJXcTeXtypeR {fontfamily: MJXcTeXtypeR,MJXcTeXtypeRw} .MJXcTeXcalR {fontfamily: MJXcTeXcalR,MJXcTeXcalRw} .MJXcTeXmainB {fontfamily: MJXcTeXmainB,MJXcTeXmainBx,MJXcTeXmainBw} .MJXcTeXmainI {fontfamily: MJXcTeXmainI,MJXcTeXmainIx,MJXcTeXmainIw} .MJXcTeXmainR {fontfamily: MJXcTeXmainR,MJXcTeXmainRw} .MJXcTeXmathI {fontfamily: MJXcTeXmathI,MJXcTeXmathIx,MJXcTeXmathIw} .MJXcTeXsize1R {fontfamily: MJXcTeXsize1R,MJXcTeXsize1Rw} .MJXcTeXsize2R {fontfamily: MJXcTeXsize2R,MJXcTeXsize2Rw} .MJXcTeXsize3R {fontfamily: MJXcTeXsize3R,MJXcTeXsize3Rw} .MJXcTeXsize4R {fontfamily: MJXcTeXsize4R,MJXcTeXsize4Rw} .MJXcTeXvecR {fontfamily: MJXcTeXvecR,MJXcTeXvecRw} .MJXcTeXvecB {fontfamily: MJXcTeXvecB,MJXcTeXvecBx,MJXcTeXvecBw} @fontface {fontfamily: MJXcTeXamsR; src: local('MathJax_AMS'), local('MathJax_AMSRegular')} @fontface {fontfamily: MJXcTeXamsRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_AMSRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_AMSRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_AMSRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXcalB; src: local('MathJax_Caligraphic Bold'), local('MathJax_CaligraphicBold')} @fontface {fontfamily: MJXcTeXcalBx; src: local('MathJax_Caligraphic'); fontweight: bold} @fontface {fontfamily: MJXcTeXcalBw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_CaligraphicBold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_CaligraphicBold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_CaligraphicBold.otf') format('opentype')} @fontface {fontfamily: MJXcTeXfrakR; src: local('MathJax_Fraktur'), local('MathJax_FrakturRegular')} @fontface {fontfamily: MJXcTeXfrakRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_FrakturRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_FrakturRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_FrakturRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXfrakB; src: local('MathJax_Fraktur Bold'), local('MathJax_FrakturBold')} @fontface {fontfamily: MJXcTeXfrakBx; src: local('MathJax_Fraktur'); fontweight: bold} @fontface {fontfamily: MJXcTeXfrakBw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_FrakturBold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_FrakturBold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_FrakturBold.otf') format('opentype')} @fontface {fontfamily: MJXcTeXmathBI; src: local('MathJax_Math BoldItalic'), local('MathJax_MathBoldItalic')} @fontface {fontfamily: MJXcTeXmathBIx; src: local('MathJax_Math'); fontweight: bold; fontstyle: italic} @fontface {fontfamily: MJXcTeXmathBIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_MathBoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_MathBoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_MathBoldItalic.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsansR; src: local('MathJax_SansSerif'), local('MathJax_SansSerifRegular')} @fontface {fontfamily: MJXcTeXsansRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_SansSerifRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_SansSerifRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_SansSerifRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsansB; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerifBold')} @fontface {fontfamily: MJXcTeXsansBx; src: local('MathJax_SansSerif'); fontweight: bold} @fontface {fontfamily: MJXcTeXsansBw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_SansSerifBold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_SansSerifBold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_SansSerifBold.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsansI; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerifItalic')} @fontface {fontfamily: MJXcTeXsansIx; src: local('MathJax_SansSerif'); fontstyle: italic} @fontface {fontfamily: MJXcTeXsansIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_SansSerifItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_SansSerifItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_SansSerifItalic.otf') format('opentype')} @fontface {fontfamily: MJXcTeXscriptR; src: local('MathJax_Script'), local('MathJax_ScriptRegular')} @fontface {fontfamily: MJXcTeXscriptRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_ScriptRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_ScriptRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_ScriptRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXtypeR; src: local('MathJax_Typewriter'), local('MathJax_TypewriterRegular')} @fontface {fontfamily: MJXcTeXtypeRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_TypewriterRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_TypewriterRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_TypewriterRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXcalR; src: local('MathJax_Caligraphic'), local('MathJax_CaligraphicRegular')} @fontface {fontfamily: MJXcTeXcalRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_CaligraphicRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_CaligraphicRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_CaligraphicRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXmainB; src: local('MathJax_Main Bold'), local('MathJax_MainBold')} @fontface {fontfamily: MJXcTeXmainBx; src: local('MathJax_Main'); fontweight: bold} @fontface {fontfamily: MJXcTeXmainBw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_MainBold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_MainBold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_MainBold.otf') format('opentype')} @fontface {fontfamily: MJXcTeXmainI; src: local('MathJax_Main Italic'), local('MathJax_MainItalic')} @fontface {fontfamily: MJXcTeXmainIx; src: local('MathJax_Main'); fontstyle: italic} @fontface {fontfamily: MJXcTeXmainIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_MainItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_MainItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_MainItalic.otf') format('opentype')} @fontface {fontfamily: MJXcTeXmainR; src: local('MathJax_Main'), local('MathJax_MainRegular')} @fontface {fontfamily: MJXcTeXmainRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_MainRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_MainRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_MainRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXmathI; src: local('MathJax_Math Italic'), local('MathJax_MathItalic')} @fontface {fontfamily: MJXcTeXmathIx; src: local('MathJax_Math'); fontstyle: italic} @fontface {fontfamily: MJXcTeXmathIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_MathItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_MathItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_MathItalic.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsize1R; src: local('MathJax_Size1'), local('MathJax_Size1Regular')} @fontface {fontfamily: MJXcTeXsize1Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_Size1Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_Size1Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_Size1Regular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsize2R; src: local('MathJax_Size2'), local('MathJax_Size2Regular')} @fontface {fontfamily: MJXcTeXsize2Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_Size2Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_Size2Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_Size2Regular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsize3R; src: local('MathJax_Size3'), local('MathJax_Size3Regular')} @fontface {fontfamily: MJXcTeXsize3Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_Size3Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_Size3Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_Size3Regular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsize4R; src: local('MathJax_Size4'), local('MathJax_Size4Regular')} @fontface {fontfamily: MJXcTeXsize4Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_Size4Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_Size4Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_Size4Regular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXvecR; src: local('MathJax_Vector'), local('MathJax_VectorRegular')} @fontface {fontfamily: MJXcTeXvecRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_VectorRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_VectorRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_VectorRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXvecB; src: local('MathJax_Vector Bold'), local('MathJax_VectorBold')} @fontface {fontfamily: MJXcTeXvecBx; src: local('MathJax_Vector'); fontweight: bold} @fontface {fontfamily: MJXcTeXvecBw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_VectorBold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_VectorBold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_VectorBold.otf') format('opentype')}
If we zoom in on this function we can see it is continuously differentiable.
But when we zoom out it appears as a discrete step function.
The sigmoid function illustrates the scaledependence of emptiness and form. When we zoom in we see continuity (emptiness), which is a prerequisite for gradient descent. When we zoom out, we see a discrete system (form), which is necessary for the emergence of ontologies. Emptiness and form work together to produce emergent ontologies.
Discuss
Forecasting Thread: Existential Risk
This is a thread for displaying your probabilities of an existential catastrophe that causes extinction or the destruction of humanity’s longterm potential.
Every answer to this post should be a forecast showing your probability of an existential catastrophe happening at any given time.
For example, here is Michael Aird’s timeline:
The goal of this thread is to create a set of comparable, standardized xrisk predictions, and to facilitate discussion on the reasoning and assumptions behind those predictions. The thread isn’t about setting predictions in stone – you can come back and update at any point!
How to participate
 Go to this page
 Create your distribution
 Specify an interval using the Min and Max bin, and put the probability you assign to that interval in the probability bin.
 You can specify a cumulative probability by leaving the Min box blank and entering the cumulative value in the Max box.
 To put probability on never, assign probability above January 1, 2120 using the edit button to the right of the graph. Specify your probability for never in the notes, to distinguish this from putting probability on existential catastrophe occurring after 2120.
 Click 'Save snapshot' to save your distribution to a static URL
 A timestamp will appear below the 'Save snapshot' button. This links to the URL of your snapshot.
 Make sure to copy it before refreshing the page, otherwise it will disappear.
 Click ‘Log in’ to automatically show your snapshot on the Elicit question page
 You don’t have to log in, but if you do, Elicit will:
 Store your snapshot in your account history so you can easily access it.
 Automatically add your most recent snapshot to the xrisk question page under ‘Show more’. Other users will be able to import your most recent snapshot from the dropdown.
 We’ll set a default name that your snapshot will be shown under – if you want to change it, you can do so on your profile page.
 If you’re logged in, your snapshots for this question will be publicly viewable.
 You don’t have to log in, but if you do, Elicit will:
 Copy the snapshot timestamp link and paste it into your LessWrong comment
 You can also add a screenshot of your distribution in your comment using the instructions below.
How to add an image to your comment
 Take a screenshot of your distribution
 Then do one of two things:
 If you have betafeatures turned on in your account settings, draganddrop the image into your comment
 If not, upload it to an image hosting service like imgur.com, then write the following markdown syntax for the image to appear, with the url appearing where it says ‘link’: ![](link)
 If it worked, you will see the image in the comment before hitting submit.
If you have any bugs or technical issues, reply to Ben from the LW team or Amanda (me) from the Ought team in the comment section, or email me at amanda@ought.org.
Questions to consider as you're making your prediction
 What definitions are you using? It’s helpful to specify them.
 What evidence is driving your prediction?
 What are the main assumptions that other people might disagree with?
 What evidence would cause you to update?
 How is the probability mass allocated amongst xrisk scenarios?
 Would you bet on these probabilities?
Discuss
How often do series C startups fail to exit?
How often do series C startups really fail? By fail I mean never have an acquisition or IPO. Internet says 80% (see https://medium.com/journalofempiricalentrepreneurship/dissectingstartupfailurebystage34bb70354a36) but this seems very high to me.
Most Series C companies are worth in the 100200M range, the one I'm at is worth 270M. How does all the value just evaporate? What happens to the companies that "fail"?
Asking to decide whether to exercise my options. I only need my company to exit at 41M to break even. I am bearish on the company but with around 40M in ARR it is hard to imagine it not exiting.
Discuss
What AI companies would be most likely to have a positive longterm impact on the world as a result of investing in them?
Ever since GPT3 was unveiled, I've been thinking pretty heavily about increasing my investment in AIrelated companies. My first thoughts were to invest in Microsoft and Alphabet (Google)  Microsoft because they are partnered with OpenAI, and Alphabet since they have big AI reseach projects of their own. But in the process of thinking about investing in these companies, I started wondering about the longterm impacts such investments would have on the world  investing in the right or wrong company could dramatically change how the world looks 20 years from now, and whether it is a place I'd want to live in  the worst case scenario would be all humans dead, or even worse; best case scenario is... too amazing to put into words. And then there's plenty of room in between those for how things can go, depending on who makes the important decisions, and how good the decisions they make will be. (While I'm only a single person with modest funds to invest in companies, I also consider that my actions are acausally correlated with those of others sufficiently similar to me, which means the acausal results of any investment I make will be multiplied by an amount that makes my decisions have nontrivial impact on the world).
So the important question is, do I expect Microsoft and Alphabet to do better or worse, in regards to alignment and ethical issues, compared to other actors who will develop AGI in their lieu? (I do expect someone will develop AGI in their lieu) I can think of actors who I expect will likely do worse than Microsoft or Alphabet  the government of basically any country, or firms based in a country with more totalitarian ethics than the US  wheras I can only think of alternative actors who I expect to do roughly as good as Microsoft or Alphabet, but not neccesarily to do better than them. I trust MIRI, but I also don't perceive MIRI as being actively involved in the development of working AI systems; it seems to me that they are laying the important theoretical groundwork for getting things right, but aren't in position to be the ones who actually do the work that needs to be gotten right.
So my main problem here is a lack of knowledge  there almost certainly are other firms who, if I had the relevant information, I would expect would do better on alignment and ethical issues than Microsoft or Alphabet, but I also don't know who those firms are, or why I should expect them to do so. So my question is, for an investor looking to make a AGIsized profit off of AGI, but also cares about what the future looks like as a result of such investment, what companies will be most likely to result in a good longterm future for humanity?
Note that I'm not asking which company will make the most profit  as long as I reasonably expect that a company will make an AGIsized profit, that's all I care about on that front. What matters is the impact it has on the desirablity of the future world it will lead to. I'm also not asking about organizations to donate to, because while that is important, that's not the problem I'm chewing over right now.
Discuss
Prepare for COVID19 Human Challenge Trials — A Petition in Canada (and soon, the UK)
Canadians— sign the petition here.
TL;DRCOVID19 human challenge trials could save tens of thousands of lives by quickly narrowing the field of promising candidates, and there is strong reason to believe that signaling clear public support for these trials via an official petition could meaningfully accelerate preparation.
Why COVID19 Human Challenge Trials?In a COVID19 human challenge trial, willing participants would receive the vaccine candidate and, once the vaccine takes effect, be deliberately exposed to live coronavirus. The ability to observe participants closely and gather samples while tracing the progress of infection in real time, knowing exactly when they were infected and with what dose, and being able to follow up over a long period, would offer an unprecedented level of scientific and medical insight into an unfamiliar virus. It would also help us test vaccines far faster. If a challenge trial brings us one day closer to the development of an additional vaccine that could avert just 25% of daily COVID19 deaths, it would save 1,250 lives. If a challenge trial brings us a month closer, it’d save 37,500 lives.
To learn more about COVID19 human challenge trials:
 Watch this Vox video
 Read this piece by Dr. Sayantan Banerjee in The Telegraph
 Read this paper in the Journal of Clinical Infectious Diseases
For one, it is remarkably easy (takes around 20 seconds), so even a very small chance that your signature makes a difference tips the scale in any costeffectiveness calculation on the margin.
More broadly, though, signaling a groundswell of public support for COVID19 human challenge trials has directly led to faster preparation for these trials. In May, a NIH document noted that their consideration of challenge trials “has been driven almost entirely by the altruism of potential volunteer advocates and the intense considerations of bioethicists.”
1Day Sooner, which has worked systematically to include volunteers in the public conversation about challenge trials, launched an open letter in support of COVID19 challenge trials on July 15 that was signed by over 100 academics and experts as well as 2,000 potential challenge trial volunteers. A week later, the Washington Post Editorial Board wrote in favor of challenge trial preparation. Within a few weeks, Reuters reported that the National Institutes of Health were preparing a coronavirus strain for a COVID19 challenge trial, in part due to “pressure from advocacy groups such as 1Day Sooner.”
The logic behind the effectiveness of public advocacy for challenge trials is that vaccine developers want assurance that their decision to deliberately infect people with a dangerous virus won’t prompt public backlash. By making clear that the public is actually on board with these trials, stakeholders have a safety net to move forward.
We are now launching a Canada and UK petition campaign because Oxford’s Jenner Institute and several Canadian MPs have signaled interest in conducting a COVID19 human challenge trial. By showing broad support for these trials, we hope to make it easier for more stakeholders to come out in favor of these trials.
Discuss
Клуб чтения цепочек
Needed: AI infohazard policy
The premise of AI risk is that AI is a danger, and therefore research into AI might be dangerous. In the AI alignment community, we're trying to do research which makes AI safer, but occasionally we might come up with results that have significant implications for AI capability as well. Therefore, it seems prudent to come up with a set of guidelines that address:
 Which results should be published?
 What to do with results that shouldn't be published?
These are thorny questions that it seems unreasonable to expect every researcher to solve for themselves. The inputs to these questions involve not only technical knowledge about AI, but also knowledge about the behavior of progress, to the extent we can produce such using historical record or other methods. AI risk organizations might already have internal policies on these issues, but they don't share them and don't discuss or coordinate them with each other (that I know of: maybe some do it in private correspondences). Moreover, coordination might be important even if each actor is doing something reasonable when regarded in isolation (avoiding bad Nash equilibria). We need to have a public debate on the topic inside the community, so that we arrive at some consensus (that might be updated over time). If not consensus, then at least a reasonable spectrum of possible policies.
Some considerations that such a policy should take into account:
 Some results might have implications that shorten the AI timelines, but are still good to publish since the distribution of outcomes is improved.
 Usually we shouldn't even start working on something which is in the shouldnotbepublished category, but sometimes the implications only become clear later, and sometimes dangerous knowledge might still be net positive as long as it's contained.
 In the midgame, it is unlikely for any given group to make it all the way to safe AGI. Therefore, safe AGI is a broad collective effort and we should expect most results to be published. In the endgame, it might become likely for a given group to make it all the way to safe AGI. In this case, incentives for secrecy might become stronger.
 The policy should not fail to address extreme situations that we only expect to arise rarely, because those situations might have especially major consequences.
Some questions that such a policy should answer:
 What are the criteria that determine whether a certain result should be published?
 What are good channels to ask for advise on such a decision?
 How to decide what to do with a potentially dangerous result? Circulate in a narrow circle? If so, which? Conduct experiments in secret? What kind of experiments?
The last point is also related to a topic with independent significance, namely, what are reasonable precautions for testing new AI algorithms? This has both technical aspects (e.g. testing on particular types of datasets or particular types of environments) and procedural aspects (who should be called to advice/decide on the manner). I expect to have several tiers of precautions, s.t. a tier can be selected according to our estimate of the new algorithm's potential, and guidelines for producing such an estimate.
I emphasize that I don't presume to have good answers to these questions. My goal here was not to supply answers, but to foster debate.
Discuss
Zen and Rationality: Just This Is It
This is post 4/? about the intersection of my decades of LWstyle rationality practice and my several years of Zen practice.
In today's installment, I look at "just this is it" from a rationalist perspective.
When Dongshan, the cofounder of what would become the Soto Zen school within which I practice, was preparing to leave his teacher Yunyan and go out in the world, he asked Yunyan how he might summarize his teaching. Yunyan replied, "just this [is it]". Because the more we say the more we move into the world of words and away from reality as it is on its own prior to conception, this is often shorted in various ways to "just this" or "just is" or "this is" or "it is" or, perhaps best of all short of saying nothing and letting reality stand on its own, "is".
This is arguably the core teaching of Soto Zen and maybe all of Buddhism, to perceive and accept reality just as it is. Yet I see it all over the place in the LessWrong corpus, too. I'll mention a few of these.
Egan's Law posits that "it all adds up to normality". In "A Technical Explanation of Technical Explanation", Eliezer phrased a similar sentiment as "since the beginning, not one unusual thing has ever happened". Both point at the way that reality is just as it is, and the only way we can be confused or surprised is because we had an idea about how reality is rather than simply looking and seeing how it is.
I think this is a hard thing to remember, because to the kind of people that are attracted to Less Wrong, better models of reality are very attractive. I know they are to me! Yet it's very easy to go from accepting reality as it is and trying to better predict it to getting lost in the model that does the predicting and confusing it for the real thing. Thus, while at the same time we look for models with better gears that more precisely carve reality at its joints, we also have to remember those boundaries are fuzzy and that all models are ultimately wrong even and especially when they are useful. It's perhaps the great koan of Less Wrong to build better models while simultaneously accepting that all models are somewhere wrong.
To help us deal with this koan, we have a poem to help us. You might think I mean the Litany of Tarski, but you would be wrong, because that poem is about having beliefs correspond to reality, but "just this is it" is all about getting under those beliefs and just seeing what's actually being perceived. For that, we turn to the Litany of Gendlin:
What is true is already so.
Owning up to it doesn’t make it worse.
Not being open about it doesn’t make it go away.
And because it’s true, it is what is there to be interacted with.
Anything untrue isn’t there to be lived.
People can stand what is true,
for they are already enduring it.
This was said by Eugene Gendlin of Focusing fame, a technique for helping you reconnect to your perceptions just as they are without judgement or modeling. The method is simple, yet its impact can be profound for many to get out of their ideas about how things are and to get back to what evidence they are actually getting about the world. Zen asks us to over and over again come back to this fundamental point that reality is just as we perceive it, not what we believe about it, and that belief is just a useful mechanism for helping us better live our lives, if only we don't get tripped up into mistaking the map for the territory.
Finally, to return to "A Technical Explanation of Technical Explanation", it contains one other phrase that neatly captures the spirit of "just this": "joy in the merely real". If we can take joy in what actually is, if that can be enough for us, then all else becomes the playground in which we live our lives.
Discuss
Clarifying “What failure looks like” (part 1)
Thanks to Jess Whittlestone, Daniel Eth, Shahar Avin, Rose Hadshar, Eliana Lorch, Alexis Carlier, Flo Dorner, Kwan Yee Ng, Lewis Hammond, Phil Trammell and Jenny Xiao for valuable conversations, feedback and other support. I am especially grateful to Jess Whittlestone for long conversations and detailed feedback on drafts, and her guidance on which threads to pursue and how to frame this post. All errors are my own.
Epistemic status: My Best Guess
Epistemic effort: ~70 hours of focused work (mostly during FHI’s summer research fellowship), talked to ~10 people.
Introduction“What failure looks like” is the one of the most comprehensive pictures of what failure to solve the AI alignment problem looks like, in worlds without discontinuous progress in AI. I think it was an excellent and muchneeded addition to our understanding of AI risk. Still, if many believe that this is a main source of AI risk, I think it should be fleshed out in more than just one blog post. The original story has two parts; I’m focusing on part 1 because I found it more confusing and nebulous than part 2.
Firstly, I’ll summarise part 1 (hereafter “WFLL1”) as I understand it:

In the world today, it’s easier to pursue easytomeasure goals than hardtomeasure goals.

Machine learning is differentially good at pursuing easytomeasure goals (assuming that we don’t have a satisfactory technical solution to the intent alignment problem[1]).

We’ll try to harness this by designing easytomeasure proxies for what we care about, and deploy AI systems across society which optimize for these proxies (e.g. in law enforcement, legislation and the market).

We’ll give these AI systems more and more influence (e.g. eventually, the systems running law enforcement may actually be making all the decisions for us).

Eventually, the proxies for which the AI systems are optimizing will come apart from the goals we truly care about, but by then humanity won’t be able to take back influence, and we’ll have permanently lost some of our ability to steer our trajectory.
WFLL1 is quite thin on some important details:

WFLL1 does not envisage AI systems directly causing human extinction. So, to constitute an existential risk in itself, the story must involve the lockin of some suboptimal world.[2] However, the likelihood that the scenario described in part 1 gets lockedin (especially over very long time horizons) is not entirely clear in the original post.

It’s also not clear how bad this lockedin world would actually be.
I’ll focus on the first point: how likely is it that the scenario described in WFLL1 leads to the lockin of some suboptimal world. I’ll finish with some rough thoughts on the second point  how bad/severe that lockedin world might be  and by highlighting some remaining open questions.
Likelihood of lockinThe scenario described in WFLL1 seems very concerning from a longtermist perspective if it leads to humanity getting stuck on some suboptimal path (I’ll refer to this as “lockin”). But the blog post itself isn't all that clear about why we should expect such lockin  i.e. why we won't be able to stop the trend of AI systems optimising for easytomeasure things before it's too late  a confusion which has been pointed out before. In this section, I'll talk through some different mechanisms by which this lockin can occur, discuss some historical precedents for these mechanisms occurring, and then discuss why we might expect the scenario described in WFLL1 to be more likely to lead to lockin than for the precedents.
The mechanisms for lockinSummary: I describe five complementary mechanisms by which the scenario described in WFLL1 (i.e. AI systems across society optimizing for simple proxies at the expense of what we actually want) could get lockedin permanently. The first three mechanisms show how humanity may increasingly depend on the superior reasoning abilities of AIs optimizing for simple proxies to run (e.g.) law enforcement, legislation and the market, despite it being apparent  at least to some people  that this will be bad in the long term. The final two mechanisms explain how this may eventually lead to a truly permanent lockin, rather than merely temporary delays in fixing the problem.
Before diving into the mechanisms, first, let’s be clear about the kind of world in which they may play out. The original post assumes that we have not solved intent alignment and that AI is “responsible for” a very large fraction of the economy.[3] So we’ve made sufficient progress on alignment (and capabilities) such that we can deploy powerful AI systems across society that pursue easytomeasure objectives, but not hardtomeasure ones.
(1) Shortterm incentives and collective actionMost actors (e.g. corporations, governments) have some shortterm objectives (e.g. profit, being reelected). These actors will be incentivised to deploy (or sanction the deployment of) AI systems to pursue these shortterm objectives. Moreover, even if some of these actors are aware that pursuing proxies in place of true goals is prone to failure, if they decide not to use AI then they will likely fall behind in their shortterm objectives and therefore lose influence (e.g. be outcompeted, or not reelected). This kind of situation is called a collective action problem, since it requires actors to coordinate on collectively limiting their use of AI  individual actors are better off (in the short term) by deploying AI anyway.
Example: predictive policy algorithms used in the US are biased against people of colour. We can’t debias these algorithms, because we don’t know how to design algorithms that pursue the hardtomeasure goal of “fairness”. Meanwhile, such algorithms continued to be used. Why? Given crime rate objectives and a limited budget, police departments do better on these objectives by using (cheap) predictive algorithms, compared with hiring more staff to think through bias/fairness issues. So, individual departments are “better off” in the short term (i.e. more likely to meet their objectives and so keep their jobs) if they just keep using predictive algorithms. Even if some department chief realises that this minimization of reported crime rate produces this perverse outcome, they are unable to take straightforward action to fix the problem because this would likely result in increased reported crime rate for their department, impacting that chief’s career prospects.
(2) Regulatory captureThe second mechanism is that influential people will benefit from the AIs optimizing for easytomeasure goals, and they will oppose attempts to put on the brakes. Think of a powerful CEO using AI techniques to maximize profit: they will be incentivised to capture regulators who attempt to stop the use of AI, for example via political donations or lobbying.
Example: Facebook is aware of how user data protection and the spread of viral misinformation led to problems in the 2016 presidential election. Yet they spent $17 million lobbying the US government to assuage regulators who were trying to introduce countervailing regulation in 2019.
(3) Genuine ambiguityThe third mechanism is that there will be genuine ambiguity about whether the scenario described in WFLL1 is good or bad. For a while, humans are overall better off in absolute terms than they are today.[4] From the original post:
There will be legitimate arguments about whether the implicit longterm purposes being pursued by AI systems are really so much worse than the longterm purposes that would be pursued by the shareholders of public companies or corrupt officials.
This will be heightened by the fact that it’s easier to make arguments about things for which you have clear, measurable objectives.[5] So arguments that the world is actually fine will be easier to make, in light of the evidence about how well things are going according to the objectives being pursued by AIs. Arguments that something is going wrong, however, will have no such concrete evidence to support them (they might only be able to appeal to a vague sense that the world just isn’t as good as it could be).
This ambiguity will make the collective action problem of the first mechanism even harder to resolve, since disagreement between actors on the severity of a collective problem impedes collective action on that problem.
Example: genuine ambiguity about whether capitalism is “good” or “bad” in the long run. Do negative externalities become catastrophically high, or does growth lead to sufficiently advanced technology fast enough to compensate for these externalities?
(4) Dependency and deskillingIf used widely enough across important societal functions, there may come a time when ceasing to use AI systems would require something tantamount to societal collapse. We can build some intuition for this argument by thinking about electricity, one general purpose technology on which society already depends heavily. Suppose for the sake of argument that some research comes out arguing that our use of electricity will eventually cause our future to be less good than it otherwise could have been. How would humanity respond? I’d expect to see research on potential modifications to our electricity network, and research that tries to undermine the original study. But actually giving up electricity seems unlikely. Even if doing so would not imply total societal collapse, it would at least significantly destabilise society, reducing our ability to deal with other existential risks. This destabilisation would increase the chance of conflict, which would further erode international trust and cooperation and increase risks posed by a range of weapon technologies.[6] And even if giving up electricity was actually the best strategy in expectation, we wouldn’t necessarily do so, due to the problems of short term incentives, collective action, regulatory capture and genuine ambiguity mentioned above.
Furthermore, if we increasingly depend on AIs to make the world work, then humans are unlikely to continue to learn the skills we would need to replace them. In a world where most businesspeople/doctors/lawyers are now AIs, we would likely cut costs by closing down most human business/medical/law schools. This deskilling is an additional reason to think we could be lockedin to a world where AI systems are filling these roles.
(5) Opposition to taking back influenceWhilst these four mechanisms may mean that our attempts at taking back influence from AIs will be delayed, and will come at some cost, surely we will eventually realise that something has gone wrong, and make a proper attempt to fix it, even if this involves some costly reskilling and destabilisation?
By way of answering this question in the negative, the original article imagines the following possibility:
Eventually, largescale attempts to fix the problem are themselves opposed by the collective optimization of millions of optimizers pursuing simple goals.
This opposition could take two forms. The first can be seen as a continuation of the “genuine ambiguity” mechanism. Simply because the AIs are doing their jobs so well, we may be increasingly unlikely to realise that anything is going wrong. Reported sense of security, healthcare statistics, life satisfaction, GDP, etc. will look great, because it is precisely these proxies for which the AIs are optimizing. As the gap between how things are and how they appear grows, so too will the persuasion/deception abilities of AIs and the world’s incomprehensibility. Eventually, AIs will be able to manipulate human values and our ability to perceive the world in sophisticated ways (think: highly addictive video games, highly persuasive media or education; cf. the human safety problem).
Example: recommender algorithms maximizing clickthroughs feed users more extreme content in order to keep them online for longer. Stuart Russell claims that this is an example of an algorithm making its users’ values more extreme, in order to better pursue its objective.[7]
Secondly, the AIs may explicitly oppose any attempts to shut them down or otherwise modify their objectives. This is because human attempts to take back influence probably will result in (short term) losses according to their objective functions (e.g. reported sense of security will go down if the systems that have been driving this down are switched off). Therefore, AIs will be incentivised to oppose such changes.
What this opposition looks like depends on how general the AIs are. In CAIStype scenarios, AIs would probably be limited to the narrow kinds of deception described above. For example, an AI police service with bounded resources minimizing the number of complaints before the end of the day (as a proxy for society’s actual safety) will not take longterm, largescale actions to manipulate human values (e.g. producing advertising to convince the public that complaining is ineffectual). However, it could still take unintended shortterm, smallscale actions, if they’re helpful for the task before the end of the bound (e.g. offer better protection to people if they don’t file complaints).
More general AI could oppose human attempts to take back influence in more concerning ways. For example, it could hamper human attempts at collective action (by dividing people’s attention across different issues), cut funding for research on AI systems that can pursue hardtomeasure objectives or undermine the influence of key humans in the opposition movement. Our prospects certainly seem better in CAIStype scenarios.
Historical precedentsI think the existence of these mechanisms makes the case that it is possible that the scenario described in WFLL1 will get lockedin. But is it plausible? In particular, will we really fail to make a sufficient attempt to fix the problem before it is irreversibly lockedin? I’ll examine three historical precedents which demonstrate the mechanisms playing out, which positively update my credence that it will also play out in the case of WFLL1. However, this reasoning via historical precedents is far from decisive evidence, and I can imagine completely changing my mind if I had more evidence about factors like takeoff speeds and the generality of AI systems.
Climate changeClimate change is a recent example of how mechanisms 13 delayed our attempts to solve a problem until some irreversible damage was already done. However, note that the mechanism for the irreversible lockin is different to WFLL1 (the effects of climate change are lockedin via irreversible physical changes to the climate system, rather than mechanisms 4 and 5 described above).
(1) Shortterm incentives and collective action
Most electricity generation companies maximize profit by producing electricity from fossil fuels. Despite the unequivocal scientific evidence that burning fossil fuels causes climate change and will probably make us collectively worse off in the long term, individual companies are better off (in the short term) if they continue to burn fossil fuels. And they will be outcompeted if they don’t. The result is a slowrolling climate catastrophe, despite attempts at collective action like the Kyoto Protocol.
(2) Regulatory capture
BP, Shell, Chevron, ExxonMobil and Total have spent €251m lobbying the EU since 2010 in order to water down EU climate legislation.
(3) Genuine ambiguity
Consensus among the scientific community that humancaused emissions were contributing to climate change was not established until the 1990s. Even today, some people deny there is a problem. This probably delayed attempts to solve the problem.
The agricultural revolutionThe agricultural revolution is a precedent for mechanisms 1 and 4 leading to lockin of technology that arguably made human life worse (on average) for thousands of years. (The argument that agriculture made human life worse is that increased population density enabled epidemics, farm labour increased physical stress, and malnutrition rose due to the replacement of a varied diet with fewer starchy foods.[8])
(1) Shortterm incentives and collective action
Humans who harnessed agricultural technology could increase their population relative to their huntergatherer peers. Despite the claimed lower levels of health among agriculture communities, their sheer advantage in numbers gave them influence over huntergatherers:
The greater political and military power of farming societies since their inception resulted in the elimination and displacement of late Pleistocene foragers (Bowles, 2011).
So, individual communities were incentivised to convert to agriculture, on pain of being eradicated by more powerful groups who had adopted agriculture.
(4) Dependency
Once a community had been depending on agricultural technology for some generations, it would be difficult to regress to a huntergatherer lifestyle. They would have been unable to support their increased population, and would probably have lost some skills necessary to be successful huntergatherers.
The colonisation of New ZealandThe colonisation of New Zealand is a precedent for a group of humans permanently losing some influence over the future, due to mechanisms 1, 3 and 5. In 1769, the indigenous Māori were the only people in New Zealand, but by 1872, the British (with different values to the Māori) had a substantial amount of influence over New Zealand’s future (see this animation of decline in Māori land ownership for a particularly striking illustration of this). Despite the superficial differences, I think this provides a fairly close analogy to WFLL1.[9]
(1) Shortterm incentives and collective action
The British purchased land from the Māori, in exchange for (e.g.) guns and metal tools. Each tribe was individually better off if they engaged in trade, because guns and tools were economically and militarily valuable; tribes that did not obtain guns were devastated in the Musket Wars. However, tribes became collectively worse off because the British charged unreasonable prices (e.g. in 1848, over 30% of New Zealand was purchased for around NZD 225,000 in today’s currency) and could use this land to increase their influence in the longer term (more settlers could arrive and dominate New Zealand’s agriculturebased economy).
(3) Genuine ambiguity
British goals were initially somewhat aligned with Māori goals. Most early contact was peaceful and welcomed by Māori. In absolute economic terms, the Māori were initially better off thanks to trade with the British. The Māori translation of the Treaty of Waitangi, which the Māori knew would bring more British settlers, was signed by around 540 Māori chiefs.
(5) Opposition to taking back influence
However, once the British had established themselves in New Zealand, the best ways to achieve their goals ceased to be aligned with Māori goals. Instead, they turned to manipulation (e.g. breaking agreements about how purchased land would be used), confiscation (e.g. the New Zealand Settlements Act 1863) and conflict (e.g. the New Zealand Wars). For the past 150 years, Māori values have sadly been just one of many determinants of New Zealand’s future, and not even a particularly strong one.
How WFLL1 may differ from precedentsThese precedents demonstrate that each of the lockin mechanisms have already played out, making it seem more plausible. The next section discusses how WFLL1 may differ from the precedents. I think these differences suggest that the lockin mechanisms are a stronger force in WFLL1 than in the precedents, which also positively updates my credence that WFLL1 will be lockedin.
AI may worsen the “genuine ambiguity” mechanismIf AI leads to a proliferation of misinformation (e.g. via language models or deepfakes), then this will probably reduce our ability to reason and reach consensus about what is going wrong. This misinformation need not be sufficiently clever to convince people of falsehoods, it just has to splinter the attention of people who are trying to understand the problem enough to break our attempts at collective action.[10]
Another way in which AI may increase the amount of “genuine ambiguity” we have about the problem is the epistemic bubble/echo chamber phenomenon, supposedly aggravated by social media recommender systems. The claim is that (1) epistemic communities are isolated from each other via (accidental or deliberate) lack of exposure to (reasonable interpretations of) dissenting viewpoints, and (2) recommender systems, by virtue of maximising clickthroughs, have worsened this dynamic. If this is true, and epistemic communities disagree about whether specific uses of AI (e.g. AI systems maximizing easytomeasure goals replacing judges in courts) are actually serving society’s goals, this would make it even harder to reach the consensus required for collective action.
High risk of dependency and deskillingWFFL1 assumes that AI is “responsible for” a very large fraction of the economy, making it the first time in human history where most humans are no longer required for the functioning of the economy. The agricultural and industrial revolutions involved some amount of deskilling, but humans were still required at most stages of production. However, in WFLL1 it seems likely that humans will heavily depend on AI for the functioning of the economy, making it particularly hard to put on the brakes.
Speed and warning shotsAs AI gets more advanced, the world will probably start moving much faster than today (e.g. Christiano once said he thinks the future will be “like the Industrial Revolution but 10x100x faster”). Naively, this would seem to make things less likely to go well because we’ll have less opportunity to identify and act on warning signs.
That said, some amount of speed may be on our side. If the effects of climate change manifested more quickly, it seems more likely that individual actors would be galvanised towards collective action. So faster change seems to make it more likely that the world wakes up to there being a problem, but less likely that we’re able to fix the problem if we do.
Another way of putting this might be: too fast, and the first warning shot spells doom; too slow, and warning shots don’t show up or get ignored. I’m very uncertain about what the balance will look like with AI. All things considered, perhaps faster progress is worse because human institutions move slowly even when they’re galvanised into taking action.
This discussion seems to carry an important practical implication. Since warning shots are only as helpful as our responses to them, it makes sense to set up institutions that are likely to respond effectively to warning shots if they happen. For example, having a clear, reputable literature describing these kinds of risks, which (roughly) predicts what early warning shots would look like, and argues persuasively that things will only get worse in the long run if we continue to use AI to pursue easytomeasure goals, seems pretty helpful.
Severity of lockinThe extent to which we should prioritise reducing the risk of a lockin of WFLL1 also depends on how bad this world actually is. Previous discussion has seen some confusion about this question. Some possibilities include:

The world is much worse than our current world, because humans eventually become vastly less powerful than AIs and slowly go extinct, in much the same way as insects that become extinct in our world.

The world is worse than our current world, because (e.g.) despite curing disease and ageing, humans have no real freedom or understanding of the world, and spend their lives in highly addictive but unrewarding virtual realities.

The world is better than our current world, because humans still have some influence over the future, but our values are only one of many forces, and we can only make use of 1% of the cosmic endowment.

The world is much better than our current world, because humans lead fairly worthwhile lives, assisted by AIs pursuing proxies. We coursecorrected these proxies along the way and they ended up capturing much of what we value. However, we still don’t make use of the full cosmic endowment.
It seems that Christiano had something like the third scenario in mind, but it isn’t clear to me why this is the most likely. The question is: how bad would the future be, if it is at least somewhat determined by AIs optimizing for easytomeasure goals, rather than human intentions? I think this is an important open question. If I were to spend more time thinking about it, here are some things I’d do.
Comparison with precedentsIn the same way that it was helpful when reasoning about the likelihood of lockin to think about past examples, then work out how WFLL1 may compare, I think this could be a useful approach to this question. I’ll give two examples: both involve systems optimizing for easytomeasure goals rather than human intentions, but seem to differ in the severity of the outcomes.
CompStat: where optimizing for easytomeasure goals was net negative?[11]

CompStat is a system used by police departments in the US.

It’s used to track crime rate and police activity, which ultimately inform the promotion and remuneration of police officers.

Whilst the system initially made US cities much safer, it ended up leading to:

Widespread under/misreporting of crime (to push reported crime rate down).

The targeting of people of the same race and age as those who were committing crimes (to push police activity up).

In NYC one year, the reported crime rate was down 80%, but in interviews, officers reported it was only down ~40%.

It seems plausible that pressure on police to pursue these proxies made cities less safe than they would have been without CompStat: there were many other successful initiatives which were introduced alongside CompStat, and there were cases of substantial harm caused to the victims of crime underreporting and unjust targeting.
“Publish or perish”: where optimizing for easytomeasure goals is somewhat harmful but plausibly net positive?

The pressure to publish papers to succeed in an academic career has some negative effects on the value of academic research.

However, much important work continues to happen in academia, and it’s not obvious that there’s a clearly better system that could replace it.
In terms of how WFLL1 may differ from precedents:

Human institutions incorporate various “corrective mechanisms”, e.g. checks and balances in political institutions, and “common sense”. However, it’s not obvious that AI systems pursuing easytomeasure goals will have these.

Most human institutions are at least somewhat interpretable. This means, for example, that humans who tamper with the measurement process to pursue easytomeasure objectives are prone to being caught, as eventually happened with CompStat. However, ML systems today are currently hard to interpret, and so it may be more difficult to catch interference with the measurement process.
What this post has done:

Clarified in more detail the mechanisms by which WFLL1 may be lockedin.

Discussed historical precedents for lockin via these mechanisms and ways in which WFLL1 differs from these precedents.

Taken this as cautious but far from decisive evidence that the lockin of WFLL1 is plausible.

Pointed out that there is confusion about how bad the future would be if it is partially influenced by AIs optimizing for easytomeasure goals rather than human intentions.

Suggested how future work might make progress on this confusion.
As well as clarifying this confusion, future work could:

Explore the extent to which WFLL1 could increase existential risk by being a risk factor in other existential risks, rather than an existential risk in itself.

Search for historical examples where the mechanisms for lockin didn’t play out.

Think about other ways to reason about the likelihood of lockin of WFLL1, e.g. via a game theoretic model, or digging into The Age of Em scenario where similar themes play out.
I’m worried that WFLL1 could happen even if we had a satisfactory solution to the intent alignment problem, but I’ll leave this possibility for another time. ↩︎
WFLL1 could also increase existential risk by being a risk factor in other existential risks, rather than a mechanism for destroying humanity’s potential in itself. To give a concrete example: faced with a global pandemic, a health advice algorithm minimising shortterm excess mortality may recommend complete social lockdown to prevent the spread of the virus. However, this may ultimately result in higher excess mortality due to the longer term (and harder to measure) effects on mental health and economic prosperity. I think that exploring this possibility is an interesting avenue for future work. ↩︎
The latter assumption is not explicit in the original post, but this comment suggests that it is what Christiano had in mind. Indeed, WFLL1 talks about AI being responsible for running corporations, law enforcement and legislation, so the assumption seems right to me. ↩︎
This isn’t clear in the original post, but is clarified in this discussion. ↩︎
I owe this point to Shahar Avin. ↩︎
These pathways by which conflict may increase existential risk are summarised in The Precipice (Ord, 2020, ch. 6). ↩︎
From Human Compatible: “... consider how contentselection algorithms function on social media. They aren’t particularly intelligent, but they are in a position to affect the entire world because they directly influence billions of people. Typically, such algorithms are designed to maximize clickthrough, that is, the probability that the user clicks on presented items. The solution is simply to present items that the user likes to click on, right? Wrong. The solution is to change the user’s preferences so that they become more predictable. A more predictable user can be fed items that they are likely to click on, thereby generating more revenue. People with more extreme political views tend to be more predictable in which items they will click on. (Possibly there is a category of articles that diehard centrists are likely to click on, but it’s not easy to imagine what this category consists of.) Like any rational entity, the algorithm learns how to modify the state of its environment—in this case, the user’s mind—in order to maximize its own reward.8 The consequences include the resurgence of fascism, the dissolution of the social contract that underpins democracies around the world, and potentially the end of the European Union and NATO. Not bad for a few lines of code, even if it had a helping hand from some humans. Now imagine what a really intelligent algorithm would be able to do.” ↩︎
There is some controversy about whether this is the correct interpretation of the paleopathological evidence, but there seems to at least be consensus about the other two downsides (epidemics and physical stress increasing due to agriculture). ↩︎
I got the idea for this analogy from Daniel Kokotajlo’s work on takeovers by conquistadors, and trying to think of historical precedents for takeovers where loss of influence happened more gradually. ↩︎
I owe this point to Shahar Avin. ↩︎
Source for these claims about CompStat: this podcast. ↩︎
Discuss
My (Mis)Adventures With Algorithmic Machine Learning
Introduction
This was originally posted here.
I've been researching, for quite some time, the prospect of machine learning on a wider variety of data types than normally considered; things other than tables of numbers and categories. In particular, I want to do ML for program and proof synthesis which requires, at the very least, learning the structures of trees or graphs which don't come from a differentiable domain. Normal ML algorithms can't handle these; though some recent methods, such as graph neural networks and transformers, can be adapted to this domain with some promising results. However, these methods still rely on differentiation. Is this really required? Are we forever doomed to map all our data onto a differentiable domain if we want to learn with it?
An alternative approach that has been bandied about for a while is the utilization of compression. It's not hard to find articles and talks about the relationship between compression and prediction. If you have a good predictor, then you can compress a sequence into a seed for that predictor and decompress by running said predictor. Going the other way is harder, but, broadly speaking, if you have a sequence that you want to make a prediction on and a good compressor, then whichever addition increases the compressed size the least should be considered the likeliest prediction. This approach is quite broad, applying to any information which can be represented on a computer and not requiring any assumptions whatsoever about the structure of our data beyond that. We could use this idea to, for example, fill in gaps in graphs, trees, sets of inputoutput pairs, etc.
It's important to understand what's actually required here. We don't actually need to compress our training data; we only need a way to estimate the change in minimalcompressionsize as we add a prediction. This minimalcompressionsize is called the Kolmogorov Complexity, denoted K(X). The minimalcompressionsize of a program which outputs X on an input Y is called the Conditional Kolmogorov Complexity, denoted K(XY). The topic of Kolmogorov Complexity is quite broad, and I won't explain all its complexities here. A standard introduction is the textbook An Introduction to Kolmogorov Complexity and Its Applications by Li, Ming, Vitányi, and Paul. If we have a good method for calculating K, then we don't need to actually make use of compression.
Making this practical is quite hard and underresearched, and there aren't many papers on the topic. But there is this; Algorithmic Probabilityguided Supervised Machine Learning on Nondifferentiable Spaces which reproduces some standard ML applications using this approach. I want to understand how doing ML this way works, and this post will basically be a collection of the notes I made while reading the paper. If I refer to "the paper", "this paper", etc. in this post, this is what I'm referring to. These notes will digress quite often and I'm also quite critical of some aspects of the paper. This post was also written somewhat like a stream of consciousness, so I'll often say something which I correct later on. This post isn't intended to merely summarize the paper, but to describe what I learned and thought as I read it. Hopefully, you'll learn stuff too.
Why Not Use Ordinary Compression?One of the most common suggestions for approximating K(X) is to simply use an already existing compression algorithm. The problem is that most "optimal" compression algorithms such as arithmetic encoding are only optimal up to the Shannon Entropy of the data. That is if we assume the data is sampled randomly from a distribution, the best we can do is estimate the shape of this distribution and give shorter encodings appropriately to more likely symbols. This is, asymptotically, about the same as counting substring occurrences to reduce redundancy. If our data is actually just randomly sampled, then this is great! But the real world isn't like this. Most realworld data can be construed as an essentially deterministic process with some added noise. Most compression potential comes from modeling this underlying process, not the noise.
Consider the sequence;
1, 2, 3, 4, ..., 1000This is, obviously, very compressible. An optimal (truly optimal, not Shannon entropyoptimal) compressor would be able to compress this into a program producing this output. Maybe Range@1000, or something even smaller, depending on what language it's using. But statistical compression will just try to find repetitive substrings. Even if we represent this list in binary and compress, statistical methods won't be able to compress this much better than a truly random string since there are few repetitious patterns.
There are lots of natural examples of this. Compressing the digits of π, compressing the coordinates of regular geometric figures, compressing a list of positions for a simple physical system simulation. It's obvious that these can have small algorithmic complexity, that they should be compressible into small programs that generate them, and yet statistical compression methods won't be able to take advantage of this.
As a consequence, we must use compression methods that do something more sophisticated than statistical compression. Unfortunately, essentially all generalpurpose compression methods are like this. There are some MLbased methods that probably aren't. A lot of the textcompression algorithms which participated in the Large Text Compression Benchmark use RNNs which are definitely doing something other than statistical compression.
Much of the paper is dedicated to explaining one method for approximating Kolmogorov Complexity. Kolmogorov Complexity isn't computable, and getting a good approximation is very hard. Some clever methods have been devised, but we can't get a good grasp of it as easily as we can perform statistical compression.
Methods for Approximating Kolmogorov ComplexityWe, ultimately, need a way to approximate Kolmogorov complexity. The learning methods themselves should be largely independent of this, but choosing a method is essential for realworld applications. Here are a few methods I've found in the literature;
CTM  Coding Theorem MethodThis is the method the paper endorses so I'll talk about it in more detail later on.
The idea is to enumerate all strings of a given length and run them as programs for some chosen model. We collect all the inputoutput pairs to a database. We then use the coding theorem to justify using this to estimate K. In particular, if we add together 2^l(p), where l(p) is the length of p, for all programs p which output X, that will get us an estimate for K(X). This is basically what the coding theorem says, hence the name of the method.
BDM  Block Decomposition MethodThis utilizes an existing CTM database to estimate Kolmogorov Complexity. It first tries finding algorithmically compressible substrings using CTM and then uses that information in conjunction with a Shannon entropy like calculation to estimate the complexity of the whole string. For small strings, BDM is close to the performance of CTM, for large strings its averagecase performance is close to statistical compression. Many large strings in practice, however, tend to be compressed better than with statistical methods.
See:
 Numerical Evaluation of Algorithmic Complexity for Short Strings
 A Decomposition Method for Global Evaluation of Shannon Entropy and Local Estimations of Algorithmic Complexity
List Approximation is based on optimizing a simple observation. While generating the smallest program generating X is not computable, generating a list guaranteed to contain the smallest program is. In particular, we can return a list enumerating all strings below and containing X. This will definitely have the smallest program generating X, but it will be exponentially large in the length of X. How small can this list be?
Short lists with short programs in short time (also an improved version in Short lists for shortest descriptions in short time) show that this list can be made quadratically large (and asymptotically no smaller) in the length of the input while guaranteedly containing the smallest program. This makes searching for K(X) much more practical as we only need to run a number of programs quadratic in the size of X.
If we are willing to accept only approximating K with a list, we can ensure an O(log(X)) penalty to our smallest generating program and make the list linear in the size of X, as shown in Linear listapproximation for short programs.
These methods seem promising, but the algorithms themselves are quite abstruse and some require exponential space, making them impractical. However, improvements may be possible.
It's unclear how much labor is actually saved when using the approximation lists. It may be the case that both the smallest possible representations of programs and everything else in the list requires an absurd amount of work to normalize. It may remove those programs which were already easy to dismiss when using bruteforce while exclusively keeping the ones that are hard to assess anyway. The lists may also only have the smallest program which is hard to assess. If there's no secondbest approximation to K, then we're stuck having to find the actual smallest value with no backup if that's impractical to know. Without any practical demonstrations, it's hard to know if these are genuine problems.
Universal Almost Optimal CompressionThis method is based on a generic property of compressiondecompression pairs. As it turns out, we can, while incurring polylogarithmic overhead in the size of the compressed string, replace a (potentially noncomputable) compression algorithm and its decompressor with a pair consisting of an efficient compressor and a potentially inefficient decompressor. By fixing our compressordecompressor pair to be K and E (the function that simply evaluates a program), we can get a new compressiondecompression pair that will compress inputs to a length which differs, at most, polylogarithmically from K. This compressor would not get us smaller, equivalent programs, but, if our goal is to simply approximate the size of a hypothetical Kolmogorovcompressed program, this should work fine.
The basic idea is the following; rather than trying to find the actual smallest compressed string, instead only generate a small "fingerprint" which would allow you to identify what the smallest compressed string might be. The decompressor then just generates a list of candidates, perhaps exhaustively, and uses the fingerprint to find and run it by brute force.
Depending on how much work we're willing to put into making this fingerprint, we can get it down to a pretty small size. According to the paper, it can be made within K(X) + O(log(X)) with only polynomial effort in the size of the string.
I don't fully understand how this technique works. It's tied up in a lot of the theory that List Approximation uses as well. It's concepts come from the theory of pseudorandomness; something I'll have to become more familiar with. See
Incremental compressionInstead of calculating K(X) all at once, it can usually be done piecemeal. The idea is that, given some input X, we want to find a pair of functions F, D, such that F(D(X)) = X and F + D(X) < X. Specifically, we want to find the smallest F meeting this requirement. The idea is that D(X) reduces the size of X, deleting whatever information is in F from X. F is then that information, isolated from X. By repeating this over and over again, we can decompose X into a series F1(F2(F3(...(R)))), where R is the residual which wasn't compressed. In the limit, R should basically consist of all the random information present in X, while the Fs correspond to algorithmic "features" which can be isolated from X. So long as the Fs are always as small as possible, this construction will approach the actual Kolmogorov complexity.
I think this line of work hints towards a rich theory of "atomic" algorithmic information, but it's not ready for practical applications as of yet. The incremental compression is not computable, but it should be much quicker to approximate, on average, than K(X) while still approaching K(X).
See:
Higherorder compressionThis is a method of compressing lambda expressions by observing a connection between grammarbased compression and lambda binding.
The procedure is very simple. Start with a lambda expression.

Compress the expression using a treegrammar (using repair, for instance). Convert this tree grammar back into a lambda expression.

Run a "simplification procedure" which performs
 etareduction
 betareduction on linear lambda bindings
 betareduction on applications to bound variables

Repeat until the expression stops shrinking.
I honestly have a hard time believing this works. I'll have to think about it more carefully. To me, this doesn't seem like it should perform better than statistical compression, but, according to the paper Functional Programs as Compressed Data;
our representation of compressed data in the form of λterms is optimal with respect to Kolmogorov complexity, up to an additive constant.
I don't buy the argument given in the paper, though, which just seems to argue that optimal compression should be possible in theory; it doesn't even mention the specifics of the algorithm they present. None the less, I want to include this here since it makes a specific and relevant claim. Some followup work seems to be doing something more computationally interesting, such as Compaction of Church Numerals for HigherOrder Compression, so a future version of this might be better suited for the task at hand.
Similar grammarbased methods should work for other structured models of computation. For example, using repair for graphs as presented in GrammarBased Graph Compression, a version should be possible for interaction nets.
Approximating Conditional Kolmogorov Complexity using CTMWhile those methods allow approximating K(X), we usually actually want to approximate K(XY), the amount of information stored in X but not in Y; the amount of information X tells us if we already know Y. The paper tries giving a method for doing this, but the method seems very questionable. It says to do the following;
 Fix a "computable relation" M : x → y.
 I don't know what the paper means by this, and the phrase "computable relation" is never clarified. I would assume that it means that a list of output ys can be enumerated on any input x, but I don't know. M is also described as being a "Turing complete space", in the typical case (e.g. when using CTM). M cannot be both a space and a relation, so clearly space is meant in some loose sense, but it's unclear what a "Turing complete space" is supposed to be. I interpret this as meaning that M is supposed to be a relation from programs to outputs in the typical case, which is a function, not a relation. But this framing implies that M could be broader. Perhaps M may be a relation in the case of a nondeterministic computation model, but this is not expanded upon in the paper.
 Fix a finite set P, such that (y, x) ∈ P iff y ∈ M(x)
 In the ordinary case, P would be the database in CTM of outputinput pairs.
 The paper then states that we can then approximate K(XY) by taking the log₂ of the sum, for all (Y, X) ∈ P, of 1/P.
The problems start when we realize that, since P is a set, any pair occurs in P at most once, meaning that this value is log₂ (1/P) if (Y, X) is in the database or ∞ if it isn't. This is obviously not what's intended, but I also can't glean from the context what is. Furthermore, the full expression given in the paper is;
CTM(XY) = log₂ Σ{(Y, X) ∈ P} 1/PBoth X and Y are bound twice, once in defining CTM and once by the sum itself. It seems like the second bind is trying to reference the first, but that makes no sense, syntactically. Alternatively, if we interpret the binders as entirely separate, then CTM does nothing with its arguments, and just returns 0 on all inputs (since log₂ P/P = 0), which is obviously wrong.
The simplest fix is to simply make P a multiset which may contain multiple copies of any given pair. The calculation should then be;
CTM(xy) = log₂( [ p ∈ P : p == (x, y) ] / P ) = log₂ Σ{p ∈ P} if p == (x, y) then 1/P else 0This may be off, but this is the closest thing to the original paper I could think of. It just calculates the loglikelihood of a random program covered by P outputing x on input y. Except, there are a few problems with this. Firstly, this quantity will always be negative since log(X) < 0 when 0 < X < 1. K can never be negative. Even if we fix this, there's still a bigger issue. This CTM definition assumes that all programs are uniformly random, but they aren't. Think of the procedure we'd go through when generating a random program. Assuming our choices are uniformly distributed, the programs we generate won't be. If we assume our programs are binary strings, then we will, at each point, make one of three choices; add a 0, add a 1, or end the string. If each choice is uniformly sampled, then there will be a one third chance of generating the empty string, a one in nine chance of generating 1, and about a one in 60,000 chance of generating 001101001. The chance of generating a program decays exponentially with the length of the program. This observation is built into algorithmic probability, and it's weird that the CTM measure, as described in this paper, ignores that.
Digressing a bit, I feel like the authors may have some overfamiliarity with one specific model of computation. One approach to define M used by the Complexity Calculator project is to use an enumeration for Turing machines which, to my knowledge, was originally devised for investigating busy beaver numbers. I believe that the authors are imagining that M as a function that enumerates all Turing machines, runs x on them all, and outputs a stream of all the ys that each Turing machine outputs. This will certainly be a function rather than some generic relation, though.
Let's think of what this measure means for other models of computation. If we were using, say, lambda expressions instead, M should enumerate all lambda expressions, take another lambda expression as input, and output the normal forms of the input applied to all possible lambda expressions. This does seem like it makes sense for any model of computation, but I'm not sure it makes sense as a measure of algorithmic similarity.
The justification for this procedure is supposed to come from the coding theorem, which states that K(X) + O(1) = log₂(m(X)), where
m(X) = Σ{p  p ↓ X} 2 ^ l(p)where p ↓ X means p normalizes to X and l(p) is the length of p. There's that exponential decay I was talking about.
See:
 Scholarpedia: Algorithmic probability
 Theorem 4.3.3 of "An Introduction to Kolmogorov Complexity and Its Applications"
Modifying this for lambda expressions,
m(X) = Σ{l  l ↓ X} 2 ^ I(l)where l ↓ X means l normalizes to X and I(l) measures the bit information of l, essentially the number of binary decisions made when constructing l. I(l) would be calculated
I(l) := I(l, 0) I(λ x . y, b) := log₂(2 + b) + I(y, b + 1) I(x y, b) := log₂(2 + b) + I(x, b) + I(y, b) I(x, b) := log₂(2 + b)Incidentally, the length of a binary string doesn't actually give the information content of that string. If a string's length isn't fixed beforehand, then each additional digit incurs one trit of information since at each stage of the construction we are choosing between one of three options; 0, 1, or stop constructing the string. From this, we can conclude that l(s) = log₃(2 ^ I(s))  1; that is, the length of a string is one less than the number of trits in that string. If the string's length is fixed beforehand, if we cannot choose to end the construction of a string at our leisure, then each choice is actually binary and l(s) = I(s).
I think that using I(s) to calculate the information rather than the length is more theoretically correct than the usual expression in terms of length. It doesn't seem to matter too much in the case of strings because the sum over all 2 ^ (l(s)1) = the sum over all 2 ^ I(s) = 1, so both are valid ways of making a probability distribution over all programs with a similar exponential decay. That l(p)1 is there so that the empty string isn't given 100% of the distribution. The real problem is generalizability; the length calculation generally fails to make a coherent distribution if our computation model no longer accepts binary strings as inputs. The information, however, can always be adapted even if our computational model expects programs to be something esoteric, like graphs, such is the case with interaction nets.
As a side note, despite length being theoretically incorrect, it's been used in some papers for measuring the information of a lambda expression. See Computable Variants of AIXI which are More Powerful than AIXItl for instance. But it seems like the theoretically wrong thing to do, especially since the actual information is so easy to calculate. I think many authors in this field don't think too carefully about the information content of the things they write about, which is quite ironic.
The conditional coding theorem states that;
K(XY) + O(1) = log₂(m(XY))where
m(XY) = Σ{p  p(Y) ↓ X} 2 ^ l(p)See:
 Theorem 4.3.4 and Definition 4.3.7 in "An Introduction to Kolmogorov Complexity and Its Applications"
This is definitely not what that CTM measure is approximating. Because the original in the paper is so obviously wrong, and the nature of M is so poorly explained, it's hard to patch it up to whatever the author's intended. In fact, I'm not sure this is actually possible. The conditional coding theorem relies on the length of the program, p, which Y is being fed into. This would require us to incorporate the complexity of the Turing machine itself, but P doesn't store that information.
Let me try to offer a more sensible formulation of the CTM idea. Assume a computing function M : x → y which simply evaluates an input program into an output using a fixed computational model. Let P be a finite subset of outputinput pairs (y, x). The input type should satisfy the smn theorem, so we can format programs like f(x); have functions which can have variables substituted into them. For some Turing machines, application is often just done as list concatonation, though, this becomes squirley if we want to represent functions which take multiple arguments, nested function application, etc. For a more wellstructured model of computation, such as the lambda calculus, application may be a fundamental operation. Regardless, we can then define our metric as;
CTM(xy) =  log₂( Σ{ p  (x, p(y)) ∈ P } 2 ^ I(p) )This would require us to be able to pattern match so as to detect p(y). If application is just concatenation, then this is as simple as looking for the suffix y, which is pretty trivial.
This doesn't much resemble what's in the paper, but it makes much more sense.
The paper mentions that CTM(x), which approximates nonconditional Kolmogorov complexity, can be defined as CTM(x""), x conditioned on the empty string. Well, actually it says that CTM(""x) should do this, but that doesn't make any sense. It's unclear enough in the original, it should definitely be CTM(x"") in my modified version since it would just be summing for every program p = p ++ "" which outputs x; hence it's eminently compatible with concatenationasapplication. In general, a separate measure would need to be made for other models of computation since applicationasconcatenation doesn't even make sense in general for Turing machines (do you really think application should be associative?), much less other models of computation. More generically, we'd define;
CTM(x) =  log₂( Σ{ p  (x, p) ∈ P } 2 ^ I(p) ) Block DecompositionIn the section on BDM (Block Decomposition Method), the authors keep calling things "tensors", but it never explains what a tensor is. It definitely isn't in the ordinary mathematical sense since the only things described as tensors are just binary strings. The original paper on BDM talks about compressing tensors and vectors. However, neither of those two things are actually compressed in that paper. Instead, it seems like the authors think that any list is a vector and any listoflists is a tensor, which is what they actually compress. It's annoying when authors abuse terminology like this; it's just confusing. From that, I think "tensor" just means a 2dimensional array of bits. Just call them arrays if that's what they are! This is a CS paper, after all.
I have a suspicion that the segment on block decomposition was copypasted from somewhere else without any copyediting. Tensors aren't mentioned outside the section on BDM.
Setting that aside, we're trying to approximate K(XY) using BDM(XY). Assume a fixed "partitioning strategy". The paper never explains what this is, but I read other sources (which I'll talk about later on) which informed me that a "partitioning strategy" is simply a method of splitting up a string into (possibly overlapping) substrings which are already in our CTM database. What BDM tries to do is then devise a "pairing strategy" which minimizes a quantity. The paper doesn't state what a "pairing strategy" is either, and no other source I read clarifies. It only says the following;
A pairing strategy generates a set P
 consisting of pairs of pairs ((rx, nx), (ry, ny))
 where rx and ry are partitions of X and Y made by our partitioning strategy
 where each rx occurring in P must only occur once, though there is no similar restriction on ry. This just means that P, treated as a relation, is (nontotally) functional.
 where nx and ny are the occurrence counts of rx and ry within the partitionings of X and Y, respectively.
 where rx and ry are partitions of X and Y made by our partitioning strategy
That's all it says on pairing strategies. As far as I can tell from this, a pairing strategy that pairs nothing and is just empty is valid, but I'm pretty sure it's not supposed to be.
Assuming we have an understanding of what additional constraints a pairing strategy should have, we want to find the pairing strategy which minimizes the following quantity;
Σ{((rx, nx), (ry, ny)) ∈ P} CTM(rxry) + if nx == ny then 0 else log(nx)The minimal value for this quantity will be BDM(XY).
This quantity will always be nonnegative and we can always minimize it to zero by making P empty. This is obviously not intended. It also doesn't make much sense to me that we're taking the log of nx if nx is just the count of rxs rather than something involving the length of rx. And shouldn't that log term scale with the difference between nx and ny in some way? The paper offers no real intuition.
Maybe looking at the original BDM paper can offer clarification. It gives a nice example which I'll reproduce here. Let's say we're applying BDM to the string
010101010101010101We have a choice of partitioning strategy, but it gives the example of splitting the string into substrings of length 12 which may overlap by, at most, 11 digits. When we do this, we get 3 101010101010s and 4 010101010101s. According to CTM, both strings have a complexity of 26.99 bits (assuming we're using only 2state binary Turing machines). This would indicate that the smallest program generating the 12 digit string is, at most, 26 digits long. Also, there are no Turing complete 2state binary Turing machines, so this choice seems doubly weird. We then calculate the BDM value as
26.99 + log(3) + 26.99 + log(4) ≈ 57.565... Okay, but shouldn't it be way smaller? The original string wasn't even twice as long as its partitions, and it should be almost as easy to generate as the partitions. I thought this might be an idiosyncrasy of the specific kind of Turing machine which the paper uses, but the complexity calculator website says almost the same thing, giving the "BDM algorithmic complexity estimation" as 57.5664 bits when we select a block size of 12 with an overlap of 11.
Let's digress a bit and think of K in the lambda calculus. Firstly, we need a way to represent binary strings. We'll just encode these as lists of bits. The type of bits will be defined as
2 = ∀ X . X → X → X = {0, 1}where
0 = λ f . λ t . f 1 = λ f . λ t . tStrings should have an appropriate elimination rule stating that, for any type family or predicate P over binary strings,
∀ S : BinString . (∀ s : BinString . (b : 2) → P s → P (b :: s)) → P "" → P SThis is essentially the induction rule for binary strings. One form of it, anyway. We can replace that predicate with a polymorphic variable to get our representation.
BinString = ∀ X . (2 → X → X) → X → XCompare
∀ S : BinString . ∀ P . (∀ s . (b : 2) → P s → P (b :: s)) → P "" → P S ∀ X . ( 2 → X → X ) → X → XFor any particular string, S, we can realize the induction principal using
λ c : ∀ s . (b : Bits) → P s → P (b :: s) . λ n : P "" . S c nSee
 Generic Derivation of Induction for Impredicative Encodings in Cedille, I guess, since I don't know of a better source on this topic.
I'll talk about alternate representations later on, but I think this is among the most natural representations given the mathematical structure of the data type. "Most natural" doesn't necessarily mean "best", though. Unary natural numbers are more natural than binary representations since pretty much every simple representation of the universal property of ℕ suggests a unary representation. However, they're horribly inefficient.
Using this representation, the original string would be encoded as
λc . λn . c 0 (c 1 (c 0 (c 1 (c 0 (c 1 (c 0 (c 1 (c 0 (c 1 (c 0 (c 1 (c 0 (c 1 (c 0 (c 1 (c 0 (c 1 n)))))) )))))))))))This representation has about 192.747 bits of information. That information count might seem like a lot, but the lambda calculus represents trees of any kind as efficiently as it can represent lists. In this case, the string is a list of bits, which is about as small as can be expected.
However, it can be compressed to;
λc . λn . (λ f . λ x . f (f x)) (λ f . λ x . f (f (f x))) (λ x . c 0 (c 1 x)) nwhich has about 83.93 bits of info. One of the 12 character subpartitions can be compressed into
λc . λn . (λ f . λ x . f (f x)) ((λ f . λ x . f (f (f x))) (λ x . c 0 (c 1 x))) nFor the other, just swap 0 and 1. This has about 83.93 bits; the exact same, in fact, as the full string. The reason why these are the same is that there are 9 repetitions of 01 in the full string and 9 = 3 ^ 2. In the substrings, there are 6 repetitions of 01 or 10, and 6 = 2 * 3. The information in multiplication is the same as the information in exponentiation, so the representations end up having the same amount of info. Of course, I don't know if these are actually the smallest possible representations; these are just the smallest I could come up with. They do illustrate my point, however. The two strings should have about the same information; maybe the original should have a little more. It seems extremely suspicious to me that the two strings have such dramatically different bitcounts according to BDM.
This isn't the only way to represent binary strings. We can write the original string instead as;
λ0 . λ1 . 0 (1 (0 (1 (0 (1 (0 (1 (0 (1 (0 (1 (0 (1 (0 (1 (0 (1 (λ x . x))))))) )))))))))))To justify this representation we need to prove that its type is isomorphic to something satisfying the universal property of binary strings. Here, 1 and 0 are expected to take a function X → X and return another function of the same type. This means our new representation has type;
∀ X . (X → X) → (X → X) → (X → X)we can rewrite this using a bit of type algebra;
∀ X . (X → X) → (X → X) → (X → X) ≅ ∀ X . (X → X) × (X → X) × X → X ≅ ∀ X . (X → X) × (X → X) × (1 → X) → X ≅ ∀ X . (X + X + 1 → X) → X ≅ ∀ X . (2 × X + 1 → X) → XNote that any type of the form ∀ X . (F(X) → X) → X is the (weakly) initial algebra over the endofunctor F. Binary strings are the initial algebra over the endofunctor X ↦ 2 × X + 1, a special case of initiality for lists in general. Continuing this calculation;
∀ X . (2 × X + 1 → X) → X ≅ ∀ X . (2 × X → X) → (1 → X) → X ≅ ∀ X . (2 → X → X) → X → XWhich is the representation we started with. This justifies that the new representation is isomorphic to the old one and we can comfortably use it interchangeably. See also;
Our new representation has about 78.9 bits of information, while the substrings have about 54.9 bits. We can compress our larger string to
λ0 . λ1 . (λ f . λ x . f (f x)) (λ f . λ x . f (f (f x))) (λ x . 0 (1 x)) (λ x . x)which has about 57.27 bits of information, less even than what BDM states the Turing machine representation should have. And the lambda calculus has to represent every treelike datatype! What's the Turing machine representation doing with all that space below 57 bits if it can't even fit 9 repetitions of 01? As far as I can tell, the two 12 digit substrings can't be compressed any further, but my point from before still stands; both strings have similar amounts of algorithmic information. It's suspicious that BDM would say otherwise.
...
Okay, I think I figured it out. I was confused about the partitioning strategy. It's up to us to find a good selection of a blocksize and overlap. Going back to the complexity calculator, if I set the blocksize to 2 and the overlap to 1, it estimates the complexity to be 12.82 bits. Doing the calculation myself, we have 9 repetitions of 01 and 8 repetitions of 10. The calculation would then be;
3.3274 + log(9) + 3.3274 + log(8) ≈ 12.8247Going through all the options in the complexity calculator, the minimal is a blocksize of 2 with an overlap of 0. This partition only has 9 01s, and nothing else. This gives the calculation as;
3.3274 + log(9) ≈ 6.4973The BDM paper states that
[...] if Adj(X) is close to 1, then BDM(X) ≈ K(X).
where Adj(X) is the set of pairs of substrings with their occurrence counts. The blocksize of 2 with an overlap of 0 gives an Adj(X) of exactly 1 since it only has one partition; Adj(X) = {(01, 9)}. This should, presumably, be as close to the actual Kolmogorov complexity as BDM can get. I suppose that means we're trying to find the partition strategy which minimizes the number of substrings that are covered by the CTM database.
The appendix of the paper has a section on "The Impact of the Partition Strategy". It says the following;
BDM better approximates the universal measure K(X) as the number of elements resulting from applying the partition strategy to X.
as the number of elements... what? Was this paper not copyedited!
BDM(XY) is a good approximation to K(XY) when the Adj(X) and Adj(Y) share a high number of base tensors.
I assume that this doesn't actually rely on our data being tensors (or 2dimensional arrays).
We conjecture that there is no general strategy for finding a best partition strategy [...] Thus the partition strategy can be considered an hyperparameter that can be empirically optimized from the available data.
Hmm... this seems like a big gap in the whole approach.
So this clears up some things about the partitioning strategy, at any rate. But I wasn't worried about the partitioning strategy anyway; I wanted to know what a "pairing strategy" is! The original paper on BDM isn't any help since it doesn't describe conditional BDM at all.
Going back to the topic paper of this post, it does describe a "coarse conditional BDM of X with respect to the tensor Y". Again, tensors are not explained at all in the paper, and it's unclear if Y actually needs to be a tensor in any mathematical sense. As I stated before, I think the authors just mean a 2dimensional array when they say "tensor", and it seems obvious that the construction doesn't rely on dimensionality at all. It defines BDM(XY) as
BDM(XY) = (Σ{(rx, nx) ∈ Adj(X) && rx ∉ Adj(Y)} CTM(rx) + log(nx)) + (Σ{(rx, nx) ∈ Adj(X) ∩ Adj(Y)} log(nx))This definition isolates the unique information in X while issuing additional penalties if the information shared between X and Y appears more or less often in X than in Y. I'm not sure if this makes sense. The paper says;
[the second sum] is important in cases where such multiplicity dominates the complexity of the objects
but, intuitively, it seems to me like the sum should only add a penalty if nx > ny; because, otherwise, we're penalizing the conditional complexity of X for content that's in Y but not in X. I'll have to think about this a bit more.
The "coarse" BDM is, I guess, less accurate than the "strong" BDM that I first looked at; but, at least, it makes sense. It's weaker since it doesn't utilize conditional CTM. But without additional clarification on what a "pairing strategy" is, I just can't understand how the strong version works.
I've thought a lot about it and, while I'm not confident, I think I've figured out the two most reasonable fixes.

If the pairing strategies must cover X then that solves the specific problem I pointed out.

If P is supposed to be totally functional over the partitions of X.
Neither of these conditions is hinted at in the paper, but it's the best I've got. The paper does say;
prior knowledge of the algorithmic structure of the objects can be used to facilitate the computation by reducing the number of possible pairings to be explored
So, at the very least, the pairing strategy is supposed to be determined by some algorithm that isn't described in any detail. I'm frustrated by this whole thing.
Spitballing ways to improve DBM and CTMOne of the thoughts I had was that there may be program patterns in the computational model that always uniformly evaluate. For example, it may always be the case that 100101X10101Y0, for any strings X and Y, evaluate to 1100101, or maybe 0Y1010. In either case, we can make the replacement to get a smaller program without running the program. Or maybe it reduces to 01Y0Y01. Depending on the length of Y this may or may not be a reduction in size. And there may be variations on this involving variable substrings being riffled and combined in nontrivial ways.
The CTM database only keeps track of string reductions, but it may be possible to search for patterns and perform block replacement via unification/pattern matching instead. This description was given in terms of firstorder unification, but there may be a close link with higherorder unification; unification modulo computation.
This also seems similar to finding smaller extensionally equivalent expressions to certain programs. That is, finding programs that don't normalize to the same term but do normalize to the same term when given the same inputs. Programs that are not the same but are behaviourally indistinguishable. I wrote a program a while ago in Mathematica which enumerated SKI expressions, applied 30 or so variable arguments to them, and collected them into pairs which normalized to the same expression after application. In this way, I built up a database which I used to make expressions smaller by replacing expressions with smaller extensionally equivalent ones. In practice, this ran into some issues. Namely, two extensionally equivalent expressions are not guaranteed to have the same behavior since one may start evaluating after two arguments while the other will evaluate after three. For example, the following two expressions are extensionally equivalent, but they only normalize to the same term after two arguments are applied despite the first fully normalizing after only one application.
λ f . f λ f . λ x . f xThis only ceases to be a problem if you have some sort of typing discipline which can allow you to infer the number of arguments an expression expects. You can then assess extensional equivalence up to that number of arguments while also guaranteeing the preservation of expected behavior up to the type of the full expression you're compressing. This, of course, doesn't work on extensional equivalences which require some elimination principle to justify; e.g. the equivalence of merge sort with quicksort. This may be particularly relevant to incremental compression.
Generating programs with specific, even simple, types is highly nontrivial. It's just the proof synthesis problem where you want to enumerate all proofs (encoding programs) of a given theorem (encoding a type). Restricting to certain typing disciplines, such as simple types without intersection types, certain polymorphic typing disciplines, some refinement type disciplines, some disciplines heavily leaning on algebraic data types, and some others, can be searched fairly efficiently, however. The following papers seem particularly relevant;
 TypeandExampleDirected Program Synthesis
 ExampleDirected Synthesis: A TypeTheoretic Interpretation
 Program Synthesis from Polymorphic Refinement Types
 Generating Constrained Test Data using Datatype Generic Programming
This may be leverageable for some applications. In fact, I'd guess most applications could leverage this.
It's also worth noting that this whole problem is just the inductive programming problem + a desire to minimize the induced program's size. Inductive programming is a whole field unto itself. There are algorithms which can exhaustively produce programs which have a particular output sequence;
See;
These approaches can be used as compression methods. They're not guaranteed to approach the Kolmogorov complexity, but they should generally do a much, much better job than statistical compression methods while being much more efficient than methods attempting to approximate K directly.
Consider the possibility of an algorithm, C(X), and a Shannon optimal compressor, S(X), such that the sum over all X of C(X)  S(X) is negatively infinite. That is to say, C(X) tends to perform better infinitely much as a Shannon optimal compressor while being no slower. BDM already does this, but is there an algorithm that can do this without keeping track of a large database which takes exponential effort beforehand to calculate?
It's actually fairly easy to outperform Shannon optimal compressors by infinitely much. Have a compressor do the following; If it detects a string encoding 0, 1, 2, 3, 4, 5, ..., replace it with a flag indicating such and a number indicating how far the sequence counts. For all other strings, replace them with the output of a Shannon optimal compressor. Such a scheme would generally perform no worse than a Shannon optimal compressor while performing better by infinitely much; though the benefits clearly only apply to a small number of patterns overall, even if there are infinitely many such patterns. This means that CTM will generally be able to improve by infinitely much by adding entries to its database, but expanding this database takes exponential effort. Is there a way to do better? Is there a way to characterize how far this can go without exponential effort? Even if it doesn't cover as much of program space asymptotically, is there a way to grow the database forever using only, say, linear effort? Or quadratic? Or logarithmic? And could such things be efficiently sewed together so that we discover the easy things first and foremost, for even very large strings, and the hard things later?
1, 2, 3, ... 999999999, 1000000000Is easy to compress algorithmically, but I wouldn't expect BDM to do much better than statistical compression. I do believe that such things should be efficiently discoverable anyway, but not by CTM or BDM, as it stands.
I think we may need to start thinking about the amount of effort we'd be willing to expend during compression. K and its alternatives seem somewhat backward in their definition. K is the minimal size of a program that can become our target if we're willing to expend an arbitrarily large effort during decompression. Levin complexity is just K plus the log of the runtime of the smallest program. Essentially, Levin complexity is the minimal size of a program that can become our target if we're willing to expend only an exponential amount of effort on decompression. But, shouldn't it be the other way around? Shouldn't we care more about the amount of effort we want to put during compression?
For the purposes of ML, we don't care about decompression at all. What would be better to know is how small a program can be if we're only willing to spend exponential, quadratic, linear, etc. effort with respect to the length of the string we're compressing. Are these problems solvable? I have a suspicion that feedforward NNs are essentially solving a subset of the linear effort case.
This is an area which has been explored quite a bit; by the paper on Universal almost optimal compression I mentioned earlier, for instance. I would like to explore this area in the future, and I believe it will be of tremendous importance for making compression practical for machine learning.
Here's another idea. This one is original to me, but I wouldn't be surprised if someone came up with something similar. Some models of computation can be run backward, nondeterministically. Specifically, models where every state can be reached in one step by only a finite number of transitions. This can't be done effectively with the lambda calculus. If we were in a state λ x . λ f . f x x, that could have been reached in one step by
λx . (λ y . λ f . f y y) x (λx . x) (λ x . λ f . f x x) λ x . λ f . (λx . x) (f x x) (λ d . λ x . λ f . f x x) (λx . x) (λ d . λ x . λ f . f x x) (λx . x x) (λ d . λ x . λ f . f x x) (λx . x x x) ...and infinitely many other things. This means that running a lambda expression backward implies enumerating infinite possibilities at each step. That doesn't mean running expressions backward is impossible, but it limits the utility of such an approach since we'd basically be enumerating every lambda expression an infinite number of times at each backward step. The same applies to combinator logic.
Many models of computation, however, don't have this property. Anything where a fixed amount of work is done at each step don't; that includes Turing machines, interaction nets, the linear lambda calculus, and most abstract machines. These can all be run backward, as a result. We can then enumerate all the programs which normalize to a particular output by doing the following, assuming we're using an appropriate Turing machine;
 Start with our output string.
 Enumerate every endstate involving this machine. That is, every case where the head of the machine is at every position while in the halting state.
 For each of these, generate an infinitely tall rose tree by recursively running the program backward for each time step. We can collapse these trees into a stream by doing a breadthfirstsearch and we can collapse these searches together by riffling the streams.
 Every time we reach a point where the machine's head is at the beginning of the string in the starting state, we've logged a program which normalizes to our output.
This procedure will look for and find only those programs which normalize to our desired output, ordered by running time. We can keep this going for as long as we want, remembering only the smallest program found so far. The longer we go, the closer our approximation is to the actual shortest program and therefore the actual Kolmogorov complexity. There are also probably heuristics we could apply to prune the search of paths which won't get us anything smaller than what we already have. I don't know how efficient this could be made, but it seems to me that it would do better than BDM on large strings.
For parallel models of computation, such as interaction nets, we can optimize this further by treating the (backward) transformations of different segments of the program independently and only combine the timelines when they start interacting.
A further optimization, which would make finding small programs easier but would make finding values verifiably close to K harder, is to iteratively compress programs. We run a program backward only until it shrinks. We then abandon all other search branches and start over with the new, smaller program. Doing this over and over may allow one to effectively find and run recursive programs that generate an output backward much more efficiently than an unpruned search.
Here's another idea. In CTM, we're enumerating all programs and seeing what they output. It may, instead, be better to enumerate programs and then filter them for randomness. Essentially, we'd build up a database by looking at each program in algorithmic order. We'd try building that program from the programs already in the database. If we can't build it using the existing elements in the database in a way that's smaller than what we're trying to build then the thing we're looking at is algorithmically random and we add it to the database as a random "atom". This should significantly cut down on the combinatorial explosion, though, I'm pretty sure it would still be exponential.
Ultimately, I think this whole problem should eat itself. If we're using a universal learning algorithm, then, certainly, it can learn to be better, somehow. Bootstrapping should be possible in this domain.
Algorithmic LossHey, wasn't this post supposed to be about machine learning? Oh ya! Let's finally talk about that!
For any standard ML technique, we need a definition of Loss which we want to minimize. The exact way this will be defined will depend on what our task is. Generally, we'll use some form of conditional Kolmogorov complexity directly as our loss. Similar measures are already used in some applications. Crossentropy is the most directly related loss, and using algorithmic complexity can be thought of as a more robust form of entropy.
Generally, our loss will try to capture how much of the data isn't captured by our model. We want our loss to answer the question, "given our model, how much of the data still needs to be explained?" To that end, our loss will generally be a function of the training data conditioned on our predictions. Given a data point y and a prediction Y, our loss on that point will be K(yY). To get a total loss we can just add all these measures together.
But the paper suggests adding the squared losses together. Why should we do this? Why do we do this for normal ML? Well, we can't normally just add together the losses since they can often be negative. K can never give negative values, so that's not a problem here. Why would we use the mean squared error rather than the mean absolute error? There are two explanations I've seen. Firstly, I've seen some say MSE is easier to differentiate than MAE. This isn't true outside of very simple models where you want to find the optimum in a single step, such as linear regression, and we aren't going to be differentiating K anyway, so this doesn't matter. The other reason comes from an assumption that we're modeling things sampled from a gaussian distribution.
This has always irked me. No matter how many statisticians say "it's reasonable to assume everything is being sampled from a gaussian", that doesn't make it true. If you do any bayesian ML, you'll find that a significant effort spent on standard techniques is in distribution engineering. If what you're modeling can't be negative, is multimodal, follows a powerlaw, or a litany of other things then you're not looking at data sampled from a gaussian and you'll have to make a different distribution instead.
Anyway, let's do the derivation;
Firstly, what we always really want to do is maximize the likelihood. Our model is going to make predictions about the probability of various data. The likelihood of our model is just the product of all the probabilities of each training point as predicted by our model.
L(m) = Π(i) m_prob(yi)In practice, this will usually end up multiplying a bunch of numbers below 1 together, getting a vanishingly small likelihood for models training on a lot of data. Because of this, we usually use the negative loglikelihood, which is
NLL(m) =  Σ(i) log(m_prob(yi))This makes all our toosmall numbers large without losing any information, so this is usually what real algorithms try to minimize. On a historical note, this trick was often used to make carrying out tough calculations easier. Logorithm tables were a hot commodity back before calculators became commonplace. Anyway, we often also divide by the total number of data points to turn this loglikelihood into a mean loglikelihood, that way our loss doesn't become huge just because we're working with a lot of data points.
The equation of a gaussian is;
e^( (xμ)² / 2 σ²) / σ √(2 π)The "prediction" made by a Gaussian model will be the mean, μ, and the likelihood of a particular piece of data x will be that data fed into the PDF of a gaussian with the predicted mean. Substituting with those changes into our negative loglikelihood, this becomes
 Σ(i) log(e ^  (yi  y_pred)² / 2 σ²) / σ √(2 π) = (1 / σ³ √(8 π)) Σ(i) (yi  yi_pred)²which is exactly the squared error, modulo some constant we don't care about. And getting the average by dividing by the number of data points will get us the mean squared error, MSE. This should also illustrate that if you don't think it's reasonable to assume your data were randomly sampled from a gaussian distribution, then you should also not think it's reasonable to use the squared error without a similar derivation.
Okay, so, what's the justification for squaring K? Let's think about this, what are the probabilities in our likelihood? Well, they'll be the algorithmic probabilities; the probability that a random program will output the datapoint when given our model's prediction as an input. The coding theorem says exactly that (within an additive constant) the Kolmogorov complexity of a program is the negative logarithm of the algorithmic probability, meaning the appropriate negative loglikelihood is exactly the sum of Ks.
But, wait, I was looking for justification for squaring K. That's what the paper does. Does it say why?
[...] in order to remain congruent with the most widely used cost functions, we will, for the purpose of illustration, use the sum of the squared algorithmic differences.
Oh, so there is no reason. To be clear, there is a clearly right thing to do here; use the sum of Ks, not squared Ks. We may also want to divide by our number of data points to get the mean K error rather than just the total error. I don't think the author's thought very hard about what the loss should be. For much of the paper this odd, clearly wrong loss function will be used.
The paper goes on to talk about categorical loss. The obvious thing, to me, is to do basically the same thing, and that's what the paper recommends. Assume our model is outputting some object which is being used to predict the class. In classical ML, this would be like the class probabilities before assigning a class. The loss will be K(YM(X)), where Y is the actual class and X is our input data. This signifies how much information is in the real class but not in the prediction of our model. If we were using our model for prediction, then the class C which minimizes K(CM(X)) would be our prediction.
The paper relates this to clustering by algorithmic distance. I don't see the connection, but it does reference another generic version of a standard ML technique; specifically, it points out the paper Clustering by Compression.
Algorithmic OptimizationGreat! So, we know how to assess a model, how do we actually do optimization? Prepare to be disappointed, because the answer offered in the paper is "brute search"! Well, the paper phrases it in a more, umm, appealing way. The optimization algorithm is defined as follows;
 Keep track of the most recent minimal cost in a variable called minCost which is initially set to infinity.
 Keep track of the best set of parameters in a variable called param, which is, I guess, initially set to null, or something.
 Create a stream of all possible parameters of our model in algorithmic order (that is, ordered by their K).
 For each set of parameters, calculate the loss of the model with those parameters. If the loss is less than the current minCost, then set minCost to this value and set param to the parameters being used.
 Keep going until you're bored or satisfied or no reason at all; the paper doesn't care.
That's it. However, there are some complications that make me not entirely sure if this is right. It defines the cost function (the sum of squared Ks) to be J_a(ˆX, M), where ˆX is our dataset, and M is the model we're assessing. However, the actual description of the algorithm tells us to use J_a(ˆM, σ_i), where σ_i is the ith parameter in our parameter stream and ˆM is never defined. Presumedly, ˆM has something to do with our model, but it can't possibly be a replacement for ˆX since ˆX is just a collection of inputoutput pairs and our model is, obviously, not. Either there's an entirely separate loss function over models and parameters which is never defined, or there was a typo and the algorithm should have said J_a(ˆX, M{σ_i}), or something like that. The version I wrote seems pretty intuitive (if overly simplistic), so I'm leaning toward the latter.
The paper states that this algorithm "can ['minimize K(M) and minimize the cost function'] in an efficient amount of time". But, uh, no it doesn't. It does as bad as a brute force search because it is a brute force search. It goes on to say
the algorithmic parameter optimization always finds the lowest algorithmically complex parameters that fit the data ˆX within the halting condition [...] algorithmic parameter optimization will naturally be a poor performer when inferring models of high algorithmic complexity.
Ya think? I'm not confident that this would work on anything but toy problems, but, who knows, maybe I'm wrong and this actually works surprisingly well on realworld data, but I doubt it. The algorithm doesn't even try taking advantage of what it knows about good parameters.
As an aside, the paper mentions that
In the context of artificial evolution and genetic algorithms, it has been previously shown that, by using an algorithmic probability distribution, the exponential random search can be sped up to quadratic
giving a few citations. This seems more reasonable to me, as such methods aren't just bruteforce searching.
This will be a bit of a digression, but if you read this far you probably don't care about that. The first example it uses is a regression problem on two variables. It says the following on the ability to enumerate the parameter space;
For instance, in order to fit the output of the function f (Eq. 2) by means of the model M, we must optimize over two continuous parameters s1 and s2. Therefore the space of parameters is composed of the pairs of real numbers σ_i = [σ_i1, σ_i2]. However, a computer cannot fully represent a real number, using instead an approximation by means of a fixed number of bits. Since this second space is finite, so is the parameter space and the search space which is composed of pairs of binary strings of finite size [...]
This entire paragraph is rather headscratching. Computers certainly can fully represent a real number. We can figure out how by following the same basic procedure I used before to figure out how to encode binary strings. You just state a universal property of the mathematical object you want to represent and derive a type of realizers. This is a bit squirrely with the real numbers as the exact universal properties diverge in constructive settings. Dedekind Reals and Cauchy Reals aren't isomorphic anymore, for instance. There are also practical questions about how to make calculating with them as easy as possible. That being said, the simplest universal property for any kind of real number I'm aware of is the following; the (nonnegative) real numbers are the final coalgebra of the endofunctor X ↦ ℕ × X. There are a few places that say this in various guises. The most direct is On coalgebra of real numbers which is all about this observation. See this as well. This basically says that real numbers are an infinite stream of natural numbers. There are a few ways of viewing what this represents, and that will largely determine whether you see each number as representing a nonnegative real or something isomorphic, like something in [0, 1). For the latter, we can read each natural number as describing how many 1s we encounter before encountering a 0 in the binary expansion of a number. For example;
0 = [0, 0, 0, ...] 0.1 = [0, 0, 2, 0, 2, 0, 2, 0, 2, ...] 0.2 = [0, 2, 0, 2, 0, 2, 0, 2, 0, ...] 1/3 = [1, 1, 1, 1, 1, 1, 1, ...] √2  1 = [2, 1, 1, 0, 0, 0, 0, 1, 0, 4, ...] π  3 = [0, 1, 0, 1, 0, 0, 0, 6, 2, 1, 1, ... ]Of course, we can map this back on to the nonnegative reals by interpreting each number x as 1/(1x)  1 instead. Then we'd have
0 = [0, 0, 0, ...] 0.1 = [0, 0, 0, 1, 3, 1, 0, 0, 1, 3, 1, 0,...] 0.2 = [0, 1, 1, 1, 1, 1, 1, 1 ...] √2 = [1, 0, 1, 1, 5, 2, 0, 0, ...] 3 = [2, 0, 0, 0, 0, 0 ...] π = [2, 0, 0, 0, 1, 0, 0, 2, 0, 0 ... ] e = [1, 3, 2, 0, 1, 0, 2, 1, ... ]Alternatively, we can imagine each entry in a sequence [a0, a1, a2, ...] as representing a number as a simple continued fraction of the form
a0  1/ (a1  1/(a2  1/ ...))This can represent any nonnegative real number.
0 = [0, 0, 0, ...] 1 = [1, 0, 0, 0, ...] 0.1 = [0, 10, 0, 0, 0, ...] 0.2 = [0, 5, 0, 0, 0, ...] 1/3 = [0, 3, 0, 0, 0, ...] √2 = [1, 2, 2, 2, 2, 2, ...] π = [3, 7, 15, 1, 292, 1, 1, ... ]whichever interpretation we use will determine how we define, for instance, addition, multiplication, etc. Note that there are some complications with representing real numbers as continued fractions in this way. Notably, that some numbers don't have unique representations. While I understand these caveats, I don't understand how to solve them, though I've been told that such solutions are "various".
Following Recursive types for free!, the (weakly) final coalgebra of X ↦ ℕ × X can simply be defined as
∃ X . (X → ℕ × X) × XWe can construct a real number by fixing a type, X, giving an X as a seed, and then defining a method of generating new digits and new seeds from an old seed. For example, we can construct an infinite stream of zeros by setting X to be ⊤, the unit type, giving •, the only inhabitant of ⊤, as our seed, and defining our generator as λ x . (0, •). In full, we'd have
0 : ℝ := λ x . (0, •), •or, if we want to be explicit about all our encodings;
0 := λ p . p (λ x . λ p . p (λ z . λ s . z) (λ x . x)) (λ x . x)This means that the real number zero has, at most, about 32.511 bits of complexity; surprisingly small for something which is supposedly infinitely large.
The usual reason we'd want to use floatingpoint numbers over exact real numbers is efficiency; floatingpoint numbers are much faster to compute with since our computers have hardware dedicated to computing with them. But this approach is representing the parameters as raw outputs of some virtual computer anyway, so that doesn't apply here. To use floating points, we'd have to convert them to some encoding of floats in our computational model. We get no efficiency in using floats here.
We can make our representation a bit more efficient. Following the Coinductive function spaces page, we can use a few isomorphisms to change this representation. Notably, it's generally the case that
∃ X . (X → A × X) × X ≅ (∀ X . (1 + X → X) → X) → Afor any A. Since ℕ is the initial algebra over the endofunctor X ↦ 1 + X, the above can be rewritten as;
∃ X . (X → A × X) × X ≅ ℕ → ASo we can redefine the nonnegative reals as just functions from ℕ → ℕ. Neat! Following this, we can define zero instead as;
0 := λ n . λ z . λ s . zas the constant function which just returns 0 for any input. This has only about 6.9 bits! Not bad for representing infinite many digits. It would actually be MORE complex if we truncated it to finitely many digits. We may even notice that the least complex real number will simply be encoded by the identity function. This will be the number who's continued fraction is [0, 1, 2, 3, 4, ...]. As it turns out, this number is
I₁(2)/I₀(2) ≈ 0.697775where the Is are Bessel I functions. This is called the "Continued Fraction Constant". Or, if we were interpreting the number as representing the binary expansion, this would be
2  ϑ₂(1 / √2) / 2 ^ (7/8) ≈ 0.358367Where ϑ is an elliptic theta function. I don't know if this constant has a name, but I couldn't find it anywhere. When using our previous map to turn this into the full positive reals, this becomes ≈ 0.558524.
Anyway, my whole point with this exercise was to show we can represent real numbers, and many other mathematical structures besides, just fine on computers. We don't, and, in fact, shouldn't use floatingpoint numbers if we're going to take algorithmic complexity seriously. The original BDM paper mentions π a few times, saying, for instance,
the digits of π have been shown to be [...] only algorithmic in the way they are produced from any of the many known generating formulas
so the authors know that π (and presumably other real numbers), in all of its infinite digits, can be represented by an algorithm. But in this paper, they insist on using floatingpoint numbers. Why? The paper just says that the parameter space becomes enumerable, but we can effectively enumerate the (constructive) reals by enumerating all the inhabitants of the type
ℝ := (∀ X . X → (X → X) → X) → (∀ Y . Y → (Y → Y) → Y)the output of such a procedure, if it were done in some breadthfirst manner, would output encodings for real numbers in essentially algorithmic order.
I think the authors need a crash course in highertype computability. There's a whole wide world that you're missing out on if you really believe computers can only represent discrete data types.
Final ThoughtsThe rest of the paper just goes through some usage examples. I don't feel the need to summarize them, but you may find them interesting to look at. The cellular automata classifier was a particularly good illustration. In the "hybrid machine learning" section, an interesting suggestion is to use K as a regularization method on top of another loss function. The same section also suggests giving training weights to simpler samples from a given space so that a model prioritizes the most plausible samples. I don't know how effective these would be, but they're interesting suggestions which may be useful as additional tools in the ML toolbox.
The conclusion section refers to the whole framework as a "symbolic inference engine" being integrated into traditional ML. I... wouldn't phrase it that way. There's not much inference and even less symbology. That being said, a typesensitive version of these ideas might be better.
I'm not satisfied with the methods presented in this paper. Nothing in it connected back to that idea of "whichever addition increases the compressed size the least should be considered the likeliest prediction" I mentioned at the beginning of this post. All the paper said, really, was "when tuning parameters, just look through them in order from least to most complex. Also, use Kolmogorov complexity, not Shannon entropy, to measure complexity." I'll keep that in mind, but I think I'd want to look at other methods. I think something more carefully designed will need to be made for practical ML. I'll take a closer look at those evolutionary methods I mentioned.
I also suspect that some variation of a bandit algorithm may benefit from these ideas. The paper Informationgain computation and its followups described a hypothetical system which uses a contextual bandit for exploring a logicprogram like search space. The expert that the bandit uses simply calculates the Kullback–Leibler divergence, a statistical entropy metric, of the history for each branch to give recommendations. The idea is that desirable histories should maximize the information gained about our goal as we go down it. A better system might use a K approximation rather than an entropy measure. Though, the paper also suggests using an RNN to do the history compression, which I'd expect to give a lower value than the entropy since it would be able to exploit the temporal structure of the history for compression in a way that statistical compression wouldn't be able to do.
...
THE END
Discuss
Are aircraft carriers super vulnerable in a modern war?
It seems like aircraft carriers are a good candidate for something that only exists because it made sense in WWII and it looks the part of “impressive military asset”, i.e. it’s all larping at this point. It seems vulnerable to attack relative to its huge cost because offense has an advantage over defense: I.e. can’t an enemy send tons of cheap drone planes and drone submarines to first hunt for its location and then swarmattack it?
Note: I don’t know anything about this subject. The ideal answerer is someone with domain knowledge who has good epistemology; that’s why I wasn’t satisfied with a Google search and I’m asking here.
Discuss
I'm Voting For Ranked Choice, But I Don't Like It
This fall, Ranked Choice / Instant Runoff Voting (IRV) will be on the ballot in Massachusetts. I'm voting for it, but only because it's better than the status quo, not because I think it's a very good voting system.
Massachusetts currently uses traditional majority ("first past the post") voting: whoever gets the most votes wins. Unfortunately, this only works well when you have two candidates. With more candidates, the candidates tend to hurt their allies by competing for the same pool of votes, making it more likely that an opponent wins.
In IRV each voter lists their preferred candidates in order, and if your first choice is eliminated then your vote goes to your next favorite. This mostly fixes the problem of minor spoiler candidates: anyone who is not a serious contender will get eliminated and their votes redistributed.
Unfortunately, IRV has major problems when you have more than two serious candidates. For example, even if there is a candidate that a majority of voters prefer to every other, they can still lose if their competitors happen to be eliminated in the wrong order. In Why Ranked Choice Voting Isn't Great I give examples of realistic situations in which IRV can give poor results.
While every voting method has cases it handles poorly, some are better than others. One attempt to compare them is called Voter Satisfaction Efficiency (more details). The idea is, you run a large number of simulations and see how different methods perform. It turns out that IRV does very poorly here, and if voters are highly strategic IRV does even worse than traditional plurality voting.
While I wish the voting method for us to consider were Approval (or maybe 321 or STAR), I do still think IRV is better than what we have today, and I'm planning on voting for it. One specific way in which IRV is an improvement is that it mostly doesn't, in its failings, benefit one type of party. This means that if we switch to IRV, and then as thirdparty candidates become stronger contenders we start to run into IRV's problems, it should be politically practical to switch to a better system. I do think there is some risk of setting back alternative voting systems in general by implementing an inferior version, but on balance I think the benefit of fixing the "minor spoiler" problem is likely larger.
Comment via: facebook
Discuss
Марафон охоты на баги по «Времени молотков». Третья неделя
Академическая встреча
Where is human level on text prediction? (GPTs task)
I look at graphs like these (From the GPT3 paper), and I wonder where humanlevel is:
Gwern seems to have the answer here:
GPT21.5b had a crossentropy validation loss of ~3.3 (based on the perplexity of ~10 in Figure 4, and log2(10)=3.32.mjxchtml {display: inlineblock; lineheight: 0; textindent: 0; textalign: left; texttransform: none; fontstyle: normal; fontweight: normal; fontsize: 100%; fontsizeadjust: none; letterspacing: normal; wordwrap: normal; wordspacing: normal; whitespace: nowrap; float: none; direction: ltr; maxwidth: none; maxheight: none; minwidth: 0; minheight: 0; border: 0; margin: 0; padding: 1px 0} .MJXcdisplay {display: block; textalign: center; margin: 1em 0; padding: 0} .mjxchtml[tabindex]:focus, body :focus .mjxchtml[tabindex] {display: inlinetable} .mjxfullwidth {textalign: center; display: tablecell!important; width: 10000em} .mjxmath {display: inlineblock; bordercollapse: separate; borderspacing: 0} .mjxmath * {display: inlineblock; webkitboxsizing: contentbox!important; mozboxsizing: contentbox!important; boxsizing: contentbox!important; textalign: left} .mjxnumerator {display: block; textalign: center} .mjxdenominator {display: block; textalign: center} .MJXcstacked {height: 0; position: relative} .MJXcstacked > * {position: absolute} .MJXcbevelled > * {display: inlineblock} .mjxstack {display: inlineblock} .mjxop {display: block} .mjxunder {display: tablecell} .mjxover {display: block} .mjxover > * {paddingleft: 0px!important; paddingright: 0px!important} .mjxunder > * {paddingleft: 0px!important; paddingright: 0px!important} .mjxstack > .mjxsup {display: block} .mjxstack > .mjxsub {display: block} .mjxprestack > .mjxpresup {display: block} .mjxprestack > .mjxpresub {display: block} .mjxdelimh > .mjxchar {display: inlineblock} .mjxsurd {verticalalign: top} .mjxmphantom * {visibility: hidden} .mjxmerror {backgroundcolor: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; fontstyle: normal; fontsize: 90%} .mjxannotationxml {lineheight: normal} .mjxmenclose > svg {fill: none; stroke: currentColor} .mjxmtr {display: tablerow} .mjxmlabeledtr {display: tablerow} .mjxmtd {display: tablecell; textalign: center} .mjxlabel {display: tablerow} .mjxbox {display: inlineblock} .mjxblock {display: block} .mjxspan {display: inline} .mjxchar {display: block; whitespace: pre} .mjxitable {display: inlinetable; width: auto} .mjxrow {display: tablerow} .mjxcell {display: tablecell} .mjxtable {display: table; width: 100%} .mjxline {display: block; height: 0} .mjxstrut {width: 0; paddingtop: 1em} .mjxvsize {width: 0} .MJXcspace1 {marginleft: .167em} .MJXcspace2 {marginleft: .222em} .MJXcspace3 {marginleft: .278em} .mjxtest.mjxtestdisplay {display: table!important} .mjxtest.mjxtestinline {display: inline!important; marginright: 1px} .mjxtest.mjxtestdefault {display: block!important; clear: both} .mjxexbox {display: inlineblock!important; position: absolute; overflow: hidden; minheight: 0; maxheight: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjxtestinline .mjxleftbox {display: inlineblock; width: 0; float: left} .mjxtestinline .mjxrightbox {display: inlineblock; width: 0; float: right} .mjxtestdisplay .mjxrightbox {display: tablecell!important; width: 10000em!important; minwidth: 0; maxwidth: none; padding: 0; border: 0; margin: 0} .MJXcTeXunknownR {fontfamily: monospace; fontstyle: normal; fontweight: normal} .MJXcTeXunknownI {fontfamily: monospace; fontstyle: italic; fontweight: normal} .MJXcTeXunknownB {fontfamily: monospace; fontstyle: normal; fontweight: bold} .MJXcTeXunknownBI {fontfamily: monospace; fontstyle: italic; fontweight: bold} .MJXcTeXamsR {fontfamily: MJXcTeXamsR,MJXcTeXamsRw} .MJXcTeXcalB {fontfamily: MJXcTeXcalB,MJXcTeXcalBx,MJXcTeXcalBw} .MJXcTeXfrakR {fontfamily: MJXcTeXfrakR,MJXcTeXfrakRw} .MJXcTeXfrakB {fontfamily: MJXcTeXfrakB,MJXcTeXfrakBx,MJXcTeXfrakBw} .MJXcTeXmathBI {fontfamily: MJXcTeXmathBI,MJXcTeXmathBIx,MJXcTeXmathBIw} .MJXcTeXsansR {fontfamily: MJXcTeXsansR,MJXcTeXsansRw} .MJXcTeXsansB {fontfamily: MJXcTeXsansB,MJXcTeXsansBx,MJXcTeXsansBw} .MJXcTeXsansI {fontfamily: MJXcTeXsansI,MJXcTeXsansIx,MJXcTeXsansIw} .MJXcTeXscriptR {fontfamily: MJXcTeXscriptR,MJXcTeXscriptRw} .MJXcTeXtypeR {fontfamily: MJXcTeXtypeR,MJXcTeXtypeRw} .MJXcTeXcalR {fontfamily: MJXcTeXcalR,MJXcTeXcalRw} .MJXcTeXmainB {fontfamily: MJXcTeXmainB,MJXcTeXmainBx,MJXcTeXmainBw} .MJXcTeXmainI {fontfamily: MJXcTeXmainI,MJXcTeXmainIx,MJXcTeXmainIw} .MJXcTeXmainR {fontfamily: MJXcTeXmainR,MJXcTeXmainRw} .MJXcTeXmathI {fontfamily: MJXcTeXmathI,MJXcTeXmathIx,MJXcTeXmathIw} .MJXcTeXsize1R {fontfamily: MJXcTeXsize1R,MJXcTeXsize1Rw} .MJXcTeXsize2R {fontfamily: MJXcTeXsize2R,MJXcTeXsize2Rw} .MJXcTeXsize3R {fontfamily: MJXcTeXsize3R,MJXcTeXsize3Rw} .MJXcTeXsize4R {fontfamily: MJXcTeXsize4R,MJXcTeXsize4Rw} .MJXcTeXvecR {fontfamily: MJXcTeXvecR,MJXcTeXvecRw} .MJXcTeXvecB {fontfamily: MJXcTeXvecB,MJXcTeXvecBx,MJXcTeXvecBw} @fontface {fontfamily: MJXcTeXamsR; src: local('MathJax_AMS'), local('MathJax_AMSRegular')} @fontface {fontfamily: MJXcTeXamsRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_AMSRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_AMSRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_AMSRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXcalB; src: local('MathJax_Caligraphic Bold'), local('MathJax_CaligraphicBold')} @fontface {fontfamily: MJXcTeXcalBx; src: local('MathJax_Caligraphic'); fontweight: bold} @fontface {fontfamily: MJXcTeXcalBw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_CaligraphicBold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_CaligraphicBold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_CaligraphicBold.otf') format('opentype')} @fontface {fontfamily: MJXcTeXfrakR; src: local('MathJax_Fraktur'), local('MathJax_FrakturRegular')} @fontface {fontfamily: MJXcTeXfrakRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_FrakturRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_FrakturRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_FrakturRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXfrakB; src: local('MathJax_Fraktur Bold'), local('MathJax_FrakturBold')} @fontface {fontfamily: MJXcTeXfrakBx; src: local('MathJax_Fraktur'); fontweight: bold} @fontface {fontfamily: MJXcTeXfrakBw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_FrakturBold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_FrakturBold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_FrakturBold.otf') format('opentype')} @fontface {fontfamily: MJXcTeXmathBI; src: local('MathJax_Math BoldItalic'), local('MathJax_MathBoldItalic')} @fontface {fontfamily: MJXcTeXmathBIx; src: local('MathJax_Math'); fontweight: bold; fontstyle: italic} @fontface {fontfamily: MJXcTeXmathBIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_MathBoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_MathBoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_MathBoldItalic.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsansR; src: local('MathJax_SansSerif'), local('MathJax_SansSerifRegular')} @fontface {fontfamily: MJXcTeXsansRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_SansSerifRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_SansSerifRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_SansSerifRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsansB; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerifBold')} @fontface {fontfamily: MJXcTeXsansBx; src: local('MathJax_SansSerif'); fontweight: bold} @fontface {fontfamily: MJXcTeXsansBw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_SansSerifBold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_SansSerifBold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_SansSerifBold.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsansI; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerifItalic')} @fontface {fontfamily: MJXcTeXsansIx; src: local('MathJax_SansSerif'); fontstyle: italic} @fontface {fontfamily: MJXcTeXsansIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_SansSerifItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_SansSerifItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_SansSerifItalic.otf') format('opentype')} @fontface {fontfamily: MJXcTeXscriptR; src: local('MathJax_Script'), local('MathJax_ScriptRegular')} @fontface {fontfamily: MJXcTeXscriptRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_ScriptRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_ScriptRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_ScriptRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXtypeR; src: local('MathJax_Typewriter'), local('MathJax_TypewriterRegular')} @fontface {fontfamily: MJXcTeXtypeRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_TypewriterRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_TypewriterRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_TypewriterRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXcalR; src: local('MathJax_Caligraphic'), local('MathJax_CaligraphicRegular')} @fontface {fontfamily: MJXcTeXcalRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_CaligraphicRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_CaligraphicRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_CaligraphicRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXmainB; src: local('MathJax_Main Bold'), local('MathJax_MainBold')} @fontface {fontfamily: MJXcTeXmainBx; src: local('MathJax_Main'); fontweight: bold} @fontface {fontfamily: MJXcTeXmainBw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_MainBold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_MainBold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_MainBold.otf') format('opentype')} @fontface {fontfamily: MJXcTeXmainI; src: local('MathJax_Main Italic'), local('MathJax_MainItalic')} @fontface {fontfamily: MJXcTeXmainIx; src: local('MathJax_Main'); fontstyle: italic} @fontface {fontfamily: MJXcTeXmainIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_MainItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_MainItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_MainItalic.otf') format('opentype')} @fontface {fontfamily: MJXcTeXmainR; src: local('MathJax_Main'), local('MathJax_MainRegular')} @fontface {fontfamily: MJXcTeXmainRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_MainRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_MainRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_MainRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXmathI; src: local('MathJax_Math Italic'), local('MathJax_MathItalic')} @fontface {fontfamily: MJXcTeXmathIx; src: local('MathJax_Math'); fontstyle: italic} @fontface {fontfamily: MJXcTeXmathIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_MathItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_MathItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_MathItalic.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsize1R; src: local('MathJax_Size1'), local('MathJax_Size1Regular')} @fontface {fontfamily: MJXcTeXsize1Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_Size1Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_Size1Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_Size1Regular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsize2R; src: local('MathJax_Size2'), local('MathJax_Size2Regular')} @fontface {fontfamily: MJXcTeXsize2Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_Size2Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_Size2Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_Size2Regular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsize3R; src: local('MathJax_Size3'), local('MathJax_Size3Regular')} @fontface {fontfamily: MJXcTeXsize3Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_Size3Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_Size3Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_Size3Regular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXsize4R; src: local('MathJax_Size4'), local('MathJax_Size4Regular')} @fontface {fontfamily: MJXcTeXsize4Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_Size4Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_Size4Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_Size4Regular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXvecR; src: local('MathJax_Vector'), local('MathJax_VectorRegular')} @fontface {fontfamily: MJXcTeXvecRw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_VectorRegular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_VectorRegular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_VectorRegular.otf') format('opentype')} @fontface {fontfamily: MJXcTeXvecB; src: local('MathJax_Vector Bold'), local('MathJax_VectorBold')} @fontface {fontfamily: MJXcTeXvecBx; src: local('MathJax_Vector'); fontweight: bold} @fontface {fontfamily: MJXcTeXvecBw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/eot/MathJax_VectorBold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/woff/MathJax_VectorBold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTMLCSS/TeX/otf/MathJax_VectorBold.otf') format('opentype')} ). GPT3 halved that loss to ~1.73 judging from Brown et al 2020 and using the scaling formula (2.57⋅(3.64⋅103)−0.048). For a hypothetical GPT4, if the scaling curve continues for another 3 orders or so of compute (100–1000×) before crossing over and hitting harder diminishing returns, the crossentropy loss will drop, using to ~1.24 (2.57⋅(3.64⋅106)−0.048).
If GPT3 gained so much metalearning and world knowledge by dropping its absolute loss ~50% when starting from GPT2’s nearhuman level, what capabilities would another ~30% improvement over GPT3 gain? What would a drop to ≤1, perhaps using wider context windows or recurrency, gain?
So, am I right in thinking that if someone took random internet text and fed it to me word by word and asked me to predict the next word, I'd do about as well as GPT2 and significantly worse than GPT3? If so, this actually lengthens my timelines a bit.
(Thanks to Alexander Lyzhov for answering this question in conversation)
Discuss
Страницы
 1
 2
 3
 4
 5
 6
 7
 8
 9
 …
 следующая ›
 последняя »