## Вы здесь

# Сборщик RSS-лент

### Ненасильственное общение. Тренировка

### Уличная эпистемология. Тренировка

### Рациональное додзё. Факторизация целей

### Preparing for coding interviews

### Большая открытая встреча

### Prokaryote Multiverse. An argument that potential simulators do not have significantly more complex physics than ours

Definitions

"Universe" can no longer be said to mean "everything", such a definition wouldn't be able to explain the existence of the word "multiverse". I define universe as a region of existence that, from the inside, is difficult to see beyond.

I define "Multiverse" as: Everything, with a connoted reminder; "everything" can be presumed to be much larger and weirder than "everything that you have seen or heard of".

What this argument is for

**This argument disproves the simulation argument for simulators hailing from universes much more complex than our own**. Complex physics suffice much much more powerful computers (I leave proving this point as an exercise to the reader). If we had to guess what our simulators might look like, our imagination might go first to universes where simulating an entire pocket universe like ours is easy, universes which are as we are to flatland or to conway's game of life. We might imagine universes with more spacial dimensions or forces that we lack.

I will argue that this would be vanishingly unlikely.

**This argument does not refute the common bounded simulation argument of simple universes** (which includes ancestor simulations). It does carve it down a bit. It seems to be something that, if true, would be useful to know.

The argument

The first fork of the argument is that a more intricate machine is much less likely to generate an interesting output.

Life needs an interesting output. Life needs a very even combination of possibility, stability, and randomness. The more variables you add to the equation, the smaller the hospitable region within the configuration space. The hospitable configuration-region within our own physics appears to be tiny (wikipedia, anthropic coincidences) (and I'm sure it is much tinier than is evidenced there). The more variables a machine has to align before it can support life, the more vanishingly small the cradle will be within that machine's spaces.

The second fork of the argument is that complex physics are simply the defining feature of a theory that fails kolmogorov's razor (our favoured formalisation of occam's razor).

If we are to define some prior distribution over what exists, out beyond what we can see, kolmogorov complexity seems like a sensible metric to use. A universe generated by a small machine is much more likely a-priori - perhaps we should assume it occurs with much greater frequency - than a universe that can only be generated by a large machine.

If you have faith in solomonoff induction, you must assign lower measure to complex universes even before you consider those universes' propensity to spawn life.

I claim that one large metaphysical number will be outweighed by another large metaphysical number. I propose that the maximum number of simple simulated universes that could be hosted within a supercomplex universe is unlikely to outnumber the natural instances of simple universes that lay about in the multiverse's bulk.

Discuss

### Neural Nets in Python 1

Introduction

This post is an attempt to explain how to write a neural network in Python using numpy. I am obviously not the first person to do this. Almost all of the code is here adapted from Michael Nielsen's fantastic online book Neural Networks and Deep Learning . Victor Zhou also has a great tutorial in Python . Why am I trying to do the same? Partially, it's for my own benefit, cataloging my code so I can refer back to it later in a form more captivating than a mere docstring. Also partially, I think I can share a few intuitions which make the backpropagation equations a lot easier to derive.

Okay, so here's a typical picture of a neural network:

If you're new to all this: A neural network is a function that takes in an input vector (or matrix) and outputs another vector (or matrix). The input starts at the leftmost vertical layer of nodes and then gets transformed, via a series of operations, to the rightmost vertical layer of nodes. Each layer is a linear combination of the layer before it, followed by an activation function, which is applied to each node. In other words, a neural net is parameterized by a series of weight matrices .mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0}
.MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0}
.mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table}
.mjx-full-width {text-align: center; display: table-cell!important; width: 10000em}
.mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0}
.mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left}
.mjx-numerator {display: block; text-align: center}
.mjx-denominator {display: block; text-align: center}
.MJXc-stacked {height: 0; position: relative}
.MJXc-stacked > * {position: absolute}
.MJXc-bevelled > * {display: inline-block}
.mjx-stack {display: inline-block}
.mjx-op {display: block}
.mjx-under {display: table-cell}
.mjx-over {display: block}
.mjx-over > * {padding-left: 0px!important; padding-right: 0px!important}
.mjx-under > * {padding-left: 0px!important; padding-right: 0px!important}
.mjx-stack > .mjx-sup {display: block}
.mjx-stack > .mjx-sub {display: block}
.mjx-prestack > .mjx-presup {display: block}
.mjx-prestack > .mjx-presub {display: block}
.mjx-delim-h > .mjx-char {display: inline-block}
.mjx-surd {vertical-align: top}
.mjx-mphantom * {visibility: hidden}
.mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%}
.mjx-annotation-xml {line-height: normal}
.mjx-menclose > svg {fill: none; stroke: currentColor}
.mjx-mtr {display: table-row}
.mjx-mlabeledtr {display: table-row}
.mjx-mtd {display: table-cell; text-align: center}
.mjx-label {display: table-row}
.mjx-box {display: inline-block}
.mjx-block {display: block}
.mjx-span {display: inline}
.mjx-char {display: block; white-space: pre}
.mjx-itable {display: inline-table; width: auto}
.mjx-row {display: table-row}
.mjx-cell {display: table-cell}
.mjx-table {display: table; width: 100%}
.mjx-line {display: block; height: 0}
.mjx-strut {width: 0; padding-top: 1em}
.mjx-vsize {width: 0}
.MJXc-space1 {margin-left: .167em}
.MJXc-space2 {margin-left: .222em}
.MJXc-space3 {margin-left: .278em}
.mjx-test.mjx-test-display {display: table!important}
.mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px}
.mjx-test.mjx-test-default {display: block!important; clear: both}
.mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex}
.mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left}
.mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right}
.mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0}
.MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal}
.MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal}
.MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold}
.MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold}
.MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw}
.MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw}
.MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw}
.MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw}
.MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw}
.MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw}
.MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw}
.MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw}
.MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw}
.MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw}
.MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw}
.MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw}
.MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw}
.MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw}
.MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw}
.MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw}
.MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw}
.MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw}
.MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw}
.MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw}
.MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw}
@font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')}
@font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')}
@font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold}
@font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')}
@font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')}
@font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold}
@font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')}
@font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic}
@font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')}
@font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')}
@font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold}
@font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')}
@font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic}
@font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')}
@font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')}
@font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')}
@font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')}
@font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold}
@font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')}
@font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic}
@font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')}
@font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')}
@font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic}
@font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')}
@font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')}
@font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')}
@font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')}
@font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')}
@font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')}
@font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold}
@font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')}
W1,W2,.., a series of bias vectors, b1,b2,..., and an activation function *a* (typically a nonlinear function like tanh(x) which is applied element-wise).

The typical picture, while good for representing the general idea of a neural net, does not do a good job of showing the different operations being performed. I prefer representing a neural net as a computational graph, like below:

Here, it's clearer to see how each node is a function of the step before it. A normal three-layer neural network is given by the following composition of functions:

f0=X=input

f1=W1⋅f0+b1

f2=a(f1)

f3=W2⋅f2+b2

f4=a(f3)=^Y=predicted output

This recursive definition will make it easy to derive the backpropagation algorithm, which we'll use to train our network. It also allows us to easily unroll the function, if we want to see what's going on in on line, by substituting until we get back to the input:

f4=a(W2⋅(a(W1⋅f0+b1)+b2))

And of course, if our neural network has more than three layers, we just add more recursively defined functions.

Forward Pass

The process of taking an input and going through the process of matrix multiplications, vector additions, and activation functions to get the output is referred to as the **forward pass**. To get started, let's write a neural net class that can perform a forward pass, given the dimensions for each layer and an activation function:

```python

import numpy as np

class NN():

def __init__(self, sizes, activation):

self.num_layers = len(sizes)

self.sizes = sizes

self.x_dim = sizes[0]

self.b = [np.random.randn(1, y) for y in sizes[1:]]

self.w = [np.random.randn(x, y)

for x, y in zip(sizes[: -1], sizes[1:])]

self.activ = activation

def forward_pass(self, input, classify=False):

if (input.ndim == 1):

input = input.reshape(1, -1)

for i in range(self.num_layers-1):

input = self.activ.fn(np.dot(input, self.w[i]) + self.b[i])

return input

```

A quick explanation: our class takes in an array of layer sizes and creates appropriate weight matrices with random values from 0 to 1. EX: [20, 50, 10] would result in weight matrices of dimensions 20×50 and 50×10. For the forward_pass function, we can see from the computational graph that each matrix multiplication (and vector addition) is followed by an application of the activation function, so we can simply recurse until we go through all of our weight matrices.

Loss FunctionSo far, I haven't explained how the neural net is supposed to actually work. Say we have some data and their associated target values (EX: different measurements of divers and how long they can hold their breath). Using the above code, even if we get the dimensions of the input/output right, our forward pass is going to give us garbage results.

This is because we randomly initialized our weights and biases. We don't want *any* set of weights and biases, but a "good" set of weights and biases. In doing so, we now need to define what we mean by "good". At the very least, it seems that a good set of weights and biases should lead to predicted values which are close to the associated target values, for most of the data we have.

This is where loss functions come in. They take in as input our predicted value and the true value and output a measure of just how far apart the two values are. There are many functions we could choose to measure the distance between Y and ^Y. For ease of explanation, we'll go with L2 norm of their difference, i.e. the sum of the squares of their differences.

Now, after taking a forward pass, we can use our loss function to tell us just how far away our predicted value is from the true value. It's easy to add this change to our class:

```python

class NN():

def __init__(self, sizes, activation, loss):

self.loss = loss

'''

The rest is unchanged

'''

'''

predict is unchanged

'''

def eval_loss(self, input, output):

return self.loss.fn(self.predict(input), output)

```

We can easily integrate this into our model by adding it to our computational graph:

Now we've added one more step:

f0=X=input

f1=W1⋅f0+b1

f2=a(f1)

f3=W2⋅f2+b2

f4=a(f3)=^Y=predicted output$

f5=∥f4−Y∥2

Now that we have our loss function defined, we can begin the work of actually optimizing our network because we have an answer to the question, "Optimizing with respect to what?"

Recall that our network is parameterized by a set of weights and biases. (There's also the activation function, but that's more of a fixed thing we can't really fine-tune.)

Backpropagation allows us to figure out how much each weight and bias is responsible for the loss function. We do this by taking partial derivatives of the loss function with respect to each weight matrix and bias vector. Given that a neural net is just a big composite function, we'll be using the Chain Rule a lot. This is where the recursive notation shines. It's much easier to have a placeholder like $f_3$ than a big clump of nested parentheses.

The reason we are taking partial derivatives at all is because they'll allow us to perform iterative optimization, e.g. gradient descent, on our network, which is how the "training" happens.

We'll start with the biases b1,b2,... first:

First, let's find ∂f5∂b2

From above, we've already defined f6 to be the loss function applied after a forward pass, so that's why we're taking the partial derivative of f5 with respect to b2. Note below that $a'$ is the derivative of the activation function.

∂f5∂b2=∂f5∂f4∂f4∂b2=2(^Y−Y)∂f4∂b2

∂f4∂b2=∂f4∂f3∂f3∂b2=a′(f3)∂f3∂b2

∂f3∂b2=1

Thus, ∂f5∂b2=2(^Y−Y)a′(f3).

Next, let's find ∂f5∂b1

(Below, I've omitted the intermediary step of showing ∂g∂x=∂g∂f∂f∂x.)

∂f5∂b1=2(^Y−Y)∂f4∂b1

∂f4∂b1=a′(f3)∂f3∂b1

∂f3∂b1=W2⋅∂f2∂b1

∂f2∂b1=a′(f1)∂f1∂b1

∂f1∂b1=1

Thus ∂f5∂b1=2(^Y−Y)a′(f3)⋅W2⋅a′(f1).

Before we go any further, there are a two useful things to notice:

1. The loss function's derivative (in this case, 2(^Y−Y)) will always be the first term in the partial derivative of the loss with respect to any weight or bias.

2. The partial derivatives of the bias vectors is recursively defined. ∂L∂bn−1=∂L∂bn⋅Wn⋅a′(zn−1) where zc is defined to be the result of Wc⋅f2c−2+bc. In other words, zcis the result of multiplying the previous layer by the cth weight matrix and adding the cth bias vector. We let L represent the general loss function, applied after an arbitrary number of layers.

Let's do the weight matrices W1,W2,... next:

First, let's find ∂f5∂W2

∂f5∂W2=2(^Y−Y)∂f4∂W2

∂f4∂W2=a′(f3)∂f3∂W2

∂f3∂W2=f2=a(f1)

Thus, ∂f5∂W2=2(^Y−Y)a′(f3)f2=∂f5∂b1a(f1).

Now we find ∂f5∂W1

∂f5∂W1=2(^Y−Y)∂f4∂W1

∂f4∂W1=a′(f3)∂f3∂W1

∂f3∂W1=W2⋅∂f2∂W1

∂f2∂W1=a′(f1)∂f1∂W1

∂f1∂W1=f0=X

Thus ∂f5∂W1=2(^Y−Y)a′(f3)⋅W2⋅a′(f1)f0=∂f5∂b1f0

Here, in both partial derivatives, we see something useful: The partial derivative of the loss function with respect to a weight matrix can be calculated in part using the partial derivative of the loss function with respect to the bias vector in the same layer. The extra term we need is the activation function applied element-wise to the layer before it.

In other words: $\frac{\partial \text{L}}{\partial W_n} = \frac{\partial \text{L}}{\partial b_n} a(z_{n-1})$.

Thus, as long as we store both the results of $z_c$ and $a(z_c)$ during a forward pass operation, we'll have most of the information we need to calculate the partial derivatives.

We're now ready to write the code:

```python

class NN:

def backprop(self, x, y):

z = []

activations = [x]

for i in range(self.num_layers-1):

x = np.dot(x, self.w[i]) + self.b[i]

z.append(x)

x = self.activ.fn(x)

activations.append(x)

```

To start with we perform a forward pass. Along the way, we store the results in activations and z. One small caveat: we start with x in activations as well because our recursive definition bottoms out at the input value, so we need for the gradients at the first layer.

Now, we go backwards and recursively calculate our gradients:

```python

class NN:

def backprop(self, train_x, train_Y):

'''

same as above

'''

deltas = []

b_grad = []

w_grad = []

for i in range(len(z)):

if i == 0:

delta = self.loss.deriv(activations[-1], y

)*self.activ.deriv(z[-1])

deltas.append(delta)

if i != 0:

deltas.append(np.dot(deltas[i-1], self.w[-

i].T)*self.activ.deriv(z[-i-1]))

b_grad.append(np.sum(deltas[i], axis=0))

w_grad.append(np.dot(activations[-2-i].T, deltas[i]))

return w_grad,b_grad

```

deltas is a list holding $\frac{\partial L}{\partial b_c}$ values. The first case handles the derivative of the loss function. We pass it in activations[-1] which represents the output of our neural net (as it's the activation of the last layer) and multiply it by activ.deriv, the derivative of the activation function (which we assume we've defined elsewhere).

Otherwise, we follow the recursive formula from earlier and multiply the previous delta value by the next weight matrix and we multiply it by a′of the next znvalue. To get the ∂L∂bc value, we simply take the current value of delta (and sum up if our input was a matrix rather than a vector). To get the ∂L∂Wc value, we follow the recursive formula and perform one more matrix multiplication (we index activations`by [-2-i] because we added `x as an extra value when starting out).

And we're done! We've now calculated the partial derivatives for all the weights and biases. Next time, we'll dive into different optimization methods and go over how to put these gradients to use.

Discuss

### Inspection Paradox as a Driver of Group Separation

*Have you ever noticed that anybody driving slower than you is an idiot, and anyone going faster than you is a maniac?*-- George Carlin

The Inspection Paradox, where the reported results are heavily observer-dependent, has been mentioned here a couple of times before:

https://www.lesswrong.com/posts/tqEAiScpz7g2xLHtE/the-inspection-paradox-is-everywhere

https://www.lesswrong.com/posts/HW2fbbGM8B6y7pkDb/the-just-world-hypothesis#8r2uaA6Mb25k2vF9Z

An example of it that is familiar to everyone is that, when driving with an average speed, you see all other cars separating into two categories: slow drivers and fast drivers, because you naturally encounter more cars who are faster or slower than you are, and none that move at the same speed. So a normal distribution of speeds becomes bimodal, from something like this:

to something like this:

Another familiar example, from the newer post by the original author, is that for an average Facebook user,

Your friends have more Facebook friends than you do.Because, naturally, members with more connections are more likely to have a connection with you, among others.

(For the record, the inspection paradox can be classified as subset of the sampling bias, which, in turn, is a form of the oft discussed selection bias.)

But back to the apparent multi-modality. You encounter is whenever unusual events have a higher availability, sometimes properly measured like in the average driver case above, and sometimes perceived, like in the availability heuristic. This doesn't have to be about the frequency of the observations, it could be about their emotional impact. If the more "out there" the observations are, the more they affect you, then their distribution would appear bimodal, and you might instinctively recoil from them, and seek the comfort of the in-group. In this case the inspection-induced multi-modality may turn into an actual one, a case of perception becoming reality. (I had attempted to model it numerically some months ago, in two companion posts, sadly not written well enough to attract much interest, but showing that the effect described may well be real.)

Discuss

### Problems in AI Alignment that philosophers could potentially contribute to

(This was originally a comment that I wrote as a follow up to my question for William MacAskill's AMA. I'm moving it since it's perhaps more on-topic here.)

It occurs to me that another reason for the lack of engagement by people with philosophy backgrounds may be that philosophers aren't aware of the many philosophical problems in AI alignment that they could potentially contribute to. So here's a list of philosophical problems that have come up just in my own thinking about AI alignment.

- Decision theory for AI / AI designers
- How to resolve standard debates in decision theory?
- Logical counterfactuals
- Open source game theory
- Acausal game theory / reasoning about distant superintelligences

- Infinite/multiversal/astronomical ethics
- Should we (or our AI) care much more about a universe that is capable of doing a lot more computations?
- What kinds of (e.g. spatial-temporal) discounting is necessary and/or desirable?

- Fair distribution of benefits
- How should benefits from AGI be distributed?
- For example, would it be fair to distribute it equally over all humans who currently exist, or according to how much AI services they can afford to buy?
- What about people who existed or will exist at other times and in other places or universes?

- Need for "metaphilosophical paternalism"?
- However we distribute the benefits, if we let the beneficiaries decide what to do with their windfall using their own philosophical faculties, is that likely to lead to a good outcome?

- Metaphilosophy
- What is the nature of philosophy?
- What constitutes correct philosophical reasoning?
- How to specify this into an AI design?

- Philosophical forecasting
- How are various AI technologies and AI safety proposals likely to affect future philosophical progress (relative to other kinds of progress)?

- Preference aggregation between AIs and between users
- How should two AIs that want to merge with each other aggregate their preferences?
- How should an AI aggregate preferences between its users?

- Normativity for AI / AI designers
- What is the nature of normativity? Do we need to make sure an AGI has a sufficient understanding of this?

- Metaethical policing
- What are the implicit metaethical assumptions in a given AI alignment proposal (in case the authors didn't spell them out)?
- What are the implications of an AI design or alignment proposal under different metaethical assumptions?
- Encouraging designs that make minimal metaethical assumptions or is likely to lead to good outcomes regardless of which metaethical theory turns out to be true.
- (Nowadays AI alignment researchers seem to be generally good about not placing too much confidence in their own moral theories, but the same can't always be said to be true with regard to their metaethical ideas.)

Discuss

### Встреча YC Startup School

### How can you use music to boost learning?

I often find that I am able to appreciate the beauty of a subject more while listening to music (especially instrumental music). Hearing the notes while I think about the topic helps creates a lot of subconscious connections with the material, solidifying what I am learning as a distinct set of memories. In general I think that associations with powerful sensory experience is just a good way to remember things and learn.

However, I have also heard that listening to music can distract you when you are trying to do deep work. Apparently, there is some research on this, but I have barely scratched the surface of the literature, and I wouldn't know where to start.

Is there an optimal way to use music to learn? Should I employ certain strategies, like putting on the music only after I've read something, so that I can think about what I just read while music floats through my consciousness?

Discuss

### A Primer on Matrix Calculus, Part 3: The Chain Rule

This post concludes the subsequence on matrix calculus. Here, I will focus on an exploration of the chain rule as it's used for training neural networks. I initially planned to include Hessians, but perhaps for that we will have to wait.

Deep learning has two parts: *deep* and *learning.* The *deep *part refers to the fact that we are composing simple functions to form a complex function. In other words, in order to perform a task, we are mapping some input .mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0}
.MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0}
.mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table}
.mjx-full-width {text-align: center; display: table-cell!important; width: 10000em}
.mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0}
.mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left}
.mjx-numerator {display: block; text-align: center}
.mjx-denominator {display: block; text-align: center}
.MJXc-stacked {height: 0; position: relative}
.MJXc-stacked > * {position: absolute}
.MJXc-bevelled > * {display: inline-block}
.mjx-stack {display: inline-block}
.mjx-op {display: block}
.mjx-under {display: table-cell}
.mjx-over {display: block}
.mjx-over > * {padding-left: 0px!important; padding-right: 0px!important}
.mjx-under > * {padding-left: 0px!important; padding-right: 0px!important}
.mjx-stack > .mjx-sup {display: block}
.mjx-stack > .mjx-sub {display: block}
.mjx-prestack > .mjx-presup {display: block}
.mjx-prestack > .mjx-presub {display: block}
.mjx-delim-h > .mjx-char {display: inline-block}
.mjx-surd {vertical-align: top}
.mjx-mphantom * {visibility: hidden}
.mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%}
.mjx-annotation-xml {line-height: normal}
.mjx-menclose > svg {fill: none; stroke: currentColor}
.mjx-mtr {display: table-row}
.mjx-mlabeledtr {display: table-row}
.mjx-mtd {display: table-cell; text-align: center}
.mjx-label {display: table-row}
.mjx-box {display: inline-block}
.mjx-block {display: block}
.mjx-span {display: inline}
.mjx-char {display: block; white-space: pre}
.mjx-itable {display: inline-table; width: auto}
.mjx-row {display: table-row}
.mjx-cell {display: table-cell}
.mjx-table {display: table; width: 100%}
.mjx-line {display: block; height: 0}
.mjx-strut {width: 0; padding-top: 1em}
.mjx-vsize {width: 0}
.MJXc-space1 {margin-left: .167em}
.MJXc-space2 {margin-left: .222em}
.MJXc-space3 {margin-left: .278em}
.mjx-test.mjx-test-display {display: table!important}
.mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px}
.mjx-test.mjx-test-default {display: block!important; clear: both}
.mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex}
.mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left}
.mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right}
.mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0}
.MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal}
.MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal}
.MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold}
.MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold}
.MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw}
.MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw}
.MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw}
.MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw}
.MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw}
.MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw}
.MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw}
.MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw}
.MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw}
.MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw}
.MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw}
.MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw}
.MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw}
.MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw}
.MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw}
.MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw}
.MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw}
.MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw}
.MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw}
.MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw}
.MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw}
@font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')}
@font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')}
@font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold}
@font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')}
@font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')}
@font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold}
@font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')}
@font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic}
@font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')}
@font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')}
@font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold}
@font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')}
@font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic}
@font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')}
@font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')}
@font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')}
@font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')}
@font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold}
@font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')}
@font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic}
@font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')}
@font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')}
@font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic}
@font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')}
@font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')}
@font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')}
@font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')}
@font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')}
@font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')}
@font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold}
@font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')}
x to an output y using some long nested expression, like y=f1(f2(f3(x))). The *learning *part refers to the fact that we are allowing the properties of the function to be set automatically via an iterative process like gradient descent.

Conceptually, combining these two parts is easy. What's hard is making the whole thing efficient so that we can get our neural networks to actually train on real world data. That's where the backpropagation enters the picture.

*Backpropagation *is simply a technique to train neural networks by efficiently using the chain rule to calculate the partial derivatives of each parameter. However, backpropagation is notoriously a pain to deal with. These days, modern deep learning libraries provide tools for automatic differentiation, which allow the computer to automatically perform this calculus in the background. However, while this might be great for practitioners of deep learning, here we primarily want to understand the notation as it would be written on paper.1 Plus, if we were writing our own library, we'd want to know what's happening in the background.

What I have discovered is that, despite my initial fear of backpropagation, it is actually pretty simple to follow if you just understand the notation. Unfortunately, the notation can get a bit difficult to deal with (and was a pain to write out in Latex).

We start by describing the single variable chain rule. This is simply ddxf(g(x))=f′(g(x))g′(x). But if we write it this way, then it's in an opaque notation and hides which variables we are taking the derivative with respect to. Alternatively we can write the rule in a way that makes it more obvious what we are doing: ddxf(g(x))=dfdgdgdx, where g is meant as shorthand for g(x). This way it is intuitively clear that we can cancel the fractions on the bottom, and this reduces to dfdx, as desired.

It turns out, that for a function f:Rn→Rm and g:Rk→Rn, the chain rule can be written as ∂∂xf(g(x))=∂f∂g∂g∂x where ∂f∂g is the Jacobian of f with respect to g.

Isn't that neat. Our understanding of Jacobians has now well paid off. Not only do we have an intuitive understanding of the Jacobian, we can now formulate the vector chain rule using a compact notation — one that matches the single variable case perfectly.2

However, in order to truly understand backpropagation, we must go beyond mere Jacobians. In order to work with neural networks, we need to introduce the *generalized Jacobian. *If the Jacobian from yesterday was spooky enough already, I recommend reading no further. Alternatively if you want to be able to truly understand how to train a neural network, read at your own peril.

First, a vector can be seen as a list of numbers, and a matrix can be seen as an ordered list of vectors. An ordered list of matrices is... a tensor of order 3. Well not exactly. Apparently some people are actually disappointed with the term tensor because a tensor means something very specific in mathematics already and isn't *just *an ordered list of matrices.3 But whatever, that's the term we're using for this blog post at least.

As you can probably guess, a list of tensors of order n is a tensor of order n+1. We can simply represent tensors in code using multidimensional arrays. In the case of the Jacobian, we were taking the derivative of functions between two vector spaces, Rn and Rm. When we are considering mapping from a space of tensors of order n to a space of tensors of order m, we denote the relationship y=f(x) as between the spaces R(M1×M2×...×Mn)→R(M1×M2×...×Mm).

The generalized Jacobian J between these two spaces is an object with shape (M1×M2×...×Mn)×(N1×N2×...×Nm). We can think of this object as a generalization of the matrix, where each row is a tensor with the same shape as the tensor y and each column has the same shape as the tensor x. The intuitive way to understand the generalized Jacobian is that we can index J with vectors →i and →j. At each index in J we find the partial derivative between the variables y→i and x→j, which are scalar variables located in the tensors y and x.

Formulating the chain rule using the generalized Jacobian yields the same equation as before: for z=f(y) and y=g(x), ∂z∂x=∂z∂y∂y∂x. The only difference this time is that ∂z∂xhas the shape (K1×...×KDz)×(M1×...×MDx) which is itself formed by the result of a generalized matrix multiplication between the two generalized matrices, ∂z∂y and ∂y∂x. The rules for this generalized matrix multiplication is similar to regular matrix multiplication, and is given by the formula:

(∂z∂x)i,j=∑k(∂z∂y)i,k(∂y∂x)k,jHowever, where this differs from matrix multiplication is that i,j,k are vectors which specify the location of variables within a tensor.

Let's see if we can use this notation to perform backpropagation on a neural network. Consider a neural network defined by the following composition of simple functions: f(x)=W2(relu(W1x+b1))+b2. Here, relu describes the activation function of the first layer of the network, which is defined as the element-wise application of relu(x)=max(x,0). There are a few parameters of this network: the weight matrices, and the biases. These parameters are the things that we are taking the derivative with respect to.

There is one more part to add before we can train this abstract network: a loss function. In our case, we are simply going to train the parameters with respect to the loss function L(^y,y)=||^y−y||22 where ^y is the prediction made by the neural network, and y is the vector of desired outputs. In full, we are taking ∂∂wL(f(x),y), for some weights w, which include W1,W2,b1,b2. Since this loss function is parameterized by a constant vector y, we can henceforth treat the loss function as simply L(f(x)).

Ideally, we would not want to make this our loss function. That's because the true loss function should be over the entire dataset — it should take into account how good the predictions were for each sample that it was given. The way that I have described it only gave us the loss for a single prediction.

However, taking the loss over the entire dataset is too expensive and converges slowly. Alternatively, taking the loss over a single point (ie: stochastic gradient descent) is also too slow because it doesn't allow us to take into account parallel hardware. So, actual practitioners use what's called mini-batch descent, where their loss function is over some subset of the data. For simplicity, I will just show the stochastic gradient descent step.

For ∂∂b2L(f(x)) we have ∂∂b1L(f(x))=∂L∂f∂f∂b2. From the above definition of f, we can see that ∂f∂b2=I, where I is the identity matrix. From here on I will simply assume that the partial derivatives are organized in some specific manner, but omitted. The exact way it's written doesn't actually matter too much as long as you understand the *shape* of the Jacobian being represented.

We can now evaluate ∂f∂W2. Let U be (relu(W1x+b1)). Then computing the derivative ∂f∂W2comes down to finding the generalized Jacobian of W2Uwith respect to W2. I will illustrate what this generalized Jacobian would look like by building up from analogous, lower order derivatives. The derivative dydx of y=cx is c. The gradient ∇xc⊺x is c. The Jacobian Jx of Ux is U. We can therefore see that the generalized Jacobian JW2of W2U will be some type of order 3 tensor which would look like a simple expression involving U.

The derivatives for the rest of the weight matrices can be computed similarly to the derivatives I have indicated for b2 and W2. We simply need to evaluate the terms later on in the chain ∂L∂f⋯∂v∂W1where v is shorthand for the function v=W1x.

We have, however, left out one crucial piece of information, which is how to calculate the derivative over the relu function. To do that we simply separate the derivative into a piecewise function. When the input is less than zero, the derivative is 0. When the input is greater than zero, the derivative is 1. But since the function is not differentiable at 0, we just pretend that it is and make it's derivative 0; this doesn't cause any issues.

0 \end{cases}">∂∂xrelu(x)={0x≤01x>0This means that we are pretty much done, as long as you can fill in the details for computing the generalized Jacobians. The trickiest part in the code is simply making sure that all the dimensions line up. Now, once we have computed by derivatives, we can incorporate this information into some learning algorithm like Adam, and use this to update the parameters and continue training the network.

There are, however, many ways that we can make the algorithm more efficient than one might make it during a naive implementation. I will cover one method briefly.

We can start by taking into account information about the direction we are calculating the Jacobians. In particular, if we consider some chain ∂L∂f⋯∂v∂W1, we can take advantage of the fact that tensor-tensor products are associative. Essentially, this means that we can start by computing the last derivative ∂v∂W1 and then multiplying forward. This is called *forward accumulation*. We can also compute this expression in reverse, which is referred to as *reverse accumulation*.

Besides forward and reverse accumulation, there are more complex intracacies for fully optimizing a library. From Wikipedia,

Forward and reverse accumulation are just two (extreme) ways of traversing the chain rule. The problem of computing a full Jacobian of*f*: ℝn → ℝm with a minimum number of arithmetic operations is known as the

*optimal Jacobian accumulation*(OJA) problem, which is NP-complete.

Now if you've followed this post and the last two, and filled in some of the details I (sloppily) left out, you should be well on your way to being able to implement efficient backpropagation yourself. Perhaps read this famous paper for more ways to make it work.

1 This is first and foremost my personal goal, rather than a goal that I expect the readers here to agree with.

2 If you want to see this derived, see section 4.5.3 in the paper.

3 The part about people being disappointed comes from my own experience, as it's what John Canny said in CS 182. The definition of Tensor can be made more precise as a multidimensional array that satisfies a specific transformation law. See here for more details.

Discuss

### Клуб чтения цепочек

### Рациональное додзё. Внутренний симулятор

### Beliefs Are For True Things

One of the core principles -- maybe the most core principle -- of the art of rationality is that *beliefs are for true things*. In other words, you should believe things because they are true. You should not believe things that are not true.

Holding that beliefs are for true things means that you do not believe things because they are *useful, *believe things because they *sound nice*, or believe things because you *prefer them to be true*. You believe things that *are* true (or at least that you believe to be true, which is often the best we can get!).

Eliezer referred to this principle as "the void", writing in his "The Twelve Virtues of Rationality":

Before these eleven virtues is a virtue which is nameless.Miyamoto Musashi wrote, in The Book of Five Rings:“The primary thing when you take a sword in your hands is your intention to cut the enemy, whatever the means. Whenever you parry, hit, spring, strike or touch the enemy’s cutting sword, you must cut the enemy in the same movement. It is essential to attain this. If you think only of hitting, springing, striking or touching the enemy, you will not be able actually to cut him. More than anything, you must be thinking of carrying your movement through to cutting him.”Every step of your reasoning must cut through to the correct answer in the same movement. More than anything, you must think of carrying your map through to reflecting the territory.Musashi wrote that you must always think of carrying your motion through to cutting; I write, with Eliezer, that every belief and every step in your belief must cut through to knowing the truth.

Beliefs, after all, are for true things, and if you lose sight of that you will lose your epistemics. If you think only of what gives you an advantage in a debate, of what sounds nice, of what wins you the admiration of your peers, of what is politically correct, or of what you would prefer to be true, you will not be able to actually believe true things.

I would like to take the perhaps unusual step of closing with a poem by Rudyard Kipling, which addresses this point (among others) rather well:

As I pass through my incarnations in every age and race,

I make my proper prostrations to the Gods of the Market-Place.

Peering through reverent fingers I watch them flourish and fall.

And the Gods of the Copybook Headings, I notice, outlast them all.We were living in trees when they met us. They showed us each in turn,

That water would certainly wet us, as Fire would certainly burn:

But we found them lacking in Uplift, Vision, and Breadth of Mind,

So we left them to teach the Gorillas while we followed the March of Mankind.We moved as the Spirit listed.

*They*never altered their pace,

Being neither cloud nor wind-borne like the Gods of the Market-Place;

But they always caught up with our progress, and presently word would come

That a tribe had been wiped off its icefield, or the lights had gone out in Rome.With the Hopes that our World is built on they were utterly out of touch.

They denied that the Moon was Stilton; they denied she was even Dutch.

They denied that Wishes were Horses; they denied that a Pig had Wings.

So we worshiped the Gods of the Market Who promised these beautiful things.When the Cambrian measures were forming, They promised perpetual peace.

They swore, if we gave them our weapons, that the wars of the tribes would cease.

But when we disarmed They sold us and delivered us bound to our foe,

And the Gods of the Copybook Headings said:

*"Stick to the Devil you know."*On the first Feminian Sandstones we were promised the Fuller Life

(Which started by loving our neighbor and ended by loving his wife)

Till our women had no more children and the men lost reason and faith,

And the Gods of the Copybook Headings said:

*"The Wages of Sin is Death."*In the Carboniferous Epoch we were promised abundance for all,

By robbing selective Peter to pay for collective Paul;

But, though we had plenty of money, there was nothing our money could buy,

And the Gods of the Copybook Headings said:

*"If you don't work you die."*Then the Gods of the Market tumbled, and their smooth-tongued wizards withdrew,

And the hearts of the meanest were humbled and began to believe it was true

That All is not Gold that Glitters, and Two and Two make Four —

And the Gods of the Copybook Headings limped up to explain it once more. * * * * * *As it will be in the future, it was at the birth of Man —

There are only four things certain since Social Progress began: —

That the Dog returns to his Vomit and the Sow returns to her mire,

And the burnt Fool's bandaged finger goes wabbling back to the Fire;

And that after this is accomplished, and the brave new world begins

When all men are paid for existing and no man must pay for his sins,

As surely as Water will wet us, as surely as Fire will burn,

The Gods of the Copybook Headings with terror and slaughter return!

Discuss

### What experiments would demonstrate "upper limits of augmented working memory?"

Wikipedia has this discussion of working-memory-as-ability-to-discern-relationships-simultaneously:

Other have argued that working memory capacity is better characterized as "the ability to mentally form relations between elements, or to grasp relations in given information. This idea has been advanced by Halford, who illustrated it by our limited ability to understand statistical interactions between variables.[34]"These authors asked people to compare written statements about the relations between several variables to graphs illustrating the same or a different relation, as in the following sentence: "If the cake is from France, then it has more sugar if it is made with chocolate than if it is made with cream, but if the cake is from Italy, then it has more sugar if it is made with cream than if it is made of chocolate". This statement describes a**relation between three variables**(country, ingredient, and amount of sugar), which is the maximum most individuals can understand. The capacity limit apparent here is obviously not a memory limit (all relevant information can be seen continuously) but a limit to how many relationships are discerned simultaneously.

A common argument I've heard is that large monitors, notebooks, whiteboards, etc, are important tools to expand working memory.

I notice I'm not 100% sure what this means – in particular in the context of "discerning relationships simultaneously."

In this blogpost on distributed teams , Elizabeth plots out her model of worker productivity, which looks like this:

I look at any chunk of that, and it makes sense.

If I were to try to summarize the whole thing without looking at the reference drawing, I would definitely not be able to (without a lot of memorization and/or thinking about the model to get it deeply entangled within myself)

If I *have* the model right in front of me, I *still* can't really explain it, it's too complicated.

Diagrams help – I'm pretty sure I could track more moving parts with a diagram than without a diagram. But how much do they help? And what does that mean?

I'm interested in this as part of a general hypothesis that working-memory might be a key bottleneck on intellectual progress. It seems like you should be able to formalize the limit of how many relationships people can reason about at once, and how much visual aids and other working-memory augmentation help. But I'm not quite sure what testing it would mean.

If I try to memorize a phone number with *no* visual aids, it's obvious to check how many digits I can remember. If I *have* a visual aid, it's easy - just read off the page. But when it comes to discerning relationships, just reading off the page "what inputs plug into what" isn't really the question.

I'm interested in:

- generally getting deconfused about this topic
- whether there's any science that tries actually answering this question
- what science
*could*theoretically try answering this question if it hasn't been done yet.

Discuss

### Clarifying some key hypotheses in AI alignment

We've created a diagram mapping out important **and** controversial hypotheses for AI alignment. We hope that this will help researchers identify and more productively discuss their disagreements.

*A part of the diagram. Click through to see the full version.*

- This does not decompose arguments exhaustively. It does not include every reason to favour or disfavour ideas. Rather, it is a set of key hypotheses and relationships with other hypotheses, problems, solutions, models, etc. Some examples of important but apparently uncontroversial premises within the AI safety community: orthogonality, complexity of value, Goodhart's Curse, AI being deployed in a catastrophe-sensitive context.
- This is not a comprehensive collection of key hypotheses across the whole space of AI alignment. It focuses on a subspace that we find interesting and is relevant to more recent discussions we have encountered, but where key hypotheses seem relatively less illuminated. This includes rational agency and goal-directedness, CAIS, corrigibility, and the rationale of foundational and practical research. In hindsight, the selection criteria was something like:
- The idea is closely connected to the problem of artificial systems optimizing adversarially against humans.
- The idea must be explained sufficiently well that we believe it is plausible.

- Arrows in the diagram indicate flows of evidence or soft relations, not absolute logical implications — please read the "interpretation" box in the diagram. Also pay attention to any reasoning written next to a Yes/No/Defer arrow — you may disagree with it, so don't blindly follow the arrow!

Much has been written in the way of arguments for AI risk. Recently there have been some talks and posts that clarify different arguments, point to open questions, and highlight the need for further clarification and analysis. We largely share their assessments and echo their recommendations.

One aspect of the discourse that seems to be lacking clarification and analysis is the reasons to favour one argument over another — in particular, the key hypotheses or cruxes that underlie the different arguments. Understanding this better will make discourse more productive and help people reason about their beliefs.

This work aims to collate and clarify hypotheses that seem key to AI alignment in particular (by "alignment" we mean the problem of getting an AI system to reliably do what an overseer intends, or try to do so, depending on which part of the diagram you are in). We point to which hypotheses, arguments, approaches, and scenarios are favoured and disfavoured by each other. It is neither comprehensive nor sufficiently nuanced to capture everyone's views, but we expect it to reduce confusion and encourage further analysis.

You can digest this post through the diagram or the supplementary information, which have their respective strengths and limitations. However, **we recommend starting with the diagram**, then if you are interested in related reading or our comments about a particular hypothesis, you can click the link on the box title in the diagram, or look it up below.

The sections here list the hypotheses in the diagram, along with related readings and our more opinion-based comments, for lack of software to neatly embed this information (however, boxes in the diagram do link back to the headings here). Note that the diagram is the best way to understand relationships and high-level meaning, while this offers more depth and resources for each hypothesis. Phrases in italics with the first letter capitalised are referring to a box in the diagram.

Definitions- AGI: a system (not necessarily agentive) that, for almost all economically relevant cognitive tasks,
*at least*matches any human's ability at the task. Here, "agentive AGI" is essentially what people in the AI safety community usually mean when they say AGI. References to before and after AGI are to be interpreted as fuzzy, since this definition is fuzzy. - CAIS: comprehensive AI services. See Reframing Superintelligence.
- Goal-directed: describes a type of behaviour, currently not formalised, but characterised by generalisation to novel circumstances and the acquisition of power and resources. See Intuitions about goal-directed behaviour.

Will the first AGI be most effectively modelled like a unitary, unbounded, goal-directed agent?

- Related reading: Reframing Superintelligence, Comments on CAIS, Summary and opinions on CAIS, embedded agency sequence, Intuitions about goal-directed behaviour
- Comment: This is consistent with some of classical AI theory, and agency continues to be a relevant concept in capability-focused research, e.g. reinforcement learning. However, it has been argued that the way AI systems are taking shape today, and the way humans historically do engineering, are cause to believe superintelligent capabilities will be achieved by different means. Some grant that a CAIS-like scenario is probable, but maintain that there will still be
*Incentive for agentive AGI*. Others argue that the current understanding of agency is problematic (perhaps just for being vague, or specifically in relation to embeddedness), so we should defer on this hypothesis until we better understand what we are talking about. It appears that this is a strong crux for the problem of*Incorrigible goal-directed superintelligence*and the general aim of*(Near) proof-level assurance of alignment*, versus other approaches that reject alignment being such a hard, one-false-move kind of problem. However, to advance this debate it does seem important to clarify notions of goal-directedness and agency.

Are there features of systems built like unitary goal-directed agents that offer a worthwhile advantage over other broadly superintelligent systems?

- Related reading: Reframing Superintelligence, Comments on CAIS, Summary and opinions on CAIS, Will humans build goal-directed agents?, AGI will drastically increase economies of scale
- Comment: Some basic points argued in favour are that agentive AGI is significantly more efficient, or humans find agents easier to think about, or humans just want to build human-like agents for its own sake. However, even if agentive AGI offers greater efficiency, one could argue it is too risky or difficult to build, so we are better off settling for something like CAIS.

In general and holding resources constant, is a collection of modular AI systems with distinct interfaces more competent than a single integrated AI system?

- Related reading: Reframing Superintelligence Ch. 12, 13, AGI will drastically increase economies of scale
- Comment: an almost equivalent trade-off here is generality vs. specialisation. Modular systems would benefit from specialisation, but likely bear greater cost in principal-agent problems and sharing information (see this comment thread). One case that might be relevant to think about is human roles in the economy — although humans have a general learning capacity, they have tended towards specialising their competencies as part of the economy, with almost no one being truly self-sufficient. However, this may be explained merely by limited brain size. The recent success of end-to-end learning systems has been argued in favour of integration, as has the evolutionary precedent of humans (since human minds appear to be more integrated than modular).

AI systems so far generally lack some key qualities that are traditionally supposed of AGI, namely: pursuing cross-domain long-term goals, having broad capabilities, and being persistent and unitary. Does this lacking extrapolate, with increasing automation of AI R&D and the rise of a broad collection of superintelligent services?

- Related reading: Reframing Superintelligence Ch. I

Will systems built like unitary goal-directed agents develop incidentally from something humans or other AI systems build?

- Related reading: Subsystem Alignment, Risks from Learned Optimization, Let's talk about "Convergent Rationality"

Given sufficient capacity, does an AI system converge on rational agency and consequentialism to achieve its objective?

- Related reading: Let's talk about "Convergent Rationality"
- Comment: As far as we know, "convergent rationality" has only been named recently by David Krueger, and while it is not well fleshed out yet, it seems to point at an important and commonly-held assumption. There is some confusion about whether the convergence could be a theoretical property, or is merely a matter of human framing, or merely a matter of
*Incentive for agentive AGI*.

Will there be optimisation processes that, in turn, develop considerably powerful optimisers to achieve their objective? A historical example is natural selection optimising for reproductive fitness to make humans. Humans may have good reproductive fitness, but optimise for other things such as pleasure even when this diverges from fitness.

- Related reading: Subsystem alignment, Risks from Learned Optimization

Will there be discontinuous, explosive growth in AI capabilities to reach the first agentive AGI? A discontinuity reduces the opportunity to correct course. Before AGI it seems most likely to result from a qualitative change in learning curve, due to an algorithmic insight, architectural change or scale-up in resource utilisation.

- Related reading: Intelligence Explosion Microeconomics, The Hanson-Yudkowsky AI-Foom Debate, A Contra AI FOOM Reading List, Any rebuttals of Christiano and AI Impacts on takeoff speeds?, A shift in arguments for AI risk
- Comment: Discontinuity or fast takeoff is a central assumption of early arguments for AI risk and seems to have the greatest quantity of debate. A large proportion of the community supports it in some form, but this proportion has apparently decreased significantly in the last few years (beliefs changing or new people, it's unclear), with Paul Christiano's and Katja Grace's writing being a key influence. Note that we distinguish
*to*AGI and*from*AGI because of strategic and developmental considerations around human-level. In published works the distinction has not been very clear — we would like to see more discussion about it. Thanks to Ben Garfinkel for pointing out how the distinction can be important.

Is an AI system that improves through its own AI R&D and self-modification capabilities more likely than distributed AI R&D automation? Recursive improvement would give some form of explosive growth, and so could result in unprecedented gains in intelligence.

- Related reading: Intelligence Explosion Microeconomics, Reframing Superintelligence Ch. 1

Will there be discontinuous, explosive growth in AI capabilities after agentive AGI? A discontinuity reduces the opportunity to correct course. After AGI it seems most likely to result from a recursive improvement capability.

- Related reading: see
*Discontinuity to AGI* - Comment: see
*Discontinuity to AGI*

Do contemporary machine learning techniques scale to general human level (and beyond)? The state-of-the-art experimental research aiming towards AGI is characterised by a set of theoretical assumptions, such as reinforcement learning and probabilistic inference. Does this paradigm readily scale to general human-level capabilities without fundamental changes in the assumptions or methods?

- Related reading: Prosaic AI alignment, A possible stance for alignment research, Conceptual issues in AI safety: the paradigmatic gap, Discussion on the machine learning approach to AI safety
- Comment: One might wonder how much change in assumptions or methods constitutes a paradigm shift, but the more important question is how relevant current ML safety work can be to the most high-stakes problems, and that seems to depend strongly on this hypothesis. Proponents of the ML safety approach admit that much of the work could turn out to be irrelevant, especially with a paradigm shift, but argue that there is nonetheless a worthwhile chance. ML is a fairly broad field, so people taking this approach should think more specifically about what aspects are relevant and scalable. If one proposes to build safe AGI by scaling up contemporary ML techniques, clearly they should believe the hypothesis — but there is also a feedback loop: the more feasible approaches one comes up with, the more evidence there is for the hypothesis. You may opt for
*Foundational or "deconfusion" research*if (1) you don't feel confident enough about this to commit to working on ML, or (2) you think that, whether or not ML scales in terms of*capability*, we need deep insights about intelligence to get a satisfactory solution to*alignment*. This implies*Alignment is much harder than, or does not overlap much with, capability gain*.

Do we need a much deeper understanding of intelligence to build an aligned AI?

- Related reading: The Rocket Alignment Problem

Do corrigible AI systems have a broad basin of attraction to intent alignment? Corrigible AI *tries* to help an overseer. It acts to improve its model of the overseer's preferences, and is incentivised to make sure any subsystems it creates are aligned — perhaps even more so than itself. In this way, perturbations or errors in alignment tend to be corrected, and it takes a large perturbation to move out of this "basin" of corrigibility.

- Related reading: Corrigibility, discussion on the need for a grounded definition of preferences (comment thread)
- Comment: this definition of corrigibility is still vague, and although it can be explained to work in a desirable way, it is not clear how practically feasible it is. It seems that proponents of corrigible AI accept that greater theoretical understanding and clarification is needed: how
*much*is a key source of disagreement. On a practical extreme, one would iterate experiments with tight feedback loops to figure it out, and correct errors on the go. This assumes ample opportunity for trial and error, rejecting*Discontinuity to/from AGI*. On a theoretical extreme, some argue that one would need to develop a new mathematical theory of preferences to be confident enough that this approach will work, or such a theory would provide the necessary insights to make it work at all. If you find this hypothesis weak, you probably put more weight on threat models based on Goodhart's Curse, e.g.*Incorrigible goal-directed superintelligence*, and the general aim of*(Near) proof-level assurance of alignment*.

Will a concrete, catastrophic AI failure be overwhelmingly hard to recognise or anticipate? For certain kinds of advanced AI systems (namely the goal-directed type), it seems that short of near proof-level assurances, all safeguards are thwarted by the nearest unblocked strategy. Such AI may also be incentivised for deception and manipulation towards a treacherous turn. Or, in a machine learning framing, it would be very difficult to make such AI robust to distributional shift.

- Related reading: Importance of new mathematical foundations to avoid inconspicuous failure (comment thread)
- Comment: This seems to be a key part of many people's models for AI risk, which we associate most with MIRI. We think it significantly depends on whether there is
*Agentive AGI*, and it supports the general aim of*(Near) proof-level assurance of alignment*. If we can get away from that kind of AI, it is more likely that we can relax our approach and*Use feedback loops to correct course as we go*.

Would gradual gains in the influence of AI allow small problems to accumulate to catastrophe? The gradual aspect affords opportunity to recognise failures and think about solutions. Yet for any given incremental change in the use of AI, the economic incentives could outweigh the problems, such that we become more entangled in, and reliant on, a complex system that can collapse suddenly or drift from our values.

- Related reading: What failure looks like, A shift in arguments for AI risk > The alignment problem without a discontinuity > Questions about this argument

*Thanks to Stuart Armstrong, Wei Dai, Daniel Dewey, Eric Drexler, Scott Emmons, Ben Garfinkel, Richard Ngo and Cody Wild for helpful feedback on drafts of this work. Ben especially thanks Rohin for his generous feedback and assistance throughout its development.*

Discuss

### Ненасильственное общение. Тренировка

### Tessercube — OpenPGP Made Mobile

I cited several words from my colleague *Neruthes*. He's not at LessWrong but he also joined the most recent Shanghai LessWrong meetup: https://www.lesswrong.com/events/zR4atrRmiaqGLvjYj/shanghai-lesswrong-meetup#NZHcXphJXcsTFwpmf

Recently we have been working on this project in the sense that I feel there is no good OpenPGP utility on Mobile (especially iOS). By good, I mean the UX should be good, and the license should be AGPL or GPL.

In the process, we got the idea of App Penetration — by making OpenPGP into keyboards (input methods), we can literally be end-to-end encrypted when using any channel of communication, as long as the other side can decrypt — on Facebook Messenger, on Telegram, on iMessage, whatever.

For now, we have been releasing Android beta test versions on __Google Play__. The iOS version on __App Store__. It might be a bit early to announce because there are plenty of bugs and a big shortage of tutorials, but I believe hardcore users can go through it.

There can be a lot of bugs and UX flaws. If you find any bug, just go to __GitHub__ and open an issue. And I will appreciate!

In larger perspective, building __Tessercube__ is just a humble beginning. We would like to give general public proper tools of encryption and make them possible to protect their privacy and enable the people to really own their data. That's why we also made __Maskbook__ , an encryption and programmable layer on top of all existing giants, such as Facebook, Twitter, etc. I will write a separate post for our story and our approach.

([I:b])

Discuss

### A Primer on Matrix Calculus, Part 2: Jacobians and other fun

I started this post thinking that I would write all the rules for evaluating Jacobians of neural network parameters in specific cases. But while this would certainly be useful for grokking deep learning papers, frankly it's difficult to write that in Latex and the people who have written The Matrix Calculus You Need For Deep Learning paper have already done it much better than I can do.

Rather, I consider my comparative advantage here to provide some expansion on why we should use Jacobians in the first place. If you were to just read the paper above, you might start to think that Jacobians are just notational perks. I hope to convince you that they are much more than that. In at least one setting, Jacobians provide a mathematical framework for analyzing the input-output behavior of deep neural networks, which can help us see things which we might have missed without this framework. A specific case of this phenomenon is a recently discovered technique which was even more recently put into a practical implementation: Jacobian regularization. Here we will see some fruits of our matrix calculus labor.

Deep learning techniques require us to train a neural network by slowly modifying parameters of some function until the function begins returning something close to the intended output. These parameters are often represented in the form of matrices. There are a few reasons for this representation: the matrix form is compact, and it allows us to use the tools of linear algebra directly. Matrix computations can also be processed in parallel, and this standardization allows programmers to build efficient libraries for the training of deep neural networks.

One quite important matrix in deep learning is the Jacobian.

In one sense, the Jacobian matrix is just a way of organizing gradient vectors. Gradient vectors, in turn, are just ways of organizing partial derivatives of an expression. Therefore, the Jacobian matrix is just a big matrix which allows us to organize a lot of partial derivatives. Formally, the Jacobian of .mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} f:Rn→Rm is defined by the following matrix. We denote fi(x) as the mapping from Rn→Ri, where Ri is the real number line in the ith coordinate of the output vector Rm. Then the Jacobian is simply

⎡⎢ ⎢ ⎢ ⎢ ⎢⎣∇f⊺1∇f⊺2⋮∇f⊺m⎤⎥ ⎥ ⎥ ⎥ ⎥⎦But describing the Jacobian as just some big rectangular box of partial derivatives of vector-valued functions hides the intuition for why Jacobians are important. To truly grasp the Jacobian, we will first build up to it by imagining a function that maps between two real coordinate spaces: .mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} f:Rn→Rm. If n=m, this function has a natural interpretation. We can see it by considering the case where n=m=2.

Now imagine stretching, warping, and contracting the plane so that the points are moved around in some manner. This means that every point is mapped to some other point on the graph. For instance, (2,5) could be mapped to (1,3). Crucially, make sure this transformation doesn't cause any sort of sharp discontinuity: we don't want the graph to rip. I am not good with illustrating that type of thing, so I encourage you to imagine it instead in your head, or alternatively watch this video.

There is a special set of such mappings, which we call *linear transformations*. Linear transformations have the property that when we perform the stretching, the gridlines are kept straight. We could still rotate the graph, for instance. But what we can't do is bend some axis so that it takes on a curvy shape after the transformation. If we drew a line on the graph before the linear transformation, it must remain a straight line after the transformation.

What does this have to do with the Jacobian? To see we first must ask why derivatives are useful. In the most basic sense, the derivative is getting at trying to answer the questoin, "What is the local behavior of this function?"

First, if you will consider by analogy, we could ask for the local behavior of some differentiable function f:R→R. This would be a line whose slope is provided by the derivative. Similarly we could ask for the local behavior of some multivariate function f:Rn→R, which would be a hyperplane whose direction is determined by the gradient. Now when we ask what the local behavior of some vector-valued function f:Rn→Rm is, we get a linear transformation described by the Jacobian.

In the above illustration, the Jacobian is evaluated at the point in the bottom left corner of the red tiles. The linear transformation implied by the Jacobian is represented by the translucent square after the function is applied, which is a rotation transformation some angle clockwise. As we can see, while f is some extra curvy function, the Jacobian can approximate the local behavior at a point quite well.

To see how this visualization is useful, consider the case of applying a Jacobian to a neural network. We could be designing a simple neural network to predict the output of two variables, representing perhaps normalized class probabilities, given an input of three variables, representing perhaps input pixel data. We now illustrate the neural network.

This neural networks implements a particular function from R3→R2. However, the exact function that is being implemented depends crucially on the parameters, here denoted by the connections between the nodes. If we compute the Jacobian of this neural network with respect to the input, at some input instance, we would end up getting a good idea of how the neural network changes within the neighborhood of that particular input.

One way we can gain insight from a Jacobian is by computing its *determinant*. Recall, a determinant is a function from square matrices to scalars which is defined recursively as the alternating sum and subtraction of determinants of the minors of the matrix multiplied by elements in the top row. On second thought, *don't* recall that definition of determinant; that's not going to get you anywhere. Despite the determinant's opaque definition, we can gain deeper insight into what the determinant represents by instead viewing it geometrically. In a few words, the determinant computes the scaling factor for a given linear transformation of a matrix.

Above, I have pulled from Wikimedia a parallelepiped, which was formed from some linear mapping of a cube. The volume of this parallelepiped is some multiple of the volume of the cube before the transformation. It turns out that no matter which region of space we look at, this linear transformation will generate the *same* ratio from post-transformed regions to pre-transformed regions. This ratio is given by the determinant of the matrix representing the linear mapping. In other words, the determinant tells us how much some transformation is expanding or contracting space.

What this means is for the Jacobian is that the determinant tells us how much space is being squished or expanded in the *neighborhood *around a point. If the output space is being expanded a lot at some input point, then this means that the neural network is a bit unstable at that region, since minor alterations in the input could cause huge distortions in the output. By contrast, if the determinant is small, then some small change to the input will hardly make a difference to the output.

This very fact about the Jacobian is behind a recent development in the regularization of deep neural networks. The idea is that we could use this interpretation of the Jacobian as a measure of robustness to input-perturbations around a point to make neural networks more robust off their training distribution. Traditional approaches like L2 regularization have emphasized the idea of keeping some parameters of the neural network from wandering off into extreme regions. The idea here is that smaller parameters are more likely *a priori*, which motivates the construction of some type of penalty on parameters that are too large.

In contrast to L2 regularization, the conceptual framing of Jacobian regularization comes from a different place. Instead of holding a leash on some parameters, to keep them from wandering off into the abyss, Jacobian regularization emphasizes providing robustness to small changes in the input space. The motivation behind this approach is clear to anyone who has been paying attention to adversarial examples over the last few years. To explain, adversarial examples are cases where we provide instances of a neural network where it performs very poorly, even if it had initially done well on a non-adversarial test set. Consider this example, provided by OpenAI.

The first image was correctly identified as a panda by the neural network. However, when we added a tiny bit of noise to the image, the neural network spit out garbage, confidently classifying a nearly exact copy as a gibbon. One could imagine a hypothetical adversary using this exploit to defeat neural network systems in practice. In the context of AI safety, adversarial attacks constitutes a potentially important subproblem of system reliability.

In Jacobian regularization, we approach this issue by putting a penalty on the size of the entries in the Jacobian matrix. The idea is simple: the smaller the values of the matrix, the less that tiny perturbations in input-space will affect the output. Concretely, the regularizer is described by taking the frobenius norm of the Jacobian matrix, ||J(x)||2F. The frobenius norm is nothing complicated, and is really just a way of describing that we square all of the elements in the matrix, take the sum, and then take the square root of this sum. Put another way, if we imagine concatenating all the gradient vectors which compose the Jacobian, the frobenius norm is just describing the L2 penalty of this concatenated vector.

Importantly, this technique is subtly different from taking the L2 norm over the *parameters*. In the case of a machine learning algorithm with no linearity, this penalty does however reduce to L2 regularization. Why? Because when we take the Jacobian of a purely affine function, we obtain the global information about how the function stretches and rotates space, excluding the translation offset. This global information precisely composes the parameters that we would be penalizing. It is theoretically similar to how if we take the derivative of a line, we can reconstruct the line from the derivative and a bias term.

If while reading the last few paragraphs, you starting thinking *how is this just now being discovered?* you share my thoughts exactly. As far as I can tell, the seeds of Jacobian regularization have existed since at least the 1990s. However, it took until 2016 for a team to create a full implementation. Only recently, as I write this in August 2019, has a team of researchers claimed to have discovered an efficient algorithm for applying this regularization penalty to neural networks.

The way that researchers created this new method is by using random projections to approximate the Frobenius norm. Whereas the prior approach *mentioned *random projections, it was never put into practice. The new paper succeeded by devising an algorithm to approximate Jacobians efficiently with minimal overhead cost.

How efficiently? The paper states that there is

only a negligible difference in model solution quality between training with the exact computation of the Jacobian as compared to training with the approximate algorithm, even when using a single random projection.If this technique really works as its described, this is a significant result. The paper claims that by applying Jacobian regularization, training uses only 30 more computation compared to traditional stochastic gradient descent without regularization at all. And for all that, we get some nice benefits: the system was significantly more robust to a PGD attack, and it was apparently much better than vanilla L2 regularization due to the distance between decision cells in the output space.

I recommend looking at the paper for more details.

Discuss

## Страницы

- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- следующая ›
- последняя »