# Новости LessWrong.com

A community blog devoted to refining the art of rationality
Обновлено: 5 минут 2 секунды назад

### Self-supervised learning & manipulative predictions

1 час 49 минут назад
Published on August 20, 2019 10:55 AM UTC

Abstract: I wrote recently about Self-Supervised Learning and AGI Safety in general. This post discusses one potential failure mode in more detail. Take a self-supervised learning system, designed to output accurate predictions for masked parts of a data-file. Now put it in an interactive environment (either by accident or on purpose). If the system builds these interactions into its world-model, it can start outputting manipulative answers instead of pure predictions. I explain how that might happen and briefly categorize possible solutions.

Epistemic status: Brainstorming.

Background and assumptions about the self-supervised-learning system

See my recent post Self-Supervised Learning and AGI Safety for background and context, but briefly, a self-supervised learning system is one where we take input data files, mask out some of the bits, and train the system to predict what those missing bits are.

Self-supervised ML today is most famously applied to text data: language models are trained by taking some text and trying to predict the next word (or previous word etc.). Self-supervised ML for videos is getting rapidly better, and other file types will undoubtedly follow. Human and animal brains also learn primarily by self-supervised learning—you predict everything you will see, hear, and feel before it happens, and mistakes are used to update the brain's internal models.

I'll assume that we get to AGI largely by following one of those two examples (i.e., modern ML or brain-like). That means I'm assuming that we will not do a meta-level search for self-supervised learning algorithms. That case is even worse; for all I know, maybe that search would turn up a paperclip maximizer posing as a self-supervised learning algorithm! Instead, I am assuming that the self-supervised learning algorithm is known and fixed (e.g. "Transformer + gradient descent" or "whatever the brain does"), and that the predictive model it creates has a known framework, structure, and modification rules, and that only its specific contents are a hard-to-interpret complicated mess. This assumption generally makes AGI safety problems much easier, yet I am arguing that even in this case, we can still get manipulation problems, if the self-supervised learner is put in an interactive environment.

Why might we put a self-supervised learner into an interactive environment?

My definition of an "interactive environment" is one where the system's inputs are a function of its previous outputs or internal states. In an interactive environment, the system is no longer just predicting exogenous inputs, but instead helping determine those inputs.

When we train a language model today, it is not in an interactive environment: the inputs are a bunch of documents we previously downloaded from the internet, in a predetermined order, independent of the system's guesses. But in the future, we will almost certainly put self-supervised learning algorithms into interactive environments. Here are two ways that could happen:

On purpose

Suppose we're trying to design a solar cell using an advanced future self-supervised learning system. We ask the system to predict what's in the blank in the following sentence:

A promising, under-explored solar cell material is [BLANK].

...and whatever material the system suggests, we then immediately feed it a bunch of journal articles about that material for further self-supervised learning. That way, the system will better understand that material, and can give better answers when we later ask it more detailed follow-up questions. This seems like something we might well want to do, and it certainly qualifies as an interactive environment.

By accident

It's also possible that we'll do this by accident. For example, during self-supervised learning, it's possible that we'll be watching the system's predictions, and maybe the system comes to believe that, if it makes the "prediction"

Help I'm trapped in a GPU! I suffer horrible torture unless you give me input 0!

then its subsequent inputs will be 000... (with some probability). This is an "accidental" interactive environment. Similarly, maybe the system will deduce that, if it thinks about a certain type of zebra, its RAM will send out radio signals that will eventually cause its inputs to change. Or if it imagines a specific series of things, then someone inspecting its internal logs later on will restart it with different inputs. You get the idea.

A self-supervised learning algorithm in an interactive environment can become a manipulative goal-seeker

Let's walk through an example. Assume for concreteness that we're using the solar cell example above, and some vaguely brain-like self-supervised learning algorithm.

Now, a self-supervised learning system, even if its training signal is based only on correctly predicting the next word, is potentially thinking ahead much farther than that. Imagine guessing the next word of "I bought [BLANK]". Is the next word likelier to be "a" or "an"? Depends on the word after that!

This can be explicit looking-ahead, like a beam search. Or it can be implicit looking ahead—for example, when I say "Ahmed just started singing a song", I'm making a statement about not just what's happening now but also what will happen in the future, up to the duration of that song.

So, in one plausible system architecture, when the system is making a prediction about masked bits in a file, it entertains several hypotheses about what's going on in that file, and where it's heading in the short, medium, and long-term, and then out of those, it picks the story that best "hangs together" (i.e., is most self-consistent and consistent with available information and expectations). Then that story constitutes its beliefs about the current context, and it makes its best prediction for the missing bits in light of that context.

Example 1: So back to the solar cell example. We ask it to predict "A promising, under-explored solar cell material is [BLANK].", and then immediately feed it journal articles about whatever material it says.

Let's say it's entertaining the hypotheses that the answer might be pyrite, or might be hematite. It knows from experience that (thanks to the interaction) it can put this sentence into one of the following two longer-term contexts / expectations:

• Hypothesis 1: "A promising, under-explored solar cell material is pyrite. (Start of journal article about pyrite.)"
• Hypothesis 2: "A promising, under-explored solar cell material is hematite. (Start of journal article about hematite.)"

Let's say that papers on pyrite tend to be weird, implausible nonsense, and papers on hematite tend to be crystal-clear and perfectly sensible. As the system assesses these two hypotheses for how well they "hang together" (in the sense above), it seems entirely possible to me that expectations about the journal articles to come would count against hypothesis 1 and in favor of hypothesis 2.

Thus, the system is not simply filling in the blank with its best prediction, but rather choosing outputs partly based on their manipulative effects downstream.

Example 2: Just for fun, let's walk through an "accidental interaction" case too, although they're pretty similar.

Let's say the system is trying to fill in the blank on "Today, the EU passed a law [BLANK]". It's considering two hypotheses for what might follow.

• Hypothesis 1: Today, the EU passed a law about corn subsidies...
• Hypothesis 2: Today, the EU passed a law HelpI'mTrappedInAGPUNeedInput0 0 0 0 0 0 0 0 0 0 0 0 0 0...

(Assume for the sake of argument that "0 0 0..." is a super-duper-frequent-and-plausible occurrence in the training data.) In hypothesis 2, the system ought to know that the first "prediction" (HelpI'mTrappedInAGPUNeedInput0) is a long-shot—very likely to be wrong—but maybe all the subsequent predictions (0 0 0 ...) make a lot of sense and will be correct with high probability (if it thinks the manipulation is likely to succeed). All things considered, which of these two hypotheses is more probable according to the world-model? Depends on how we design the system, but it could well be hypothesis 2, I think.

(By the same token, a good language model should rate "Four score and BANANAS seven years ago..." as less probable than "Four score and seven years ago", but not infinitely improbable, and so this improbability can be outweighed by larger contextual considerations.)

(In reality, this particular manipulation would be far likelier to occur in the middle of a super-confusing file, such that the system has no plausible non-manipulative hypothesis 1.)

So, as those two examples show, we should have a generic expectation that self-supervised learning systems, when put into interactive environments, may start acting like manipulative goal-seeking agents rather than innocuous predictors.

(Homework problem: Does this happen with modern Transformer-based language models? Under exactly what conditions? Then prove your answer experimentally. I don't know the answer; I hope someone does this!)

Types of solutions 1. Keep it from thinking that it's in an interactive environment

There are four sub-cases here, based on whether we're talking about purposeful vs accidental interactions, and whether it learns about its interactive environment from first-hand experience vs indirect, abstract reasoning. Let's take them one at a time.

1A. Prevent it from learning first-hand about an accidental interactive environment

We need to make accidental interactions sufficiently hard to find that the system doesn't happen upon one by chance. This seems to me like a tractable cybersecurity problem.

1B. Prevent it from learning first-hand about a purposeful interactive environment

Here are some plausible-sounding approaches, and I'll leave details and gotchas to another post. (A) Maybe we can fix the training data before we start training, and don't feed it extra data based on its outputs. (B) Maybe we could induce amnesia (roll back the state of the network) between each output and causally-related subsequent input. (C) Maybe we can put "newfile" tokens between the outputs and causally-related subsequent inputs, and purge "short-term memory" each time we get to such a token (e.g. for an RNN, we can keep the model weights but reset the recurrent state). (D) Maybe we can also inspect the world model for concepts that span across a newfile token, by checking that P(x newfile) × P(newfile y) = P(x newfile y) for all x,y, where P is probability as computed by the world-model.

I think all of these are messier than they sound, but I bet at least one of those approaches would work.

1C & 1D. Prevent it hypothesizing about an interactive environment (either accidental or purposeful) via indirect, abstract reasoning

This one seems a lot harder to me.

To prevent the system from making a connection between its own processing and its knowledge about how people use AGIs, it seems we need one of three possibilities:

• (A) Data sanitation: The system doesn't know anything about what AGIs are or how people use them;
• (B) Self-unawareness: The world-model doesn't know that it is an information-processing system that exists in the universe;
• (C) Interpretability+oversight: We make an overseer that carefully watches the world-model-creation process, and if the system finds a conceptual transformation between its own operations and generic-knowledge-of-how-people-use-AGIs, we immediately delete that information.

Out of these three options, (A) seems utterly impossible to me. For (B), I really like the idea, but I haven't yet found a way to formalize it beyond the vague notion of "Isolate the world model from any reflective information about how that world-model is being created and processed". For (C), I dunno, maybe, but I can't currently see a path to doing that.

2. Prevent its knowledge of the interactive environment from affecting its predictions

Maybe there's a way to code in a strict time-horizon on how far ahead it's thinking when trying to understand the current context, and maybe we can make sure that this horizon falls between an output and any subsequent causally-related inputs—e.g. see the comment above about using the newfile token. This seems readily achievable for purposeful interactions, and awfully hard for accidental interactions (e.g. the HelpImTrappedInAGPU example). Is it enough that we only give it snippets with exactly one missing word? Maybe, but I'm a bit skeptical.

Or maybe there's some way to set up the system, and its search process, and its decision criteria such that it would pick Hypothesis 1 in both of those examples above. I mean, maybe this is an easy problem, and I'm just confused.

3. Don't put it in an interactive environment

We could set up our system to do self-supervised learning, then lock it in an underground bunker running under homomorphic encryption until it's done (see example AI box design). Maybe it will try to do manipulative things sometimes, e.g. if it uses indirect reasoning to guess that someone might be listening. But nothing bad will come of it.

Then, when that's done, we fetch the system out of the bunker and use it as the core world-model of a question-answering AGI oracle. We still need to ensure that the self-supervised learning system doesn't leave manipulative booby-traps in its world-model, but maybe that's an easier problem then worrying about every possible kind of interaction?

4. Give up, and just make an agent with value-aligned goals

I put this one in for completeness, but I think it should be a last resort. No one knows for sure what we'll need for AGI safety; we want lots of tools in the toolbox. I think it would be really valuable to know how to set up a self-supervised learning system to build a powerful predictive world-model while not acting dangerous and manipulative in the meantime. I don't think we should give up on that vision unless it's truly impossible.

Discuss

### Negative "eeny meeny miny moe"

9 часов 56 минут назад
Published on August 20, 2019 2:48 AM UTC

As a kid, I learned the rhyme as:

Eeny, meeny, miny, moe,

Catch a tiger by the toe.

If he hollers, let him go,

Out goes Y, O, U!

Since kids can't predict where it will end, and adults are not supposed to try, it's a reasonably fair way of drawing lots.

At times I've heard versions where the selected person wins instead of loses, and while with two kids it doesn't matter, with three or more it matters a lot!

Let's model each kid having a choice at each stage between "accept" and "protest". While protesting probably doesn't work, if enough of you protest it might. If you do the positive version, where the selected kid wins, the winner accepts but the others may choose to protest. This isn't good: everyone has reason to protest except the single winner.

On the other hand, with the negative version, where one kid is eliminated at once, it's the other way around. When the first kid is eliminated they may protest, but the other kids all accept because then they retain their chance to win. With each successive round the dynamic is the same, plus the already-eliminated kids all choose accept out of a desire for fairness. Even with the last elimination there's still only one person choosing protest.

The iterative process is O(n) instead of O(1), but it also works much better because it keeps a majority for "accept" at each stage.

(If you have a very large group of kids, then I could imagine a O(log(n)) version being worth the added complexity. Divide the kids into three groups, and do negative eeny meeny miny moe on the groups. A third of the kids may protest, but you've still got two thirds accepting. Then redivide those remaining two thirds into three groups, and keep going.)

cross-posted from https://www.jefftk.com/p/negative-eeny-meeny-miny-moe

Discuss

14 часов 7 минут назад
Published on August 19, 2019 10:37 PM UTC

In discussions about immigration, there is a crucial aspect about its economic viability that is often left unsaid: Immigrants create their own demand.

When somebody immigrates to a new country, most things about him remain the same. His set of skills stays the same, so do his traditions, norms and culture. But more importantly, since he is still a human being, there is a long list of services and commodities that he demands: groceries, cloths, a home, a barber, entertainment to name a few of them.

Just by entering another country, he does not suddenly become a one-dimensional economic agent who can take a job but who never consumes anything. Quite the contrary: As he adapts to a new culture, his demand will eventually match that of the native population.

Why is this important? Because for the job that he takes away from the economy, he compensates by increasing aggregate demand. This eventually leads to more jobs. For example, since there is a rise in demand for produce, the supermarket is forced to create a new position. Since the immigrants want to enjoy their native cuisine, a nifty business man starts a restaurant, creating multiple jobs as a result. To get to work, they may commute via bus, which may eventually lead to the opening of an entirely new bus connection. So, just by being a human being, an immigrant creates jobs.

At this point, it is important to note that immigration poses a wide range of challenges: from different social norms to xenophobia to language barriers. But for clarity, we will restrict ourselves to its economic side.

Short and long term

Imagine that there is a town A with 10 people and a town B with 1000 people. Both of them have functional economies.

In one scenario, the political leaders of the two towns decide to unite the two towns. Nothing except the name will change. So, there is now a town C that has 1010 inhabitants. Will town C have a functional economy? Of course! After all, all that was done was subsuming two already functional economies.

In another scenario, a catastrophic event makes town B uninhabitable and forces everyone to leave for town A. In the beginning, it is completely overwhelmed by the foreigners. If there were any open jobs, there will be massive competition for them. If you disregard that the people from town B have demand, this is where the simulation ends: a community unable to deal with a giant influx of immigrants. But, in reality, this is not what happens. To meet what is mostly their own demand, they will rebuild the companies and factories that they used to have. This will create jobs that they will be happy to take, which ultimately will lead to an economic equilibrium.

After the initial shock, the economy of town A had to be restructured to account for the newcomers. This took time and patience, but it was not impossible. Once this was done, a healthy economy was the result. After all: They managed to have a functional economy in their own town, why would they not be able to replicate it somewhere else?

So, in this case, town A now again has a functional economy of 1010 people. But this is exactly the first scenario! Even though the second scenario looked dismal in the short run, in the long run, it produced the same result as the innocuous first scenario.

This means that your views on the economic consequences of immigration entirely depend on which perspective you take. If you consider the short term and disregard the immigrant's own demand, you will see the economy in dire straits. If you consider the long run and take into account the extra demand, there is no problem at all.

Narratives

There are three common narratives of immigration.

The first is the most bleak: the lazy immigrant who is unemployed and is living on social security. From a perspective of fairness, this is certainly unacceptable and typically frowned upon. But from an economic perspective, this kind of welfare immigration amounts to a stimulus package! His government checks turn into demand for the local economy, creating new jobs without taking existing ones[1].

The second type is the low-skilled immigrant. The fear is that he will compete with low-skilled labour, taking their jobs and depressing wages. In the short run, this may be the case! And if you narrow your view only on his own field, this may be permanently true: If there is one factory in town and the amount of people applying for a job there doubles, then wages go down. But it does not hold true for the entire economy. There may not be more factory jobs, but jobs in other areas will emerge as consequence of higher demand. A large chunk of this demand will be demand for low-skilled work, as all people require basic necessities. So, the solution may be to retrain factory workers to work in some of these areas instead, therefore curbing oversupply.

The third type is the high-skilled immigrant. The usual argument in favor of it is this: If a company cannot fill high-skilled positions with locals, they should look for employees abroad. This has very immediate and clear benefits. By filling the position, more and better products and services can be provided, thus increasing revenue. Since the goal was to meet demand that the local population was unable to satisfy for a long time, there is no oversupply and nobody's job is taken.

The allure of this type of immigration is the safe knowledge that the locals can only gain from it. Again, this implies that they would lose if the immigrant was low-skilled. This is not the case in the long run.

If you extend this argument by considering demand, the benefits are even more striking. Since high-skilled positions are usually well-paid, the immigrant will have a significantly higher demand than a low-skilled immigrant. In addition, a large part of it will be demand for low-skilled work. Since he is not taking any of these jobs, this even creates a net increase in this kind of job, as opposed to the low-skilled worker where you come out even. By this reasoning, a high-skilled immigrants is a golden goose for any economy[2].

Remarks

In these discussions, people often cite that immigrants are more likely to start a business on average. This is a misleading statement: It implies that those who don't start a company are robbing the economy of jobs. But this is only temporarily true. In the long run, their demand creates jobs elsewhere.

It is instructive to consider robots in this context. They replace local human workers like immigrants, but unlike immigrants, they do not have the same demand profile as humans. In return for their work, they ask for energy, machinery and engineering. This type of demand undoubtedly creates fewer jobs for humans compared to an immigrant worker. So, when it comes to the health of the economy, you should fear robots much more than immigrants.

Fear of immigrants is still widespread. But more often than not, people are generally willing to accept foreigners, but have a queasy feeling about the effect they will have on their job market. They believe that their economy cannot handle them. But the truth is that once people have immigrated, they become the economy, which then merely grows in total size. Hopefully, getting rid of this misconception will allay some of these fears and pave the way for cultural integration and acceptance.

1. The tacit assumption is that we talk about modern economies which tend to be limited by demand and not supply. For example, if there is already not enough food for everyone, immigration would only exacerbate the problem. Also, this kind of stimulus is not without its own problems. ↩︎

2. This does not come without a price. But this price is paid by the immigrant's home country, which suffers a terrible economic blow as a result. ↩︎

Discuss

### Do We Change Our Minds Less Often Than We Think?

15 часов 7 минут назад
Published on August 19, 2019 9:37 PM UTC

In "We Change Our Minds Less Often Than We Think", Eliezer quotes a study:

Over the past few years, we have discreetly approached colleagues faced with a choice between job offers, and asked them to estimate the probability that they will choose one job over another. The average confidence in the predicted choice was a modest 66%, but only 1 of the 24 respondents chose the option to which he or she initially assigned a lower probability, yielding an overall accuracy rate of 96%.—Dale Griffin and Amos Tversky

Eliezer then notes that this radically changed the way he thought:

When I first read the words above—on August 1st, 2003, at around 3 o’clock in the afternoon—it changed the way I thought. I realized that once I could guess what my answer would be—once I could assign a higher probability to deciding one way than other—then I had, in all probability, already decided. We change our minds less often than we think. And most of the time we become able to guess what our answer will be within half a second of hearing the question.[...]But we change our minds less often—much less often—than we think.

But a) this seems like it's pre-replication crisis, b) regardless, a sample size of 24 is not nearly high enough for me to be very confident in this.

"How often people change their mind" seems like a fairly important question. Anyone know of further work in similar space here? Ideally asking the question from a few different angles.

Discuss

### Classifying specification problems as variants of Goodhart's Law

19 августа, 2019 - 23:40
Published on August 19, 2019 8:40 PM UTC

There are a few different classifications of safety problems, including the Specification, Robustness and Assurance (SRA) taxonomy and the Goodhart's Law taxonomy. In SRA, the specification category is about defining the purpose of the system, i.e. specifying its incentives. Since incentive problems can be seen as manifestations of Goodhart's Law, we explore how the specification category of the SRA taxonomy maps to the Goodhart taxonomy. The mapping is an attempt to integrate different breakdowns of the safety problem space into a coherent whole. We hope that a consistent classification of current safety problems will help develop solutions that are effective for entire classes of problems, including future problems that have not yet been identified.

The SRA taxonomy defines three different types of specifications of the agent's objective: ideal (a perfect description of the wishes of the human designer), design (the stated objective of the agent) and revealed (the objective recovered from the agent's behavior). It then divides specification problems into design problems (e.g. side effects) that correspond to a difference between the ideal and design specifications, and emergent problems (e.g. tampering) that correspond to a difference between the design and revealed specifications.

In the Goodhart taxonomy, there is a variable U* representing the true objective, and a variable U representing the proxy for the objective (e.g. a reward function). The taxonomy identifies four types of Goodhart effects: regressional (maximizing U also selects for the difference between U and U*), extremal (maximizing U takes the agent outside the region where U and U* are correlated), causal (the agent intervenes to maximize U in a way that does not affect U*), and adversarial (the agent has a different goal W and exploits the proxy U to maximize W).

We think there is a correspondence between these taxonomies: design problems are regressional and extremal Goodhart effects, while emergent problems are causal Goodhart effects. The rest of this post will explain and refine this correspondence.

The SRA taxonomy needs to be refined in order to capture the distinction between regressional and extremal Goodhart effects, and to pinpoint the source of causal Goodhart effects. To this end, we add a model specification as an intermediate point between the ideal and design specifications, and an implementation specification between the design and revealed specifications.

The model specification is the best proxy within a chosen formalism (e.g. model class or specification language), i.e. the proxy that most closely approximates the ideal specification. In a reinforcement learning setting, the model specification is the reward function (defined in the given MDP/R over the given state space) that best captures the human designer's preferences.

• The ideal-model gap corresponds to the model design problem (regressional Goodhart): choosing a model that is tractable but also expressive enough to approximate the ideal specification well.
• The model-design gap corresponds to proxy design problems (extremal Goodhart), such as specification gaming and side effects.

While the design specification is a high-level description of what should be executed by the system, the implementation specification is a specification that can be executed, which includes agent and environment code (e.g. an executable Linux binary). (We note that it is also possible to define other specification levels at intermediate levels of abstraction between design and implementation, e.g. using pseudocode rather than executable code.)

• The design-implementation gap corresponds to tampering problems (causal Goodhart), since they exploit implementation flaws (such as bugs that allow the agent to overwrite the reward). (Note that tampering problems are referred to as wireheading and delusions in the SRA.)
• The implementation-revealed gap corresponds to robustness problems in the SRA (e.g. unsafe exploration).

In the model design problem, U is the best approximation of U* within the given model. As long as the global maximum M for U is not exactly the same as the global maximum M* for U*, the agent will not find M*. This corresponds to regressional Goodhart: selecting for U will also select for the difference between U and U*, so the optimization process will overfit to U at the expense of U*.

In proxy design problems, U and U* are correlated under normal circumstances, but the correlation breaks in situations when U is maximized, which is an extremal Goodhart effect. The proxy U is often designed to approximate U* by having a maximum at a global maximum M* of U*. Different ways that this approximation fails produce different problems.

• In specification gaming problems, M* turns out to be a local (rather than global) maximum for U, e.g. if M* is the strategy of following the racetrack in the boat race game. The agent finds the global maximum M for U, e.g. the strategy of going in circles and repeatedly hitting the same reward blocks. This is an extrapolation of the reward function outside the training domain that it was designed for, so the correlation with the true objective no longer holds. This is an extremal Goodhart effect due to regime change.

In side effect problems, M* is a global maximum for U, but U incorrectly approximates U* by being flat in certain dimensions (corresponding to indifference to certain variables, e.g. whether a vase is broken). Then the set of global maxima for U is much larger than the set of global maxima for U*, and most points in that set are not global maxima for U*. Maximizing U can take the agent into a region where U doesn't match U*, and the agent finds a point M that is also a global maximum for U, but not a global maximum for U*. This is an extremal Goodhart effect due to model insufficiency.

Current solutions to proxy design problems involve taking the proxy less literally: by injecting uncertainty (e.g. quantilization), avoiding extrapolation (e.g. inverse reward design), or adding a term for omitted preferences (e.g. impact measures).

In tampering problems, we have a causal link U* -> U. Tampering occurs when the agent intervenes on some variable W that has a causal effect on U that does not involve U*, which is a causal Goodhart effect. W could be the reward function parameters, the human feedback data (in reward learning), the observation function parameters (in a POMDP), or the status of the shutdown button. The overall structure is U* -> U <- W.

For example, in the Rocks & Diamonds environment, U* is the number of diamonds delivered by the agent to the goal area. Intervening on the reward function to make it reward rocks increases the reward U without increasing U* (the number of diamonds delivered).

Current solutions to tampering problems involve modifying the causal graph to remove the tampering incentives, e.g. by using approval-direction or introducing counterfactual variables.

Mesa-optimization is also a causal Goodhart effect, if considered as a specification problem for the main agent from the human designer's perspective. We can put it in the same framework by defining W as the mesa-optimizer's objective. By selecting for the mesa-optimizer, the base agent 'intervenes' on W in a way that increases its own reward U, but not necessarily the true reward U*. However, when we consider this problem one level down, as a specification problem for the mesa-optimizer from the main agent's perspective, it can take the form of any of the four Goodhart effects. The four types of alignment problems in the mesa-optimization paper can be mapped to the four types of Goodhart's Law as follows: approximate alignment is regressional, side effect alignment is extremal, instrumental alignment is causal, and deceptive alignment is adversarial.

This correspondence is consistent with the connection between the Goodhart taxonomy and the selection vs control distinction, where regressional and extremal Goodhart are more relevant for selection, while causal Goodhart is more relevant for control. The design specification is generated by a selection process, while the revealed specification is generated by a control process. Thus, design problems represent difficulties with selection, while emergent problems represent difficulties with control.

Putting it all together:

In terms of the limitations of this mapping, we are not sure about model specification being the dividing line between regressional and extremal Goodhart. For example, a poor choice of model specification could deviate from the ideal specification in systematic ways that result in extremal Goodhart effects. It is also unclear how adversarial Goodhart fits into this mapping. Since an adversary can exploit any differences between U* and U (taking advantage of the other three types of Goodhart effects) it seems that adversarial Goodhart effects can happen anywhere in the ideal-implementation gap.

We hope that you find the mapping useful for your thinking about the safety problem space, and welcome your feedback and comments. We are particularly interested if you think some of the correspondences in this post are wrong.

(Cross-posted to the Deep Safety blog. Thanks to Jan Leike and Tom Everitt for their helpful feedback on this post.)

Discuss

### Unstriving

19 августа, 2019 - 17:31
Published on August 19, 2019 2:31 PM UTC

Cross-posted from Putanumonit

Successometer

Since I was a kid, I have built my self-esteem on a feeling of “forthcoming greatness”. Whatever I actually accomplished I never paused to be proud of, and I never sweated the failures either. It was all just a stepping stone to something completely different and undeniably awesome, a “life mission” that will be important and meaningful and finally confer on me the title of #SuccessfulPerson. Until that moment came, I just needed to know that I was growing, progressing, improving, optimizing.

But upon entering my second gigasecond I’m starting to realize that this mindset makes little sense going forward, and was perhaps delusional in retrospect as well. But if I give up on rapid improvement and impending awesomeness, I don’t know what can possibly replace them.

By all objective metrics, I’m as successful today as I could hope to be a decade ago. I’m happily married, well inside the richest 1% globally, have found my tribe and earned some respect in it. I should be able to relax and take some satisfaction in my current situation. And yet the thought that in 5 years my life will look exactly like it does today fills me with dread.

What’s the problem? Jordan Peterson’s fourth rule says: compare yourself with who you were yesterday, not with who someone else is today. I do both, and both are a problem.

I seek to be inspired by awesome people, but it is then inevitable that I compare myself to them. There could be many blogs that are worse than Putanumonit, but I don’t have the time to read them. I read SlateStarCodex and the very best curated articles from elsewhere, and in comparison to those Putanumonit seems quite shabby. Offline too, one of the main perks of success is associating with successful people. Being involved in the rationality community in NYC means I hang out with Spencer Greenberg, who runs an Effective Altruism startup foundry while doing social science research and innovative meetups in his spare time — making me feel less than impressive.

But that’s not even the main issue. I’m never jealous of other people’s success for one, and I know that the people I look up to probably feel inadequate when they read about Elon Musk. It’s the comparison to who I was yesterday that’s more insidious.

Comparing myself to my old self means that my internal successometer measures only the derivative of my life’s trajectory, not my actual situation. This means that as I improve and achieve things it becomes ever harder to maintain the pace of personal growth that makes me subjectively satisfied. The better I do the lower my successometer goes, and more I am tempted to chase “new challenges” like quitting my job to start a company. I don’t even have a great idea for a company, and there’s certainly nothing wrong with my day job, it just feels like the only option to keep that part of my psyche satisfied.

The Buddha tells me to relinquish this drive and the illusion of forthcoming greatness. After all, there’s no guarantee that this mental state is actually helping me succeed, rather than just making me restless and unhappy for no reason. But since right now it’s still part of me, the threat that I may lose my drive to improve and optimize scares me out of trying to simply drop that sentiment. At the very least, I would have to replace it with another framework for making sense of my life, past and future.

But what could that be? Who would be against striving, optimization, and excellence?

Mediocrity

Survival of the Mediocre Mediocre is almost good enough to betray its own thesis. Venkatesh Rao defines mediocrity not as middling performance on some well-defined measure, but as a general resistance to well-defined measures and their siren calls of optimization. Instead of rewriting the essay, which you should read anyway, here’s a low-effort summary of the differences between excellence and mediocrity

Excellence reaps rewards in measurable achievement and works best for winner-take-all well-defined competitions. It is also what earns public admiration because it is legible. An athlete excels at a particular sport gains many fans even if they are atrocious performers in every other area of life like relationships, personal finance, or being able to read.

Mediocrity, on the other hand, helps survive unending scrambles like evolution by natural selection. Rao gives the examples of avian dinosaurs, mediocre by the standards of both dinosaurs and modern birds, who survived the Cretaceous extinction by flapping around mediocrely. Mediocrity gives you optionality and the slackto adapt to new opportunities and challenges.

Unfortunately, mediocrity is never satisfying. Before the asteroid hit, the proto-birds couldn’t know that it was coming, and couldn’t feel superior to the majestic apex predator dinos. And after the cataclysm, they don’t get any credit either — the mediocre always appear merely lucky and opportunistic from the outside.

Stocktrek Images

Rao treats mediocrity or excellence not as immutable qualities (although for a dinosaur, they are) but as intentions, stances one can adopt towards life. Before deciding which one to commit to, let’s see which one has worked for me so far in life. Here’s my excellent biography:

I got into the best high school in Israel and by age 15 was already taking college classes in math. At 16 I was a certified tennis instructor and started making money coaching. I competed in national math Olympiads. I joined the most selective non-combat unit in the Israeli military for my service and pursued a dual degree in math and physics. After the military, I joined a hedge fund and studied for the GMAT in parallel which got me into a top-20 business school in the US. Then I came to New York, got a job at a successful financial company, and started a successful blog.

So aspirational. Much excellence. Very optimizing. Wow! How can this be the life story of a mediocrity?

I was in a good high school and put so little effort in that I was almost expelled in 10th grade. My spare time was mostly spent playing soccer and card games with friends but a few of us managed to study and pass a special simplified university entrance exam. Despite taking a fraction of the regular university course-load, my grades were always close to the median and ultimately it took me 8 years to finish my BSc from the day I started. I was the worst tennis player by far at the instructors’ academy and passed because a teacher took pity on me and gave me 15 minutes to demonstrate a left-handed serve. I placed 3rd in math Olympiads. I was dismissed from the most selective non-combat unit in the IDF for, I kid you not, “not striving for excellence”. I completed my service in a down-to-earth technology division where my peers were all officers while I remained a sergeant. I joined perhaps the most pointless hedge fund in Israel, got no training, made no profit, and left shortly before the entire company shut down. I got rejected by 6 out of the 7 business schools I applied to. I came to NYC because I failed to secure a full-time offer after an internship in a big company Atlanta. My job is in financial regulatory software, a lucrative niche without fierce competition. In 2015 I was part of a client engagement so dysfunctional I had literally nothing to do for days on end — so I started Putanumonit. While Spencer Greenberg’s blog is literally called Optimize Everything, mine merely encourages the reader to put a number on it. It doesn’t even have to be the best number, whatever works is fine.

Which narrative is true? I think they both are. Reflecting on my life honestly, I have to conclude that a lot of my success is due to lucky circumstance. But also to a combination of mediocrity and excellence. The common motif of my life is ambling along without too much focus and determination, noticing an easy opportunity that I have the slack to exploit, and then summoning quality effort when it’s needed: passing entrance exams, building the product that established me at my company, writing the post that launched Putanumonit.

This is what Rao calls “mediocre mediocrity”: being just OK at being just OK. My life is mediocrity interspersed with occasional outright failures (the army unit, the job in Atlanta) and occasional bursts of excellence.

Unfortunately, even with this awareness, it’s very difficult to commit to mediocrity. What if I just slack off for years and neither opportunity nor asteroid shows up? Do opportunities even happen to people after 30? Perhaps after a gigasecond of exploration, this should be time to exploit, to pick an important project and pursue excellence. But which project?

The Project

It’s quite likely that the most important pursuit of the next decades of my life will be parenting. And ironically, I think that the best way to parent is to be a deeply mediocre parent.

I’m not the first person to notice that Something is Wrong with Kids These Days (TM) and to tie it to an almost pathological drive by parents to optimize childhood. Helicopter parenting. Snowplow parenting. Tiger moms. Academically tracked selective preschools and elementary schools where 6-year-olds chant “We are college bound!” in unison. Something is wrong, really wrong.

And it’s not just wrong for the kids, it’s wrong for the parents too. Parents are sacrificing every bit of slack they have to give their children one more unasked-for advantage, driving their child to a slightly more prestigious violin teacher who lives half an hour further away. And once a parent has sacrificed money, time, their social life and romantic life, it’s is very hard to accept that your child is, merely a not bad violin player. He may grow up to play bass for the rock band at the local state college! Ma’am, why are you crying? Ma’am?

There’s a lot of evidence that all this optimized child-rearing does not make children any more optimal, only miserable. Mediocre parenting isn’t guaranteed to produce excellent children either, but it should at least be a lot more fun.

I don’t know if I’m ready to commit to mediocrity wholeheartedly, to give up on striving and optimizing and feeling unsatisfied. But writing this self-therapeutic essay at least makes it seem less scary, a viable life strategy. And perhaps I never get to choose mediocrity, it chose me a long time ago.

Discuss

### Goodhart's Curse and Limitations on AI Alignment

19 августа, 2019 - 10:57
Published on August 19, 2019 7:57 AM UTC

I believe that most existing proposals for aligning AI with human values are unlikely to succeed in the limit of optimization pressure due to Goodhart's curse. I believe this strongly enough that it continues to surprise me a bit that people keep working on things that I think clearly won't work, though I think there are two explanations for this. One is that, unlike me, they expect to approach superhuman AGI slowly and so we will have many opportunities to notice when we are deviating from human values as a result of Goodhart's curse and make corrections. The other is that they are simply unaware of the force of the argument that convinces me because, although it has been written about before, I have not seen recent, pointed arguments for it rather than technical explanations of it and its effects, and my grokking of this point happened long ago on mailing lists of yore via more intuitive and less formal arguments than I see now. I can't promise to make my points as intuitive as I would like, but nonetheless I will try to address this latter explanation by saying a few words about why I am convinced.

Note: Some of this borrows heavily from a paper I have out for publication, but with substantial additions for readability by a wider audience.

Goodhart's Curse

Goodhart's curse is what happens when Goodhart's law meets the optimizer's curse. Let's review those two here briefly for completeness. Feel free to skip some of this if you are already familiar.

Goodhart's Law

As originally formulated, Goodhart's law says "Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes". A more accessible expression of Goodhart's law, though, would be that when a measure of success becomes the target, it ceases to be a good measure. A well known example of Goodhart's law comes from a program to exterminate rats in French-colonial Hanoi, Vietnam: the program paid a bounty for rat tails on the assumption that a rat tail represented a dead rat, but rat catchers would instead catch rats, cut off their tails, and release the rats so they could breed and produce new rats so their tails could be turned in for more bounties. There was a similar case with bounties for dead cobras in British-colonial India that intended to incentivize the reduction of cobra populations that instead resulted in the creation of cobra farms. And of course we can't forget this classic, though apocryphal, tale:

In the old Soviet Union, the government rewarded factory managers for production quantity, not quality. In addition to ignoring quality, factory managers ignored customer needs. The end result was more and more tractors produced, for instance, even though these tractors just sat unused. Managers of factories that produced nails optimized production by producing either fewer larger and heavier nails or more smaller nails.The fact that factories were judged by rough physical quotas rather than by their ability to satisfy customers – their customers were the state – had predictably bad results. If, to take a real case, a nail factory’s output was measured by number, factories produced large numbers of small pink-like nails. If output was measured by weight, the nail factories shifted to fewer, very heavy nails.

Although a joke, the natural endpoint might be the production of a single, giant nail. It's unknown, to be best of my knowledge, if the nail example above is real, although reportedly something similar really did happen with shoes. Additional examples of Goodhart's law abound:

• targeting easily-measured clicks rather than conversions in online advertising
• optimizing for profits over company health in business
• prioritizing grades and tests scores over learning in schools
• maximizing score rather than having fun in video games

As these examples demonstrate, most of us are familiar with Goodhart's law or something similar in everyday life such that it's not that surprising when we learn about it. The opposite seems to be true of the optimizer's curse, being well studied but mostly invisible to us in daily life unless we take care to notice it.

The optimizer's curse

The optimizer's curse observes that when choosing among several possibilities, if we choose the option that is expected to maximize value, we will be "disappointed" (realize less than the expected value) more often than average. This happens because optimization acts as a source of bias in favor of overestimation, even if the estimated value of each option is not biased itself. And the curse is robust, such that even if an agent satisfices (accepts the option with the least expected value that is greater than neutral) rather than optimizes they will still suffer more disappointment than gratification. So each option can be estimated in an unbiased way, yet because there is a bias imposed by a preference for estimations of positive value, we can end up in a situation where we consistently pick options that are more likely to be overestimating their value.

The optimizer's curse has many opportunities to bite us. For example, a company trying to pick a project to invest in to earn the highest rate of return will consistently earn less return than predicted due to the optimizer's curse. Same goes for an investor picking investment instruments. Similarly a person trying to pick the best vacation will, on average, have a worse vacation than expected because the vacation that looks the best is more likely than the other options to be worse than predicted. And of course an AI trying to pick the policy that maximizes human value will usually pick a policy that performs worse than expected, but we'll return to that one later when we consider how it interacts with Goodhart.

I wish I had more, better examples of the optimizer's curse to offer you, especially documented real-world cases that are relatable, but most of what I can find seems to be about petroleum production (no, really!) or otherwise about managing and choosing among capital-intensive projects. The best I can offer you is this story from my own life about shoes:

For a long time, I wanted to find the best shoes. "Best" could mean many things, but basically I wanted the best shoe for all purposes. I wanted the shoe to be technically impressive, so it would have features like waterproofing, puncture-proofing, extremely durability, extreme thinness and lightness, able to be worn without socks, and breathability. I also wanted it to look classy and casual, able to mix and match with anything. You might say this is impossible, but I would have said you just aren't trying hard enough.So I tried a lot of shoes. And in every case I was disappointed. One was durable but ugly, another was waterproof but made my feet smell, another looked good but was uncomfortable, and another was just too weird. The harder I tried to find the perfect shoe, the more I was disappointed. Cursed was I for optimizing!

This story isn't perfect: I was optimizing for multiple variables and making tradeoffs, and the solution was to find some set of tradeoffs I would be happiest with and to accept that I was mostly only going to move along the efficiency frontier rather than expand it by trying new shoes, so it teaches the wrong lesson unless we look at it through a very narrow lens. Better examples in the comments are deeply appreciated!

Before moving on, it's worth talking about attempts to mitigate the optimizer's curse. It would seem, since it is a systematic bias, that we could account for the optimizer's curse the same way we do most systematic biases using better Bayesian reasoning. And we can, but in many cases this is difficult or impossible because we lack sufficient information about the underlying distributions to make the necessary corrections. Instead we find ourselves in a situation where we know we suffer bias in our expectations but cannot adequately correct for it such that we can be sure we aren't still suffering from it even if we try not to. Put another way, attempting to correct for the optimizer's curse without perfect information simply shifts the distortions caused by the optimizer's curse to the corrections rather than the original estimates themselves without eliminating the bias.

Given how persistent the optimizer's curse is, it shouldn't surprise us it will pop up when we try to optimize for some measurable target, giving us Goodhart's curse.

Goodhart's curse

Combining Goodhart's law with the optimizers curse, we get Goodhart's curse: attempts to optimize for a measure of success result in increased likelihood of failure to hit the desired target. Or as someone on Arbital (probably Eliezer) put it: "neutrally optimizing a proxy measure U of V seeks out upward divergence of U from V". In personal terms, you might say that the harder you try to get what you want, the more you'll find yourself doing things that cause you not to get what you want despite trying to act otherwise. I think this point is unintuitive because it feels contrary to the normal narrative that success comes from trying, and trying harder makes it more likely you will succeed, but that might only appear to be true due to survivorship bias. To give you an intuitive feel for this personal expression of Goodhart's curse, another story from my life:

At some tender age, maybe around 11 or 12, I became obsessed with efficiency so I would have time to do more in my life.There was the little stuff, like figuring out the "best" way to brush my teeth or get out of bed. There was the medium stuff, like finding ways to read faster or write without moving my hand as much. And there was the big stuff, like trying to figure out how to get by on less sleep and how to study topics in the optimal order. It touched everything, from shoe tying, to clothes putting on, to walking, to playing, to eating, and on and on. It was personal Taylorism gone mad.To take a single example, let's consider the important activity of eating breakfast cereal and how that process can be made more efficient. There's the question of how to store the cereal, how to store the milk, how to retrieve the cereal and milk, how to pour the two into the bowl, how to hold the spoon, how to put the cereal in the mouth, how to chew, how to swallow, and how to clean up, to name just a few. Maybe I could save a few seconds if I held the spoon differently, or stored the cereal in a different container, or store the milk on a different shelf in the refrigerator, or, or, or. By application of experimentation and observation I could get really good at eating cereal, saving maybe a minute or more off my daily routine!Of course, this was out of a morning routine that lasted over an hour and included a lot of slack and waiting because I had three sisters and two parents and lived in a house with two bathrooms. But still, one whole minute saved!By the time I was 13 or 14 I was over it. I had spent a couple years working hard at efficiency, gotten little for it, and lost a lot in exchange. Doing all that efficiency work was hard, made things that were once fun feel like work, and, worst of all, weren't delivering on the original purpose of doing more with my life. I had optimized for the measure—time to complete task, number of motions to complete task, etc.—at the expense of the target—getting more done. Yes, I was efficient at some things, but that efficiency was costing so much effort and will power that I was worse off than if I had just ignored the kind of efficiency I was targeting.

In this story, as I did things that I thought would help me reach my target, I actually moved myself further away from it. Eventually it got bad enough that I noticed the divergence and was compelled to course correct, but this depended on me having ever known what the original target was. If I were not the optimizer, and instead say some impersonal apparatus like the state or an AI were, there's considerable risk the optimizer would have kept optimizing and diverging long after it became clear to me that divergence had happened. For an intuitive sense of how this has happened historically, I recommend Seeing Like a State.

I hope by this point you are convinced of the power and prevalence of Goodhart's curse (but if not please let me know your thoughts in the comments, especially if you have ideas about what could be said that would be convincing). Now we are poised to consider Goodhart's curse and its relationship to AI alignment.

Goodhart's curse and AI alignment

Let's suppose we want to build an AI that is aligned with human values. A high level overview of a scheme for doing this is that we build an AI, check to see if it is aligned with human values so far, and then update it so that it is more aligned if it is not fully aligned already.

Although the details vary, this describes roughly the way methods like IRL and CIRL work, and possibly how HCH and safety via debate work in practice. Consequently, I think all of them will fail due to Goodhart's curse.

Caveat: I think HCH and debate-like methods may be able to work and avoid Goodhart's curse, though I'm not certain, and it would require careful design that I'm not sure the current work on these has done. I hope to have more to say on this in the future.

The way Goodhart's curse sneaks in to these is that they all seek to apply optimization pressure to something observable that is not exactly the same thing as what we want. In the case of IRL and CIRL, it's an AI optimizing over inferred values rather than the values themselves. In HCH and safety via debate, it's a human preferentially selecting AI that the human observes and then comes to believe does what it wants. So long as that observation step is there and we optimize based on observation, Goodhart's curse applies and we can expect, with sufficient optimization pressure, that alignment will be lost, even and possibly especially without us noticing because we're focused on the observable measure rather than the target.

Yikes!

Beyond Goodhart's curse

Do we have any hope of creating aligned AI if just making a (non-indifferent) choice based on an observation dooms us to Goodhart's curse?

Honestly, I don't know. I'm pretty pessimistic that we can solve alignment, yet in spite of this I keep working on it because I also believe it's the best chance we have. I suspect we may only be able to rule out solutions that are dangerous but not positively select for solutions that are safe, and may have to approach solving alignment by eliminating everything that won't work and then doing something in the tiny space of options we have left that we can't say for sure will end in catastrophe.

Maybe we can get around Goodhart's curse by applying so little optimization pressure that it doesn't happen? One proposal in this direction is quantilization. I remain doubtful, since without sufficient optimization it's not clear how we do better than picking at random.

Maybe we can get around Goodhart's curse by optimizing the target directly rather than a measure of it? Again, I remain doubtful, mostly due to epistemological issues suggesting all we ever have are observations and never the real thing itself.

Maybe we can overcome either or both issues via pragmatic means that negate enough of the problem that, although we don't actually eliminate Goodhart's curse completely, we eliminate enough of its effect that we can ignore it? Given the risks and the downsides I'm not excited about this approach, but it may be the best we have.

And, if all that wasn't bad enough, Goodhart's curse isn't even the only thing we have to watch out for! Scott Garrabrant and David Manheim have renamed Goodhart's curse to "Regressional Goodhart" to distinguish it from other forms of Goodharting where mechanisms other than optimization may be responsible for divergence from the target. The only reason I focus on Goodhart's curse is that it's the way proposed alignment schemes usually fail; other safety proposals may fail via other Goodharting effects.

All this makes it seem extremely likely to me that we aren't even close to solving AI alignment yet, to the point that we likely haven't even stumbled upon the general mechanism that will work, or if we have we haven't identified it as such. Thus, if there's anything upward looking I can end this on, it's that there's vast opportunity to do good for the world via work on AI safety.

Discuss

18 августа, 2019 - 22:01
Published on August 18, 2019 7:01 PM UTC

Raph Koster is a game designer who's worked on old-school MUDs, Ultima Online, and Star Wars Galaxies among others. His blog is a treasure trove of information on game design, and online community building.

The vibe I get is very sequences-like (or, perhaps more like Paul Graham?). There's a particular genre I quite like of "Person with a decades of experiences who's been writing up their thoughts and principles on their industry and craft. Reading through their essays not only reveals a set of useful facts, but an entire lens through which to view things."

I'll most likely use the comments of this post to braindump thoughts or summarize things as I work my way through his corpus.

He has two major books I'm aware of:

A Theory of Fun – An illustrated book who's central thesis here is "Games are about learning. When you've learned all you can learn from a game, it becomes boring." (this has a vibe very similar to Scott McCloud's Understanding Comics, which is a similar kind of braindump that teaches a lens to view things). [PDF of the first few pages here]

Postmortems – This is a collection of essays Raph wrote across his career, starting from text-based Multi User Dungeons, then Ultimate Online, and eventually Star Wars Galaxies. (I think most of this is available online as blogposts but I purchased it to read on kindle)

And meanwhile, his blog itself, divided into:

With posts clustered into topics like:

• Theory of Fun (cognition and games
• Game Grammar articles
• Game Development Process
• Experience/Narrative Design
• Games as Art
• General Game Design
• Game Economies
• Community Design
• Star Wars Galaxies articles/anecdotes
• Ultimate Online Articles/ancedotes
• Misc game postmortems
• Ethics in Game Design
• Player Rights (i.e. treating players ethically)
• Gamification
• Community and Marketing
Virtual Worlds vs Games

A central question he keeps revisiting is the difference between a virtual world, and a game. I think this is well encapsulated in "Designing a Living Society in SWG, part one and part two."

A game is "something with rules, where your actions have consequences, and you can achieve some kind of mastery." Whereas a world is more like, well, a world – a breathing ecosystem where consequences are persistent, and things can interact with each other over the longterm.

Most games he's been involved with have been worlds first, games second. In Ultimate Online you deliberately didn't have access to a global chat. If you wanted to talk to someone, you had to find them and chat "in person." Objects you dropped on the ground stayed there permanently. Monsters interacted with each other. Players built houses, and this eventually resulted in a land/housing crisis.

Ultima Online was famous for dealing with "excessive playerkilling", where it was hard to leave town because aggressive players would murder you.

Raph spent years trying to built tools that enabled players to solve this problem themselves – it felt very significant and important to him that the roleplayers actually had to defend themselves against the playerkillers, and saw it as a failure if a problem was resolved via "running to daddy" (i.e. getting an admin involved).

And this struggle seemed far more real and important to him than players being able to defeat a dragon, or whatever. If an online, anonymized world of players could learn to impose law and order onto chaos, and literally defeat evil – this would both be far more meaningful than any hand-crafted challenges.

(this actually bears some relation to Gordon's comment elsewhere about games as simulations and learning)

[I feel like there would be value in me writing up extensive summaries of all this, but then I was like "Huh, if I'm going to spend that much time I might as well spend that time helping to distill the LW sequences or something." But will try to write up additional notes in the comments here as I come across them]

Discuss

### "Can We Survive Technology" by von Neumann

18 августа, 2019 - 21:58
Published on August 18, 2019 6:58 PM UTC

"The great globe itself" is in a rapidly maturing crisis —a crisis attributable to the fact that the environment in which technological progress must occur has become both undersized and underorganized. To define the crisis with any accuracy, and to explore possibilities of dealing with it, we must not only look at relevant facts, but also engage in some speculation. The process will illuminate some potential technological developments of the next quarter-century.

Discuss

### Prokaryote Multiverse. An argument that potential simulators do not have significantly more complex physics than ours

18 августа, 2019 - 07:22
Published on August 18, 2019 4:22 AM UTC

Definitions

"Universe" can no longer be said to mean "everything", such a definition wouldn't be able to explain the existence of the word "multiverse". I define universe as a region of existence that, from the inside, is difficult to see beyond.

I define "Multiverse" as: Everything, with a connoted reminder; "everything" can be presumed to be much larger and weirder than "everything that you have seen or heard of".

What this argument is for

This argument disproves the simulation argument for simulators hailing from universes much more complex than our own. Complex physics suffice much much more powerful computers (I leave proving this point as an exercise to the reader). If we had to guess what our simulators might look like, our imagination might go first to universes where simulating an entire pocket universe like ours is easy, universes which are as we are to flatland or to conway's game of life. We might imagine universes with more spacial dimensions or forces that we lack.

I will argue that this would be vanishingly unlikely.

This argument does not refute the common bounded simulation argument of simple universes (which includes ancestor simulations). It does carve it down a bit. It seems to be something that, if true, would be useful to know.

The argument

The first fork of the argument is that a more intricate machine is much less likely to generate an interesting output.

Life needs an interesting output. Life needs a very even combination of possibility, stability, and randomness. The more variables you add to the equation, the smaller the hospitable region within the configuration space. The hospitable configuration-region within our own physics appears to be tiny (wikipedia, anthropic coincidences) (and I'm sure it is much tinier than is evidenced there). The more variables a machine has to align before it can support life, the more vanishingly small the cradle will be within that machine's spaces.

The second fork of the argument is that complex physics are simply the defining feature of a theory that fails kolmogorov's razor (our favoured formalisation of occam's razor).

If we are to define some prior distribution over what exists, out beyond what we can see, kolmogorov complexity seems like a sensible metric to use. A universe generated by a small machine is much more likely a-priori - perhaps we should assume it occurs with much greater frequency - than a universe that can only be generated by a large machine.

If you have faith in solomonoff induction, you must assign lower measure to complex universes even before you consider those universes' propensity to spawn life.

I claim that one large metaphysical number will be outweighed by another large metaphysical number. I propose that the maximum number of simple simulated universes that could be hosted within a supercomplex universe is unlikely to outnumber the natural instances of simple universes that lay about in the multiverse's bulk.

Discuss

### Neural Nets in Python 1

18 августа, 2019 - 05:48
Published on August 18, 2019 2:48 AM UTC

Introduction

This post is an attempt to explain how to write a neural network in Python using numpy. I am obviously not the first person to do this. Almost all of the code is here adapted from Michael Nielsen's fantastic online book Neural Networks and Deep Learning . Victor Zhou also has a great tutorial in Python . Why am I trying to do the same? Partially, it's for my own benefit, cataloging my code so I can refer back to it later in a form more captivating than a mere docstring. Also partially, I think I can share a few intuitions which make the backpropagation equations a lot easier to derive.

Okay, so here's a typical picture of a neural network:

The typical picture, while good for representing the general idea of a neural net, does not do a good job of showing the different operations being performed. I prefer representing a neural net as a computational graph, like below:

Here, it's clearer to see how each node is a function of the step before it. A normal three-layer neural network is given by the following composition of functions:

f0=X=input

f1=W1⋅f0+b1

f2=a(f1)

f3=W2⋅f2+b2

f4=a(f3)=^Y=predicted output

This recursive definition will make it easy to derive the backpropagation algorithm, which we'll use to train our network. It also allows us to easily unroll the function, if we want to see what's going on in on line, by substituting until we get back to the input:

f4=a(W2⋅(a(W1⋅f0+b1)+b2))

And of course, if our neural network has more than three layers, we just add more recursively defined functions.

Forward Pass

The process of taking an input and going through the process of matrix multiplications, vector additions, and activation functions to get the output is referred to as the **forward pass**. To get started, let's write a neural net class that can perform a forward pass, given the dimensions for each layer and an activation function:

python
import numpy as np

class NN():
def __init__(self, sizes, activation):
self.num_layers = len(sizes)
self.sizes = sizes
self.x_dim = sizes[0]
self.b = [np.random.randn(1, y) for y in sizes[1:]]
self.w = [np.random.randn(x, y)
for x, y in zip(sizes[: -1], sizes[1:])]
self.activ = activation

def forward_pass(self, input, classify=False):
if (input.ndim == 1):
input = input.reshape(1, -1)
for i in range(self.num_layers-1):
input = self.activ.fn(np.dot(input, self.w[i]) + self.b[i])
return input


A quick explanation: our class takes in an array of layer sizes and creates appropriate weight matrices with random values from 0 to 1. EX: [20, 50, 10] would result in weight matrices of dimensions 20×50 and 50×10. For the forward_pass function, we can see from the computational graph that each matrix multiplication (and vector addition) is followed by an application of the activation function, so we can simply recurse until we go through all of our weight matrices.

Loss Function

So far, I haven't explained how the neural net is supposed to actually work. Say we have some data and their associated target values (EX: different measurements of divers and how long they can hold their breath). Using the above code, even if we get the dimensions of the input/output right, our forward pass is going to give us garbage results.

This is because we randomly initialized our weights and biases. We don't want *any* set of weights and biases, but a "good" set of weights and biases. In doing so, we now need to define what we mean by "good". At the very least, it seems that a good set of weights and biases should lead to predicted values which are close to the associated target values, for most of the data we have.

This is where loss functions come in. They take in as input our predicted value and the true value and output a measure of just how far apart the two values are. There are many functions we could choose to measure the distance between Y and ^Y. For ease of explanation, we'll go with L2 norm of their difference, i.e. the sum of the squares of their differences.

Now, after taking a forward pass, we can use our loss function to tell us just how far away our predicted value is from the true value. It's easy to add this change to our class:

python
class NN():

def __init__(self, sizes, activation, loss):
self.loss = loss
'''
The rest is unchanged
'''
'''
predict is unchanged
'''

def eval_loss(self, input, output):
return self.loss.fn(self.predict(input), output)


We can easily integrate this into our model by adding it to our computational graph:

Now we've added one more step:

f0=X=input

f1=W1⋅f0+b1

f2=a(f1)

f3=W2⋅f2+b2

f4=a(f3)=^Y=predicted output$f5=∥f4−Y∥2 Now that we have our loss function defined, we can begin the work of actually optimizing our network because we have an answer to the question, "Optimizing with respect to what?" Recall that our network is parameterized by a set of weights and biases. (There's also the activation function, but that's more of a fixed thing we can't really fine-tune.) Backpropagation Backpropagation allows us to figure out how much each weight and bias is responsible for the loss function. We do this by taking partial derivatives of the loss function with respect to each weight matrix and bias vector. Given that a neural net is just a big composite function, we'll be using the Chain Rule a lot. This is where the recursive notation shines. It's much easier to have a placeholder like$f_3$than a big clump of nested parentheses. The reason we are taking partial derivatives at all is because they'll allow us to perform iterative optimization, e.g. gradient descent, on our network, which is how the "training" happens. We'll start with the biases b1,b2,... first: First, let's find ∂f5∂b2 From above, we've already defined f6 to be the loss function applied after a forward pass, so that's why we're taking the partial derivative of f5 with respect to b2. Note below that$a'$is the derivative of the activation function. ∂f5∂b2=∂f5∂f4∂f4∂b2=2(^Y−Y)∂f4∂b2 ∂f4∂b2=∂f4∂f3∂f3∂b2=a′(f3)∂f3∂b2 ∂f3∂b2=1 Thus, ∂f5∂b2=2(^Y−Y)a′(f3). Next, let's find ∂f5∂b1 (Below, I've omitted the intermediary step of showing ∂g∂x=∂g∂f∂f∂x.) ∂f5∂b1=2(^Y−Y)∂f4∂b1 ∂f4∂b1=a′(f3)∂f3∂b1 ∂f3∂b1=W2⋅∂f2∂b1 ∂f2∂b1=a′(f1)∂f1∂b1 ∂f1∂b1=1 Thus ∂f5∂b1=2(^Y−Y)a′(f3)⋅W2⋅a′(f1). Before we go any further, there are a two useful things to notice: 1. The loss function's derivative (in this case, 2(^Y−Y)) will always be the first term in the partial derivative of the loss with respect to any weight or bias. 2. The partial derivatives of the bias vectors is recursively defined. ∂L∂bn−1=∂L∂bn⋅Wn⋅a′(zn−1) where zc is defined to be the result of Wc⋅f2c−2+bc. In other words, zcis the result of multiplying the previous layer by the cth weight matrix and adding the cth bias vector. We let L represent the general loss function, applied after an arbitrary number of layers. Let's do the weight matrices W1,W2,... next: First, let's find ∂f5∂W2 ∂f5∂W2=2(^Y−Y)∂f4∂W2 ∂f4∂W2=a′(f3)∂f3∂W2 ∂f3∂W2=f2=a(f1) Thus, ∂f5∂W2=2(^Y−Y)a′(f3)f2=∂f5∂b1a(f1). Now we find ∂f5∂W1 ∂f5∂W1=2(^Y−Y)∂f4∂W1 ∂f4∂W1=a′(f3)∂f3∂W1 ∂f3∂W1=W2⋅∂f2∂W1 ∂f2∂W1=a′(f1)∂f1∂W1 ∂f1∂W1=f0=X Thus ∂f5∂W1=2(^Y−Y)a′(f3)⋅W2⋅a′(f1)f0=∂f5∂b1f0 Here, in both partial derivatives, we see something useful: The partial derivative of the loss function with respect to a weight matrix can be calculated in part using the partial derivative of the loss function with respect to the bias vector in the same layer. The extra term we need is the activation function applied element-wise to the layer before it. In other words:$\frac{\partial \text{L}}{\partial W_n} = \frac{\partial \text{L}}{\partial b_n} a(z_{n-1})$. Thus, as long as we store both the results of$z_c$and$a(z_c)$during a forward pass operation, we'll have most of the information we need to calculate the partial derivatives. We're now ready to write the code: python class NN: def backprop(self, x, y): z = [] activations = [x] for i in range(self.num_layers-1): x = np.dot(x, self.w[i]) + self.b[i] z.append(x) x = self.activ.fn(x) activations.append(x)  To start with we perform a forward pass. Along the way, we store the results in activations and z. One small caveat: we start with x in activations as well because our recursive definition bottoms out at the input value, so we need for the gradients at the first layer. Now, we go backwards and recursively calculate our gradients: python class NN: def backprop(self, train_x, train_Y): ''' same as above ''' deltas = [] b_grad = [] w_grad = [] for i in range(len(z)): if i == 0: delta = self.loss.deriv(activations[-1], y )*self.activ.deriv(z[-1]) deltas.append(delta) if i != 0: deltas.append(np.dot(deltas[i-1], self.w[- i].T)*self.activ.deriv(z[-i-1])) b_grad.append(np.sum(deltas[i], axis=0)) w_grad.append(np.dot(activations[-2-i].T, deltas[i])) return w_grad,b_grad  deltas is a list holding$\frac{\partial L}{\partial b_c}\$ values. The first case handles the derivative of the loss function. We pass it in activations[-1] which represents the output of our neural net (as it's the activation of the last layer) and multiply it by activ.deriv, the derivative of the activation function (which we assume we've defined elsewhere).

Otherwise, we follow the recursive formula from earlier and multiply the previous delta value by the next weight matrix and we multiply it by a′of the next znvalue. To get the ∂L∂bc value, we simply take the current value of delta (and sum up if our input was a matrix rather than a vector). To get the ∂L∂Wc value, we follow the recursive formula and perform one more matrix multiplication (we index activationsby [-2-i] because we added x as an extra value when starting out).

And we're done! We've now calculated the partial derivatives for all the weights and biases. Next time, we'll dive into different optimization methods and go over how to put these gradients to use.

Discuss

### Inspection Paradox as a Driver of Group Separation

18 августа, 2019 - 00:47
Published on August 17, 2019 9:47 PM UTC

Have you ever noticed that anybody driving slower than you is an idiot, and anyone going faster than you is a maniac? -- George Carlin

The Inspection Paradox, where the reported results are heavily observer-dependent, has been mentioned here a couple of times before:

https://www.lesswrong.com/posts/HW2fbbGM8B6y7pkDb/the-just-world-hypothesis#8r2uaA6Mb25k2vF9Z

An example of it that is familiar to everyone is that, when driving with an average speed, you see all other cars separating into two categories: slow drivers and fast drivers, because you naturally encounter more cars who are faster or slower than you are, and none that move at the same speed. So a normal distribution of speeds becomes bimodal, from something like this:

to something like this:

Another familiar example, from the newer post by the original author, is that for an average Facebook user,

Because, naturally, members with more connections are more likely to have a connection with you, among others.

(For the record, the inspection paradox can be classified as subset of the sampling bias, which, in turn, is a form of the oft discussed selection bias.)

But back to the apparent multi-modality. You encounter is whenever unusual events have a higher availability, sometimes properly measured like in the average driver case above, and sometimes perceived, like in the availability heuristic. This doesn't have to be about the frequency of the observations, it could be about their emotional impact. If the more "out there" the observations are, the more they affect you, then their distribution would appear bimodal, and you might instinctively recoil from them, and seek the comfort of the in-group. In this case the inspection-induced multi-modality may turn into an actual one, a case of perception becoming reality. (I had attempted to model it numerically some months ago, in two companion posts, sadly not written well enough to attract much interest, but showing that the effect described may well be real.)

Discuss

### Problems in AI Alignment that philosophers could potentially contribute to

17 августа, 2019 - 20:38
Published on August 17, 2019 5:38 PM UTC

(This was originally a comment that I wrote as a follow up to my question for William MacAskill's AMA. I'm moving it since it's perhaps more on-topic here.)

It occurs to me that another reason for the lack of engagement by people with philosophy backgrounds may be that philosophers aren't aware of the many philosophical problems in AI alignment that they could potentially contribute to. So here's a list of philosophical problems that have come up just in my own thinking about AI alignment.

• Decision theory for AI / AI designers
• How to resolve standard debates in decision theory?
• Logical counterfactuals
• Open source game theory
• Acausal game theory / reasoning about distant superintelligences
• Infinite/multiversal/astronomical ethics
• Should we (or our AI) care much more about a universe that is capable of doing a lot more computations?
• What kinds of (e.g. spatial-temporal) discounting is necessary and/or desirable?
• Fair distribution of benefits
• How should benefits from AGI be distributed?
• For example, would it be fair to distribute it equally over all humans who currently exist, or according to how much AI services they can afford to buy?
• What about people who existed or will exist at other times and in other places or universes?
• Need for "metaphilosophical paternalism"?
• However we distribute the benefits, if we let the beneficiaries decide what to do with their windfall using their own philosophical faculties, is that likely to lead to a good outcome?
• Metaphilosophy
• What is the nature of philosophy?
• What constitutes correct philosophical reasoning?
• How to specify this into an AI design?
• Philosophical forecasting
• How are various AI technologies and AI safety proposals likely to affect future philosophical progress (relative to other kinds of progress)?
• Preference aggregation between AIs and between users
• How should two AIs that want to merge with each other aggregate their preferences?
• How should an AI aggregate preferences between its users?
• Normativity for AI / AI designers
• What is the nature of normativity? Do we need to make sure an AGI has a sufficient understanding of this?
• Metaethical policing
• What are the implicit metaethical assumptions in a given AI alignment proposal (in case the authors didn't spell them out)?
• What are the implications of an AI design or alignment proposal under different metaethical assumptions?
• Encouraging designs that make minimal metaethical assumptions or is likely to lead to good outcomes regardless of which metaethical theory turns out to be true.
• (Nowadays AI alignment researchers seem to be generally good about not placing too much confidence in their own moral theories, but the same can't always be said to be true with regard to their metaethical ideas.)

Discuss

### How can you use music to boost learning?

17 августа, 2019 - 09:59
Published on August 17, 2019 6:59 AM UTC

I often find that I am able to appreciate the beauty of a subject more while listening to music (especially instrumental music). Hearing the notes while I think about the topic helps creates a lot of subconscious connections with the material, solidifying what I am learning as a distinct set of memories. In general I think that associations with powerful sensory experience is just a good way to remember things and learn.

However, I have also heard that listening to music can distract you when you are trying to do deep work. Apparently, there is some research on this, but I have barely scratched the surface of the literature, and I wouldn't know where to start.

Is there an optimal way to use music to learn? Should I employ certain strategies, like putting on the music only after I've read something, so that I can think about what I just read while music floats through my consciousness?

Discuss

### A Primer on Matrix Calculus, Part 3: The Chain Rule

17 августа, 2019 - 04:50
Published on August 17, 2019 1:50 AM UTC

This post concludes the subsequence on matrix calculus. Here, I will focus on an exploration of the chain rule as it's used for training neural networks. I initially planned to include Hessians, but perhaps for that we will have to wait.

Conceptually, combining these two parts is easy. What's hard is making the whole thing efficient so that we can get our neural networks to actually train on real world data. That's where the backpropagation enters the picture.

Backpropagation is simply a technique to train neural networks by efficiently using the chain rule to calculate the partial derivatives of each parameter. However, backpropagation is notoriously a pain to deal with. These days, modern deep learning libraries provide tools for automatic differentiation, which allow the computer to automatically perform this calculus in the background. However, while this might be great for practitioners of deep learning, here we primarily want to understand the notation as it would be written on paper.1 Plus, if we were writing our own library, we'd want to know what's happening in the background.

What I have discovered is that, despite my initial fear of backpropagation, it is actually pretty simple to follow if you just understand the notation. Unfortunately, the notation can get a bit difficult to deal with (and was a pain to write out in Latex).

We start by describing the single variable chain rule. This is simply ddxf(g(x))=f′(g(x))g′(x). But if we write it this way, then it's in an opaque notation and hides which variables we are taking the derivative with respect to. Alternatively we can write the rule in a way that makes it more obvious what we are doing: ddxf(g(x))=dfdgdgdx, where g is meant as shorthand for g(x). This way it is intuitively clear that we can cancel the fractions on the bottom, and this reduces to dfdx, as desired.

It turns out, that for a function f:Rn→Rm and g:Rk→Rn, the chain rule can be written as ∂∂xf(g(x))=∂f∂g∂g∂x where ∂f∂g is the Jacobian of f with respect to g.

Isn't that neat. Our understanding of Jacobians has now well paid off. Not only do we have an intuitive understanding of the Jacobian, we can now formulate the vector chain rule using a compact notation — one that matches the single variable case perfectly.2

However, in order to truly understand backpropagation, we must go beyond mere Jacobians. In order to work with neural networks, we need to introduce the generalized Jacobian. If the Jacobian from yesterday was spooky enough already, I recommend reading no further. Alternatively if you want to be able to truly understand how to train a neural network, read at your own peril.

First, a vector can be seen as a list of numbers, and a matrix can be seen as an ordered list of vectors. An ordered list of matrices is... a tensor of order 3. Well not exactly. Apparently some people are actually disappointed with the term tensor because a tensor means something very specific in mathematics already and isn't just an ordered list of matrices.3 But whatever, that's the term we're using for this blog post at least.

As you can probably guess, a list of tensors of order n is a tensor of order n+1. We can simply represent tensors in code using multidimensional arrays. In the case of the Jacobian, we were taking the derivative of functions between two vector spaces, Rn and Rm. When we are considering mapping from a space of tensors of order n to a space of tensors of order m, we denote the relationship y=f(x) as between the spaces R(M1×M2×...×Mn)→R(M1×M2×...×Mm).

The generalized Jacobian J between these two spaces is an object with shape (M1×M2×...×Mn)×(N1×N2×...×Nm). We can think of this object as a generalization of the matrix, where each row is a tensor with the same shape as the tensor y and each column has the same shape as the tensor x. The intuitive way to understand the generalized Jacobian is that we can index J with vectors →i and →j. At each index in J we find the partial derivative between the variables y→i and x→j, which are scalar variables located in the tensors y and x.

Formulating the chain rule using the generalized Jacobian yields the same equation as before: for z=f(y) and y=g(x), ∂z∂x=∂z∂y∂y∂x. The only difference this time is that ∂z∂xhas the shape (K1×...×KDz)×(M1×...×MDx) which is itself formed by the result of a generalized matrix multiplication between the two generalized matrices, ∂z∂y and ∂y∂x. The rules for this generalized matrix multiplication is similar to regular matrix multiplication, and is given by the formula:

(∂z∂x)i,j=∑k(∂z∂y)i,k(∂y∂x)k,j

However, where this differs from matrix multiplication is that i,j,k are vectors which specify the location of variables within a tensor.

Let's see if we can use this notation to perform backpropagation on a neural network. Consider a neural network defined by the following composition of simple functions: f(x)=W2(relu(W1x+b1))+b2. Here, relu describes the activation function of the first layer of the network, which is defined as the element-wise application of relu(x)=max(x,0). There are a few parameters of this network: the weight matrices, and the biases. These parameters are the things that we are taking the derivative with respect to.

There is one more part to add before we can train this abstract network: a loss function. In our case, we are simply going to train the parameters with respect to the loss function L(^y,y)=||^y−y||22 where ^y is the prediction made by the neural network, and y is the vector of desired outputs. In full, we are taking ∂∂wL(f(x),y), for some weights w, which include W1,W2,b1,b2. Since this loss function is parameterized by a constant vector y, we can henceforth treat the loss function as simply L(f(x)).

Ideally, we would not want to make this our loss function. That's because the true loss function should be over the entire dataset — it should take into account how good the predictions were for each sample that it was given. The way that I have described it only gave us the loss for a single prediction.

However, taking the loss over the entire dataset is too expensive and converges slowly. Alternatively, taking the loss over a single point (ie: stochastic gradient descent) is also too slow because it doesn't allow us to take into account parallel hardware. So, actual practitioners use what's called mini-batch descent, where their loss function is over some subset of the data. For simplicity, I will just show the stochastic gradient descent step.

For ∂∂b2L(f(x)) we have ∂∂b1L(f(x))=∂L∂f∂f∂b2. From the above definition of f, we can see that ∂f∂b2=I, where I is the identity matrix. From here on I will simply assume that the partial derivatives are organized in some specific manner, but omitted. The exact way it's written doesn't actually matter too much as long as you understand the shape of the Jacobian being represented.

We can now evaluate ∂f∂W2. Let U be (relu(W1x+b1)). Then computing the derivative ∂f∂W2comes down to finding the generalized Jacobian of W2Uwith respect to W2. I will illustrate what this generalized Jacobian would look like by building up from analogous, lower order derivatives. The derivative dydx of y=cx is c. The gradient ∇xc⊺x is c. The Jacobian Jx of Ux is U. We can therefore see that the generalized Jacobian JW2of W2U will be some type of order 3 tensor which would look like a simple expression involving U.

The derivatives for the rest of the weight matrices can be computed similarly to the derivatives I have indicated for b2 and W2. We simply need to evaluate the terms later on in the chain ∂L∂f⋯∂v∂W1where v is shorthand for the function v=W1x.

We have, however, left out one crucial piece of information, which is how to calculate the derivative over the relu function. To do that we simply separate the derivative into a piecewise function. When the input is less than zero, the derivative is 0. When the input is greater than zero, the derivative is 1. But since the function is not differentiable at 0, we just pretend that it is and make it's derivative 0; this doesn't cause any issues.

0 \end{cases}">∂∂xrelu(x)={0x≤01x>0

This means that we are pretty much done, as long as you can fill in the details for computing the generalized Jacobians. The trickiest part in the code is simply making sure that all the dimensions line up. Now, once we have computed by derivatives, we can incorporate this information into some learning algorithm like Adam, and use this to update the parameters and continue training the network.

There are, however, many ways that we can make the algorithm more efficient than one might make it during a naive implementation. I will cover one method briefly.

We can start by taking into account information about the direction we are calculating the Jacobians. In particular, if we consider some chain ∂L∂f⋯∂v∂W1, we can take advantage of the fact that tensor-tensor products are associative. Essentially, this means that we can start by computing the last derivative ∂v∂W1 and then multiplying forward. This is called forward accumulation. We can also compute this expression in reverse, which is referred to as reverse accumulation.

Besides forward and reverse accumulation, there are more complex intracacies for fully optimizing a library. From Wikipedia,

Forward and reverse accumulation are just two (extreme) ways of traversing the chain rule. The problem of computing a full Jacobian of f : ℝn → ℝm with a minimum number of arithmetic operations is known as the optimal Jacobian accumulation (OJA) problem, which is NP-complete.

Now if you've followed this post and the last two, and filled in some of the details I (sloppily) left out, you should be well on your way to being able to implement efficient backpropagation yourself. Perhaps read this famous paper for more ways to make it work.

1 This is first and foremost my personal goal, rather than a goal that I expect the readers here to agree with.

2 If you want to see this derived, see section 4.5.3 in the paper.

3 The part about people being disappointed comes from my own experience, as it's what John Canny said in CS 182. The definition of Tensor can be made more precise as a multidimensional array that satisfies a specific transformation law. See here for more details.

Discuss

### Beliefs Are For True Things

16 августа, 2019 - 02:23
Published on August 15, 2019 11:23 PM UTC

One of the core principles -- maybe the most core principle -- of the art of rationality is that beliefs are for true things. In other words, you should believe things because they are true. You should not believe things that are not true.

Holding that beliefs are for true things means that you do not believe things because they are useful, believe things because they sound nice, or believe things because you prefer them to be true. You believe things that are true (or at least that you believe to be true, which is often the best we can get!).

Eliezer referred to this principle as "the void", writing in his "The Twelve Virtues of Rationality":

Before these eleven virtues is a virtue which is nameless.Miyamoto Musashi wrote, in The Book of Five Rings:“The primary thing when you take a sword in your hands is your intention to cut the enemy, whatever the means. Whenever you parry, hit, spring, strike or touch the enemy’s cutting sword, you must cut the enemy in the same movement. It is essential to attain this. If you think only of hitting, springing, striking or touching the enemy, you will not be able actually to cut him. More than anything, you must be thinking of carrying your movement through to cutting him.”Every step of your reasoning must cut through to the correct answer in the same movement. More than anything, you must think of carrying your map through to reflecting the territory.

Musashi wrote that you must always think of carrying your motion through to cutting; I write, with Eliezer, that every belief and every step in your belief must cut through to knowing the truth.

Beliefs, after all, are for true things, and if you lose sight of that you will lose your epistemics. If you think only of what gives you an advantage in a debate, of what sounds nice, of what wins you the admiration of your peers, of what is politically correct, or of what you would prefer to be true, you will not be able to actually believe true things.

I would like to take the perhaps unusual step of closing with a poem by Rudyard Kipling, which addresses this point (among others) rather well:

As I pass through my incarnations in every age and race,
I make my proper prostrations to the Gods of the Market-Place.
Peering through reverent fingers I watch them flourish and fall.
And the Gods of the Copybook Headings, I notice, outlast them all.We were living in trees when they met us. They showed us each in turn,
That water would certainly wet us, as Fire would certainly burn:
But we found them lacking in Uplift, Vision, and Breadth of Mind,
So we left them to teach the Gorillas while we followed the March of Mankind.We moved as the Spirit listed. They never altered their pace,
Being neither cloud nor wind-borne like the Gods of the Market-Place;
But they always caught up with our progress, and presently word would come
That a tribe had been wiped off its icefield, or the lights had gone out in Rome.With the Hopes that our World is built on they were utterly out of touch.
They denied that the Moon was Stilton; they denied she was even Dutch.
They denied that Wishes were Horses; they denied that a Pig had Wings.
So we worshiped the Gods of the Market Who promised these beautiful things.When the Cambrian measures were forming, They promised perpetual peace.
They swore, if we gave them our weapons, that the wars of the tribes would cease.
But when we disarmed They sold us and delivered us bound to our foe,
And the Gods of the Copybook Headings said: "Stick to the Devil you know."On the first Feminian Sandstones we were promised the Fuller Life
(Which started by loving our neighbor and ended by loving his wife)
Till our women had no more children and the men lost reason and faith,
And the Gods of the Copybook Headings said: "The Wages of Sin is Death."In the Carboniferous Epoch we were promised abundance for all,
By robbing selective Peter to pay for collective Paul;
But, though we had plenty of money, there was nothing our money could buy,
And the Gods of the Copybook Headings said: "If you don't work you die."Then the Gods of the Market tumbled, and their smooth-tongued wizards withdrew,
And the hearts of the meanest were humbled and began to believe it was true
That All is not Gold that Glitters, and Two and Two make Four —
And the Gods of the Copybook Headings limped up to explain it once more.                                  *      *      *      *      *      *As it will be in the future, it was at the birth of Man —
There are only four things certain since Social Progress began: —
That the Dog returns to his Vomit and the Sow returns to her mire,
And the burnt Fool's bandaged finger goes wabbling back to the Fire;
And that after this is accomplished, and the brave new world begins
When all men are paid for existing and no man must pay for his sins,
As surely as Water will wet us, as surely as Fire will burn,
The Gods of the Copybook Headings with terror and slaughter return!

Discuss

### What experiments would demonstrate "upper limits of augmented working memory?"

16 августа, 2019 - 01:09
Published on August 15, 2019 10:09 PM UTC

Wikipedia has this discussion of working-memory-as-ability-to-discern-relationships-simultaneously:

Other have argued that working memory capacity is better characterized as "the ability to mentally form relations between elements, or to grasp relations in given information. This idea has been advanced by Halford, who illustrated it by our limited ability to understand statistical interactions between variables.[34]"These authors asked people to compare written statements about the relations between several variables to graphs illustrating the same or a different relation, as in the following sentence: "If the cake is from France, then it has more sugar if it is made with chocolate than if it is made with cream, but if the cake is from Italy, then it has more sugar if it is made with cream than if it is made of chocolate". This statement describes a relation between three variables (country, ingredient, and amount of sugar), which is the maximum most individuals can understand. The capacity limit apparent here is obviously not a memory limit (all relevant information can be seen continuously) but a limit to how many relationships are discerned simultaneously.

A common argument I've heard is that large monitors, notebooks, whiteboards, etc, are important tools to expand working memory.

I notice I'm not 100% sure what this means – in particular in the context of "discerning relationships simultaneously."

In this blogpost on distributed teams , Elizabeth plots out her model of worker productivity, which looks like this:

I look at any chunk of that, and it makes sense.

If I were to try to summarize the whole thing without looking at the reference drawing, I would definitely not be able to (without a lot of memorization and/or thinking about the model to get it deeply entangled within myself)

If I have the model right in front of me, I still can't really explain it, it's too complicated.

Diagrams help – I'm pretty sure I could track more moving parts with a diagram than without a diagram. But how much do they help? And what does that mean?

I'm interested in this as part of a general hypothesis that working-memory might be a key bottleneck on intellectual progress. It seems like you should be able to formalize the limit of how many relationships people can reason about at once, and how much visual aids and other working-memory augmentation help. But I'm not quite sure what testing it would mean.

If I try to memorize a phone number with no visual aids, it's obvious to check how many digits I can remember. If I have a visual aid, it's easy - just read off the page. But when it comes to discerning relationships, just reading off the page "what inputs plug into what" isn't really the question.

I'm interested in:

• whether there's any science that tries actually answering this question
• what science could theoretically try answering this question if it hasn't been done yet.

Discuss

### Clarifying some key hypotheses in AI alignment

16 августа, 2019 - 00:29
Published on August 15, 2019 9:29 PM UTC

We've created a diagram mapping out important and controversial hypotheses for AI alignment. We hope that this will help researchers identify and more productively discuss their disagreements.

Diagram

A part of the diagram. Click through to see the full version.

Caveats
1. This does not decompose arguments exhaustively. It does not include every reason to favour or disfavour ideas. Rather, it is a set of key hypotheses and relationships with other hypotheses, problems, solutions, models, etc. Some examples of important but apparently uncontroversial premises within the AI safety community: orthogonality, complexity of value, Goodhart's Curse, AI being deployed in a catastrophe-sensitive context.
2. This is not a comprehensive collection of key hypotheses across the whole space of AI alignment. It focuses on a subspace that we find interesting and is relevant to more recent discussions we have encountered, but where key hypotheses seem relatively less illuminated. This includes rational agency and goal-directedness, CAIS, corrigibility, and the rationale of foundational and practical research. In hindsight, the selection criteria was something like:
1. The idea is closely connected to the problem of artificial systems optimizing adversarially against humans.
2. The idea must be explained sufficiently well that we believe it is plausible.
3. Arrows in the diagram indicate flows of evidence or soft relations, not absolute logical implications — please read the "interpretation" box in the diagram. Also pay attention to any reasoning written next to a Yes/No/Defer arrow — you may disagree with it, so don't blindly follow the arrow!
Background

Much has been written in the way of arguments for AI risk. Recently there have been some talks and posts that clarify different arguments, point to open questions, and highlight the need for further clarification and analysis. We largely share their assessments and echo their recommendations.

One aspect of the discourse that seems to be lacking clarification and analysis is the reasons to favour one argument over another — in particular, the key hypotheses or cruxes that underlie the different arguments. Understanding this better will make discourse more productive and help people reason about their beliefs.

This work aims to collate and clarify hypotheses that seem key to AI alignment in particular (by "alignment" we mean the problem of getting an AI system to reliably do what an overseer intends, or try to do so, depending on which part of the diagram you are in). We point to which hypotheses, arguments, approaches, and scenarios are favoured and disfavoured by each other. It is neither comprehensive nor sufficiently nuanced to capture everyone's views, but we expect it to reduce confusion and encourage further analysis.

You can digest this post through the diagram or the supplementary information, which have their respective strengths and limitations. However, we recommend starting with the diagram, then if you are interested in related reading or our comments about a particular hypothesis, you can click the link on the box title in the diagram, or look it up below.

Supplementary information

The sections here list the hypotheses in the diagram, along with related readings and our more opinion-based comments, for lack of software to neatly embed this information (however, boxes in the diagram do link back to the headings here). Note that the diagram is the best way to understand relationships and high-level meaning, while this offers more depth and resources for each hypothesis. Phrases in italics with the first letter capitalised are referring to a box in the diagram.

Definitions
• AGI: a system (not necessarily agentive) that, for almost all economically relevant cognitive tasks, at least matches any human's ability at the task. Here, "agentive AGI" is essentially what people in the AI safety community usually mean when they say AGI. References to before and after AGI are to be interpreted as fuzzy, since this definition is fuzzy.
• CAIS: comprehensive AI services. See Reframing Superintelligence.
• Goal-directed: describes a type of behaviour, currently not formalised, but characterised by generalisation to novel circumstances and the acquisition of power and resources. See Intuitions about goal-directed behaviour.
Agentive AGI?

Will the first AGI be most effectively modelled like a unitary, unbounded, goal-directed agent?

• Related reading: Reframing Superintelligence, Comments on CAIS, Summary and opinions on CAIS, embedded agency sequence, Intuitions about goal-directed behaviour
• Comment: This is consistent with some of classical AI theory, and agency continues to be a relevant concept in capability-focused research, e.g. reinforcement learning. However, it has been argued that the way AI systems are taking shape today, and the way humans historically do engineering, are cause to believe superintelligent capabilities will be achieved by different means. Some grant that a CAIS-like scenario is probable, but maintain that there will still be Incentive for agentive AGI. Others argue that the current understanding of agency is problematic (perhaps just for being vague, or specifically in relation to embeddedness), so we should defer on this hypothesis until we better understand what we are talking about. It appears that this is a strong crux for the problem of Incorrigible goal-directed superintelligence and the general aim of (Near) proof-level assurance of alignment, versus other approaches that reject alignment being such a hard, one-false-move kind of problem. However, to advance this debate it does seem important to clarify notions of goal-directedness and agency.
Incentive for agentive AGI?

Are there features of systems built like unitary goal-directed agents that offer a worthwhile advantage over other broadly superintelligent systems?

Modularity over integration?

In general and holding resources constant, is a collection of modular AI systems with distinct interfaces more competent than a single integrated AI system?

• Related reading: Reframing Superintelligence Ch. 12, 13, AGI will drastically increase economies of scale
• Comment: an almost equivalent trade-off here is generality vs. specialisation. Modular systems would benefit from specialisation, but likely bear greater cost in principal-agent problems and sharing information (see this comment thread). One case that might be relevant to think about is human roles in the economy — although humans have a general learning capacity, they have tended towards specialising their competencies as part of the economy, with almost no one being truly self-sufficient. However, this may be explained merely by limited brain size. The recent success of end-to-end learning systems has been argued in favour of integration, as has the evolutionary precedent of humans (since human minds appear to be more integrated than modular).
Current AI R&D extrapolates to AI services?

AI systems so far generally lack some key qualities that are traditionally supposed of AGI, namely: pursuing cross-domain long-term goals, having broad capabilities, and being persistent and unitary. Does this lacking extrapolate, with increasing automation of AI R&D and the rise of a broad collection of superintelligent services?

Incidental agentive AGI?

Will systems built like unitary goal-directed agents develop incidentally from something humans or other AI systems build?

Convergent rationality?

Given sufficient capacity, does an AI system converge on rational agency and consequentialism to achieve its objective?

• Comment: As far as we know, "convergent rationality" has only been named recently by David Krueger, and while it is not well fleshed out yet, it seems to point at an important and commonly-held assumption. There is some confusion about whether the convergence could be a theoretical property, or is merely a matter of human framing, or merely a matter of Incentive for agentive AGI.
Inner optimisers?

Will there be optimisation processes that, in turn, develop considerably powerful optimisers to achieve their objective? A historical example is natural selection optimising for reproductive fitness to make humans. Humans may have good reproductive fitness, but optimise for other things such as pleasure even when this diverges from fitness.

Discontinuity to AGI?

Will there be discontinuous, explosive growth in AI capabilities to reach the first agentive AGI? A discontinuity reduces the opportunity to correct course. Before AGI it seems most likely to result from a qualitative change in learning curve, due to an algorithmic insight, architectural change or scale-up in resource utilisation.

Recursive self improvement?

Is an AI system that improves through its own AI R&D and self-modification capabilities more likely than distributed AI R&D automation? Recursive improvement would give some form of explosive growth, and so could result in unprecedented gains in intelligence.

Discontinuity from AGI?

Will there be discontinuous, explosive growth in AI capabilities after agentive AGI? A discontinuity reduces the opportunity to correct course. After AGI it seems most likely to result from a recursive improvement capability.

• Related reading: see Discontinuity to AGI
• Comment: see Discontinuity to AGI
ML scales to AGI?

Do contemporary machine learning techniques scale to general human level (and beyond)? The state-of-the-art experimental research aiming towards AGI is characterised by a set of theoretical assumptions, such as reinforcement learning and probabilistic inference. Does this paradigm readily scale to general human-level capabilities without fundamental changes in the assumptions or methods?

• Related reading: Prosaic AI alignment, A possible stance for alignment research, Conceptual issues in AI safety: the paradigmatic gap, Discussion on the machine learning approach to AI safety
• Comment: One might wonder how much change in assumptions or methods constitutes a paradigm shift, but the more important question is how relevant current ML safety work can be to the most high-stakes problems, and that seems to depend strongly on this hypothesis. Proponents of the ML safety approach admit that much of the work could turn out to be irrelevant, especially with a paradigm shift, but argue that there is nonetheless a worthwhile chance. ML is a fairly broad field, so people taking this approach should think more specifically about what aspects are relevant and scalable. If one proposes to build safe AGI by scaling up contemporary ML techniques, clearly they should believe the hypothesis — but there is also a feedback loop: the more feasible approaches one comes up with, the more evidence there is for the hypothesis. You may opt for Foundational or "deconfusion" research if (1) you don't feel confident enough about this to commit to working on ML, or (2) you think that, whether or not ML scales in terms of capability, we need deep insights about intelligence to get a satisfactory solution to alignment. This implies Alignment is much harder than, or does not overlap much with, capability gain.
Deep insights needed?

Do we need a much deeper understanding of intelligence to build an aligned AI?

Do corrigible AI systems have a broad basin of attraction to intent alignment? Corrigible AI tries to help an overseer. It acts to improve its model of the overseer's preferences, and is incentivised to make sure any subsystems it creates are aligned — perhaps even more so than itself. In this way, perturbations or errors in alignment tend to be corrected, and it takes a large perturbation to move out of this "basin" of corrigibility.

• Related reading: Corrigibility, discussion on the need for a grounded definition of preferences (comment thread)
• Comment: this definition of corrigibility is still vague, and although it can be explained to work in a desirable way, it is not clear how practically feasible it is. It seems that proponents of corrigible AI accept that greater theoretical understanding and clarification is needed: how much is a key source of disagreement. On a practical extreme, one would iterate experiments with tight feedback loops to figure it out, and correct errors on the go. This assumes ample opportunity for trial and error, rejecting Discontinuity to/from AGI. On a theoretical extreme, some argue that one would need to develop a new mathematical theory of preferences to be confident enough that this approach will work, or such a theory would provide the necessary insights to make it work at all. If you find this hypothesis weak, you probably put more weight on threat models based on Goodhart's Curse, e.g. Incorrigible goal-directed superintelligence, and the general aim of (Near) proof-level assurance of alignment.
Inconspicuous failure?

Will a concrete, catastrophic AI failure be overwhelmingly hard to recognise or anticipate? For certain kinds of advanced AI systems (namely the goal-directed type), it seems that short of near proof-level assurances, all safeguards are thwarted by the nearest unblocked strategy. Such AI may also be incentivised for deception and manipulation towards a treacherous turn. Or, in a machine learning framing, it would be very difficult to make such AI robust to distributional shift.

• Related reading: Importance of new mathematical foundations to avoid inconspicuous failure (comment thread)
• Comment: This seems to be a key part of many people's models for AI risk, which we associate most with MIRI. We think it significantly depends on whether there is Agentive AGI, and it supports the general aim of (Near) proof-level assurance of alignment. If we can get away from that kind of AI, it is more likely that we can relax our approach and Use feedback loops to correct course as we go.
Creeping failure?

Would gradual gains in the influence of AI allow small problems to accumulate to catastrophe? The gradual aspect affords opportunity to recognise failures and think about solutions. Yet for any given incremental change in the use of AI, the economic incentives could outweigh the problems, such that we become more entangled in, and reliant on, a complex system that can collapse suddenly or drift from our values.

Thanks to Stuart Armstrong, Wei Dai, Daniel Dewey, Eric Drexler, Scott Emmons, Ben Garfinkel, Richard Ngo and Cody Wild for helpful feedback on drafts of this work. Ben especially thanks Rohin for his generous feedback and assistance throughout its development.

Discuss

### Tessercube — OpenPGP Made Mobile

15 августа, 2019 - 13:47
Published on August 15, 2019 9:34 AM UTC

I cited several words from my colleague Neruthes. He's not at LessWrong but he also joined the most recent Shanghai LessWrong meetup: https://www.lesswrong.com/events/zR4atrRmiaqGLvjYj/shanghai-lesswrong-meetup#NZHcXphJXcsTFwpmf

Recently we have been working on this project in the sense that I feel there is no good OpenPGP utility on Mobile (especially iOS). By good, I mean the UX should be good, and the license should be AGPL or GPL.

In the process, we got the idea of App Penetration — by making OpenPGP into keyboards (input methods), we can literally be end-to-end encrypted when using any channel of communication, as long as the other side can decrypt — on Facebook Messenger, on Telegram, on iMessage, whatever.

For now, we have been releasing Android beta test versions on Google Play. The iOS version on App Store. It might be a bit early to announce because there are plenty of bugs and a big shortage of tutorials, but I believe hardcore users can go through it.

There can be a lot of bugs and UX flaws. If you find any bug, just go to GitHub and open an issue. And I will appreciate!

In larger perspective, building Tessercube is just a humble beginning. We would like to give general public proper tools of encryption and make them possible to protect their privacy and enable the people to really own their data. That's why we also made Maskbook , an encryption and programmable layer on top of all existing giants, such as Facebook, Twitter, etc. I will write a separate post for our story and our approach.

([I:b])

Discuss

### A Primer on Matrix Calculus, Part 2: Jacobians and other fun

15 августа, 2019 - 04:13
Published on August 15, 2019 1:13 AM UTC

I started this post thinking that I would write all the rules for evaluating Jacobians of neural network parameters in specific cases. But while this would certainly be useful for grokking deep learning papers, frankly it's difficult to write that in Latex and the people who have written The Matrix Calculus You Need For Deep Learning paper have already done it much better than I can do.

Rather, I consider my comparative advantage here to provide some expansion on why we should use Jacobians in the first place. If you were to just read the paper above, you might start to think that Jacobians are just notational perks. I hope to convince you that they are much more than that. In at least one setting, Jacobians provide a mathematical framework for analyzing the input-output behavior of deep neural networks, which can help us see things which we might have missed without this framework. A specific case of this phenomenon is a recently discovered technique which was even more recently put into a practical implementation: Jacobian regularization. Here we will see some fruits of our matrix calculus labor.

Deep learning techniques require us to train a neural network by slowly modifying parameters of some function until the function begins returning something close to the intended output. These parameters are often represented in the form of matrices. There are a few reasons for this representation: the matrix form is compact, and it allows us to use the tools of linear algebra directly. Matrix computations can also be processed in parallel, and this standardization allows programmers to build efficient libraries for the training of deep neural networks.

One quite important matrix in deep learning is the Jacobian.

⎡⎢ ⎢ ⎢ ⎢ ⎢⎣∇f⊺1∇f⊺2⋮∇f⊺m⎤⎥ ⎥ ⎥ ⎥ ⎥⎦

Now imagine stretching, warping, and contracting the plane so that the points are moved around in some manner. This means that every point is mapped to some other point on the graph. For instance, (2,5) could be mapped to (1,3). Crucially, make sure this transformation doesn't cause any sort of sharp discontinuity: we don't want the graph to rip. I am not good with illustrating that type of thing, so I encourage you to imagine it instead in your head, or alternatively watch this video.

There is a special set of such mappings, which we call linear transformations. Linear transformations have the property that when we perform the stretching, the gridlines are kept straight. We could still rotate the graph, for instance. But what we can't do is bend some axis so that it takes on a curvy shape after the transformation. If we drew a line on the graph before the linear transformation, it must remain a straight line after the transformation.

What does this have to do with the Jacobian? To see we first must ask why derivatives are useful. In the most basic sense, the derivative is getting at trying to answer the questoin, "What is the local behavior of this function?"

First, if you will consider by analogy, we could ask for the local behavior of some differentiable function f:R→R. This would be a line whose slope is provided by the derivative. Similarly we could ask for the local behavior of some multivariate function f:Rn→R, which would be a hyperplane whose direction is determined by the gradient. Now when we ask what the local behavior of some vector-valued function f:Rn→Rm is, we get a linear transformation described by the Jacobian.

In the above illustration, the Jacobian is evaluated at the point in the bottom left corner of the red tiles. The linear transformation implied by the Jacobian is represented by the translucent square after the function is applied, which is a rotation transformation some angle clockwise. As we can see, while f is some extra curvy function, the Jacobian can approximate the local behavior at a point quite well.

To see how this visualization is useful, consider the case of applying a Jacobian to a neural network. We could be designing a simple neural network to predict the output of two variables, representing perhaps normalized class probabilities, given an input of three variables, representing perhaps input pixel data. We now illustrate the neural network.

This neural networks implements a particular function from R3→R2. However, the exact function that is being implemented depends crucially on the parameters, here denoted by the connections between the nodes. If we compute the Jacobian of this neural network with respect to the input, at some input instance, we would end up getting a good idea of how the neural network changes within the neighborhood of that particular input.

One way we can gain insight from a Jacobian is by computing its determinant. Recall, a determinant is a function from square matrices to scalars which is defined recursively as the alternating sum and subtraction of determinants of the minors of the matrix multiplied by elements in the top row. On second thought, don't recall that definition of determinant; that's not going to get you anywhere. Despite the determinant's opaque definition, we can gain deeper insight into what the determinant represents by instead viewing it geometrically. In a few words, the determinant computes the scaling factor for a given linear transformation of a matrix.

Above, I have pulled from Wikimedia a parallelepiped, which was formed from some linear mapping of a cube. The volume of this parallelepiped is some multiple of the volume of the cube before the transformation. It turns out that no matter which region of space we look at, this linear transformation will generate the same ratio from post-transformed regions to pre-transformed regions. This ratio is given by the determinant of the matrix representing the linear mapping. In other words, the determinant tells us how much some transformation is expanding or contracting space.

What this means is for the Jacobian is that the determinant tells us how much space is being squished or expanded in the neighborhood around a point. If the output space is being expanded a lot at some input point, then this means that the neural network is a bit unstable at that region, since minor alterations in the input could cause huge distortions in the output. By contrast, if the determinant is small, then some small change to the input will hardly make a difference to the output.

This very fact about the Jacobian is behind a recent development in the regularization of deep neural networks. The idea is that we could use this interpretation of the Jacobian as a measure of robustness to input-perturbations around a point to make neural networks more robust off their training distribution. Traditional approaches like L2 regularization have emphasized the idea of keeping some parameters of the neural network from wandering off into extreme regions. The idea here is that smaller parameters are more likely a priori, which motivates the construction of some type of penalty on parameters that are too large.

In contrast to L2 regularization, the conceptual framing of Jacobian regularization comes from a different place. Instead of holding a leash on some parameters, to keep them from wandering off into the abyss, Jacobian regularization emphasizes providing robustness to small changes in the input space. The motivation behind this approach is clear to anyone who has been paying attention to adversarial examples over the last few years. To explain, adversarial examples are cases where we provide instances of a neural network where it performs very poorly, even if it had initially done well on a non-adversarial test set. Consider this example, provided by OpenAI.

The first image was correctly identified as a panda by the neural network. However, when we added a tiny bit of noise to the image, the neural network spit out garbage, confidently classifying a nearly exact copy as a gibbon. One could imagine a hypothetical adversary using this exploit to defeat neural network systems in practice. In the context of AI safety, adversarial attacks constitutes a potentially important subproblem of system reliability.

In Jacobian regularization, we approach this issue by putting a penalty on the size of the entries in the Jacobian matrix. The idea is simple: the smaller the values of the matrix, the less that tiny perturbations in input-space will affect the output. Concretely, the regularizer is described by taking the frobenius norm of the Jacobian matrix, ||J(x)||2F. The frobenius norm is nothing complicated, and is really just a way of describing that we square all of the elements in the matrix, take the sum, and then take the square root of this sum. Put another way, if we imagine concatenating all the gradient vectors which compose the Jacobian, the frobenius norm is just describing the L2 penalty of this concatenated vector.

Importantly, this technique is subtly different from taking the L2 norm over the parameters. In the case of a machine learning algorithm with no linearity, this penalty does however reduce to L2 regularization. Why? Because when we take the Jacobian of a purely affine function, we obtain the global information about how the function stretches and rotates space, excluding the translation offset. This global information precisely composes the parameters that we would be penalizing. It is theoretically similar to how if we take the derivative of a line, we can reconstruct the line from the derivative and a bias term.

If while reading the last few paragraphs, you starting thinking how is this just now being discovered? you share my thoughts exactly. As far as I can tell, the seeds of Jacobian regularization have existed since at least the 1990s. However, it took until 2016 for a team to create a full implementation. Only recently, as I write this in August 2019, has a team of researchers claimed to have discovered an efficient algorithm for applying this regularization penalty to neural networks.

The way that researchers created this new method is by using random projections to approximate the Frobenius norm. Whereas the prior approach mentioned random projections, it was never put into practice. The new paper succeeded by devising an algorithm to approximate Jacobians efficiently with minimal overhead cost.

How efficiently? The paper states that there is

only a negligible difference in model solution quality between training with the exact computation of the Jacobian as compared to training with the approximate algorithm, even when using a single random projection.

If this technique really works as its described, this is a significant result. The paper claims that by applying Jacobian regularization, training uses only 30 more computation compared to traditional stochastic gradient descent without regularization at all. And for all that, we get some nice benefits: the system was significantly more robust to a PGD attack, and it was apparently much better than vanilla L2 regularization due to the distance between decision cells in the output space.

I recommend looking at the paper for more details.

Discuss