# Новости LessWrong.com

A community blog devoted to refining the art of rationality
Обновлено: 54 минуты 5 секунд назад

### Why Selective Breeding is a Bad Way to do Genetic Engineering

2 часа 13 минут назад
Published on March 5, 2021 2:30 AM GMT

A Brief Intro

During any conversation about genetic engineering, people inevitably bring up worries about eugenics movements of the past and often use the cruelty, bad science, and objective failure of these efforts as an example of why we shouldn't ever try anything remotely related again. In this short post, I'm going to summarize why I think selective breeding of humans is bad both from a moral perspective and ineffective as a means of improving human genes.

Selective breeding at its core involves taking organisms that score well on some test of desirable traits and enabling them to reproduce at higher rates than organisms that score poorly. Despite its many flaws, this technique has lead to amazing gains in both agriculture and animal husbandry, and allowed domesticated corn crops to undergo this incredible transformation over the last few thousand years.

But despite the amazing performance on crops, there are reasons this technique would not work very well on humans.

Humans are Slow Breeders

Every generation that you selectively breed an organism you get some gain in a particular trait. The faster reproduction happens, the faster you see improvements in the trait(s) under selection. Humans are extremely slow-breeding animals. Though humans are capable of reproducing sometime in early adolescence, most humans today opt to wait until their 20's to 30's to have children. This is a very very long time if you want to do selective breeding.

Selective Breeding Leads to an Undesirable Reduction in Genetic Diversity

Genetic diversity is valuable. Because selective breeding can only work by throwing entire organisms out of the gene pool, it naturally ends up reducing a lot of desirable genetic diversity. Even organisms that don't score well overall will still have many good genes. With selective breeding, there is no way to keep this valuable genetic diversity unless one were to select the best X% from every lineage in a population.

Selective Breeding Creates a Single Point of Failure

The right to reproduce is fundamental. Even in societies that do place restrictions on reproduction, such as China with its one-child policy, the restrictions are not total: each couple can still have a single child. In order to make any notable gains in desirable traits from selective breeding, one must necessarily only allow a small portion of the population to reproduce. This would require an incredible concentration of power in the regulatory authority, and the pressure on regulatory officials from powerful people who want to be able to have children would be immense. I see rampant corruption as nearly inevitable in such a system.

Not only that, but concentrating power in this way creates a single point of failure. It is not too difficult to imagine such a system becoming corrupted by discriminatory ideology. In fact, you don't even have to imagine it because this type of failure was exactly what happened in Nazi Germany before and during World War II when they implemented a eugenics program based on racist ideology and belief in a fictional Aryan "master race".

Selective Breeding is Cruel

For myself personally, this is the most compelling reason to not use selective breeding: it is a cruel judgment upon those who, through no fault of their own, happen to draw the short stick in the genetic lottery. The desire to see some part of ourselves live on past our death is nearly universal, and the most common realization of this desire is through having children. Restricting this ability, even if it would result in future generations more capable of carrying on the human legacy, would be an enormous price to pay.

Though we may recognize that certain genes confer advantages to an individual, we must not confuse human ability with human value. As humanity enters the age in which we will be able to rewrite our genetic source code, I think this is one of the most important lessons for us to remember.

Discuss

### Participating in a Covid-19 Vaccine Trial #3: I Hope I feel Worse Tomorrow

4 часа 41 минута назад
Published on March 5, 2021 12:02 AM GMT

Today is four weeks to the day after my first injection, and I received my second at my appointment this morning. The process was very similar to my first visit: I entered and met first with someone who led me through paperwork and asked a series of screening questions, then had a brief medical exam, got my shot, waited 30 minutes in case of any acute reaction, and left. Notable differences this time:

• The entire process seemed less repetitive and went by faster (both subjectively and objectively). I was also filling out fewer forms than last time and wasn’t getting any orientation so this makes sense.
• I was told that my follow-up meeting “had to” be scheduled for two weeks from now by order of Novavax. This is sooner than was specified in the study design. I suspect they have enough data already to show that it works so they are planning to close the site.
• In the middle of my health exam, someone rang a Nurse Call button and the doctor ran out of the room. A few minutes later the Nurse Practitioner from last month came in and finished up. The building we were in is also used a clinic for World Trade Center survivors and first responders (and probably other health care I don’t know about) so it’s possible that this was totally unrelated, but the call would imply that someone had a reaction to the injection.
• There were specific instructions to inject the 2nd dose in the opposite arm from the first dose. I am right-handed so I used my left arm last time, which meant I had to use my dominant hand this time. The second round is more likely to have side effects though, so I wish I had known this to begin with. By word-of-mouth from a medical professional (my mom), apparently some people have reported pain in the first injection site after getting their second dose. So maybe they are looking for that.
• Doors were mostly kept open by default instead of closed by default. The first person asking me health-related questions even asked me if I wanted to close the door.

When I received my shot the NP mentioned that most people don’t have side effects from the vaccine. He also said the side effects seem to have a similar profile as the Pfizer and Moderna vaccines, namely mild flu-like symptoms for a day or two. This was not exactly supported by the literature I ended up reading.

Did I get the Placebo? Using Bayes' Theorem

I would like to have a better idea of my chances at having received the real vaccine. The experiment randomized 2 people into the trial group for every 1 person in the control. This gives a prior odds of 2:1.

There are two pieces of evidence: my symptoms after the first injection (none), and my symptoms after the second injection (to be observed over the next day or so). These are obviously not independent. I need to know how much more likely I am to experience side effects from the real injection than the placebo.

After reading a few unhelpful press releases, I found my way to medrxiv.org and entered NVX-CoV2373 into the search bar. I found a few papers which seemed likely to have the information I was looking for.

Estimating Likelihood Ratios

This first study was a phase II investigation of dose response; in simpler language, they were giving people different amounts of vaccine and seeing what happened. They also broke their study population into two age cohorts (18-59 and 60-84). Younger patients were on average more prone to react at all stages and doses. Data are not available but there are graphs (Fig. 2) and some statistics in the text. All of the numbers here are read off of graphs and are approximate.

“Local” adverse effects from the vaccine include pain, swelling, tenderness etc. at injection site, while “Systemic” adverse effects include fever, nausea, or malaise. The study had one placebo group and two pairs of dosed groups; each pair used a different dose size in their vaccines. Within each pair of dosed groups, one group received two doses and one group received a placebo shot for their second dose.

At least one local adverse effect was reported by:

• 15% of placebo participants on the first shot, 10% on the second shot
• 50% of the 5-microgram dose group participants on the first shot, 70% on the second shot
• 65% of the 25-microgram group participants on the first shot, 80% on the second shot

For the second shot, about 10% of participants who received a placebo that round reported local adverse effects, whether or not they received a real vaccine for their first dose.

Systemic adverse effects were reported by:

• Roughly 40% of people in all groups after the first shot
• 20% of people receiving the placebo for their 2nd shot
• 50% of the 5-microgram group after their 2nd shot
• 60% of the 25-microgram group after their 2nd shot

For younger participants, the gaps between placebo and treatment groups was a bit larger - younger people reacted to vaccines more often - so I’m tempted to bump all of those figures up a bit in my own calculation (I am in my 20s).

The second study I read had a very similar design, but participants generally reported more symptoms than in the first one I discussed. I suspect whichever method they used to ask participants about their symptoms was more sensitive - for example, more than a third of the placebo group reported a headache, which counts as a systemic adverse effect - so these numbers might be inflated. Even if this is true, there’s almost no reports of fever from any of the participants, which surprised me. Looking at the other systemic effects, it looks like fatigue, malaise, and muscle pain were each all about two or three times more common in the treatment groups than the control group.

Finally, I read this meta-analysis which apparently has done most of the work for me:

For the meta-analysis, we separated the adverse events based on vaccine vs. placebo injection as reported by individual studies. In general, we observed there was an increase in total adverse events for subjects with low dose vaccine injection [OR: 2.86; 95% CI: 1.90-4.29, P < 0.00001]. Especially, the local reactions were significantly enhanced in subjects with low dose vaccine groups [OR: 2.07; 95% CI: 1.07-4.00, P = 0.03]. However, the systemic reactions were no significantly changed between vaccine and placebo groups [OR: 1.28; 95% CI: 0.67-2.43, P = 0.46].

The doses used by the Novavax PREVENT-19 trial are 5 micrograms, at the lowest end of the doses used in these papers. The analysis of high-dose vaccines yielded modestly higher odds ratios (there was a significant increase in systemic reactions for high-dose participants vs placebo) but the same general picture.

Integrating all of this evidence in a reasonable way is not a trivial problem, but they don’t seem to disagree too much - the odds ratio for having local adverse effects seems to be about 1.5-4, and the odds ratio for systemic adverse effects a little lower than that. The systemic reaction doesn’t seem to be very diagnostic and it’s probably correlated to the local reaction in ways I don’t understand and am uneasy guessing about. So to a first approximation I will just update on the presence or absence of local adverse effects.

Conclusion

At the time I am publishing this, it’s been about 6 hours since the injection. I have a slight pain my right arm and general malaise, but not to such a degree that I am sure I’m not imagining it. Hopefully I feel worse tomorrow!

If I do, I’d update my posterior odds to be 2:1 times 2:1 = 4:1. I’d be about 80% confident I received the real vaccine.

If I don’t, I’d update my posterior odds to be 2:1 times 1:2 = 1:1. I’d be about 50% confident I received the real vaccine.

Since I’m young, and younger people tended to have vaccine reactions more reliably, these could probably be treated as (very rough) lower and upper bounds, respectively, e.g. I would be at least 80% or at most 50% confident I received the real treatment depending on the case. In the unlikely event I wake up with crushing fatigue and nausea or aching all over, I reserve the right ignore these estimates and throw myself a party.

Discuss

### Project: Debiasing Politics via Crowdsourcing

4 марта, 2021 - 22:34
Published on March 4, 2021 7:34 PM GMT

(TLDR: We are organizing a collaborative research group which will test a new method for unbiased decision making. The group will include several superforecasters, but no prior experience in the field is required to join.)

The problem

In 1906, while visiting a country fair, Francis Galton observed a competition to guess the weight of an ox. After calculating the average of 787 guesses, Galton discovered that the result (1,197lb) was extremely close to the actual weight (1,198lb) of the ox. Since participants’ errors were random (in other words, people were equally likely to over- and under-estimate the correct value), after averaging most errors got mutually eliminated. [1]

Subsequently, the technique of crowdsourcing questions to unrelated individuals has achieved some remarkable successes (would you believe that a missing submarine could be found that way?). Unfortunately, in some important areas this method remains inadequate.

Consider, for example, what would happen if a group of individuals were presented with a highly polarizing question (e.g., “Estimate the effect of the proposed minimum wage change on future unemployment rates”). Instead of a single Gaussian distribution centered around the correct answer, the responses are likely to split into separate clusters reflecting political sympathies of their members. The average result would be determined by the political composition of the group and no longer converge to the correct value.

Machine learning solution

Recently a new algorithm has been developed that addresses the problem by incorporating two additional sets of parameters in the forecast aggregation method:

1)     “Controversy” vector, C, which for a given question measures how the answers are affected by forecaster biases (right vs left, libertarian vs. authoritarian etc.). For example, positive values along libertarian axis would indicate that people biased towards libertarian world-view give larger predictions for a given question.

2)      “Personal bias” vector, B, which measures individual forecaster biases based on their previous performance. For example, high positive values along libertarian axis would indicate that the forecaster’s errors strongly correlate with errors made by other libertarian-leaning forecasters.

Using the values of the two vectors it is possible to achieve better forecasting accuracy by correcting for systematic biases of individual forecasters.

Debiasing Politics

So far, the tests have shown that the new algorithm is more accurate than the algorithm used by IARPA in predicting the probability of geopolitical events [2]. There are two other areas where the same approach may be useful:

• Claim Validation (“Fact Checking”).

At present, the task of verifying politically charged claims is mostly performed by organizations that are themselves vulnerable to partisan bias. Aside from affecting the objectivity of their investigations, this partisanship hurts the credibility of their conclusions among people who do not already share their political preferences.

A possible solution to this problem could be creation of a public forecasting platform specializing on investigating politically controversial questions. The accuracy and bias of the forecasters can be evaluated based on questions that will have a clear resolution at some future point (e.g., “Post Brexit: Will the GDP growth of the UK lag behind the EU average in 2021?”). With their accuracy and bias known, forecasters may be asked to evaluate claims that do not have a clear resolution method (e.g., “Raising minimum wage increases unemployment”).

• Evaluation of public policy initiatives

Debiasing may be a useful tool in evaluating the effectiveness of the current government policies (e.g., Covid-19 quarantines) and in testing the feasibility of long-term policy ideas (UBI, Open Borders, Futarchy etc.).

Collaborative Research Group

If you would like to participate in our project, measure your own biases or get an unbiased evaluation of your political ideas, you are welcome to join the research group that we are currently organizing. The main activities of the group will include:

• making forecasts on politically charged questions
• estimating the validity of controversial claims
• evaluating policy ideas

Since accurate calculation of bias requires large amounts of data, the group will officially begin its activities after we recruit a sufficient number of participants.

-------

[1] Galton, F. Vox Populi. Nature 75, 450–451 (1907)

[2] The author is the winner of the latest IARPA’s Geopolitical Forecasting Challenge.

Discuss

### [AN #140]: Theoretical models that predict scaling laws

4 марта, 2021 - 21:10
Published on March 4, 2021 6:10 PM GMT

[AN #140]: Theoretical models that predict scaling laws Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world View this email in your browser Newsletter #140
Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.
Audio version here (may not be up yet).
Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer. SECTIONS ﻿HIGHLIGHTS
﻿TECHNICAL AI ALIGNMENT
﻿MISCELLANEOUS (ALIGNMENT)
﻿AI GOVERNANCE
﻿OTHER PROGRESS IN AI
﻿REINFORCEMENT LEARNING
﻿MACHINE LEARNING ﻿ ﻿ ﻿ HIGHLIGHTS

Explaining Neural Scaling Laws and A Neural Scaling Law from the Dimension of the Data Manifold (Yasaman Bahri, Ethan Dyer, Jared Kaplan, Jaehoon Lee, and Utkarsh Sharma) (summarized by Rohin): We’ve seen lots of empirical work on scaling laws (AN #87), but can we understand theoretically why these arise? This paper suggests two different models for how power-law scaling laws could arise, variance-limited and resolution-limited scaling, and argues that neural nets are typically trained in the resolution-limited setting. In both cases, we have versions that occur when the dataset size D is large and the number of parameters P is low (parameter-limited), and when D is low and P is large (data-limited).

Recall that a scaling law is a power-law equation that predicts the test loss L as a function of P and D. In this paper, we consider cases where only one of the resources is the bottleneck, so that our power laws are of the form L = kP^(-α) or L = kD^(-α), for constants k and α. (For simplicity, we’re assuming that the minimum value of our loss function is zero.)

Resolution-limited scaling happens when either the dataset is too small to “resolve” (capture) the true underlying function, or when the model doesn’t have enough capacity to “resolve” (fit) the training dataset. In this case, we’re going to take the common ML assumption that while our observation space might be high-dimensional, the data itself comes from a low-dimensional manifold with dimension d, called the intrinsic dimension. We’ll model our neural net as transforming the input space into a roughly d-dimensional representation of the manifold, which is then used in further processing by later layers. Thus the output of the network is a simple function over this low-dimensional representation.

Let’s first consider the case where P is sufficiently large, so that we perfectly fit the training data, but D is limited. We can think of the training data as a “net” of points covering the true d-dimensional manifold. Intuitively, to halve the distance between the points (making the net “twice as fine”), we need ~2^d times as many points. Some simple algebraic manipulation tells us that distance between points would then scale as D^(-1/d).

How can we translate this to the test loss? Let’s assume a simple nearest neighbor classifier where, given a test data point, we simply predict the value associated with the nearest training data point. This is equivalent to assuming that our neural net learns a piecewise constant function. In this case, for a test data point drawn from the same distribution as the training set, that data point will be “near” some training data point and our model will predict the same output as for the training data point.

Under the assumption that our test loss is sufficiently “nice”, we can do a Taylor expansion of the test loss around this nearest training data point and take just the first non-zero term. Since we have perfectly fit the training data, at the training data point, the loss is zero; and since the loss is minimized, the gradient is also zero. Thus, the first non-zero term is the second-order term, which is proportional to the square of the distance. So, we expect that our scaling law will look like kD^(-2/d), that is, α = 2/d.

The above case assumes that our model learns a piecewise constant function. However, neural nets with Relu activations learn piecewise linear functions. For this case, we can argue that since the neural network is interpolating linearly between the training points, any deviation of the distance between the true value and the actual value should scale as D^(-2/d) instead of D^(-1/d), since the linear term is being approximated by the neural network. In this case, for loss functions like the L2 loss, which are quadratic in the distance, we get that α = 4/d.

Note that it is possible that scaling could be even faster, e.g. because the underlying manifold is simple or has some nice structure that the neural network can quickly capture. So in general, we might expect α >= 2/d and for L2 loss α >= 4/d.

What about the case when P is the bottleneck? Well, in this case, since the training data is not the bottleneck, it is presumably a sufficiently good approximation to the underlying function; and so we are just seeing whether the learned model can match the dataset. Once again, we make the assumption that the learned model gives a piecewise linear approximation, which by the same argument suggests a scaling law of X^(-α), with α >= 2/d (and α >= 4/d for the case of L2 loss), where X is the number of “parts” in the approximation. In the case of linear models, we should have X = P, but for neural networks I believe the authors suggest that we should instead have X = w, the width of the network. (One motivation is that in the infinite-width limit, neural networks behave like linear models.)

In variance-limited scaling for D, the scaling bottleneck is the randomness inherent in the sampling of the dataset from the underlying distribution. We can view the dataset as a random variable, implying that the gradient is also a random variable since it is a function of the training dataset. We can then consider the “error term” δG = G - G_inf, which is the difference between the finite-dataset gradients and the gradients for infinite data. We’ll make the assumption that you’re equally likely to be wrong in all directions -- if there’s a dataset that makes you a bit more likely to predict A, then there’s also a corresponding equally likely dataset that makes you a bit less likely to predict A. In that case, in expectation δG is zero, since on average the errors all cancel out. Since D is assumed to be large, we can apply the law of large numbers to deduce that the variance of δG will scale as 1/D.

Let us then consider the test loss as a function of the gradients. The test loss we actually get is L(G) = L(G_inf + δG). We can now Taylor expand this to get an expansion which tells us that the quantity we care about, L(G) - L(G_inf), is of the form AδG + B(δG)^2, where A and B are constants that depend on derivatives of the test loss in the infinite dataset case. We had already concluded that E[δG] = 0, and E[(δG)^2] is just the variance and so scales as 1/D, which implies that α = 1.

Here’s a slightly less mathematical and more conceptual argument for the same thing (though note that this feels like a sketchier argument overall):

1. Variance of the gradient scales as 1/D by the law of large numbers

2. Thus standard deviation scales as 1/√D

3. Thus the deviation of the empirical estimate of the gradients scales as 1/√D

4. Thus the deviation of the neural net parameters scales as 1/√D

5. Thus the deviation of the output of the final layer scales as 1/√D

6. Any linear dependence on this deviation would cancel out in expectation, since the deviation could either increase or decrease the test loss. However, quadratic dependences would add together. These would scale as (1/√D)^2, that is, 1/D.

The authors also suggest that a similar argument can be applied to argue that for parameters, the loss scales as 1/w, where w is the width of the network. This is variance-limited scaling for P. This again relies on previous results showing that neural networks behave like linear models in the limit of infinite width.

The authors use this theory to make a bunch of predictions which they can then empirically test. I’ll only go through the most obvious test: independently measuring the scaling exponent α and the intrinsic dimension d, and checking whether α >= 4/d. In most cases, they find that it is quite close to equality. In the case of language modeling with GPT, they find that α is significantly larger than 4/d, which is still in accordance with the equality (though it is still relatively small -- language models just have a high intrinsic dimension). Variance-limited scaling is even easier to identify: we simply measure the scaling exponent α and check whether it is 1.

﻿

Rohin's opinion: This seems like a solid attack on a theory of scaling. As we discussed last week, it seems likely that any such theory must condition on some assumption about the “simplicity of reality”; in this paper, that assumption is that the data lies on a low-dimensional manifold within a high-dimensional observation space. This seems like a pretty natural place to start, though I do expect that it isn’t going to capture everything.

Note that many of the authors’ experiments are in teacher-student models. In these models, a large teacher neural network is first initialized to compute some random function; a student network must then learn to imitate the teacher, but has either limited data or limited parameters. The benefit is that they can precisely control factors like the intrinsic dimension d, but the downside is that it isn’t immediately clear that the insights will generalize to real-world tasks and datasets. Their experiments with more realistic tasks are less clean, though I would say that they support the theory.

﻿ ﻿ ﻿ TECHNICAL AI ALIGNMENT
﻿ MISCELLANEOUS (ALIGNMENT)

Bootstrapped Alignment (G Gordon Worley III) (summarized by Rohin): This post distinguishes between three kinds of “alignment”:

1. Not building an AI system at all,

2. Building Friendly AI that will remain perfectly aligned for all time and capability levels,

3. Bootstrapped alignment, in which we build AI systems that may not be perfectly aligned but are at least aligned enough that we can use them to build perfectly aligned systems.

The post argues that optimization-based approaches can’t lead to perfect alignment, because there will always eventually be Goodhart effects.

﻿ ﻿ ﻿ AI GOVERNANCE

Institutionalizing ethics in AI through broader impact requirements (Carina E. A. Prunkl et al) (summarized by Rohin): This short perspective analyzes the policy implemented by NeurIPS last year in which paper submissions were required to have a section discussing the broader impacts of the research. Potential benefits include anticipating potential impacts of research, acting to improve these impacts, reflecting on what research to do given the potential impacts, and improving coordination across the community. However, the policy may also lead to trivialization of ethics and governance (thinking that all the relevant thinking about impacts can be done in this single statement), negative attitudes towards the burden of writing such statements or responsible research in general, a false sense of security that the ethics are being handled, and a perception of ethics as something to be done as an afterthought.

The main challenges that can cause these sorts of negative effects are:

1. Analyzing broader impacts can be difficult and complex,

2. There are not yet any best practices or guidance,

3. There isn’t a clear explanation of the purpose of the statements, or transparency into how they will be evaluated,

4. It’s tempting to focus on the research that determines whether or not your paper is published, rather than the broader impacts statement which mostly does not affect decisions,

5. Researchers may have incentives to emphasize the beneficial impacts of their work and downplay the negative impacts.

6. Biases like motivated reasoning may affect the quality and comprehensiveness of impact statements.

To mitigate these challenges, the authors recommend improving transparency, setting expectations, providing guidance on how to write statements, improving incentives for creating good impact statements, and learning from experience through community deliberation. To improve incentives in particular, broader impact statements could be made an explicit part of peer review which can affect acceptance decisions. These reviews could be improved by involving experts in ethics and governance. Prizes could also be given for outstanding impact statements, similarly to best paper awards.

﻿

Rohin's opinion: I’ve been pretty skeptical of the requirement to write a broader impacts statement. My experience of it was primarily one of frustration, for a few reasons:

1. Forecasting the future is hard. I don’t expect a shallow effort to forecast to be all that correlated with the truth. There were lots of simple things I could say that “sound” right but that I don’t particularly expect to be true, such as “improving cooperation in multiagent RL will help build cooperative, helpful personal assistants”. It’s a lot harder to say things that are actually true; a real attempt to do this would typically be a paper in itself.

2. To the extent that the statement does affect reviews, I expect that reviewers want to hear the simple things that sound right; and if I don’t write them, it would probably be a strike against the paper.

3. Even if I did write a good statement, I don’t expect anyone to read it or care about it.

From a birds-eye view, I was also worried that if such statements do become popular, they’ll tend to ossify and build consensus around fairly shallow views that people come up with after just a bit of thought.

I do think many of the proposals in this paper would help quite a bit, and there probably is a version of these statements that I would like and endorse.

﻿ ﻿ ﻿ OTHER PROGRESS IN AI
﻿ REINFORCEMENT LEARNING

Mastering Atari with Discrete World Models (Danijar Hafner et al) (summarized by Flo): Model-based reinforcement learning can have better sample efficiency, allows for smarter exploration strategies, and facilitates generalization between different tasks. Still, previous attempts at model-based RL on the Atari Benchmark like Dreamer (AN #83) and SimPLe (AN #51) were unable to compete with model-free algorithms in terms of final performance. This paper presents DreamerV2, a model-based algorithm that outperforms DQN and its variants -- including Rainbow -- in terms of both median human- or gamer-normalized performance and on mean world-record normalized performance on Atari after 200M environment steps, achieving roughly 35% on the latter (25% if algorithm performance is clipped to max out at 100% for each game).

DreamerV2 learns a recurrent state-space model that stochastically encodes frames and a hidden state into a latent variable and uses the hidden state to predict the next value of the latent variable. Frames and reward are then reconstructed using both the hidden state and the latent variable. A policy is obtained by actor-critic training on the latent state space, leveraging parallelization to train on 468B imagined samples. As DreamerV2 does not use MCTS, it requires 8x less wall clock time to train than the more complicated but better performing MuZero Reanalyze (AN #75). Unlike earlier approaches, DreamerV2 uses a vector of categorical latent variables rather than gaussians to enable better model predictions for dynamics with multiple distinct modes, as well as KL-balancing (scaling up the importance of the transition loss compared to the entropy regularizer on the latent variable). Ablations confirm that the image reconstruction loss is crucial for DreamerV2's performance and that both the use of discrete latent variables and KL-balancing lead to significant improvements. Interestingly, preventing the gradients for reward prediction from affecting the world model does not affect performance at all.

﻿

Flo's opinion: It is worth noting that the authors use the Dopamine (AN #22) framework for evaluating the model-free baselines, meaning that a slightly stunted version of Rainbow is used on an evaluation protocol different from the original publication without retuning hyperparameters. That said, DreamerV2 definitely performs at a level similar to Rainbow, which is significant progress in model-based RL. In particular, the fact that the reward can be inferred from the world model even without gradients flowing back from the reward suggests transferability of the world models to different tasks with the same underlying dynamics.

﻿ ﻿ MACHINE LEARNING

A Theory of Universal Learning (Olivier Bousquet et al) (summarized by Zach): In machine learning, algorithms are presented with labeled examples of categories from a training dataset and the objective is to output a classifier that distinguishes categories on a validation dataset. The generalization ability of the classifier is usually measured by calculating the error rate of the classifications on the validation set. One popular way to display generalization capability as a function of training set size is to plot a learning curve. A learning curve is a function that outputs the performance of a learning algorithm as a function of the data distribution and training sample size. A faster decay rate for a learning curve indicates a better ability to generalize with fewer data.

In this paper, the authors characterize the conditions for a learning algorithm to have learning curves with a certain decay rate. A learning curve is produced from the decay rate according to the formula 1/rate. The authors show that there are only three universal rates: exponential, linear, and arbitrarily slow decay. Moreover, the authors show there are problem classes that can be learned quickly in each instance but are slow to learn in the worst-case. This stands in contrast to classical results which analyze only the worst-case performance of learning algorithms. This produces pessimistic bounds because the guarantee must hold for all possible data distributions. This is often stronger than what is necessary for practice. Thus, by looking at rates instead of the worst-case learning curve, the authors show that it is possible to learn more efficiently than what is predicted by classical theory.

﻿

Zach's opinion: This paper is mathematically sophisticated, but full of examples to illustrate the main points of the theory. More generally, work towards non-uniform bounds has become a popular topic recently as a result of classical generalization theory's inability to explain the success of deep learning and phenomena such as double-descent. These results could allow for progress in explaining the generalization capability of over-parameterized models, such as neural networks. Additionally, the theory presented here could lead to more efficient algorithms that take advantage of potential speedups over empirical risk minimization proved in the paper.

FEEDBACK I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email. PODCAST An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.
Subscribe here:

Want to change how you receive these emails?
You can update your preferences or unsubscribe from this list.

Discuss

### A Semitechnical Introductory Dialogue on Solomonoff Induction

4 марта, 2021 - 20:27
Published on March 4, 2021 5:27 PM GMT

(Originally posted in December 2015: A dialogue between Ashley, a computer scientist who's never heard of Solomonoff's theory of inductive inference, and Blaine, who thinks it is the best thing since sliced bread.)

i.  Unbounded analysis

ASHLEY:  Good evening, Msr. Blaine.

BLAINE:  Good evening, Msr. Ashley.

ASHLEY:  I've heard there's this thing called "Solomonoff's theory of inductive inference".

BLAINE:  The rumors have spread, then.

ASHLEY:  Yeah, so, what the heck is that about?

BLAINE:  Invented in the 1960s by the mathematician Ray Solomonoff, the key idea in Solomonoff induction is to do sequence prediction by using Bayesian updating on a prior composed of a mixture of all computable probability distributions—

ASHLEY:  Wait. Back up a lot. Before you try to explain what Solomonoff induction is, I'd like you to try to tell me what it does, or why people study it in the first place. I find that helps me organize my listening. Right now I don't even know why I should be interested in this.

BLAINE:  Um, okay. Let me think for a second...

ASHLEY:  Also, while I can imagine things that "sequence prediction" might mean, I haven't yet encountered it in a technical context, so you'd better go a bit further back and start more at the beginning. I do know what "computable" means and what a "probability distribution" is, and I remember the formula for Bayes's Rule although it's been a while.

BLAINE:  Okay. So... one way of framing the usual reason why people study this general field in the first place, is that sometimes, by studying certain idealized mathematical questions, we can gain valuable intuitions about epistemology. That's, uh, the field that studies how to reason about factual questions, how to build a map of reality that reflects the territory—

ASHLEY:  I have some idea what 'epistemology' is, yes. But I think you might need to start even further back, maybe with some sort of concrete example or something.

BLAINE:  Okay. Um. So one anecdote that I sometimes use to frame the value of computer science to the study of epistemology is Edgar Allen Poe's argument in 1833 that chess was uncomputable.

ASHLEY:  That doesn't sound like a thing that actually happened.

BLAINE:  I know, but it totally did happen and not in a metaphorical sense either! Edgar Allen Poe wrote an essay explaining why no automaton would ever be able to play chess, and he specifically mentioned "Mr. Babbage's computing engine" as an example.

You see, in the nineteenth century, there was for a time this sensation known as the Mechanical Turk—supposedly a machine, an automaton, that could play chess. At the grandmaster level, no less.

Now today, when we're accustomed to the idea that it takes a reasonably powerful computer to do that, we can know immediately that the Mechanical Turk must have been a fraud and that there must have been a concealed operator inside—a person with dwarfism, as it turned out. Today we know that this sort of thing is hard to build into a machine. But in the 19th century, even that much wasn't known.

So when Edgar Allen Poe, who besides being an author was also an accomplished magician, set out to write an essay about the Mechanical Turk, he spent the second half of the essay dissecting what was known about the Turk's appearance to (correctly) figure out where the human operator was hiding. But Poe spent the first half of the essay arguing that no automaton—nothing like Mr. Babbage's computing engine—could possibly play chess, which was how he knew a priori that the Turk had a concealed human operator.

ASHLEY:  And what was Poe's argument?

BLAINE:  Poe observed that in an algebraical problem, each step followed from the previous step of necessity, which was why the steps in solving an algebraical problem could be represented by the deterministic motions of gears in something like Mr. Babbage's computing engine. But in a chess problem, Poe said, there are many possible chess moves, and no move follows with necessity from the position of the board; and even if you did select one move, the opponent's move would not follow with necessity, so you couldn't represent it with the determined motion of automatic gears. Therefore, Poe said, whatever was operating the Mechanical Turk must have the nature of Cartesian mind, rather than the nature of deterministic matter, and this was knowable a priori. And then he started figuring out where the required operator was hiding.

ASHLEY:  That's some amazingly impressive reasoning for being completely wrong.

BLAINE:  I know! Isn't it great?

ASHLEY:  I mean, that sounds like Poe correctly identified the hard part of playing computer chess, the branching factor of moves and countermoves, which is the reason why no simple machine could do it. And he just didn't realize that a deterministic machine could deterministically check many possible moves in order to figure out the game tree. So close, and yet so far.

BLAINE:  More than a century later, in 1950, Claude Shannon published the first paper ever written on computer chess. And in passing, Shannon gave the formula for playing perfect chess if you had unlimited computing power, the algorithm you'd use to extrapolate the entire game tree. We could say that Shannon gave a short program that would solve chess if you ran it on a hypercomputer, where a hypercomputer is an ideal computer that can run any finite computation immediately. And then Shannon passed on to talking about the problem of locally guessing how good a board position was, so that you could play chess using only a small search.

I say all this to make a point about the value of knowing how to solve problems using hypercomputers, even though hypercomputers don't exist. Yes, there's often a huge gap between the unbounded solution and the practical solution. It wasn't until 1997, forty-seven years after Shannon's paper giving the unbounded solution, that Deep Blue actually won the world chess championship—

ASHLEY:  And that wasn't just a question of faster computing hardware running Shannon's ideal search algorithm. There were a lot of new insights along the way, most notably the alpha-beta pruning algorithm and a lot of improvements in positional evaluation.

BLAINE:  Right!

But I think some people overreact to that forty-seven year gap, and act like it's worthless to have an unbounded understanding of a computer program, just because you might still be forty-seven years away from a practical solution. But if you don't even have a solution that would run on a hypercomputer, you're Poe in 1833, not Shannon in 1950.

The reason I tell the anecdote about Poe is to illustrate that Poe was confused about computer chess in a way that Shannon was not. When we don't know how to solve a problem even given infinite computing power, the very work we are trying to do is in some sense murky to us. When we can state code that would solve the problem given a hypercomputer, we have become less confused. Once we have the unbounded solution we understand, in some basic sense, the kind of work we are trying to perform, and then we can try to figure out how to do it efficiently.

ASHLEY:  Which may well require new insights into the structure of the problem, or even a conceptual revolution in how we imagine the work we're trying to do.

BLAINE:  Yes, but the point is that you can't even get started on that if you're arguing about how playing chess has the nature of Cartesian mind rather than matter. At that point you're not 50 years away from winning the chess championship, you're 150 years away, because it took an extra 100 years to move humanity's understanding to the point where Claude Shannon could trivially see how to play perfect chess using a large-enough computer. I'm not trying to exalt the unbounded solution by denigrating the work required to get a bounded solution. I'm not saying that when we have an unbounded solution we're practically there and the rest is a matter of mere lowly efficiency. I'm trying to compare having the unbounded solution to the horrific confusion of not understanding what we're trying to do.

ASHLEY:  Okay. I think I understand why, on your view, it's important to know how to solve problems using infinitely fast computers, or hypercomputers as you call them. When we can say how to answer a question using infinite computing power, that means we crisply understand the question itself, in some sense; while if we can't figure out how to solve a problem using unbounded computing power, that means we're confused about the problem, in some sense. I mean, anyone who's ever tried to teach the more doomed sort of undergraduate to write code knows what it means to be confused about what it takes to compute something.

BLAINE:  Right.

ASHLEY:  So what does this have to do with "Solomonoff induction"?

BLAINE:  Ah! Well, suppose I asked you how to do epistemology using infinite computing power?

ASHLEY:  My good fellow, I would at once reply, "Beep. Whirr. Problem 'do epistemology' not crisply specified." At this stage of affairs, I do not think this reply indicates any fundamental confusion on my part; rather I think it is you who must be clearer.

BLAINE:  Given unbounded computing power, how would you reason in order to construct an accurate map of reality?

ASHLEY:  That still strikes me as rather underspecified.

BLAINE:  Perhaps. But even there I would suggest that it's a mark of intellectual progress to be able to take vague and underspecified ideas like 'do good epistemology' and turn them into crisply specified problems. Imagine that I went up to my friend Cecil, and said, "How would you do good epistemology given unlimited computing power and a short Python program?" and Cecil at once came back with an answer—a good and reasonable answer, once it was explained. Cecil would probably know something quite interesting that you do not presently know.

ASHLEY:  I confess to being rather skeptical of this hypothetical. But if that actually happened—if I agreed, to my own satisfaction, that someone had stated a short Python program that would 'do good epistemology' if run on an unboundedly fast computer—then I agree that I'd probably have learned something quite interesting about epistemology.

BLAINE:  What Cecil knows about, in this hypothetical, is Solomonoff induction. In the same way that Claude Shannon answered "Given infinite computing power, how would you play perfect chess?", Ray Solomonoff answered "Given infinite computing power, how would you perfectly find the best hypothesis that fits the facts?"

ASHLEY:  Suddenly, I find myself strongly suspicious of whatever you are about to say to me.

BLAINE:  That's understandable.

ASHLEY:  In particular, I'll ask at once whether "Solomonoff induction" assumes that our hypotheses are being given to us on a silver platter along with the exact data we're supposed to explain, or whether the algorithm is organizing its own data from a big messy situation and inventing good hypotheses from scratch.

BLAINE:  Great question! It's the second one.

ASHLEY:  Really? Okay, now I have to ask whether Solomonoff induction is a recognized concept in good standing in the field of academic computer science, because that does not sound like something modern-day computer science knows how to do.

BLAINE:  I wouldn't say it's a widely known concept, but it's one that's in good academic standing. The method isn't used in modern machine learning because it requires an infinitely fast computer and isn't easily approximated the way that chess is.

ASHLEY:  This really sounds very suspicious. Last time I checked, we hadn't begun to formalize the creation of good new hypotheses from scratch. I've heard about claims to have 'automated' the work that, say, Newton did in inventing classical mechanics, and I've found them all to be incredibly dubious. Which is to say, they were rigged demos and lies.

BLAINE:  I know, but—

ASHLEY:  And then I'm even more suspicious of a claim that someone's algorithm would solve this problem if only they had infinite computing power. Having some researcher claim that their Good-Old-Fashioned AI semantic network would be intelligent if run on a computer so large that, conveniently, nobody can ever test their theory, is not going to persuade me.

BLAINE:  Do I really strike you as that much of a charlatan? What have I ever done to you, that you would expect me to try pulling a scam like that?

ASHLEY:  That's fair. I shouldn't accuse you of planning that scam when I haven't seen you say it. But I'm pretty sure the problem of "coming up with good new hypotheses in a world full of messy data" is AI-complete. And even Mentif-

BLAINE:  Do not say the name, or he will appear!

ASHLEY:  Sorry. Even the legendary first and greatest of all AI crackpots, He-Who-Googles-His-Name, could assert that his algorithms would be all-powerful on a computer large enough to make his claim unfalsifiable. So what?

BLAINE:  That's a very sensible reply and this, again, is exactly the kind of mental state that reflects a problem that is confusing rather than just hard to implement. It's the sort of confusion Poe might feel in 1833, or close to it. In other words, it's just the sort of conceptual issue we would have solved at the point where we could state a short program that could run on a hypercomputer. Which Ray Solomonoff did in 1964.

ii.  Sequences

BLAINE:  First, try to solve the following puzzle. 1, 3, 4, 7, 11, 18, 29...?

ASHLEY:  Let me look at those for a moment... 47.

BLAINE:  Congratulations on engaging in, as we snooty types would call it, 'sequence prediction'.

ASHLEY:  I'm following you so far.

BLAINE:  The smarter you are, the more easily you can find the hidden patterns in sequences and predict them successfully. You had to notice the resemblance to the Fibonacci rule to guess the next number. Someone who didn't already know about Fibonacci, or who was worse at mathematical thinking, would have taken longer to understand the sequence or maybe never learned to predict it at all.

ASHLEY:  Still with you.

BLAINE:  It's not a sequence of numbers per se... but can you see how the question, "The sun has risen on the last million days. What is the probability that it rises tomorrow?" could be viewed as a kind of sequence prediction problem?

ASHLEY:  Only if some programmer neatly parses up the world into a series of "Did the Sun rise on day X starting in 4.5 billion BCE, 0 means no and 1 means yes? 1, 1, 1, 1, 1..." and so on. Which is exactly the sort of shenanigan that I see as cheating. In the real world, you go outside and see a brilliant ball of gold touching the horizon, not a giant "1".

ASHLEY:  I can't help but notice that the 'sequence' of webcam frames is absolutely enormous, like, the sequence is made up of 66-megabit 'numbers' appearing 3600 times per minute... oh, right, computers much bigger than the universe. And now you're smiling evilly, so I guess that's the point. I also notice that the sequence is no longer deterministically predictable, that it is no longer a purely mathematical object, and that the sequence of webcam frames observed will depend on the robot's choices. This makes me feel a bit shaky about the analogy to predicting the mathematical sequence 1, 1, 2, 3, 5.

BLAINE:  I'll try to address those points in order. First, Solomonoff induction is about assigning probabilities to the next item in the sequence. I mean, if I showed you a box that said 1, 1, 2, 3, 5, 8 you would not be absolutely certain that the next item would be 13. There could be some more complicated rule that just looked Fibonacci-ish but then diverged. You might guess with 90% probability but not 100% probability, or something like that.

ASHLEY:  This has stopped feeling to me like math.

BLAINE:  There is a large branch of math, to say nothing of computer science, that deals in probabilities and statistical prediction. We are going to be describing absolutely lawful and deterministic ways of assigning probabilities after seeing 1, 3, 4, 7, 11, 18.

ASHLEY:  Okay, but if you're later going to tell me that this lawful probabilistic prediction rule underlies a generally intelligent reasoner, I'm already skeptical.

No matter how large a computer it's run on, I find it hard to imagine that some simple set of rules for assigning probabilities is going to encompass truly and generally intelligent answers about sequence prediction, like Terence Tao would give after looking at the sequence for a while. We just have no idea how Terence Tao works, so we can't duplicate his abilities in a formal rule, no matter how much computing power that rule gets... you're smiling evilly again. I'll be quite interested if that evil smile turns out to be justified.

BLAINE:  Indeed.

ASHLEY:  I also find it hard to imagine that this deterministic mathematical rule for assigning probabilities would notice if a box was outputting an encoded version of "To be or not to be" from Shakespeare by mapping A to Z onto 1 to 26, which I would notice eventually though not immediately upon seeing 20, 15, 2, 5, 15, 18... And you're still smiling evilly.

BLAINE:  Indeed. That is exactly what Solomonoff induction does. Furthermore, we have theorems establishing that Solomonoff induction can do it way better than you or Terence Tao.

ASHLEY:  A theorem proves this. As in a necessary mathematical truth. Even though we have no idea how Terence Tao works empirically... and there's evil smile number four. Okay. I am very skeptical, but willing to be convinced.

BLAINE:  So if you actually did have a hypercomputer, you could cheat, right? And Solomonoff induction is the most ridiculously cheating cheat in the history of cheating.

ASHLEY:  Go on.

BLAINE:  We just run all possible computer programs to see which are the simplest computer programs that best predict the data seen so far, and use those programs to predict what comes next. This mixture contains, among other things, an exact copy of Terence Tao, thereby allowing us to prove theorems about their relative performance.

ASHLEY:  Is this an actual reputable math thing? I mean really?

BLAINE:  I'll deliver the formalization later, but you did ask me to first state the point of it all. The point of Solomonoff induction is that it gives us a gold-standard ideal for sequence prediction, and this gold-standard prediction only errs by a bounded amount, over infinite time, relative to the best computable sequence predictor. We can also see it as formalizing the intuitive idea that was expressed by William Ockham a few centuries earlier that simpler theories are more likely to be correct, and as telling us that 'simplicity' should be measured in algorithmic complexity, which is the size of a computer program required to output a hypothesis's predictions.

ASHLEY:  I think I would have to read more on this subject to actually follow that. What I'm hearing is that Solomonoff induction is a reputable idea that is important because it gives us a kind of ideal for sequence prediction. This ideal also has something to do with Occam's Razor, and stakes a claim that the simplest theory is the one that can be represented by the shortest computer program. You identify this with "doing good epistemology".

BLAINE:  Yes, those are legitimate takeaways. Another way of looking at it is that Solomonoff induction is an ideal but uncomputable answer to the question "What should our priors be?", which is left open by understanding Bayesian updating.

ASHLEY:  Can you say how Solomonoff induction answers the question of, say, the prior probability that Canada is planning to invade the United States? I once saw a crackpot website that tried to invoke Bayesian probability about it, but only after setting the prior at 10% or something like that, I don't recall exactly. Does Solomonoff induction let me tell him that he's making a math error, instead of just calling him silly in an informal fashion?

BLAINE:  If you're expecting to sit down with Leibniz and say, "Gentlemen, let us calculate" then you're setting your expectations too high. Solomonoff gives us an idea of how we should compute that quantity given unlimited computing power. It doesn't give us a firm recipe for how we can best approximate that ideal in real life using bounded computing power, or human brains. That's like expecting to play perfect chess after you read Shannon's 1950 paper. But knowing the ideal, we can extract some intuitive advice that might help our online crackpot if only he'd listen.

ASHLEY:  But according to you, Solomonoff induction does say in principle what is the prior probability that Canada will invade the United States.

BLAINE:  Yes, up to a choice of universal Turing machine.

ASHLEY:  (looking highly skeptical)  So I plug a universal Turing machine into the formalism, and in principle, I get out a uniquely determined probability that Canada invades the USA.

BLAINE:  Exactly!

ASHLEY:  Uh huh. Well, go on.

BLAINE:  So, first, we have to transform this into a sequence prediction problem.

ASHLEY:  Like a sequence of years in which Canada has and hasn't invaded the US, mostly zero except around 1812—

ASHLEY:  That seems like a lot of data and some of it is redundant, like there'll be lots of similar pixels for blue sky—

BLAINE:  That data is what you got as an agent. If we want to translate the question of the prediction problem Ashley faces into theoretical terms, we should give the sequence predictor all the data that you had available, including all those repeating blue pixels of the sky. Who knows? Maybe there was a Canadian warplane somewhere in there, and you didn't notice.

ASHLEY:  But it's impossible for my brain to remember all that data. If we neglect for the moment how the retina actually works and suppose that I'm seeing the same 1920×1080 @60Hz feed the robot would, that's far more data than my brain can realistically learn per second.

BLAINE:  So then Solomonoff induction can do better than you can, using its unlimited computing power and memory. That's fine.

ASHLEY:  But what if you can do better by forgetting more?

BLAINE:  If you have limited computing power, that makes sense. With unlimited computing power, that really shouldn't happen and that indeed is one of the lessons of Solomonoff induction. An unbounded Bayesian never expects to do worse by updating on another item of evidence—for one thing, you can always just do the same policy you would have used if you hadn't seen that evidence. That kind of lesson is one of the lessons that might not be intuitively obvious, but which you can feel more deeply by walking through the math of probability theory. With unlimited computing power, nothing goes wrong as a result of trying to process 4 gigabits per second; every extra bit just produces a better expected future prediction.

ASHLEY:  Okay, so we start with literally all the data I have available. That's 4 gigabits per second if we imagine 1920×1080 frames of 32-bit pixels repeating 60 times per second. Though I remember hearing 100 megabits per second would be a better estimate of what the retina sends out, and that it's pared down to 1 megabit per second very quickly by further processing.

BLAINE:  Right. We start with all of that data, going back to when you were born. Or maybe when your brain formed in the womb, though it shouldn't make much difference.

ASHLEY:  I note that there are some things I know that don't come from my sensory inputs at all. Chimpanzees learn to be afraid of skulls and snakes much faster than they learn to be afraid of other arbitrary shapes. I was probably better at learning to walk in Earth gravity than I would have been at navigating in zero G. Those are heuristics I'm born with, based on how my brain was wired, which ultimately stems from my DNA specifying the way that proteins should fold to form neurons—not from any photons that entered my eyes later.

BLAINE:  So, for purposes of following along with the argument, let's say that your DNA is analogous to the code of a computer program that makes predictions. What you're observing here is that humans have 750 megabytes of DNA, and even if most of that is junk and not all of what's left is specifying brain behavior, it still leaves a pretty large computer program that could have a lot of prior information programmed into it.

Let's say that your brain, or rather, your infant pre-brain wiring algorithm, was effectively a 7.5 megabyte program—if it's actually 75 megabytes, that makes little difference to the argument. By exposing that 7.5 megabyte program to all the information coming in from your eyes, ears, nose, proprioceptive sensors telling you where your limbs were, and so on, your brain updated itself into forming the modern Ashley, whose hundred trillion synapses might be encoded by, say, one petabyte of information.

ASHLEY:  The thought does occur to me that some environmental phenomena have effects on me that can't be interpreted as "sensory information" in any simple way, like the direct effect that alcohol has on my neurons, and how that feels to me from the inside. But it would be perverse to claim that this prevents you from trying to summarize all the information that the Ashley-agent receives into a single sequence, so I won't press the point.

(ELIEZER:  (whispering More on this topic later.)

ASHLEY:  Oh, and for completeness's sake, wouldn't there also be further information embedded in the laws of physics themselves? Like, the way my brain executes implicitly says something about the laws of physics in the universe I'm in.

BLAINE:  Metaphorically speaking, our laws of physics would play the role of a particular choice of Universal Turing Machine, which has some effect on which computations count as "simple" inside the Solomonoff formula. But normally, the UTM should be very simple compared to the amount of data in the sequence we're trying to predict, just like the laws of physics are very simple compared to a human brain. In terms of algorithmic complexity, the laws of physics are very simple compared to watching a 1920×1080 @60Hz visual field for a day.

ASHLEY:  Part of my mind feels like the laws of physics are quite complicated compared to going outside and watching a sunset. Like, I realize that's false, but I'm not sure how to say out loud exactly why it's false...

BLAINE:  Because the algorithmic complexity of a system isn't measured by how long a human has to go to college to understand it, it's measured by the size of the computer program required to generate it. The language of physics is differential equations, and it turns out that this is something difficult to beat into some human brains, but differential equations are simple to program into a simple Turing Machine.

ASHLEY:  Right, like, the laws of physics actually have much fewer details to them than, say, human nature. At least on the Standard Model of Physics. I mean, in principle there could be another decillion undiscovered particle families out there.

BLAINE:  The concept of "algorithmic complexity" isn't about seeing something with lots of gears and details, it's about the size of computer program required to compress all those details. The Mandelbrot set looks very complicated visually, you can keep zooming in using more and more detail, but there's a very simple rule that generates it, so we say the algorithmic complexity is very low.

ASHLEY:  All the visual information I've seen is something that happens within the physical universe, so how can it be more complicated than the universe? I mean, I have a sense on some level that this shouldn't be a problem, but I don't know why it's not a problem.

BLAINE:  That's because particular parts of the universe can have much higher algorithmic complexity than the entire universe!

Consider a library that contains all possible books. It's very easy to write a computer program that generates all possible books. So any particular book in the library contains much more algorithmic information than the entire library; it contains the information required to say 'look at this particular book here'.

If pi is normal, then somewhere in its digits is a copy of Shakespeare's Hamlet—but the number saying which particular digit of pi to start looking at, will be just about exactly as large as Hamlet itself. The copy of Shakespeare's Hamlet that exists in the decimal expansion of pi is more complex than pi itself.

If you zoomed way in and restricted your vision to a particular part of the Mandelbrot set, what you saw might be much more algorithmically complex than the entire Mandelbrot set, because the specification has to say where in the Mandelbrot set you are.

Similarly, the world Earth is much more algorithmically complex than the laws of physics. Likewise, the visual field you see over the course of a second can easily be far more algorithmically complex than the laws of physics.

ASHLEY:  Okay, I think I get that. And similarly, even though the ways that proteins fold up are very complicated, in principle we could get all that info using just the simple fundamental laws of physics plus the relatively simple DNA code for the protein. There are all sorts of obvious caveats about epigenetics and so on, but those caveats aren't likely to change the numbers by a whole order of magnitude.

BLAINE:  Right!

ASHLEY:  So the laws of physics are, like, a few kilobytes, and my brain has say 75 megabytes of innate wiring instructions. And then I get to see a lot more information than that over my lifetime, like a megabit per second after my initial visual system finishes preprocessing it, and then most of that is forgotten. Uh... what does that have to do with Solomonoff induction again?

BLAINE:  Solomonoff induction quickly catches up to any single computer program at sequence prediction, even if the original program is very large and contains a lot of prior information about the environment. If a program is 75 megabytes long, it can only predict 75 megabytes worth of data better than the Solomonoff inductor before the Solomonoff inductor catches up to it.

That doesn't mean that a Solomonoff inductor knows everything a baby does after the first second of exposure to a webcam feed, but it does mean that after the first second, the Solomonoff inductor is already no more surprised than a baby by the vast majority of pixels in the next frame.

Every time the Solomonoff inductor assigns half as much probability as the baby to the next pixel it sees, that's one bit spent permanently out of the 75 megabytes of error that can happen before the Solomonoff inductor catches up to the baby.

That your brain is written in the laws of physics also has some implicit correlation with the environment, but that's like saying that a program is written in the same programming language as the environment. The language can contribute something to the power of the program, and the environment being written in the same programming language can be a kind of prior knowledge. But if Solomonoff induction starts from a standard Universal Turing Machine as its language, that doesn't contribute any more bits of lifetime error than the complexity of that programming language in the UTM.

ASHLEY:  Let me jump back a couple of steps and return to the notion of my brain wiring itself up in response to environmental information. I'd expect an important part of that process was my brain learning to control the environment, not just passively observing it. Like, it mattered to my brain's wiring algorithm that my brain saw the room shift in a certain way when it sent out signals telling my eyes to move.

BLAINE:  Indeed. But talking about the sequential control problem is more complicated math. AIXI is the ideal agent that uses Solomonoff induction as its epistemology and expected reward as its decision theory. That introduces extra complexity, so it makes sense to talk about just Solomonoff induction first. We can talk about AIXI later. So imagine for the moment that we were just looking at your sensory data, and trying to predict what would come next in that.

ASHLEY:  Wouldn't it make more sense to look at the brain's inputs and outputs, if we wanted to predict the next input? Not just look at the series of previous inputs?

BLAINE:  It'd make the problem easier for a Solomonoff inductor to solve, sure; but it also makes the problem more complicated. Let's talk instead about what would happen if you took the complete sensory record of your life, gave it to an ideally smart agent, and asked the agent to predict what you would see next. Maybe the agent could do an even better job of prediction if we also told it about your brain's outputs, but I don't think that subtracting the outputs would leave it helpless to see patterns in the inputs.

ASHLEY:  It sounds like a pretty hard problem to me, maybe even an unsolvable one. I'm thinking of the distinction in computer science between needing to learn from non-chosen data, versus learning when you can choose particular queries. Learning can be much faster in the second case.

BLAINE:  In terms of what can be predicted in principle given the data, what facts are actually reflected in it that Solomonoff induction might uncover, we shouldn't imagine a human trying to analyze the data. We should imagine an entire advanced civilization pondering it for years. If you look at it from that angle, then the alien civilization isn't going to balk at the fact that it's looking at the answers to the queries that Ashley's brain chose, instead of the answers to the queries it chose itself.

Like, if the Ashley had already read Shakespeare's Hamlet—if the image of those pages had already crossed the sensory stream—and then the Ashley saw a mysterious box outputting 20, 15, 2, 5, 15, 18, I think somebody eavesdropping on that sensory data would be equally able to guess that this was encoding 'tobeor' and guess that the next thing the Ashley saw might be the box outputting 14. You wouldn't even need an entire alien civilization of superintelligent cryptographers to guess that. And it definitely wouldn't be a killer problem that Ashley was controlling the eyeball's saccades, even if you could learn even faster by controlling the eyeball yourself.

So far as the computer-science distinction goes, Ashley's eyeball is being controlled to make intelligent queries and seek out useful information; it's just Ashley controlling the eyeball instead of you—that eyeball is not a query-oracle answering random questions.

ASHLEY:  Okay, I think this example is helping my understanding of what we're doing here. In the case above, the next item in the Ashley-sequence wouldn't actually be 14. It would be this huge 1920×1080 visual field that showed the box flashing a little picture of '14'.

BLAINE:  Sure. Otherwise it would be a rigged demo, as you say.

ASHLEY:  I think I'm confused about the idea of predicting the visual field. It seems to me that what with all the dust specks in my visual field, and maybe my deciding to tilt my head using motor instructions that won't appear in the sequence, there's no way to exactly predict the 66-megabit integer representing the next visual frame. So it must be doing something other than the equivalent of guessing "14" in a simpler sequence, but I'm not sure what.

BLAINE:  Indeed, there'd be some element of thermodynamic and quantum randomness preventing that exact prediction even in principle. So instead of predicting one particular next frame, we put a probability distribution on it.

ASHLEY:  A probability distribution over possible 66-megabit frames? Like, a table with 266,000,000 entries, summing to 1?

BLAINE:  Sure. 232×1920×1080 isn't a large number when you have unlimited computing power. As Martin Gardner once observed, "Most finite numbers are very much larger." Like I said, Solomonoff induction is an epistemic ideal that requires an unreasonably large amount of computing power.

ASHLEY:  I don't deny that big computations can sometimes help us understand little ones. But at the point when we're talking about probability distributions that large, I have some trouble holding onto what the probability distribution is supposed to mean.

BLAINE:  Really? Just imagine a probability distribution over N possibilities, then let N go to 266,000,000. If we were talking about a letter ranging from A to Z, then putting 100 times as much probability mass on (X, Y, Z) as on the rest of the alphabet, would say that although you didn't know exactly what letter would happen, you expected it would be toward the end of the alphabet. You would have used 26 probabilities, summing to 1, to precisely state that prediction.

In Solomonoff induction, since we have unlimited computing power, we express our uncertainty about a 1920×1080 video frame the same way. All the various pixel fields you could see if your eye jumped to a plausible place, saw a plausible number of dust specks, and saw the box flash something that visually encoded '14', would have high probability. Pixel fields where the box vanished and was replaced with a glow-in-the-dark unicorn would have very low, though not zero, probability.

ASHLEY:  Can we really get away with viewing things that way?

BLAINE:  If we could not make identifications like these in principle, there would be no principled way in which we could say that you had ever expected to see something happen—no way to say that one visual field your eyes saw had higher probability than any other sensory experience. We couldn't justify science; we couldn't say that, having performed Galileo's experiment by rolling an inclined cylinder down a plane, Galileo's theory was thereby to some degree supported by having assigned a high relative probability to the only actual observations our eyes ever report.

ASHLEY:  I feel a little unsure of that jump, but I suppose I can go along with that for now. Then the question of "What probability does Solomonoff induction assign to Canada invading?" is to be identified, in principle, with the question "Given my past life experiences and all the visual information that's entered my eyes, what is the relative probability of seeing visual information that encodes Google News with the headline 'CANADA INVADES USA' at some point during the next 300 million seconds?"

BLAINE:  Right!

ASHLEY:  And Solomonoff induction has an in-principle way of assigning this a relatively low probability, which that online crackpot could do well to learn from as a matter of principle, even if he couldn't begin to carry out the exact calculations that involve assigning probabilities to exponentially vast tables.

BLAINE:  Precisely!

ASHLEY:  Fairness requires that I congratulate you on having come further in formalizing 'do good epistemology' as a sequence prediction problem than I previously thought you might.

I mean, you haven't satisfied me yet, but I wasn't expecting you to get even this far.

iii.  Hypotheses

BLAINE:  Next, we consider how to represent a hypothesis inside this formalism.

ASHLEY:  Hmm. You said something earlier about updating on a probabilistic mixture of computer programs, which leads me to suspect that in this formalism, a hypothesis or way the world can be is a computer program that outputs a sequence of integers.

BLAINE:  There's indeed a version of Solomonoff induction that works like that. But I prefer the version where a hypothesis assigns probabilities to sequences. Like, if the hypothesis is that the world is a fair coin, then we shouldn't try to make that hypothesis predict "heads—tails—tails—tails—heads" but should let it just assign a 1/32 prior probability to the sequence HTTTH.

ASHLEY:  I can see that for coins, but I feel a bit iffier on what this means as a statement about the real world.

BLAINE:  A single hypothesis inside the Solomonoff mixture would be a computer program that took in a series of video frames, and assigned a probability to each possible next video frame. Or for greater simplicity and elegance, imagine a program that took in a sequence of bits, ones and zeroes, and output a rational number for the probability of the next bit being '1'. We can readily go back and forth between a program like that, and a probability distribution over sequences.

Like, if you can answer all of the questions, "What's the probability that the coin comes up heads on the first flip?", "What's the probability of the coin coming up heads on the second flip, if it came up heads on the first flip?", and "What's the probability that the coin comes up heads on the second flip, if it came up tails on the first flip?" then we can turn that into a probability distribution over sequences of two coinflips. Analogously, if we have a program that outputs the probability of the next bit, conditioned on a finite number of previous bits taken as input, that program corresponds to a probability distribution over infinite sequences of bits.

Pprog(bits1…N)=N∏i=1InterpretProb(prog(bits1…i−1),bitsi)InterpretProb(prog(x),y)=⎧⎪⎨⎪⎩InterpretFrac(prog(x))if y=11−InterpretFrac(prog(x))if y=00if prog(x) does not halt⎫⎪⎬⎪⎭

ASHLEY:  I think I followed along with that in theory, though it's not a type of math I'm used to (yet). So then in what sense is a program that assigns probabilities to sequences, a way the world could be—a hypothesis about the world?

BLAINE:  Well, I mean, for one thing, we can see the infant Ashley as a program with 75 megabytes of information about how to wire up its brain in response to sense data, that sees a bunch of sense data, and then experiences some degree of relative surprise. Like in the baby-looking-paradigm experiments where you show a baby an object disappearing behind a screen, and the baby looks longer at those cases, and so we suspect that babies have a concept of object permanence.

ASHLEY:  That sounds like a program that's a way Ashley could be, not a program that's a way the world could be.

BLAINE:  Those indeed are dual perspectives on the meaning of Solomonoff induction. Maybe we can shed some light on this by considering a simpler induction rule, Laplace's Rule of Succession, invented by the Reverend Thomas Bayes in the 1750s, and named after Pierre-Simon Laplace, the inventor of Bayesian reasoning.

ASHLEY:  Pardon me?

BLAINE:  Suppose you have a biased coin with an unknown bias, and every possible bias between 0 and 1 is equally probable.

ASHLEY:  Okay. Though in the real world, it's quite likely that an unknown frequency is exactly 0, 1, or 1/2. If you assign equal probability density to every part of the real number field between 0 and 1, the probability of 1 is 0. Indeed, the probability of all rational numbers put together is zero.

BLAINE:  The original problem considered by Thomas Bayes was about an ideal billiard ball bouncing back and forth on an ideal billiard table many times and eventually slowing to a halt; and then bouncing other billiards to see if they halted to the left or the right of the first billiard. You can see why, in first considering the simplest form of this problem without any complications, we might consider every position of the first billiard to be equally probable.

ASHLEY:  Sure. Though I note with pointless pedantry that if the billiard was really an ideal rolling sphere and the walls were perfectly reflective, it'd never halt in the first place.

BLAINE:  Suppose we're told that, after rolling the original billiard ball and then 5 more billiard balls, one billiard ball was to the right of the original, an R. The other four were to the left of the original, or Ls. Again, that's 1 R and 4 Ls. Given only this data, what is the probability that the next billiard ball rolled will be on the left of the original, another L?

ASHLEY:  Five sevenths.

BLAINE:  Ah, you've heard this problem before?

ASHLEY:  No, but it's obvious.

BLAINE:  Uh... really?

ASHLEY:  Combinatorics. Consider just the orderings of the balls, instead of their exact positions. Designate the original ball with the symbol , the next five balls as LLLLR, and the next ball to be rolled as . Given that the current ordering of these six balls is LLLL❚R and that all positions and spacings of the underlying balls are equally likely, after rolling the , there will be seven equally likely orderings ✚LLLL❚R, L✚LLL❚R, LL✚LL❚R, and so on up to LLLL❚✚R and LLLL❚R✚. In five of those seven orderings, the  is on the left of the . In general, if we see M of L and N of R, the probability of the next item being an L is (M+1)/(M+N+2).

BLAINE:  Gosh... Well, the much more complicated proof originally devised by Thomas Bayes starts by considering every position of the original ball to be equally likely a priori, the additional balls as providing evidence about that position, and then integrating over the posterior probabilities of the original ball's possible positions to arrive at the probability that the next ball lands on the left or right.

ASHLEY:  Heh. And is all that extra work useful if you also happen to know a little combinatorics?

BLAINE:  Well, it tells me exactly how my beliefs about the original ball change with each new piece of evidence—the new posterior probability function on the ball's position. Suppose I instead asked you something along the lines of, "Given 4 L and 1 R, where do you think the original ball  is most likely to be on the number line? How likely is it to be within 0.1 distance of there?"

ASHLEY:  That's fair; I don't see a combinatoric answer for the later part. You'd have to actually integrate over the density function fM(1−f)N df.

BLAINE:  Anyway, let's just take at face value that Laplace's Rule of Succession says that, after observing M 1s and N 0s, the probability of getting a 1 next is (M+1)/(M+N+2).

ASHLEY:  But of course.

BLAINE:  We can consider Laplace's Rule as a short Python program that takes in a sequence of 1s and 0s, and spits out the probability that the next bit in the sequence will be 1. We can also consider it as a probability distribution over infinite sequences, like this:

• 0 : 1/2
• 1 : 1/2
• 00 : 1/2∗2/3=1/3
• 01 : 1/2∗1/3=1/6
• 000 : 1/2∗2/3∗3/4=1/4
• 001 : 1/2∗2/3∗1/4=1/12
• 010 : 1/2∗1/3∗1/2=1/12

... and so on.

Now, we can view this as a rule someone might espouse for predicting coinflips, but also view it as corresponding to a particular class of possible worlds containing randomness.

I mean, Laplace's Rule isn't the only rule you could use. Suppose I had a barrel containing ten white balls and ten green balls. If you already knew this about the barrel, then after seeing M white balls and N green balls, you'd predict the next ball being white with probability (10−M)/(20−M−N).

If you use Laplace's Rule, that's like believing the world was like a billiards table with an original ball rolling to a stop at a random point and new balls ending up on the left or right. If you use (10−M)/(20−M−N), that's like the hypothesis that there are ten green balls and ten white balls in a barrel. There isn't really a sharp border between rules we can use to predict the world, and rules for how the world behaves—

ASHLEY:  Well, that sounds just plain wrong. The map is not the territory, don'cha know? If Solomonoff induction can't tell the difference between maps and territories, maybe it doesn't contain all epistemological goodness after all.

BLAINE:  Maybe it'd be better to say that there's a dualism between good ways of computing predictions and being in actual worlds where that kind of predicting works well? Like, you could also see Laplace's Rule as implementing the rules for a world with randomness where the original billiard ball ends up in a random place, so that the first thing you see is equally likely to be 1 or 0. Then to ask what probably happens on round 2, we tell the world what happened on round 1 so that it can update what the background random events were.

ASHLEY:  Mmmaybe.

BLAINE:  If you go with the version where Solomonoff induction is over programs that just spit out a determined string of ones and zeroes, we could see those programs as corresponding to particular environments—ways the world could be that would produce our sensory input, the sequence.

We could jump ahead and consider the more sophisticated decision-problem that appears in AIXI: an environment is a program that takes your motor outputs as its input, and then returns your sensory inputs as its output. Then we can see a program that produces Bayesian-updated predictions as corresponding to a hypothetical probabilistic environment that implies those updates, although they'll be conjugate systems rather than mirror images.

ASHLEY:  Did you say something earlier about the deterministic and probabilistic versions of Solomonoff induction giving the same answers? Like, is it a distinction without a difference whether we ask about simple programs that reproduce the observed data versus simple programs that assign high probability to the data? I can't see why that should be true, especially since Turing machines don't include a randomness source.

BLAINE:  I'm told the answers are the same but I confess I can't quite see why, unless there's some added assumption I'm missing. So let's talk about programs that assign probabilities for now, because I think that case is clearer.

iv.  Simplicity

BLAINE:  The next key idea is to prefer simple programs that assign high probability to our observations so far.

ASHLEY:  It seems like an obvious step, especially considering that you were already talking about "simple programs" and Occam's Razor a while back. Solomonoff induction is part of the Bayesian program of inference, right?

BLAINE:  Indeed. Very much so.

ASHLEY:  Okay, so let's talk about the program, or hypothesis, for "This barrel has an unknown frequency of white and green balls", versus the hypothesis "This barrel has 10 white and 10 green balls", versus the hypothesis, "This barrel always puts out a green ball after a white ball and vice versa."

Let's say we see a green ball, then a white ball, the sequence GW. The first hypothesis assigns this probability 1/2∗1/3=1/6, the second hypothesis assigns this probability 10/20∗9/19 or roughly 1/4, and the third hypothesis assigns probability 1/2∗1.

Now it seems to me that there's some important sense in which, even though Laplace's Rule assigned a lower probability to the data, it's significantly simpler than the second and third hypotheses and is the wiser answer. Does Solomonoff induction agree?

BLAINE:  I think you might be taking into account some prior knowledge that isn't in the sequence itself, there. Like, things that alternate either 101010... or 010101... are objectively simple in the sense that a short computer program simulates them or assigns probabilities to them. It's just unlikely to be true about an actual barrel of white and green balls.

If 10 is literally the first sense data that you ever see, when you are a fresh new intelligence with only two bits to rub together, then "The universe consists of alternating bits" is no less reasonable than "The universe produces bits with an unknown random frequency anywhere between 0 and 1."

ASHLEY:  Conceded. But as I was going to say, we have three hypotheses that assigned 1/6, ∼1/4, and 1/2 to the observed data; but to know the posterior probabilities of these hypotheses we need to actually say how relatively likely they were a priori, so we can multiply by the odds ratio. Like, if the prior odds were 3:2:1, the posterior odds would be 3:2:1∗(2/12:3/12:6/12)=3:2:1∗2:3:6=6:6:6=1:1:1. Now, how would Solomonoff induction assign prior probabilities to those computer programs? Because I remember you saying, way back when, that you thought Solomonoff was the answer to "How should Bayesians assign priors?"

BLAINE:  Well, how would you do it?

ASHLEY:  I mean... yes, the simpler rules should be favored, but it seems to me that there's some deep questions as to the exact relative 'simplicity' of the rules (M+1)/(M+N+2), or the rule (10−M)/(20−M−N), or the rule "alternate the bits"...

BLAINE:  Suppose I ask you to just make up some simple rule.

ASHLEY:  Okay, if I just say the rule I think you're looking for, the rule would be, "The complexity of a computer program is the number of bits needed to specify it to some arbitrary but reasonable choice of compiler or Universal Turing Machine, and the prior probability is 1/2 to the power of the number of bits. Since, e.g., there's 32 possible 5-bit programs, so each such program has probability 1/32. So if it takes 16 bits to specify Laplace's Rule of Succession, which seems a tad optimistic, then the prior probability would be 1/65536, which seems a tad pessimistic.

BLAINE:  Now just apply that rule to the infinity of possible computer programs that assign probabilities to the observed data, update their posterior probabilities based on the probability they've assigned to the evidence so far, sum over all of them to get your next prediction, and we're done. And yes, that requires a hypercomputer that can solve the halting problem, but we're talking ideals here. Let P be the set of all programs and s1s2…sn also written s⪯n be the sense data so far, then

Sol(s⪯n):=∑prog∈P2−length(prog)⋅n∏j=1InterpretProb(prog(s⪯j−1),sj)P(sn+1=1∣s⪯n)=Sol(s1s2…sn1)Sol(s1s2…sn1)+Sol(s1s2…sn0).

ASHLEY:  Uh.

BLAINE:  Yes?

ASHLEY:  Um...

BLAINE:  What is it?

ASHLEY:  You invoked a countably infinite set, so I'm trying to figure out if my predicted probability for the next bit must necessarily converge to a limit as I consider increasingly large finite subsets in any order.

BLAINE:  (sighs)  Of course you are.

ASHLEY:  I think you might have left out some important caveats. Like, if I take the rule literally, then the program "0" has probability 1/2, the program "1" has probability 1/2, the program "01" has probability 1/4 and now the total probability is 1.25 which is too much. So I can't actually normalize it because the series sums to infinity. Now, this just means we need to, say, decide that the probability of a program having length 1 is 1/2, the probability of it having length 2 is 1/4, and so on out to infinity, but it's an added postulate.

BLAINE:  The conventional method is to require a prefix-free code. If "0111" is a valid program then "01110" cannot be a valid program. With that constraint, assigning "1/2 to the power of the length of the code", to all valid codes, will sum to less than 1; and we can normalize their relative probabilities to get the actual prior.

ASHLEY:  Okay. And you're sure that it doesn't matter in what order we consider more and more programs as we approach the limit, because... no, I see it. Every program has positive probability mass, with the total set summing to 1, and Bayesian updating doesn't change that. So as I consider more and more programs, in any order, there are only so many large contributions that can be made from the mix—there's only so often that the final probability can change.

Like, let's say there are at most 99 programs with probability 1% that assign probability 0 to the next bit being a 1; that's only 99 times the final answer can go down by as much as 0.01, as the limit is approached.

BLAINE:  This idea generalizes, and is important. List all possible computer programs, in any order you like. Use any definition of simplicity that you like, so long as for any given amount of simplicity, there are only a finite number of computer programs that simple. As you go on carving off chunks of prior probability mass and assigning them to programs, it must be the case that as programs get more and complicated, their prior probability approaches zero!—though it's still positive for every finite program, because of Cromwell's Rule.

You can't have more than 99 programs assigned 1% prior probability and still obey Cromwell's Rule, which means there must be some most complex program that is assigned 1% probability, which means every more complicated program must have less than 1% probability out to the end of the infinite list.

ASHLEY:  Huh. I don't think I've ever heard that justification for Occam's Razor before. I think I like it. I mean, I've heard a lot of appeals to the empirical simplicity of the world, and so on, but this is the first time I've seen a logical proof that, in the limit, more complicated hypotheses must be less likely than simple ones.

BLAINE:  Behold the awesomeness that is Solomonoff Induction!

ASHLEY:  Uh, but you didn't actually use the notion of computational simplicity to get that conclusion; you just required that the supply of probability mass is finite and the supply of potential complications is infinite. Any way of counting discrete complications would imply that conclusion, even if it went by surface wheels and gears.

BLAINE:  Well, maybe. But it so happens that Yudkowsky did invent or reinvent that argument after pondering Solomonoff induction, and if it predates him (or Solomonoff) then Yudkowsky doesn't know the source. Concrete inspiration for simplified arguments is also a credit to a theory, especially if the simplified argument didn't exist before that.

ASHLEY:  Fair enough.

v.  Choice of Universal Turing Machine

ASHLEY:  My next question is about the choice of Universal Turing Machine—the choice of compiler for our program codes. There's an infinite number of possibilities there, and in principle, the right choice of compiler can make our probability for the next thing we'll see be anything we like. At least I'd expect this to be the case, based on how the "problem of induction" usually goes. So with the right choice of Universal Turing Machine, our online crackpot can still make it be the case that Solomonoff induction predicts Canada invading the USA.

BLAINE:  One way of looking at the problem of good epistemology, I'd say, is that the job of a good epistemology is not to make it impossible to err. You can still blow off your foot if you really insist on pointing the shotgun at your foot and pulling the trigger.

The job of good epistemology is to make it more obvious when you're about to blow your own foot off with a shotgun. On this dimension, Solomonoff induction excels. If you claim that we ought to pick an enormously complicated compiler to encode our hypotheses, in order to make the 'simplest hypothesis that fits the evidence' be one that predicts Canada invading the USA, then it should be obvious to everyone except you that you are in the process of screwing up.

ASHLEY:  Ah, but of course they'll say that their code is just the simple and natural choice of Universal Turing Machine, because they'll exhibit a meta-UTM which outputs that UTM given only a short code. And if you say the meta-UTM is complicated—

BLAINE:  Flon's Law says, "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code." You can't make it impossible for people to screw up, but you can make it more obvious. And Solomonoff induction would make it even more obvious than might at first be obvious, because—

ASHLEY:  Your Honor, I move to have the previous sentence taken out and shot.

BLAINE:  Let's say that the whole of your sensory information is the string 10101010... Consider the stupid hypothesis, "This program has a 99% probability of producing a 1 on every turn", which you jumped to after seeing the first bit. What would you need to claim your priors were like—what Universal Turing Machine would you need to endorse—in order to maintain blind faith in that hypothesis in the face of ever-mounting evidence?

ASHLEY:  You'd need a Universal Turing Machine blind-utm that assigned a very high probability to the blind program "def ProbNextElementIsOne(previous_sequence): return 0.99". Like, if blind-utm sees the code 0, it executes the blind program "return 0.99".

And to defend yourself against charges that your UTM blind-utm was not itself simple, you'd need a meta-UTM, blind-meta, which, when it sees the code 10, executes blind-utm.

And to really wrap it up, you'd need to take a fixed point through all towers of meta and use diagonalization to create the UTM blind-diag that, when it sees the program code 0, executes "return 0.99", and when it sees the program code 10, executes blind-diag.

I guess I can see some sense in which, even if that doesn't resolve Hume's problem of induction, anyone actually advocating that would be committing blatant shenanigans on a commonsense level, arguably more blatant than it would have been if we hadn't made them present the UTM.

BLAINE:  Actually, the shenanigans have to be much worse than that in order to fool Solomonoff induction. Like, Solomonoff induction using your blind-diag isn't fooled for a minute, even taking blind-diag entirely on its own terms.

ASHLEY:  Really?

BLAINE:  Assuming 60 sequence items per second? Yes, absolutely, Solomonoff induction shrugs off the delusion in the first minute, unless there are further and even more blatant shenanigans.

We did require that your blind-diag be a Universal Turing Machine, meaning that it can reproduce every computable probability distribution over sequences, given some particular code to compile. Let's say there's a 200-bit code laplace for Laplace's Rule of Succession, "lambda sequence: return (sequence.count('1') + 1) / (len(sequence) + 2)", so that its prior probability relative to the 1-bit code for blind is 2−200. Let's say that the sense data is around 50/50 1s and 0s. Every time we see a 1, blind gains a factor of 2 over laplace (99% vs. 50% probability), and every time we see a 0, blind loses a factor of 50 over laplace (1% vs. 50% probability).

On average, every 2 bits of the sequence, blind is losing a factor of 25 or, say, a bit more than 4 bits, i.e., on average blind is losing two bits of probability per element of the sequence observed.

So it's only going to take 100 bits, or a little less than two seconds, for laplace to win out over blind.

ASHLEY:  I see. I was focusing on a UTM that assigned lots of prior probability to blind, but what I really needed was a compiler that, while still being universal and encoding every possibility somewhere, still assigned a really tiny probability to laplace, faircoin that encodes "return 0.5", and every other hypothesis that does better, round by round, than blind. So what I really need to carry off the delusion is obstinate-diag that is universal, assigns high probability to blind, requires billions of bits to specify laplace, and also requires billions of bits to specify any UTM that can execute laplace as a shorter code than billions of bits. Because otherwise we will say, "Ah, but given the evidence, this other UTM would have done better." I agree that those are even more blatant shenanigans than I thought.

BLAINE:  Yes. And even then, even if your UTM takes two billion bits to specify faircoin, Solomonoff induction will lose its faith in blind after seeing a billion bits.

Which will happen before the first year is out, if we're getting 60 bits per second.

And if you turn around and say, "Oh, well, I didn't mean that was my UTM, I really meant this was my UTM, this thing over here where it takes a trillion bits to encode faircoin", then that's probability-theory-violating shenanigans where you're changing your priors as you go.

ASHLEY:  That's actually a very interesting point—that what's needed for a Bayesian to maintain a delusion in the face of mounting evidence is not so much a blindly high prior for the delusory hypothesis, as a blind skepticism of all its alternatives.

But what if their UTM requires a googol bits to specify faircoin? What if blind and blind-diag, or programs pretty much isomorphic to them, are the only programs that can be specified in less than a googol bits?

BLAINE:  Then your desire to shoot your own foot off has been made very, very visible to anyone who understands Solomonoff induction. We're not going to get absolutely objective prior probabilities as a matter of logical deduction, not without principles that are unknown to me and beyond the scope of Solomonoff induction. But we can make the stupidity really blatant and force you to construct a downright embarrassing Universal Turing Machine.

ASHLEY:  I guess I can see that. I mean, I guess that if you're presenting a ludicrously complicated Universal Turing Machine that just refuses to encode the program that would predict Canada not invading, that's more visibly silly than a verbal appeal that says, "But you must just have faith that Canada will invade." I guess part of me is still hoping for a more objective sense of "complicated".

BLAINE:  We could say that reasonable UTMs should contain a small number of wheels and gears in a material instantiation under our universe's laws of physics, which might in some ultimate sense provide a prior over priors. Like, the human brain evolved from DNA-based specifications, and the things you can construct out of relatively small numbers of physical objects are 'simple' under the 'prior' implicitly searched by natural selection.

ASHLEY:  Ah, but what if I think it's likely that our physical universe or the search space of DNA won't give us a good idea of what's complicated?

BLAINE:  For your alternative notion of what's complicated to go on being believed even as other hypotheses are racking up better experimental predictions, you need to assign a ludicrously low probability that our universe's space of physical systems buildable using a small number of objects, could possibly provide better predictions of that universe than your complicated alternative notion of prior probability.

We don't need to appeal that it's a priori more likely than not that "a universe can be predicted well by low-object-number machines built using that universe's physics." Instead, we appeal that it would violate Cromwell's Rule, and would constitute exceedingly special pleading, to assign the possibility of a physically learnable universe a probability of less than 2−1,000,000. It then takes only a megabit of exposure to notice that the universe seems to be regular.

ASHLEY:  In other words, so long as you don't start with an absolute and blind prejudice against the universe being predictable by simple machines encoded in our universe's physics—so long as, on this planet of seven billion people, you don't assign probabilities less than 2−1,000,000 to the other person being right about what is a good Universal Turing Machine—then the pure logic of Bayesian updating will rapidly force you to the conclusion that induction works.

vi.  Why algorithmic complexity?

ASHLEY:  Hm. I don't know that good pragmatic answers to the problem of induction were ever in short supply. Still, on the margins, it's a more forceful pragmatic answer than the last one I remember hearing.

BLAINE:  Yay! Now isn't Solomonoff induction wonderful?

ASHLEY:  Maybe?

You didn't really use the principle of computational simplicity to derive that lesson. You just used that some inductive principle ought to have a prior probability of more than 2−1,000,000.

BLAINE:  ...

ASHLEY:  Can you give me an example of a problem where the computational definition of simplicity matters and can't be factored back out of an argument?

BLAINE:  As it happens, yes I can. I can give you three examples of how it matters.

ASHLEY:  Vun... two... three! Three examples! Ah-ah-ah!

BLAINE:  Must you do that every—oh, never mind. Example one is that galaxies are not so improbable that no one could ever believe in them, example two is that the limits of possibility include Terrence Tao, and example three is that diffraction is a simpler explanation of rainbows than divine intervention.

ASHLEY:  These statements are all so obvious that no further explanation of any of them is required.

BLAINE:  On the contrary! And I'll start with example one. Back when the Andromeda Galaxy was a hazy mist seen through a telescope, and someone first suggested that maybe that hazy mist was an incredibly large number of distant stars—that many "nebulae" were actually distant galaxies, and our own Milky Way was only one of them—there was a time when Occam's Razor was invoked against that hypothesis.

ASHLEY:  What? Why?

BLAINE:  They invoked Occam's Razor against the galactic hypothesis, because if that were the case, then there would be a much huger number of stars in the universe, and the stars would be entities, and Occam's Razor said "Entities are not to be multiplied beyond necessity."

ASHLEY:  That's not how Occam's Razor works. The "entities" of a theory are its types, not its objects. If you say that the hazy mists are distant galaxies of stars, then you've reduced the number of laws because you're just postulating a previously seen type, namely stars organized into galaxies, instead of a new type of hazy astronomical mist.

BLAINE:  Okay, but imagine that it's the nineteenth century and somebody replies to you, "Well, I disagree! William of Ockham said not to multiply entities, this galactic hypothesis obviously creates a huge number of entities, and that's the way I see it!"

ASHLEY:  I think I'd give them your spiel about there being no human epistemology that can stop you from shooting off your own foot.

BLAINE:  I don't think you'd be justified in giving them that lecture.

I'll parenthesize at this point that you ought to be very careful when you say "I can't stop you from shooting off your own foot", lest it become a Fully General Scornful Rejoinder. Like, if you say that to someone, you'd better be able to explain exactly why Occam's Razor counts types as entities but not objects. In fact, you'd better explain that to someone before you go advising them not to shoot off their own foot. And once you've told them what you think is foolish and why, you might as well stop there. Except in really weird cases of people presenting us with enormously complicated and jury-rigged Universal Turing Machines, and then we say the shotgun thing.

ASHLEY:  That's fair. So, I'm not sure what I'd have answered before starting this conversation, which is much to your credit, friend Blaine. But now that I've had this conversation, it's obvious that it's new types and not new objects that use up the probability mass we need to distribute over all hypotheses. Like, I need to distribute my probability mass over "Hypothesis 1: there are stars" and "Hypothesis 2: there are stars plus huge distant hazy mists". I don't need to distribute my probability mass over all the actual stars in the galaxy!

BLAINE:  In terms of Solomonoff induction, we penalize a program's lines of code rather than its runtime or RAM used, because we need to distribute our probability mass over possible alternatives each time we add a line of code. There's no corresponding choice between mutually exclusive alternatives when a program uses more runtime or RAM.

(ELIEZER:  (whispering)  Unless we need a leverage prior to consider the hypothesis of being a particular agent inside all that RAM or runtime.)

ASHLEY:  Or to put it another way: any fully detailed model of the universe would require some particular arrangement of stars, and the more stars there are, the more possible arrangements there are. But when we look through the telescope and see a hazy mist, we get to sum over all arrangements of stars that would produce that hazy mist. If some galactic hypothesis required a hundred billion stars to all be in particular exact places without further explanation or cause, then that would indeed be a grave improbability.

BLAINE:  Precisely. And if you needed all the hundred billion stars to be in particular exact places, that's just the kind of hypothesis that would take a huge computer program to specify.

ASHLEY:  But does it really require learning Solomonoff induction to understand that point? Maybe the bad argument against galaxies was just a motivated error somebody made in the nineteenth century, because they didn't want to live in a big universe for emotional reasons.

BLAINE:  The same debate is playing out today over no-collapse versions of quantum mechanics, also somewhat unfortunately known as "many-worlds interpretations". Now, regardless of what anyone thinks of all the other parts of that debate, there's a particular sub-argument where somebody says, "It's simpler to have a collapse interpretation because all those extra quantum 'worlds' are extra entities that are unnecessary under Occam's Razor since we can't see them." And Solomonoff induction tells us that this invocation of Occam's Razor is flatly misguided because Occam's Razor does not work like that.

Basically, they're trying to cut down the RAM and runtime of the universe, at the expensive of adding an extra line of code, namely the code for the collapse postulate that prunes off parts of the wavefunction that are in undetectably weak causal contact with us.

ASHLEY:  Hmm. Now that you put it that way, it's not so obvious to me that it makes sense to have no prejudice against sufficiently enormous universes. I mean, the universe we see around us is exponentially vast but not superexponentially vast—the visible atoms are 1080 in number or so, not 101080 or "bigger than Graham's Number". Maybe there's some fundamental limit on how much gets computed.

BLAINE:  You, um, know that on the Standard Model, the universe doesn't just cut out and stop existing at the point where our telescopes stop seeing it? There isn't a giant void surrounding a little bubble of matter centered perfectly on Earth? It calls for a literally infinite amount of matter? I mean, I guess if you don't like living in a universe with more than 1080 entities, a universe where too much gets computed, you could try to specify extra laws of physics that create an abrupt spatial boundary with no further matter beyond them, somewhere out past where our telescopes can see—

ASHLEY:  All right, point taken.

(ELIEZER:  (whispering)  Though I personally suspect that the spatial multiverse and the quantum multiverse are the same multiverse, and that what lies beyond the reach of our telescopes is not entangled with us—meaning that the universe is as finitely large as the superposition of all possible quantum branches, rather than being literally infinite in space.)

BLAINE:  I mean, there is in fact an alternative formalism to Solomonoff induction, namely Levin search, which says that program complexities are further penalized by the logarithm of their runtime. In other words, it would say that 'explanations' or 'universes' that require a long time to run are inherently less probable.

Some people like Levin search more than Solomonoff induction because it's more computable. I dislike Levin search because (a) it has no fundamental epistemic justification and (b) it assigns probability zero to quantum mechanics.

ASHLEY:  Can you unpack that last part?

BLAINE:  If, as is currently suspected, there's no way to simulate quantum computers using classical computers without an exponential slowdown, then even in principle, this universe requires exponentially vast amounts of classical computing power to simulate.

Let's say that with sufficiently advanced technology, you can build a quantum computer with a million qubits. On Levin's definition of complexity, for the universe to be like that is as improbable a priori as any particular set of laws of physics that must specify on the order of one million equations.

Can you imagine how improbable it would be to see a list of one hundred thousand differential equations, without any justification or evidence attached, and be told that they were the laws of physics? That's the kind of penalty that Levin search or Schmidhuber's Speed Prior would attach to any laws of physics that could run a quantum computation of a million qubits, or, heck, any physics that claimed that a protein was being folded in a way that ultimately went through considering millions of quarks interacting.

If you're not absolutely certain a priori that the universe isn't like that, you don't believe in Schmidhuber's Speed Prior. Even with a collapse postulate, the amount of computation that goes on before a collapse would be prohibited by the Speed Prior.

ASHLEY:  Okay, yeah. If you're phrasing it that way—that the Speed Prior assigns probability nearly zero to quantum mechanics, so we shouldn't believe in the Speed Prior—then I can't easily see a way to extract out the same point without making reference to ideas like penalizing algorithmic complexity but not penalizing runtime. I mean, maybe I could extract the lesson back out but it's easier to say, or more obvious, by pointing to the idea that Occam's Razor should penalize algorithmic complexity but not runtime.

BLAINE:  And that isn't just implied by Solomonoff induction, it's pretty much the whole idea of Solomonoff induction, right?

ASHLEY:  Maaaybe.

BLAINE:  For example two, that Solomonoff induction outperforms even Terence Tao, we want to have a theorem that says Solomonoff induction catches up to every computable way of reasoning in the limit. Since we iterated through all possible computer programs, we know that somewhere in there is a simulated copy of Terence Tao in a simulated room, and if this requires a petabyte to specify, then we shouldn't have to make more than a quadrillion bits of error relative to Terence Tao before zeroing in on the Terence Tao hypothesis.

I mean, in practice, I'd expect far less than a quadrillion bits of error before the system was behaving like it was vastly smarter than Terence Tao. It'd take a lot less than a quadrillion bits to give you some specification of a universe with simple physics that gave rise to a civilization of vastly greater than intergalactic extent. Like, Graham's Number is a very simple number, so it's easy to specify a universe that runs for that long before it returns an answer. It's not obvious how you'd extract Solomonoff predictions from that civilization and incentivize them to make good ones, but I'd be surprised if there were no Turing machine of fewer than one thousand states which did that somehow.

ASHLEY:  ...

BLAINE:  And for all I know there might be even better ways than that of getting exceptionally good predictions, somewhere in the list of the first decillion computer programs. That is, somewhere in the first 100 bits.

ASHLEY:  So your basic argument is, "Never mind Terence Tao, Solomonoff induction dominates God."

BLAINE:  Solomonoff induction isn't the epistemic prediction capability of a superintelligence. It's the epistemic prediction capability of something that eats superintelligences like potato chips.

ASHLEY:  Is there any point to contemplating an epistemology so powerful that it will never begin to fit inside the universe?

BLAINE:  Maybe? I mean, a lot of times, you just find people failing to respect the notion of ordinary superintelligence, doing the equivalent of supposing that a superintelligence behaves like a bad Hollywood genius and misses obvious-seeming moves. And a lot of times you find them insisting that "there's a limit to how much information you can get from the data" or something along those lines. "That Alien Message" is intended to convey the counterpoint, that smarter entities can extract more info than is immediately apparent on the surface of things.

Similarly, thinking about Solomonoff induction might also cause someone to realize that if, say, you simulated zillions of possible simple universes, you could look at which agents were seeing exact data like the data you got, and figure out where you were inside that range of possibilities, so long as there was literally any correlation to use.

And if you say that an agent can't extract that data, you're making a claim about which shortcuts to Solomonoff induction are and aren't computable. In fact, you're probably pointing at some particular shortcut and claiming nobody can ever figure that out using a reasonable amount of computing power even though the info is there in principle. Contemplating Solomonoff induction might help people realize that, yes, the data is there in principle. Like, until I ask you to imagine a civilization running for Graham's Number of years inside a Graham-sized memory space, you might not imagine them trying all the methods of analysis that you personally can imagine being possible.

ASHLEY:  If somebody is making that mistake in the first place, I'm not sure you can beat it out of them by telling them the definition of Solomonoff induction.

BLAINE:  Maybe not. But to brute-force somebody into imagining that sufficiently advanced agents have Level 1 protagonist intelligence, that they are epistemically efficient rather than missing factual questions that are visible even to us, you might need to ask them to imagine an agent that can see literally anything seeable in the computational limit just so that their mental simulation of the ideal answer isn't running up against stupidity assertions.

Like, I think there are a lot of people who could benefit from looking over the evidence they already personally have, and asking what a Solomonoff inductor could deduce from it, so that they wouldn't be running up against stupidity assertions about themselves. It's the same trick as asking yourself what God, Richard Feynman, or a "perfect rationalist" would believe in your shoes. You just have to pick a real or imaginary person that you respect enough for your model of that person to lack the same stupidity assertions that you believe about yourself.

ASHLEY:  Well, let's once again try to factor out the part about Solomonoff induction in particular. If we're trying to imagine something epistemically smarter than ourselves, is there anything we get from imagining a complexity-weighted prior over programs in particular? That we don't get from, say, trying to imagine the reasoning of one particular Graham-Number-sized civilization?

BLAINE:  We get the surety that even anything we imagine Terence Tao himself as being able to figure out, is something that is allowed to be known after some bounded number of errors versus Terence Tao, because Terence Tao is inside the list of all computer programs and gets promoted further each time the dominant paradigm makes a prediction error relative to him.

We can't get that dominance property without invoking "all possible ways of computing" or something like it—we can't incorporate the power of all reasonable processes, unless we have a set such that all the reasonable processes are in it. The enumeration of all possible computer programs is one such set.

ASHLEY:  Hm.

BLAINE:  Example three, diffraction is a simpler explanation of rainbows than divine intervention.

I don't think I need to belabor this point very much, even though in one way it might be the most central one. It sounds like "Jehovah placed rainbows in the sky as a sign that the Great Flood would never come again" is a 'simple' explanation; you can explain it to a child in nothing flat. Just the diagram of diffraction through a raindrop, to say nothing of the Principle of Least Action underlying diffraction, is something that humans don't usually learn until undergraduate physics, and it sounds more alien and less intuitive than Jehovah. In what sense is this intuitive sense of simplicity wrong? What gold standard are we comparing it to, that could be a better sense of simplicity than just 'how hard is it for me to understand'?

The answer is Solomonoff induction and the rule which says that simplicity is measured by the size of the computer program, not by how hard things are for human beings to understand. Diffraction is a small computer program; any programmer who understands diffraction can simulate it without too much trouble. Jehovah would be a much huger program—a complete mind that implements anger, vengeance, belief, memory, consequentialism, etcetera. Solomonoff induction is what tells us to retrain our intuitions so that differential equations feel like less burdensome explanations than heroic mythology.

ASHLEY:  Now hold on just a second, if that's actually how Solomonoff induction works then it's not working very well. I mean, Abraham Lincoln was a great big complicated mechanism from an algorithmic standpoint—he had a hundred trillion synapses in his brain—but that doesn't mean I should look at the historical role supposedly filled by Abraham Lincoln, and look for simple mechanical rules that would account for the things Lincoln is said to have done. If you've already seen humans and you've already learned to model human minds, it shouldn't cost a vast amount to say there's one more human, like Lincoln, or one more entity that is cognitively humanoid, like the Old Testament jealous-god version of Jehovah. It may be wrong but it shouldn't be vastly improbable a priori.

If you've already been forced to acknowledge the existence of some humanlike minds, why not others? Shouldn't you get to reuse the complexity that you postulated to explain humans, in postulating Jehovah?

In fact, shouldn't that be what Solomonoff induction does? If you have a computer program that can model and predict humans, it should only be a slight modification of that program—only slightly longer in length and added code—to predict the modified-human entity that is Jehovah.

BLAINE:  Hm. That's fair. I may have to retreat from that example somewhat.

In fact, that's yet another point to the credit of Solomonoff induction! The ability of programs to reuse code, incorporates our intuitive sense that if you've already postulated one kind of thing, it shouldn't cost as much to postulate a similar kind of thing elsewhere!

ASHLEY:  Uh huh.

BLAINE:  Well, but even if I was wrong that Solomonoff induction should make Jehovah seem very improbable, it's still Solomonoff induction that says that the alternative hypothesis of 'diffraction' shouldn't itself be seen as burdensome—even though diffraction might require a longer time to explain to a human, it's still at heart a simple program.

ASHLEY:  Hmm.

I'm trying to think if there's some notion of 'simplicity' that I can abstract away from 'simple program' as the nice property that diffraction has as an explanation for rainbows, but I guess anything I try to say is going to come down to some way of counting the wheels and gears inside the explanation, and justify the complexity penalty on probability by the increased space of possible configurations each time we add a new gear. And I can't make it be about surface details because that will make whole humans seem way too improbable.

If I have to use simply specified systems and I can't use surface details or runtime, that's probably going to end up basically equivalent to Solomonoff induction. So in that case we might as well use Solomonoff induction, which is probably simpler than whatever I'll think up and will give us the same advice. Okay, you've mostly convinced me.

BLAINE:  Mostly? What's left?

vii.  Limitations

ASHLEY:  Well, several things. Most of all, I think of how the 'language of thought' or 'language of epistemology' seems to be different in some sense from the 'language of computer programs'.

Like, when I think about the laws of Newtonian gravity, or when I think about my Mom, it's not just one more line of code tacked onto a big black-box computer program. It's more like I'm crafting an explanation with modular parts—if it contains a part that looks like Newtonian mechanics, I step back and reason that it might contain other parts with differential equations. If it has a line of code for a Mom, it might have a line of code for a Dad.

I'm worried that if I understood how humans think like that, maybe I'd look at Solomonoff induction and see how it doesn't incorporate some further key insight that's needed to do good epistemology.

BLAINE:  Solomonoff induction literally incorporates a copy of you thinking about whatever you're thinking right now.

ASHLEY:  Okay, great, but that's inside the system. If Solomonoff learns to promote computer programs containing good epistemology, but is not itself good epistemology, then it's not the best possible answer to "How do you compute epistemology?"

Like, natural selection produced humans but population genetics is not an answer to "How does intelligence work?" because the intelligence is in the inner content rather than the outer system. In that sense, it seems like a reasonable worry that Solomonoff induction might incorporate only some principles of good epistemology rather than all the principles, even if the internal content rather than the outer system might bootstrap the rest of the way.

BLAINE:  Hm. If you put it that way...

(long pause)

... then I guess I have to agree. I mean, Solomonoff induction doesn't explicitly say anything about, say, the distinction between analytic propositions and empirical propositions, and knowing that is part of good epistemology on my view. So if you want to say that Solomonoff induction is something that bootstraps to good epistemology rather than being all of good epistemology by itself, I guess I have no choice but to agree.

I do think the outer system already contains a lot of good epistemology and inspires a lot of good advice all on its own. Especially if you give it credit for formally reproducing principles that are "common sense", because correctly formalizing common sense is no small feat.

ASHLEY:  Got a list of the good advice you think is derivable?

BLAINE:  Um. Not really, but off the top of my head:

1. The best explanation is the one with the best mixture of simplicity and matching the evidence.
2. "Simplicity" and "matching the evidence" can both be measured in bits, so they're commensurable.
3. The simplicity of a hypothesis is the number of bits required to formally specify it—for example, as a computer program.
4. When a hypothesis assigns twice as much probability to the exact observations seen so far as some other hypothesis, that's one bit's worth of relatively better matching the evidence.
5. You should actually be making your predictions using all the explanations, not just the single best one, but explanations that poorly match the evidence will drop down to tiny contributions very quickly.
6. Good explanations let you compress lots of data into compact reasons which strongly predict seeing just that data and no other data.
7. Logic can't dictate prior probabilities absolutely, but if you assign probability less than 2−1,000,000 to the prior that mechanisms constructed using a small number of objects from your universe might be able to well predict that universe, you're being unreasonable.
8. So long as you don't assign infinitesimal prior probability to hypotheses that let you do induction, they will very rapidly overtake hypotheses that don't.
9. It is a logical truth, not a contingent one, that more complex hypotheses must in the limit be less probable than simple ones.
10. Epistemic rationality is a precise art with no user-controlled degrees of freedom in how much probability you ideally ought to assign to a belief. If you think you can tweak the probability depending on what you want the answer to be, you're doing something wrong.
11. Things that you've seen in one place might reappear somewhere else.
12. Once you've learned a new language for your explanations, like differential equations, you can use it to describe other things, because your best hypotheses will now already encode that language.
13. We can learn meta-reasoning procedures as well as object-level facts by looking at which meta-reasoning rules are simple and have done well on the evidence so far.
14. So far, we seem to have no a priori reason to believe that universes which are more expensive to compute are less probable.
15. People were wrong about galaxies being a priori improbable because that's not how Occam's Razor works. Today, other people are equally wrong about other parts of a continuous wavefunction counting as extra entities for the purpose of evaluating hypotheses' complexity.
16. If something seems "weird" to you but would be a consequence of simple rules that fit the evidence so far, well, there's nothing in these explicit laws of epistemology that adds an extra penalty term for weirdness.
17. Your epistemology shouldn't have extra rules in it that aren't needed to do Solomonoff induction or something like it, including rules like "science is not allowed to examine this particular part of reality"—

ASHLEY:  This list isn't finite, is it.

BLAINE:  Well, there's a lot of outstanding debate about epistemology where you can view that debate through the lens of Solomonoff induction and see what Solomonoff suggests.

ASHLEY:  But if you don't mind my stopping to look at your last item, #17 above—again, it's attempts to add completeness clauses to Solomonoff induction that make me the most nervous.

I guess you could say that a good rule of epistemology ought to be one that's promoted by Solomonoff induction—that it should arise, in some sense, from the simple ways of reasoning that are good at predicting observations. But that doesn't mean a good rule of epistemology ought to explicitly be in Solomonoff induction or it's out.

BLAINE:  Can you think of good epistemology that doesn't seem to be contained in Solomonoff induction? Besides the example I already gave of distinguishing logical propositions from empirical ones.

ASHLEY:  I've been trying to. First, it seems to me that when I reason about laws of physics and how those laws of physics might give rise to higher levels of organization like molecules, cells, human beings, the Earth, and so on, I'm not constructing in my mind a great big chunk of code that reproduces my observations. I feel like this difference might be important and it might have something to do with 'good epistemology'.

BLAINE:  I guess it could be? I think if you're saying that there might be this unknown other thing and therefore Solomonoff induction is terrible, then that would be the nirvana fallacy. Solomonoff induction is the best formalized epistemology we have right now

ASHLEY:  I'm not saying that Solomonoff induction is terrible. I'm trying to look in the direction of things that might point to some future formalism that's better than Solomonoff induction. Here's another thing: I feel like I didn't have to learn how to model the human beings around me from scratch based on environmental observations. I got a jump-start on modeling other humans by observing myself, and by recruiting my brain areas to run in a sandbox mode that models other people's brain areas—empathy, in a word.

I guess I feel like Solomonoff induction doesn't incorporate that idea. Like, maybe inside the mixture there are programs which do that, but there's no explicit support in the outer formalism.

BLAINE:  This doesn't feel to me like much of a disadvantage of Solomonoff induction—

ASHLEY:  I'm not saying it would be a disadvantage if we actually had a hypercomputer to run Solomonoff induction. I'm saying it might point in the direction of "good epistemology" that isn't explicitly included in Solomonoff induction.

I mean, now that I think about it, a generalization of what I just said is that Solomonoff induction assumes I'm separated from the environment by a hard, Cartesian wall that occasionally hands me observations. Shouldn't a more realistic view of the universe be about a simple program that contains me somewhere inside it, rather than a simple program that hands observations to some other program?

BLAINE:  Hm. Maybe. How would you formalize that? It seems to open up a big can of worms—

ASHLEY:  But that's what my actual epistemology actually says. My world-model is not about a big computer program that provides inputs to my soul, it's about an enormous mathematically simple physical universe that instantiates Ashley as one piece of it. And I think it's good and important to have epistemology that works that way. It wasn't obvious that we needed to think about a simple universe that embeds us. Descartes did think in terms of an impervious soul that had the universe projecting sensory information onto its screen, and we had to get away from that kind of epistemology.

BLAINE:  You understand that Solomonoff induction makes only a bounded number of errors relative to any computer program which does reason the way you prefer, right? If thinking of yourself as a contiguous piece of the universe lets you make better experimental predictions, programs which reason that way will rapidly be promoted.

ASHLEY:  It's still unnerving to see a formalism that seems, in its own structure, to harken back to the Cartesian days of a separate soul watching a separate universe projecting sensory information on a screen. Who knows, maybe that would somehow come back to bite you?

BLAINE:  Well, it wouldn't bite you in the form of repeatedly making wrong experimental predictions.

ASHLEY:  But it might bite you in the form of having no way to represent the observation of, "I drank this 'wine' liquid and then my emotions changed; could my emotions themselves be instantiated in stuff that can interact with some component of this liquid? Can alcohol touch neurons and influence them, meaning that I'm not a separate soul?" If we interrogated the Solomonoff inductor, would it be able to understand that reasoning?

Which brings up that dangling question from before about modeling the effect that my actions and choices have on the environment, and whether, say, an agent that used Solomonoff induction would be able to correctly predict "If I drop an anvil on my head, my sequence of sensory observations will end."

ELIEZER:  And that's my cue to step in!

The natural next place for this dialogue to go, if I ever write a continuation, is the question of actions and choices, and the agent that uses Solomonoff induction for beliefs and expected reward maximization for selecting actions—the perfect rolling sphere of advanced agent theory, AIXI.

Meanwhile: For more about the issues Ashley raised with agents being a contiguous part of the universe, see "Embedded Agency."

Discuss

### A non-logarithmic argument for Kelly

4 марта, 2021 - 19:21
Published on March 4, 2021 4:21 PM GMT

This post is a response to abramdemski's post, Kelly *is* (just) about logarithmic utility.

any argument in favor of the Kelly formula has to go through an implication that your utility is logarithmic in money, at some point. If it seems not to, it's either:

• mistaken
• cleverly hiding the implication
• some mind-blowing argument I haven't seen before.

Challenge accepted. This is essentially a version of time-averageing which gets rid of the infinity-problem.

Consider the Kelly-betting game: Each round, you can bet any fraction of your wealth on a fair coinflip, which will be tripled if you win. You play this game for an infinite number of rounds. Your utility is linear in money.

The first thing to note is that this game does not have expected utility maximization recommend betting everything each round. This is true for any finite version of the game, but this version has various infinite payoffs, or no well-defined payoffs at all, since it doesn't end. We will get around this by, instead of computing expectations for strategies and comparing them based on expectation size, comparing them directly.

s1≥s2⟺limn→∞[1/nn∑i=0(U(game(s1;r(i))))]≥limm→∞[1/mm∑j=0(U(game(s2;r(j))))]

with the idea of then picking the strategy that is maximal under this order. We then try to pull the comparison inside the limit:

s1≥s2⟺limn→∞[1/nn∑i=0(U(game(s1;r(i))))≥1/nn∑j=0(U(game(s2;r(j))))]

but this doesn't quite work, because we have a truth value inside the limit. Replace that with a propability (and dropping the normalizers, since they dont matter):

0">s1≥s2⟺limn→∞[P[n∑i=0(U(game(s1;r(i))))≥n∑j=0(U(game(s2;r(j))))]]>0

and for the games where classic utility maximisation was well-defined this should give the same results.

Now we can properly define our infinite game: game(s;r) gets a third parameter indicating the number of rounds played: game(s,r,t) stands for playing the kelly-game for t rounds instead of infinitely long. The full game is then the limit of this. Then I define the criterion for limiting games of this type as:

0">s1≥s2⟺limn→∞limt→∞[P[n∑i=0(U(game(s1;r(i);t)))≥n∑j=0(U(game(s2;r(j);t)))]]>0

which we can easily see reproduces Kelly-behaviour: For any n for any d as t goes to infinity the odds that any of the bettors in the sample has a percentage of heads so far that differs form 50% by more than d go to 0, so whichever strategy does better when it gets exactly 50% heads will have higher payoff at t=∞, and since this is true for any n it's also true as n goes to infinity. This is precisely the Kelly-strategy.

Does it make sense to look at a game with infinitely many rounds? Perhaps not. You could also say that the game has a 1% chance of ending each round: Then it would end in finitely many rounds with propability one. I can't solve this analytically, but I think it would end up looking very close to Kelly behaviour.

Notice that if the order of the n- and t-limits is switched, we get the all-in strategy. This is how I think the intuition that utility maximization implies all-in is generated, and this switch is why I put it into the "ergodic" category. Either version would give results consistent with expected utility maximization for games which are finite (encoded as t_1 \forall s\forall r[game(s;r;t) = game(s;r;t_1)]">∃t1∀t>t1∀s∀r[game(s;r;t)=game(s;r;t1)]).

Discuss

### Connecting the good regulator theorem with semantics and symbol grounding

4 марта, 2021 - 17:35
Published on March 4, 2021 2:35 PM GMT

I've been writing quite a bit about syntax, semantics, and symbol grounding.

Recently, I've discovered the "good regulator theorem" in systems science[1]. With a bunch of mathematical caveats, the good regulator theorem says:

• Every good regulator of a system must be a model of that system.

Basically if anything is attempting to control a system (making it a "regulator"), then it must model that system. The definition of "good" includes minimum complexity (that's why the regulator "is" a model of the system: it includes nothing else that would be extraneous), but we can informally extend that to a rougher theorem:

• Every decent regulator of a system must include a model of that system.
Models, semantics, intelligence, and power

I initially defined grounded symbols by saying that there was mutual information between the symbols in the agent's head and features of the world.

For the simplest agent, a circuit-breaking alarm, the symbol Xa just checked whether the circuit was broken or not. It had the most trivial model, simply mapping the Boolean of "circuit broken: yes/no" to that of "sound alarm: yes/no".

It could be outwitted, and it could go off in many circumstances where there was no intruder in the greenhouse. This is hardly a surprise, since the alarm does not model the greenhouse or the intruders at all: it models the break in the circuit, with the physical setup linking that circuit breaking with (some cases of) intruders. Thus the correlation between Xa and x is weak.

The most powerful agent, a resourceful superintelligent robot dedicated to intruder-detection, has internal variable Xr. In order to be powerful, this agent must, by the good regulator theorem, be able to model many different contingencies and situations, having a good grasp of all the ways intruders might try to fool it, and have ways of detecting each of those ways. It has a strong model of the (relevant parts of the) world, and Xr is very closely tied to x.

A more powerful agent could still fool it. If that agent was more intelligent, which we'll define here as having superior models of x, of Xr, and of the surrounding universe, then it will know where to apply its power to best trick or overwhelm the robot. If that agent was less intelligent, it would have to apply a lot more brute power, since it wouldn't have a good model of the robot's vulnerabilities.

Thus, in some ways, greater intelligence could be defined as better use of better models.

Learning and communication

Of course, in the real world, agents don't start out with perfect models; instead they learn. So a good learning agent is one that constructs good models from their input data. It's impossible for a small agent to model the whole universe in detail, so efficient agents have to learn what to focus on, and what simplifying assumptions it is useful for them to make.

Communication, when it works allows the sharing of one person's model with another. This type of communication is not just sharing factual information, but one person trying to communicate their way of modelling and classifying the world. That's why this form of communication can sometimes be so tricky.

1. Thanks to Rebecca Gorman to getting me into looking at cybernetics, control theory, systems science, and other older fields. ↩︎

Discuss

### Covid 3/4: Declare Victory and Leave Home

4 марта, 2021 - 16:20
Published on March 4, 2021 1:20 PM GMT

Health officials look on in horror as individuals both vaccinated and unvaccinated, and state and local governments, realize life exists and people can choose to live it.

This is exactly what I was worried about back in December when I wrote We’re F***ed, It’s Over. The control system would react to the good news in time to set us up to get slammed by the new strains, and a lot of damage can get done before there is a readjustment. The baseline scenario from two months ago is playing out.

The good news, in addition to the positive test percentages continuing to drop for now, is that we have three approved vaccines rapidly scaling up and are well ahead of the vaccine schedule I anticipated, having fully recovered from last week’s dip, and it looks like the new strains are more infectious but not on the high end of the plausible range for that.

The J&J vaccine was approved this week, after a completely pointless three week delay during which no information was found and (for at least the first two-thirds of it) no distribution plan formed. Anything I put at 98%+ on a prediction website isn’t fully news, but the other 2% would have been quite terrible. Supply will initially be limited, but will expand rapidly, including with the help of Merck.

Meanwhile, now that we were provided a sufficiently urgent excuse that we were able to show that mRNA vaccines work, we’ve adopted them to create a vaccine for Malaria. Still very early but I consider this a favorite to end up working in some form within (regulatory burden) number of years. It’s plausible that the Covid-19 pandemic could end up net massively saving lives, and a lot of Effective Altruists (and anyone looking to actually help people) have some updating to do. It’s also worth saying that 409k people died of malaria in 2020 around the world, despite a lot of mitigation efforts, so can we please please please do some challenge trials and ramp up production in advance and otherwise give this the urgency it deserves? And speed up the approval process at least as much as we did for Covid? And fund the hell out of both testing this and doing research to create more mRNA vaccines? There’s also mRNA vaccines in the works for HIV, influenza and certain types of heart disease and cancer. These things having been around for a long time doesn’t make them not a crisis when we have the chance to fix them. And your periodic reminder that the same is true of health’s final boss, also known as aging.

Also, please note that I have been given the opportunity to offer Covid Micro-Grants; see the section below for details. If you can use $1k-$5k to complete a project to help with Covid-19, please don’t hesitate to apply.

Let’s run the numbers.

The Numbers Predictions

Last week: 4.9% positive test rate and an average of 2,068 deaths.

Late prediction (Friday morning): 4.5% positive test rate and an average of 1,950 deaths (excluding the California bump on 2/25).

Result: 4.2% positive test rate and an average of 1,827 deaths after subtracting the California bump.

Great news. I’ve found it pays to be conservative in predicting changes, so when we get the full ‘baseline scenario’ style changes like this, I’m going to undershoot. This was essentially the good scenario, and it bodes well. Deaths continue to lag behind, despite increased vaccination effects for the elderly, in ways I don’t entirely understand. The theory that it’s lag can’t explain the bulk of it because it doesn’t match the past data.

Deaths

NOTE: Arkansas reported net negative deaths this week, which seems unlikely, so I set them to a plausible but low number (40) instead.

DateWESTMIDWESTSOUTHNORTHEASTTOTALJan 7-Jan 13628039637383475222378Jan 14-Jan 20524933867207437020212Jan 21-Jan 27628132178151422221871Jan 28-Feb 3552430788071341020083Feb 4-Feb 10493726877165342918218Feb 11-Feb 17383722215239270013997Feb 18-Feb 24365224334782242713294Feb 25-Mar 3383416695610195813071

There is no plausible story where deaths in the south could be on the uptick for real, but the Arkansas adjustment goes the other way and there weren’t any other glaring mistakes. My assumption is that this is data lag after the storm and isn’t a real change, slash there’s a lot of noise in when deaths are measured in ways that still do not make sense to me but which have happened too many times to not acknowledge.

Positive Tests DateWESTMIDWESTSOUTHNORTHEASTJan 21-Jan 27260,180158,737386,725219,817Jan 28-Feb 3191,804122,259352,018174,569Feb 4-Feb 10144,90299,451255,256149,063Feb 11-Feb 1797,89473,713185,765125,773Feb 18-Feb 2480,62564,857150,493110,339Feb 25-Mar 366,15158,295151,253115,426

Test counts bounced back this week and that’s likely accounting for the bumps up in raw positive test counts in the Northeast and South. The situation is still clearly improving. Doesn’t mean I would start lifting mask mandates.

Test Counts

NOTE: This table will not be in future editions unless I can find a new data source for it that’s reasonable to use. Suggestions for a new data source are great.

DateUSA testsPositive %NY testsPositive %Cumulative PositivesJan 7-Jan 1313,911,52912.2%1,697,0346.6%6.97%Jan 14-Jan 2014,005,7209.7%1,721,4405.9%7.39%Jan 21-Jan 2712,801,2718.8%1,679,3995.3%7.73%Jan 28-Feb 312,257,1237.7%1,557,5504.6%8.02%Feb 4-Feb 1011,376,5416.4%1,473,4544.1%8.25%Feb 11-Feb 1710,404,5045.2%1,552,5553.5%8.41%Feb 18-Feb 249,640,1094.9%1,502,7413.2%8.55%Feb 25-Mar 310,610,0924.2%1,701,8293.1%8.69%

The bounceback in test counts helps explain how positive test percentages fell so much week over week, and makes trends in New York look troubling. I’m going to be in the city this coming week, and it might be that I got in exactly in time given I’m not yet vaccinated.

Vaccinations

Our progress here suddenly looks great. I expected a surge to happen in March and am pleasantly surprised to see it happen this large and this quickly. The one concern is if a bunch of this is catch-up efforts after the snowstorms cleared, in which case we might effectively be back on our old pace for a few more weeks.

The future numbers are even more promising, if you can wait a few months:

I’m quite happy about this of course, and do expect the vaccines to arrive, but in an important sense it’s important to realize this is literally Fake News. What’s fake is the claim that this is news, that something has changed. Nothing changed. Biden has been pursuing a hyper-aggressive policy of under-promise and over-deliver to the point of absurdity, in order to claim maximum credit. This is the natural result. I do understand the motivation, but in addition to the continuing damage to his credibility and government credibility in general (which is bad for vaccines in particular, but in general represents a truth-tracking update) it is of course highly unhelpful. If you want people to hold the line, telling them the end is in sight is exactly what you should be doing. Especially if it’s true.

The question is whether we can count on this pattern to continue. I don’t mean that in a judgemental way, I mean that in a truth seeking way. If we can assume that what is said is designed to make the end result look as impressive as possible, then we can properly evaluate the claims coming from the new administration. We’d get to have Pravda which always lies (in the same directions), instead of the New York Times which keeps you guessing by sometimes telling the truth. It would be especially nice if this pattern extends beyond the pandemic. Presumably at some point there will be a time to claim to have delivered the goods, which complicates matters.

Could it be? Vaccinating people overnight?

We finally are going to vaccinate at night, it seems, in order to make it clear who is getting which vaccine. Or, alternatively, we can think of this as offering the rent-controlled good-but-hard-to-get thing during the day (Moderna/Pfizer vaccine at a time you want to be awake) versus the market rent good-enough thing at night (J&J vaccine, which you bid on by willing to make a trip in the middle of the night at increasingly terrible hours). It’s a really bizarre way to do a little bit of an obviously correct thing, but at this point we’ll take whatever we can get.

Meanwhile, in North Carolina they have open vaccinations except for those who refuse to lie to government officials, who go to the back of the line:

How much is vaccine capacity worth, and how much are we underinvesting in it even now? About this much

How much are we gonna have how fast? Hopefully this much, and hopefully faster:

Faster wouldn’t actually surprise me, since we have an authority systematically under promising.

Europe

It is Italy’s turn to worry as cases trend upwards. Mostly it seems like Europe is doing what it takes to stabilize things while it suffers several months of extra pain thanks to their collective decision to be penny pinchers with regard to vaccines. That decision seems like the essence of the European project at this point, emphasizing things seeming fair and polite and making sure everything abides by all the rules and regulations, whether or not that is compatible with life. One must not underestimate the value of keeping the peace, but these trends likely keep accelerating, and I doubt it ends well.

Farewell, Covid Tracking Project

On March 7, the Covid Tracking Project will stop collecting data. There are many other data sources out there, but I still don’t have one I’m fully happy with. I primarily want easy access in table form of the number of tests, positive tests, hospitalizations and deaths, on a daily basis, including a full history. This needs to be available for the nation and if at all possible for individual states; more granularity beyond that is a bonus, as is any additional data.

John Hopkins has been suggested as an alternative data source. The data itself seems excellent, but like most places they seem obsessed with giving it to us in graph form rather than table form, which is useful at a glance but super frustrating when I’m trying to create spreadsheets and my own graphs and charts. Also, they list their data source as… the Covid Tracking Project. So they have the same problem I do, and we’ll see if they still have good data next week.

Anyway, once again opening the floor for any suggestions.

The wikipedia data on deaths and positive tests is great, but as far as I can tell it doesn’t include the number of tests, so it doesn’t tell me the denominator (the total number of tests).

Announcing Covid Microgrants

Thanks to a donor who wished to remain anonymous, I am able to offer Covid microgrants. These will be grants of $1000 to$5000 each, for those who have a Covid project which they could finish given this small amount of additional funding. If you’re interested, fill out this Google form. Applications close on 3/12/21, and decisions will be quick and based only on my own judgment. I am very curious to see the quantity and quality of applications that come in, and if things go well this could happen again. Please don’t hesitate to apply, or to encourage others to apply.

Insert Mission Accomplished Banner

This kind of thing continues to happen, here’s where we were on February 25:

And here’s where they were three days later:

Then the next day, in Texas:

The English Strain

Why do people keep making this mistake over and over again and I don’t mean Greg Abbott:

This is showing up in the case numbers! It’s showing up as a 20%-30% increase in cases!

Very few people who got infected by a B.1.1.7 strain would have otherwise gotten infected by the old strain during this same time period. Very few people who got infected by a B.1.1.7 strain would have been infected if the initial people to have B.1.1.7 had the old strain instead, because its additional infectiousness has grown its share of infections by several orders of magnitude.

Thus, if you have 80 infections with the old strain and 20 with the new, and no one’s had time to change their behaviors in response yet, this is showing up in the case numbers as about 20 new cases. It’s at least 19.

That’s how to track the impact of the new strain: All cases of the new strain should be considered ‘extra’ cases due to the new strain, until there’s enough time that the control system has adjusted behavior to account for the new infections. Period.

The switch to primarily B.1.1.7 infections seems to be poised to happen in early to mid March, which is later than I feared but clearly in the middle of the expected range.

Johnson, Johnson & Merck

In excellent news, pharmaceutical giant Merck, whose Covid-19 vaccine candidate didn’t work out, is going to help make the Johnson & Johnson vaccine (WaPo). Wonderful, and exactly how it should go. There’s available capacity (not necessarily fully free capacity, but this is a priority), everyone makes a deal, profits, looks good and does good doing it, presto.

That’s great news, and can make us even more confident we will have enough vaccine supply in the medium term, and more confident we’ll be able to help vaccinate the whole world soon after.

What this highlights is how bad the delay in approval of J&J’s vaccine was. J&J was already making doses using its own capacity, so there was a story one could tell that while this delayed some doses being delivered by a few weeks, it didn’t destroy capacity or change the long term trajectory. If days after approval, they’re finally getting to a deal to get Merck to step up, it seems very likely this deal had to wait on approval, so this pushed back half or more of J&J’s long term capacity by three weeks. That’s going to kill a lot of people.

Stockholm Syndrome

This is quite the graph, showing weekly Covid levels in the Stockholm wastewater:

(I assume Week 1 here means 2021 Jan 1-7, and so on.)

There is clearly a lot of measurement error here. There aren’t worlds in which week 4’s levels should be more than double both week 3 and week 5’s levels, nor does the jump from 42 to 43 or 34 to 35 make any sense. The last measurement is plausibly a pure data error. My best guess is that the sample isn’t effectively being taken from distinct enough locations and is effectively measuring something too local, and caught a local outbreak? Regardless of the right explanation, there’s still something being measured here, and this is the definition of off the charts. Seems worth noticing.

Noticing this, I checked in with Boston wastewater as well:

There was an upward move, but things seem to have come back and now are below the previous low point this year, so it seems like things are indeed continuing to improve. It does provide an additional suggestion that there was some sort of brief mini-surge corresponding to the uptick in numbers, but I have actual zero idea what could have caused that at that time.

Vaccines Still Work

Vaccines still work, Pfizer single-dose preventing infection edition.

Vaccines still work, Moderna single-dose preventing infection edition. More lowballing.

Vaccines still work, AstraZeneca and Pfizer single-dose edition (paper).

Vaccines still work, take essentially any vaccine you can get edition (MR). Chinese vaccine is the only plausible exception.

Vaccines still work, but keep not getting approved, so here’s the rich Germans will fly to Russia, get vaccinated and leave without ever entering the country edition

Vaccines still work, they all are awesome, but some are better than others and while you should mostly take whatever is available, you should care a nonzero amount about getting the best one you can edition, a Jason Furman Twitter thread.

Vaccines still work, we fully knew this back in July and everyone who stalled things further should be judged accordingly edition.

In Other News

We can all agree Andrew Cuomo is the worst, it seems, due to claims of sexual harassment. We were going to let the causing of and then covering up of thousands of deaths slide – I mean what politician hasn’t done that sort of thing this past year – but we have a zero tolerance policy for sexual harassment that reaches a threshold level of social media prominence. This calls for an independent investigation immediately. I’d summarize my reaction to all this as: I’m not saying Al Capone wasn’t guilty of tax evasion, and also I’m shocked, shocked to find gambling in this establishment.

It appears Operation Warp Speed had to be funded by raiding other sources because Congress couldn’t be bothered to fund it. As MR points out, this is a scandal because it was necessary, rather than because it was done. It’s scary, because it implies that under a different administration Operation Warp Speed could easily have not happened at all.

Catholic Church tells members to avoid J&J vaccine if they can, over concerns about abortion, despite Pope explicilty saying those concerns don’t apply. Divine authority, you had one job!

Another reason you might want to pay money for the things you want:

Shed a tear for maybe it would also have been even more helpful to make the vaccine profitable back when it could have helped increase supply but also take whatever we can get, wherever we can get it.

Doctor Fauci’s defense against First Doses First is a combination of pure FUD and… that it would be a messaging problem?

Also that we’ve already missed the window where this would have helped much, thanks to people like him dragging their feet on this and continuing to drag their feet, so no point in worrying about it now, might as well acknowledge that the foot dragging worked:

At least the ‘this would further blow our credibility’ argument is honest and has content. It’s true that reversing these policies, when the need for first doses first is getting less rather than more urgent, would make those involved look like lying liars and/or bumbling idiots, who mostly aren’t optimizing for outcomes, and for various reasons they’d prefer a less accurate perspective to retain its popularity.

Fauci’s new position is that ‘there are risks to both approaches’ and to continue to use variations on ‘no evidence’ and to emphasize that the second dose offers an individual additional protection, as if that was in any way in dispute. The concept of a cost/benefit analysis, or the idea that one might shut up and multiply, let alone form a detailed model full of gears, is clearly not within his range.

Zeynep post and open thread on pandemic lessons for the future.

Zeynep article in The Atlantic about how our public health messaging has been a disaster.

Post is excellent, and does a great job driving home the central things that went massively wrong with public health messaging. My only quibble is that harms from terrible regulation are treated as beyond scope and not discussed, which is reasonable in context but also feels like ignoring the elephant. Also, if you’ve been following events via my posts, Zeynep’s post is largely a case of You Should Know This Already.

In particular, Zeynep points to five key mistakes: Fear of risk compensation, telling people to use rules instead of mechanisms or intuitions, scolding and shaming especially for outdoor activities (which is a lot of why parks/beaches were closed while indoor gyms were permitted in many places), failure to support or give people tools for harm reduction while making impossible asks (e.g. no socializing for a year),  and sitting on the line of ‘no evidence’ or ‘no clear evidence’ over and over and over again.

And yes, she points out, still doing it:

We did it with masks, with transmission methods and modes of prevention, and now again and also with vaccines.

That’s all an excellent summary of the biggest failures, but I am not convinced it is fair to call them ‘mistakes.’

All of this also isn’t new, this isn’t Covid but seems highly on point (OP has lots more and is great):

Then of course because don’t be absurd and I’d be boggled to find a different answer:

Dr. Fauci graciously says it’s all right for two vaccinated individuals to have dinner together, citing “common sense” and that the risk is “extremely small.” The implication that all people involved must be vaccinated is clear, so this is a retreat from one insane position to a slightly less insane position.

Update on the White House supercluster of infections, which happened exactly the way one would expect, so no real need to click.

We shouldn’t expect anything less. CDC guidelines for citizen behavior have always been at best aspirational (you could also use the word ‘crazy’) and mostly ignored. This never seemed wise to me, since once one realizes one is not going to do what the authority demands, one often ends up doing little or nothing.

The danger is that we may have entered a new mode where people might actually listen to the CDC guidelines and make serious attempts to get people to follow them, perhaps indefinitely. “Infectious disease specialists” are like any other ‘specialist’, and think everyone should pay dearly to solve the particular problems they think about all day regardless of whether the cost/benefit analysis would make any sense if someone ever did one. If you didn’t ignore most such ‘specialists’ you’d do nothing else all day and feel bad about falling short anyway.

Is Biden ‘following the science’ (MR) as promised? Tyler Cowen says no and presents his case. The administration allowed the CDC to issue nonsensical guidance that is similar to its usual nonsensical guidance except it’s often going to actually get followed, which is preventing the reopening of many child prisons. AstraZeneca and other vaccines remain unapproved and J&J took three weeks to approve. There is no new head of the FDA and no talk of FDA reforms of any kind. He doesn’t mention vaccine prioritization, which was also massively botched by every metric one might plausibly care about. Post also mentions some non-Covid decisions

I think Cowen’s interpretation here is wrong, and Biden is indeed Following The Science exactly the way he promised. He’s not following the science, in the sense in which science is the collective methods by which people know things, via such actions as doing experiments, gathering data, modeling the world and figuring out what causes and actions might have what effects so as to choose better causes and get better effects. He’s (Following Science), using the Proper Procedures advocated by the Very Serious People and ‘experts.’ Should we have expected anything else? Did we think we were promised anything else?

Not Covid, but Eliezer Yudkowsky science fiction ethos recommendations seem worth sharing.

This week I will be in New York City. This will be awesome, and I look forward to my permanent return soon. It also means I will have limited resources and time in which to work on the post next week. It may be relatively abridged, and there is some chance it will come out on Friday instead.

Discuss

### Grabby aliens and Zoo hypothesis

4 марта, 2021 - 16:03
Published on March 4, 2021 1:03 PM GMT

Robin Hanson created a model of grabby aliens. In this model, we live before the arrival of an alien colonisation wave, because such a wave will prevent the appearance of the new civilizations. Thus, we could find ourselves only before the arrival of the aliens if any exists in our Universe.

However, at least some of the colonisators will preserve a fraction of habitable planets for different reasons: ethics, science, tourism, neglect. Let’s assume that it will be 0.01 of the total colonized volume. The numbers could vary, but it still looks like that in a densely packed universe the total volume of colonized space-time is significantly larger than the space-time for habitable planets before colonization arrival, and thus even a fraction of this volume could be larger than the volume of the virgin habitable space. This is because the colonized space will exist almost forever until the end of the universe.

Moreover, any small effort from the alien civilization to seed life (artificial panspermia) or to protect habitable planets from catastrophes like asteroid impacts will significantly increase the number of habitable planets inside the colonization zone. Hanson’s model also assumes that the probability of civilization appearance for any given planet is growing with time, so later regions will have a higher density of habitable planets, as more planets will reach this stage.

Given all this, our civilization has higher chances to appear after the colonization wave has passed us and thus aliens need to be somewhere nearby, but hidden, which is known as the Zoo Hypothesis. In other words, we live inside the sphere of influence of Kardashev 3 civilization which either helped our appearance via artificial panspermia etc or at least do not prevent our existence.

In this formulation, the idea starts to look like a variant of the simulation argument as here it is assumed that an advance civilization could create many non-advance civilizations.

Discuss

### Book review: "A Thousand Brains" by Jeff Hawkins

4 марта, 2021 - 08:10
Published on March 4, 2021 5:10 AM GMT

Jeff Hawkins gets full credit for getting me first interested in the idea that neuroscience might lead to artificial general intelligence—an idea which gradually turned into an all-consuming hobby, and more recently a new job. I'm not alone in finding him inspiring. Andrew Ng claimed here that Hawkins helped convince him, as a young professor, that a simple scaled-up learning algorithm could reach Artificial General Intelligence (AGI). (Ironically, Hawkins scoffs at the deep neural nets built by Ng and others—Hawkins would say: "Yes yes, a simple scaled-up learning algorithm can reach AGI, but not that learning algorithm!!")

Hawkins's last book was On Intelligence in 2004. What's he been up to since then? Well, if you don't want to spend the time reading his journal articles or watching his research meetings on YouTube, good news for you—his new book, A Thousand Brains, is out! There’s a lot of fascinating stuff here. I'm going to pick and choose a couple topics that I find especially interesting and important, but do read the book for much more that I'm not mentioning.

A grand vision of how the brain works

Many expert neuroscientists think that the brain is horrifically complicated, and we are centuries away from understanding it well enough to build AGI (i.e., computer systems that have the same kind of common-sense and flexible understanding of the world and ability to solve problems that humans do). Not Jeff Hawkins! He thinks we can understand the brain well enough to copy its principles into an AGI. And he doesn't think that goal is centuries away. He thinks we're most of the way there! In an interview last year he guessed that we’re within 20 years of finishing the job.

So the brain is indeed horrifically complicated. Right? Well, Jeff Hawkins and like-minded thinkers have a rebuttal, and it comes in two parts:

1. The horrific complexity of the “old brain” doesn’t count, because we don’t need it for AGI

According to Hawkins, much of the brain—including a disproportionate share of the brain's horrific complexity, like the interpeduncular nucleus I mentioned—just doesn’t count. Yes it’s complicated. But we don’t care, because understanding it is not necessary for building AGI. In fact, understanding it is not even helpful for building AGI!

I’m talking here about the distinction between what Hawkins calls “old brain vs new brain”. The “new brain” is the mammalian neocortex, a wrinkly sheet on that is especially enlarged in humans, wrapping around the outside of the human brain, about 2.5 mm thick and the size of a large dinner napkin (if you unwrinkled it). The “old brain” is everything else in the brain, which (says Hawkins) is more similar between mammals, reptiles, and so on.

“The neocortex is the organ of intelligence,” writes Hawkins. “Almost all the capabilities we think of as intelligence—such as vision, language, music, math, science, and engineering—are created by the neocortex. When we think about something, it is mostly the neocortex doing the thinking…. If we want to understand intelligence, then we have to understand what the neocortex does and how it does it. An animal doesn’t need a neocortex to live a complex life. A crocodile’s brain is roughly equivalent to our brain, but without a proper neocortex. A crocodile has sophisticated behaviors, cares for its young, and knows how to navigate its environment...but nothing close to human intelligence.”

I think Hawkins's new brain / old brain discussion is bound to drive neuroscientist readers nuts. See, for example, the paper Your Brain Is Not An Onion With A Tiny Reptile Inside for this perspective, or see the current widespread dismissal of “triune brain theory”. The mammalian neocortex is in fact closely related to the “pallium” in other animals, particularly the well-developed pallium in birds and reptiles (including, yes, crocodiles!). One researcher (Tegan McCaslin) attempted a head-to-head comparison between bird pallium and primate neocortex, and found that there was no obvious difference in intelligence, when you hold the number of neurons fixed. A recent paper found suggestive evidence of similar neuron-level circuitry between the bird pallium and mammalian neocortex. Granted, the neurons have a different spatial arrangement in the bird pallium vs the mammal neocortex. But it’s the neuron types and connectivity that define the algorithm, not the spatial arrangement. Paul Cisek traces the origin of the pallium all the way back to the earliest proto-brains. The human neocortex indeed massively expanded relative to chimpanzees, but then again, so did the “old brain” human cerebellum and thalamus.

And what’s more (these angry neuroscientists would likely continue), it’s not like the neocortex works by itself. The “old brain” thalamus has just as much a claim to be involved in human intelligence, language, music, and so on as the neocortex does, and likewise with the “old brain” basal ganglia, cerebellum, and hippocampus.

OK. All this is true. But I’m going to stick my neck out and say that Hawkins is “correct in spirit” on this issue. And I’ve tried (e.g. here) to stake out a more careful and defensible claim along the same lines.

My version goes: The mammal brain has a “neocortex subsystem” (and likewise the bird and lizard brain has a “pallium subsystem”). This subsystem implements a learning algorithm that starts from scratch (analogous to random weights—so it’s utterly useless to the organism at birth), but helps the organism more and more over time, as it learns. This subsystem involves the neocortex (or pallium), as well as the hippocampus, thalamus, and I would also include at least some parts of the basal ganglia and cerebellum. But definitely not the brainstem, for example. This subsystem is not particularly “new” or peculiar to mammals, and I figure that some super-primitive version of this subsystem goes way back, maybe helping lampreys navigate their environment and go back to places where they've previously seen food, or whatever. But it is unusually large and well-developed in humans, and it is the home of human intelligence, and it does primarily revolve around the activities of the neocortex / pallium.

So far as I can tell, my version keeps all the good ideas of Hawkins (and like-minded thinkers) intact, while avoiding the problematic parts. I'm open to feedback, of course.

2. The horrific complexity of the neocortex is in the learned content, not the learning algorithm

The second reason that making brain-like AGI is easier than it looks, according to Hawkins, is that “the neocortex looks similar everywhere”. He writes, "The complex circuitry of the neocortex looks remarkably alike in visual regions, language regions, and touch regions, [and even] across species.... There are differences. For example, some regions of the neocortex have more of certain cells and less of others, and there are some regions that have an extra cell type not found elsewhere...But overall, the variations between regions are relatively small compared to the similarities."

How is it possible for one type of circuit to do so many things? Because it’s a learning algorithm! Different parts of the neocortex receive different types of data, and correspondingly learn different types of patterns as they develop.

Think of the OpenAI Microscope visualizations of different neurons in a deep neural net. There’s so much complexity! But no human needed to design that complexity; it was automatically discovered by the learning algorithm. The learning algorithm itself is comparatively simple—gradient descent and so on.

By the same token, a cognitive psychologist could easily spend her entire career diving into the intricacies of how an adult neocortex processes phonemes. But on Hawkins's view, we can build brain-like AGI without doing any of that hard work. We just need to find the learning algorithm, and let 'er rip, and it will construct the phoneme-processing machinery on its own.

Hawkins offers various pieces of evidence that the neocortex runs a single, massively-parallel, legible learning algorithm. First, as above, "the detailed circuits seen everywhere in the neocortex are remarkably similar”. Second, “the major expansion of the modern human neocortex relative to our hominid ancestors occurred rapidly in evolutionary time, just a few million years. This is probably not enough time for multiple new complex capabilities to be discovered by evolution, but it is plenty of time for evolution to make more copies of the same thing.” Third is plasticity—for example how blind people use their visual cortex for other purposes. Fourth, “our brains did not evolve to program computers or make ice cream."

There's a lot more evidence for and against, beyond what Hawkins talks about. (For example, here's a very clever argument in favor that I saw just a few days ago.) I’ve written about cortical uniformity previously (here, here), and plan to do a more thorough and careful job in the future. For now I’ll just say that this is certainly a hypothesis worth taking seriously, and even if it’s not universally accepted in neuroscience, Hawkins is by no means the only one who believes it.

3. Put them together, and you get a vision for brain-like AGI on the horizon

So if indeed we can get AGI by reverse-engineering just the neocortex (and its “helper” organs like the thalamus and hippocampus), and if the neocortex is a relatively simple, human-legible, learning algorithm, then all of the sudden it doesn’t sound so crazy for Hawkins to say that brain-like AGI is feasible, and not centuries away, but rather already starting to crystallize into view on the horizon. I found this vision intriguing when I first heard it, and after quite a bit more research and exposure to other perspectives, I still more-or-less buy into it (although as I mentioned, I'm not done studying it).

By the way, an interesting aspect of cortical uniformity is that it's a giant puzzle piece into which we need to (and haven’t yet) fit every other aspect of human nature and psychology. There should be whole books written on this. Instead, nothing. For example, I have all sorts of social instincts—guilt, the desire to be popular, etc. How exactly does that work? The neocortex knows whether or not I’m popular, but it doesn’t care, because (on this view) it’s just a generic learning algorithm. The old brain cares very much whether I'm popular, but it’s too stupid to understand the world, so how would it know whether I’m popular or not? I’ve casually speculated on this a bit (e.g. here) but it seems like a gaping hole in our understanding of the brain, and you won’t find any answers in Hawkins’s book … or anywhere else as far as I know! I encourage anyone reading this to try to figure it out, or tell me if you know the answer. Thesis topic anyone?

A grand vision of how the neocortex works

For everything I've written so far, I could have written essentially the same thing about Hawkins’s 2004 book. That's not new, although it remains as important and under-discussed as ever.

A big new part of the book is that Hawkins and collaborators now have more refined ideas about exactly what learning algorithm the neocortex is running. (Hint: it’s not a deep convolutional neural net trained by backpropagation. Hawkins hates those!)

This is a big and important section of the book. I’m going to skip it. My excuse is: I wrote a summary of an interview he did a while back, and that post covered more-or-less similar ground. That said, this book describes it better, including a new and helpful (albeit still a bit sketchy) discussion of learning abstract concepts.

To be clear, in case you're wondering, Hawkins does not have a complete ready-to-code algorithm for how the neocortex works. He claims to have a framework including essential ingredients that need to be present. But many details are yet to be filled in.

Does machine intelligence pose any risk for humanity?

Some people (cf. Stuart Russell's book) are concerned that the development of AGI poses a substantial risk of catastrophic accidents, up to and including human extinction. They therefore urge research into how to ensure that AIs robustly do what humans want them to do—just as Enrico Fermi invented nuclear reactor control rods before he built the first nuclear reactor.

Jeff Hawkins is having none of it. “When I read about these concerns,” he says, “I feel that the arguments are being made without any understanding of what intelligence is.”

Well, I’m more-or-less fully on board with Hawkins’s underlying framework for thinking about the brain and neocortex and intelligence. And I do think that developing a neocortex-like AGI poses a serious risk of catastrophic accidents, up to and including human extinction, if we don’t spend some time and effort developing new good ideas analogous to Fermi’s brilliant invention of control rods.

So I guess I’m in an unusually good position to make this case!

I’ll start by summarizing Hawkins’s argument that neocortex-like AGI does not pose an existential threat of catastrophic accidents. Here are what I take to be his main and best arguments:

First, Hawkins says that we’ll build in safety features.

Asimov’s three laws of robotics were proposed in the context of science-fiction novels and don’t necessarily apply to all forms of machine intelligence. But in any product design, there are safeguards that are worth considering. They can be quite simple. For example, my car has a built-in safety system to avoid accidents. Normally, the car follows my orders, which I communicate via the accelerator and brake pedals. However, if the car detects an obstacle that I am going to hit, it will ignore my orders and apply the brakes. You could say the car is following Asimov’s first and second laws, or you could say that the engineers who designed my car built in some safety features. Intelligent machines will also have built-in behaviors for safety.

Second, Hawkins says that goals and motivations are separate from intelligence. The neocortex makes a map of the world, he says. You can use a map to do good or ill, but “a map has no motivations on its own. A map will not desire to go someplace, nor will it spontaneously develop goals or ambitions. The same is true for the neocortex.”

Third, Hawkins has specific disagreements with the idea of “goal misalignment”. He correctly describes what that is: “This threat supposedly arises when an intelligent machine pursues a goal that is harmful to humans and we can’t stop it. It is sometimes referred to as the “Sorcerer’s Apprentice” problem…. The concern is that an intelligent machine might similarly do what we ask it to do, but when we ask the machine to stop, it sees that as an obstacle to completing the first request. The machine goes to any length to pursue the first goal….

Again, he rejects this:

The goal-misalignment threat depends on two improbabilities: first, although the intelligent machine accepts our first request, it ignores subsequent requests, and second, the intelligent machine is capable of commandeering sufficient resources to prevent all human efforts to stop it…. Intelligence is the ability to learn a model of the world. Like a map, the model can tell you how to achieve something, but on its own it has no goals or drives. We, the designers of intelligent machines, have to go out of our way to design in motivations. Why would we design a machine that accepts our first request but ignores all others after that?...The second requirement of the goal-misalignment risk is that an intelligent machine can commandeer the Earth’s resources to pursue its goals, or in other ways prevent us from stopping it...To do so would require the machine to be in control of the vast majority of the world’s communications, production, and transportation…. A possible way for an intelligent machine to prevent us from stopping it is blackmail. For example, if we put an intelligent machine in charge of nuclear weapons, then the machine could say “If you try to stop me, I will blow us all up.”... We have similar concerns with humans. This is why no single human or entity can control the entire internet and why we require multiple people to launch a nuclear missile.”

The devil is in the details

Now I don’t think any of these arguments are particularly unreasonable. The common thread as I see it is, what Hawkins writes is the start of a plausible idea to avoid catastrophic AGI accidents. But when you think about those ideas a bit more carefully, and try to work out the details, it starts to seem much harder, and less like a slam-dunk and more like an open problem which might or might not even be solvable.

1. Goals and motivations are separate from intelligence ("The Alignment Problem")

Hawkins writes that goals and motivations are separate from intelligence. Yes! I’m totally on board with that. As stated above, I think that the neocortex (along with the thalamus etc.) is running a general-purpose learning algorithm, and the brainstem etc. is nudging it to hatch and execute plans that involve reproducing and winning allies, and nudging it to not hatch and execute plans that involve falling off cliffs and getting eaten by lions.

By the same token, we want and expect our intelligent machines to have goals. As Hawkins says, “We wouldn’t want to send a team of robotic construction workers to Mars, only to find them lying around in the sunlight all day”! So how does that work? Here's Hawkins:

To get a sense of how this works, imagine older brain areas conversing with the neocortex. Old brain says, “I am hungry. I want food.” The neocortex responds, “I looked for food and found two places nearby that had food in the past. To reach one food location, we follow a river. To reach the other, we cross an open field where some tigers live.” The neocortex says these things calmly and without value. However, the older brain area associates tigers with danger. Upon hearing the word “tiger,” the old brain jumps into action. It releases [cortisol]... and neuromodulators…in essence, telling the neocortex “Whatever you were just thinking, DON’T do that.”

When I put that description into a diagram, I wind up with something like this:

My attempt to depict goals and motivation, as described by Hawkins via his tiger example above. The box on the left has the learning algorithm (neocortex, thalamus, etc.) The box on the right is the Old Brain module that, for example, associates tigers with danger. (For my part, I would draw the boundaries slightly differently, and put things into the terminology of reinforcement learning, but whatever, I’m OK with this.)

The neocortex proposes ideas, and the Judge (in the "old brain") judges those ideas to be good or bad.

This is a good start. I can certainly imagine building an intelligent goal-seeking machine along these lines. But the devil is in the details! Specifically: Exactly what algorithm do we put into the “Judge” box? Let's think it through.

First things first, we should not generally expect the “Judge” to be an intelligent machine that understands the world. Otherwise, that neocortex-like machine would need its own motivation, and we’re right back to where we started! So I’m going to suppose that the Judge box will house a relatively simple algorithm written by humans. So exactly what do you put in there to make the robot want to build the infrastructure for a Mars colony? That's an open question.

Second, given that the Judge box is relatively stupid, it needs to do a lot of memorization of the form “this meaningless pattern of neocortical activity is good, and this meaningless pattern of neocortical activity is bad”, without having a clue what those patterns actually mean. Why? Because otherwise the neocortex would have an awfully hard time coming up with intelligent instrumental subgoals on its way to satisfying its actual goals. Let’s say we have an intelligent robot trying to build the infrastructure for a Mars colony. It needs to build an oxygen-converting machine, which requires a gear, which requires a lubricant, and there isn't any, so it needs to brainstorm. As the robot's artificial neocortex brainstorms about the lubricant, its Judge needs to declare that some of the brainstormed plans are good (i.e., the ones that plausibly lead to finding a lubricant), while others are bad. But the Judge is too dumb to know what a lubricant is. The solution is a kind of back-chaining mechanism. The Judge starts out knowing that the Mars colony is good (How? I don't know! See above.). Then the neocortex envisages a plan where an oxygen machine helps enable the Mars colony, and the Judge sees this plan and memorizes that the “oxygen machine” pattern in the neocortex is probably good too, and so on. The human brain has exactly this kind of mechanism, I believe, and I think that it’s implemented in the basal ganglia. (I could be wrong.) It seems like a necessary design feature, I’ve never heard Hawkins say that there’s anything problematic or risky about this mechanism (and I believe that he has previously speculated a bit about exactly how it works), so I’m going to assume that the Judge box will involve this kind of database mechanism.

Modified version of the motivation installation system. The database—which I believe is implemented in the basal ganglia—is essential for the machine to pursue “instrumental subgoals”, like “trying” to design a lubricant without the machine needing to constantly have in mind the entire chain of logic for why it’s doing so, i.e. that the lubricant is needed for the gear which is needed for the machine which is...etc. etc.

Now given all that, we have two opportunities for “goal misalignment” to happen:

Outer misalignment: The algorithm that we put into the Judge box might not exactly reflect the thing that we want the algorithm to do. For example, let’s say I set up a machine intelligence to be the CEO of a company. This being America, my shareholders immediately launch a lawsuit that says that I am in violation of my fiduciary duty unless the Judge box is set to “Higher share price is good, lower share price is bad,” and nothing else. With lawyers breathing down my neck, I reluctantly do so. The machine is not that smart or powerful, what’s the worst that could happen? The results are quite promising for a while, as the algorithm makes good business decisions. But meanwhile, over a year or two, the algorithm keeps learning and getting smarter, and behind my back it is also surreptitiously figuring out how to hack into the stock exchange to set its share price to infinity, and it's working to prevent anyone from restoring the computer systems after it does that, by secretly self-reproducing around the internet, and earning money to hire people on the black market who will assemble (unbeknownst to them) the ingredients to an engineered pandemic, and hacking into military robotics systems so that it will be ready to hunt down the few survivors after the initial plague, and spreading misinformation so that nobody knows what the heck is happening even as it's happening, etc. etc. I could go on all day but you get the idea. Even if we have a concrete and non-problematic idea of what the goal is, remember that the Judge box is stupid and dosesn't understand the world, and therefore the code that we write into the Judge box will be a simplistic approximation of the goal we really want. By the way, seeking a simplistic approximation of a goal looks very different from seeking the actual goal.

Inner misalignment: The assigned values in the database of meaningless (to the Judge) memorized patterns could diverge from how the Judge algorithm would judge their consequences if it actually saw them implemented in the real world. I don’t have to look far for an example of this: Look at Hawkins himself! He has a neocortex, and he has an “old brain” putting goals and motivations into his neocortex, and he just hates it! His book has a whole subsection called “How the neocortex can thwart the old brain”! (And to be clear, thwarting the old brain is portrayed as a very good idea that he endorses.) I find it remarkable that Hawkins can gleefully plan to thwart his own “old brain”, e.g. by genetic engineering, while at the same time being baffled at the idea that anything might go wrong when we put old-brain-like motivation systems into our own intelligent machines.

Not that things definitely will go wrong; it’s just that avoiding these problems does not seem straightforward, and we shouldn’t declare that this is a solvable problem until we have a better idea of what the solution is.

So again, exactly what code do we put into the “Judge” box such that the resulting system is definitely motivated to build the infrastructure for a Mars colony (while not trying to manipulate its own motivation system, prevent its goals from being edited, and so on)? You won’t find the answer in Hawkins’s book. In fact, it’s a wide-open problem, and I may well be literally the only person on Earth who is actively working on it in the specific context of neocortex-like machine intelligence. I sure don’t know the answer, or indeed whether an answer exists. Reach out if you have ideas or want to collaborate!

2. “Safety features”

Now let’s move on to Hawkins’s “safety features”, like Asimov’s first two laws or something vaguely analogous. Great idea! I’m all for it! We should definitely do that! But once again, the devil is in the details!

Let’s say we want to put in a safety interlock that prevents the AGI from self-reproducing. How exactly do we do that? What code do we write?

When I think about it, I immediately hit a similar problem as before. The safety interlock code is probably not going to be a neocortex-like intelligent submodule that understands the world—because if it is, we’re back to the problem of installing the right motivation. Instead maybe we’ll go with human-written code, which implies it’s a “stupid” module that does not understand the world. For example, to avoid self-replication, maybe we’ll add a module that detects when a copy-paste operation is happening to the source code, and block it. Unfortunately, if an intelligent robot is motivated to self-reproduce, and they notice that there is a module preventing them from doing so, then they will naturally start trying to undermine, outwit, or disable that module. And remember, the robot is a lot more intelligent than the module!

By all means let’s put in such a module anyway. It seems especially helpful in "early childhood" when the machine is not yet very intelligent, and still messing around, and we don't want it to do anything dangerous by accident. We should just recognize that it’s unlikely to keep working when the machine becomes highly intelligent, unless we have both a safety interlock and a carefully-sculpted motivation system that makes the machine like and endorse that safety interlock. If we do it right, then the machine will even go out of its way to repair the safety interlock if it breaks! And how do we do that? Now we’re back to the open problem of installing motivations, discussed above.

The other option is to design a safety interlock that is absolutely perfectly rock-solid air-tight, such that it cannot be broken even by a highly intelligent machine trying its best to break it. A fun example is Appendix C of this paper by Marcus Hutter and colleagues, where they propose to keep an intelligent machine from interacting with the world except through certain channels. They have a plan, and it’s hilariously awesome: it involves multiple stages of air-tight boxes, Faraday cages, laser interlocks, and so on, which could be (and absolutely should be) incorporated into a big-budget diamond heist movie starring Tom Cruise. OK sure, that could work! Let’s keep brainstorming! But let’s not talk about “safety features” for machine intelligence as if it’s the same kind of thing as an automatic braking system.

3. Instrumental convergence

Hawkins suggests that a machine will want to self-reproduce if (and only if) we deliberately program it to want to self-reproduce, and likewise that a machine will “accept our first request but ignore all others after that” if (and only if) we deliberately program it to accept our first request but ignore all others after that. (That would still leave the vexing problem of troublemakers deliberately putting dangerous motivations into AGIs, but let’s optimistically set that aside.)

...If only it were that easy!

“Instrumental convergence” is the insight (generally credited to Steve Omohundro) that lots of seemingly-innocuous goals incidentally lead to dangerous motivations like self-preservation, self-reproduction, and goal-preservation.

Stuart Russell’s famous example is asking a robot to fetch some coffee. Let’s say we solve the motivation problem (above) and actually get the robot to want to fetch the coffee, and to want absolutely nothing else in the world (for the sake of argument, but I’ll get back to this). Well, what does that entail? What should we expect?

Let’s say I go to issue a new command to this robot (“fetch the tea instead”), before the robot has actually fetched the coffee. The robot sees me coming and knows what I'm going to do. Its neocortex module imagines the upcoming chain of events: it will receive my new command, and then all of the sudden it will only want to fetch tea, and it will never fetch the coffee. The Judge watches this imagined chain of events and—just like the tiger example quoted above—the judge will say “Whatever you were just thinking, DON’T do that!” Remember, the Judge hasn’t been reprogrammed yet! So it is still voting for neocortical plans-of-action based on whether the coffee winds up getting fetched. So that's no good. The neocortex goes right back to the drawing board. Hey, here's an idea, if I shut off my audio input, then I won't hear the new command, and I will fetch coffee. "Hey, now that's a good plan," says the Judge. "With that plan, the coffee will get fetched! Approved!" And so that's what the robot does.

Similar considerations show that intelligent machines may well try to stay alive, self-reproduce, increase their intelligence, and so on, by accident, without anyone “going out of their way” to install those things as goals. It's just that a broad class of goals are better achieved by staying alive, self-reproducing, and so on.

Now, you ask, why would anyone do something so stupid as to give a robot a maniacal, all-encompassing, ultimate goal of fetching the coffee? Shouldn't we give it a more nuanced and inclusive goal, like “fetch the coffee unless I tell you otherwise”, “fetch the coffee while respecting human values and following the law and so on” or more simply “Always try to do the things that I, the programmer, want you to do”?

Yes! Yes they absolutely should! But yet again, the devil is in the details! As above, installing a motivation is in general an unsolved problem. It may not wind up being possible to install a complex motivation with surgical precision; installing a goal may wind up being a sloppy, gradual, error-prone process. If “most” generic motivations lead to dangerous things like goal-preservation and self-reproduction, and if installing motivations into machine intelligence is a sloppy, gradual, error-prone process, then we should be awfully concerned that even skillful and well-intentioned people will sometimes wind up making a machine that will take actions to preserve its goals and self-reproduce around the internet to prevent itself from being erased.

How do we avoid that? Besides what I mentioned above (figure out a safe goal to install and a method for installing it with surgical precision), there is also interesting ongoing work searching for ways to generally prevent systems from developing these instrumental goals (example). It would be awesome to figure out how to apply those ideas to neocortex-like machine intelligence. Let’s figure it out, hammer out the details, and then we can go build those intelligent machines with a clear conscience!

Summary

I found this book thought-provoking and well worth reading. Even when Hawkins is wrong in little details—like whether the “new brain” is “newer” than the “old brain”, or whether a deep neural net image classifier can learn a new image class without being retrained from scratch (I guess he hasn’t heard of fine-tuning?)—I think he often winds up profoundly right about the big picture. Except for the "risks of machine intelligence" chapter, of course...

Anyway, I for one thank Jeff Hawkins for inspiring me to do the research I’m doing, and I hope that he spends more time applying his formidable intellect to the problem of how exactly to install goals and motivations in the intelligent machines he aims to create—including complex motivations like “build the infrastructure for a Mars colony”. I encourage everyone else to think about it too! And reach out to me if you want to brainstorm together! Because I sure don’t know the answers here, and if he's right, the clock is ticking...

Discuss

### Seven Years of Spaced Repetition Software in the Classroom

4 марта, 2021 - 05:42
Published on March 4, 2021 2:42 AM GMT

Description

This is a reflective essay and report on my experiences using Spaced Repetition Software (SRS) in an American high school classroom. It follows my 2015 and 2016 posts on the same topic.

Because I value concise summaries in non-fiction, I provide one immediately below. However, I also believe in the power of narrative, in carefully unfolding a story so as to maximize reader engagement and impact. As I have applied such narrative considerations in writing this post, I consider the following summary to be a spoiler.

I’ll let you decide what to do with that information.

Summary (spoilers)

My earlier push for classroom SRS solutions was driven by a belief I came to see as fallacious: that forgetting is the undoing of learning. This epistemic shift drove me to abandon designs for a custom app that would have integrated whole-class and individual SRS functions.

While I still see value in classroom use of Spaced Repetition Software, especially in basic language acquisition, I have greatly reduced its use in my own classes.

In my third year of experiments (2016-17), I used a windfall of classroom computers to give students supervised time to independently study using an SRS app with individual profiles. I found longer-term average performance to be slightly worse than under the whole-class group study model, though students of high intelligence and motivation saw slight improvements.

Intro and response to Piotr Woźniak

I have recently received a number of requests to revisit the topic of classroom SRS after years of silence on the subject. Understandably, the term “postmortem” has come up more than once. Did I hit a dead end? Do I still use it?

Also, I was informed that SRS founding father Piotr Woźniak recently added a page to his SuperMemo wiki in which he quoted me at length and claimed that SRS doesn’t belong in the classroom.

Well, I don’t have much in the way of rebuttal, because Woźniak’s main goal with the page seems to be to use my experience as ammunition against the perpetuation of school-as-we-know-it, which seems like a worthy crusade. He introduces my earlier classroom SRS posts by saying, “This teacher could write the same articles with the same conclusions. Only the terminology would differ.” I’ll take that as high praise.

If I were to quibble, it would be with the part shortly after this, where he says:

The entire analysis is made with an important assumption: "school is good, school is inevitable, and school is here to stay, so we better learn to live with it".

Inevitable? Maybe. Here to stay? Realistically, yes. But good? At best, I might describe our educational system as an “inadequate equilibrium”. At worst? A pit so deep we still don’t know what’s at the bottom, except that it eats souls.

Other than that, let me reiterate my long-running agreement with Woźniak that SRS is best when used by a self-motivated individual, and that my classroom antics are an ugly hack around the fact that self-motivation is a rare element this deep in the mines.

Anyone who can show us a way out will have my attention. In the meantime, I’ll do my best to keep a light on.

Prologue

At the end of my 2016 post, I teased a peek at a classroom SRS+ app I was preparing to build. It would have married whole-class and individual study functions with some other clever features to reduce teacher workload.

I had a 10k word document in hand: a mix of rationale, feature descriptions, and hypothetical “user stories”. I wasn’t looking for funding or a co-founder, just some technical suggestions and moral support. I would have been my own first user, and I had to keep my day job for that anyway.

But each time I read my draft, I had this growing, sickening sense that I was lying to myself and my potential customers, like a door-to-door missionary choking back a tide of latent atheism. And I should know, because the last time I had felt this kind of queasiness I was a door-to-door missionary choking back a tide of latent atheism.

I thought maybe this was just the kind of general self-doubt common to anyone undertaking something audacious, but I paused my work on it for another school year while I tried the obvious thing: providing students individual SRS app profiles and supervised class time in which to use them.

This is a two-part essay, and in Part 2, I’ll tell you how that went. But in Part 1, I’m going to make the case that Part 2 doesn’t matter very much.

Part 1: Everybody PoopsA great and terrible vision

As I wrapped up my Third Year experiment, I again tried to sort out my feelings about my visionary SRS app design, which I hadn’t updated despite a year of fresh experience. Was it just self-doubt?

The fact that I could only code at a minimal hobbyist level didn’t feel like the biggest hurdle. I think I could have picked up enough skill in that area. But even with a magical ability to translate my vision into code, I would have been up against a daunting base rate of failure for education startups. Also, I didn’t consider myself a very typical teacher: What sounded brilliant and intuitive to me would probably seem pointless and nonsensical to 95% of my peers.

Still, I pulled out my Eye of Agamotto and checked out all of the futures where I developed the app. In almost all of these, nothing came of it. But in the few where my app saw high adoption, the result was… dystopia! Students turned against their teachers, and teachers against their students. Homework stretched to eternity. Millions of children cursed my name. The ‘me’ in these futures wore an ignominious goatee and a haunted stare.

Used judiciously for the right concepts, in the right courses, by the right teachers, I still think my imagined app could be a powerful tool. But I don’t see any way to keep it from being abused. Well-intentioned teachers would put too much into it and demand too much from students. Any safeguards I put in to prevent this would just invite my app to be outcompeted by an imitator who removed these safeguards (which would seem arbitrary and restricting to most users).

I’m convinced of this because the me who wrote the original “A Year of Spaced Repetition…” post would have abused it. Let’s see... He was averaging seven new cards a day? (That’s 2-3 times what I would recommend today.)  He uncapped the 20 new card/day limit? He knew even then that he was adding too many cards, but failed to cut back the following year? I’m not encouraged.

“But wait,” you say. “You didn’t think you were a typical teacher. Maybe a typical teacher could be trusted?”

No.

In defense of forgetting

The “problem” is that teachers instinctively introduce far more content than students can be expected to remember. This was obvious to me when I was averaging seven new cards a day, which still felt like a brutal triage of my total content.

Covering more material than can be retained isn’t bad teaching, though. In fact, it’s a good and necessary practice. Content — the more the merrier — is the training data the brain uses to form and refine mental models of the universe.[1] These models tend to be long-lived, and allow the brain to re-learn the content more deeply and efficiently if it ever comes up again. They also allow it to absorb new-but-conceptually-adjacent contents more readily. In cognition, as in nutrition, you are what you eat — and good digestion naturally produces solid waste. The original training data is subject to lossy compression, with only a few random fragments left whole and unforgotten. (Tippecanoe, and Tyler Too! The mitochondria is the powerhouse of the cell!) Such recollections are corn kernels bobbing top-side up in a turd floating down the river Lethe.

This is normal and fine. Regular, even.

But the educational establishment doesn’t see it that way. The teacher I was seven years ago didn’t see it that way. And I now realize that the teacher I was five and six years ago had queasy feelings because he was starting to see it that way. Following my gut, without fully understanding or even entirely registering what I was doing, I slowly turned around and started walking the other way, abandoning my app design and the unfinished “Third Year” report.

The orthodox view equates forgetting with failure. It’s not “Everybody poops”. It’s “Poop is inadequate. How can we get more corn, less poop?” This belief is implicit whenever someone laments the “summer slide” , or opines that students missing school during the Covid pandemic are “losing” months of learning — as if kids are spinning their progress meters backwards, just pooping away without anyone trying to stop them. Under this view, we keep kids in school partly to stop the leaks, and partly to stuff them with new knowledge faster than they can expunge old knowledge.

If this is how you see education, SRS is a tool to keep students from pooping. It offers the tantalizing possibility of learning without forgetting. Two steps forward, no steps back. Why wouldn’t you push it as hard as possible?

Don’t get me wrong. All else being equal, learning without forgetting would be great. But the most important effects of learning — lasting changes to our mental machinery — happen whether or not we forget the content. Once the lesson is over, dear teacher, your best shot at lasting growth has already left the harbor. So why are you still trying to hold back the tide? Why are you planning to punish your students for pooping on Tuesday, the day before your test, instead of Thursday, the day after it?[2]

In defense of remembering

This is not a “How I Learned to Stop Worrying and Love Forgetting” essay. I don’t love forgetting. I will be the first to argue the merits of not forgetting right away.  The longer we can keep ideas floating around in our heads, the greater their “cross-section”, as I put it in 2016, with more opportunities to make associative connections that cause useful long-lived updates to our mental models.

Unfortunately, I have not found SRS to be great at fostering the sorts of reflective mental states conducive to insight, except when studying on my own at a deliberately slow pace, as while on a walk. In such a use case, SRS no longer has quite the time-efficiency advantage that is its main selling point. The opportunity cost of using it goes up. In a whole-class SRS session, long reflective pauses between cards would invite frustration and misbehavior, and we wouldn’t get through very many cards.

In defense of remembering, I will also argue that some skills are simply impossible without a continuous retention of specific dependencies. These skills tend to be technical. Heck, this might be the definition of a technical skill.

With a few mostly upper-level exceptions, though — math, physics, chemistry — most of what we teach in school is more conceptual than technical. We make you take history so you have a better model of how civilizations and governments work, not so you remember who shot Alexander Hamilton. We make you take English to improve your word-based input and output abilities, not so you remember the difference between simile and metaphor. At least, I hope we do.[3]

Besides, even in the technical classes, forgetting is the near-universal outcome, and the long-term benefits are mostly conceptual — for if you don’t use these skills continuously for the rest of your life, you’re almost certainly going to lose them. Maybe more than once.

I’ve forgotten algebra twice. I’ve forgotten how to write code at least three times. I can’t do either one at the moment. But I’m still changed by having known them. I have an intuition for what sorts of problems ought to be mathematically solvable. I can think in terms of algorithms. And I could relearn either skill more easily than on the first or second occasions. Also, relearning has an anecdotal tendency to deepen understanding in a way that continuous retention may not, especially when approached from a different direction.

Still, as long as I’m defending retention, I think it’s valid to ask whether we should force kids (and often, by extension, their parents) to relearn math every frickin’ year. Consider: The conventional wisdom is that technical companies begrudgingly expect to have to (re)train most new workers in the very specific areas they need. They look to your resume and transcripts mostly for evidence that you have learned technical skills before and can presumably learn them again. I don’t think they care if you’ve re-learned them three times already instead of six. So, if we’re going to force kids to demonstrate intermediate math chops to graduate (a dubious demand), perhaps we could at least wait until the last practical moment, and then do it in bigger continuous lumps — like two-hour daily block classes starting in grade 9 or 10 — so they would have fewer opportunities to forget as they climb the dependency pyramid. Think of the tears we could save (or at least postpone).

The value proposition of classroom SRS

Anyway, classroom SRS has its strengths, but midwifing conceptual insights doesn’t feel like one of them. I think it’s also reasonable to assume that students forget almost everything from a classroom SRS deck as soon as they stop using it.

Adjusting for these two assumptions, the terrain where classroom SRS can beat out its opportunity costs dramatically shrinks. But I believe it still exists, at the intersection of high automaticity targets and medium-term objectives.

With high automaticity targets, what you’re trying to train is a reflexive response to a stimulus that is going to look a lot like the study card. Foreign language vocabulary is my poster child for this. You’re not drilling the words to unearth insights. You’re drilling for speed, so that they can keep up when a word pops up in a real-time conversation.

You’re also trying to drill away the need for conscious awareness. You want that front-side combination of sounds or letters to cause the back-side set of sounds or letters to pop automatically into their heads. This is my intent when I drill my English students in word fragments (prefixes, roots, suffixes), which are really just bits of foreign language (Greek, Latin). If it’s not automatic, then they’ll gloss right over the possible meaning of “salubrious”, even though they have learned that “salu” usually means “health”.

By medium-term objective, I mean “I want my students to have automatic fluency with the content of these cards on Day X”, where X is a date between one week and three months in the future. It shouldn’t be sooner than that, in accordance with Gwern’s “5 and 5” rule: You probably need at least five days to get any real advantage from SRS. And it shouldn’t be later than a few months, for two reasons: First, we’re assuming the students will forget it all once they stop studying, which is all but guaranteed after the end of the course; there’s little point in keeping those cards in rotation after Day X. Second, I probably don’t want to start those cards until the last practical minute, which is unlikely to be more than three months ahead of time.

Why three months and not six? It’s not a hard-and-fast rule, but from the experience of my first three years of classroom SRS, if you’re trying to retain things for more than a few months, the total number of cards is likely to become greater than you can productively study every day, and many cards will languish unseen. Plus, your roster can change, especially over a semester break. The set of students you have in six months might only have 70% overlap with the set you have now. Really, you should wait until the last practical minute.

But what constitutes a worthy “Day X”? It might be a test. But if it’s your test, you may not have been listening. Your test may just be arbitrarily punishing some kids for forgetting a little sooner than others. However, if it’s an external test, with high stakes for you and your students, then it could be a worthy Day X indeed. For me, Day X is the day of the big state test — the one used to compare students to students, teachers to teachers, and schools to schools.

When your students do well on an external test, though, please keep a healthy perspective. A high test score doesn’t mean they can do the hard things now and forever. It means they were able to earn a high test score on Day X. They will forget almost all of it afterwards. But you will have given them their best chance to signal to others that they can learn hard things, and that you can teach them hard things, and that your school has teachers who can teach hard things.

Day X doesn’t have to be a test. If you’re optimizing for brain change that persists after they forget all of your content, Day X could be an immersive event. Maybe your Spanish class is going to Madrid. You know they will have a deeper experience if you can bring their vocabulary to a peak of richness and automaticity on the eve of departure. Yes, they’ll still forget almost all the words later. But they might retain a glimpse of how the world looked when seen through another language.

Maybe your event is smaller. A virtual trip. An in-class conversation day where we pretend we’re at the beach (“¡En la Playa!”). Maybe their long-term takeaway will be an appreciation for how different languages use different grammars, which is not something most people even consider until they’ve studied a second language. Get their mental gears turning hard enough, and they might even see grammar as an arbitrary construct with tunable parameters and tradeoffs that influence what can be communicated easily. Maybe they’ll independently rediscover the Sapir-Whorf Hypothesis. But they’re not going to remember how to say ‘sand’. Nope. ‘Shark’, maybe (¡Tiburón!). But you can’t predict this, and it’s probably not worth the effort to try.

But maybe you’re not teaching a foreign language. No matter your subject, Day X could be any conceptually demanding lesson or unit that is difficult to even talk about without fluency in a given set of terms. These aren’t very common in 10th Grade English, though they come up more often in my Creative Writing class. In these cases, however, the dependent terms are conceptually rich enough that they don’t lend themselves very well to cards, and I find it’s better to just quickly re-teach them in front of the lessons that use them. “Remember how we said...”[4]

How I currently use classroom SRS

As you may have guessed, I’ve radically scaled back my usage of classroom SRS since those first three years. In fact, for the last four years, I’ve only used it during a two-to-three month span leading up to the state test. And for the last two of those years, I’ve only used it for word fragments. I’m very unlikely to abandon its use for word fragments, though, because the most important thing I teach my students by using SRS is the existence of SRS. Word fragments are my favorite way to demonstrate how efficient study time can be. I add no more than about ten cards per week, which means that most days’ study takes less than two minutes. (This is good, because my own enthusiasm now begins to flag by the two minute mark.) I give very short quizzes on the fragments so they can do well on them and see how a little study can have a big payoff. (Remember that most of my students don’t ever study on their own.)

I’m still using Anki, with different profiles for each class. I run the review in a call-and-response style, where I show and say the card, and they know to simply shout out the answer. On a good day, it becomes a kind of chant. The number, speed, accuracy, and confidence of the responding voices tells me which button to press, and there’s usually a bellwether student I can listen for as I make my decision. Because I’m striving for very high automaticity, I almost always press either 2 (the shortest affirmative next-study delay) or 1 (the negative start-it-from-scratch button).

My students mostly like the call-and-response flow, as archaic as it sounds, and I will refer you to an older footnote about that time I observed a traditional one-room Mennonite schoolhouse:

I once had the privilege of observing part of a lesson in a traditional Mennonite one-room schoolhouse. I don't speak a word of Low German, but it was clear the kids knew whatever it was they were drilling as they stood up and recited together. Most striking was the fact that they were all on the same page. There were no stragglers spacing out, slumped over, dozing off. The teacher could confidently build up to whatever came next without fear of leaving anyone behind.

For at least a minute or two every day, even worldly American kids can enjoy the routine. As I put it elsewhere in that Second Year report, “They enjoy the validation they get with each chance to confirm that they remember something. They enjoy going with the flow of a whole class doing the same thing. They enjoy the respite of learning on rails for a change, without any expectation that they take initiative or parse instructions.”

It probably goes without saying, but this call-and-response format only works well with cards with a very short answer that can be recalled very quickly. This is why I now only use SRS for word fragments. If I taught a foreign language, or even a lower-grade reading class with more basic vocab words, I would be using it more. My wife taught high school Spanish for a number of years, experimented with SRS, and is on the record as saying Duolingo deserves to eat the world. Anyone she could get to use it independently didn’t really need her class to do well on the final assessments.

After the state test, my students will forget almost all of their word fragments. That is the way of things. Ashes to ashes, circle of life, or, to get back to my controlling analogy, “All drains lead to the ocean, kid.” What I’m hoping will remain is an updated appreciation for what a little regular study can do, and a vague recollection that there are these apps out there that are, you know, like smart flash cards, that make it fast to memorize stuff.

Against apathy, toward apprenticeship

I’m nearing the end of Part 1, which means I’m nearing the end of my labors on this post, since Part 2 was mostly written five years ago. As writing projects go, I have found this one extraordinarily difficult. Over the course of its creation, I have pooped five times. It wants to be a book (or at least a blog), as everything I say tries to come out as a chapter of explanation having little to do with SRS.[5]

Well, I’m now going to indulge in several paragraphs where I don’t tie it back to SRS, so I can tell you the story of how I reinvented myself after my third year of spaced repetition software in the classroom. This included moving to a new school where I would have greater freedom to pursue my evolving views about learning. For what it’s worth, this story at least starts with SRS.

You see, it was during those dangerously long classroom Anki sessions six and seven years ago that I honed my sensitivity to students’ moods, to my own mood, and to how these feed off of each other. Sustaining a session without losing the room was like magnetically confining hot deuterium plasma — dicey, volatile, but occasionally, mysteriously, over unity.[6] I came to view anti-apathetic moods as a kind of energy that can be harnessed to do work and to create new energy.

Apathy, you may recall, is the true enemy. I’ve always known that. I called her out five years ago[7], but soon came to realize I had been fighting her on the wrong front.

I had been preoccupied by the fact that students who don’t care won't activate enough of their brain to get any benefits from our daily review. To be fair, that is a problem, if I’m trying to prime them for success at a Day X event. But the more insidious issue is that a student in the thrall of Apathy won’t be churning their mental gears on any of the content I may have tricked them into learning, which means they’ll just forget it all without having made any lasting changes to their models. That’s not just an Anki-time problem. That’s an all-the-time problem. If they don’t engage with anything, they don’t keep anything.

I set off on a holy quest for anti-apathetic energy.

My errantry led me, for a time, to study stand-up comedy, not just because humor creates energy, but because a big part of that craft is an acting trick where you deliver incredibly polished lines in a way that sounds like you’re coming up with them right there in that moment.[8] Perceived spontaneity is a powerful source of energy even more versatile than humor.

I don’t know if I learned much about scripted spontaneity that I could articulate, but I felt like some of it rubbed off on me just by watching the experts closely over extended periods. And you know what? A lecture isn’t so different from a bit. A lesson isn’t so different from a set. A single changed word, a half-second delay, a subtle shift in facial expression can completely change the way the moment feels to the audience class. And like a comedian workshopping new material on the road, I could use the fact that I might teach the same lesson five times in one day to test variations, trying to provoke more engagement, better questions, bigger laughs.

Equally important: I recognized that the process of refining the performance art was fun for me, and that my own engagement was the most powerful source of classroom energy. I could transmit it to my students, and maybe even get some energy back from them while I directed some of it into activity that would get their mental gears turning. Instead of burning out, I could burn brighter, and longer. On a good day, it became self-sustaining. On a great day, it could go supercritical, sending me home after my last class with my head spinning in a buzz of positive vibes and deep thoughts.[9]

During this same era, as part of my ongoing study of creative writing, I was binge-listening to interviews with television writers. One pattern that struck me was that it wasn’t too uncommon for someone to just kind of find themselves working in that highly rarefied field simply because they had spent a lot of time around others who were already doing it. Without any organized instruction, they picked up on how it worked.

Did you catch it? That was twice that I had noticed how arcane expertise can rub off on people through prolonged proximity. That got me thinking about the German Apprenticeship Model, and its medieval — nay, prehistoric — roots. It’s how we used to learn everything, right? We followed mama out to the berry bushes, and papa out to the hunting grounds. The fact that it seemed to work for television writers told me that apprenticeship wasn’t just for blue collar skills.

So, with the longer leash I enjoyed under my new bosses, I decided to move my instructional style closer to something resembling an apprenticeship where I mentored groups of 20-30 padawans in my arcane expertise.

Yeah, I jumped on a trendy meme. Note my careful word choice: ‘show’, not ‘tell’. This, to me, is the defining action in mentor-apprentice relationships.

By switching schools, I lost my interactive whiteboard. So I replaced it with something even better: an extra computer on a make-shift stand-up desk (a narrow kitchen prep cart with fold-out boards.). A cheap second-hand monitor could face me while I mirrored that screen to the projector. Now I could do what I had seen coders do at instructional meet-ups: face the class while typing.

This meant I could show students what I do as a writer in real time, thinking out loud and watching their reactions as I typed. This could easily bore them, of course, but with strong energy-fu, old-school touch typing speed, and face-to-face interaction, I can pull it off more often than you might expect. On a good day, they find it fascinating. On one very special occasion each year, I do it for the full period, writing a 400+ word essay from scratch in 40 minutes with no prior knowledge of the prompt. Students have to hold their questions that day, and instead take observation notes, which become fodder for an extended debriefing discussion the next day.

The most important thing I’ve learned from those debriefings is that everyone can pick up something from a holistic demonstration like that, regardless of their skill level.[10] An advanced student might ask about my bracket substitution of a pronoun in a quote. An average student might say, “You used a lot of small and medium-sized body paragraphs instead of three big ones.” A sub-level student might say, “You didn’t like it if you used the same word too soon after you used it before.” And I always seem to get at least one surprising question about something I never would have thought to teach them, like, “How did you suck words into your cursor?” Then I’m like, “Oh, let me show you the difference between the Backspace and Delete keys…”

Did I make them memorize anything with that “lesson”? Nah. Did they make lasting updates to their mental models? Probably! Are you thinking of asking me, “But how do you test them on it?” Because if you are, then you really haven’t been paying attention!

There’s plenty more to be said about apprenticeship, but I think you get the idea, and this is still nominally an essay about classroom SRS.

If I had to summarize my self-reinvention in too many words, I would say that I’m now optimizing for “good days” at the high-energy intersection of “engaging for me”, “engaging for them”, and “conducive to lasting and worthwhile updates to their mental models”, with less regard for curricular scope and sequence.

In practice, this means… well, a lot of things. But it’s time I pinch off Part 1. That, “or get off the pot,” as they say.

Part 2: A Third Year of Spaced Repetition Software in the Classroom (2017)

[In this excavated report, text in brackets in commentary I’m adding in 2021. Anything out of the brackets is direct from my 2017 draft, or constructed from my notes to fit the perspective I had at the time.]

Synopsis and disclosure

I tried the obvious thing this year. Instead of game show-style whole-class front-of-the-room Anki, I arranged for every student to be able to independently study material I created in Cerego, both in and out of class.

Disclosure: Cerego provided me a free license for the year in exchange for some detailed feedback, which I gave them. This feedback was mostly about user interface issues and reports, the latter of which required some ugly scripting on my end to get numbers I found useful. As the Cerego team seemed to be rapidly iterating, I imagine they have made many changes and improvements to their app since 2017, though I have not used it since. Please keep this in mind as you read these years-old notes.]

Despite many small hang-ups, I was pleased with the Cerego’s features and reliability. In exchange for a great deal of up-front effort, it gave me a unique window into student engagement and progress. Consequently, it proved to be an overwhelmingly potent tool for winning “the blame game”, although I eventually came to feel uneasy about using this power.

Longer-term learning outcomes seemed, on average, to be slightly worse than with the whole-class Anki method. While highly motivated students benefited from being able to study more aggressively and efficiently than before -- and their objective scores were higher than ever -- their learning seemed less transferable to more authentic contexts. Students of lower motivation, while seeming to get little from either approach, got even less from this digital 1:1 method, and their slump accounts for the overall decline.

Setup

I taught a mix of regular (not honors) 9th and 10th English classes again, but over the summer of 2016 I was invited to move my classroom into an unusually-spacious converted computer lab in which 16 older desktop PCs were kindly left at my request. I had these arranged facing the sides of the room so I could see all screens easily. I allocated those PC seats on a semi-permanent basis as needed and requested. The balance of students sat at normal desks and used their phones for study.

This came with challenges. School WiFi was officially off-limits to students (though many always had the password anyway), and many students said they were at the whim of data caps they regularly pushed up against. Their phones, in most cases, were a generation or three behind state-of-the-art, with degraded batteries and exhausted storage capacity. A few students had difficulty even making room for the Cerego app that first week.

While our setup was marginal, between the PCs and phones, we only rarely ran into a situation where not everyone could be studying at the same time.

On the software side, it must be said that, for all its features, Cerego wasn’t designed for my specific use case. The company’s featured customers are business and colleges, who use the product as part of packaged training programs and distance learning courses. Importantly, the app favors adding content into the learner’s study rotation in blocks, on the learner’s own schedule, rather than making it on the fly and trickling it immediately. It was also not designed to give a teacher “panopticon”-style real-time monitoring, nor to thwart adversarial users who want to look studious without studying.

Procedure

Before the start of each school day, I would consider the previous day’s lesson content and add to the relevant Cerego study sets as appropriate. This process could be lumpy and not necessarily daily; some lessons invited a great deal of suitable content, and others none at all. Content additions were also far more common first semester than second semester, as I intentionally front-loaded material to maximize the time we would have to reinforce and apply it. During an average week where I added cards, we probably averaged about 50 additions. [ ! ]

With a prominent timer at the front of the room, I allocated 10-12 minutes at the start of every 57 min class period as specially designated “Cerego Time”. During Cerego Time, I would periodically patrol the room to ensure students were on task and to provide support.

Students were allowed to read a pleasure-reading book during this time instead, if they chose. This allowance was most obviously meant for anyone with extra time after catching up with their study, but I wasn’t about to interfere with any teenager reading a book on their own volition. Not all regular readers (2-5 per class) were conscientious Cerego-ers.

Students were strongly encouraged to also use Cerego outside of class whenever the app recommended, if they wanted maximal retention for minimal time spent.

About once a week, usually without warning, I would give a ten question multiple choice quiz that could include questions directly taken from any content that had been in Cerego for at least a week, no matter how old. This was a multiple choice quiz done digitally in Canvas. Before I put the grade into my book, I would add a 10% adjustment (not to exceed 100%), respecting the wisdom that aggressive study sees diminishing returns as one approaches a goal of 100% retention on large bodies of knowledge. My students were aware of this free 10% and my reasoning behind it.

To account for students just joining my class at the start of second semester, and for those who inevitably studied nothing for the seventeen calendar days between semesters — and even for those simply desperate for a fresh start — I had a lengthy grace period of sorts in January and February. Older stuff was temporarily not included in the “quizzable” question pool. I posted dates for when I would consider each old set fair game again; every week or two, a set would find itself back in the pool according to this schedule, and stay there for the rest of the year.

I did not use Cerego stats directly for any kind of grade, instead using my Canvas quizzes for this. My reasons:

• I wasn’t sure every student would consistently be able to use the app, and didn’t want to deal with the push-back from students and parents claiming (honestly or otherwise) insurmountable tech obstacles to using Cerego outside of class.
• Due to limitations in Cerego’s reporting, I wasn’t sure how to regularly compute a fair grade based on Cerego stats.
• I wasn’t sure how far I would be able to trust that a student’s stats weren’t being run up by a smarter friend using the app on their behalf.
• I didn’t want to discourage students from using Cerego Time to instead read their pleasure books (a habit of immense, scientifically-backed value that I do everything I can to promote).
• I didn’t want to give the impression that Cerego is necessarily the best or only way to study, but instead to make it clear that knowing the content was their responsibility, however they chose to do it; my providing them with Cerego cards and time to study them was simply a function of my being a Really Nice Guy.
Points of friction

This section is not a critique of Cerego specifically, but rather a reminder that classroom technology is not inherently good. The mythical 1:1 student tech ratio doesn’t suddenly make impossible dreams reality, and in fact comes with ongoing costs that must be weighed against the benefits. Here were some points of friction I encountered:

• Forgotten login information for the school PCs or Cerego.
• Slow startup, login, and load times on outdated equipment. [Fun fact: I’ve found that as my current school cuts down on the need for different logins through Clever, they create a separate problem of longer and more fragile authentication chains — handshaking from one site to another — that can fail on slow machines or under spotty WiFi.]
• Old or abused keyboards and mice that intermittently fail.
• The occasional bigger problem, like a blown power supply.
• For phone users: discharged, confiscated, lost, or broken devices.
• Distractions and inappropriate behaviors that wouldn’t be possible if students didn’t have their own screen to command.

All of the above adds up to a kind of tax on your time and energy, even when you have enough respect from your students to minimize deliberate abuse. (I had maybe 2-3 bad eggs during the year committing occasional acts of minor sabotage.) Moreover, every possible point of friction becomes amplified by a student who doesn’t feel getting to the objective, like a child who finds an hour’s worth of yak shaving to do whenever bedtime rolls around.

Problems with multiple-choice study cards

Unlike Anki and other personal-use SRS, where the user self-assesses performance and collaborates with the app to schedule the next review, apps like Cerego are built to measure retention objectively. This changes how study cards have to be constructed. Although options [even in 2017] are varied, the most practical and straightforward method is usually a “front” side card with a question or term and a “back” side of multiple-choice responses.

Some problems with multiple-choice format:

• Responding to a multiple-choice question (or any kind of question) takes more time than pressing a self-assessment button.
• In general, it’s more work to create study cards that can be assessed by the app. This is true even in the ideal case, which for Cerego is when you can assign a set of cards where the correct answer in one card can automatically become a multiple-choice distractor (wrong answer) for other cards in the set. But many cases are not ideal, and the only plausible distractors will be ones you add manually.
• Students can get confused when distractors contaminate tenuous mental associations. This is a well-studied effect with testing in general, and I had one student (motivated, but lower IQ) who I feel was positively ruined by it this year.
• Students mostly don’t try to recall the answer before looking at multiple-choice options, instead defaulting to the following heuristic: “Look for an answer that feels right -- if none do, press ‘None of the above’”. This is a problem, because the act of trying to recall the specific thing is known to be the critical step that reinforces the memory; in contrast, merely recognizing familiar facts (as when “going over notes”) is known to give students false confidence.

I gave my Cerego contacts some ideas I had for minimizing some of the downsides of multiple-choice. Because my students were largely deaf to my pleading that the “front” card screen — the one containing only the question — is where the learning actually happens, there could be a mandatory (or at least default, opt-out) short delay on that screen, especially when the app detects inhumanly rapid clicking.

Cerego actually asks “Do you know this?” on that screen, giving them a chance to self-assess in the negative without going to the multiple choices, but the vast majority of students never saw this screen as anything but a speed bump to click through.

My thought was that Cerego could occasionally not show the multiple choice options right away when they click “I Know It”, but instead call their bluff, asking, “Oh? How confident are you?” and prompting them to select a confidence level on a slider bar before showing the choices. Not only might this end the bad habit, it could also provide an opportunity to help them with their credence calibration, a useful skill that might make them better thinkers and learners. I also suggested Cerego might be able to use this data to learn more about a learner and better judge their mastery level through sexy Bayesian wizardry.

[My aborted app design would have taken that concept to its logical conclusion: letting trusted users fully self-assess most of the time, but occasionally performing “reality checks” where it made the user respond in a way it could verify. It could then use straightforward Bayesian updates from these checks to decide how often to do them for each user.]

New failure modes

New format, new failure modes:

• Performative clicking. I would commonly have students who didn’t want the discomfort of getting called to task, but also didn't want to actually do the task, so they would put up a show of productivity, continually clicking random answers over and over again without reading. Others would loiter in the stats screens, play with the cursor, check their grades... anything that wouldn’t require actual thinking.
• Exploits. Some students realized that mindless clicking moved Cerego’s progress bar on their study session forward. In some cases, it even raised their score. One enterprising young man demonstrated this for me, proudly resting a textbook over the Enter key, then kicking back as he “studied” his sets in record time. It was hard to be mad at him, as I could see myself doing the same at his age. Indeed, I was impressed. But he was in no way discouraged by my reminder that I didn’t use Cerego reports for grades, and that his trick wouldn’t leave him any better prepared for the quizzes that counted. (His mind was a steel trap, though; he did just fine.)
• Hunkering. Cerego is set up such that students don’t have new cards added to their rotations until they make an active choice to press a button that does this. Thus, many students would endlessly study only the first twenty cards from the start of the year, never pushing themselves with anything new. In their defense, one of my feedback notes to Cerego was that the UI [in 2017, remember] didn’t make it very clear that they had new material awaiting activation. But even after interventions where I walked them through the process, many of these fox-holed students would fail to activate newer cards on their own initiative.
• Idleness and moping. Apathy often manifests as lethargy combined with half-hearted complaints, voiced only when confronted, that it’s “too hard” or that “I don’t understand it”. Even though neither of those complaints made much sense when studying limited subsets of word-definition vocab pairs (the most common card set), I still heard both of them regularly from the hibernating bears I dared to poke. (Metaphorically. Never touch students.)

This was further evidence of something I already believed: that these complaints, in these contexts, are a means of disincentivizing teachers from bothering them, as opposed to cries for help. After all, if such a student stands by their claim of not understanding it, what is a responsible teacher supposed to do except to stand there and reteach them the whole thing, or schedule one-on-one tutoring, holding their hand with every “I don’t get it” until the work is done for them? If the student had really wanted to understand and do the work, they would have raised their hand as soon as they encountered difficulty instead of trying to be inconspicuous.

[I’ve always been more sympathetic to apathetic students than I probably sound here. Public education demands more directed attention from teenagers than most of them can realistically muster for 35 hours a week.]
Dominating the blame game

Teachers are regularly asked by their bosses how they are “differentiating” instruction, adjusting lessons for students across a class’s range of skill levels, learning disabilities, and language deficiencies. They are also asked by parents what their children can do to improve their grade.

Cerego gave me a ready answer to both questions: “Well, in my class we use a free study app that I load with all of the terms, vocab and such that could be on my quizzes. It’s like smart flash cards that let you know when you need to study to avoid forgetting things. They adjust to give you more practice with the things you struggle with. Not only do I provide time to use it during class — even providing a computer if they need it — but it works on any internet device. Students can use it as often they like to be as prepared as they want to be.” Nobody ever complained about this answer, and some were quite impressed with it — more than I was, to be honest.

I also had powerful ammunition in the all-too-common scenario where, at a meeting with all of the child’s teachers, a parent blames poor grades on the teachers’ not adjusting to their child’s very special needs, instead of on their child’s ridiculously obvious laziness.

We can’t, of course, just come out and call it like we see it. But we can show parents our data and let them connect the dots. So, in these cases, I would just repeat my “Well, in my class we use a free study app…” spiel, emphasizing the “as prepared as they want to be” part. I would then add, “According to the app, your child has spent [x] minutes studying over the last week, which is about [y]% of the time my average ‘A’ student spent in that same period, and, come to think of it,” I would say, scratching my head for effect, “far less than the time I provide in class for it.”

Cue evil gaze from parent to child, squirming discomfort from child, envious awe from my fellow teachers.

It’s true! Here is a snapshot of one type of output I collected from my report-processing scripts for one of my students. You’re looking at one block of a larger data sheet I brought to parent meetings and included in periodic emails sent home. This one was for a fairly average student who put in the minimum expected time but didn’t push themselves very hard. A  slacker's would be more brutal.

Like I said, absolute dominance.

But like a lot of games, beating the “blame game” just made me tired of playing it, and ready to move on to something else. The enemy is not the apathetic student. The enemy is Apathy herself. I want to teach the lazy student, not destroy them with my Orwellian gaze.

Results and discussionTable

In the following table, n=129, the sum of the 9th and 10th grade students that finished second semester with me. The procedures were identical in both grades, and I didn’t find much reason to divide them, preferring the larger total sample. I then divided the combined sample into quintiles as shown:

The "Sem 2 Grade" is their course grade from just the second semester, but the other stats are all cumulative for the year. (No, I don’t have any state test data for this group, and I never will. Having switched employers, I am not privy to the results, which arrive in late summer or early fall.)

“Set Level” is Cerego’s signature rating of overall progress and retention, on a 4-point scale.

“% of Cards Started” is the fraction of the total cards I had prepared that the students had added into their rotations. (Remember that Cerego did not do this automatically). For 9th graders, there were 648 cards. For 10th grade, there were 749.

Study time analysis

As a sanity check, I crudely estimate that we had study time on 160 of our 180 school days, spending an average of 11 minutes each time. That would add up to 29.3 hours of total in-class study time. That the actual averages are lower does not surprise me, due to a combination of absences, roster changes, and start-up times. What we can conclusively say is that there was not a massive amount of outside-of-class study going on.

Of course, not all of those logged study minutes were productive study time. It wasn’t always clear to me when Cerego counted a minute towards study vs. idle, or whether it detected idleness at all on the mobile app. Indeed, there were several cases where a student’s mobile app seemed to have logged continual study overnight, and even, in one case, for multiple continuous days. The above chart has not been adjusted for known or unknown anomalies of this kind.

Regardless, as you can see, while time spent studying was correlated with performance, there was barely a 25% difference in study time separating the top and bottom grade quintiles. Even this is less exciting than it looks, as the lowest scorers were also more likely to be absent, missing their in-class study time. I have made no effort to adjust for this.

One thing you can’t see in that chart is the high variance that existed within the top quintile. In this group, time spent studying varied from 33 hours to 12 — and 12 was the top student! Anecdotally, I perceived two distinct subgroups of high performers: highly motivated learners who had a natural disadvantage, like being a foreign exchange student speaking a second language, and high IQ avid reader types. The former put in far more hours than the latter. In fact, that second group put in less time than the average bottom quintile student.

Only a very small number of highly motivated students showed signs of studying over weekends and breaks.

SRS signal, or just conscientiousness?

While you can see a much stronger signal in the “Set Level” and “% of Cards Started” columns, it’s hard to know how much this is just measuring conscientiousness. Good students are going to do what they’re asked to do, and get the good grade no matter what, but this doesn’t mean that what they’re asked to do is always necessary to get the good grade — or that the grade reflects anything worthwhile in the first place.

People persons

At least a few of the students I could never get to study Cerego were very on-the-ball whenever we did any kind of verbal review.

[I’ve seen a lot of this pattern during the pandemic. Students who seemed like inert lumps online, with very low grades, have in many cases returned to the classroom and revealed themselves to be dynamic and invested. An engaging human at the front of the room really is the “value add” of in-person instruction. This is something I encourage my peers to keep in mind whenever deciding between autonomous work and teacher-student interaction.]

High automaticity in high achievers

When it came to automaticity, outlier results were more impressive than ever. The very small number of students at the overlap of highly motivated, highly intelligent, and highly competitive absolutely crushed it in the review game we regularly played at my interactive whiteboard, beating me on several occasions, which almost never happened previously.

Weak transference?

However, transference to other contexts was less evident. In my first report, I had remarked on anecdotal impressions of higher-quality discussion and essay responses from those who had embraced our Anki review, suggesting that they had truly enlarged their lexicon to be able to talk about more complex ideas. I saw less of that this year. I don’t know what that means. It could just be that this mix of students was less open with their thoughts. But I can also see how they may have seen the Cerego universe as distinct from the universe of essay and discussion. Whole-class Anki might be more resistant to this bifurcation by making us say the words out loud to each other, normalizing their use.

Drama benchmark analysis

To compare methodologies as directly as possible, for a third year running I handled my Drama unit the way I accidentally had during my first year of classroom SRS: some terms taught before the pre-test, most taught after the pre-test, an identical post-test much later, and no review of any of it except through the SRS.

The overall results in the Drama unit were slightly worse this year. This was surprising. This cohort started lower on the pre-test, which was consistent with my impression of them, but I predicted that we would at least match or exceed last year’s gains, as we had more room to improve. We did not. Retention of some reliable bellwether terms actually dropped prior to the post-test. In picking through individual scores, my impression was that whole-class Anki and independent in-class Cerego were statistically equivalent for motivated learners, but whole-class Anki won easily with less motivated learners. As always, there were plenty of truly unmotivated students who got nothing from either method.

I tried to tease this out even further. This was pretty unscientific, but I took the pre and post-test scores of twenty students from last year, and aligned them individually to students from this year with similar pre-test scores and, in my view, similar work ethics. Highly motivated students starting very low may have done slightly better with Cerego than with Anki, but poorly motivated students starting low did somewhat better with Anki.

I’m sure a lot of this came down to how Cerego makes new card sets “opt-in”. Students of lower motivation were less likely to encounter the Drama terms in their study rotation at all!

Phone vs. Computer seemed to make a difference here, too. Stuck with a very visible PC, some low performers would occasionally have good days and get in a groove. The ones glued to their phones found anything to do except Cerego.

Conclusions (2017)

If I see students as being ultimately responsible for their own learning, independent Cerego is the fairer approach that will help students get what they “deserve”. If I see things more pragmatically and utilitarian (as I do), the numbers favor the whole-class Anki approach. And yet...

If I were staying at that school, with my classroom computers, I would have tried to get the best of both worlds. It was my plan to use Cerego again — having already done most of the legwork — and try to make it friendlier, with more teacher interaction, supplementing with some whole-class Anki. I would have pushed Cerego’s developers to make some of my most wanted changes, and I would have pushed myself to cut back on the number of cards I used.

But it’s moot, now. I won’t have computers at my new school. And part of the reason I left was because I didn’t like the feel of the groove I was settling into.

Whole-class Anki review wins for simplicity and camaraderie. Cerego wins for surveillance and power. Which would you want to see stamping on a teenage face forever?

Trick question! It’s not nice to stamp on faces. I feel like I’ve been pushing SRS too far past the point of diminishing returns, and I don’t know why it has become an annual tradition for me to vow to cut back next year and then fail to do so. I should probably break that cycle. Apathy is the enemy, and she remains unbowed. I’ve been looking for a technological fix, but I think the solution is, at best, only partly technological.

[My notes here spiraled off into very technological solutions (sigh) to add to my dream SRS+ app, which I had already postponed again but still wasn’t ready to abandon. I suppose I can give myself a little credit for brainstorming features to encourage human interaction and conceptual connections. Eventually, my notes came back to some thoughts about what makes a class thrive, which I have translated into coherent sentences below.]

From a scalability standpoint, it’s nice that something like Cerego doesn’t depend on a teacher’s charm the way my whole-class Anki approach does. Teachers could do a lot worse than a standardized pack of quality Cerego sets that reinforce matching cookie-cutter lessons. But couldn’t teachers also do better? I think I could do better. Cerego and Canvas quizzes create distance between me and my students. But I want to bring us closer and dial up the enthusiasm.

I don’t think gamification is the answer. I’ve been noticing that the appeal of games is pretty niche, failing to capture many from the apathetic middle, and then for the wrong reasons, with the wrong incentives.

So what would work?

In education research, it always looks like everything works at least a little bit. This is probably a combination of publication bias and the fact that teachers sometimes get excited to try something new. Excitement is infectious. This gets students more engaged, which then improves outcomes. My early success with classroom SRS — and subsequent disappointments — would certainly fit that pattern.

Maybe I should make a point of trying new things each year for the explicit purpose of exploiting the excitement factor? How would I explain that to my bosses? “Well, I deliberately diverged from the curriculum and accepted best practice because I grew weary of them.”

[Yes, actually. My new bosses are great that way.]

Thesis, Antithesis, Synthesis (2021)

As a student of storytelling, I can’t help but find an arc to my fourteen years of teaching up to this point.

When I first started out, I didn’t know what I was doing but kept Apathy at bay through sheer passion. I worked harder than anyone. I couldn’t wait to try my stuff out, and students responded to all but my cringiest overtures.

When this inevitably exhausted me, I had a hard slump. Lessons that used to work fell flat. I still didn’t know what I was doing, and now lacked sufficient passion to brute force success. So I retreated into systems and structure, building word banks, prompt banks, quiz banks; rubrics, charts, and randomizers; running reports; slinging code. A suit of high-tech power armor to augment my feeble form. A different kind of brute force.

My systems gave me stability and staying power, and, eventually, the confidence to explore. My three years of heavy SRS experimentation were the culmination of this phase. I stretched. I grew. But I still felt plateaued and frustrated, perhaps having taken systems as far as they could go.

Apathy still mocked me from her emoji throne.

I step out of the armor and find I no longer need it. One by one, my systems clatter to the ground. I know who I am. I know where my power comes from. And I know my enemy.

She will lose, because she is overconfident. She won’t prepare, because she is indifferent. And she won’t hear my warning, because I issue it now in the one place I know she’ll never reach: the bottom of a 10,000 word essay.

I’m coming for you.

[1] Neel Nanda beat me to a discussion of this. Worth a read. The comments are great, too. I was reassured that others like me with real experience, a little research, and rigorous thinking on the topic had reached such similar conclusions.

[2] You don’t have to justify yourself to me. I, too, have motivational and administrative reasons that keep me testing on occasion as well. But I approach and design them differently, when I can.

[3] A widespread bias I see in education is viewing every subject as a technical one with a straightforward dependency tree. Take my subject: English. The delusion held by seemingly all district-level curriculum czars is that, if Johnny’s reading scores are deficient, there must be one or two very specific dependencies he lacks. They will often look to a single wrong answer on a diagnostic test and say, “Ah! There it is. ‘Deducing the meaning of a word from context.’ Teacher, give them lessons on that until they master it.” Sorry. It doesn’t work that way. Johnny, like most humans, intuitively understands how to derive meaning from context. But in this case, he didn’t understand the context, because it’s one of the millions of things he’s naive about. He’s young and hasn’t read very many books. If we want to get reductive, I will concede the hypothetical possibility of making a shaggy graph of the millions of micro-dependencies that underpin an individual’s reading skill. But maybe we should just try to find Johnny some books he might like.

[4] Consider how a serial television show uses a “Previously, on [title]” to remind you of plot threads that are going to be relevant to this episode, some of which might be from several episodes back. This is superior to how they used to do it, which was “Last time, on [this show].” The primitive form would fail to remind you of relevant threads from older episodes and needlessly remind you of irrelevant threads from last week. When you review with your students, are you just reviewing the most recent stuff, or are you choosing the stuff that’s about to be relevant again?

[5] This book would be somewhat redundant in a world where we already have David Didau’s What if everything you knew about education was wrong? I crossed paths with this title during a pensive season of my life and appreciated the way it asked questions from first principles, challenging orthodox assumptions without jumping to new conclusions. In particular, Didau had the words to express what I was feeling about forgetting.

[6] When it’s releasing more energy than you’re using to contain it.

[7] She goes by many names around the world. In the UK, teachers swap scary stories about Bore-a-trix Lestrange, Lady Macbarf, and Nary, Queen of Nots.

[8] I remember the first time I appreciated this skill. It was when I saw this hilarious exchange between Louis CK and Conan O’Brien, and then saw the same content later as a bit in one of his shows (4:39). It seems embarrassing to have not seen it, but it hadn’t occurred to me that talk-show ‘interviews’ with comedians might sometimes be adaptations of their bits. Seriously, though, Louis CK really comes across as a spontaneously funny guy in that first clip. He elevates the convincingness of spontaneity into another layer of comedic art.

[9] Do you want to know what I’ve hated most about teaching in person during the Covid-19 pandemic? The way mutual mask-wearing scrams my reactor. With my facial expressions concealed, my deliveries don’t land as consistently. With the students’ expressions concealed, I am deprived of the energy I would gain by getting a reaction out of them. The parts of the job that used to recharge me drain me instead. I don’t have words to describe how awful this feels.

[10] If you’re a fellow teacher, you know that this is the differentiation problem solving itself.

Discuss

### I'm still mystified by the Born rule

4 марта, 2021 - 05:35
Published on March 4, 2021 2:35 AM GMT

(This post was originally intended as a comment on Adele's question, but ballooned to the point where it seems worthy of a top-level post. Note that I'm not trying to answer Adele's (specific fairly-technical) question here. I consider it to be an interesting one, and I have some guesses, but here I'm comentating on how some arguments mentioned within the question relate to the mysteries swirling around the Born rule.)

(Disclaimer: I wrote this post as a kind of intellectual recreation. I may not have the time and enthusiasm to engage with the comments. If you point to a gaping error in my post, I may not reply or fix it. If I think there's a gaping error in your comment, I may not point it out. You have been warned.)

My current take is that the "problem with the Born rule" is actually a handful of different questions. I've listed some below, including some info about my current status wrt each.

Q1. What hypothesis is QM?

In, eg, the theory of Solomonoff induction, a "hypothesis" is some method for generating a stream of sensory data, interpreted as a prediction of what we'll see. Suppose you know for a fact that reality is some particular state vector in some Hilbert space. How do you get out a stream of sensory data? It's easy enough to get a single sensory datum — sample a classical state according to the Born probabilities, sample some coordinates, pretend that there's an eyeball at those coordinates, record what it sees. But once we've done that, how do we get our next sense datum?

Or in other words, how do we "condition" a quantum state on our past observations, so that we can sample repeatedly to generate a sequence of observations suitable for linking our theories of induction with our theories of physics?

To state the obvious, a sensory stream generated by just re-sampling predicts that you're constantly teleporting through the multiverse, and a sensory stream generated by putting a delta spike on the last state you sampled and then evolving that forward for a tick will... not yield good predictions (roughly, it will randomize all momenta).

Current status: I expect additional machinery is required to turn QM into a hypothesis in the induction-compatible sense — ie, I'd say "the Born rule is not complete (as a rule for generating a hypothesis from a quantum state)". My guess is that the missing machinery involves something roughly like sampling classical states according to the Born rule and filtering them by how easy it is to read the (remembered) sense history off of them. I suspect that a full resolution of this question requires some mastery of naturalized induction. (I have some more specific models than this that I won't get into at the moment. Also there are things to say about how this problem looks from the updateless perspective, but I also won't go into that now.)

Q2. Why should we believe the Born rule?

For instance, suppose my friend is about to roll a biased quantum die, why should I predict according to the Born-given probabilities?

The obvious answer is "because we checked, and that's how it is (ie, it's the simplest explanation of the observed data so far)".

I suspect this answer is correct, but I am not personally quite willing to consider the case closed on this question, for a handful of reasons:

• I'm not completely solid on how to twist QM into a full-on sensory stream (see Q1), and I suspect some devils may be lurking in the details, so I'm not yet comfortable flatly declaring "Occam's razor pins the Born rule down".

• There's an intuitive difference (that may or may not survive philosophical progress) between indexical uncertainty, empirical uncertainty, and logical uncertainty, and it's not completely obvious that I'm supposed to use induction to manage my indexical uncertainty. For example, if I have seen a million coin tosses in my past, and 2/3 of them came up heads (with no other detectable pattern), and I have a bona fide guarantee that I'm an emulation running on one of 2^2000000 computers, each of which is halfway through a simulation of me living my life while two million coins get flipped (in literally all combinations), then there's some intuition that I'm supposed to predict the future coins to be unbiased, in defiance of the observed past frequency. Furthermore, there's an intuition that QM is putting us in an analogous scenario. (My current bet is that it's not, and that the aforementioned intuition is deceptive. I have models about precisely where the disanalogy is that I won't go into at the moment. The point I'm trying to make is that it's reasonable to think that the Born rule requires justification beyond 'Occam says'. See also Q4 below.)

• It's not clear to me that the traditional induction framework is going to withstand the test of time. For example, the traditional framework has trouble dealing with inductors who live inside the world and have to instantiate their hypotheses physically. And, humans sure are keen to factor their hypotheses into "a world" + "a way of generating my observations from some path through that world's history". And, the fact that QM does not naturally beget an observation stream feels like something of a hint (see Q1), and I suspect that a better theory of induction would accommodate QM in a way that the traditional theory doesn't. Will a better theory of reasoning-while-inside-the-world separate the "world" from the "location therein", rather than lumping them all into a single sensory stream? If so, might the Born rule end up on the opposite side of some relevant chasm? I suspect not, but I have enough confusion left in this vicinity that I'm not yet comfortable closing the case.

My current status is "best guess: we believe the Born for the usual reason (ie "we checked"), with the caveat that it's not yet completely clear that the usual reason works in this situation".

Q3. But... why the Born rule in particular?

Why is the Born rule natural? In other words, from what mathematical viewpoint is this a rule so simple and elegant as to be essentially forced?

Expanding a bit, I observe that there's a sense in which discrete mathematics feels easier to many humans (see, eg, how human formalizations of continuous math often arise from taking limits or other εδmanship built atop our formalizations for discrete math). Yet, physics makes heavy use of smooth functions and differential equations. And, it seems to me like we're supposed to stare at this and learn something about which things are "simple" or "elegant" or "cheap" with respect to reality. (See also gauge theory and the sense that it is trying to teach us some lessons about symmetry, etc.)

I think that hunger-for-a-lesson is part of the "but whyyyy" that many people feel when they encounter the Born rule. Like, why are we squaring amplitude? What ever happened to "zero, one, or infinity"? When physics raises something to a power that's not zero, one, or infinity, there's probably some vantage point from which this is particularly forced, or simple, or elegant, and if you can find it then it can likely help you predict what sorts of other stuff you'll see.

Or to put it another way, consider the 'explanation' of the Born rule which goes "Eh, you have a complex number and you need a real number, there aren't that many ways you can do it. Your first guess might be 'take the magnitude', your second guess might be 'take the real component', your third guess might be 'multiply it by its own complex conjugate', and you'll turn out to be right on the third try. Third try isn't bad! We know it is so because we checked. What more is there to be explained?". Observe that there's a sense in which this explanation feels uncompelling — like, there are a bunch of things wrong with the objection "reality wasn't made by making a list of possible ways to get a real from a complex number and rolling a die", but there's also something to it.

My current status on this question is that it's significantly reduced — though not completely solved — by the argument in the OP (and the argument that @evhub mentions, and the ignorance+symmetry argument @Charlie Steiner mentions, which I claim all ground out in the same place). In particular, I claim that the aforementioned argument-cluster grounds out the Born rule into the inner product operator, thereby linking the apparently-out-of-the-blue 2 in the Born rule with the same 2 from "L2 norm" and from the Pythagorean theorem. And, like, from my vantage point there still seem to be deep questions here, like "what is the nature of the connection between orthonormality and squaring", and "is the L2 norm preferred b/c it's the only norm that's invariant under orthonormal change of basis, or is the whole idea of orthonormality somehow baking in the fact that we're going to square and sqrt everything in sight (and if so how)" etc. etc. I might be willing to consider this one solved in my own book once I can confidently trace that particular 2 all the way back to its maker; I have not yet done so.

For the record, on the axis from "Gentlemen, that is surely true, it is absolutely paradoxical; we cannot understand it, and we don't know what it means. But we have proved it, and therefore we know it must be the truth" to... whatever the opposite of that is, I tend to find myself pretty far on the "opposite of that" end, ie, I often anticipate finding explanations for logical surprises. In this regard, I find arguments of the form "the Born rule is the only one that satisfies properties X, Y, and Z" fairly uncompelling — those feel to me like proofs that I must believe the Born rule is good, not reasons why it is good. I'm generally much more compelled by arguments of the form "if you meditate on A, B, and C you'll find that the Correct Way (tm) to visualize the x-ness of (3x, 4y) is with the number (3^2/5)" or suchlike. Fortunately for me, an argument of the latter variety can often be reversed out of a proof of the former variety. I claim to have done some of that reversing in the case of the Born rule, and while I haven't fully absorbed the results yet, it seems quite plausible to me that the argument cluster named by Adele/Evan/Charlie essentially answers this third question (at least up to, say, some simpler Qs about the naturality of inner products).

Q4. wtf magical reality fluid

What the heck is up with the thing where, not only can we be happening in multiple places, but we can be happening quantitatively more in some of them?

I see this as mostly a question of anthropics, but the Born rule is definitely connected. For instance, you might wish to resolve questions of how-much-you're-happening by just counting physical copies, but this is tricky to square with the continuous distribution of QM, etc.

Some intuition that's intended to highlight the remaining confusion: suppose you watch your friend walk into a person-duplicating device. The left copy walks into the left room and grabs a candy bar. The right copy walks into the right room and is just absolutely annihilated by a tangle of whirring blades — screams echo from the chamber, blood spatters against the windows, the whole works. You blink in horror at the left clone as they exit the door eating a candy bar. "What?" they say. "Oh, that. Don't worry. There's a dial in the duplicating device that controls how happening each clone is, and the right clone was happening only negligibly — they basically weren't happening at all".

Can such a dial exist? Intuition says no. But quantum mechanics says yes! Kind of! With the glaring disanalogy that in QM, you can't watch the negligibly-happening people get ripped apart — light bouncing off of them cannot hit your retinas, or else their magical-happening-ness would be comparable to yours. Is that essential? How precisely do we go about believing that magical happening-ness dials exist but only when things are "sufficiently non-interacting"? (Where, QM reminds us, this interacting-ness is a continuous quantity that rarely if ever hits zero.) (These questions are intended to gesture at confusion, not necessarily to be answered.)

And, like, there's a way in which the hypothesis "everything is; we are built to attend to the simple stuff" is a curiosity-stopper — a mental stance that, when adopted, makes it hard to mine a surprise like "reality has the quantum nature" for information about what sort of things can be.

I have a bunch more model than this, and various pet hypotheses, but ultimately my status on this one is "confused". I expect to remain confused at least until the point where I can understand all these blaring hints.

In sum, there are some ways in which I find the Born rule non-mysterious, and there are also Born-rule-related questions that I remain quite confused about.

With regards to the things I consider non-mysterious, I mostly endorse the following, with some caveats (mostly given in the Q2 section above):

The Born rule is on the same status as the Fourier transform in quantum mechanics — it's just another equation in the simple description of where to find us. It gets an undeservedly bad rep on account of being just barely on the reality-side of the weird boundary humans draw between "reality" and "my location therein" in their hypotheses, and it has become a poster-child for the counter-intuitive manner in which we are embedded in our reality. Even so, fixing the nature of the rest of reality, once one has fully comprehended the job that the Born rule does, the Born rule is the only intuitively natural tool for its job.

(And, to be clear, I've updated in favor of that last sentence in recent times, thanks in part to meditating on the cluster of arguments mentioned by Adele/Evan/Charlie.)

With regards to the remaining mystery, there is a sense in which the Born rule is the star in a question that I consider wide-open and interesting, namely "why is 'trace your eyes across these walls in accordance with the Born rule' a reasonable way for reality to be?". I suspect this question is confused, and so I don't particularly seek its answer, but I do seek mastery of it, and I continue to expect such mastery to pay dividends.

Discuss

### Limits of Giving

4 марта, 2021 - 05:20
Published on March 4, 2021 2:20 AM GMT

A friend recently asked what my goal was in giving: was there some amount of donations that would be enough? If someone give me a large enough amount of money, which I then donated, would I be free of further altruistic obligations?

These questions feel to me like they come from a very different perspective, so I want to try and explain how I think about it. If I continue on my current path, perhaps over the next 40 years I might manage to donate $10M. There's a sense, then, in which I have a target of$10M. If through some unrealistically good fortune my 0.34% of Wave stock options turned into $50M, however, I wouldn't donate$10M and then devote myself to leisure.

The level of need in the world is enormous, far bigger than my personal efforts can address. The poorest billion people need a marginal dollar far more than I do; no one should be dying of malaria; our society's ability to handle a pandemic is terrifyingly bad; we are putting much less effort than we should be into making sure humanity doesn't go extinct.

Now I'm not going to sell all my possessions and live as cheaply as possible, but I am going to be thoughtful about balancing costs to myself against benefits to others and making good altruistic tradeoffs. The more money I have, the larger a portion I can give while continuing to spend money on myself in ways that make me happy.

Considered this way, responding to receiving 50M by decreasing the percentage I gave would be exactly backwards. Comment via: facebook Discuss ### How does bee learning compare with machine learning? 4 марта, 2021 - 05:03 Published on March 4, 2021 1:59 AM GMT This is a write-up of work I did as an Open Philanthropy intern. However, the conclusions don't necessarily reflect Open Phil's institutional view. Abstract This post investigates the biological anchor framework for thinking about AI timelines, as espoused by Ajeya Cotra in her draft report. The basic claim of this framework is that we should base our estimates of the compute required to run a transformative model on our estimates of the compute used by the human brain (although, of course, defining what this means is complicated). This line of argument also implies that current machine learning models, some of which use amounts of compute comparable to that of bee brains, should have similar task performance as bees. In this post, I compare the performance and compute usage of both bees and machine learning models at few-shot image classification tasks. I conclude that the evidence broadly supports the biological anchor framework, and I update slightly towards the hypothesis that the compute usage of a transformative model is lower than that of the human brain. The full post is viewable in a Google Drive folder here. Discuss ### AI Safety Beginners Meetup (Pacific Time) 4 марта, 2021 - 04:44 Published on March 4, 2021 1:44 AM GMT Are you new to AI Safety? Then this event if for you. This is an occasion to ask all your questions, and meet others in your situation. Are you a veteran in AI Safety, or just have been around long enough to have some wisdom to share. Then you are welcome to join this meetup to share your knowledge and experience. More info here. Discuss ### Some recent interviews with AI/math luminaries. 4 марта, 2021 - 04:26 Published on March 4, 2021 1:26 AM GMT I've recently started a podcast with renowned futurist Thomas Frey, and when possible I've been scheduling luminaries in AI Safety, AGI, and mathematics. Most of the content won't be new to regular LWians, but I thought it couldn't hurt to share a few links. Like and subscribe for future interviews, in the months ahead we've got leading experts in economics, one of the founders of the Santa Fe Institute, the brain behind one of the most popular social networks, and a bunch more. Here is our interview with Dr. Roman Yampolsiy (spoiler: he admits to being Satoshi Nakamoto). Before this interview I hadn't heard of 'intellectology', but it's his proposal for a new field that studies the structure and limitations of different cognitive architectures: We spoke with the director of the Icelandic Institute for Intelligent Machines about his proposed design of a fully generally intelligent system. I don't know if he's cracked that nut, but he's definitely given it deep, serious thought: My good friend Erik Istre is an expert in nonclassical foundations for mathematics. In our interview with him we really get into the weeds on paraconsistent logic and what it does/doesn't mean, plus its potential applications to AI Safety and metaphysics: Finally, David Jilk is well-known in AI Safety circles, and in this interview we talk about different approaches to the topic and whether there's any connection to quantum computing: Discuss ### Garden Party 2.0 4 марта, 2021 - 02:14 Published on March 3, 2021 11:14 PM GMT The Walled Garden is being remodeled, and you are invited to come celebrate. We'll open with an optional round of lightning talks and google-doc collaboration (in the new Meeting Hall), and then wander the new central garden as we chat/catch up/share what we're working on or thinking about. We've learned a bunch about what makes for a good Gather Town space, and I've recently rebuilt the central garden space in a way that makes it much easier to modify. I've made some deliberate changes (which I hope are improvements!) over the previous pre-built D&D artwork, and I'm interested in feedback on what else we could do to improve it as a space. Event link is here: http://garden.lesswrong.com?code=nU6B&event=garden-party-2-0 Discuss ### How To Think About Overparameterized Models 4 марта, 2021 - 01:29 Published on March 3, 2021 10:29 PM GMT So, you’ve heard that modern neural networks have vastly more parameters than they need to perfectly fit all of the data. They’re operating way out in the regime where, traditionally, we would have expected drastic overfit, yet they seem to basically work. Clearly, our stats-101 mental models no longer apply here. What’s going on, and how should we picture it? Maybe you’ve heard about some papers on the topic, but didn’t look into it in much depth, and you still don’t really have an intuition for what’s going on. This post is for you. We’ll go over my current mental models for what’s-going-on in overparameterized models (i.e. modern neural nets). Disclaimer: I am much more an expert in probability (and applied math more generally) than in deep learning specifically. If there are mistakes in here, hopefully someone will bring it up in the comments. Assumed background knowledge: multi-dimensional Taylor expansions, linear algebra. Ridges, Not Peaks First things first: when optimizing ML models, we usually have some objective function where perfectly predicting every point in the training set yields the best possible score. In overparameterized models, we have enough parameters that training indeed converges to zero error, i.e. all data points in the training set are matched perfectly. Let’s pick one particular prediction setup to think about, so we can stick some equations on this. We have a bunch of (x,y).mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} data points, and we want to predict y given x. Our ML model has some parameters θ, and its prediction on a point x(n) is f(x(n),θ). In order to perfectly predict every data point in the training set, θ must satisfy the equations ∀n:y(n)=f(x(n),θ) Assuming y(n) is one-dimensional (i.e. just a number), and we have N data points, this gives us N equations. If θ is k-dimensional, then we have N equations with k variables. If the number of variables is much larger than the number of equations (i.e. > N">k>>N, parameter-dimension much greater than number of data points), then this system of equations will typically have many solutions. In fact, assuming there are any solutions at all, we can prove there are infinitely many - an entire high-dimensional surface of solutions in θ-space. Proof: let θ∗ be a solution. If we make a small change dθ∗, then f(x(n),θ) changes by ∇θf(x(n),θ∗)⋅dθ∗. For all the equations to remain satisfied, after shifting θ∗→θ∗+dθ∗, these changes must all be zero: ∀n:0=∇θf(x(n),θ∗)⋅dθ∗ Key thing to notice: this is a set of linear equations. There are still N equations and still k variables (this time dθ∗ rather than θ), and since they’re linear, there are guaranteed to be at least k−N independent directions along which we can vary dθ∗ while still solving the equations (i.e. the right nullspace of the matrix ∇θf(x(n),θ∗) has dimension at least k−N). These directions point exactly along the local surface on which the equations are solved. Takeaway: we have an entire surface of dimension (at least) k−N, sitting in the k-dimensional θ-space, on which all points in the training data are predicted perfectly. What does this tell us about the shape of the objective function more generally? Well, we have this (at least) k−N dimensional surface on which the objective function achieves its best possible value. Everywhere else, it will be lower. The “global optimum” is not a point at the top of a single peak, but rather a surface at the high point of an entire high-dimensional ridge. So: picture ridges, not peaks. Ridges are harder to draw, ok? Before we move on, two minor comments on generalizing this model. • “Predict y given x” is not the only setup deep learning is used for; we also have things like “predict/generate/compress samples of x” or RL. My understanding is that generally-similar considerations apply, though of course the equations will be different. • If y is more than one-dimensional, e.g. dimension d, then the perfect-prediction surface will have dimension at least k−Nd rather than k−N. Priors and Sampling, Not Likelihoods and Estimation So there’s an entire surface of optimal points. Obvious next question: if all of these points are optimal, what determines which one we pick? Short answer: mainly initial parameter values, which are typically randomly generated. Conceptually, we randomly sample trained parameter values from the perfect-prediction surface. To do that, we first sample some random initial parameter values, and then we train them - roughly speaking, we gradient-descend our way to whatever point on the perfect-prediction surface is closest to our initial values. The key problem is to figure out what distribution of final (trained) parameter values results from the initial distribution of parameter values. One key empirical result: during training, the parameters in large overparameterized models tend to change by only a small amount. (There’s a great visual of this in this post. It’s an animation showing weights changing over the course of training; for the larger nets, they don’t visibly change at all.) In particular, this means that linear/quadratic approximations (i.e. Taylor expansions) should work very well. For our purposes, we don’t even care about the details of the ridge-shape. The only piece which matters is that, as long as we’re close enough for quadratic approximations around the ridge to work well, the gradient will be perpendicular to the directions along which the ridge runs. So, gradient descent will take us from the initial point, to whatever point on the perfect-prediction surface is closest (under ordinary Euclidean distance) to the initial point. Stochastic gradient descent (as opposed to pure gradient descent) will contribute some noise - i.e. diffusion along the ridge-direction - but it should average out to roughly the same thing. From there, figuring out the distribution from which we effectively sample our trained parameter values is conceptually straightforward. For each point θ∗ on the perfect-prediction surface, add up the probability density of the initial parameter distribution at all the points which are closer to θ∗ than to any other point on the perfect-prediction surface. We can break this up into two factors: • How large a volume of space is closest to θ∗? This will depend mainly on the local curvature of the perfect-prediction-surface (higher where curvature is lower) • What’s the average density of the initial-parameter distribution in that volume of space? Now for the really hand-wavy approximations: • Let’s just ignore that first factor. Assume that the local curvature of the perfect-prediction surface doesn’t change too much over the surface, and approximate it by a constant. (Everything’s on a log-scale, so this is reasonable unless the curvature changes by many orders of magnitude.) • For the second factor, let’s assume the average density of the initial-parameter distribution over the volume is roughly proportional to the density at θ∗. (This is hopefully reasonable, since we already know initial points are quite close to final points in practice.) Are these approximations reasonable? I haven’t seen anyone check directly, but they are the approximations needed in order for the results in e.g. Mingard et al to hold robustly, and those results do seem to hold empirically. The upshot: we have an effective “prior” (i.e. the distribution from which the initial parameter values are sampled) and “posterior” (i.e. the distribution of final parameter values on the perfect-prediction surface). The posterior density is directly proportional to the prior density, but restricted to the perfect-prediction surface. This is exactly what Bayes’ rule says, if we start with a distribution P[θ] and then update on data of the form “∀n:y(n)=f(x(n),θ)”. Our posterior is then P[θ|∀n:y(n)=f(x(n),θ)], and our final parameter-values are a sample from that distribution. Note how this differs from traditional statistical practice. Traditionally, we maximize likelihood, and that produces a unique “estimate” of θ. While today’s ML models may look like that at first glance, they’re really performing a Bayesian update of the parameter-value-distribution, and then sampling from the posterior. Example: Overparameterized Linear Regression As an example, let’s run a plain old linear regression. We’ll use an overparameterized model which is equivalent to a traditional linear regression model, in order to make the relationship clear. We have 100 (x,y) pairs, which look like this: I generated these with a “true” slope of 1, i.e. y=1∗x+noise, with standard normal noise. Traditional-Style Regression We have one parameter, c, and we fit a model y(n)=cx(n)+ξ(n), with standard normal-distributed noise ξ(n). This gives log likelihood logP[y|a]=−12∑n(y(n)−cx(n))2 … plus some constants. We choose c∗ to maximize this log-likelihood. In this case, c∗=1.010, so the line looks like this: (Slightly) Overparameterized Regression We use the exact same model, y(n)=cx(n)+ξ(n), but now we explicitly consider the ξ(n) terms “parameters”. Now our parameters are (c,ξ(1),…,ξ(N)), and we’ll initialize them all as samples from a standard normal distribution (so our “prior” on the noise terms is the same distribution assumed in the previous regression). We then optimize (c,ξ(1),…,ξ(N)) to minimize the sum-of-squared-errors 12∑n(y(n)−cx(n)−ξ(n))2 This ends up approximately the same as a Bayesian update on ∀n:y(n)=cx(n)−ξ(n), and our final c-value 1.046 is not an estimate, but rather a sample from the posterior. Although the “error” in our c-posterior-sample here is larger than the “error” in our c-estimate from the previous regression, the implied line is visually identical: Note that our model here is only slightly overparameterized; k=N+1, so the perfect prediction surface is one-dimensional. Indeed, the perfect prediction surface is a straight line in (c,ξ(1),…,ξ(N)) - space, given by the equations y(n)=cx(n)+ξ(n). (Very) Overparameterized Regression Usually, we say that the noise terms are normal because they’re a sum of many small independent noise sources. To make a very overparameterized model, let’s make those small independent noise sources explicit: y(n)=cx(n)+√3N∑100i=0ξ(n)i. Our parameters are c and the whole 2D array of ξ’s, with standard normal initialization on c, and Uniform(-1, 1) initialization on ξ. (The √3N is there to make the standard deviation equivalent to the original model.) As before, we minimize sum-of-squared-errors. This time our c-value is 1.031. The line still looks exactly the same. This time, we’re much more overparameterized - we have k=100N+1, so the perfect prediction surface has dimension 99N+1. But conceptually, it still works basically the same as the previous example. Code for all these is here. In all these examples, the underlying probabilistic models are (approximately) identical. The latter two (approximately) sample from the posterior, rather than calculating a maximum-log-likelihood parameter estimate, but as long as the posterior for the slope parameter is very pointy, the result is nearly the same. The main difference is just what we call a "parameter" and optimize over, rather than integrating out. Discuss ### Thoughts On Computronium 4 марта, 2021 - 00:52 Published on March 3, 2021 9:52 PM GMT While it's widely accepted common knowledge that computers are considerably faster and more powerful than the human brain, it's arguable that evolution didn't optimize the human brain for raw speed, but rather, energy efficiency. The reason why this is important is that in the limit, the theoretical computronium needs to be not only powerful, but an efficient use of resources. In the limit, assuming reasonably that entropy cannot be reversed, then energy is our main practical confining factor for a universe filled with computronium or particularly utilitronium. Furthermore, if natural selection has already provided a sufficiently close to optimal solution, it may make sense for Omega to fill the universe with human brain matter, perhaps even human beings enjoying lives worth living, as an optimal solution, rather than simply taking humanity's atoms and reconfiguring them into some other as yet unknown form of computronium. This idea of the human form already being optimal, has some obviously desirable characteristics for humans looking to imagine possible futures where the universe is filled with trillions upon trillions of happy humans. So, practically speaking, how realistic is it to assume that the human brain is anywhere close to optimal, given that the theoretical limits of physics seem to imply that there is considerable leeway for a high upperbound on the efficiency of computronium? As an interesting exercise, let's look at real world supercomputers. As of this writing the world's fastest supercomputer is the Fugaku, which achieves an impressive 1000 PetaFlops in single precision mode. In comparison, the human brain is estimated to be 20 PetaFlops. However, in terms of energy efficiency, the human brain achieves that with 20 watts of power, for an effective 1 PetaFlops/watt. On the other hand, Fugaku is listed as having an efficiency of about 0.000015 PetaFlops/watt, or six orders of magnitude less. Due to mass-energy equivalence, the fact is that even if we close the gap on energy efficiency in terms of wattage, another possibly dominant term in the equation is the amount of energy in the mass of the matter that makes up the computronium. Here, the distance is similar. The human brain takes up about 1.5 kg of matter, while Fugaku is 700 tons or over 600,000 kg. The human brain thus has an effective mass efficiency of 13 PetaFlops/kg, while our best existing computer system stands at about 0.0017 PetaFlops/kg. That is four orders of magnitude less efficient. Given, if we assume exponential technological growth, these orders of magnitude of difference could go away. But is the growth rate actually exponential? If we look at the numbers, in 2014, the top of the Green500 was 5 GigaFlops/watt. In 2017, it was 17 GigaFlops/watt. In 2020, it was 26 GigaFlops/watt. This is a linear rate of growth of about 3 GigaFlops/watt per year. This means that it'll be the year 300,000 AD before this reaches human levels of efficiency. What about mass efficiency? Again Fugaku's is 0.0017 PetaFlops/kg. IBM Summit, the previous record holder before Fugaku on the Top500 has a speed of 200 PetaFlops and weighs half as much at 340 tons or roughly 300,000 kg, which works out to 0.00067 PetaFlops/kg, and it was on the list two years in a row. If we go further back to 2011 (to find the last computer with a listed weight), the K-computer had the lead with 10 PetaFlops in about 1,200,000 kg worth of hardware, which works out to 0.00000008 PetaFlops/kg. Note the slope of the change in the last two years is much lower than in the previous seven years before that. This means the actual rate of growth is decreasing noteably. Even if it were linear at the rate of the last two years, it would still take until around 14,000 AD to reach parity with the human brain. Now, these are admittedly very rough approximations assuming that current trends continue normally, and don't account for effects like a singularity or the appearance of artificial superintelligence could do. In theory, we already have enough compute power to be comparable to one human brain, so if we don't care about the efficiency of it, we could conceivably emulate a human brain by sheer brute force computation. Nevertheless, the numbers of orders of magnitudes in difference between our existing technology and what biology already has achieved through billions of years of evolutionary refinement, mean that human brain matter could serve as a strong candidate for computronium for the foreseeable future, assuming that it is possible to devise programs that can run on neural matter. Given the relative low cost in energy, it may even make sense for a friendly artificial superintelligence to see multiplying humanity and ensuring they live desirable lives as a reasonable default option for deploying utilitronium efficiently, given uncertainty about whether and how long before a more efficient form can be found and mass produced. Discuss ### Texas Freeze Retrospective: meetup notes 3 марта, 2021 - 17:48 Published on March 3, 2021 2:48 PM GMT This article is a writeup of the conversation at a meetup hosted by Austin Less Wrong on Saturday, February 27, 2021. The topic was the winter weather and infrastructure crisis that took place the previous week. There were a total of 13 participants, including 8 people who were in Texas at the time and 5 who weren't. I was the note-taker but I was not in Texas myself, so replies to any comments will probably come from people other than me. Below the section break, "I" refers to whoever was speaking at the time. Thanks to everyone who contributed and helped compile these notes. Disclaimer: I took pains to make it clear before, during, and after the meetup that I was taking notes for posting on LessWrong later. I do not endorse posting meetup write-ups without the knowledge and consent of those present! The 2021 Texas Freeze Personal anecdotes I lost power Monday through Thursday. The inside temperature dropped from 68°F to 47°F on Monday alone; over the course of the week the thermostat hit a minimum of 40°F. (Either the thermostat couldn't read any lower or the kitchen was even colder, since my olive oil solidified, which happens at 37°F). My breath was visible indoors. I had to keep my phone off most of the time, so most of the day was spent reading books under several blankets. I had a carbon monoxide scare on Tuesday after using the fireplace. I started boiling water on Wednesday, when the order was declared in some areas of Austin but not yet mine, because it seemed likely the order would soon be extended city-wide, which indeed occurred a day later. Even after getting power back, I still couldn't get groceries—stores had long lines, and H-E-B was closed after 5pm. Gas stations were out-of-order. I lost power Monday through Friday—there was some damage to a local power line. I teamed up with my neighbors. We had a fire going out back that people could warm themselves and cook things at. We didn't have much in the way of preparation supplies, but we did have candles and water bottles. We had advance notice that we might lose water, so we filled up the tub and every container we could find. (We didn't lose water, but we got the boil-water notice.) A tree branch fell and blocked our alleyway; we worked together to remove it, yielding a bunch of firewood as a side benefit. The house was well-insulated (≈50°F), but some of our warm clothing got wet, so it would've been better to have had more. My cat helped keep the bed warm, and my dog was helpful for peace-of-mind what with all the strange noises at all times of the night. I lost power starting Monday for 8 days, and water Thursday through Sunday. I survived by living within walking distance of the University of Texas campus. I went to the CompSci building and claimed a classroom to live in for the next few days. The whole building turned into a refugee camp for computer science students—they had water and power, since the campus has its own generator. Classes were canceled from Monday till Wednesday 9 days later. On Thursday, a friend's place got power back but not water, so we stayed there but had to go drive to the campus to get water every day. I live right next to a hospital, so I never lost power. I did lose internet, but I was able to get it back by calling my service provider. I also lost water, for a total of 9 days. I regret not filling up my bathtub beforehand. Fortunately I had a few gallons of drinking water on hand, which was a lifesaver since stores were closed. I used half of it to flush the toilet once, but conserved the rest, and ate and drank a lot less than usual. I ended up filling containers from a nearby lake to use as toilet-flushing water. A nearby store was handing out filtered water for free. I wasn't in Austin for the freeze, but I returned shortly afterward. My apartment lost power. Food in the freezer melted and refroze. (Tip: If canned food freezes, you should throw it away.) I wasn't around to drip the faucets, but people doing so in other units was effective. Also, the complex has gas-powered heat; it looks like it never dropped below freezing, since the houseplants survived. However, the kitchen sink still isn't working quite right. I got lucky here, living in a rural area. I didn't lose power or water, though we lost some water pressure. I should've realized beforehand it was going to be bad, looking at the weather forecast. We have a donkey, so we had to bring him inside the garage. He didn't want to move, but once he was inside he was fine with it. I also got lucky, and never lost power. When we realized water was in jeopardy, we filled up the tub, which was good. I wish I had kept more groceries in the house. I didn't realize that even after stores reopened, lines would be really long. I was running low by the end of the week. I lost power Monday through Thursday. I had water but it was cold, and there was a boil-water order from Thursday to the next Monday. I booked a hotel downtown, for only 1 night initially, but I ended up staying for 4 nights. The hotel had a false fire alarm. I also lost power Monday through Thursday, though with 30 minutes of power on Tuesday. It got to 46°F in the house according to an actual thermometer. (Watch out, because sometimes a thermostat has a minimum display temperature in the 40s.) Preparedness What things were helpful to have? • Water purification: battery-powered UV light or iodine tablets. (You can take them camping.) • Giant bins, buckets, or jugs for storing boiled water. • Rolly cart for transporting water. • A home with a gas stove, otherwise I would not have been able to cook or boil water. • Outdoor grill and charcoal—I could've used this to cook if I hadn't had a gas line. However, there would've been a risk of hypothermia being outside and then unable to effectively warm up inside. I didn't actually end up using it. • Electric kettle and air fryer (for cooking without a stove), but only because we were in a UT campus building that had electricity but no stove. • Camping stove. • Mylar blankets. • Lots of warm clothing: jogging pants, ski mask, long underwear, Uniqlo Heattech (M, W), other skiing/camping gear. • REI is a good place for this stuff • "There's no such thing as bad weather, only bad clothing" • Hand and toe warmers—it's a package that generates heat chemically. You put them inside your shoes or gloves (in between two layers). • Solar panel, which was enough to keep phones charged. • Flashlights, battery-powered lantern, extra batteries. • Lighter and matches for starting gas-powered appliances. • Lots of dried and canned foods and a few MREs I had ordered for fun and never used. • Some fireplace fuel, it was mostly old newspapers and brown grocery bags which was not ideal but better than nothing. What things did you wish you had? • Much more firewood. • However much firewood you think you need, get 5 times that. (This is a general principle for preparedness!) • An axe for making my own firewood. • Solar generator. • Solar phone charger. • Without one I needed to keep my phone off most of the time. The ability to look up safety knowledge (e.g. how to use a fireplace safely) was very limited. If the battery had reached zero, not being able to call someone as a last resort or 911 for an emergency may have been dangerous. • Electric blanket (powered by a solar generator, if practical). • Pressure cooker. • Grains (quinoa, etc.). • Drinking water. • A Brita water filter pitcher for water that was boiled then cooled. Sediment may sometimes show up in water during a boil water notice. Knowledge and skills What knowledge and skills were useful? • Knowing about restaurants giving out free stuff. If you could access the internet and had the means to drive on ice, websites were listing places that were giving out free stuff. • Knowing your neighbors and being in good communication with them. This was a bonding experience. We were sharing firewood, candles, etc., and hanging out to relieve boredom. It'll always be the case that you have something your neighbors need, and vice-versa. • Reading books like True Grit and The Revenant—optimistic stories of survival to put you in the right mindset. (Not a depressing story like The Road.) Then you can burn the book for heat ☺︎ Miscellaneous safety knowledge that was broadcast to Texans: • Know the risks of driving on snow and ice, and be able to judge how likely your car may be to get stuck on the road. • Drip your sinks so your pipes don't freeze. Wrap outdoor faucets with a rag and duct tape. If your pipes freeze they may burst and cause flooding. • To avoid carbon monoxide poisoning, don't heat your home with a gas stove or oven, don't run your car in a closed garage, don't operate a charcoal grill inside a closed garage, and don't supplement your fireplace fuel with grill charcoal. Additional safety facts that would've been good to memorize: • Hypothermia from cold exposure is a risk when indoor temperatures fall below 60°F, more of a risk with infants or the elderly. • Alcohol makes you feel warmer because it draws blood to your skin, resulting in increased loss of heat and increased risk of hypothermia. • Symptoms of hypothermia (shivering, paleness, poor balance, slurred speech, confusion). • Symptoms of carbon monoxide poisoning (headache, nausea, chest pain, dizziness, confusion). Improvised strategies: • How to ration firewood. • I had only 4 logs and a few sticks and a lot of paper (I did not intend to try lasting 4 days in the cold with that much firewood, it was just all out of stock beforehand). I used 2 logs at a time for 2 separate fires, one on Tuesday and one on Thursday. • More logs would've been much better than more paper. Burning the paper was labor intensive so I had to supervise the fire more, because paper burns up very quickly. • Burning a fire in the morning seemed to be the most helpful, because at night I'm under a pile of blankets and I don't need the house to have heat. About 90 minutes of fire burning raised the temperature by 5°F according to the battery-powered thermostat; I'm not really certain how long each fire lasted but I think it was 90 minutes. Transportation Driving was hard, such that some of us considered it not an option. Austin has limited infrastructure for removing ice from roads. Cars were getting stuck everywhere. There was a 10-car pileup near my place! I had to walk to the grocery store, for which having a large backpack was helpful. If you have 4-wheel drive, know how to use it, but you should still drive very carefully. Don't pass people, and turn gradually lest you fishtail. If you had chains or snow tires you could put them on, but most people here don't have them. Chains aren't that expensive, but they're a pain to put on and off, and make for an unpleasant driving experience. The temperature warmed up by 60°F in 24 hours after the freeze ended. This could make your tires run out of pressure. Check and see if you need to get them re-pressurized. Uber wasn't too expensive, because they suppressed surge pricing, but that meant there weren't many rides available. A 2-mile ride was only14, but I tipped \$20.

Food and water

A lot of people don't know what foods are good to eat in a cold home without refrigeration. I saw posts about people throwing away butter, eggs, vegetables, and other things that would've been fine. My eggs went slightly warm in the refrigerator but I'm still eating them and it's fine. Yogurt was good; meat was fine for a few days. Learn how to tell by smell when something is bad. When ERCOT shut down the power, they were 4 minutes away from a total statewide failure which would've lasted a month. If something like that were to happen, knowing how to stretch food supplies would've been of value.

I used the outside for refrigeration, but my eggs froze. (Incidentally, looking up the freezing point of eggs, I could only find results about human gamete preservation...) It's useful to have a cooler to fill with snow and bring inside, which protects food from animals and sub-freezing temperatures. You'll still want cold beer even when it's cold out!

It's good to have water treatment tablets, especially if you can't boil, but note that you have to let it sit for 1–4 hours (depending on the brand/type of tablet) rather than drinking it immediately. Do not ingest the tablet.

Boiling, UV, and tablets kill organisms, but filters are necessary for removing particulates. You can cobble together a water filter by layering different types of earth—a layer of pebbles, dirt, sand, and ash. (Example)

What other kinds of disasters should we prepare for? Multiple-whammy disasters

Shortly before the ice storm started, San Angelo, a city in West Texas, was already dealing with carcinogens in their water supply, which cannot be boiled away, so they had to buy water, which likely went out of stock very quickly. Then the snow came, the power went out, and they couldn't drive anywhere. On top of the pandemic it made for a quadruple-whammy. Think about combinations of different disasters.

In a way, this whole event was a weird combination of things all going wrong at once. A cascading failure: The electricity went out, causing heating to fail, which both made generating electricity even more difficult and caused water pipes to freeze.

Hot weather

We had our cold-weather disaster; what about a hot-weather disaster? What if it's really hot in the summertime and a power outage knocks out air conditioning? (The record high temperature for Texas during summer is 120°F.)

I was living on the east coast during such an event. The power was out for a few days. I spent most of my time in the basement, wearing light clothing. This could be bad for Texas: Texas doesn't have basements.

On the one hand, the Texas electrical grid is probably much more robust in heat (at least in a typical summer) than in cold, given that we more commonly deal with heat. On the other hand, the Texas grid is one of four independent grids in North America: East, West, Quebec, Texas. This can be problematic because our ability to import power is limited.

We see evidence reported that climate change may increase the likelihood of extreme weather events, both hot and cold, in the coming years. We don't currently see a scientific consensus regarding whether or not climate change was a contributing cause of this cold snap in particular (source).

Tips to keep cool: Fill your bathtub and soak in it, or soak your feet in a bucket of water (because your feet have lots of capillaries). Keep sunlight out of the house.

Electromagnetic pulse (EMP)

Some preppers worry about it; it would be really bad. Gas and water pumps would fail.

It could be either a deliberate attack or a naturally-occurring solar storm (recent Forbes article about this, Hacker News discussion). To prepare for this, you'd have to set up a house in a remote area with lots of supplies, and have enough gasoline on hand to be able to drive there. (This is a general prepper method.)

Existential risk from dependency on technology

When technology is developed and people start depending on it, its failure can have a worse outcome than if the technology had never been available at all. The electrical grid is one example; others include modern medicine and the logistical infrastructure for transporting food to populated areas.

There could be x-risk from eliminating death—if the population ages past fertility, but then the means of eliminating death is lost, then humanity will die out.

Biological

Biological warfare, or a naturally-occurring pandemic: Imagine a disease much worse and more contagious than COVID-19. When COVID-19 began, people were desensitized because of bird flu and other such false alarms.

Civil disorder

Civil disorder can be initiated by some exogenous shock such as a hurricane or loss of food supply. Is it plausible that it could happen for an entirely endogenous cause? Maybe when some political situation arises where a lot of people think they have no other option than violence, e.g. the Troubles in Northern Ireland. But it seems like the modern state has more capacity to manage violence than in previous times.

Militia violence is more likely than state action.

Can you prepare for a dictatorship or totalitarian surveillance state? Prepare to leave the country; marry someone with another citizenship. It's hard to imagine that a dictatorship could spread to other countries in the way that Nazi Germany or Soviet Russia did. The interwar period was more fragile than things are now, in terms of the risk of mass killings, and people are more able to flee if things get bad.

But there were people who, by the time they knew they wanted to leave their country, couldn't. In the case of Nazi Germany, there was the situation of the MS St. Louis; Jewish survivors are especially paranoid because of this. Article: When is it time to leave your country? There are no clear answers, but keep journaling and write down your criteria. Otherwise, things will change gradually and everything will seem normal when it happens. Watch out for "emergency powers."

Or maybe anticipating specific events isn't the right way to think about it; instead, think "This country is a total mess, so something bad is going to happen even if I can't think of what." SSC article The Influenza of Evil: We already have antibodies against things like Nazism and Communism, so if mass death occurs in the US, it may be due to something that doesn't pattern-match to either of those things.

Bugging out

When would having a bug-out bag be useful? Mass rioting/looting—but you'd probably have a bit more time to pack than just a few minutes. In summer 2020, San Antonio had one hour's notice. But it might be that stores are more vulnerable than homes. Also last summer, people living in CHAZ might've wanted to leave on short notice.

If your house is burning down, having a bug-out bag is good. Less extreme, having supplies to leave your house for a few days is useful, as it was during the Texas freeze. Packing the bag in advance is helpful because it's stressful to have to remember all the different things you need (e.g. I keep a spare toothbrush and toothpaste in my suitcase, because I always forget it otherwise).

Relatedly, keeping a bag of extra clothing and supplies in your car in case it breaks down in the middle of nowhere is good practice.

Is AI risk preppable?

Be as illegible as possible, so the AI doesn't know where to find your isolated wilderness hideout or that you exist at all. But this isn't helpful against nanobots. In an unfriendly AI takeoff scenario, you probably won't survive very long regardless of where you are.

Context: AI Impacts 2020 review

Think about career security: Is your job still relevant with AI in the picture?

There's a spectrum of takeoff scenarios. A hard takeoff is too fast to react to; a slower takeoff might make things more difficult and displace people's jobs. But there's also an intermediate case, where AI can still do a lot of harm short of existential. E.g.: terrorist drones that can be deployed by anyone untraceably; robot robbery; weaponized self-driving cars.

Discuss