# Новости LessWrong.com

A community blog devoted to refining the art of rationality
Обновлено: 1 час 5 минут назад

### Washington DC Meeetup: Reading Discussion

9 часов 21 минута назад
Published on June 16, 2021 11:54 PM GMT

First, a meta note. For those of you who don't know Maia and I, we used to organize the DC meetup until 2014. We've been in the Bay Area since then, but just moved back. I'm looking forward to meeting or re-meeting all of y'all!

We'll be discussing Avoid News, Part 2: What the Stock Market Taught Me about News and Your Book Review: Progress And Poverty. We'd prefer it if you read the articles beforehand, but there will be printed copies available.

Note that this isn't the usual spot: the Smithsonian is still requiring tickets, so we can't have the meetup in the Kogod Courtyard. We'll be meeting at Navy Memorial Plaza nearby. I'll have a sign saying "DC LessWrong".

For help finding the group, please call (or text, with a likely-somewhat-slower response rate): 301-458-0764.

Discuss

### Neo-Mohism

11 часов 17 минут назад
Published on June 16, 2021 9:57 PM GMT

“When one advances claims, one must first establish a standard of assessment. To make claims in the absence of such a standard is like trying to establish on the surface of a spinning potter’s wheel where the sun will rise and set. Without a fixed standard, one cannot clearly ascertain what is right and wrong, or what is beneficial and harmful.”

-Mozi, ‘A Condemnation of Fatalism’

Epistemic status

This is more meant as a description of my personal moral philosophy than a prescriptive philosophy/religion for others to follow, but I hope to refine it to the point where many people would be on-board with it while still being useful. Please be charitable with your critique, as this is a work in progress (as my own moral philosophy is) which I whipped up in the span of an hour, and I have yet to settle on where Neo-Mohism ends and my personal ethics begin (if anywhere!). But do let me know what should be clarified more, or what you think is incorrect!

High Concept

Humans accomplish goals more effectively when they cooperate, but humans have different goals and different ways they wish to cooperate.

Inspired by the ancient philosophy of Mohism, this is an attempt to find a universal framework that most systems can agree on, with an emphasis on argumentation and consequentialism.

Neo-Mohism is primarily a meta-ethic, but also includes some moral conclusions.

Tenets
• All tenets are subject to revision. The ultimate arbiter of this philosophy is the ability to make advance, falsifiable predictions, allowing the universe to judge between competing ideas.
• All tenets should be valid and sound, in accordance with the logical absolutes (non-contradiction, identity, and excluded middle). If the logical absolutes are somehow disproved or changed, this is subject to revision.
• Any moral framework adopted should avoid "repugnant conclusions" that violate the principles of wellbeing (below).
• The word "morality" within this system is defined as "[X action] is that which ought to be done, given the goal of [Y]." As such, Neo-Mohism does not concern itself with any is/ought problem, and uses Consequentialism as a meta-ethic. The only problem to solve is whether the goal is shared, and how you determine what accomplishes that goal.
• The goal of Neo-Mohism is "maximize the wellbeing of conscious creatures".
Wellbeing

Wellbeing is defined as "Suffering is bad (you are not well if you are suffering), and death is bad (you are not being if you are dead)", and only applies only to conscious creatures who exist or who we are confident will exist (creatures that neither exist nor will ever exist do not have a "being" in any meaningful sense). Furthermore, the Neo-Mohist pillars of wellbeing are “Happiness”, "Truth", "Freedom", "Responsibility", and “Life”.

"Happiness" is defined as "Hedonic Enjoyment" (the good feeling I get from drinking a glass of Boba Tea, the warmth I feel when I bask in the sun, the entertainment I get from watching a good movie, etc) and “Emotional Fulfillment” (having an intellectually-stimulating relationship, doing hard work aligned with one’s values, being mindful in the present moment, and self improvement). Neo-Mohists want themselves and everyone to be happy. Especially, they do not want anyone to suffer, as suffering is a problematic barrier to happiness in a way that lesser pleasure is not.

“Truth” is defined as “that which comports with external reality”, or “that which exists regardless of what one believes”. There is an objective reality that we experience, and we can learn about this reality through observation and reason. Bayesian Rationality is the epistemology of choice for Neo-Mohists, who desire to know what is true for its own sake, and want everyone to believe true things.

“Freedom” is defined as “agency to make informed decisions”, from which we derive the importance of consent.

“Responsibility” can best be defined by Ghandi’s saying “Be the change you wish to see in the world.” Many things exist outside our control, but it’s merely a matter of degree. If something goes wrong, a Neo-Mohist must admit any part they played in it. Other people often cannot be relied upon to do the right thing. It is not enough to say “this has been delegated”. If a Neo-Mohist sees suffering that they can do something about, it is their responsibility to try and fix it.

“Life” is defined as the active state of a sentient creature. "Sentience" is defined as "anything that can suffer or reflect on the concept of suffering". Life provides meaning to an otherwise meaningless universe. All sentient creatures have moral weight (humans, animals, AI, aliens, etc), including ones far away from you (distance does not matter).
Because all sentient beings have moral weight, it is immoral to subjugate, kill or eat them.
"Sapience” is of special concern; the ability to be aware of one’s awareness and the ability to make moral judgments. Humanity is the only known species to have this trait, and therefore the loss of humanity would be a terrible thing indeed, reducing our universe to a meaningless machine. Misanthropy is misguided at best, dangerous and immoral at worst.

Derived from "Freedom" and "Life": Everyone has the right to use technology to be whatever they wish to be, including *not dead*.

Derived from "Truth" and "Freedom": deception is a hostile act that deprives people of the ability to have a relationship with you based on fully informed consent. As such, any deception (including lies of omission) are immoral acts (although sometimes immoral acts must be performed to avoid more immoral acts). The closer your relationship, the more important honesty is.

Discuss

### Variables Don't Represent The Physical World (And That's OK)

16 июня, 2021 - 22:05
Published on June 16, 2021 7:05 PM GMT

Suppose I’m a fish farmer, running an experiment with a new type of feed for my fish. In one particular tank, I have 100 fish, and I measure the weight of each fish. I model the weight of each fish as an independent draw from the same distribution, and I want to estimate the mean of that distribution.

Key point: even if I measure every single one of the 100 fish, I will still have some uncertainty in the distribution-mean. Sample-mean is not distribution-mean. My ability to control the fish’ environment is limited; if I try to re-run the experiment with another tank of fish, I don’t think I’ll actually be drawing from the same distribution (more precisely, I don’t think the “same distribution” model will correctly represent my information anymore). Once I measure the weights of each of the 100 fish, that’s it - there’s no more fish I can measure to refine my estimate of the distribution-mean by looking at the physical world, even in principle. Maybe I could gain some more information with detailed simulations and measurements of tank-parameters, but that would be a different model, with a possibly-different distribution-mean.

The distribution-mean is not fully determined by variables in the physical world.

And this isn’t some weird corner-case! This is one of the simplest, most prototypical use-cases of probability/statistics. It’s in intro classes in high-school.

Another example: temperature. We often describe temperature, intuitively, as representing the average kinetic energy of molecules, bouncing around microscopically. But if we look at the math, temperature works exactly like the distribution-mean in the fish example: it doesn’t represent the actual average energy of the particles, it represents the mean energy of some model-distribution from which the actual particle-energies are randomly drawn. Even if we measured the exact energy of every particle in a box, we’d still have nonzero (though extremely small) uncertainty in the temperature.

One thing to note here: we’re talking about purely classical uncertainty. This has nothing to do with quantum mechanics or the uncertainty principle. Quantum adds another source of irresolvable uncertainty, but even in the quantum case we will also have irresolvable classical uncertainty.

Clusters

Clustering provides another lens on the same phenomenon. Consider this clustering problem:

Let’s say that the lower-left cluster contains trees, the upper-right apples, and the lower-right pencils. I want to estimate the distribution-mean of some parameter for the tree-cluster.

If there’s only a limited number of data points, then this has the same inherent uncertainty as before: sample mean is not distribution mean. But even if there’s an infinite number of data points, there’s still some unresolvable uncertainty: there are points which are boundary-cases between the “tree” cluster and the “apple” cluster, and the distribution-mean depends on how we classify those. There is no physical measurement we can make which will perfectly tell us which things are “trees” or “apples”; this distinction exists only in our model, not in the territory. In turn, the tree-distribution-parameters do not perfectly correspond to any physical things in the territory.

My own work on abstraction implicitly claims that this applies to high-level concepts more generally (indeed, Bayesian clustering is equivalent to abstraction-discovery, under this formulation). It still seems likely that a wide variety of cognitive algorithms would discover and use similar clusters - the clusters themselves are “natural” in some sense. But that does not mean that the physical world fully determines the parameters. It’s just like temperature: it’s a natural abstraction, which I’d expect a wide variety of cognitive processes to use in order to model the world, but that doesn’t mean that the numerical value of a temperature is fully determined by the physical world-state.

Takeaway: the variables in our world-models are not fully determined by features of the physical world, and this is typical, it’s true of most of the variables we use day-to-day.

Our Models Still Work Just Fine

Despite all this, our models still work just fine. Temperature is still coupled to things in the physical world, we can still measure it quite precisely, it still has lots of predictive power, and it’s useful day-to-day. Don't be alarmed; it all adds up to normality.

The variables in our models do not need to correspond to anything in the physical world in order to be predictive and useful. They do need to be coupled to things in the physical world, we need to be able to gain some information about the variables by looking at the physical world and vice versa. But there are variables in our world models which are not fully determined by physical world-state. Even if we knew exactly the state of the entire universe, there would still be uncertainty in some of the variables in our world-models, and that’s fine.

Discuss

### [AN #152]: How we’ve overestimated few-shot learning capabilities

16 июня, 2021 - 20:20
Published on June 16, 2021 5:20 PM GMT

Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).

Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.

HIGHLIGHTS

True Few-Shot Learning with Language Models (Ethan Perez et al) (summarized by Rohin): We can get GPT-3 (AN #102) to perform useful tasks using “prompt programming”, in which we design an input sentence such that the most likely continuation of that sentence would involve GPT-3 performing the task of interest. For example, to have GPT-3 answer questions well, we might say something like “The following is a transcript of a dialogue with a helpful, superintelligent, question-answering system:”, followed by a few example question-answer pairs, after which we ask our questions.

Since the prompts only contain a few examples, this would seem to be an example of strong few-shot learning, in which an AI system can learn how to do a task after seeing a small number of examples of that task. This paper contends that while GPT-3 is capable of such few-shot learning, the results reported in various papers exaggerate this ability. Specifically, while it is true that the prompt only contains a few examples, researchers often tune their choice of prompt by looking at how well it performs on a relatively large validation set -- which of course contains many examples of performing the task, something we wouldn’t expect to have in a true few-shot learning context.

To illustrate the point, the authors conduct several experiments where we start with around 12 possible prompts and must choose which to use based only on the examples given (typically 5). They test two methods for doing so:

1. Cross-validation: Given a prompt without examples, we attach 4 of the examples to the prompt and evaluate it on the last example, and average this over all possible ways of splitting up the examples.

2. Minimum description length: While cross-validation evaluates the final generalization loss on the last example after updating on previous examples, MDL samples an ordering of the examples and then evaluates the average generalization loss as you feed the examples in one-by-one (so more like an online learning setup).

On the LAMA-UHN task, the difference between a random prompt and the best prompt looks to be roughly 5-6 percentage points, regardless of model size. Using MDL or cross-validation usually gives 20-40% of the gain, so 1-2 percentage points. This suggests that on LAMA-UHN, typical prompt-based “few-shot” learning results are likely 3-5 percentage points higher than what you would expect if you were in a true few-shot setting where there is no validation set to tune on.

But it may actually be worse than that. We’ve talked just about the prompt so far, but the validation set can also be used to improve hyperparameters, network architecture, the design of the learning algorithm etc. This could also lead to inflated results. The authors conduct one experiment with ADAPET on SuperGLUE which suggests that using the validation set to select hyperparameters can also lead to multiple percentage points of inflation.

Rohin's opinion: The phenomenon in this paper is pretty broadly applicable to any setting in which a real-world problem is studied in a toy domain where there is extra information available. For example, one of my projects at Berkeley involves using imitation learning on tasks where there really isn’t any reward function available, and it’s quite informative to see just how much it slows you down when you can’t just look at how much reward your final learned policy gets; research becomes much more challenging to do. This suggests that performance on existing imitation learning benchmarks is probably overstating how good we are at imitation learning, because the best models in these situations were probably validated based on the final reward obtained by the policy, which we wouldn’t normally have access to.

TECHNICAL AI ALIGNMENT
TECHNICAL AGENDAS AND PRIORITIZATION

High Impact Careers in Formal Verification: Artificial Intelligence (Quinn Dougherty) (summarized by Rohin): This post considers the applicability of formal verification techniques to AI alignment. Now in order to “verify” a property, you need a specification of that property against which to verify. The author considers three possibilities:

1. Formally specifiable safety: we can write down a specification for safe AI, and we’ll be able to find a computational description or implementation

2. Informally specifiable safety: we can write down a specification for safe AI mathematically or philosophically, but we will not be able to produce a computational version

3. Nonspecifiable safety: we will never write down a specification for safe AI.

Formal verification techniques are applicable only to the first case. Unfortunately, it seems that no one expects the first case to hold in practice: even CHAI, with its mission of building provably beneficial AI systems, is talking about proofs in the informal specification case (which still includes math), on the basis of comments like these in Human Compatible. In addition, it currently seems particularly hard for experts in formal verification to impact actual practice, and there doesn’t seem to be much reason to expect that to change. As a result, the author is relatively pessimistic about formal verification as a route to reducing existential risk from failures of AI alignment.

LEARNING HUMAN INTENT

Navigation Turing Test (NTT): Learning to Evaluate Human-Like Navigation (Sam Devlin, Raluca Georgescu, Ida Momennejad, Jaroslaw Rzepecki, Evelyn Zuniga et al) (summarized by Rohin): Since rewards are hard to specify, we are likely going to have to train AI agents using human feedback. However, human feedback is particularly expensive to collect, so we would like to at least partially automate this using reward models. This paper looks at one way of building such a reward model: training a classifier to distinguish between human behavior and agent behavior (i.e. to be the judge of a Turing Test). This is similar to the implicit or explicit reward model used in adversarial imitation learning algorithms such as GAIL (AN #17) or AIRL (AN #17).

Should we expect these classifiers to generalize, predicting human judgments of how human-like a trajectory is on all possible trajectories? This paper conducts a user study in order to answer the question: specifically, they have humans judge several of these Turing Tests, and see whether the classifiers agree with the human judgments. They find that while the classifiers do agree with human judgments when comparing a human to an agent (i.e. the setting on which the classifiers were trained), they do not agree with human judgments when comparing two different kinds of artificial agents. In fact, it seems like they are anti-correlated with human judgments, rather than simply having no correlation at all -- only one of the six classifiers tested does better than chance (at 52.5%), the median is 45%, and the worst classifier gets 22.5%. (Note however that the sample size is small, I believe n = 40 though I’m not sure.)

Rohin's opinion: Ultimately my guess is that if you want to predict human judgments well, you need to train against human judgments, rather than the proxy task of distinguishing between human and agent behavior. That being said, I do think these proxy tasks can serve as valuable pretraining objectives, or as auxiliary objectives that help to improve sample efficiency.

FORECASTING

AXRP Episode 7.5 - Forecasting Transformative AI from Biological Anchors (Daniel Filan and Ajeya Cotra) (summarized by Rohin): This podcast goes over the biological anchors framework (AN #121), as well as three other (AN #105) approaches (AN #145) to forecasting AI timelines and a post on aligning narrowly superhuman models (AN #141). I recommend reading my summaries of those works individually to find out what they are. This podcast can help contextualize all of the work, adding in details that you wouldn’t naturally see if you just read the reports or my summaries of them.

For example, I learned that there is a distinction between noise and effective horizon length. To the extent that your gradients are noisy, you can simply fix the problem by increasing your batch size (which can be done in parallel). However, the effective horizon length is measuring how many sequential steps you have to take before you get feedback on how well you’re doing. The two are separated in the bio anchors work because the author wanted to impose specific beliefs on the effective horizon length, but was happy to continue extrapolating from current examples for noise.

FIELD BUILDING

AI Safety Career Bottlenecks Survey Responses Responses (Linda Linsefors) (summarized by Rohin): A past survey asked for respondents’ wish list of things that would be helpful and/or make them more efficient (with respect to careers in AI safety). This post provides advice for some of these wishes. If you’re trying to break into AI safety work, this seems like a good source to get ideas on what to try or resources that you hadn’t previously seen.

MISCELLANEOUS (ALIGNMENT)

"Existential risk from AI" survey results (Rob Bensinger) (summarized by Rohin): This post reports on the results of a survey sent to about 117 people working on long-term AI risk (of which 44 responded), asking about the magnitude of the risk from AI systems. I’d recommend reading the exact questions asked, since the results could be quite sensitive to the exact wording, and as an added bonus you can see the visualization of the responses. In addition, respondents expressed a lot of uncertainty in their qualitative comments. And of course, there are all sorts of selection effects that make the results hard to interpret.

Keeping those caveats in mind, the headline numbers are that respondents assigned a median probability of 20% to x-risk caused due to a lack of enough technical research, and 30% to x-risk caused due to a failure of AI systems to do what the people deploying them intended, with huge variation (for example, there are data points at both ~1% and ~99%).

Rohin's opinion: I know I already harped on this in the summary, but these numbers are ridiculously non-robust and involve tons of selection biases. You probably shouldn’t conclude much from them about how much risk from AI there really is. Don’t be the person who links to this survey with the quote “experts predict 30% chance of doom from AI”.

Survey on AI existential risk scenarios (Sam Clarke et al) (summarized by Rohin): While the previous survey asked respondents about the overall probability of existential catastrophe, this survey seeks to find which particular risk scenarios respondents find more likely. The survey was sent to 135 researchers, of which 75 responded. The survey presented five scenarios along with an “other”, and asked people to allocate probabilities across them (effectively, conditioning on an AI-caused existential catastrophe, and then asking which scenario happened).

The headline result is that all of the scenarios were roughly equally likely, even though individual researchers were opinionated (i.e. they didn’t just give uniform probabilities over all scenarios). Thus, there is quite a lot of disagreement over which risk scenarios are most likely (which is yet another reason not to take the results of the previous survey too seriously).

AI GOVERNANCE

Some AI Governance Research Ideas (Alexis Carlier et al) (summarized by Rohin): Exactly what it says.

OTHER PROGRESS IN AI
DEEP LEARNING

The Power of Scale for Parameter-Efficient Prompt Tuning (Brian Lester et al) (summarized by Rohin): The highlighted paper showed that prompt programming as currently practiced depends on having a dataset on which prompts can be tested. If we have to use a large dataset anyway, then could we do better by using ML techniques like gradient descent to choose the prompt? Now, since prompts are discrete English sentences, you can’t calculate gradients for them, but we know how to deal with this -- the first step of a language model is to embed English words (or syllables, or bytes) into a real-valued vector, after which everything is continuous. So instead of using gradient descent to optimize the English words in the prompt, we instead optimize the embeddings directly. Another way of thinking about this is that we have our “prompt” be a sentence of (say) 50 completely new words, and then we optimize the “meaning” of those words such that the resulting sequence of 50 newly defined words becomes a good prompt for the task of interest.

The authors show that this approach significantly outperforms the method of designing prompts by hand. While it does not do as well as finetuning the full model on the task of interest, the gap between the two decreases as the size of the model increases. At ~10 billion parameters, the maximum size tested, prompt tuning and model tuning are approximately equivalent.

In addition, using a prompt is as simple as prepending the new prompt embedding to your input and running it through your model. This makes it particularly easy to do ensembling: if you have N prompts in your ensemble, then given a new input, you create a batch of size N where the ith element consists of the ith prompt followed by the input, and run that batch through your model to get your answer. (In contrast, if you had an ensemble of finetuned models, you would have to run N different large language models for each input, which can be significantly more challenging.)

NEWS

AI Safety Research Project Ideas (Owain Evans et al) (summarized by Rohin): In addition to a list of research project ideas, this post also contains an offer of mentorship and/or funding. The deadline to apply is June 20.

Research Fellow- AI TEV&V (summarized by Rohin): CSET is currently seeking a Research Fellow to focus on the safety and risk of deployed AI systems.

Deputy Director (CSER) (summarized by Rohin): The Centre for the Study of Existential Risk (CSER) is looking to hire a Deputy Director.

FEEDBACK

I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.

PODCAST

An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

Discuss

### Announcing the Nuclear Risk Forecasting Tournament

16 июня, 2021 - 19:16
Published on June 16, 2021 4:16 PM GMT

Summary: From June 22, Rethink Priorities and Metaculus will run a Nuclear Risk Forecasting Tournament to help inform funding, policy, research, and career decisions aimed at reducing existential risks. The starting prize pool is $2,500. We would be excited for you to help via making forecasts, participating in the Discussion Forum, increasing forecaster rewards by becoming a monthly Supporter, and/or reaching out to relevant people to suggest they participate. In 1941, no one had ever built a nuclear weapon, and most people had no idea that anyone ever might. Over the following 80 years, the Allies built such weapons, the US detonated two on cities, and scientists and engineers developed or proposed many new and vastly more destructive types of nuclear weapons. Global stockpiles peaked at 70,000 warheads (compared to today’s ~13,000), and the total yield of the US arsenal alone peaked at over a million times the yield of the bomb dropped on Hiroshima. In 2021, nine states are believed to possess nuclear weapons (Russia, the United States, China, France, the UK, Pakistan, India, Israel, and North Korea). In theory, they could launch them at any time, this could directly cause millions of deaths, and this could perhaps indirectly — via nuclear winter — cause billions of deaths and even the permanent loss of our potential to reach a flourishing future. On the other hand, nuclear weapons have never again been used in war since 1945, warhead numbers and total yields have declined dramatically since their peaks, and if nuclear conflict does occur it might involve low numbers of weapons, low yields, or targets that result in relatively low fatality levels and no nuclear winter. So how likely are each of the strikingly different possible futures of nuclear risks? And what can and should we do about this — which goals should we pursue, and how can we use our limited resources to best achieve them? Rethink Priorities is investigating these questions in order to help improve funding, policy, research, and career decisions aimed at reducing existential risks. The scope and theory of change for this work is discussed in more detail here. These questions are complex, high-stakes, and time-sensitive, so we want to draw on the skills and perspectives of a community of forecasters. To do so, we’re partnering with Metaculus to run the Nuclear Risk Forecasting Tournament, hosted within a new Metaculus Forecasting Cause initiative focused on Flourishing Futures. Forecasts, comments, and essays produced for this tournament will inform and in some cases be featured within our other research, and some will be packaged for and directly shared with some key decision-makers. The tournament will unfold across three rounds, with a mix of calibration, long-term, and fortified essay questions in the following topics: • Armed Conflict • Initiation of Nuclear Conflict • Escalation and Targeting • States’ Arsenals • Policies for Nuclear Weapons Use • Technological Developments • Climate and Famine Effects • Interventions Round 1 is opening on June 22. We would be excited for you to help in any of the following ways: 1. Join the Nuclear Risk Tournament and make forecasts when it opens! The starting prize pool is$2,500, which will be split across four prizes for forecasting performance and one for Fortified Essays. The tournament closes on December 31, 2022, and the prizes will be distributed within two weeks of that time.
2. Share your expertise, brainstorm forecasting questions, and update the community with the latest research in the Flourishing Futures Discussion Forum.
3. Increase forecaster rewards by becoming a monthly Supporter.
4. Reach out to researchers and experts to enlist their help furthering this cause by generating new shared knowledge on the Metaculus platform.

We also hope this tournament will provide a space for experienced forecasters to learn about and share insights on nuclear risks, for experts on nuclear risk to learn about and share insights on forecasting, for people with interest but less experience to start engaging and getting up to speed, and for fruitful connections to be made between these communities. We welcome you to join the tournament community.

Credits

This research is a project of Rethink Priorities. It was written by Michael Aird and cross-posted to Medium and the EA Forum. Thanks to Gaia Dempsey, Peter Wildeford, and Linch Zhang for helpful feedback. If you like our work, please consider subscribing to our newsletter. You can see more of our work here.

Discuss

### AI-Based Code Generation Using GPT-J-6B

16 июня, 2021 - 18:05
Published on June 16, 2021 3:05 PM GMT

Above is a link to an interesting post about synthetic code generation with a transformer model trained on The Pile, which contains a large chuck of GitHub and StackOverflow. Due to CommonCrawl's deficiency in this area, the much smaller GPT-J-6B outperforms OpenAI’s largest publicly available GPT-3 models. The performance is impressive enough that one wonders how capable a 100+ billion parameter model trained on The Pile will be, let alone what an AlphaGo-level engineering effort towards the end of synthetic code generation would achieve.

As the The Pile was created to provide a dataset for 100 billion paramater+ models, we may not have to wait long. The examples in the post are clearly trivial, but I personally take this to be something of a fire alarm. I was not previously aware of how poorly-optimized GPT-3 was for code generation, and I have updated toward surprising gains in this area in the next few years.

I no longer consider agents with superhuman performance in competitive programming to be a ridiculous thing to pursue.

It is useful to remind myself of how shocked I would be to see such things in 2012. In 2012 I would have taken this as a sign that AGI was near.

Scenario-based planning postulates that one should predict symptoms emblematic of a given scenario and then robotically assume you are in said scenario once a sufficient number of these symptoms come to pass. I am unsure whether there is wisdom in this approach, but I find it a discomfiting line of thought.

Discuss

### MIRIx Part I: Insufficient Values

16 июня, 2021 - 17:33
Published on June 16, 2021 2:33 PM GMT

I wish this were crossposted from the AI Alignment Forum.  May contain more technical jargon than usual.

I’m Jose.  I realized recently I wasn’t taking existential risk seriously enough, and in April, a year after I first applied, I started running a MIRIx group in my college.  I’ll write summaries of the sessions that I thought were worth sharing.  Most of the members are very new to FAI, so this will partly be an incentive to push upward and partly my own review process.  Hopefully some of this will be helpful to others.

This one focuses on how aligning creator intent with the base objective of an AI might not be enough for outer alignment, starting with an overview of Coherent Extrapolated Volition and its flaws.  This was created in collaboration with Jacob Abraham and Abraham Francis.

Coherent Extrapolated Volition

From the wiki,

In calculating CEV, an AI would predict what an idealized version of us would want, "if we knew more, thought faster, were more the people we wished we were, had grown up farther together". It would recursively iterate this prediction for humanity as a whole, and determine the desires which converge. This initial dynamic would be used to generate the AI's utility function.

In other words, CEV is an AI having not only a precise model of human values, but also the meta understanding of how to resolve contradictions and incompleteness in those values in a friendly way.

This line of research was considered obsolete by Eliezer, due to the problems it runs into - some of which make it appear like the proposal itself only shifts the pointers to the goals.  In the time we spent discussing it, we ended up with a (most likely not comprehensive) list of the major flaws of CEV.

Criticism

1. Goodhart’s Law.  CEV doesn’t address the issue of wireheading, where an AI maximizes a proxy utility function instead of the intended utility function.  For example, without a sufficiently descriptive embedded understanding of what it means to be “more the people we wished we were” or even “smarter”, the AI is incentivized to exploit that ambiguity to create the cheapest outcomes.  To a sufficiently powerful AI, you’re giving it limited indirect write access to the values it will implement.  The obvious solution is a formal framework for these properties, which might just be rephrasing the general alignment problem.
However, CEV can be viewed as a proposal of what to do with AGI after problems like Goodhart’s Law have been solved.
2. Computational Costs.  Brute forcing extended high-fidelity simulations of all the humans that have ever lived in an attempt to formulate CEV will probably be too expensive for any first-generation AGI.  It’ll definitely be too expensive to make it a viable option in a competitive scenario.  Solutions to these include abstracting the details to make it more cost-efficient (which exacerbates Goodhart’s Law), and choosing a subset of the human population to extrapolate, which runs into philosophical dilemmas.
3. Value Convergence.  CEV works under the assumption that at the limit, human values will converge to an equilibrium.  This need not necessarily be the case, and at the very least, there’s enough breathing room for the alternative that we can’t be assured of its safety.  Some variants that have been proposed remedy this, but run into issues of their own.
4. Ideal Advisor Theory.  CEV shares much in common with the Ideal Advisor theories of philosophy.  Some of the criticisms raised against it by Sobel (1994) are also applicable to vanilla CEV, such as:
• The views held by a perfectly rational and fully informed human may change over time; i.e., values need not converge on a reasonable time scale.  Mitigations to this come with their own bag of issues, such as privileging a single perspective at a predetermined stopping point, or managing a complex trade-off of values between different voices.
• Under the assumptions that some lives can only be evaluated if they are experienced, and that for those experiences to be unbiased you’d have to start all of them from the same “blank slate”, Sobel proposes an amnesia model of agent experience as the most plausible implementation.  Under this model, agents undergo all lives sequentially, but have their memories extracted and wiped after each one.  These memories are returned to them at the very end.
Sobel considers two objections to the amnesia model:
• The perspective of the person at the end, informed by all the lives they now remember, would be different from the perspective of their self living that life, so properly evaluating those memories may be beyond them.
• This entire process might drive the agent insane.  “Idealizing” them to a point where that isn’t true might leave them different enough from the original to not qualify as the same person.
• Extrapolated models may be so much more advanced than their non-extrapolated selves that they can no longer objectively weigh the life and well-being of the latter.  In an extreme case, this could end up being the same way in which we value lesser sentient forms of life.  In a more moderate case, the way we might judge our life’s worth after an accident causing serious mental damage - there are some who would consider death as a viable alternative to that.
This particular objection seems unlikely to me, but not obviously implausible.
• There are variants to CEV - such as Bostrom’s parliamentary model - that resolve some of these criticisms (the parliamentary model solves all but the last of Sobel’s arguments).  However, they run into new problems of their own.  This isn’t solid or even convincing proof to me that good variants can’t exist though, but it serves as precedent for new issues forming when old ones are avoided.

I thought of CEV to begin this because it specifically targets the assumption that human values themselves might not be enough.

Inner and Outer Alignment

There is little consensus on a definition for the entire alignment problem, but a large part, intent alignment, i.e. making sure the AI does what the programmers want it to, is composed of two components: inner and outer alignment.

Inner Alignment is about making sure the AI actually optimizes what our reward function specifies.  In other words, the reward function is the base objective, the objective an AI can search for optimizers to implement.  But in its search, the AI may find proxy objectives that are easier to optimize, and do the job fairly well (think of evolution, where the base objective is reproductive fitness, while the mesa objective includes heuristics like pain aversion, status signaling, etc.).  This is the mesa objective.  Inner Alignment is aligning the base objective with the mesa objective.

Outer Alignment is about making sure we like the reward function we’re training the AI for.  That is, if we had a model that solves inner alignment and was actually optimizing for the objective it’s given, would we like that model?  This is the centre of much of classical alignment discussion (the paperclip AI thought experiment, for example).

Recall that what CEV addresses is the potential for aligning our intent with the base objective to be insufficient, that a model that optimizes an objective we like can still fail in the limit as it runs into inconsistencies or other problems with our value systems.  The friendly resolution of these problems may be beyond a base human or human model at test time; far from necessarily so, but I think with enough probability in at least a few instances to be a problem.

Insufficient Values

Note: Epistemic status on the following is speculative at best, and is based on what posts and papers we could read in the time we had.

Based on my limited understanding of Outer Alignment, it doesn’t include a formalization of aligning AI with the values we would hold at the limit.  Some of the proposals we looked at also ran into this problem.

Imitative amplification, for example, relies on a model that tries to imitate a human with access to the model.  With oversight using transparency tools to account for deceptive or other harmful behaviour, it is plausibly outer aligned.  However, a base human may not be able to reliably resolve in a friendly way the contradictions and inconsistencies it would face at the limit.  That’s fairly uncharted territory, and might involve the human model diverging from the human template too far.  I don’t think the oversight would be of much help here either, because it isn’t necessary that these problems would come up as early as training time.  It’s also possible nearly any sort of resolution would seem misaligned to us.

Some proposals bypass this problem altogether, but at terrible cost.  STEM AI, for example, avoids value modelling entirely, but does so by ignoring the class of use cases where those would be relevant.

It’s possible that we wouldn’t need to worry about this problem at all.  Perhaps they’ll be addressed during training time, or instead of resolving inconsistencies, the AI could account for them as new value axioms.  But while the former may even be the likely scenario, the alternative still holds a distinct probability, especially in realistic scenarios where training until hypothetical future value conflicts are resolved isn’t competitive.  Treating inconsistencies as new axioms could likely be dangerous, and might not even solve the core problem because of an endless chain of new inconsistencies as we add new ones, in Godelian fashion.

Endnote: I hesitated for a while before posting this because it felt like something that must have been addressed already.  I didn’t find much commenting on this in any of the posts we went through though, so I just peppered this with what was possibly an irksome number of uncertainty qualifiers.  Whatever we got wrong, tell us.

Discuss

### Reward Is Not Enough

16 июня, 2021 - 16:52
Published on June 16, 2021 1:52 PM GMT

Three case studies1. Incentive landscapes that can’t feasibly be induced by a reward function

You’re a deity, tasked with designing a bird brain. You want the bird to get good at singing, as judged by a black-box hardcoded song-assessing algorithm that you already built into the brain last week. The bird chooses actions based in part on within-lifetime reinforcement learning involving dopamine. What reward signal do you use?

Well, we want to train the bird to sing the song correctly. So it’s easy: the bird practices singing, and it listens to its own song using the song-assessing black box, and it does RL using the rule:

The better the song sounds, the higher the reward.

Oh wait. The bird is also deciding how much time to spend practicing singing, versus foraging or whatever. And the worse it sings, the more important it is to practice! So you really want the rule:

The worse the song sounds, the more rewarding it is to practice singing.

Uh oh.

How do you resolve this conflict?

• Maybe “Reward = Time derivative of how good the song sounds”? Nope, under this reward, if the bird is bad at singing, and improving very slowly, then practice would not feel very rewarding. But here, the optimal action is to continue spending lots of time practicing. (Singing well is really important.)
• Maybe “Reward is connected to the abstract concept of ‘I want to be able to sing well’?” Sure—I mean, that is ultimately what evolution is going for, and that’s what it would look like for an adult human to “want to get out of debt” or whatever. But how do you implement that? "I want to be able to sing well" is an awfully complicated thought; I doubt most birds are even able to think it—and if they could, we still have to solve a vexing symbol-grounding problem if we want to build a genetic mechanism that points to that particular concept and flags it as desirable. No way. I think this is just one of those situations where “the exact thing you want” is not a feasible option for the within-lifetime RL reward signal, or else doesn’t produce the desired result. (Another example in this category is “Don’t die”.)
• Maybe you went awry at the start, when you decided to choose actions using a within-lifetime RL algorithm? In other words, maybe “choosing actions based on anticipated future rewards, as learned through within-lifetime experience” is not a good idea? Well, if we throw out that idea, it would avoid this problem, and a lot of reasonable people do go down that route (example), but I disagree (discussion here, here); I think RL algorithms (and more specifically model-based RL algorithms) are really effective and powerful ways to skillfully navigate a complex and dynamic world, and I think there's a very good reason that these algorithms are a key component of within-lifetime learning in animal brains. There’s gotta be a better solution than scrapping that whole approach, right?
• Maybe after each singing practice, you could rewrite those memories, to make the experience seem more rewarding in retrospect than it was at the time? I mean, OK, maybe in principle, but can you actually build a mechanism like that which doesn't have unintended side-effects? Anyway, this is getting ridiculous.

…“Aha”, you say. “I have an idea!” One part of the bird brain is “deciding” which low-level motor commands to execute during the song, and another part of the bird brain is “deciding” whether to spend time practicing singing, versus foraging or whatever else. These two areas don’t need the same reward signal! So for the former area, you send a signal: “the better the song sounds, the higher the reward”. For the latter area, you send a signal: “the worse the song sounds, the more rewarding it feels to spend time practicing”.

...And that’s exactly the solution that evolution discovered! See the discussion and excerpt from Fee & Goldberg 2011 in my post Big picture of phasic dopamine.

2. Wishful thinking

You’re the same deity, onto your next assignment: redesigning a human brain to work better. You’ve been reading all the ML internet forums and you’ve become enamored with the idea of backprop and differentiable programming. Using your godlike powers, you redesign the whole human brain to be differentiable, and apply the following within-lifetime learning rule:

When something really bad happens, do backpropagation-through-time, editing the brain’s synapses to make that bad thing less likely to happen in similar situations in the future.

OK, to test your new design, you upgrade a random human, Ned. Ned then goes on a camping trip to the outback, goes to sleep, wakes up to a scuttling sound, opens his eyes and sees a huge spider running towards him. Aaaah!!!

The backpropagation kicks into gear, editing the synapses throughout Ned's brain so as to make that bad signal less likely in similar situations the future. What are the consequences of these changes? A bunch of things! For example:

• In the future, the decision to go camping in the outback will be viewed as less appealing. Yes! Excellent! That's what you wanted!
• In the future, when hearing a scuttling sound, Ned will be less likely to open his eyes. Whoa, hang on, that’s not what you meant!
• In the future, when seeing a certain moving black shape, Ned’s visual systems will be less likely to classify it as a spider. Oh jeez, this isn’t right at all!!

In The Credit Assignment Problem, Abram Demski describes actor-critic RL as a two-tiered system: an “instrumental” subsystem which is trained by RL to maximize rewards, and an “epistemic” subsystem which is absolutely not trained to maximize rewards, in order to avoid wishful thinking / wireheading.

Brains indeed do an awful lot of processing which is not trained by the main reward signal, for precisely this reason:

• Low-level sensory processing seems to run on pure predictive (a.k.a. self-supervised) learning, with no direct involvement of RL at all.
• Some higher-level sensory-processing systems seem to have a separate reward signal that reward it for discovering and attending to “important things” both good and bad—see discussion of inferotemporal cortex here.
• The brainstem and hypothalamus seem to be more-or-less locked down, doing no learning whatsoever—which makes sense since they’re the ones calculating the reward signals. (If the brainstem and hypothalamus were being trained to maximize a signal that they themselves calculate … well, it's easy enough to guess what would happen, and it sure wouldn't be “evolutionarily adaptive behavior”.)
• Other systems that help the brainstem and hypothalamus calculate rewards and other assessments—amygdala, ventral striatum, agranular prefrontal cortex, etc.—likewise seem to have their own supervisory training signals that are different from the main reward signal.

So we get these funny within-brain battles involving subsystems that do not share our goals and that we cannot directly intentionally control. I know intellectually that it's safe to cross the narrow footbridge over the ravine, but my brainstem begs to differ, and I wind up turning around and missing out on the rest of the walk. “Grrr, stupid brainstem,” I say to myself.

3. Deceptive AGIs

You're a human. You have designed an AGI which has (you believe) a good corrigible motivation, and it is now trying to invent a better solar panel.

• There's some part of the AGI's network that is imagining different ways to build a solar panel, and trying to find a good design;
• There's another part of the AGI's network that is choosing what words to say, when the AGI is talking to you and telling you what it’s working on.

(In the human case, we could point to different parts of the cortex. The parts are interconnected, of course, but they can still get different reward signals, just as in the bird example above.)

The obvious approach is to have one reward signal, widely broadcast, influencing both parts of the network. And if we get to a point where we can design reward signals that sculpt an AGI's motivation with surgical precision, that's fine! We would sculpt the motivation so that the AGI is trying to invent a better solar panel as a means to an end, with the ultimate goal of helping you. Then when it chooses words, it would also be maximally helpful.

I think it's more likely that we won't get to that point—that AGIs' motivations will always be a bit off from what we intended for various reasons, but we'll (hopefully) muddle through anyway thanks to conservativism, transparency, and so on. Anyway, the motivation will certainly be off during training, when it's still learning basic things about the world, the existence of people, itself, etc. So either way, we should be deeply concerned that the AGI may sometimes be overly motivated to invent a better solar panel. And then, when it chooses the words to say to you, it may choose its words partly to manipulate you so as to advance its goal of inventing a better solar panel—rather than choosing words that are purely maximally helpful and honest.

So it seems to me that the word-choosing part of the network should have a separate reward signal (or be built in a different way altogether), a reward signal which doesn't care a whit about inventing a better solar panel, a reward signal geared purely towards clear and honest communication.

I don’t have a specific plan here. I just feel like I'd be surprised if the best possible safe AGI design that anyone can dream up will involve using the exact same reward signal for “making plans” and “choosing what words to emit when taking to a human”. I think we'll want different reward signals. Just like the bird example above.

This kind of non-unified agent sounds kinda weird, right? Like an AGI with Tourette’s or something?!? But it shouldn’t sound weird! It’s similar to (or even isomorphic to) a robot taking actions while being inspected by an interpretability / oversight module:

These are the exact same thing, just drawn differently. In the left diagram, we are meant to imagine a “unified agent” AGI that chooses actions to maximize reward, meanwhile being inspected / overseen by an independent algorithm running on a laptop nearby. In the right diagram, we moved the inspection / oversight algorithm into the same box as the AGI, albeit with its own separate speaker. Here the drawing encourages us to imagine this system as a kind of “non-unified” AGI, likely with multiple subsystems running in parallel, each trained on a different reward / supervisory signal.Does an agent need to be "unified" to be reflectively stable?

“Need”? No. It’s clearly possible for a non-unified system—with different supervisory signals training different subsystems—to be reflectively stable. For example, take the system “me + AlphaZero”. I think it would be pretty neat to have access to a chess-playing AlphaZero. I would have fun playing around with it. I would not feel frustrated in the slightest that AlphaZero’s has “goals” that are not my goals (world peace, human flourishing, etc.), and I wouldn’t want to change that.

By the same token, if I had easy root access to my brain, I would not change my low-level sensory processing systems to maximize the same dopamine-based reward signal that my executive functioning gets. I don't want the wishful thinking failure mode! I want to have an accurate understanding of the world! (Y’know, having Read The Sequences and all…) Sure, I might make a few tweaks to my brain here and there, but I certainly wouldn’t want to switch every one of my brain subsystems to maximize the same reward signal.

(If AlphaZero were an arbitrarily powerful goal-seeking agent, well then, yeah, I would want it to share my goals. But it’s possible to make a subsystem that is not an arbitrarily powerful goal-seeking agent. For example, take AlphaZero itself—not scaled up, just literally exactly as coded in the original paper. Or a pocket calculator. Or really any algorithm ever implemented as of this writing.)

So it seems to me that a “non-unified” agent is not inevitably reflectively unstable. However, they certainly can be. Just like I have a few bones to pick with my brainstem, as mentioned above, it's likewise very possible for different parts of the agent to start trying to trick each other, or hack into each other, or whatever. This is an obvious potential failure mode that we’d be nuts to ignore.

It's not a new problem though. Remember the figure above: it's arbitrary where we draw the line between "the AGI" and "other algorithms interacting with and trying to influence the AGI". So it's not a fundamentally different type of problem from gradient hacking. Or, for that matter, deception in general. (After all, humans are algorithms too.)

The Fraught Valley

Still, while it's not a new problem, I'll still take this as an excuse to talk about solving it.

The way I’m thinking about it is:

• Early in training, we have The Path Of Incompetence, where the “executive / planning submodule” of the AGI is too stupid / insufficiently self-aware / whatever to formulate and execute a plan to undermine other submodules.
• Late in training, we can hopefully get to The Trail of Corrigibility. That’s where we have succeeded at making a corrigible AGI that understands and endorses the way that it’s built—just like how, as discussed above, my low-level sensory processing systems don’t share my goals, but I like them that way.
• If there’s a gap between those, we’re in, let’s call it, The Fraught Valley.

For example, go back to that figure above, and imagine using those interpretability / oversight tools to install and verify good motivations in the executive / planning submodule. The goal is to do this successfully before the AGI is sophisticated enough to undermine the interpretability tools themselves.

Or imagine trying to do value learning (IRL etc.) in an AGI that builds a world-model from scratch, as I believe humans do. Here we literally can’t install the right motivations from the get-go, because “the right motivations” are inevitably defined in terms of concepts like objective reality, people, self, etc., that are (as of yet) nowhere to be found in the world-model. So maybe we let it do some learning, with some carefully-thought-through curriculum of data and rewards, and spin up the IRL subsystem as soon as the world-model is developed enough to support it.

Anyway, the goal is to make the width of the Fraught Valley as small as possible, or better yet, eliminate it altogether. This involves:

1. Making it hard and complicated to corrupt the various motivation-installation and interpretability systems. I don’t think it’s realistic to harden these systems against a superintelligent adversary,  but every little roadblock we can think of is good—it helps stretch out the Path Of Incompetence.
2. Meanwhile, we push from the other side by designing the AGI in such a way that we can install good motivations, and especially root out the most dangerous ones, early. This might involve things like directing its attention to learn corrigibility-relevant concepts early, and self-awareness late, or whatever. Maybe we should even try to hardcode some key aspects of the world-model, rather than learning the world-model from scratch as discussed above. (I’m personally very intrigued by this category and planning to think more along these lines.)

Success here doesn't seem necessarily impossible. It just seems like a terrifying awful mess. (And potentially hard to reason about a priori.) But it seems kinda inevitable that we have to solve this, unless of course AGI has a wildly different development approach than the one I'm thinking of.

Finally we get to the paper “Reward Is Enough” by Silver, Sutton, et al.

The title of this post here is a reference to the recent paper by David Silver, Satinder Singh, Doina Precup, and Rich Sutton at DeepMind.

I guess the point of this post is that I’m disagreeing with them. But I don’t really know. The paper left me kinda confused.

Starting with their biological examples, my main complaint is that they didn’t clearly distinguish “within-lifetime RL (involving dopamine)” from “evolution treated as an RL process maximizing inclusive genetic fitness”.

With the latter (intergenerational) definition, their discussion is entirely trivial. Oh, maximizing inclusive genetic fitness “is enough” to develop perception, language, etc.? DUUUHHHH!!!

With the former (within-lifetime) definition, their claims are mostly false when applied to biology, as discussed above. Brains do lots of things that are not “within-lifetime RL with one reward signal”, including self-supervised (predictive) learning, supervised learning, auxiliary reward signals, genetically-hardcoded brainstem circuits, etc. etc.

Switching to the AI case, they gloss over the same interesting split—whether the running code, the code controlling the AI’s actions and thoughts in real time, looks like an RL algorithm (analogous to within-lifetime dopamine-based learning), or whether they are imagining reward-maximization purely an outer loop (analogous to evolution). If it’s the latter, then, well, then what they’re saying is trivially obvious (humans being an existence proof). If it’s the former, then the claim is nontrivial, but it’s also I think wrong.

As a matter of fact, I personally expect an AGI much closer to the former (the real-time running code involves an RL algorithm) than the latter (RL purely as an outer-loop process), for reasons discussed in Against Evolution as an analogy for how humans will create AGI. If that’s what they were talking about, then the point of my post here is that “reward is not enough”. The algorithm would need other components too, components which are not directly trained to maximize reward, like self-supervised learning.

Then maybe their response would be: “Such a multi-component system would still be RL; it’s just a more sophisticated RL algorithm.” If that’s what they meant, well, fine, but that’s definitely not the impression I got when I was reading the text of the paper. Or the paper's title, for that matter.

Discuss

### Escaping the Löbian Obstacle

16 июня, 2021 - 13:54
Published on June 16, 2021 12:02 AM GMT

Earlier this year, when looking for an inroad to AI safety, I learned about the Löbian Obstacle, which is a problem encountered by 'purely logical' agents when trying to reason about and trust one another. In the original paper of Yudkowsky and Herreshoff [1], they show that a consequence of Löb's theorem is that an agent X can only "trust" the reasoning of an agent Y with a strictly weaker reasoning system than themselves, where "trust" here means 'formally prove that the conclusions of the other agent's reasoning will be true'. As stated, this looks like a major problem if X is a human trying to build an artificially intelligent system Y, but it's also a problem for any individual (embedded) agent trying to reason about their own future behaviour.

I'm not the first person to find this problem counterintuitive, and for good reason. In this article I'm going to explain why a formal (purely syntactic) logic system alone is a poor model of the reasoning of embedded agents, and show that by fixing this, we remove the foundation for the difficulties arising from Löb's theorem.

For the uninitiated, there is a handy survey of application of Löb's theorem in AI safety research by Patrick LaVictoire [6].

Pure syntax

First, I should explain the formal set-up for applying Löb's theorem to agents. We model an agent's reasoning with a formal language, or logic, which I'll call L. Here I shall make the further assumption that this logic fits (or can be squeezed into) a formal language of the kind logicians are familiar with: the logic consists of some formal symbols or variables A,B,C... along with some logical connectives, operators and quantifiers for combining variables into expressions, or formulas. The agent is also assumed to carry some inference rules for manipulating formulas. Altogether, this data constitutes the syntax of L (its symbolic content and the rules for manipulating those symbols).

Since we don't care precisely what the symbols in L refer to, we need go no further, and we can start worrying about how this formal system behaves.

To arrive at the Löbian Obstacle, we suppose that an agent A (who reasons with L) builds an agent B who shall also reason with the logic L. Suppose, specifically, that A is building B to help them achieve some goal G. In order for A to ensure B will be helpful, A programs B to produce a proof that B's actions will lead to the goal G before acting. I should clarify that we are assuming that A and B inhabit an idealized environment over which they have complete control and perfect information, so that such a proof represents a guarantee of success.

Since A doesn't know in advance what B needs to do to achieve G (else there would be no dilemma!) there is a large number of formulas in L which could be output by B as 'actions provably achieving G'. Being maximally pessimistic (incidentally, this assumption is another one I find problematic, but I'll allow it for the time being) we can assume it could be any formula ϕ.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')}  in L. Thus, in order for A to be certain that B will behave helpfully, it would seem that A must be able to prove, for all such formulas, a theorem schema of the form □⌈ϕ⌉→ϕ, which is to say that the existence of a proof of ϕ (which will eventually be supplied by B) implies that ϕ is true.

This is the point at which Löb's Theorem wags its finger: in the words of Yudkowsky and Herreshoff, "no formal system at least as powerful as Peano Arithmetic can prove its own soundness schema". In other words, it would seem that A cannot possibly prove in L that it will be able to trust an arbitrary output from B.

The problem of induction in disguise

There's one obvious problem here, which I hinted at in the introduction: the same argument applies to A reasoning about itself! The only means that A has of concluding that its own actions will result in the desired consequences are by reasoning in L, and A has no way of proving that its musings in L have any bearing on its environment. This can be interpreted as an extremely condensed version of Hume's problem of induction [5], which consists in the observation that there is no purely logical basis for our inferences regarding, for example, the consequences of our actions.

I contend that applying Löb's Theorem to a purely syntactic system is a rather arcane way of arriving at this conclusion, since it's much easier to see how an agent's reasoning can fail to be sound with a simple example.

Earlier, I emphasised the phrase "since we don't care precisely what the symbols in L refer to," a claim which I will now make more precise. While the precise details of any given implementation aren't important for arguing about generic behaviour, it does matter that the symbols in L refer to something, since the veracity (which is to say the "actual" truth value) of the soundness schema depends on it. Let me throw some vocabulary out for the important concepts here.

The symbols and formulas of L don't come with intrinsic meaning (no formal language does). To imbue them with meaning, we require a mapping from formulas in L to objects, values or concepts which the agent is able to identify in their environment. This mapping is called the semantics of L. The semantic map is typically constructed inductively, starting at interpretations of variables and extending this to formulas one logical connective or quantifier at a time. From another perspective, this mapping is an attempt to express the environment as a model of the logic L.

In order for the agent to reason correctly about its environment (that is, come to valid conclusions using L) it is necessary that the semantics is sound, which is to say that all of the inference rules transform formulas whose interpretation is true into formulas of the same kind. In other words, the semantics should be truth-functional with respect to the inference rules of L.

It's easy to see how soundness can fail. Consider a chess-playing agent. If the agent isn't aware of some rule for how chess pieces may be moved (the agent isn't aware of the possibility of castling, say) then it's obvious how, under circumstances where those rules apply, they may come to incorrect conclusions regarding the game. The logic involved in this case may be much simpler than Peano Arithmetic!

For our purposes here, the take-away is that soundness of the semantics cannot be guaranteed by syntax alone. Again, we don't need Löb's Theorem to demonstrate this; producing non-sound semantics for any given logic is rarely difficult, since it suffices to deliberately make the semantics fail to be truth-functional (by deliberately misinterpreting formal conjunctions as "or", say).

Soundness as a belief

No agent is able to justify their system of reasoning from immutable first principles; this is another presentation of Hume's problem of induction. If that claim sounds contentious, I highly recommend Nancy Cartwright's essay [2] on relativism in the philosophy of science. And yet, we (humans) continue to reason, and to expect the conclusions of our reasoning to represent valid statements about reality.

The crucial fact underpinning this apparent contradiction is that every agent, embedded or otherwise, must hold metalogical beliefs regarding the soundness of their reasoning. In the very simplest case of an agent with a rigid reasoning system, this will consist of a belief that the chosen semantics for their logic is sound. In other words, while the soundness schema appearing in Löb's Theorem can never be proved in L, the agent necessarily carries a metalogical belief that (the metalogical version of) this soundness schema holds, since this belief is required in order for A to confidently apply any reasoning it may perform within L.

A related metalogical belief is that of consistency: an agent cannot prove the consistency of its own reasoning (Gödel's second incompleteness theorem), but it must nonetheless believe that its logic is consistent for soundness to be a possibility in the first place.

Allowing for the existence of metalogical beliefs immediately dissolves the Löbian Obstacle, since this belief extends to proofs provided by other agents: as soon as agent A encounters a proof in L from any source, they will expect the conclusion to be validated in their environment by the same virtue that they expect the conclusions of their own reasoning to be valid. Agent A can delegate the task of proving things to subordinates with more computational power and still be satisfied with the outcome, for example.

Where did the problems go?

I do not expect that the perception shift I'm proposing can have magically eliminated  self-reference problems in general. A critical reader might accuse me of simply ignoring or obfuscating such problems. Let me try to convince you that my proposed shift in perspective achieves something meaningful, by explaining how Löb's theorem rears its head.

A devotee of formal logic might naively hope from my exposition so far that these so-called metalogical beliefs of soundness can simply be encoded as axioms in an expanded logic L' containing L. But Löb's theorem tells us that this will automatically make L' inconsistent!

Thus, I am advocating here that we interpret Löb's theorem differently. Rather than concluding that a logical agent simply cannot believe its own proofs to be valid (a conclusion which we humans flout on a daily basis), which brings us to a seemingly insurmountable obstacle, I am proposing that we relegate assertions of soundness to beliefs at the interface between purely formal reasoning and the semantics of that reasoning. This distinction is protected by Hume's guillotine: the observation that there is a strict separation between purely formal reasoning and value judgements, where the latter includes judgements regarding truth. The agent's beliefs cannot be absorbed into the logic L, precisely because syntax can never subsume (or presume) semantics.

To make this concrete, I am proposing that soundness assertions (and metalogical beliefs more generally) must include the additional data of a semantic mapping, an aboutness condition. Note that this doesn't preclude an agent reasoning about its own beliefs, since there is nothing stopping them from building a semantic map from their logic into their collection of beliefs, or indeed from their logic into itself. Incidentally, the proof of Löb's theorem requires just such a mapping in the form of a Gödel numbering (in modern modal logic proofs, the 'provability' modality takes on this role), although it is usually only a partial semantic map, since it is constrained to an interpretation of a representation of the natural numbers in L into a collection of suitably expressive formulas/proofs in L.

The impossibility of an a priori guarantee of soundness forces an abstract intelligent agent to resort to empiricism in the same way that we human agents do: the agent begins from a hypothetical belief of soundness with less than absolute certainty, and must test that belief in their environment. When failures of soundness are observed, the agent must adapt. Depending on the architecture of the agent, the adaptation may consist of updates to L, an update to the semantic map, or an update to the belief value (if the truth values employed by the agent in their beliefs are sufficiently structured to express domains of truth, say). I am deliberately hinting at some interesting possibilities here, which I may explore in more detail in a later post.

Under the belief that the Löbian Obstacle was a genuine problem to be avoided, several approaches have been proposed in the past decade. Foremost is identifying a probabilistic regime in which Löb's Theorem fails to hold, and by extension fails to produce an obstacle [3]. In the setting of agents reasoning about mathematics or purely formal systems, this has been vastly expanded into the concept of a logical inductor [4] which assigns probabilities to all formulas in its logic (under the implicit assumption of particular fixed semantics!).

However, fixed semantics are inflexible, and any system which assigns values (such as probabilities) directly to formulas in its logic must be extremely data-heavy, since it must store (encodings of) large numbers of formulas explicitly rather than implicitly. An immediate advantage of recognizing that semantics is, and indeed must be, separate from syntax is that it highlights the possibility that the semantic map may vary. Rather than carrying around a heavy knowledge base consisting of all of the statements which it knows to be true (with respect to a fixed semantics), an agent may instead apply the same formal reasoning in different situations and carry the much lighter load of empirical knowledge of situations in which its reasoning will be sound. Or, with a little refinement, the agent can carry an understanding of which fragment of its logic can be applied in a given situation.

To me, this feels a lot closer to my own experience of reasoning: I have some factual knowledge, but I often rely more heavily on my understanding of how to apply my reasoning abilities to a given situation.

---

[1] Tiling Agents for Self-Modifying AI, and the Löbian Obstacle, Yudkowsky, E and Herreshoff, M. intelligence.org, 2013

[2] Relativism in the Philosophy of Science, Cartwright, N., from Relativism, a contemporary anthology, 2010, Columbia University Press. Available on researchgate.

[3] Probabilistic Löb Theorem, Armstrong, S., lesswrong.org, 2013

[4] Logical Induction, Garrabrant, S. et al., arxiv, 2020

[5] The Problem of Induction, Henderson, L., The Stanford Encyclopedia of Philosophy (Spring 2020 Edition), Edward N. Zalta (ed.), url

[6] An Introduction to Löb’s Theorem in MIRI Research, LaVictoire, P., 2015, intelligence.org

Discuss

### How can we quantify player alignment in 2x2 normal-form games?

16 июня, 2021 - 05:09
Published on June 16, 2021 2:09 AM GMT

In my experience, constant-sum games are considered to provide "maximally unaligned" incentives, and common-payoff games are considered to provide "maximally aligned" incentives. How do we quantitatively interpolate between these two extremes? That is, given an arbitrary 2×2.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')}  payoff table representing a two-player normal-form game (like Prisoner's Dilemma), what extra information do we need in order to produce a real number quantifying agent alignment?

If this question is ill-posed, why is it ill-posed? And if it's not, we should probably understand how to quantify such a basic aspect of multi-agent interactions, if we want to reason about complicated multi-agent situations whose outcomes determine the value of humanity's future.

Thoughts:

• Assume the alignment function has range [0,1] or [−1,1].
• Constant-sum games should have minimal alignment value, and common-payoff games should have maximal alignment value.
• The function probably has to consider a strategy profile (since different parts of a normal-form game can have different incentives; see e.g. equilibrium selection).
• The function should probably be a function of player A's alignment with player B; for example, player A might always cooperate and player B might always defect. Then it seems reasonable to consider whether A is aligned with B (in some sense), while B is not aligned with A (they pursue their own payoff without regard for A's payoff).
• So the function need not be symmetric over players.
• The function should be invariant to applying a separate positive affine transformation to each player's payoffs; it shouldn't matter whether you add 3 to player 1's payoffs, or multiply the payoffs by a half.

• Do some thought experiments to pin down the intuitive concept. Consider simple games where my "alignment" concept returns a clear verdict, and use these to derive functional constraints (like symmetry in players, or the range of the function, or the extreme cases).
• See if I can get enough functional constraints to pin down a reasonable family of candidate solutions, or at least pin down the type signature.

Discuss

### Three Paths to Existential Risk from AI

16 июня, 2021 - 04:37
Published on June 16, 2021 1:37 AM GMT

Discuss

### Can someone help me understand the arrow of time?

15 июня, 2021 - 20:26
Published on June 15, 2021 5:26 PM GMT

If I understand correctly the psychological arrow of time tries to explain why we perceive time as passing by. It answers that in truth time does not pass - we exist at a single point in the timeline, but have the illusion of time passing by because we remember the past and not the future.

Firstly, is this approximately correct?

Secondly isn't memory itself a process that takes place over time? So how can the illusion occur if time isn't passing in which it can occur?

Thirdly if this were true, there'd be no point doing anything - time is never going to pass so you can't do anything anyway. So why does nobody seem to act like that's true? Does nobody actually believe the theory?

I have a feeling there's some major parts of the theory I'm missing, but what are they? Or is the theory less ambitious and only tries to explain why time passes in a particular direction, and not why it passes at all?

Discuss

### If instead of giving out dividends, public companies bought total market index funds, which companies would be the biggest?

15 июня, 2021 - 20:07
Published on June 15, 2021 5:07 PM GMT

Feel free to make simplifying assumptions or give partial answers (ex.: just links for some of the raw data).

Raw data needed for this calculation:

• dividends given by each public companies (or at least the biggest ones)
• historical value of each public companies (or at least the biggest ones)
• historical value of a total market index fund (failing that, top 3000, or top 500 companies)

Can be restricted to the US.

Discuss

### Vignettes Workshop (AI Impacts)

15 июня, 2021 - 15:05
Published on June 15, 2021 12:05 PM GMT

AI Impacts is organizing an online gathering to write down how AI will go down! For more details, see this announcement, or read on.

Plan

1. Try to write plausible future histories of the world, focusing on AI-relevant features. (“Vignettes.”)
2. Read each others’ vignettes and critique the implausible bits: “Wouldn’t the US government do something at that point?” “You say twenty nations promise each other not to build agent AI–could you say more about why and how?”
3. Amend and repeat.

This event will happen over two days, so you can come Friday if this counts as work for you, Saturday if it counts as play, and both if you are keen. RSVP to particular days is somewhat helpful; let us know in the comments.

Date, Time, Location

The event will happen on Friday the 25th of June and Saturday the 26th.

It’ll go from 10am (California time) until probably around 4pm both days.

It will take place online, in the LessWrong Walled Garden. Here are the links to attend:
Friday
Saturday

FAQ

> Do I need literary merit or creativity?
No.

> Do I need to have realistic views about the future?
No, the idea is to get down what you have and improve it.

> Do I need to write stories?
Nah, you can just critique them if you want.

> What will this actually look like?
We’ll meet up online, discuss the project and answer questions, and then spend chunks of time (online or offline) writing and/or critiquing vignettes, interspersed with chatting together.

> Have you done this before? Can I see examples?
Yes, on a small scale. See here for some resulting vignettes. We thought it was fun and interesting.

This event is co-organized by Katja Grace and Daniel Kokotajlo. Thanks to everyone who participated in the trial Vignettes Day months ago. Thanks to John Salvatier for giving us the idea.

Discuss

### Vignettes workshop

15 июня, 2021 - 14:10
Published on June 15, 2021 11:10 AM GMT

Plan

1. Try to write plausible future histories of the world, focusing on AI-relevant features. (“Vignettes.”)
2. Read each others’ vignettes and critique the implausible bits: “Wouldn’t the US government do something at that point?” “You say twenty nations promise each other not to build agent AI–could you say more about why and how?”
3. Amend and repeat.

This event will happen over two days, so you can come Friday if this counts as work for you, Saturday if it counts as play, and both if you are keen. RSVP to particular days is somewhat helpful; let us know in the comments.

Date, Time, Location

The event will happen on Friday the 25th of June and Saturday the 26th.

It’ll go from 10am (California time) until probably around 4pm both days.

It will take place online, in the LessWrong Walled Garden. Here are the links to attend:
Friday
Saturday

FAQ

> Do I need literary merit or creativity?
No.

> Do I need to have realistic views about the future?
No, the idea is to get down what you have and improve it.

> Do I need to write stories?
Nah, you can just critique them if you want.

> What will this actually look like?
We’ll meet up online, discuss the project and answer questions, and then spend chunks of time (online or offline) writing and/or critiquing vignettes, interspersed with chatting together.

> Have you done this before? Can I see examples?
Yes, on a small scale. See here for some resulting vignettes. We thought it was fun and interesting.

This event is co-organized by Katja Grace and Daniel Kokotajlo. Thanks to everyone who participated in the trial Vignettes Day months ago. Thanks to John Salvatier for giving us the idea.

Discuss

### Stanford EA Confusion Dinner

15 июня, 2021 - 13:49
Published on June 15, 2021 10:49 AM GMT

Stanford Effective Altruism will be hosting a Confusion Dinner this Sunday from 5 - 6 pm Pacific Time. Join to meet some of the Stanford EAs/LWers and to talk about whatever it is you are confused about!

The link to join is here.

Discuss

### Cultural Lag and Social Physics

15 июня, 2021 - 07:45
Published on June 15, 2021 4:45 AM GMT

Suppose rational, organic, emergent, strains of thought are attempting to be followed through the sublime collisions of billions of humans thoughts, actions, and emotions as expressed through our world culture, media and societal efforts. These constitute a basic form of world consciousness as brought on by the digital age and the advancement of computers and the Internet. As instances of coherent, cooperative communication and effort enabled by globalization, the adoption of these rational thoughts, beliefs and behaviors is influenced by something I call Cultural Lag.

Simply put, Cultural Lag is a force of impedance in a system of Social Physics. Social Physics being the rational institution of the Laws of Physics, as applied to the Social Studies. This is a concept I've been thinking and attempting to write about for several years, although I'm sure I'm not the only person thinking along these lines. But as the world of Social Studies consists of the interactions of persons, places, and things, they are subject to the Laws of Physics, and so the tenants of Physics must apply.

The ability to measure Cultural Lag is not yet available - although I'd love to give it a shot, as an artist I lack the technical skills to sort it out but am open to discussion about the possibility of even attempting to find an equation for it - but it IS possible to spot examples of it everywhere. The trouble with convincing the majority of Americans at this point to get vaccinated is a prime example. Much of the science and research has been done, the logistics are in place and doses are available, but a variation of Cultural Lag - Vaccine Hesitancy - slows down the progress of the world because of beliefs.

Everything from the existence of Lead Pipes in American public water systems, to lack of internet access for low income and rural Americans, to lack of agreement on whether the Earth is flat, whether God exists, access to decent Physical and Mental healthcare, and many, many more, are all examples of Cultural Lag. I hypothesize that it exists as a force of impedance in the atomic and possibly quantum realm, which expresses itself at the human scale as the incredibly unequal distribution of resources across the spectrum of life on this planet.

One of the main vectors of Cultural Lag is Mass communication, as the valuable, rational and Cohumane intentions of the producers of beneficial knowledge, goods and technology is often lost in the sound and the fury of our media and competing institutional efforts, as the flow of good information and resources is impeded from making it's way to those who need it. Cohumane intentions are intentions which attempt to take into account humanity living in harmony with all life in the Universe, including the potential for Aliens from other planets as well as any Artificial Life which might emerge from our Technological efforts.

It is these examples of Cultural Lag I attempt to address through my writing and my art, identifying and attempting to express my thoughts and feelings about them, as well as attempting to create solutions to the very real problems they often represent. Keeping in mind I approach these domains from the perspective of an artist/designer/media critic, this blog is my first attempt to share my ideas with a broader audience. My ideas are open to interpretation and my thinking is (I believe) open to change. I'm open to suggestions, work groups, opportunities to collaborate with other artists or people with technical skills, and just plain discussion.

I intend to use this blog as a way to introduce some of my work, and to invite people to engage productively with it. As my ideas sometimes are so broad I find it hard to cut them up into smaller pieces on a post by post basis, I'm looking forward to the challenge. I understand much of what I think and have written is based on work I've read or seen before, and will try my best to acknowledge those people and their work as things come up, but I'm not that great at academic writing. I'm debating attempting graduate work, possibly in statistics or media criticism, although I have spent a fair amount of time thinking about how to start a non-profit.

Whats the take away? I've got ideas, some are good, some are bad, I want to improve my thinking, and I'm looking for some community. Stay tuned while I attempt to do these things and more, as interestingly and productively as possible.

Discuss

### Psyched out

15 июня, 2021 - 07:45
Published on June 15, 2021 4:45 AM GMT

I don't know where to begin as I've apparently spent my whole life working at being a Rationalist and didn't know it until I heard about this site and the Rationalist Movement. As a Starving Artist/Amateur Intellectual/Hermit, I'm not in a position to go to graduate school right now, and the people around me these days don't share my same interests. Like any human being I need to engage with people thinking about the same things I do. So here I am.

I've kept boxes of sketchbooks and journals for a couple decades that are full of ideas and concepts I've developed from my own personal experience, reading and studies, but my work doesn't do well on traditional 'art' forums. It's usually very dry, scientific, conceptual work looking for a form, but as a trained artist I've run into technical issues - requiring skills, experience and/or data and resources I don't have - so my work stays theoretical and stacks up in boxes in my living room. I've attempted to write research papers, grant proposals, blogs, twitter posts, essays and books but can't quite finish them because I've lacked a support network. My work is my passion, but I've been unable to share it because I hadn't found a place where so many of my interests could be satisfied. When I look at the list of topics of discussion on this site, I feel agitated I didn't find this community sooner.

So as a newbie looking to develop, share, develop, and share my own work, contributing while benefiting from the work of the others on this board, I'm a little overwhelmed by the scope of this place. Since I'm pretty comfortable in my sexual and gender identity as a man, I wouldn't mind a little hand holding in the beginning.

Discuss

### Knowledge is not just digital abstraction layers

15 июня, 2021 - 06:49
Published on June 15, 2021 3:49 AM GMT

Knowledge is not just digital abstraction layers

Financial status: This is independent research. I welcome financial support to make further posts like this possible.

Epistemic status: This is in-progress thinking.

This post is part of a sequence on the accumulation of knowledge. Our goal is to articulate what it means for knowledge to accumulate within a physical system.

The challenge is this: given a closed physical system, if I point to a region and tell you that knowledge is accumulating in this region, how would you test my claim? What are the physical characteristics of the accumulation of knowledge? We do not take some agent as the fundamental starting point but instead take a mechanistic physical system as the starting point, and look for a definition of knowledge at the level of physics.

The previous post looked at mutual information between a region within a system and the remainder of the system as a definition of the accumulation of knowledge. This post will explore mutual information between the high- and low-level configurations of a digital abstraction layer.

A digital abstraction layer is a way of grouping the low-level configurations of a system together such that knowing which group the system’s current configuration is in allows you to predict which group the system’s next configuration will be in. A group of low-level configurations is called a high-level configuration. There are many ways to divide the low-level configurations of a system into groups, but most will not have this predictive property.

Here are three examples of digital abstraction layers:

Digital abstraction layers in computers

Information in contemporary computers is encoded as electrons in MOS memory cells. In these systems, the low-level configurations are all the ways that a set of electrons can be arranged within a memory cell. There are two high-level configurations corresponding to the "high" and "low" states of the memory cell. These high-level configurations correspond to groups of low-level configurations.

If we knew the high-level configuration of all the memory cells in a computer then we would know which instruction the computer would process next and could therefore predict what the next high-level configuration of the computer would be. We can make this prediction without knowing the specific low-level configuration of electrons in memory cells. Most other ways that low-level configurations of this system could be grouped into high-level configurations would not make such predictions possible. The design of digital computers is chosen specifically in order that this particular grouping of configurations does allow the evolution of the system to be predicted in terms of high-level configurations.

Digital abstraction layers in the genome

Information in the genome is stored as A, C, G, T bases in DNA. The low-level configurations are the possible arrangements of a small collection of carbon, nitrogen, oxygen, and hydrogen atoms. There are four high-level configurations corresponding to the arrangement of those atoms into Adenin, Cytosine, Guanine, and Thymine molecules.

There are many physically distinct configurations of a small collection of carbon, nitrogen, oxygen, and hydrogen atoms. If we counted each configuration as distinct as we did in the preceding post in this sequence then the number of bits of information would be the number of yes-or-no questions required to pinpoint one particular low-level configuration. But we could also say that we are only interested in distinguishing A from C from G from T molecules, and that we are not interested in distinguishing things at any finer granularity than that, and also that we are not interested in any configuration of atoms that does not constitute one of those four molecules. In this case our high-level system encodes two bits of information, since it takes two yes-or-no questions to distinguish between the four possibilities.

Digital abstraction layers in Conway’s Game of Life

For the third example of a digital abstraction layer, consider the following construction within Conway’s Game of Life that runs an internal simulation of Conway’s Game of Life (full video here):

In case it’s not visible in the truncated gif above, each of the high-level "cells" in this construction is actually a region consisting of many thousands of low-level cells of a much finer-grained version of Conway’s Game of Life. The fuzzy stuff between the cells is machinery that transmits information between one high-level cell and its neighbors in order to produce the high-level state transitions visible in the animation. That machinery is itself an arrangement of gliders, glider guns, and other constructions that evolve according to the basic laws of Life that are running at the lowest level of the automata. It is a very cool construction.

Suppose now that I take the high-level configurations of this system to be the pattern of "on" and "off" cells in the zoomed-out game board, and that I group all the possible low-level game states according to which high-level game state they express. (That is, if two low-level configurations appear as the same arrangement of on/off game cells at the zoomed-out level, then we group them together.)

With this grouping of low-level configurations into high-level configurations, I can predict the next high-level configuration if I know the previous high-level configuration. This is because this particular construction is set up in such a way that the high-level configurations evolve according to the laws of Life, so, if I know the high-level configuration at one point in time then I can predict that the overall automata will soon transition into one of the low-level configuration corresponding to the next high-level configuration that is predicted by the laws of Life. The same is not true for most ways of grouping low-level configurations into high-level configurations. If I grouped the low-level configurations according to whether each row contains an even or odd number of active cells then knowing the current high-level configuration -- that is, knowing whether there are currently an even or odd number of active cells in each row -- does not let me predict the next high-level configuration unless I also know the current low-level configuration.

I will call a grouping of low-level configurations into high-level configurations in such a way that transitions between high-level configurations can be understood without knowing the underlying low-level configurations a "digital abstraction layer".

Knowledge defined in terms of digital abstraction layers

Perhaps we can define knowledge as mutual information between the high-level configurations of a digital abstraction layer contained within some region and the low-level configurations of the whole system. This definition encompasses knowledge about the ancestral environment encoded in the genome, knowledge recorded by a computer, and knowledge stored in synapse potentials in the brain, while ruling out difficult-to-retrieve information imprinted on physical objects by stray photon impacts, which was the counterexample we encountered in the previous post.

Example: Digital computer

The computer using a camera to look for an object discussed in the previous post is an example of a digital abstraction layer. The low-level configurations are the physical configurations of the computer, and the high-level configurations are the binary states of the memory cells and CPU registers that we interact with when we write computer programs. If I know the current high-level configuration of the computer then I can predict the next high-level configuration of the computer, since I know which instruction the CPU will read next and I can predict what effect that will have without knowing anything about the low-level configuration of electrons trapped in semiconductors.

In this example we would be able to measure an increase in mutual information between the high-level configuration of the computer and the low-level configuration of the environment as the computer uses its camera to find the object we have programmed it to seek.

This also helps us to understand what it means for an entity to accumulate self-knowledge: it means that there is an increase over time of the mutual information between the high- and low-level configurations of that entity. A computer using an electron microscope to build a circuit diagram of its own CPU fits this definition, while a rock that is "a perfect map of itself" does not.

Counterexample: Data recorder

But what about a computer that merely records every image captured by a camera? This system has at least as much mutual information with its environment as one that uses those observations to build up a model of its surroundings. It is reasonable to say that a data recorder is accumulating nonzero knowledge, but it is strange to say that exchanging the sensor data for a model derived from that sensor data is always a net decrease in knowledge. If I discovered that my robot vacuum recorded its sensor data to some internal storage device I might be mildly concerned, but if I discovered that my robot vacuum was deriving sophisticated models of human psychology from that sensor data I would be much more concerned. Yet a mutual information conception of knowledge seems to have no way to account for the computational process of turning raw sensor data into actionable models.

Conclusion

It is not clear whether digital abstraction layers have a fundamental relationship with knowledge. It does seem that information must be useful and accessible in order to precipitate the kind of goal-directed action that we seek to understand, and digital abstraction layers are one way to think about what it means for information to be useful and accessible at the level of physics.

More fundamentally, it seems that mutual information fails to capture what we mean when we think of the accumulation of knowledge, even after we rule out the cases discussed in the previous post. A physical process that begins with some observations and ends with an actionable model derived from those observations should generally count as a positive accumulation of knowledge, yet such a process will never increase its mutual information with its environment, since by hypothesis no new observations are added along the way. The next post will look beyond information theory for a definition of the accumulation of knowledge.

Discuss

### (Another) Using a Memory Palace to Memorize a Textbook

15 июня, 2021 - 03:46
Published on June 14, 2021 10:12 PM GMT

Why do this? I was a year out of graduate school, but I could already feel my knowledge leaking away. This is a frustrating experience that will be familiar to any of you who've switched fields, if you are no longer working with your hard won knowledge/skills, they seem to vanish. The mind is a leaky sieve, without constant refilling it empties quickly.

Like, if you asked me to write down the Schrodinger equation (my PhD was in physics), right this instant, I'd have a 50/50 chance of getting it right (I just tried this, by the way, and I failed, now all I'm left with is a wrong equation and a slightly hollow feeling like what Scrooge McDuck might feel if he opened up his vault and all that was there were a few quarters. Canadian quarters.)

My experience with memory palaces were that:

1. They are a bitch to get right, you need to put in significant practice to get proficient, but...
2. They give more permanence to memories

This permanence was exactly what I wanted: I wanted to be able to remember things like the Schrodinger equation, even if I hadn't thought about it in years.

So, instead of memorizing the entire textbook, I narrowed my vision-- could I memorize the important equations and figures in a chapter. The goal would be to be able to deliver a lecture on the chapter without looking at written notes-- being able to move from important equation to important equation, and being sure of your derivations.

The usual format of a palace is: you take a place you are familiar with, and you mentally 'place' objects there, and then you walk through the palace in order, visiting the objects that help encode memories.

Approach 1: Picture-in-the-mind

I first tried just taking a 'snapshot' of an equation, and placing it on pedestals around my palace (which was just my dingy basement suite apartment). Unfortunately, this was an abject failure. My powers of visualization were not enough to create permanent 'mental' snapshots, they disappeared to dust when I wasn't focusing on them. I needed something more memorable...

Approach 2: Story-in-the-mind

For every equation, I tried creating a little visual 'story' for it. This was fairly free-form. Say I wanted to memorize y = x/2 +1, I might picture an "x" sliding down a divisor "/" into the waiting arms of a "2". This was easier for me to recall than a static picture of the equation, the visual story allowed me to 'move' through the equation, the same way you might if you read it off the page. The problem was my visual language wasn't consistent and had to be invented on the spot-- this made the storage process slow and the retrieval process prone to error.

Intermission

I took a break from this for a year or so. I took up another challenge-- memorizing the names and dates of office of all American presidents. In testing out approaches, I came across the Dominic System of memorization. Used by famed memory athlete (who looks, in the best way possible, like a pornstar from the 70's) Dominic O'Brien to win the Memory Olympiad multiple times, it is a refreshingly straightforward scheme.

1. Take the following letters (0, A, B, C, D, E, S, G, H, N) they correspond to the numbers (0, 1, 2, 3, 4, 5, 6, 7, 8, 9)

2. Take all combinations of these letters (giving 10*10 =100) unique combos (AA, AB, ...HA, HB)

3. For every combo, think of a celebrity who has those initials (AS-> Arnold Schwarzenegger) (this is surprisingly difficult-- like, who has the initials H.E.? (I floundered on this for a while, now I picture the Joker giggling "He He")).

Now you can memorize things with 2 digits! Say you parked your car in spot 62, this becomes SB, which could be Simone Biles, so you imagine Mss. Biles tumbling over your car. Useless though. You can't even memorize a date (4 digits).

1. To memorize 4 digits, you use the Object Action system. Each celebrity is intrinsically an Object, and for each celebrity you give them an Action. Say you wanted to memorize the date 1990- (AN)(NO) (Amber Nash) (Nick Offerman), so you might picture Pam (from the cartoon Archer) making a canoe (this is the Action I've given Nick).

2. You can expand this even further by using the Object Action Item system (each celebrity now also has an item associated, and you can memorize 6 digits.

The basic premise is, by drilling the conversion between numbers and celebrities until it is second nature, you can quickly construct memorable, visual stories that all have the same common visual language. By placing these stories around a route (or palace), you can memorize very long series of digits.

But how does this help memorizing equations?

Approach 3: Extending Dominic

One hundred things is actually a lot of things. Turns out, you can fit most of the common symbols of mathematics into 100 boxes. So I did.

1. You need to include the English alphabet (26 boxes gone: (0A-BS)) (this seems weird, as you are actually increasing the information here-- 'j' is encoded as "AB". But "j" is really encoded as "Anthoy Bourdain", so it is still encoded as one entity.
2. You need to include the greek alphabet (21 boxes gone: (C0-E0) )
3. Basic arithmetic (10 boxes gone: '=', '+', '-', '*', '^', 'root', '(', ')' "|" ) S0-SN
4. Trig (8 boxes gone: e, ln, sin, cos, tan, sinh, cosh, tanh, ) (G0-GG)
5. Calc (4 boxes gone: d/dx, integral, del, Summation) (H0-HC)

Most physics equations can be written as a combination of the symbols encoded above. You'll notice I have room to spare! I still have the 'N' column. I also have gaps, where for symbol hygiene, I try to start new categories of symbols on new columns. All in all, I still have 31 boxes left, if I want to extend this system further!

This post is getting long, so I'm not going to write down examples of this. It was a pain to set everything up, but I encoded everything in flash cards and worked through them instead of hitting up reddit when I was in the bathroom, it only took a week or so to learn. The problem was density. Equations can have a significant number of symbols in them, even fairly simple ones. (And why bother to memorize simple equations. What you're really after, you greedy little STEMLord, is the ability to draw some equation from Jackson Electrodynamics and Magnetism at the drop of a hat and pistol-whip the insouciant lout who dared question you into complete submission.)

So, if an equation has 12 symbols, then you have still have to build a visual story containing 12 elements, on the fly, and keep it in one location. It was too much to manage, which led me to my current approach.

Approach 4: Chained Palaces

The way to order and remember long visual stories is to use a memory palace (duh). So, now what I do is I have a 'main' palace (like a friend's house). Every place within this palace encodes two things: the equation number I'm trying to memorize, and a link to another palace (this is surprisingly easy. I don't need to do anything special to encode the link, weirdly.) In that linked palace, I divide up the equation and place the visual elements in some familiar route.

I thought I would have trouble coming up with enough palaces, but it hasn't been an issue so far. I also re-use the palaces, and as long as the context between the uses of the palace are dissimilar, it doesn't appear to be a problem.

Soo, what?

I mean, it works. I can successfully encode equations, and have long term recall of them. I actually did encode a chapter from Jackson a year ago ;), and though the equations are a bit rusty, they are still there.

This approach does take a while though-- you have a lot of setup time to drill the visual language enough that it is reflexive. Then you have to build the visual stories (this gets quicker with practice, but still). Then you have to drill the stories a few times (easy with Anki) to make sure you've got it.

The big win though is verification. Instead of taking a stab at writing something down that looks right (and hoping you have an even number of sign errors), you can write down an equation, and check the answer against the visual story in your palace.

Discuss