Вы здесь

Problems with formalising decision theory

Новости LessWrong.com - 23 августа, 2019 - 06:27
Published on August 23, 2019 3:27 AM UTC

In this post, I clarify how far we are from a complete solution to decision theory, and the way in which high-level philosophy relates to the mathematical formalism. I’ve personally been confused about this in the past, and I think it could be useful to people who casually follows the field. I also link to some less well-publicized approaches.

The first disagreement you might encounter when reading about alignment-related decision theory is the disagreement between Causal Decision Theory (CDT), Evidential Decision Theory (EDT), and different logical decision theories emerging from MIRI and lesswrong, such as Functional Decision Theory (FDT) and Updateless Decision Theory (UDT). This is characterized by disagreements on how to act in problems such as Newcomb’s problem, smoking lesion and the prisoner's dilemma. MIRI’s paper on FDT represents this debate from MIRI’s perspective, and, as exemplified by the philosopher who refereed that paper, academic philosophy is far from having settled on how to act in these problems.

I’m quite confident that the FDT-paper gets those problems right, and as such, I used to be pretty happy with the state of decision theory. Sure, the FDT-paper mentions logical counterfactuals as a problem, and sure, the paper only talks about a few toy problems, but the rest is just formalism, right?

As it turns out, there are a few caveats to this:

1. CDT, EDT, FDT, and UDT are high-level clusters of ways to go about decision theory. They have multiple attempted formalisms, and it’s unclear to what extent different formalisms recommend the same things. For FDT and UDT in particular, it’s unclear whether any one attempted formalism (e.g. the graphical models in the FDT paper) will be successful. This is because:
2. Logical counterfactuals is a really difficult problem, and it’s unclear whether there exists a natural solution. Moreover, any non-natural, arbitrary details in potential solutions are problematic, since some formalisms require everybody to know that everybody uses sufficiently similar algorithms. This highlights that:
3. The toy problems are radically simpler than actual problems that agents might encounter in the future. For example, it’s unclear how they generalise to acausal cooperation between different civilisations. Such civilisations could use implicitly implemented algorithms that are more or less similar to each others’, may or may not be trying and succeeding to predict each others’ actions, and might be in asymmetric situations with far more options than just cooperating and defecting. This poses a lot of problems that don’t appear when you consider pure copies in symmetric situations, or pure predictors with known intentions.

As a consequence, knowing what philosophical position to take in the toy problems is only the beginning. There’s no formalised theory that returns the right answers to all of them yet, and if we ever find a suitable formalism, it’s very unclear how it will generalise.

If you want to dig into this more, Abram Demski mentions some open problems in this comment. Some attempts at making better formalisations includes Logical Induction Decision Theory (which uses the same decision procedure as evidential decision theory, but gets logical uncertainty by using logical induction), and a potential modification, Asymptotic Decision Theory. There’s also a proof-based approach called Modal UDT, for which a good place to start would be the 3rd section in this collection of links. Another surprising avenue is that some formalisations of the high-level clusters suggest that they're all the same. If you want to know more about the differences between Timeless Decision Theory (TDT), FDT, and versions 1.0, 1.1, and 2 of UDT, this post might be helpful.

Discuss

Tabooing 'Agent' for Prosaic Alignment

Новости LessWrong.com - 23 августа, 2019 - 05:55
Published on August 23, 2019 2:55 AM UTC

This post is an attempt to sketch a presentation of the alignment problem while tabooing words like agency, goals or optimization as core parts of the ontology. This is not a critique of frameworks which treat these topics as fundamental, in fact I end up concluding that this is likely justified. This is not a 'new' framework in any sense, but I am writing it down with my own emphasis in case it helps others who feel sort of uneasy about standard views on agency. Any good ideas here probably grew out of Risks From Learned Optimization or my subsequent discussions with Chris, Joar and Evan.

Epistemic State: Likely many errors both in facts and emphasis, I would be very happy to find out where they are.

Prosaic AI Alignment as a generalization problem

I think the current and near-future state of AI development is well-described by us having:

1. very little understanding of intelligence (here defined as something like "generally powerful problem solvers"), but

2. a lot of 'dumb' compute.

Prosaic AGI development is, in my view, about using 2 to get around 1. A very simplistic model of how to do this involves three components:

• A parametrized model-space large enough that we think it contains the sorts of generalized problem solvers we want.
• A search criteria which we can test for.
• A search process which uses massive compute to find parameters in the model-space satisfying the search criteria.

Much of ML is about designing all these parts to make the search process as compute-efficient as possible, for instance by making everything differentiable and using gradient-descent. For the purposes of this discussion I will consider an even simpler model where the search process simply samples random models from some prior over the model-space until it finds one satisfying some (boolean) search criteria.

While we are generally ignoring computational and practical concerns it is important that the search criteria is limited - you can only check that the model acts correctly on a small fraction of the possible situations we want to use this model in. We might talk about a search criteria being feasible if it is both possible to gather the data required to specify it and reasonable to expect to actually find a model fulfilling it with the amount of compute you have. The goal then is to pick a model-space, prior and feasible search criteria such that the model does what we want in any situation it might end up in. Following Paul Christiano (in Techniques for optimizing worst case performance) we might broadly distinguish two ways that this could be false, which we will refer to as 'generalization failures':

• Benign failure: It does something essentially random or stupid, which has little unexpected effect on the world.

• Malign Failure: It generalizes 'competently' but not in the way we want, affecting the world in potentially catastrophic and unexpected ways.

Benign failures are seen a lot in current ML and often labelled robustness problems, where distributional shifts or adversarial examples lead to failures of generalization. These failures don't seem very dangerous, and can usually be solved eventually through iteration and fine-tuning. This is not the case for malign generalization failure, which risks destroying civilization. (Slightly breaking the taboo: the classic stories of training a deceptively aligned expected utility maximizer which only does what we want because it realizes it is being tested is a malign generalization failure, though in this framework this is just an example, and whether this is central or not is an empirical claim which will be explored in the second section.)

A contrived story which doesn't rely on superintelligence but also demonstrates a malign generalization failure is:

We are searching for a good design for a robot to clean up garbage dumps, so we run a bunch of simulations until we find one which passes our selection criteria of clearing all the garbage. We happily release this robot in the nearest garbage dump, and find that it does indeed clear the garbage, but alarmingly it does this by manufacturing self-replicating garbage-eating nano-bots. These nano-both quickly consume the earth. The robot itself knows nothing other than how to construct this precise set of nano-bots from materials found in a garbage-dumb, which is impressive but not generally intelligent.

Another class of examples are things which are generally intelligent, but in a very messy way with many blind spots and weird quirks, some of which eventually lead to humanity's demise. I think a better understanding of which kinds of malign generalization errors we might be missing could be potentially very important.

So how do we shape the way the model generalizes? I think the key question is understanding how the inputs to the search process (the model-space, the prior and the search criteria) affect the output, which I think is best understood as a posterior distribution in the model-space gained by conditioning on the model clearing the search criteria.

Some examples of how the inputs might relate to the outputs:

• With an empty search criteria the posterior is equal to the prior and unless the prior is already very specific you should expect sampling from the posterior to give models acting randomly and getting benign failures everywhere.
• If our search criteria requires unreasonable performance on some small test-set, and our prior doesn't give a significant enough bias toward simple/general models then we should expect benign generalization failures due to 'overfitting'.
• If our search criteria, prior and model-space only focus on a limited task but are set up to correctly identify general solutions to this task then we might expect little generalization error within this task, while getting almost entirely benign errors outside the task. This seems to be where current ML systems are situated.
• If we pick search criteria which require good performance on increasingly general tasks, and we make sure that the prior is increasingly weighted toward the right kind of simple/general solutions then we might expect to see less generalization failure overall in a broad domain, but we also risk malign generalization errors appearing.

Summing up, I think a reasonable definition of the prosaic AI alignment project is to prevent malign generalization error from ever happening, even as we try to eliminate benign errors. This seems difficult mostly because moving toward robust generalization and toward malign generalization seem very similar, and you need some way to differentially advantage the first. Some approaches to this include:

• Design a model-space and prior which advantages the sort of 'intended generalization' that we want, or which are transparent enough that we can use really powerful search criteria.
• Design search criteria which effectively shift the distribution to one containing mostly robustly generalizing models. Such criteria would likely involve a lot of inspection to see how the model actually works and generalizes internally. Paul Christiano's approach of adversarial training seems like a plausibly good way to do this.
• Find ways of making stronger search criteria feasible. An example of this is that the search criterion might be bottle-necked by human judgement and oversight, and amplification is a scheme which might remove this bottleneck and make more detailed search criteria more feasible.
• Consider whether we can replace the idea of one big system which should generalize to any situation with many specialized systems which are allowed to have benign generalization failures in many domains. This might correspond to a more 'comprehensive services'-style solution, as described by Eric Drexler and summarized by Rohin Shah.
Reintroducing agency

Now I will try to give an account of why the concepts of rational agency/goals/optimizers might be useful in this picture, even though they aren't explicitly part of the problem statement nor the mentioned solutions. This is based on a hand-wavy hypothesis:

H: If you have a prior and model-space which sufficiently advantages descriptive simplicity and a selection criteria which tests performance in a sufficiently general set of situations, then your posterior distribution on the model-space will contain a large measure of models internally implementing highly effective expected utility optimization for some utility function.

There are several arguments supporting this hypothesis, such as those presented by Eliezer in Sufficiently Optimized Agents Appear Coherent and the simplicity/effectiveness of simple optimization algorithms.

If H is true then it provides a good reason to study and understand goals, agency and optimization as describing properties of a particular cluster of models which will play a very important role once we start using these methods to solve very general classes of problems.

As a slight aside, this also gives a framing for the much discussed mesa optimization problem in the Risks from Learned Optimization paper, which points out that there is no a priori reason to expect the utility function to be the one you might have used to grade the model as part of the selection criteria, and that most of the measure might in fact taken up by pseudo-aligned or deceptively-aligned models, which represent a particular example of malign generalization error. In fact, if H is true, avoiding malign generalization errors largely comes down to avoiding misaligned mesa optimizers.

I think the world where H is true is a good world, because it's a world where we are much closer to understanding and predicting how sophisticated models generalize. If we are dealing with a model doing expected utility maximization we can 'just' try to understand whether we agree with its goal, and then essentially trust that it will correctly and stably generalize to almost any situation.

If you agree that understanding how an expected utility maximizer generalizes could be easier than for many other classes of minds, then studying this cluster of model-space could be useful even if H is false, as long as the weaker hypothesis H' still holds.

H': We will be able to find some model-space, prior and feasible selection criteria such that the posterior distribution on the model-space contains a large measure of models internally implementing highly effective expected utility maximization for some utility function.

In the world where H' holds we can then restrict ourselves to this way of searching, and can thus use the kinds of methods and assumptions which we could in the world where H was true.

In either of these cases I think current models of AI Alignment which treat optimizers with goals as the central problem are justified. However, I think there are reasons to believe H and possibly even H' might be false, which essentially come down to embedded agency and bounded rationality concerns pushing away from elegant agent frameworks. I also feel very uncomfortable resting the safety of humanity on assumptions like this, and would like a much better understanding of how generalization works in other clusters or parts of various model-spaces.

Summary

I have tried to present a version of the prosaic AI alignment project which doesn't make important reference to the concept of agency, instead viewing it as a generalization problem where you are trying to avoid finding models which fail disastrously when presented with new situations. Agency then reappears as a potentially important cluster of the space of possible models, which under certain empirical hypotheses justifies it as the central topic, though I still wish we had more understanding of other parts of various model-spaces.

Discuss

Vaniver's View on Factored Cognition

Новости LessWrong.com - 23 августа, 2019 - 05:54
Published on August 23, 2019 2:54 AM UTC

The View from 2018

In April of last year, I wrote up my confusions with Paul’s agenda, focusing mostly on approval directed agents. I mostly have similar opinions now; the main thing I noticed on rereading it was I talked about ‘human-sized’ consciences, when now I would describe them as larger than human size (since moral reasoning depends on cultural accumulation which is larger than human size). But on the meta level, I think they’re less relevant to Paul’s agenda than I thought then; I was confused about how Paul’s argument for alignment worked. (I do think my objections were correct objections to the thing I was hallucinating Paul meant.) So let’s see if I can explain it to Vaniver_2018, which includes pointing out the obstacles that Vaniver_2019 still sees. It wouldn't surprise me if I was similarly confused now, tho hopefully I am less so, and you shouldn't take this post as me speaking for Paul.

Factored Cognition

One core idea that Paul’s approach rests on is that thoughts, even the big thoughts necessary to solve big problems, can be broken up into smaller chunks, and this can be done until the smallest chunk is digestible. That is, problems can be ‘factored’ into parts, and the factoring itself is a task (that may need to be factored). Vaniver_2018 will object that it seems like ‘big thoughts’ require ‘big contexts’, and Vaniver_2019 has the same intuition, but this does seem to be an empirical question that experiments can give actual traction on (more on that later).

The hope behind Paul’s approach is not that the small chunks are all aligned, and chaining together small aligned things leads to a big aligned thing, which is what Vaniver_2018 thinks Paul is trying to do. A hope behind Paul’s approach is that the small chunks are incentivized to be honest. This is possibly useful for transparency and avoiding inner optimizers. A separate hope with small chunks is that they’re cheap; mimicking the sort of things that human personal assistants can do in 10 minutes only requires lots of 10 minute chunks of human time (each of which only costs a few dollars) and doesn’t require figuring out how intelligence works; that’s the machine learning algorithm’s problem.

So how does it work? You put in an English string, a human-like thing processes it, and it passes out English strings--subquestions downwards if necessary, and answers upwards. The answers can be “I don’t know” or “Recursion depth exceeded” or whatever. The human-like thing comes preloaded (or pre-trained) with some idea of how to do this correctly; obviously incorrect strategies like “just pass the question downward for someone else to answer” get ruled out, and the humans we’ve trained on have been taught things like how to do good Fermi estimation and some of the alignment basics. This is general, and lets you do anything humans can do in a short amount of time (and when skillfully chained, anything humans can do in a long amount of time, given the large assumption that you can serialize the relevant state and subdivide problems in the relevant ways).

Now schemes diverge a bit on how they use factored cognition, but in at least some we begin by training the system to simply imitate humans, and then switch to training the system to be good at answering questions or to distill long computations into cached answers or quicker computations. One of the tricks we can use here is that ‘self-play’ of a sort is possible, where we can just ask the system whether a decomposition was the right move, and this is an English question like any other.

Honesty Criterion

Originally, I viewed the frequent reserialization as a solution to a security concern. If you do arbitrary thought for arbitrary lengths of time, then you risk running into inner optimizers or other sorts of unaligned cognition. Now it seems that the real goal is closer to an ‘honesty criterion’; if you ask a question, all the computation in that unit will be devoted to answering the question, and all messages between units are passed where the operator can see them, in plain English.[1]

Even if one succeeds at honesty, it still seems difficult to maintain both generality and safety. That is, I can easily see how factored cognition allows you to stick to cognitive strategies that definitely solve a problem in a safe way, but don't see how it does that and allows you to develop new cognitive strategies to solve a problem that doesn’t result in an opening for inner optimizers--not within units, but within assemblages of units. Or, conversely, one could become more general while giving up on safety. In order to get both it seems like we’re resting a lot on the Overseer’s Manual or way that we trained the humans that we used as training data.

Serialized State is Inadequate or Inefficient

In my mind, the primary reason to build advanced AI (as opposed to simple AI) is to accomplish megaprojects instead of projects. Curing cancer (in a way that potentially involves novel research) seems like a megaproject, whereas determining how a particular protein folds (which might be part of curing cancer) is more like a project. To the extent that Factored Cognition relies on the serialized state (of questions and answers) to enforce honesty on the units of computation, it seems like that will be inefficient for problems whose state are large enough that they impose significant serialization costs, and inadequate for problems whose state are too large to serialize. If we allow answers that are a page long at most, or that a human could write out in 10 minutes, then we’re not going to get a 300-page report of detailed instructions. (Of course, allowing them to collate reports written by subprocesses gets around this difficulty, but means that we won’t have ‘holistic oversight’ and will allow for garbage to be moved around without being caught if the system doesn’t have the ability to read what it’s passing.)

The factored cognition approach also has a tree structure of computation, as opposed to a graph structure, which leads to lots of duplicated effort and the impossibility of horizontal communication. If I’m designing a car, I might consider each part separately, but then also modify the parts as I learn more about the requirements of the other parts. This sort of sketch-then-refinement seems quite difficult to do under the factored cognition approach, even though it involves reductionism and factorization.

Shared memory partially solves this (because, among other things, it introduces the graph structure of computation), but now reduces the guarantee of our honesty criterion because we allow arbitrary side effects. It seems to me like this is a necessary component for most of human reasoning, however. James Maxwell, the pioneer behind electromagnetism, lost most of his memory with age, in a way that seriously reduced his scientific productivity. And factored cognition doesn’t even allow the external notes and record-keeping he used to partially compensate.

There's Actually a Training Procedure

The previous section described what seems to me to be a bug; from Paul's perspective this might be a necessary feature because his approaches are designed around taking advantage of arbitrary machine learning, which means only the barest of constraints can be imposed. IDA presents a simple training procedure that, if used with an extremely powerful model-finding machine learning system, allows us to recursively surpass the human level in a smooth way. (Amusingly to me, this is like Paul enforcing slow takeoff.)

Training The Factoring Problem is Ungrounded

From my vantage point, the trick that we can improve the system by asking it questions like “was X a good way to factor question Y?”, where X was the attempt it had at factoring Y, is one of the core reasons to think this approach is workable, and also seems like it won’t work (or will preserve blind spots in dangerous ways). This is because while we could actually find the ground truth on how many golf balls fit in a 737, it is much harder to find the ground truth on what cognitive style most accurately estimates how many golf balls fit in a 737.

1. Check how similar it is to what you would do. A master artist might watch the brushstrokes made by a novice artist, and then point out wherever the novice artist made questionable choices. Similarly, if we get the question “if you’re trying to estimate how many golf balls fit in a 737, is ‘length of 737 * height of 737 * width of 737 / volume of golf ball’ a good method?” we just compute what we would have done and estimate if the approach will have a better or worse error.
2. Check whether or not it accords with principles (or violates them). Checking the validity of a mathematical proof normally is done by making sure that all steps are locally valid according to the relevant rules of inference. In a verbal argument, one might just check for the presence of fallacies of reasoning.
3. Search over a wide range of possible solutions, and see how it compares to the distribution. But how broadly in question-answer policy space are we searching?

We now face some tradeoffs between exploration (in a monstrously huge search space, which may be highly computationally costly to meaningfully explore) and rubber-stamping, where I use my cognitive style to evaluate whether or not my cognitive style is any good. Even if we have a good resolution to that tradeoff, we have to deal with the cognitive credit-assignment problem.

That is, in reinforcement learning one has to figure out which actions taken (or not taken) before a reward led to receiving the reward so that it can properly assign credit; similarly the system that's training the Q&A policy needs to understand well enough how the policy is leading to correct answers such that it can apply the right gradients in the right places (or use a tremendous amount of compute doing this by blind search).

This is complicated by the fact that there may be multiple approaches to problem-solving that are internally coherent, but mixtures of those approaches fail. If we only use methods like gradient-descent that smoothly traverse the solution space, this won't be a problem (because gradient descent won't sharply jump from one to another), but it's an open empirical question as to whether future ML techniques will be based on gradient descent. It’s not obvious how we can extricate ourselves from the dependence on our learned question-answer policy. If I normally split a model into submodels based on a lexicographical ordering, and now I’m considering a hypothetical split into submodels based on statistical clustering, I would likely want to consider the hypothetical split all the way down the tree (as updates to my beliefs on ‘what strategy should I use to A this Q?’ will impact more than just this question), especially if there are two coherent strategies but a mixture of the strategies is incoherent. But how to implement this is nonobvious; am I not just passing questions to the alternate branch, but also a complete description of the new cognitive strategy they should employ? It seems like a tremendous security hole to have ‘blindly follow whatever advice you get in the plaintext of questions’ as part of my Q->A policy, and so it seems more like I should be spinning up a new hypothetical agent (where the advice is baked into their policy instead of their joint memory) in a way that may cause some of my other guarantees that relied on smoothness to fail.

Also note that because updates to my policy impact other questions, I might actually want to consider the impact on other questions as well, further complicating the search space. (Ideally, if I had been handling two questions the same way and discover that I should handle them separately, my policy will adjust to recognize the two types and split accordingly.) While this is mostly done by the machine learning algorithm that’s trying to massage the Q->A policy to maximize reward, it seems like making the reward signal (from the answer to this meta-question) attuned to how it will be used will probably make it better (consider the answer “it should be answered like these questions, instead of those,” though generally we assume yes/no answers are used for reward signals).

When we have an update procedure to a system, we can think of that update procedure as the system's "grounding", or the source of gravity that it becomes arranged around. I don't yet see a satisfying source of grounding for proposals like HCH that are built on factored cognition. Empiricism doesn't allow us to make good use of samples or computation, in a way that may render the systems uncompetitive, and alternatives to empiricism seem like they allow the system to go off in a crazy direction in a way that's possibly unrecoverable. It seems like the hope is that we have a good human seed that then is gradually amplified, in a way that seems like it might work but relies on more luck than I would like: the system is rolling the dice whenever it makes a significant transition in its cognitive style, as it can no longer fully trust oversight from previous systems in the amplification tree as they may misunderstand what's going on in the contemporary system, and it can no longer fully trust oversight from itself, because it's using the potentially corrupted reasoning process to evaluate itself.

1. Of course some messages could be hidden through codes, but this behavior is generally discouraged by the optimization procedure, as whenever you compare to a human baseline they will not do the necessary decoding and will behave in a different way, costing you points. ↩︎

Discuss

Redefining Takeoff Speeds

Новости LessWrong.com - 23 августа, 2019 - 05:15
Published on August 23, 2019 2:15 AM UTC

This post is a result of numerous discussions with other participants and organizers of the MIRI Summer Fellows Program 2019. It describes ideas that are likely already known by many researchers. However, given how often disagreements about slow/fast takeoffs come up, I believe there is significant value in making them common knowledge.

Takeoff speed & why it matters

In Superintelligence, Nick Bostrom distinguishes between slow, medium, and fast AI takeoff scenarios (where the takeoff speed is measured by how much real-world time passes between the milestones of human-level AI (HLAI) and superintelligent AI (SAI)). He argues that slow takeoff should be reasonably safe since the humanity would have sufficient time coordinate and solve the AI alignment problem, while fast takeoff would be particularly dangerous since we wouldn't be able to react to what the AI does.

Real-world time is not what matters

In many scenarios, the real-time takeoff speed indeed strongly correlates with our ability to influence the outcome. However, we can also imagine many scenarios where this is not the case. As an example, suppose we obtain HLAI by simulating humans in virtual environments, and that this procedure additionally fully preserves the simulated humans' alignment with humanity. Since this effectively increases the speed at which humanity operates, we might get a "fully controlled takeoff" even if the transition from HLAI to SAI[1] only takes a few days of real-world time. More generally, if our path to HLAI also increases the effectivity of humanity's efforts, the "effective time" we get between HLAI and SAI will scale accordingly. For example, this might be the case if we go the way of Iterated Distillation and Amplification or Comprehensive AI Services. Less controversially, suppose we automate most of the current programming tasks and increase the re-usability of code, such that every computer scientist becomes 100-times as effective as they are now.

Conclusion: measure useful work

Given these examples, I think we should measure takeoff speeds not in real-world time, but rather in (some operationalization of) the work-towards-AI-alignment that humanity will be able to do between HLAI and SAI. Anecdotal examples of such measures might include "integral of the human-originating GDP between HLAI and SAI" or "number of AI safety papers published between HLAI and SAI". I believe that finding a non-anecdotal operationalization would benefit many AI policy/strategy discussions.

1. Recall that Bostrom distinguishes between speed, collective, and quality superintelligence. Arguably, being able to simulate humans (with enough compute) already constitutes a speed superintelligence. However, I don't think this diminishes the overall point of the post. ↩︎

Discuss

Does Agent-like Behavior Imply Agent-like Architecture?

Новости LessWrong.com - 23 августа, 2019 - 05:01
Published on August 23, 2019 2:01 AM UTC

This is not a well-specified question. I don't know what "agent-like behavior" or "agent-like architecture "A should mean. Perhaps the question should be "Can you define the fuzzy terms such that 'Agent-like behavior implies agent-like architecture' is true, useful, and in the spirit of the original question." I mostly think the answer is no, but it seems like it would be really useful to know if true, and the process of trying to make this true might help us triangulate what we should mean by agent-like behavior and agent-like architecture.

Now I'll say some more to try to communicate the spirit of the original question. First a giant look-up table is not a (directly) counterexample. This is because it might be that the only way to produce an agent-like GLUT is to use agent-like architecture to search for it. Similarly a program that outputs all possible GLUTs is also not a counterexample because you might have to use your agent-like architecture to point at the specific counterexample. A longer version of the conjecture is "If you see program implements agent-like behavior, there must some agent-like architecture in the program itself, in the causal history of the program, or in the process that brought your attention to the program." The pseudo-theorem I want is similar to the claim that correlation really does imply causation or the good regulator theorem.

One way of defining agent-like behavior as that which can only be produced by an agent-like architecture. This makes the theorem trivial, and the challenge is making the theorem non-vacuous. In this light, the question is something like "Is there some nonempty class of architectures that can reasonably be described as a subclass of 'agent-like' such that the class can be equivalently specified either functionally or syntactically?" This looks like it might conflict with the spirit of Rice's theorem, but I think making it probabilistic and referring to the entire causal history of the algorithm might give it a chance of working.

One possible way of defining agent-like architecture is something like "Has a world model and a goal, and searches over possible outputs to find one such that the model believes that output leads to the goal" Many words in this will have to be defined further. World model might be something that has high logical mutual information with the environment. It might be hard to define search generally enough to include everything that counts as search. There also might be completely different ways to define agent-like architecture. Do whatever makes the theorem true.

Discuss

The "Commitment Races" problem

Новости LessWrong.com - 23 августа, 2019 - 04:58
Published on August 23, 2019 1:58 AM UTC

[Epistemic status: Strong claims vaguely stated and weakly held. I expect that writing this and digesting feedback on it will lead to a much better version in the future.]

This post attempts to generalize and articulate a problem that people have been thinking about since at least 2016. In short, here is the problem:

Consequentialists can get caught in commitment races, in which they want to make commitments as soon as possible. When consequentialists make commitments too soon, disastrous outcomes can sometimes result. The situation we are in (building AGI and letting it self-modify) may be one of these times unless we think carefully about this problem and how to avoid it.

For this post I use "consequentialists" to mean agents that choose actions entirely on the basis of the expected consequences of those actions. For my purposes, this means they don't care about historical facts such as whether the options and consequences available now are the result of malicious past behavior. (I am trying to avoid trivial definitions of consequentialism according to which everyone is a consequentialist because e.g. "obeying the moral law" is a consequence.) This definition is somewhat fuzzy and I look forward to searching for more precision some other day.

Consequentialists can get caught in commitment races, in which they want to make commitments as soon as possible

Consequentialists are bullies; a consequentialist will happily threaten someone insofar as they think the victim might capitulate and won't retaliate.

Consequentialists are also cowards; they conform their behavior to the incentives set up by others, regardless of the history of those incentives. For example, they predictably give in to credible threats unless reputational effects weigh heavily enough in their minds to prevent this.

In most ordinary circumstances the stakes are sufficiently low that reputational effects dominate: Even a consequentialist agent won't give up their lunch money to a schoolyard bully if they think it will only invite more bullying later. But in some cases the stakes are high enough, or the reputational effects low enough, for this not to matter.

So, amongst consequentialists, there is sometimes a huge advantage to "winning the commitment race." If two consequentialists are playing a game of Chicken, the first one to throw out their steering wheel wins. If one consequentialist is in position to seriously hurt another, it can extract concessions from the second by credibly threatening to do so--unless the would-be victim credibly commits to not give in first. If two consequentialists are attempting to divide up a pie or select a game-theoretic equilibrium to play in, the one that can "move first" can get much more than the one that "moves second." In general, because consequentialists are cowards and bullies, the consequentialist who makes commitments first will predictably be able to massively control the behavior of the consequentialist who makes commitments later. As the folk theorem shows, this can even be true in cases where games are iterated and reputational effects are significant.

Note: "first" and "later" in the above don't refer to clock time, though clock time is a helpful metaphor for imagining what is going on. Really, what's going on is that agents learn about each other, each on their own subjective timeline, while also making choices (including the choice to commit to things) and the choices a consequentialist makes at time t are cravenly submissive to the commitments they've learned about by t.

Logical updatelessness and acausal bargaining combine to create a particularly important example of a dangerous commitment race. There are strong incentives for consequentialist agents to self-modify to become updateless as soon as possible, and going updateless is like making a bunch of commitments all at once. Since real agents can't be logically omniscient, one needs to decide how much time to spend thinking about things like game theory and what the outputs of various programs are before making commitments. When we add acausal bargaining into the mix, things get even more intense. Scott Garrabrant, Wei Dai, and Abram Demski have described this problem already, so I won't say more about that here. Basically, in this context, there are many other people observing your thoughts and making decisions on that basis. So bluffing is impossible and there is constant pressure to make commitments quickly before thinking longer. (That's my take on it anyway)

Anecdote: Playing a board game last week, my friend Lukas said (paraphrase) "I commit to making you lose if you do that move." In rationalist gaming circles this sort of thing is normal and fun. But I suspect his gambit would be considered unsportsmanlike--and possibly outright bullying--by most people around the world, and my compliance would be considered cowardly. (To be clear, I didn't comply. Practice what you preach!)When consequentialists make commitments too soon, disastrous outcomes can sometimes result. The situation we are in may be one of these times.

This situation is already ridiculous: There is something very silly about two supposedly rational agents racing to limit their own options before the other fellow limits his. But it gets worse.

Sometimes commitments can be made "at the same time"--i.e. in ignorance of each other--in such a way that they lock in an outcome that is disastrous for everyone. (Think both players in Chicken throwing out their steering wheels simultaneously.)

Here is a somewhat concrete example: Two consequentialist AGI think for a little while about game theory and commitment races and then self-modify to resist and heavily punish anyone who bullies them. Alas, they had slightly different ideas about what counts as bullying and what counts as a reasonable request--perhaps one thinks that demanding more than the Nash Bargaining Solution is bullying, and the other thinks that demanding more than the Kalai-Smorodinsky Bargaining Solution is bullying--so many years later they meet each other, learn about each other, and end up locked into all-out war.

I'm not saying disastrous AGI commitments are the default outcome; I'm saying the stakes are high enough that we should put a lot more thought into preventing them than we have so far. It would really suck if we create a value-aligned AGI that ends up getting into all sorts of fights across the multiverse with other value systems. We'd wish we built a paperclip maximizer instead.

Objection: "Surely they wouldn't be so stupid as to make those commitments--even I could see that outcome coming. A better commitment would be..."

Reply: The problem is that consequentialist agents are motivated to make commitments as soon as possible, since that way they can influence the behavior of other consequentialist agents who may be learning about them. Of course, they will balance these motivations against the countervailing motive to learn more and think more before doing drastic things. The problem is that the first motivation will push them to make commitments much sooner than would otherwise be optimal. So they might not be as smart as us when they make their commitments, at least not in all the relevant ways. Even if our baby AGIs are wiser than us, they might still make mistakes that we haven't anticipated yet. The situation is like the centipede game: Collectively, consequentialist agents benefit from learning more about the world and each other before committing to things. But because they are all bullies and cowards, they individually benefit from committing earlier, when they don't know so much.

Objection: "Threats, submission to threats, and costly fights are rather rare in human society today. Why not expect this to hold in the future, for AGI, as well?"

1. Devastating "Grim trigger" commitments are much more possible with AGI--just alter the code! Inigo Montoya is a fictional character and even he wasn't able to summon lifelong commitment on a whim; it had to be triggered by the brutal murder of his father.

2. Credibility is much easier also, especially in an acausal context (see above.)

3. Some AGI bullies may be harder to retaliate against than humans, lowering their disincentive to make threats.

4. AGI may not have sufficiently strong reputation effects in the sense relevant to consequentialists, partly because threats can be made more devastating (see above) and partly because they may not believe they exist in a population of other powerful agents who will bully them if they show weakness.

5. Finally, these terrible things do happen to some extent even among humans today--especially in situations of anarchy. Hopefully we can do better.

Objection: "Any AGI that falls for this commit-now-before-the-others-do argument will also fall for many other silly do-X-now-before-it's-too-late arguments, and thus will be incapable of hurting anyone."

Reply: That would be nice, wouldn't it? Let's hope so, but not count on it. Indeed perhaps we should look into whether there are other arguments of this form that we should worry about our AI falling for...

Anecdote: A friend of mine, when she was a toddler, would threaten her parents: "I'll hold my breath until you give me the candy!" Imagine how badly things would have gone if she was physically capable of making arbitrary credible commitments. Meanwhile, a few years ago when I first learned about the concept of updatelessness, I resolved to be updateless from that point onwards. I am now glad that I couldn't actually commit to anything then.Conclusion

Overall, I'm not certain that this is a big problem. But it feels to me that it might be, especially if acausal trade turns out to be a real thing. I would not be surprised if "solving bargaining" turns out to be even more important than value alignment, because the stakes are so high. I look forward to a better understanding of this problem.

Many thanks to Abram Demski, Wei Dai, John Wentworth, and Romeo Stevens for helpful conversations.

Discuss

Analysis of a Secret Hitler Scenario

Новости LessWrong.com - 23 августа, 2019 - 04:37
Published on August 23, 2019 1:24 AM UTC

Secret Hitler is a social deception game in the tradition of mafia, the resistance and Avalon [1]. You can read the rules here if you aren't familiar. I haven't played social deception games regularly since 2016 but in my mind its a really good game that represented the state of the art in the genre at that time. I'm going to discuss an interesting situation in which I reasoned poorly.

I was a liberal in a ten player game. The initial table set up, displaying relevant players was,

We passed a fascist article in the first round. The next government was Marek as president and Chancellor a player 3 to the right of Marek [2]. Marek passed a fascist article and inspected Sam and declared Sam was fascist to Sam's counterclaim that Marek was fascist. Sam was to the left of Marek so we had no data about him. The table generally supported Sam but I leaned towards believing Marek.

At the beginning of the game my view of possibilities looked roughly like:

But of course I'm already approximating. From my perspective an individual is only 4/9 to be Fascist and the events that two individuals are fascist is not independent. A more careful calculation would have been:

• There are (94) possible distributions of Fascists in which (74) of them both Marek and Sam are good for a probability of 5/18
• There are (73) ways for Marek to be Fascist and Sam to be Liberal. Of course (74)=(73) so we get 5/18 again.
• The remaining event that both are evil must then be 3/18.

So a better perspective would have been:

I don't think I could do this math consistently in game though so I'll do the rest of the analysis with my original priors. I've included it here for reference in your future 10 player games of Secret Hitler.

When Marek declared Sam was Fascist the only scenario which is confidently eliminated is that both are liberal. Any reasonable liberal player is truth promoting and has no reason to lie. At first glance it seems that the possibility that both are fascist is also eliminated as the fascists should have no reason to fight each other. This doesn't strictly hold though. The fascists only need one fascist in a government to likely sink it and if they reason that the liberals will reason that one of them must be good then they have good reason to pick fights with each other. But in practice in an accusation situation the table often opts to pick neither so its a risky move.

Now we get into the questionable deduction I made during the game. If Marek is a fascist and he inspects a liberal he has a choice to make. He can accuse the liberal of being fascist to sow distrust among the liberals. Or he can tell the truth to garner trust with the liberals. If Marek is liberal and he inspects a fascist then he has no choice to make. He will declare that the fascist is a fascist.

In the moment I figured there was about a 50-50 chance that a fascist would choose to call a liberal a fascist or a liberal. Let's call fascists who lie about liberal's identities bold and fascists who tell the truth timid. I discounted the possibility that both were fascist and I reasoned with probabilities from the first square. So given that Marek had accused Sam that meant that Marek was a liberal with probability 75%.

The problem is I didn't adequately update on Marek's fascist presidency. In my mental model of the game it's not the improbable to draw 3 fascist articles. This mental model is derived from the fact that it generally happens once or twice a game. But its still an unlikely event in the sense that I should consider it evidence that the president was fascist. From the reports of the first group 3 fascist articles had already been buried. Even if I distrust them I can still guess at least 2 fascist articles were buried. So the probability that 3 fascist articles were drawn again is at least (115)/(145)≈0.23. Going into the investigation I had a belief that Marek was Fascist with probability about 50% but I should have already updated to ~88% that he was a fascist as opposed to an unlucky liberal (100% of the time he didn't draw 3 fascist articles he buried a liberal and is a fascist and half the rest of the time he's a fascist by my prior). Given that, even with my conjecture that Fascists make false accusations only half the time I should have guessed Marek was more likely fascist than Sam. Marek's accusation demonstrated Marek is not a timid fascist which I conjectured to be half the fascist probability mass. By Bayes' Theorem I should have updated to Marek being fascist with probability:

P(Marek a Fascist|Marek not a timid Fascist)=P(Marek not a timid Fascist|Marek a Fascist)P(Marek a Fascist)P(Marek not a timid Fascist)=0.5⋅0.880.12+0.88/2≈0.79

I definitely computed badly in the moment. I think my model building was also bad in a number of other ways which are harder for me to put numbers on:

• I thought Marek was unlikely to pick a fight since he seemed relatively new and he was very quiet after his accusation. In my mind people lie about other people's identities prepared to fight and Marek seemed sort of timid. The counterpoint to this is lying is the obvious level one strategy for fascists. A reasonable person might think its just what fascists are supposed to do.
• I didn't factor in the probability that both were fascists at all.
• All the players seemed to jump on Marek. I think part of the reason I defended him was a contrarian bias. But being contrarian against too big a consensus in Secret Hitler is important. If everyone agrees about something than some fascists are agreeing.
• On the meta level I was very sleepy and new myself to not be reasoning that well or remembering basic facts that accurately. I probably should have deferred to the group.

Thanks for reading and let me know if you have any other thoughts about the position.

1. The development mafia -> resistance -> avalon -> Secret Hitler represents substantial progress in board gaming technology. There's also a lot of amazing adjacent games like Two Rooms and a Boom and One Night Ultimate Werewolf. I'm thankful to live in a time of extraordinary board game technological progress. ↩︎

2. Our meta was such that the chancellorship rotated counterclockwise as the presidency rotated clockwise to see the maximum number of players. In later games our meta updated to make the chancellor 3 to the left of the president and neining to them if we got a favorable result which seems to be a very powerful strategy for the liberals. ↩︎

Discuss

Thoughts from a Two Boxer

Новости LessWrong.com - 23 августа, 2019 - 03:57
Published on August 23, 2019 12:24 AM UTC

I'm writing this for blog day at MSFP. I thought about a lot of things here like category theory, the 1-2-3 conjecture and Paul Christiano's agenda. I want to start by thanking everyone for having me and saying I had a really good time. At this point I intend to go back to thinking about the stuff I was thinking about before MSFP (random matrix theory). But I learned a lot and I'm sure some of it will come to be useful. This blog is about (my confusion of) decision theory.

Before the workshop I hadn't read much besides Eliezer's paper on FDT and my impression was that it was mostly a good way of thinking about making decisions and at least represented progress over EDT and CDT. After thinking more carefully about some canonical thought experiments I'm no longer sure. I suspect many of the concrete thoughts which follow will be wrong in ways that illustrate very bad intuitions. In particular I think I am implicitly guided by non-example number 5 of an aim of decision theory in Wei Dai's post on the purposes of decision theory. I welcome any corrections or insights in the comments.

The Problem of Decision Theory

First I'll talk about what I think decision theory is trying to solve. Basically I think decision theory is the theory of how one should[1] decide on an action after one already understands: The actions available, the possible outcomes of actions, the probabilities of those outcomes and the desirability of those outcomes. In particular the answers to the listed questions are only adjacent to decision theory. I sort of think answering all of those questions is in fact harder than the question posed by decision theory. Before doing any reading I would have naively expected that the problem of decision theory, as stated here, was trivial but after pulling on some edge cases I see there is room for a lot of creative and reasonable disagreement.

A lot of the actual work in decision theory is the construction of scenarios in which ideal behavior is debatable or unclear. People choose their own philosophical positions on what is rational in these hairy situations and then construct general procedures for making decisions which they believe behave rationally in a wide class of problems. These constructions are a concrete version of formulating properties one would expect an ideal decision theory to have.

One such property is that an ideal decision theory shouldn't choose to self modify in some wide vaguely defined class of "fair" problems. An obviously unfair problem would be one in which the overseer gives CDT $10 and any other agent$0. One of my biggest open questions in decision theory is where this line between fair and unfair problems should lie. At this point I am not convinced any problem where agents in the environment have access to our decision theory's source code or copies of our agent are fair problems. But my impression from hearing and reading what people talk about is that this is a heretical position.

Newcomb's Problem

Let's discuss Newcomb's problem in detail. In this problem there are two boxes one of which you know contains a dollar. In the other box an entity predicting your action may or may not put a million dollars. They put a million dollars if and only if they predict you will only take one box. What do you do if the predictor is 99 percent accurate? How about if it is perfectly accurate? What if you can see the content of the boxes before you make your decision?

An aside on why Newcomb's problem seems important: It is sort of like a prisoner's dilemma. To see the analogy imagine you're playing a classical prisoner's dilemma against a player who can reliably predict your action and then chooses to match it. Newcomb's problem seems important because prisoner's dilemmas seem like simplifications of situations which really do occur in real life. The tragedy of prisoner dilemmas is that game theory suggests you should defect but the real world seems like it would be better if people cooperated.

Newcomb's problem is weird to think about because the predictor and agent's behaviors are logically connected but not causally. That is, if you tell me what the agent does or what the predictor predicts as an outside observer I can guess what the other does with high probability. But once the predictor predicts the agent could still take either option and flip flopping won't flip flop the predictor. Still one may argue you should one box because being a one boxer going into the problem means you will likely get more utility. I disagree with this view and see Newcomb's problem as punishing rational agents.

If Newcomb's problem is ubiquitous and one imagines an agent walking down the street constantly being Newcombed it is indeed unfortunate if they are doomed to two box. They'll end up with far fewer dollars. But this thought experiment is missing an important part of real world detail in my view. How the predictors predict the agents behavior. There are three possibilities:

• The predictors have a sophisticated understanding of the agent's inner workings and use it to simulate the agent to high fidelity.
• The predictors have seen many agents like our agent doing problems like this problem and use this to compute a probability of our agent's choice and compare it to a decision threshold.
• The predictor has been following the behavior of our agent and uses this history to assign its future behavior a probability.

In the third bullet the agent should one box if they predict they are likely to be Newcombed often[2]. In the second bullet they should one box if they predict that members of their population will be Newcombed often and they derive more utility from the extra dollars their population will get then the extra dollar they could get for themselves. I have already stated I see the third bullet as an unfair problem.

My big complaint with mind reading is that there just isn't any mind reading. All my understandings of how people behave comes from observing how they behave in general, how the human I'm trying to understand behaves specifically, whatever they have explicitly told me about their intentions and whatever self knowledge I have I believe is applicable to all humans. Nowhere in the current world do people have to make decisions under the condition of being accurately simulated.

Why then do people develop so much decision theory intended to be robust in the presence of external simulators? I suppose its because there's an expectation that this will be a major problem in the future which should be solved philosophically before it is practically important. Mind reading could become important to humans if mind surveillance because possible and deployed. I don't think such a thing is possible in the near term or likely even in the fullness of time. But I also can't think of any insurmountable physical obstructions so maybe I'm too optimistic.

Mind reading is relevant to AI safety because whatever AGI is created will likely be a program on a computer somewhere which could reason its program stack is fully transparent or its creators are holding copies of it for predictions.

Conclusion

Having written that last paragraph I suddenly understand why decision theory in the AI community is the way it is. I guess I wasn't properly engaging with the premises of the thought experiment. If one actually did tell me I was about to do a Newcomb experiment I would still two box because knowing I was in the real world I wouldn't really believe that an accurate predictor would be deployed against me. But an AI can be practically simulated and what's more can reason that it is just a program run by a creator that could have created many copies of it.

I'm going to post this anyway since its blog-day and not important-quality-writing day but I'm not sure this blog has much of a purpose anymore.

1. This may read like I'm already explicitly guided by the false purpose Wei Dai warned against. My understanding is that the goal is to understand ideal decision making. Just not for the purposes of implementation. ↩︎

2. I don't really know anything but I imagine the game theory of reputation is well developed ↩︎

Discuss

Новости LessWrong.com - 23 августа, 2019 - 03:21
Published on August 23, 2019 12:21 AM UTC

This post is a result of numerous discussions with other participants and organizers of the MIRI Summer Fellows Program 2019.

I recently (hopefully :-) ) dissolved some of my confusion about agency. In the first part of the post, I describe a concept that I believe to be central to most debates around agency. I then briefly list some questions and observations that remain interesting to me.

A(Θ)-morphization Architectures

Consider the following examples of "architectures":

Example (architectures)

1. "Agenty" according to me:
1. Monte Carlo tree search algorithm, parametrized by the number of rollouts made each move and utility function (or heuristic) used to evaluate positions.
2. (semi-vague) "Classical AI-agent" with several interconnected modules (utility function and world model, actions, planning algorithm, and observations used for learning and updating the world model).
3. (vague) Human parametrized by their goals, knowledge, and skills (and, of course, many other details).
2. "Non-agenty" according to me:
1. A hard-coded sequence of actions.
2. Look-up table.
3. Random generator (outputting x∼π on every input, for some probability distribution π).
3. Multi-agent systems:
1. Ant colony.
2. Company (consisting of individual employees, operating within an economy).
3. Comprehensive AI services.

Working definition: Architecture A(Θ) is some model parametrizable by θ inΘ that receives inputs, produces outputs, and possibly keeps an internal state. We denote specific instances of A(Θ) as A(θ).

Generalizing anthropomorphization

A standard item in the human mental toolbox is anthropomorphization: modeling various things as humans (specifically, ourselves) with "funny" goals or abilities. We can make the same mental move for architectures other than humans:

Working definition (A(Θ)-morphization): Let X be something that we want to predict or understand and let A(Θ) be an architecture. Then any model A(θ) is an A(Θ)-morphization of X.

Antropomorphization works well for other humans and some animals (curiosity, fear, hunger). On the other hand, it doesn't work so well for rocks, lightning, and AGI-s --- not that it would prevent us from using it anyway. We can measure the usefulness of A(Θ)-morphization by the degree to which it makes good predictions:

Working definition (prediction error): Suppose X exists in a world W and →E=(E1,…,En) is a sequence of variables (events about X) that we want to predict. Suppose that →e=(e1,…,en) is how →E actually unfolds and →π=(π1,…,πn) is the prediction obtained by A(Θ)-morphizing X as A(θ). The prediction error of A(θ) (w.r.t X and →E in W) is the expected Briar score of π with respect to →e.

Informally, we say that A(Θ)-morphizing X is accurate (resp. not accurate) if the corresponding prediction error is low (resp. high).[1]

When do we call things agents?

Claim: I claim that in many situations where we ask "Is X an agent?", we should instead be asking "Does X exhibit agent-like behavior?". And even better, we should explicitly operationalize this latter question by "Is A(Θ)-morphizing X accurate?". (A related question is how difficult is it for us to "run" A(θ). Indeed, we anthropomorphize so many things precisely because it is cheap for us to do so.)

Relatedly, I believe we already implicitly do this operationalization: Suppose you talk to your favorite human H about agency. H will likely subconsciously associate agency with certain architectures, maybe such as those in Example 1.1-3. Moreover, H will ascribe varying degrees of agency to different architectures --- for me, 1.3 seems more agenty than 1.1. Similarly, there are some architectures that H will associate with "definitely not an agent". I conjecture that, according to H, some X exhibits agent-like behavior if it can be accurately predicted via A(Θ)-morphization for some agenty-to-H architecture A(Θ). Similarly, H would say that X exhibits non-agenty behavior if H can accurately predict it using some non-agenty-to-H architecture.

*Critically, exhibiting agent-like and non-agenty behavior is not mutually exclusive, *and I think this causes most of the confusion around agency. Indeed, we humans seem very agenty but, at the same time, determinism implies that there exists some hard-coded behavior that we enact. A rock rolling downhill can be viewed as merely obeying the non-agenty laws of physics, but what if it "wants to" get as low as possible?

If we ban the concept of agency, which interesting problems remain?

"Agency" often comes up when discussing various alignment-related topics, such as the following:

Optimizer?

How do we detect whether X performs (or capable of performing) optimization? How to detect this from X's architecture (or causal origin) rather than looking at its behavior? (This seems central to the topic of mesa-optimization.)

Agent-like behavior vs agent-like architecture.

Consider the following conjecture: "Suppose some X exhibits agent-like behavior. Does it follow that X physically contains agent-like architecture, such as the one from Example 1.2?". This conjecture is false --- as an example, Q-learning is a "fairly agenty" architecture that leads to intelligent behavior. However, the resulting RL "agent" has a fixed policy and thus functions as a large look-up table. A better question would thus be whether there exist an agent-like architecture causally upstream of X. This question also has a negative answer, as witnessed by the example of an ant colony --- agent-like behavior without agent-like architecture, produced by a "non-agenty" optimization process of evolution. Nonetheless, a general version of the question remains: If some X exhibits agent-like behavior, does it follow that there exists some interesting physical structure[2] causally upstream of X?[3]

Moral standing.

Suppose there is some X, which I model as having some goals. When making actions should I give weight to those goals? (The answer to this question seems more related to conciousness than to A(Θ)-morphization. Note also that a particularly interesting version of the question can be obtained by replacing "I" by "AGI"...)

PC or NPC?

When making plans, should we model X as a part of the environment, or does it enter our game-theoretical considerations? Is X able to model us?

Creativity, unbounded goals, environment-generality.

In some sense, AlphaZero is an extremely capable game-playing agent. On the other hand, if we "gave it access to the internet", it wouldn't do anything with it. The same cannot be said for humans and unaligned AGIs, who would not only be able to orient in this new environment but would eagerly execute elaborate plans to increase their influence. How can we tell whether some X is more like the former or the latter?

To summarize, I believe that many arguments and confusions surrounding agency can disappear if we explicitly use A(Θ)-morphization. This should allow us to focus on the problems listed above. Most definitions I gave are either semi-formal or informal, but I believe they could be made fully formal in more specific cases.

Regarding feedback: Suggestions for a better name super-welcome! If you know of an application for which such formalization would be useful, please do let me know. Pointing out places where you expect a useful formalization to be impossible is also welcome.

1. Distinguishing between "small enough" and "too big" prediction errors seems non-trivial since different environments are naturally more difficult to predict than others. Formalizing this will likely require additional insights. ↩︎

2. An example of such "interesting physical structure" would be an implementation of an optimization architecture. ↩︎

3. Even if true, this conjecture will likely require some additional assumptions. Moreover, I expect "randomly-generated look-up tables that happen to stumble upon AGI by chance" to serve as a particularly relevant counterexample. ↩︎

Discuss

Logical Optimizers

Новости LessWrong.com - 23 августа, 2019 - 02:54
Published on August 22, 2019 11:54 PM UTC

Epistemic status: I think the basic Idea is more likely than not sound. Probably some mistakes. Looking for sanity check.

Black box description

The following is a way to Foom an AI while leaving its utility function and decision theory as blank spaces. You could plug any uncomputable or computationally intractable behavior you might want in, and get an approximation out.

Suppose I was handed a hypercomputer and allowed to run code on it without worrying about mindcrime, then the hypercomputer is removed, allowing me to keep 1Gb of data from the computations. Then I am handed a magic human utility function, as code on a memory stick. This approach would allow me to use the situation to make a FAI.

Example algorithms

Suppose you have a finite set of logical formulas, each of which evaluate to some real number. A logical optimizer is an algorithm that takes those formulas and tries to maximize the value of the formula it outputs.

Another algorithm to pick a large rn is to run a logical inductor to estimate each rnand then pick the rnthat maximized those estimates.

Suppose the formulas were

1) "3+4"

2) "10 if P=NP else 1"

3) "0 if P=NP else 11"

4) "2*6-3"

When run with a small amount of compute, these algorithms would pick option (4). They are in a state of logical uncertainty about whether P=NP, and act accordingly.

Given vast amounts of compute, they would pick either 2 or 3.

We might choose to implicitly represent a set of propositions in some manner instead of explicitly stating them. This would mean that a Logical Optimizer could optimize without needing to explicitly consider every possible expression. It could use an evolutionary algorithm. It could rule out swaths of propositions based on abstract reasoning.

Self Improvement

Now consider some formally specifiable prior over sets of propositions called P. P could be a straightforward simplicity based prior, or it could be tuned to focus on propositions of interest.

Suppose α1,...,αn are a finite set of programs that take in a set of propositions R={r1,...,rm} and output one of them. If a program fails to choose a number from 1 to m quickly enough then pick randomly, or 1 or something.

Let C(α,R)=ri be the choice made by the program α.

Let S(α)=∑R∈PC(α,R)×PR be the average value of the proposition chosen by the program α, weighted by the prior over sets of propositions P.

Now attempting to maximize S(α) over all short programs α is something that Logical Optimizers are capable of doing. Logical Optimizers are capable of producing other, perhaps more efficient Logical Optimizers in finite time.

Odds and ends

Assuming that you can design a reasonably efficient Logical Optimizer to get things started, and that you can choose a sensible P, you could get a FOOM towards a Logical Optimizer of almost maximal efficiency.

Note that Logical Optimizers aren't AI's. They have no concept of empirical uncertenty about an external world. They do not perform Baysian updates. They barely have a utility function. You can't put one in a prisoners dilemma. They only resolve a certain kind of logical uncertainty.

On the other hand, a Logical Optimizer can easily be converted into an AI by defining a prior, a notion of baysian updating, an action space and a utility function.

Just maximize over action a∈A in expressions of the form "Starting with Prior P and updating it based on evidence E, if you take action a then your utility will be?"

I suspect that Logical Optimisers are safe, in the sense that you could get one to FOOM on real world hardware, without holomorphic encryption and without disaster.

Logical Optimizers are not Clever fool proof. A clever fool could easily turn one into a paper clip maximizer. Do not put a FOOMed one online.

I suspect that one sensible route to FAI is to FOOM a logical optimizer, and then plug in some uncomputable or otherwise unfeasible definition of friendliness.

Discuss

Mechanistic Corrigibility

Новости LessWrong.com - 23 августа, 2019 - 02:20
Published on August 22, 2019 11:20 PM UTC

Acceptability

To be able to use something like relaxed adversarial training to verify a model, a necessary condition is having a good notion of acceptability. Paul Christiano describes the following two desiderata for any notion of acceptability:

1. "As long as the model always behaves acceptably, and achieves a high reward on average, we can be happy."
2. "Requiring a model to always behave acceptably wouldn't make a hard problem too much harder."

While these are good conditions that any notion of acceptability must satisfy, there may be many different possible acceptability predicates that meet both of these conditions—how do we distinguish between them? Two additional major conditions that I use for evaluating different acceptability criteria are as follows:

1. It must be not that hard for an amplified overseer to verify that a model is acceptable.
2. It must be not that hard to find such an acceptable model during training.

These conditions are different than Paul's second condition in that they are statements about the ease of training an acceptable model rather than the ease of choosing an acceptable action. If you want to be able to do some form of informed oversight to produce an acceptable model, however, these are some of the most important conditions to pay attention to. Thus, I generally think about choosing an acceptability condition as trying to answer the question: what is the easiest-to-train-and-verify property such that all models that satisfy that property[1] (and achieve high average reward) are safe?

Act-Based Corrigibility

One possible candidate property that Paul has proposed is act-based corrigibility, wherein an agent respects our short-term preferences, including those over how the agent itself should be modified. Not only is such an agent corrigible, Paul argues, but it will also want to make itself more corrigible, since having it be more corrigible is a component of our short-term preferences (Paul calls this the "broad basin" of corrigibility). While such act-based corrigibility would definitely be a nice property to have, it's unclear how exactly an amplified overseer could go about verifying such a property. In particular, if we want to verify such a property, we need a mechanistic understanding of act-based corrigibility rather than a behavioral one, since behavioral properties can only be verified by testing every input, whereas mechanistic properties can be verified just by inspecting the model.

One possible mechanistic understanding of corrigibility is corrigible alignment as described in "Risks from Learned Optimization," which is defined as the situation in which "the base objective is incorporated into the mesa-optimizer's epistemic model and [the mesa-optimizer's] objective is modified to 'point to' that information." While this gives us a starting point for understanding what a corrigible model might actually look like, there are still a bunch of missing pieces that have to be filled in. Furthermore, this notion of corrigibility looks more like instrumental corrigibility rather than act-based corrigibility, which as Paul notes is significantly less likely to be robust. Mechanistically, we can think of this lack of robustness as coming from the fact that "pointing" to the base objective is a pretty unstable operation: if you point even a little bit incorrectly, you'll end up with some sort of corrigible pseudo-alignment rather than corrigible robust alignment.

We can make this model more act-based, and at least somewhat mitigate this robustness problem, however, if we imagine pointing to only the human's short-term preferences. The hope for this sort of a setup is that, as long as the initial pointer is "good enough," there will be pressure for the mesa-optimizer to make its pointer better in the way in which its current understanding of short-term human preferences recommends, which is exactly Paul's "broad basin" of corrigibility argument. This requires it to be not that hard, however, to find a model with a notion of the human's short-term preferences as opposed to their long-term preferences that is also willing to correct that notion based on feedback.

In particular, it needs to be the case that it is not that hard to find an agent which will correct mistakes in its own prior over what the human's short-term preferences are. From a naive Bayesian perspective, this seems unlikely, as it seems strange for an agent to be incentivized to change its own prior. However, this is actually a very natural state for an agent to be in: if I trust your beliefs about X more than I trust my own, then that means I would endorse a modification of my prior to match yours. In the context of act-based corrigibility, we can think about this from a mechanistic perspective as having a pre-prior that encodes a belief that the human prior over human short-term preferences is to be preferred. Furthermore, pre-priors are generally epistemically valuable for agents to have, as a pre-prior can encourage an agent to correct its own cognitive biases. Thus, agents with pre-priors should be incentivized by most training processes, and thus shouldn't be too difficult to find.

Indifference Corrigibility

Instrumental and act-based corrigibility are not the only forms of corrigibility that have been discussed in the literature, however: there's also indifference corrigibility, wherein the agent is indifferent to modifications the human might make to it such as shutting it off. While this form of corrigibility doesn't in and of itself guarantee acceptability, if you had a way of verifying indifference corrigibility, there might be ways to turn that back into act-based corrigibility.

First, however, if we want to be able to verify indifference corrigibility, we are going to need a mechanistic understanding of it. Let's suppose we accept the argument that a competitive model will likely be a mesa-optimizer such that it will be running some sort of optimization process coupled with some sort of mesa-objective. Furthermore, let's suppose that its optimization process is computed over some sort of world model, which I think is a reasonable assumption for any competitive mesa-optimizer that wants to be able to take actions in the world. In such a case, we can think of indifference corrigibility as a form of stop gradient in the mesa-optimizer's world model which its optimization process doesn't optimize through,[2] which is a highly mechanistic property that we might expect an amplified overseer to be able to verify.

Again, however, we need agents with stop gradients to form a natural enough class such that they are not that hard to find. While it might naively seem as if agents with stop gradients would be very rare, in fact the exact opposite might be the case. Consider an agent such as the mesa-optimizer described previously which is running an optimization process over some world model, but which has preferences over the state of the world rather than the state of its world model. Such an agent will have an incentive to keep its world model aligned with the world, but only in one direction: it will want to modify its world model to match the world, but not modify the world to match its world model. This is a type of stop gradient, as it is a restriction on the agent to only minimize the difference between its world model and the world through one pathway but not through another. This example seems to suggest that stop gradients aren't that uncommon among agents such that it might be fairly achievable to find such an agent.

Even if we can train and verify indifference corrigibility, however, one of the big remaining problems with such an approach is that an indifference corrigible mesa-optimizer might do search in such a way that it finds a policy which is not itself indifference corrigible. We can think about this problem as arising from the fact that indifference corrigibility doesn't tile, though I more often think of this problem as the problem of "forwarding the guarantee," wherein even if we know how to do safe search on the outside, we have to make sure that any search inside of our model is similarly safe as well. This is a particularly interesting type of inner alignment problem in that, rather than being about how to ensure that a mesa-optimizer's objective is aligned, it is about how to ensure that a mesa-optimizer's search is safe even given that its objective is aligned. However, it seems plausible that this sort of problem could be resolved by ensuring that the model has a meta-preference towards any policies it produces also respecting the same stop gradient. In particular, the overseer could verify that any search over policies done by the model enforce the constraint that every policy have such a stop gradient.

Even once we can verify that our model is indifference corrigible and that it will forward that guarantee to any other search it might perform, however, there is still the question of how we might be able to use such a mechanism to produce a safe system. One way in which indifference corrigibility could be used to produce safety is to enforce that our model behave myopically. We can think of a myopic agent as one that only considers how best to answer the single question that you give to it rather than considering any sort of long-term consequences, which can be implemented as a stop gradient preventing any optimization outside of that domain. While myopia on its own is still insufficient to guarantee acceptability, it seems like it would at least prevent deceptive alignment, as one of the conditions for deceptive alignment is that the mesa-optimizer must have something to gain from cooperating now and then defecting later, which is not true for a myopic agent. Thus, if directed at a task which we are confident is outer aligned, such as pure supervised amplification (training a model to approximate a human consulting that model), and combined with a scheme for preventing standard pseudo-alignment (such as adversarial training), myopia verification might be sufficient to resolve the rest of the inner alignment problem by preventing deceptive alignment.

Conclusion

If we want to be able to do relaxed adversarial training to produce safe AI systems, we are going to need a notion of acceptability which is not that hard to train and verify. Corrigibility seems to be one of the most promising candidates for such an acceptability condition, but for that to work we need a mechanistic understanding of exactly what sort of corrigibility we're shooting for and how it will ensure safety. Though I think that both of the paths considered here look promising, further progress in understanding exactly what these different forms of corrigibility look like from a mechanistic perspective is likely to be necessary.

1. Or at least all models that we can find that satisfy that property. ↩︎

2. Thanks to Scott Garrabrant for the stop gradient analogy. ↩︎

Discuss

Response to Glen Weyl on Technocracy and the Rationalist Community

Новости LessWrong.com - 23 августа, 2019 - 02:14
Published on August 22, 2019 11:14 PM UTC

Economist Glen Weyl has written a long essay, "Why I Am Not A Technocrat", a major focus of which is his differences with the rationalist community.

I feel like I've read a decent number of outsider critiques of the rationalist community at this point, and Glen's critique is pretty good. It has the typical outsider critique weakness of not being fully familiar with the subject of its criticism, balanced by the strength of seeing the rationalist community from a perspective we're less familiar with.

As I was reading Glen's essay, I took some quick notes. Afterwards I turned them into this post.

Glen's Strongest Points

The fundamental problem with technocracy on which I will focus (as it is most easily understood within the technocratic worldview) is that formal systems of knowledge creation always have their limits and biases. They always leave out important consideration that are only discovered later and that often turn out to have a systematic relationship to the limited cultural and social experience of the groups developing them. They are thus subject to a wide range of failure modes that can be interpreted as reflecting on a mixture of corruption and incompetence of the technocratic elite. Only systems that leave a wide range of latitude for broader social input can avoid these failure modes.

So far, this sounds a lot like discussions I've seen previously of the book Seeing Like a State. But here's where Glen goes further:

Yet allowing such social input requires simplification, distillation, collaboration and a relative reduction in the social status and monetary rewards allocated to technocrats compared to the rest of the population, thereby running directly against the technocratic ideology. While technical knowledge, appropriately communicated and distilled, has potentially great benefits in opening social imagination, it can only achieve this potential if it understands itself as part of a broader democratic conversation.

...

Technical insights and designs are best able to avoid this problem when, whatever their analytic provenance, they can be conveyed in a simple and clear way to the public, allowing them to be critiqued, recombined, and deployed by a variety of members of the public outside the technical class.

Technical experts therefore have a critical role precisely if they can make their technical insights part of a social and democratic conversation that stretches well beyond the role for democratic participation imagined by technocrats. Ensuring this role cannot be separated from the work of design.

...

[When] insulation is severe, even a deeply “well-intentioned” technocratic class is likely to have severe failures along the corruption dimension. Such a class is likely to develop a strong culture of defending its distinctive class expertise and status and will be insulated from external concerns about the justification for this status.

...

Market designers have, over the last 30 years designed auctions, school choice mechanisms, medical matching procedures, and other social institutions using tools like auction and matching theory, adapted to a variety of specific institutional settings by economic consultants. While the principles they use have an appearance of objectivity and fairness, they play out against the contexts of societies wildly different than those described in the models. Matching theory uses principles of justice intended to apply to an entire society as a template for designing the operation of a particular matching mechanism within, for example, a given school district, thereby in practice primarily shutting down crucial debates about desegregation, busing, taxes, and other actions needed to achieve educational fairness with a semblance of formal truth. Auction theory, based on static models without product market competition and with absolute private property rights and assuming no coordination of behavior across bidders, is used to design auctions to govern the incredibly dynamic world of spectrum allocation, creating holdout problems, reducing competition, and creating huge payouts for those able to coordinate to game the auctions, often themselves market design experts friendly with the designers. The complexities that arise in the process serve to make such mass-scale privatizations, often primarily to the benefit of these connected players and at the expense of the taxpayer, appear the “objectively” correct and politically unimpeachable solution.

...

[Mechanism] designers must explicitly recognize and design for the fact that there is critical information necessary to make their designs succeed that a) lies in the minds of citizens outside the technocratic/designer class, b) will not be translated into the language of this class soon enough to avoid disastrous outcomes and c) does not fit into the thin formalism that designers allow for societal input.

...

In order to allow these failures to be corrected, it will be necessary for the designed system to be comprehensible by those outside the formal community, so they can incorporate the unformalized information through critique, reuse, recombination and broader conversation in informal language. Let us call this goal “legibility”.

...

There will in general be a trade-off between fidelity and legibility, just as both will have to be traded off against optimality. Systems that are true to the world will tend to become complicated and thus illegible.

...

Democratic designers thus must constantly attend, on equal footing, in teams or individually, to both the technical and communicative aspects of their work.

(Please let me know if you think I left out something critical)

A famous quote about open source software development states that "given enough eyeballs, all bugs are shallow". Nowadays, with critical security bugs in open-source software like Heartbleed, the spirit of this claim isn't taken for granted anymore. One Hacker News user writes: "[De facto eyeball shortage] becomes even more dire when you look at code no one wants to touch. Like TLS. There were the Heartbleed and goto fail bugs which existed for, IIRC, a few years before they were discovered. Not surprising, because TLS code is generally some of the worst code on the planet to stare at all day."

In other words, if you want critical feedback on your open source project, it's not enough just to put it out there and have lots of users. You also want to make the source code as accessible as possible--and this may mean compromising on other aspects of the design.

Academic or other in-group status games may encourage the use of big words. But we'd be better off rewarding simple explanations--not only are simple explanations more accessible, they also demonstrate deeper understanding. If we appreciated simplicity properly:

• We'd incentivize the creation of more simple explanations, promoting accessibility. And people wouldn't dismiss simple explanations for being "too obvious".

• Intellectuals would realize that even if a simple idea required lots of effort to discover, it need not require lots of effort to grasp. Verification is much quicker than search.

At the very least, I think, Glen wants our institutions to be like highly usable software: The internals require expertise to create and understand, but from a user's perspective, it "just works" and does what you expect.

Another point Glen makes well is that just because you are in the institution design business does not mean you're immune to incentives. The importance of self-skepticism regarding one's own incentives has been discussed before around here, but this recent post probably comes closes to Glen's position, that you really can't be trusted to monitor yourself.

Finally, Glen talks about the insularity of the rationalist community itself. I think this critique was true in the past. I haven't been interacting with the community in person as much over the past few years, so I hesitate to talk about the present, but I think he's plausibly right. I also think there may be an interesting counterargument that the rationalist community does a better job of integrating perspectives across multiple disciplines than your average academic department.

Possible Points of Disagreement

Although I think Glen would find some common ground with the recent post I linked, it's possible he would also find points of disagreement. In particular, habryka writes:

Highlighting accountability as a variable also highlights one of the biggest error modes of accountability and integrity – choosing too broad of an audience to hold yourself accountable to.

There is tradeoff between the size of the group that you are being held accountable by, and the complexity of the ethical principles you can act under. Too large of an audience, and you will be held accountable by the lowest common denominator of your values, which will rarely align well with what you actually think is moral (if you've done any kind of real reflection on moral principles).

Too small or too memetically close of an audience, and you risk not enough people paying attention to what you do, to actually help you notice inconsistencies in your stated beliefs and actions. And, the smaller the group that is holding you accountable is, the smaller your inner circle of trust, which reduces the amount of total resources that can be coordinated under your shared principles.

I think a major mistake that even many well-intentioned organizations make is to try to be held accountable by some vague conception of "the public". As they make public statements, someone in the public will misunderstand them, causing a spiral of less communication, resulting in more misunderstandings, resulting in even less communication, culminating into an organization that is completely opaque about any of its actions and intentions, with the only communication being filtered by a PR department that has little interest in the observers acquiring any beliefs that resemble reality.

I think a generally better setup is to choose a much smaller group of people that you trust to evaluate your actions very closely, and ideally do so in a way that is itself transparent to a broader audience. Common versions of this are auditors, as well as nonprofit boards that try to ensure the integrity of an organization.

Common wisdom is that it's impossible to please everyone. And specialization of labor is a foundational principle of modern society. If I took my role as a member of "the public" seriously and tried to provide meaningful and fair accountability to everyone, I wouldn't have time to do anything else.

It's interesting that Glen talks up the value of "legibility", because from what I understand, Seeing Like a State emphasizes its disadvantages. Seeing Like a State discusses legibility in the eyes of state administrators, but Glen doesn't explain why we shouldn't expect similar failure modes when "the general public" is substituted for "state administration".

(It's possible that Glen doesn't mean "legibility" in the same sense the book does, and a different term like "institutional legibility" would pinpoint what he's getting at. But there's still the question of whether we should expect optimizing for "institutional legibility" to be risk-free, after having observed that "societal legibility" has downsides. Glen seems to interpret recent political events as a result of excess technocracy, but they could also be seen as a result of excess populism--a leader's charisma could be more "legible" to the public than their competence.)

Anyway, I assume Glen is aware of these issues and working to solve them. I'm no expert, but from what I've heard of RadicalxChange, it seems like a really cool project. I'll offer my own uninformed outsider's perspective on institution design, in the hope that the conceptual raw material will prove useful to him or others.

My Take on Institution Design

I think there's another model which does a decent job of explaining the data Glen provides:

• Human systems are complicated.

• Greed finds & exploits flaws in institutions, causing them to decay over time.

• There are no silver bullets.

From the perspective of this model, Glen's emphasis on legibility could be seen as yet another purported silver bullet. However, I don't see a compelling reason for it to succeed where previous bullets failed. How, concretely, are random folks like me supposed to help address the corruption Glen identifies in the wireless spectrum allocation process? There seems to be a bit of a disconnect between Glen's description of the problem and his description of the solution. (Later Glen mentions the value of "humanities, continental philosophy, or humanistic social sciences"--I'd be interested to hear specific ideas from these areas, which aren't commonly known, that he thinks are quite important & relevant for institution design purposes.)

As a recent & related example, a decade or two ago many people were talking about how the Internet would revitalize & strengthen democracy; nowadays I'd guess most would agree that the Internet has failed as a silver bullet in this regard. (In fact, sometimes I get the impression this is the only thing we can all agree on!)

Anyway... What do I think we should we do?

• All untested institution designs have flaws.

• The challenge of institution design is to identify & fix flaws as cheaply as possible, ideally before the design goes into production.

Under this framework, it's not enough merely to have the approval of a large number of people. If these people have similar perspectives, their inability to identify flaws offers limited evidence about the overall robustness of the design.

Legibility is useful for flaw discovery in this framework, just as cleaner code could've been useful for surfacing flaws like Heartbleed. But there are other strategies available too, like offering bug bounties for the best available critiques.

Experiments and field trials are a bit more expensive, but it's critical to actually try things out, and resolve disagreements among bug bounty participants. Then there's the "resume-building" stage of trialing one's institution on an increasingly large scale in the real world. I'd argue one should aim to have all the kinks worked out before "resume-building" starts, but of course, it's important to monitor the roll-out for problems which might emerge--and ideally, the institution should itself have means with which it can be patched "in production" (which should get tested during experimentation & field trials).

The process I just described could itself be seen as an untested institution which is probably flawed and needs critiques, experiments, and field testing. (For example, bug bounties don't do anything on their own for legibility--how can we incentivize the production of clear explanations of the institution design in need of critiques?) Taking everything meta, and designing an institutional framework for introducing new institutions, is the real silver bullet if you ask me :-)

Probable Points of Disagreement

Given Glen's belief in the difficulty of knowledge creation, the importance of local knowledge, and the limitations of outside perspectives, I hope he won't be upset to learn that I think he got a few things wrong about the rationalist community. (I also think he got some things wrong about the EA community, but I believe he's working to fix those issues, so I won't address them.)

Glen writes:

if we want to have AIs that can play a productive role in society, our goal should not be exclusively or even primarily to align them with the goals of their creators or the narrow rationalist community interested in the AIAP.

This doesn't appear to be a difference of opinion with the rationalist community. In Eliezer's CEV paper, he writes about the "coherent extrapolated volition of humankind", not the "coherent extrapolated volition of the rationalist community".

However, now that MIRI's research is non-disclosed by default, I wonder if it would be wise for them to publicly state that their research is for the benefit of all, in a charter like OpenAI has, rather than in a paper published in 2004.

Glen writes:

The institutions likely to achieve [constraints on an AI's power] are precisely the same sorts of institutions necessary to constrain extreme capitalist or state power.

An unaligned superintelligent AI which can build advanced nanotechnology has no need to follow human laws. On the flip side, an aligned superintelligent AI can design better institutions for aggregating our knowledge & preferences than any human could.

Glen writes:

A primary goal of AI design should be not just alignment, but legibility, to ensure that the humans interacting with the AI know its goals and failure modes, allowing critique, reuse, constraint etc. Such a focus, while largely alien to research on AI and on AIAP

This actually appears to me to be one of the primary goals of AI alignment research. See 2.3 in this paper or this parable. It's not alien to mainstream AI research either: see research on explainability and interpretability (pro tip: interpretability is better).

In any case, if the alignment problem is actually solved, legibility isn't needed, because we know exactly what the system's goals are: The goals we gave it.

Conclusion

As I said previously, I have not investigated RadicalxChange in very much depth, but my superficial impression is that it is really cool. I think it could be an extremely high leverage project in a world where AGI doesn't come for a while, or gets invented slowly over time. My personal focus is on scenarios where AGI is invented relatively rapidly relatively soon, but sometimes I wonder whether I should focus on the kind of work Glen does. In any case, I am rooting for him, and I hope his movement does an astonishing job of inventing and popularizing nearly flawless institution designs.

Discuss

Why so much variance in human intelligence?

Новости LessWrong.com - 23 августа, 2019 - 01:36
Published on August 22, 2019 10:36 PM UTC

Epistemic status: Practising thinking aloud. There might be an important question here, but I might be making a simple error.

There is a lot of variance in general competence between species. Here is the standard Bostrom/Yudkowsky graph to display this notion.

There's a sense that while some mice are more genetically fit than others, they're broadly all just mice, bound within a relatively narrow range of competence. Chimps should not be worried about most mice, in the short or long term, but they also shouldn't worry especially so about peak mice - there's no incredibly strong or cunning mouse they ought to look out for.

However, my intuition is very different for humans. While I understand that humans are all broadly similar, that a single human cannot have a complex adaptation that is not universal [1], I also have many beliefs that humans differ massively in cognitive capacities in ways that can lead to major disparities in general competence. The difference between someone who does understand calculus and someone who does not, is the difference between someone who can build a rocket and someone who cannot. And I think I've tried to teach people that kind of math, and sometimes succeeded, and sometimes failed to even teach basic fractions.

I can try to operationalise my hypothesis: it seems plausible to me that if the average human intelligence was such that they'd be considered to have an IQ of 75 in the world we live in, that society could not have built rockets or do a lot of other engineering and science.

(Sidenote: I think the hope of iterated amplification is that this is false. That if I have enough humans with hard limits to how much thinking they can do, stacking lots of them can still produce all the intellectual progress we're going to need. My initial thought is that this doesn't make sense, because there are many intellectual feats like writing a book or coming up with special relativity that I generally expect individuals (situated within a conducive culture and institutions) to be much better at than groups of individuals (e.g. companies).

This is also my understanding of Eliezer's critique, that while it's possible to get humans with hard limits on cognition to make mathematical progress, it's by running an algorithm on them that they don't understand, not running an algorithm that they do understand, and only if they understand it do you get nice properties about them being aligned in the same way you might feel many humans are today.

It's likely I'm wrong about the motivation behind Iterated Amplification though.)

This hypothesis doesn't imply that someone who can do successful abstract reasoning is strictly more competent than a whole society of people who cannot. The Secret of our Success talks about how smart modern individuals stranded in forests fail to develop basic food preparation techniques that other, primitive cultures were able to build.

I'm saying that a culture with no people who can do calculus will in the long run score basically zero against the accomplishments of a culture with people who can.

One question is why we're in a culture so precariously balanced on this split between "can take off to the stars" and "mostly cannot". An idea I've heard before is the notion that if a culture is easily able to become technologically mature, it will come later than a culture who is just able to become technologically mature, because evolution works over much longer time scales than culture + technological innovation. As such, if you observe yourself to be in a culture that is able to become technologically mature, you're probably "the stupidest such culture that could get there, because if it could be done at a stupider level then it would've happened there first."

As such, we're a species whereby if we try as hard as we can, if we take brains optimised for social coordination and make them do math, then we can just about reach technical maturity (i.e. build nanotech, AI, etc).

That may be true, but the question I want to ask about is what is it about humans, culture and brains that allows for such high variance within the species, that isn't true about mice and chimps? Something about this is still confusing to me. Like, if it is the case that some humans are able to do great feats of engineering like build rockets that land, and some aren't, what's the difference between these humans that causes such massive changes in outcome? Because, as above, it's not some big complex genetic adaptation some have and some don't. I think we're all running pretty similar genetic code.

Is there some simple amount of working memory that's required to do complex recursion? Like, 6 working memory slots makes things way harder than 7?

I can imagine that there are many hacks, and not a single thing. I'm reminded of the story of Richard Feynman learning to count time, where he'd practice being able to count a whole minute. He'd do it while doing the laundry, while cooking breakfast, and so on. He later met the mathematician John Tukey, who could do the same, but they had some fierce disagreements. Tukey said you couldn't do it while reading the newspaper, and Feynman said he could. Feynman said you couldn't do it while having a conversation, and Tukey said they could. They then both surprised each other by doing exactly what they said they could.

It turned out Feynman was hearing numbers being spoken, whereas Tukey was visualising the numbers ticking over. So Feynman could still read at the same time, and his friend could still listen and talk.

The idea here is that if you're unable to use one type of cognitive resource, you may make up for it with another. This is probably the same situation as when you make trade-offs between space and time in computational complexity.

So I can imagine different humans finding different hacky ways to build up the skill to do very abstract truth-tracking thinking. Perhaps you have a little less working memory than average, but you have a great capacity for visualisation, and primarily work in areas that lend themselves to geometric / spacial thinking. Or perhaps your culture can be very conducive to abstract thought in some way.

But even if this is right I'm interested in the details of what the key variables actually are.

[1] Note: humans can lack important pieces of machinery.

Discuss

Logical Counterfactuals and Proposition graphs, Part 1

Новости LessWrong.com - 23 августа, 2019 - 01:06
Published on August 22, 2019 10:06 PM UTC

I will use Greek letters to represent an arbitrary symbol, upper case for single symbols, lower case for strings.

Respecifying Propositional logic

The goal of this first section is to reformulate first order logic in a way that makes logical counterfactuals easier. Lets start with propositional logic.

We have a set of primitive propositions p,q,r,... as well as the symbols ⊤,⊥. We also have the symbols ∨,∧ which are technically functions from Bool2→Bool but will be written p∨q not ∨(p,q) . There is also ¬:Bool→Bool

Consider the equivalence rules.

1. α≡¬¬α

2. α∧β≡β∧α

3. (α∧β)∧γ≡α∧(β∧γ)

4. ¬α∧¬β≡¬(α∨β)

5. α∧⊤≡α

6. ¬α∨α≡⊤

7. ⊥≡¬⊤

8. α∧(β∨γ)≡(α∧β)∨(α∧γ)

9. ⊥∧α≡⊥

10.α≡α∧α

Theorem

Any tautology provable in propositional logic can be created by starting at ⊤ and repeatedly applying equivalence rules.

Proof

First consider α⟹β to be shorthand for ¬α∨β.

Lemma

We can convert ⊤ into any of the 3 axioms.

α⟹(β⟹α) is a shorthand for

¬α∨(¬β∨α)≡1

¬¬(¬α∨(¬β∨α))≡4

¬(¬¬α∧¬(¬β∨α))≡4

¬(¬¬α∧(¬¬β∧¬α))≡1

¬(¬¬α∧(β∧¬α))≡2

¬(¬¬α∧(¬α∧β))≡3

¬((¬¬α∧¬α)∧β)≡4

¬(¬(¬α∨α)∧β)≡6

¬(¬⊤∧β)≡7

¬(⊥∧β)≡9

¬⊥≡7

Similarly

(α⟹(β⟹γ))⟹((α⟹β)⟹(α⟹γ))

(¬α⟹¬β)⟹(β⟹α)

(if these can't be proved, add that they ≡⊤ as axioms)

End Lemma

Whenever you have α∧(α⟹β), that is equiv to

α∧(¬α∨β)≡8

(α∧¬α)∨(α∧β)≡1

(¬¬α∧¬α)∨(α∧β)≡4

¬(¬α∨α)∨(α∧β)≡6

¬⊤∨(α∧β)≡1

¬¬(¬⊤∨(α∧β))≡4

¬(¬¬⊤∧¬(α∧β))≡1

¬(⊤∧¬(α∧β))≡2

¬(¬(α∧β)∧⊤)≡5

¬¬(α∧β)≡1

α∧β

This means that you can create and apply axioms. For any tautology, look at the proof of it in standard propositional logic. Call the statements in this proof p1,p2,p3...

suppose we have already found a sequence of substitutions from ⊤ to p1∧p2...∧pi−1

Whenever pi is a new axiom, use (5.) to get p1∧p2...∧pi−1∧⊤, then convert ⊤ into the instance of the axiom you want. (substitute alpha and beta with arbitrary props in above proof schema)

Using substitution rules (2.) and (3.) you can rearrange the terms representing lines in the proof and ignore their bracketing.

Whenever pi is produced by modus ponus from the previous pj and pk=pj⟹pi then duplicate pk with rule (10.), move one copy next to pj and use the previous procedure to turn pj∧(pj⟹pi) into pj∧pi. Then move pi to end.

Once you reach the end of the proof, duplicate the result and unwind all the working back to ⊤, which can be removed by rule (5.)

Corollary

{p,q,r}⊢s then p∧q∧r≡p∧q∧r∧s

Because p∧q∧r⟹s is a tautology and can be applied to get s.

Corollary

Any contradiction is reachable from ⊥

The negation of any contradiction k is a tautology.

⊥≡¬⊤≡¬¬k≡k

Intuitive overview perspective 1

An illustration of rule 4. ¬α∧¬β≡¬(α∨β) in action.

We can consider a proposition to be a tree with a root. The nodes are labeled with symbols. The axiomatic equivalences become local modifications to the tree structure, which are also capable of duplicating and merging identical subtrees by (10.). Arbitrary subtrees can be created or deleted by (5.).

We can merge nodes with identical subtrees into a single node. This produces a directed acyclic graph, as shown above. Under this interpretation, all we have to do is test node identity.

Intuitive overview perspective 2

Consider each possible expression to be a single node within an infinite graph.

Each axiomatic equivalence above describes an infinite set of edges. To get a single edge, substitute the generic α,β... with a particular expression. For example, if you take (2. α∧β≡β∧α ) and substitute α:=p∨q and β:=¬q. We find a link between the node (p∨q)∧¬q and ¬q∧(p∨q).

Here is a connected subsection of the graph. Note that, unlike the previous graph, this one is cyclic and edges are not directed.

All statements that are provably equivalent in propositional logic will be within the same connected component of the graph. All statements that can't be proved equivalent are in different components, with no path between them.

Finding a mathematical proof becomes an exercise in navigating an infinite maze.

In the next Post

We will see how to extend the equivalence based proof system to an arbitrary first order theory. We will see what the connectedness does then. We might even get on to infinite dimensional vector spaces and why any of this relates to logical counterfactual.

Discuss

Time Travel, AI and Transparent Newcomb

Новости LessWrong.com - 23 августа, 2019 - 01:04
Published on August 22, 2019 10:04 PM UTC

Epistemic status: has "time travel" in the title.

Let's suppose, for the duration of this post, that the local physics of our universe allows for time travel. The obvious question is: how are paradoxes prevented?

We may not have any idea how paradoxes are prevented, but presumably there must be some prevention mechanism. So, in a purely Bayesian sense, we can condition on paradoxes somehow not happening, and then ask what becomes more or less likely. In general, anything which would make a time machine more likely to be built should become less likely, and anything which would prevent a time machine being built should become more likely.

In other words: if we're trying to do something which would make time machines more likely to be built, this argument says that we should expect things to mysteriously go wrong.

For instance, let's say we're trying to build some kind of powerful optimization process which might find time machines instrumentally useful for some reason. To the extent that such a process is likely to build time machines and induce paradoxes, we would expect things to mysteriously go wrong when trying to build the optimizer in the first place.

On the flip side: we could commit to designing our powerful optimization process so that it not only avoids building time machines, but also actively prevents time machines from being built. Then the mysterious force should work in our favor: we would expect things to mysteriously go well. We don't need time-travel-prevention to be the optimization process' sole objective here, it just needs to make time machines sufficiently less likely to get an overall drop in the probability of paradox.

Discuss

Embedded Naive Bayes

Новости LessWrong.com - 23 августа, 2019 - 00:40
Published on August 22, 2019 9:40 PM UTC

Suppose we have a bunch of earthquake sensors spread over an area. They are not perfectly reliable (in terms of either false positives or false negatives), but some are more reliable than others. How can we aggregate the sensor data to detect earthquakes?

It turns out that this procedure is equivalent to a Naive Bayes model.

Naive Bayes is a causal model in which there is some parameter θ in the environment which we want to know about - i.e. whether or not there’s an earthquake happening. We can’t observe θ directly, but we can measure it indirectly via some data {xi} - i.e. outputs from the earthquake sensors. The measurements may not be perfectly accurate, but their failures are at least independent - one sensor isn’t any more or less likely to be wrong when another sensor is wrong.

We can represent this picture with a causal diagram:

From the diagram, we can read off the model’s equation: P[θ,{xi}]=P[θ]∏iP[xi|θ]. We’re interested mainly in the posterior probability P[θ|{xi}]=1ZP[θ]∏iP[xi|θ] or, in log odds form,

L[θ|{xi}]=lnP[θ]P[∼θ]+∑ilnP[xi|θ]P[xi|∼θ]

Stare at that equation, and it’s not hard to see how the seismologist’s procedure turns into a Naive Bayes model: the seismologist’s intuitive scores for each sensor correspond to the “evidence” from the sensor lnP[xi|θ]P[xi|∼θ]. The “earthquake score” then corresponds to the posterior log odds of an earthquake. The seismologist has unwittingly adopted a statistical model. Note that this is still true regardless of whether the scores used are well-calibrated or whether the assumptions of the model hold - the seismologist is implicitly using this model, and whether the model is correct is an entirely separate question.

The Embedded Naive Bayes Equation

Let’s formalize this a bit.

We have some system which takes in data x, computes some stuff, and spits out some f(x). We want to know whether a Naive Bayes model is embedded in f(x). Conceptually, we imagine that f(x) parameterizes a probability distribution over some unobserved parameter θ - we’ll write P[θ;f(x)], where the “;” is read as “parameterized by”. For instance, we could imagine a normal distribution over θ, in which case f(x) might be the mean and variance (or any encoding thereof) computed from our input data. In our earthquake example, θ is a binary variable, so f(x) is just some encoding of the probability that θ=True.

Now let’s write the actual equation defining an embedded Naive Bayes model. We assert that P[θ;f(x)] is the same as P[θ|x] under the model, i.e.

P[θ;f(x)]=P[θ|x]=1ZP[θ]∏iP[xi|θ]

We can transform to log odds form to get rid of the Z:

L[θ;f(x)]=lnP[θ]P[∼θ]+∑ilnP[xi|θ]P[xi|∼θ]

Let’s pause for a moment and go through that equation. We know the function f(x), and we want the equation to hold for all values of x. θ is some hypothetical thing out in the environment - we don’t know what it corresponds to, we just hypothesize that the system is modelling something it can’t directly observe. As with x, we want the equation to hold for all values of θ. The unknowns in the equation are the probability functions P[θ;f(x)], P[θ] and P[xi|θ]. To make it clear what’s going on, let’s remove the probability notation for a moment, and just use functions G and {gi}, with θ written as a subscript:

∀θ,x:Gθ(f(x))=cθ+∑igθi(xi)

This is a functional equation: for each value of θ, we want to find functions G, {gi}, and a constant c such that the equation holds for all possible x values. The solutions G and {gi} can then be decoded to give our probability functions P[θ;f(x)] and P[xi|θ], while c can be decoded to give our prior P[θ]. Each possible θ-value corresponds to a different set of solutions Gθ, {gθi}, cθ.

This particular functional equation is a variant of Pexider’s equation; you can read all about it in Aczel’s Functional Equations and Their Applications, chapter 3. For our purposes, the most important point is: depending on the function f, the equation may or may not have a solution. In other words, there is a meaningful sense in which some functions f(x) do embed a Naive Bayes model, and others do not. Our seismologist’s procedure does embed a Naive Bayes model: let G be the identity function, c be zero, and gi(xi)=sxii, and we have a solution to the embedding equation with f(x) given by our seismologist’s add-all-the-scores calculation (although this is not the only solution). On the other hand, a procedure computing f(x)=xxx321 for real-valued inputs x1, x2, x3 would not embed a Naive Bayes model: with this f(x), the embedding equation would not have any solutions.

Discuss

Intentional Bucket Errors

Новости LessWrong.com - 22 августа, 2019 - 23:02
Published on August 22, 2019 8:02 PM UTC

I want to illustrate a research technique that I use sometimes. (My actual motivation for writing this is to make it so that I don't feel as much like I need to defend myself when I use this technique.) I am calling it intentional bucket errors after a CFAR concept called bucket errors. Bucket errors is about noticing when multiple different concepts/questions are stored in your head as a single concept/question. Then, by noticing this, you can think about the different concepts/question separately.

What are Intentional Bucket Errors

Bucket errors are normally thought of as a bad thing. It has "errors" right in the name. However, I want to argue that bucket errors can sometimes be useful, and you might want to consider having some bucket errors on purpose. You can do this by taking multiple different concepts and just pretending that they are all the same. This usually only works if the concepts started out sufficiently close together.

Like many techniques that work by acting as though you believe something false, you should use this technique responsibly. The goal is to pretend that the concepts are the same to help you gain traction on thinking about them, but then to also be able to go back to inhabiting the world where they are actually different.

Why use Intentional Bucket Errors

Why might you want to use intentional bucket errors? For one, maybe the concepts actually are the same, but the look different enough that you won't let yourself consider the possibility. I think this is especially likely to happen if the concepts are coming from very different fields or areas of your life. Sometimes it feels silly to draw strong connections between e.g. human rationality, AI alignment, evolution, economics, etc. but such connections can be useful.

Also I find this useful for gaining traction. There is something useful about constrained optimization for being able to start thinking about a problem. Sometimes it is harder to say something true and useful about X than it is to say something true and useful that simultaneously applies to X, Y, and Z. This is especially true when the concepts you are conflating are imagined solutions to problems.

For example, maybe I have an imagined solution to counterfactuals that has a hole in it that looks like understanding multi-level world models. Then, maybe I also have have an imagined solution to tiling that also has a hole in it that looks like understanding multi-level world models. I could view this as two separate problems. The desired properties of my MLWM theory for counterfactuals might be different from the desired properties for tiling. I have these two different holes I want to fill, and one strategy I have, which superficially looks like it makes the problem harder is to try to find something that can fill both holes simultaneously. However, this can sometimes be easier because different use cases can help you triangulate the simple theory from which the specific solutions can be derived.

A lighter (maybe epistemically safer) version of intentional bucket errors is just to pay a bunch of attention to the connections between the concepts. This has its own advantages in that the relationships between the concepts might be interesting. However, I personally prefer to just throw them all in together, since this way I only have to work with one object, and it takes up fewer working memory slots while I'm thinking about it.

Examples

Here are a some recent examples where I feel like I have used something like this, to varying degrees.

How the MtG Color Wheel Explains AI Safety is obviously the product of conflating many things together without worrying too much about how all the clusters are wrong.

In How does Gradient Descent Interact with Goodhart, the question at the top about rocket designs and human approval is really very different from the experiments that I suggested, but I feel like learning about one might help my intuitions about the other. This was actually generated at the same time as I was thinking about Epistemic Tenure, which for me what partially about the expectation that there is good research and a correlated proxy of justifiable research, and even though our group idea selection mechanism is going to optimize for justifiable research, it is better if the inner optimization loops in the humans do not directly follow those incentives. The connection is a bit of a stretch in hindsight, but believing the connection was instrumental in giving me traction in thinking about all the problems.

Embedded Agency has a bunch of this, just because I was trying to factor a big problem into a small number of subfields, but the Robust Delegation section can sort of be described as "Tiling and Corrigibility kind of look similar if you squint. What happens when I just pretend they are two instatiations of the same problem."

Discuss

Computational Model: Causal Diagrams with Symmetry

Новости LessWrong.com - 22 августа, 2019 - 20:54
Published on August 22, 2019 5:54 PM UTC

Consider the following program:

f(n): if n == 0: return 1 return n * f(n-1)

Let’s think about the process by which this function is evaluated. We want to sketch out a causal DAG showing all of the intermediate calculations and the connections between them (feel free to pause reading and try this yourself).

Here’s what the causal DAG looks like:

Each dotted box corresponds to one call to the function f. The recursive call in f becomes a symmetry in the causal diagram: the DAG consists of an infinite sequence of copies of the same subcircuit.

More generally, we can represent any Turing-computable function this way. Just take some pseudocode for the function, and expand out the full causal DAG of the calculation. In general, the diagram will either be finite or have symmetric components - the symmetry is what allows us to use a finite representation even though the graph itself is infinite.

Why would we want to do this?

For our purposes, the central idea of embedded agency is to take these black-box systems which we call “agents”, and break open the black boxes to see what’s going on inside.

Causal DAGs with symmetry are how we do this for Turing-computable functions in general. They show the actual cause-and-effect process which computes the result; conceptually they represent the computation rather than a black-box function.

In particular, a causal DAG + symmetry representation gives us all the natural machinery of causality - most notably counterfactuals. We can ask questions like “what would happen if I reached in and flipped a bit at this point in the computation?” or “what value would f(5) return if f(3) were 11?”. We can pose these questions in a well-defined, unambiguous way without worrying about logical counterfactuals, and without adding any additional machinery. This becomes particularly important for embedded optimization: if an “agent” (e.g. an organism) wants to plan ahead to achieve an objective (e.g. find food), it needs to ask counterfactual questions like “how much food would I find if I kept going straight?”.

The other main reason we would want to represent functions as causal DAGs with symmetry is because our universe appears to be one giant causal DAG with symmetry.

Because our universe is causal, any computation performed in our universe must eventually bottom out in a causal DAG. We can write our programs in any language we please, but eventually they will be compiled down to machine code and run by physical transistors made of atoms which are themselves governed by a causal DAG. In most cases, we can represent the causal computational process at a more abstract level - e.g. in our example program, even though we didn’t talk about registers or transistors or electric fields, the causal diagram we sketched out would still accurately represent the computation performed even at the lower levels.

This raises the issue of abstraction - the core problem of embedded agency. My own main use-case for the causal diagram + symmetry model of computation is formulating models of abstraction: how can one causal diagram (possibly with symmetry) represent another in a way which makes counterfactual queries on the map correspond to some kind of counterfactual on the territory? Can that work when the “map” is a subDAG of the territory DAG? It feels like causal diagrams + symmetry are the minimal computational model needed to get agency-relevant answers to this sort of question.

Learning

The traditional ultimate learning algorithm is Solomonoff Induction: take some black-box system which spews out data, and look for short programs which reproduce that data. But the phrase “black-box” suggests that perhaps we could do better by looking inside that box.

To make this a little bit more concrete: imagine I have some python program running on a server which responds to http requests. Solomonoff Induction would look at the data returned by requests to the program, and learn to predict the program’s behavior. But that sort of black-box interaction is not the only option. The program is running on a physical server somewhere - so, in principle, we could go grab a screwdriver and a tiny oscilloscope and directly observe the computation performed by the physical machine. Even without measuring every voltage on every wire, we may at least get enough data to narrow down the space of candidate programs in a way which Solomonoff Induction could not do. Ideally, we’d gain enough information to avoid needing to search over all possible programs.

Compared to Solomonoff Induction, this process looks a lot more like how scientists actually study the real world in practice: there’s lots of taking stuff apart and poking at it to see what makes it tick.

In general, though, how to learn causal DAGs with symmetry is still an open question. We’d like something like Solomonoff Induction, but which can account for partial information about the internal structure of the causal DAG, rather than just overall input-output behavior. (In principle, we could shoehorn this whole thing into traditional Solomonoff Induction by treating information about the internal DAG structure as normal old data, but that doesn’t give us a good way to extract the learned DAG structure.)

We already have algorithms for learning causal structure in general. Pearl’s Causality sketches out some such algorithms in chapter 2, although they’re only practical for either very small systems or very large amounts of data. Bayesian structure learning can handle larger systems with less data, though sometimes at the cost of a very large amount of compute - i.e. estimating high-dimensional integrals.

However, in general, these approaches don’t directly account for symmetry of the learned DAGs. Ideally, we would use a prior which weights causal DAGs according to the size of their representation - i.e. infinite DAGs would still have nonzero prior probability if they have some symmetry allowing for finite representation, and in general DAGs with multiple copies of the same sub-DAG would have higher probability. This isn’t quite the same as weighting by minimum description length in the Solomonoff sense, since we care specifically about symmetries which correspond to function calls - i.e. isomorphic subDAGs. We don’t care about graphs which can be generated by a short program but don’t have these sorts of symmetries. So that leaves the question: if our prior probability for a causal DAG is given by a notion of minimum description length which only allows compression by specifying re-used subcircuits, what properties will the resulting learning algorithm possess? Is it computable? What kinds of data are needed to make it tractable?

Discuss

Simulation Argument: Why aren't ancestor simulations outnumbered by transhumans?

Новости LessWrong.com - 22 августа, 2019 - 20:29
Published on August 22, 2019 9:07 AM UTC

This is a point of confusion I still have with the simulation argument: Upon learning that we are in an ancestor simulation, should we be any less surprised? It would be odd for a future civilization to dedicate a large fraction of their computational resources towards simulating early 21st century humans instead of happy transhuman living in base reality; shouldn't we therefore be equally perplexed that we aren't transhumans?

I guess the question boils down to the choice of reference classes, so what makes the reference class "early 21st century humans" so special? Why not widen the reference class to include all conscious minds, or narrow it down to the exact quantum state of a brain?

Furthermore, if you're convinced by the simulation argument, why not believe that you're a Boltzmann brain instead using the same line of argument?

Discuss

[AN #62] Are adversarial examples caused by real but imperceptible features?

Новости LessWrong.com - 22 августа, 2019 - 20:10
Published on August 22, 2019 5:10 PM UTC

[AN #62] Are adversarial examples caused by real but imperceptible features? View this email in your browser

Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter. I'm always happy to hear feedback; you can send it to me by replying to this email.

Audio version here (may not be up yet).

Highlights

Call for contributors to the Alignment Newsletter (Rohin Shah): I'm looking for content creators and a publisher for this newsletter! Apply by September 6.

Adversarial Examples Are Not Bugs, They Are Features (Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom et al) (summarized by Rohin and Cody): Distill published a discussion of this paper. This highlights section will cover the full discussion; all of these summaries and opinions are meant to be read together.

Consider two possible explanations of adversarial examples. First, they could be caused because the model "hallucinates" a signal that is not useful for classification, and it becomes very sensitive to this feature. We could call these "bugs", since they don't generalize well. Second, they could be caused by features that do generalize to the test set, but can be modified by an adversarial perturbation. We could call these "non-robust features" (as opposed to "robust features", which can't be changed by an adversarial perturbation). The authors argue that at least some adversarial perturbations fall into the second category of being informative but sensitive features, based on two experiments.

If the "hallucination" explanation were true, the hallucinations would presumably be caused by the training process, the choice of architecture, the size of the dataset, but not by the type of data. So one thing to do would be to see if we can construct a dataset such that a model trained on that dataset is already robust, without adversarial training. The authors do this in the first experiment. They take an adversarially trained robust classifier, and create images whose features (final-layer activations of the robust classifier) match the features of some unmodified input. The generated images only have robust features because the original classifier was robust, and in fact models trained on this dataset are automatically robust.

If the "non-robust features" explanation were true, then it should be possible for a model to learn on a dataset containing only non-robust features (which will look nonsensical to humans) and still generalize to a normal-looking test set. In the second experiment (henceforth WrongLabels), the authors construct such a dataset. Their hypothesis is that adversarial perturbations work by introducing non-robust features of the target class. So, to construct their dataset, they take an image x with original label y, adversarially perturb it towards some class y' to get image x', and then add (x', y') to their dataset (even though to a human x' looks like class y). They have two versions of this: in RandLabels, the target class y' is chosen randomly, whereas in DetLabels, y' is chosen to be y + 1. For both datasets, if you train a new model on the dataset, you get good performance on the original test set, showing that the "non-robust features" do generalize.

Rohin's opinion: I buy this hypothesis. It's a plausible explanation for brittleness towards adversarial noise ("because non-robust features are useful to reduce loss"), and why adversarial examples transfer across models ("because different models can learn the same non-robust features"). In fact, the paper shows that architectures that did worse in ExpWrongLabels (and so presumably are bad at learning non-robust features) are also the ones to which adversarial examples transfer the least. I'll leave the rest of my opinion to the opinions on the responses.

Read more: Paper and Author response

Response: Learning from Incorrectly Labeled Data (Eric Wallace): This response notes that all of the experiments are of the form: create a dataset D that is consistent with a model M; then, when you train a new model M' on D you get the same properties as M. Thus, we can interpret these experiments as showing that model distillation can work even with data points that we would naively think of "incorrectly labeled". This is a more general phenomenon: we can take an MNIST model, select only the examples for which the top prediction is incorrect (labeled with these incorrect top predictions), and train a new model on that -- and get nontrivial performance on the original test set, even though the new model has never seen a "correctly labeled" example.

Rohin's opinion: I definitely agree that these results can be thought of as a form of model distillation. I don't think this detracts from the main point of the paper: the reason model distillation works even with incorrectly labeled data is probably because the data is labeled in such a way that it incentivizes the new model to pick out the same features that the old model was paying attention to.

Response: Robust Feature Leakage (Gabriel Goh): This response investigates whether the datasets in WrongLabels could have had robust features. Specifically, it checks whether a linear classifier over provably robust features trained on the WrongLabels dataset can get good accuracy on the original test set. This shouldn't be possible since WrongLabels is meant to correlate only non-robust features with labels. It finds that you can get some accuracy with RandLabels, but you don't get much accuracy with DetLabels.

The original authors can actually explain this: intuitively, you get accuracy with RandLabels because it's less harmful to choose labels randomly than to choose them explicitly incorrectly. With random labels on unmodified inputs, robust features should be completely uncorrelated with accuracy. However, with random labels followed by an adversarial perturbation towards the label, there can be some correlation, because the adversarial perturbation can add "a small amount" of the robust feature. However, in DetLabels, the labels are wrong, and so the robust features are negatively correlated with the true label, and while this can be reduced by an adversarial perturbation, it can't be reversed (otherwise it wouldn't be robust).

Rohin's opinion: The original authors' explanation of these results is quite compelling; it seems correct to me.

Response: Adversarial Examples are Just Bugs, Too (Preetum Nakkiran): The main point of this response is that adversarial examples can be bugs too. In particular, if you construct adversarial examples that explicitly don't transfer between models, and then run ExpWrongLabels with such adversarial perturbations, then the resulting model doesn't perform well on the original test set (and so it must not have learned non-robust features).

It also constructs a data distribution where every useful feature of the optimal classifer is guaranteed to be robust, and shows that we can still get adversarial examples with a typical model, showing that it is not just non-robust features that cause adversarial examples.

In their response, the authors clarify that they didn't intend to claim that adversarial examples could not arise due to "bugs", just that "bugs" were not the only explanation. In particular, they say that their main thesis is “adversarial examples will not just go away as we fix bugs in our models”, which is consistent with the point in this response.

Rohin's opinion: Amusingly, I think I'm more bullish on the original paper's claims than the authors themselves. It's certainly true that adversarial examples can arise from "bugs": if your model overfits to your data, then you should expect adversarial examples along the overfitted decision boundary. The dataset constructed in this response is a particularly clean example: the optimal classifier would have an accuracy of 90%, but the model is trained to accuracy 99.9%, which means it must be overfitting.

However, I claim that with large and varied datasets with neural nets, we are typically not in the regime where models overfit to the data, and the presence of "bugs" in the model will decrease. (You certainly can get a neural net to be "buggy", e.g. by randomly labeling the data, but if you're using real data with a natural task then I don't expect it to happen to a significant degree.) Nonetheless, adversarial examples persist, because the features that models use are not the ones that humans use.

It's also worth noting that this experiment strongly supports the hypothesis that adversarial examples transfer because they are real features that generalize to the test set.

Response: Adversarial Example Researchers Need to Expand What is Meant by ‘Robustness’ (Justin Gilmer et al): This response argues that the results in the original paper are simply a consequence of a generally accepted principle: "models lack robustness to distribution shift because they latch onto superficial correlations in the data". This isn't just about L_p norm ball adversarial perturbations: for example, one recent paper shows that if the model is only given access to high frequency features of images (which look uniformly grey to humans), it can still get above 50% accuracy. In fact, when we do adversarial training to become robust to L_p perturbations, then the model pays attention to different non-robust features and becomes more vulnerable to e.g. low-frequency fog corruption. The authors call for adversarial examples researchers to move beyond L_p perturbations and think about the many different ways models can be fragile, and to make them more robust to distributional shift.

Rohin's opinion: I strongly agree with the worldview behind this response, and especially the principle they identified. I didn't know this was a generally accepted principle, though of course I am not an expert on distributional robustness.

One thing to note is what is meant by "superficial correlation" here. It means a correlation that really does exist in the dataset, that really does generalize to the test set, but that doesn't generalize out of distribution. A better term might be "fragile correlation". All of the experiments so far have been looking at within-distribution generalization (aka generalization to the test set), and are showing that non-robust features do generalize within-distribution. This response is arguing that there are many such non-robust features that will generalize within-distribution but will not generalize under distributional shift, and we need to make our models robust to all of them, not just L_p adversarial perturbations.

Response: Two Examples of Useful, Non-Robust Features (Gabriel Goh): This response studies linear features, since we can analytically compute their usefulness and robustness. It plots the singular vectors of the data as features, and finds that such features are either robust and useful, or non-robust and not useful. However, you can get useful, non-robust features by ensembling or contamination (see response for details).

Response: Adversarially Robust Neural Style Transfer (Reiichiro Nakano): The original paper showed that adversarial examples don't transfer well to VGG, and that VGG doesn't tend to learn similar non-robust features as a ResNet. Separately, VGG works particularly well for style transfer. Perhaps since VGG doesn't capture non-robust features as well, the results of style transfer look better to humans? This response and the author's response investigate this hypothesis in more detail and find that it seems broadly supported, but there are still finnicky details to be worked out.

Rohin's opinion: This is an intriguing empirical fact. However, I don't really buy the theoretical argument that style transfer works because it doesn't use non-robust features, since I would typically expect that a model that doesn't use L_p-fragile features would instead use features that are fragile or non-robust in some other way.

Technical AI alignment   Problems

Problems in AI Alignment that philosophers could potentially contribute to (Wei Dai): Exactly what it says. The post is short enough that I'm not going to summarize it -- it would be as long as the original.

Iterated amplification

Delegating open-ended cognitive work (Andreas Stuhlmüller): This is the latest explanation of the approach Ought is experimenting with: Factored Evaluation (in contrast to Factored Cognition (AN #36)). With Factored Cognition, the idea was to recursively decompose a high-level task until you reach subtasks that can be directly solved. Factored Evaluation still does recursive decomposition, but now it is aimed at evaluating the work of experts, along the same lines as recursive reward modeling (AN #34).

This shift means that Ought is attacking a very natural problem: how to effectively delegate work to experts while avoiding principal-agent problems. In particular, we want to design incentives such that untrusted experts under the incentives will be as helpful as experts intrinsically motivated to help. The experts could be human experts or advanced ML systems; ideally our incentive design would work for both.

Currently, Ought is running experiments with reading comprehension on Wikipedia articles. The experts get access to the article while the judge does not, but the judge can check whether particular quotes come from the article. They would like to move to tasks that have a greater gap between the experts and the judge (e.g. allowing the experts to use Google), and to tasks that are more subjective (e.g. whether the judge should get Lasik surgery).

Rohin's opinion: The switch from Factored Cognition to Factored Evaluation is interesting. While it does make it more relevant outside the context of AI alignment (since principal-agent problems abound outside of AI), it still seems like the major impact of Ought is on AI alignment, and I'm not sure what the difference is there. In iterated amplification (AN #30), when decomposing tasks in the Factored Cognition sense, you would use imitation learning during the distillation step, whereas with Factored Evaluation, you would use reinforcement learning to optimize the evaluation signal. The switch would be useful if you expect the reinforcement learning to work significantly better than imitation learning.

However, with Factored Evaluation, the agent that you train iteratively is one that must be good at evaluating tasks, and then you'd need another agent that actually performs the task (or you could train the same agent to do both). In contrast, with Factored Cognition you only need an agent that is performing the task. If the decompositions needed to perform the task are different from the decompositions needed to evaluate the task, then Factored Cognition would presumably have an advantage.

Miscellaneous (Alignment)

Clarifying some key hypotheses in AI alignment (Ben Cottier et al): This post (that I contributed to) introduces a diagram that maps out important and controversial hypotheses for AI alignment. The goal is to help researchers identify and more productively discuss their disagreements.

Near-term concerns   Privacy and security

Evaluating and Testing Unintended Memorization in Neural Networks (Nicholas Carlini et al)

Machine ethics

Towards Empathic Deep Q-Learning (Bart Bussmann et al): This paper introduces the empathic DQN, which is inspired by the golden rule: "Do unto others as you would have them do unto you". Given a specified reward, the empathic DQN optimizes for a weighted combination of the specified reward, and the reward that other agents in the environment would get if they were a copy of the agent. They show that this results in resource sharing (when there are diminishing returns to resources) and avoiding conflict in two toy gridworlds.

Rohin's opinion: This seems similar in spirit to impact regularization methods: the hope is that this is a simple rule that prevents catastrophic outcomes without having to solve all of human values.

AI strategy and policy

AI Algorithms Need FDA-Style Drug Trials (Olaf J. Groth et al)

Other progress in AI   Critiques (AI)

Evidence against current methods leading to human level artificial intelligence (Asya Bergal and Robert Long): This post briefly lists arguments that current AI techniques will not lead to high-level machine intelligence (HLMI), without taking a stance on how strong these arguments are.

News

Ought: why it matters and ways to help (Paul Christiano): This post discusses the work that Ought is doing, and makes a case that it is important for AI alignment (see the summary for Delegating open-ended cognitive work above). Readers can help Ought by applying for their web developer role, by participating in their experiments, and by donating.

Project Proposal: Considerations for trading off capabilities and safety impacts of AI research (David Krueger): This post calls for a thorough and systematic evaluation of whether AI safety researchers should worry about the impact of their work on capabilities.