I've recently started working through AI safety posts written on LessWrong 1-3 years ago; in doing so I occasionally have questions/comments about the material. Is it considered good practice/in line with LW norms to write these as comments on the original, old posts? One hand I can see why "necro-ing" old posts would be frowned on, but I'm not sure where else to bring it up. You can look at my comment history for examples of what I mean (before I realized it might not be a good idea)
Previously I asked about Solomonoff induction but essentially I asked the wrong question. Richard_Kennaway pointed me in the direction of an answer to the question which I should have asked but after investigating I still had questions.
If one has 2 possible models to fit to a data set, by how much should one penalise the model which has an additional free parameter?
A couple of options which I came across were:
AIC, which has a flat facter of e penalty for each additional parameter.
BIC, which has a factor of √n penalty for each additional parameter.
where n is the number of data points.
On the one hand having a penalty which increases with n makes sense - a useful additional parameter should be able to provide more evidence the more data you have. On the other hand, having a penalty which increases with n means your prior will be different depending on the number of data points which seems wrong.
So, count me confused. Maybe there are other options which are more helpful. I don't know if the answer is too complex for a blog post but, if so, any suggestions of good text books on the subject would be great.
Fast forward a few years, and imagine that we have a complete physical model of an e-coli bacteria. We know every function of every gene, kinetics of every reaction, physics of every membrane and motor. Computational models of the entire bacteria are able to accurately predict responses to every experiment we run.
Biologists say things like “the bacteria takes in information from its environment, processes that information, and makes decisions which approximately maximize fitness within its ancestral environment.” We have strong outside-view reasons to expect that the information processing in question probably approximates Bayesian reasoning (for some model of the environment), and the decision-making process approximately maximizes some expected utility function (which itself approximates fitness within the ancestral environment).
So presumably, given a complete specification of the bacteria’s physics, we ought to be able to back out its embedded world-model and utility function. How exactly do we do that, mathematically? What equations do we even need to solve?
As a computational biology professor I used to work with said, “Isn’t that, like, the entire problem of biology?”Economics
Economists say things like “financial market prices provide the best publicly-available estimates for the probabilities of future events.” Prediction markets are an easy case, but let’s go beyond that: we have massive amounts of price data and transaction data from a wide range of financial markets - futures, stocks, options, bonds, forex... We also have some background general economic data, e.g. Fed open-market operations and IOER rate, tax code, regulatory code, and the like. How can we back out the markets’ implicit model of the economy as a whole? What equations do we need to solve to figure out, not just what markets expect, but markets’ implicit beliefs about how the world works?
Then the other half: aside from what markets expect, what do markets want? Can we map out the (approximate, local) utility functions of the component market participants, given only market data?Neuro/Psych/FAI
Imagine we have a complete model of the human connectome. We’ve mapped every connection in one human brain, we know the dynamics of every cell type. We can simulate it all accurately enough to predict experimental outcomes.
Psychologists (among others) expect that human brains approximate Bayesian reasoning and utility maximization, at least within some bounds. Given a complete model of the brain, presumably we could back out the human’s beliefs, their ontology, and what they want. How do we do that? What equations would we need to solve?ML/AI
Pull up the specifications for a trained generative adversarial network (GAN). We have all the parameters, we know all the governing equations of the network.
We expect the network to approximate Bayesian reasoning (for some model). Indeed, GAN training is specifically set up to mimic the environment of decision-theoretic agents. If anything is going to precisely approximate mathematical ideal agency, this is it. So, given the specification, how can we back out the network’s implied probabilistic model? How can we decode its internal ontology - and under what conditions do we expect it to develop nontrivial ontological structure at all?
Epistemic status: don't take it seriously
In a post apocalyptic setting, the world would be run by the socially skilled and the well connected, with corruption and nepotism ruling.
I say that at the start, because I've been trying to analyse the attraction of post-apocalyptic settings: why do we like them so much? Apart from the romanticism of old ruins, four things seem to stand out:
- Competence rewarded: the strong and the competent are the ones ruling, or at least making things happen. That must be the case, or else how could humans survive the new situation, where all luxuries are gone?
- Clear conflict: all the heroes are in it together, against some clear menace (evil tribe or leader, zombies, or just the apocalypse itself).
- Large freedom of action: instead of fitting into narrow jobs and following set bureaucratic procedures, always being careful to be polite, and so on, the heroes can let loose and do anything as long as it helps their goal.
- Moral lesson: the apocalypse happened because of some failing of past humans, and everyone agrees what they did wrong. "If only we'd listened to [X]!!"
(Some of these also explain the attraction of past "golden ages".)
And I can feel the draw of all of those things! There a definite purity and attractiveness to them. Unfortunately, in a real post-apocalyptic setting, almost all of them would be false. For most of them, we're much closer to the ideal today than we would be in a post-apocalyptic world.
First of all, nepotism, corruption, and politics. The human brain is essentially designed for tribal politics, above all else. Tracking who's doing well, who's not, what coalition to join, who to repudiate, who to flatter, and so on - that's basically why our brains got so large. Tribal societies are riven with that kind of jostling and politics. We now live in an era where a lot of us have the luxury of ignoring politics at least some of the time. That luxury would be gone after an apocalypse; with no formal bureaucratic structures in place, our survival would depend on who we got along with, and who we pissed off. Competence might get rewarded - or it might get you singled out and ostracised (and ostracised = dead, in most societies). Influential groups and families would rule the roost, and most of the conflict would be internal. Forget redressing any injustice you'd be victim of; if you're not popular, you'll never have a chance.
As for the large freedom of action: that kinda depends on whether we go back to a tribal society, or a more agriculture-empire one. In both cases, we'd have less freedom in most ways than now (see above on the need for constantly playing the game of politics). But tribal societies do sometimes offer a degree of freedom and equality, in some ways beyond what we have today. But, unfortunately, the agriculture-empire groups will crush the tribes, relegating them to the edges and less productive areas (as has happened historically). This will be even more the case than historically; those empires will be the best placed to make use of the remnants of modern technology. And agriculture-empires are very repressive; any criticism of leaders could and would be met with death or torture.
Finally, forget about moral lessons. We're not doing enough today to combat; eg, pandemics. But we are doing a lot more than nothing. So the moral lesson of a mass pandemic would be "do more of what the ancients were already doing, but do it more and better". Same goes for most risks that threaten humanity today; it's not that we fail to address them, it's that we fail to address them enough. Or suppose that it's a nuclear war that gets us; then the moral would be "we did too little against nuclear war, while doing too much for pandemics!"; if the dice fall the other way round, we'd get the opposite lesson.
In fact, there would be little moral lesson from our perspective; the post-apocalyptic people would be focused on their own ideologies and moralities, with the pre-apocalyptic world being mentioned only if it made a point relevant to those.
All in all, a post-apocalyptic world would be awful, and not just for the whole dying and ruin reasons, but just for living in the terrible and unequal societies it would produce.
I keep hearing this phrase, "collaborative truthseeking." Question: what kind of epistemic work is the word "collaborative" doing?
Like, when you (respectively I) say a thing and I (respectively you) hear it, that's going to result in some kind of state change in my (respectively your) brain. If that state change results in me (respectively you) making better predictions than I (respectively you) would have in the absence of the speech, then that's evidence for the hypothesis that at least one of us is "truthseeking."
But what's this "collaborative" thing about? How do speech-induced state changes result in better predictions if the speaker and listener are "collaborative" with each other? Are there any circumstances in which the speaker and listener being "collaborative" might result in worse predictions?
Epistemic spot checks typically consist of references from a book, selected by my interest level, checked against either the book’s source or my own research. This one is a little different that I’m focusing on a single paragraph in a single paper. Specifically as part of a larger review I read Ericsson, Krampe, and Tesch-Römer’s 1993 paper, The Role of Deliberate Practice in the Acquisition of Expert Performance (PDF), in an attempt to gain information about how long human beings can productivity do thought work over a time period.
This paper is important because if you ask people how much thought work can be done in a day, if they have an answer and a citation at all, it will be “4 hours a day” and “Cal Newport’s Deep Work“. The Ericsson paper is in turn Newport’s source. So to the extent people’s beliefs are based on anything, they’re based on this paper.
In fact I’m not even reviewing the whole paper, just this one relevant paragraph:
When individuals, especially children, start practicing in a given domain, the amount of practice is an hour or less per day (Bloom, 1985b). Similarly, laboratory studies of extended practice limit practice to about 1 hr for 3-5 days a week (e.g., Chase & Ericsson, 1982; Schneider & Shiffrin, 1977; Seibel, 1963). A number of training studies in real life have compared the efficiency of practice durations ranging from 1 -8 hr per day. These studies show essentially no benefit from durations exceeding 4 hr per day and reduced benefits from practice exceeding 2 hr (Welford, 1968; Woodworth & Schlosberg, 1954). Many studies of the acquisition of typing skill (Baddeley & Longman, 1978; Dvorak et al.. 1936) and other perceptual motor skills (Henshaw & Holman, 1930) indicate that the effective duration of deliberate practice may be closer to 1 hr per day. Pirolli and J. R. Anderson (1985) found no increased learning from doubling the number of training trials per session in their extended training study. The findings of these studies can be generalized to situations in which training is extended over long periods of time such as weeks, months, and years
Let’s go through each sentence in order. I’ve used each quote as a section header, with the citations underneath it in bold.“When individuals, especially children, start practicing in a given domain, the amount of practice is an hour or less per day”
Generalizations about talent development, Bloom (1985)
“Typically the initial lessons were given in swimming and piano for about an hour each week, while the mathematics was taught about four hours each week…In addition some learning tasks (or homework) were assigned to be practiced and perfected before the next lesson.” (p513)
“…[D]uring the week the [piano] teacher expected the child to practice about an hour a day.” with descriptions of practice but no quantification given for swimming and math (p515).
The quote seems to me to be a simplification. “Expected an hour a day” is not the same as “did practice an hour or less per day.”“…laboratory studies of extended practice limit practice to about 1 hr for 3-5 days a week”
This study focused strictly on memorizing digits, which I don’t consider to be that close to thought work.
This study had 8 people in it and was essentially an identification and reaction time trial.
3 subjects. This was a reaction time test, not thought work. No mention of duration studying.
“These studies show essentially no benefit from durations exceeding 4 hr per day and reduced benefits from practice exceeding 2 hr”
In a book with no page number given, I skipped this one.
This too is a book with no page number, but it was available online (thanks, archive.org) and I made an educated guess that the relevant chapter was “Economy in Learning and Performance”. Most of this chapter focused on recitation, which I don’t consider sufficiently relevant.
p800: “Almost any book on applied psychology will tell you that the hourly work output is higher in an eight-hour day than a ten-hour day.”(no source)
Offers this graph as demonstration that only monotonous work has diminishing returns.
p812: An interesting army study showing that students given telegraphy training for 4 hours/day (and spending 4 on other topics) learned as much as students studying 7 hours/day. This one seems genuinely relevant, although not enough to tell us where peak performance lies, just that four hours are better than seven. Additionally, the students weren’t loafing around for the excess three hours: they were learning other things. So this is about how long you can study a particular subject, not total learning capacity in a day.Many studies of the acquisition of typing skill (Baddeley & Longman, 1978; Dvorak et al.. 1936) and other perceptual motor skills (Henshaw & Holman, 1930) indicate that the effective duration of deliberate practice may be closer to 1 hr per day
“Four groups of postmen were trained to type alpha-numeric code material using a conventional typewriter keyboard. Training was based on sessions lasting for one or two hours occurring once or twice per day. Learning was most efficient in the group given one session of one hour per day, and least efficient in the group trained for two 2-hour sessions. Retention was tested after one, three or nine months, and indicated a loss in speed of about 30%. Again the group trained for two daily sessions of two hours performed most poorly.It is suggested that where operationally feasible, keyboard training should be distributed over time rather than massed”
“We found that fact retrieval speeds up as a power function of days of practice but that the number of daily repetitions beyond four produced little or no impact on reaction time”Conclusion
Many of the studies were criminally small, and typically focused on singular, monotonous tasks like responding to patterns of light or memorizing digits. The precision of these studies is greatly exaggerated. There’s no reason to believe Ericsson, Krampe, and Tesch-Römer’s conclusion that the correct number of hours for deliberate practice is 3.5, much less the commonly repeated factoid that humans can do good work for 4 hours/day.
[This post supported by Patreon].
I was reading Tom Chivers book "The AI does not hate you" and in a discussion about avoiding bad side effects when asking a magic broomstick to fill a water bucket, it was suggested that somehow instead of asking the broomstick to fill the bucket you could do something like ask it to become 95 percent sure that it was full, and that might make it less likely to flood the house.
Apparently Tom asked Eliezer at the time and he said there was no known problem with that solution.
Are there any posts on this? Is the reason why we don't know this won't work just because it's hard to make this precise?
I constructed my AI alignment research agenda piece by piece, stumbling around in the dark and going down many false and true avenues.
But now it is increasingly starting to feel natural to me, and indeed, somewhat inevitable.
What do I mean with that? Well, let's look at the problem in reverse. Suppose we had an AI that was aligned with human values/preferences. How would you expect that to have been developed? I see four natural paths:
- Effective proxy methods. For example, Paul's amplification and distillation, or variants of revealed preferences, or a similar approach. The point of this that it reaches alignment without defining what a preference fundamentally is; instead it uses some proxy for the preference to do the job.
- Corrigibility: the AI is safe and corrigible, and along with active human guidance, manages to reach a tolerable outcome.
- Something new: a bold new method that works, for reasons we haven't thought of today (this includes most strains of moral realism).
- An actual grounded definition of human preferences.
So, if we focus on scenario 4, we need a few things. We need a fundamental definition of what a human preference is (since we know this can't be defined purely from behaviour). We need a method of combining contradictory and underdefined human preferences. We also need a method for taking into account human meta-preferences. And both these methods has to actually reach an output, and not get caught in loops.
If those are the requirements, then it's obvious why we need most of the elements of my research agenda, or something similar. We don't need the exact methods sketched out there, there may be other way of synthesising preferences and meta-preferences together. But the overall structure - a way of defining preferences, and ways of combining them that produce an output - seems, in retrospect, inevitable. The rest is, to some extent, implementation details.
I've scraped http://arbital.com as the site is unusably slow and hard to search for me.
The scrape is locally browsable and plain HTML save for MathJax and a few interactive demos. Source code included (with git history).
(previously Arbital Scrape)
Updates: Included source code, MathJax and link formatting, cross-linking, missing pages, etc
Often, people think about their self-worth/self-confidence/self-esteem/self-efficacy/self-worth in ways which seem really strange from a simplistic decision-theoretic perspective. (I'm going to treat all those terms as synonyms, but, feel free to differentiate between them as you see fit!) Why might you "need confidence" in order to try something, even when it is obviously your best bet? Why might you constantly worry that you're "not good enough" (seemingly no matter how good you become)? Why do people especially suffer from this when they see others who are (in some way) much better than them, even when there is clearly no threat to their personal livelihood? Why might you think about killing yourself due to feeling worthless? (Is there an evo-psych explanation that makes sense, given how contrary it seems to survival of the fittest?)
There might be a lot of diverse explanations for the diverse phenomena. I think providing more examples of puzzling phenomena is an equally valuable way to answer (though maybe those should be a comment rather than an answer?).
This seems connected to the puzzling way people constantly seem to want to believe good things (even contrary to evidence) in order to feel good, and fear failure even when the alternative is not trying & essentially failing automatically.
Some sketchy partial explanations to start with:
- Maybe there is a sense in which we manage the news constantly. It could be that we have a mental architecture which looks a lot like a model-free RL agent connected up to a world model, being rewarded for taking actions which increase expected value according to the world-model. The model-free RL will fool the world-model where it can, but this will be ineffective in any case where the world-model understands such manipulation. So things basically even out to rational behavior, but there's always some self-delusion going on at the fringes. (This only has to do with the observation that people sometimes try to make themselves feel better by finding arguments/activities which boost self-esteem, not with other weird aspects of self-esteem.)
- There's a theory that, in order to be trustworthy bargaining partners, people evolved to feel guilty/shameful when they violate trust. You can tell who feels more guilt/shame after some interaction with them, and you can expect these people to violate trust less often since it is more costly for them. Therefore feelings of guilt/shame can be an advantage. Self-worth may be connected to how this is implemented internally. So, according to this theory, low self-worth is all about self-punishment.
- Previously, I thought that self-worth was like an estimate of how valuable you are to your peers, which serves as an estimate of what resources you can bargain for (or, how strong of a bid can you successfully make for the group to do what you want) and how likely you are to be thrown out of the coalition.
- Now I think there's an extra dimension which has to do with simpler dominance-hierarchy behavior. Many animals have dominance hierarchies; humans have more complicated coordination strategies which involve a lot of other factors, but still display very classic dominance-hierarchy behavior sometimes. In a dominance-hierarchy system, it just makes sense to carry around a little number in your head which says how great (/terrible) a person you are, and engage in a lot of varying behaviors depending on your place in the hierarchy. Someone who is low in the hierarchy has to walk with their tail between their legs, metaphorically, which means displaying caution and deference. Maybe you have trouble talking to people because you need to show fear to your superiors.
Epistemic status: this is a new model for me, certainly rough around the joints, but I think there’s something real here.
This post begins with a confusion. For years, I have been baffled that people, watching their loved ones wither and decay and die, do not clamor in the streets for more and better science. Surely they are aware of the advances in our power over reality in only the last few centuries. They hear of the steady march of technology, Crispr and gene editing and what not. Enough of them must know basic physics and what it allows. How are people so content to suffer and die when the unnecessity of it is so apparent?
It was a failure of my mine that I didn’t take my incomprehension and realize I needed a better model. Luckily, RomeoStevens recently offered me an explanation. He said that most people live in social reality and it is only a minority who live in causal reality. I don’t recall Romeo elaborating much, but I think I saw what he was pointing at. This rest of this post is my attempt to elucidate this distinction.Causal Reality
Causal reality is the reality of physics. The world is made of particles and fields with lawful relationships governing their interactions. You drop a thing, it falls down. You lose too much blood, you die. You build a solar panel, you can charge your phone. In causal reality, it is the external world which dictates what happens and what is possible.
Causal reality is the reality of mathematics and logic, reason and argument. For these too, it would definitely seem, exist independent of the human minds who understand them. Believing in the truth preservation of modus ponens is not so different from believing in Newton’s laws.
Necessarily, you must be inhabiting causal reality to do science and engineering.
In causal reality, what makes things good or bad are their effects and how much you like those effects. My coat keeps me warm in the cold winter, so it is a good coat.
All humans inhabit causal reality to some extent or another. We avoid putting our hands in fire not because it is not the done the thing, but because of prediction about that reality that it will hurt.Social Reality
Social reality is the reality of people, i.e. people are the primitive elements rather than particles and fields. The fundamentals of the ontology are beliefs, judgments, roles, relationships, and culture. The most important properties of any object, thing, or idea are how humans relate to it. Do humans think it is good or bad, welcome or weird?
Social reality is the reality of appearances and reputation, acceptance and rejection. The picture is other people and what they think the picture is. It is a collective dream. Everything else is backdrop. What makes things good or bad, normal or strange is only what others think. Your friends, your neighbors, your country, and your culture define your world, what is good, and what is possible.Your reality shapes how you make your choices
In causal reality, you have an idea of the things that you like dislike. You have an idea of what the external world allows and disallows. In each situation, you can ask what the facts on the ground are and which you most prefer. It is better to build my house from bricks or straw? Well . . . what are the properties of each, their costs and benefits, etc? Maybe stone, you think. No one has built a stone house in your town, but you wonder if such a house might be worth the trouble.
In social reality, in any situation, you are evaluating and estimating what others will think of each option. What does it say about me if I have a brick house or straw house? What will people think? Which is good? And goodness here simply stands in for the collective judgment of others. If something is not done, e.g. stone houses, then you will probably not even think of the option. If you do, you will treat it with the utmost caution, there is no precedent here - who can say how others will respond?An Example: Vibrams
Vibrams are a kind of shoe with individual “sections” for each of your toes, kind of like a glove for your feet. They certainly don’t look like most shoes, but apparently, they’re very comfortable and good for you. They’ve been around for a while now, so enough people must be buying them.
How you evaluate Vibrams will depend on whether you approach more from a causal reality angle or a social reality angle. Many of the thoughts in each case will overlap, but I contend that their order intensity will still vary.
In causal reality, properties are evaluated and predictions are made. How comfortable are they? Are they actually good for you? How expensive are they? These are obvious “causal”/”physical” properties. You might, still within causal reality, evaluate how Vibrams will affect how others see you. You care about comfort, but you also care about what your friends think. You might decide that Vibrams are just so damn comfortable they’re worth a bit of teasing.
In social reality, the first and foremost questions about Vibrams are going to be what do others think? What kinds of people wear Vibrams? What kind of person will wearing Vibrams make me? Do Vibrams fit with my identity and social strategy? All else equal, you’d prefer comfort, but that really is far from the key thing here. It’s the human judgments which are real.An Example: Arguments, Evidence, and Truth
Causal reality is typically accompanied by a notion of external truth. There is way reality is, and that’s what determines what happens. What’s more, there are ways of accessing this external truth as verified by these methods yielding good predictions. Evidence, arguments, and reasoning can often work quite well.
If you approach reality foremost with a conception of external truth and that broadly reasoning is a way to reach truth, you can be open to raw arguments and evidence changing your mind. These are information about the external world.
In social reality, truth is what other people think and how they behave. There are games to be played with “beliefs” and “arguments”, but the real truth (only truth?) that matters is how these are arguments go down with others. The validity of an argument comes from its acceptance by the crowd because the crowd is truth. I might accept that within the causal reality game you are playing that you have a valid argument, but that’s just a game. The arguments from those games cannot move me and my actions independent from how they are evaluated in the social reality.
“Yes, I can’t fault your argument. It’s a very fine argument. But tell me, who takes this seriously? Are there any experts who will support your view?” Subtext: your argument within causal reality isn’t enough for me, I need social reality to pass judgment on this before I will accept it.Why people aren’t clamoring in the streets for the end of sickness and death?
Because no one else is. Because the done thing is to be born, go to school, work, retire, get old, get sick, and die. That’s what everyone does. That’s how it is. It’s how my parents did, and their parents, and so on. That is reality. That’s what people do.
Yes, there are some people who talk about life extension, but they’re just playing at some group game the ways goths are. It’s just a club, a rallying point. It’s not about something. It’s just part of the social reality like everything else, and I see no reason to participate in that. I’ve got my own game which doesn’t involve being so weird, a much better strategy.
In his book The AI Does Not Hate You, Tom Chivers recounts himself performing an Internal Double Crux with guidance from Anna Salamon. By my take, he is valiantly trying to reconcile his social and causal reality frames. [emphasis added, very slightly reformatted]Anna Salamon: What’s the first thing that comes into your head when you think the phrase, “Your children won’t die of old age?”Tom Chivers: “The first thing that pops up, obviously, is I vaguely assume my children will die the way we all do. My grandfather died recently; my parents are in their sixties; I’m almost 37 now. You see the paths of a human’s life each time; all lives follow roughly the same path. They have different toys - iPhones instead of colour TVs instead of whatever - but the fundamental shape of a human’s life is roughly the same. But the other thing that popped is a sense “I don’t know how I can argue with it”, because I do accept that there’s a solid chance that AGI will arrive in the next 100 years. I accept that there’s a very high likelihood that if does happen then it will transform human life in dramatic ways - up to and including an end to people dying of old age, whether it’s because we’re all killed by drones with kinetic weapons, or uploaded into the cloud, or whatever. I also accept that my children will probably live that long, because they’re middle-class, well-off kinds from a Western country. All these these things add up to a very heavily non-zero chance that my children will not die of old age, but, they don’t square with my bucolic image of what humans do. They get older, they have kids, they have grandkids, and they die, and that’s the shape of life. Those are two fundamental things that came up, and they don’t square easily.
Most people primarily inhabit a social reality frame, and in social reality options and actions which aren’t being taken by other people who are like you and whose judgments you’re interested in don’t exist. There’s no extrapolation from physics and technology trends - those things are just background stories in the social game. They’re not real. Probably less real than Jon Snow. I have beliefs and opinions and judgments of Jon Snow and his actions. What is real are the people around me.Obviously, you need a bit of both
If you read this post as being a little negative toward social reality, you’re not mistaken. But to be very clear, I think that modeling and understanding people is critically important. Heck, that’s exactly what this post is. For our own wellbeing and to do anything real in the world, we need to understand and predict others, their actions, their judgments, etc. You probably want to know what the social reality is (though I wonder if avoiding the distraction of it might facilitate especially great works, but alas, it’s too late for me). Yet if there is a moral to this post, it’s two things:
- Don’t get sucked in too much by social reality. There is an external world out there which has first claim of what happens and what is possible.
- What other people think is often Bayesian evidence, but it isn’t reality itself.
- If you primarily inhabit causal reality (like most people on LessWrong), you can be a bit less surprised that your line of reasoning fails to move many people. They’re not living in the same reality as you and they choose their beliefs based on a very different process. And heck, more people live in that reality than in yours. You really are the weirdo here.
We wrote a 20-page document that explains IDA and outlines potential Machine Learning projects about IDA. This post gives an overview of the document.What is IDA?
Iterated Distillation and Amplification (IDA) is a method for training ML systems to solve challenging tasks. It was introduced by Paul Christiano. IDA is intended for tasks where:
The goal is to outperform humans at the task or to solve instances that are too hard for humans.
It is not feasible to provide demonstrations or reward signals sufficient for super-human performance at the task
Humans have a high-level understanding of how to approach the task and can reliably solve easy instances.
The idea behind IDA is to bootstrap using an approach similar to AlphaZero, but with a learned model of steps of human reasoning instead of the fixed game simulator.
Our document provides a self-contained technical description of IDA. For broader discussion of IDA and its relevance to value alignment, see Ought's presentation, Christiano's blogpost, and the Debate paper. There is also a technical ML paper applying IDA to algorithmic problems (e.g. shortest path in a graph).ML Projects on IDA
Our document outlines three Machine Learning projects on IDA. Our goal in outlining these projects is to generate discussion and encourage research on IDA. We are not (as of June 2019) working on these projects, but we are interested in collaboration. The project descriptions are “high-level” and leave many choices undetermined. If you took on a project, part of the work would be refining the project and fixing a concrete objective, dataset and model.Project 1: Amplifying Mathematical Reasoning
This project is about applying IDA to problems in mathematics. This would involve learning to solve math problems by breaking them down into easier sub-problems. The problems could be represented in a formal language (as in this paper) or in natural language. We discuss a recent dataset of high-school problems in natural language, which was introduced in this paper. Here are some examples from the dataset:
Question: Let u(n) = -n^3 - n^2. Let e(c) = -2*c^3 + c. Let f(j) = -118*e(j) + 54*u(j). What is the derivative of f(a)?
Answer: 546*a^2 - 108*a - 118
Question: Three letters picked without replacement from qqqkkklkqkkk. Give probability of sequence qql.
The paper showed impressive results on the dataset for a Transformer model trained by supervised learning (sequence-to-sequence). This suggests that a similar model could do well at learning to solve these problems by decomposition.Project 2: IDA for Neural Program Interpretation
There’s a research program in Machine Learning on “Neural Program Interpretation” (NPI). Work on NPI focuses on learning to reproduce the behavior of computer programs. One possible approach is to train end-to-end on input-output behavior. However in NPI, a model is trained to mimic the program’s internal behavior, including all the low-level operations and the high-level procedures which invoke them.
NPI has some similar motivations to IDA. This project applies IDA to the kinds of tasks explored in NPI and compares IDA to existing approaches. Tasks could include standard algorithms (e.g. sorting), algorithms that operate with databases, and algorithms that operate on human-readable inputs (e.g. text, images).Project 3: Adaptive Computation
The idea of “adaptive computation” is to vary the amount of computation you perform for different inputs. You want to apply more computation to inputs that are hard but solvable.
Adaptive computation seems important for the kinds of problems IDA is intended to solve, including some of the problems in Projects 1 and 2. This project would investigate different approaches to adaptive computation for IDA. The basic idea is to decide whether to rely only on the distilled model (which is fast but approximate) or to additionally use amplification (which is more accurate but slower). This decision could be based on a calibrated model or based on a learned policy for choosing whether to use amplification.
The generalized efficient markets (GEM) principle says, roughly, that things which would give you a big windfall of money and/or status, will not be easy. If such an opportunity were available, someone else would have already taken it. You will never find a $100 bill on the floor of Grand Central Station at rush hour, because someone would have picked it up already.
One way to circumvent GEM is to be the best in the world at some relevant skill. A superhuman with hawk-like eyesight and the speed of the Flash might very well be able to snag $100 bills off the floor of Grand Central. More realistically, even though financial markets are the ur-example of efficiency, a handful of firms do make impressive amounts of money by being faster than anyone else in their market. I’m unlikely to ever find a proof of the Riemann Hypothesis, but Terry Tao might. Etc.
But being the best in the world, in a sense sufficient to circumvent GEM, is not as hard as it might seem at first glance (though that doesn’t exactly make it easy). The trick is to exploit dimensionality.
Consider: becoming one of the world’s top experts in proteomics is hard. Becoming one of the world’s top experts in macroeconomic modelling is hard. But how hard is it to become sufficiently expert in proteomics and macroeconomic modelling that nobody is better than you at both simultaneously? In other words, how hard is it to reach the Pareto frontier?
Having reached that Pareto frontier, you will have circumvented the GEM: you will be the single best-qualified person in the world for (some) problems which apply macroeconomic modelling to proteomic data. You will have a realistic shot at a big money/status windfall, with relatively little effort.
(Obviously we’re oversimplifying a lot by putting things like “macroeconomic modelling skill” on a single axis, and breaking it out onto multiple axes would strengthen the main point of this post. On the other hand, it would complicate the explanation; I’m keeping it simple for now.)
Let’s dig into a few details of this approach…Elbow Room
There are many table tennis players, but only one best player in the world. This is a side effect of ranking people on one dimension: there’s only going to be one point furthest to the right (absent a tie).
Pareto optimality pushes us into more dimensions. There’s only one best table tennis player, and only one best 100-meter sprinter, but there can be an unlimited number of Pareto-optimal table tennis/sprinters.
Problem is, for GEM purposes, elbow room matters. Maybe I’m the on the pareto frontier of Bayesian statistics and gerontology, but if there’s one person just little bit better at statistics and worse at gerontology than me, and another person just a little bit better at gerontology and worse at statistics, then GEM only gives me the advantage over a tiny little chunk of the skill-space.
This brings up another aspect…Problem Density
Claiming a spot on a Pareto frontier gives you some chunk of the skill-space to call your own. But that’s only useful to the extent that your territory contains useful problems.
Two pieces factor in here. First, how large a territory can you claim? This is about elbow room, as in the diagram above. Second, what’s the density of useful problems within this region of skill-space? The table tennis/sprinting space doesn’t have a whole lot going on. Statistics and gerontology sounds more promising. Cryptography and monetary economics is probably a particularly rich Pareto frontier these days. (And of course, we don’t need to stop at two dimensions - but we’re going to stop there in this post in order to keep things simple.)Dimensionality
One problem with this whole GEM-vs-Pareto concept: if chasing a Pareto frontier makes it easier to circumvent GEM and gain a big windfall, then why doesn’t everyone chase a Pareto frontier? Apply GEM to the entire system: why haven’t people already picked up the opportunities lying on all these Pareto frontiers?
Answer: dimensionality. If there’s 100 different specialties, then there’s only 100 people who are the best within their specialty. But there’s 10k pairs of specialties (e.g. statistics/gerontology), 1M triples (e.g. statistics/gerontology/macroeconomics), and something like 10^30 combinations of specialties. And each of those pareto frontiers has room for more than one person, even allowing for elbow room. Even if only a small fraction of those combinations are useful, there’s still a lot of space to stake out a territory.
And to a large extent, people do pursue those frontiers. It’s no secret that an academic can easily find fertile fields by working with someone in a different department. “Interdisciplinary” work has a reputation for being unusually high-yield. Similarly, carrying scientific work from lab to market has a reputation for high yields. Thanks to the “curse” of dimensionality, these goldmines are not in any danger of exhausting.
[AN #58] Mesa optimization: what it is, and why we should care View this email in your browser
Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter. I'm always happy to hear feedback; you can send it to me by replying to this email.Highlights
Risks from Learned Optimization in Advanced Machine Learning Systems (Evan Hubinger et al): Suppose you search over a space of programs, looking for one that plays TicTacToe well. Initially, you might find some good heuristics, e.g. go for the center square, if you have two along a row then place the third one, etc. But eventually you might find the minimax algorithm, which plays optimally by searching for the best action to take. Notably, your outer optimization over the space of programs found a program that was itself an optimizer that searches over possible moves. In the language of this paper, the minimax algorithm is a mesa optimizer: an optimizer that is found autonomously by a base optimizer, in this case the search over programs.
Why is this relevant to AI? Well, gradient descent is an optimization algorithm that searches over the space of neural net parameters to find a set that performs well on some objective. It seems plausible that the same thing could occur: gradient descent could find a model that is itself performing optimization. That model would then be a mesa optimizer, and the objective that it optimizes is the mesa objective. Note that while the mesa objective should lead to similar behavior as the base objective on the training distribution, it need not do so off distribution. This means the mesa objective is pseudo aligned; it if also leads to similar behavior off distribution it is robustly aligned.
A central worry with AI alignment is that if powerful AI agents optimize the wrong objective, it could lead to catastrophic outcomes for humanity. With the possibility of mesa optimizers, this worry is doubled: we need to ensure both that the base objective is aligned with humans (called outer alignment) and that the mesa objective is aligned with the base objective (called inner alignment). A particularly worrying aspect is deceptive alignment: the mesa optimizer has a long-term mesa objective, but knows that it is being optimized for a base objective. So, it optimizes the base objective during training to avoid being modified, but at deployment when the threat of modification is gone, it pursues only the mesa objective.
As a motivating example, if someone wanted to create the best biological replicators, they could have reasonably used natural selection / evolution as an optimization algorithm for this goal. However, this then would lead to the creation of humans, who would be mesa optimizers that optimize for other goals, and don't optimize for replication (e.g. by using birth control).
The paper has a lot more detail and analysis of what factors make mesa-optimization more likely, more dangerous, etc. You'll have to read the paper for all of these details. One general pattern is that, when using machine learning for some task X, there are a bunch of properties that affect the likelihood of learning heuristics or proxies rather than actually learning the optimal algorithm for X. For any such property, making heuristics/proxies more likely would result in a lower chance of mesa-optimization (since optimizers are less like heuristics/proxies), but conditional on mesa-optimization arising, makes it more likely that it is pseudo aligned instead of robustly aligned (because now the pressure for heuristics/proxies leads to learning a proxy mesa-objective instead of the true base objective).
Rohin's opinion: I'm glad this paper has finally come out. The concepts of mesa optimization and the inner alignment problem seem quite important, and currently I am most worried about x-risk caused by a misaligned mesa optimizer. Unfortunately, it is not yet clear whether mesa optimizers will actually arise in practice, though I think conditional on us developing AGI it is quite likely. Gradient descent is a relatively weak optimizer; it seems like AGI would have to be much more powerful, and so would require a learned optimizer (in the same way that humans can be thought of as "optimizers learned by evolution").
There still is a lot of confusion and uncertainty around the concept, especially because we don't have a good definition of "optimization". It also doesn't help that it's hard to get an example of this in an existing ML system -- today's systems are likely not powerful enough to have a mesa optimizer (though even if they had a mesa optimizer, we might not be able to tell because of how uninterpretable the models are).
Read more: Alignment Forum versionTechnical AI alignment Agent foundations
Selection vs Control (Abram Demski): The previous paper focuses on mesa optimizers that are explicitly searching across a space of possibilities for an option that performs well on some objective. This post argues that in addition to this "selection" model of optimization, there is a "control" model of optimization, where the model cannot evaluate all of the options separately (as in e.g. a heat-seeking missile, which can't try all of the possible paths to the target separately). However, these are not cleanly separated categories -- for example, a search process could have control-based optimization inside of it, in the form of heuristics that guide the search towards more likely regions of the search space.
Rohin's opinion: This is an important distinction, and I'm of the opinion that most of what we call "intelligence" is actually more like the "control" side of these two options.Learning human intent
Imitation Learning as f-Divergence Minimization (Liyiming Ke et al) (summarized by Cody): This paper frames imitation learning through the lens of matching your model's distribution over trajectories (or conditional actions) to the distribution of an expert policy. This framing of distribution comparison naturally leads to the discussion of f-divergences, a broad set of measures including KL and Jenson-Shannon Divergences. The paper argues that existing imitation learning methods have implicitly chosen divergence measures that incentivize "mode covering" (making sure to have support anywhere the expert does) vs mode collapsing (making sure to only have support where the expert does), and that the latter is more appropriate for safety reasons, since the average between two modes of an expert policy may not itself be a safe policy. They demonstrate this by using a variational approximation of the reverse-KL distance as the divergence underlying their imitation learner.
Cody's opinion: I appreciate papers like these that connect peoples intuitions between different areas (like imitation learning and distributional difference measures). It does seem like this would even more strongly lead to lack of ability to outperform the demonstrator, but that's honestly more a critique of imitation learning more generally than paper this in particular.Handling groups of agents
Social Influence as Intrinsic Motivation for Multi-Agent Deep RL (Natasha Jaques et al) (summarized by Cody): An emerging field of common-sum multi-agent research asks how to induce groups of agents to perform complex coordination behavior to increase general reward, and many existing approaches involve centralized training or hardcoding altruistic behavior into the agents. This paper suggests a new technique that rewards agents for having a causal influence over the actions of other agents, in the sense that the actions of the pair of agents agents have high mutual information. The authors empirically find that having even a small number of agents who act as "influencers" can help avoid coordination failures in partial information settings and lead to higher collective reward. In one sub-experiment, they only add this influence reward to the agents' communication channels, so agents are incentivized to provide information that will impact other agents' actions (this information is presumed to be truthful and beneficial since otherwise it would subsequently be ignored).
Cody's opinion: I'm interested by this paper's finding that you can generate apparently altruistic behavior by incentivizing agents to influence others, rather than necessarily help others. I also appreciate the point that was made to train in a decentralized way. I'd love to see more work on a less asymmetric version of influence reward; currently influencers and influencees are separate groups due to worries about causal feedback loops, and this implicitly means there's a constructed group of quasi-altruistic agents who are getting less concrete reward because they're being incentivized by this auxiliary reward.Uncertainty
ICML Uncertainty and Robustness Workshop Accepted Papers (summarized by Dan H): The Uncertainty and Robustness Workshop accepted papers are available. Topics include out-of-distribution detection, generalization to stochastic corruptions, label corruption robustness, and so on.Miscellaneous (Alignment)
To first order, moral realism and moral anti-realism are the same thing (Stuart Armstrong)AI strategy and policy
Grover: A State-of-the-Art Defense against Neural Fake News (Rowan Zellers et al): Could we use ML to detect fake news generated by other ML models? This paper suggests that models that are used to generate fake news will also be able to be used to detect that same fake news. In particular, they train a GAN-like language model on news articles, that they dub GROVER, and show that the generated articles are better propaganda than those generated by humans, but they can at least be detected by GROVER itself.
Notably, they do plan to release their models, so that other researchers can also work on the problem of detecting fake news. They are following a similar release strategy as with GPT-2 (AN #46): they are making the 117M and 345M parameter models public, and releasing their 1.5B parameter model to researchers who sign a release form.
Rohin's opinion: It's interesting to see that this group went with a very similar release strategy, and I wish they had written more about why they chose to do what they did. I do like that they are on the face of it "cooperating" with OpenAI, but eventually we need norms for how to make publication decisions, rather than always following the precedent set by someone prior. Though I suppose there could be a bit more risk with their models -- while they are the same size as the released GPT-2 models, they are better tuned for generating propaganda than GPT-2 is.
Read more: Defending Against Neural Fake News
The Hacker Learns to Trust (Connor Leahy): An independent researcher attempted to replicate GPT-2 (AN #46) and was planning to release the model. However, he has now decided not to release, because releasing would set a bad precedent. Regardless of whether or not GPT-2 is dangerous, at some point in the future, we will develop AI systems that really are dangerous, and we need to have adequate norms then that allow researchers to take their time and evaluate the potential issues and then make an informed decision about what to do. Key quote: "sending a message that it is ok, even celebrated, for a lone individual to unilaterally go against reasonable safety concerns of other researchers is not a good message to send".
Rohin's opinion: I quite strongly agree that the most important impact of the GPT-2 decision was that it has started a discussion about what appropriate safety norms should be, whereas before there were no such norms at all. I don't know whether or not GPT-2 is dangerous, but I am glad that AI researchers have started thinking about whether and how publication norms should change.Other progress in AI Reinforcement learning
A Survey of Reinforcement Learning Informed by Natural Language (Jelena Luketina et al) (summarized by Cody): Humans use language as a way of efficiently storing knowledge of the world and instructions for handling new scenarios; this paper is written from the perspective that it would be potentially hugely valuable if RL agents could leverage information stored in language in similar ways. They look at both the case where language is an inherent part of the task (example: the goal is parameterized by a language instruction) and where language is used to give auxiliary information (example: parts of the environment are described using language). Overall, the authors push for more work in this area, and, in particular, more work using external-corpus-pretrained language models and with research designs that use human-generated rather than synthetically-generated language; the latter is typically preferred for the sake of speed, but the former has particular challenges we'll need to tackle to actually use existing sources of human language data.
Cody's opinion: This article is a solid and useful version of what I would expect out of a review article: mostly useful as a way to get thinking in the direction of the intersection of RL and language, and make me more interested in digging more into some of the mentioned techniques, since by design this review didn't go very deep into any of them.Deep learning
Ray Interference: a Source of Plateaus in Deep Reinforcement Learning (Tom Schaul et al) (summarized by Cody): The authors argue that Deep RL is subject to a particular kind of training pathology called "ray interference", caused by situations where (1) there are multiple sub-tasks within a task, and the gradient update of one can decrease performance on the others, and (2) the ability to learn on a given sub-task is a function of its current performance. Performance interference can happen whenever there are shared components between notional subcomponents or subtasks, and the fact that many RL algorithms learn on-policy means that low performance might lead to little data collection in a region of parameter space, and make it harder to increase performance there in future.
Cody's opinion: This seems like a useful mental concept, but it seems quite difficult to effectively remedy, except through preferring off-policy methods to on-policy ones, since there isn't really a way to decompose real RL tasks into separable components the way they do in their toy exampleMeta learning
Alpha MAML: Adaptive Model-Agnostic Meta-Learning (Harkirat Singh Behl et al)Copyright © 2019 Rohin Shah, All rights reserved.
Want to change how you receive these emails?
You can update your preferences or unsubscribe from this list.
At the moment there is a link on the Alignment Forum to apply for membership. You can apply by submitting papers, blog posts or comments. However, there is very little in the way of detail of what kind of work they expect they expect in order for you to have a reasonable chance of being invited. I'm not saying that there should be specific criteria as I'm in favour of the moderators using their judgement. However, more detail on what they are looking for would provide encouragement for people to work towards achieving the level of knowledge and good judgement that they expect for their members.
Topics we covered:
- How business school led me to rationality.
- The origin story and meaning of Putanumonit.
- How to put numbers on things where no numbers exist.
- Contextualizing and decoupling in the saga of Caster Semenya.
- Does putting numbers on dating make me an emotionless robot?
- Rationality, mindfulness, and poking your head above the river.
- The posts I regret writing.
- Antinatalism and the connection between emotion and philosophy.
- How intuition follows controversy, and why hunter-gatherers don’t have opinions on immigration.
- Fake frameworks as the key to rationality and why I prefer Magic the Gathering personality color wheel to the big 5 personality system or MBTI.
- Rationality alone and in a group.
- Why soccer is a supreme entertainment product, aesthetic experience, and showcase of virtue.
- MMA as a gateway drug to loving sports.