A catastrophe is an event so bad that we are not willing to let it happen even a single time. For example, we would be unhappy if our self-driving car ever accelerates to 65 mph in a residential area and hits a pedestrian.
Catastrophes present a theoretical challenge for traditional machine learning — typically there is no way to reliably avoid catastrophic behavior without strong statistical assumptions.
In this post, I’ll lay out a very general model for catastrophes in which they are avoidable under much weaker statistical assumptions. I think this framework applies to the most important kinds of catastrophe, and will be especially relevant to AI alignment.
Designing practical algorithms that work in this model is an open problem. In a subsequent post I describe what I currently see as the most promising angles of attack.Modeling catastrophes
We consider an agent A interacting with the environment over a sequence of episodes. Each episode produces a transcript τ, consisting of the agent’s observations and actions, along with a reward r ∈ [0, 1]. Our primary goal is to quickly learn an agent which receives high reward. (Supervised learning is the special case where each transcripts consist of a single input and a label for that input.)
While training, we assume that we have an oracle which can determine whether a transcript τ is “catastrophic.” For example, we might show a transcript to a QA analyst and ask them if it looks catastrophic. This oracle can be applied to arbitrary sequences of observations and actions, including those that don’t arise from an actual episode. So training can begin before the very first interaction with nature, using only calls to the oracle.
Intuitively, a transcript should only be marked catastrophic if it satisfies two conditions:
- The agent made a catastrophically bad decision.
- The agent’s observations are plausible: we have a right to expect the agent to be able to handle those observations.
While actually interacting with the environment, the agent cannot query the oracle — there is no time to wait for a QA engineer to review a proposed action to check if it would be catastrophic.
Moreover, if interaction with nature ever produces a catastrophic transcript, we immediately fail. The performance of an algorithm is characterized by two parameters: the probability of catastrophic failure, and the total reward assuming no catastrophic failure.
We assume that there are some policies such that no matter what nature does, the resulting transcript is never catastrophic.
Traditionally in RL the goal is to get as much reward as the best policy from some class C. We’ slightly weaken that goal, and instead aim to do as well as the best policy from C that never makes a catastrophic decision.Batch learning
I’ve described an online version of learning with catastrophes. We can also consider the batch version, where the learner is first given a large number of “training” episodes.
In the batch version, there is no penalty for catastrophes at training time, and we don’t care about training error. The two performance parameters are test-time performance and test-time catastrophe probability.The oracle
This definition depends on an oracle who determines which transcripts are catastrophic.
For weak AI systems, the oracle may be a human. But a powerful AI system might take actions which are catastrophic but which look inoffensive to a human judge, so this approach doesn’t cut it.
In general, the judge should be a human+AI team which is more competent than the system being trained, armed with an adequate solution to the informed oversight problem.Approach
Learning with catastrophes is straightforward given an unlimited number of queries to the catastrophe oracle. Given any online learning algorithm A, we can “harden” it by running the following process before the beginning of each episode (a similar process can be applied to a batch learner):
- Search over all possible environments, running A on each one to obtain a transcript.
- If we find a catastrophic transcript τ, then add τ to A’s training data with a reward of −1, and go back to step 1.
It’s easy to prove that this process converges, if A is competent: the number of times we invoke step 2 is at most the time required to learn an optimal catastrophe-free policy (plus the number of episodes).
The big problem with this algorithm is the exponential search in step 1.Adversarial training
The most obvious approach to tractable learning with catastrophes is to replace step 1 with a tractable algorithm that approximates the search. This process is an example of adversarial training.
We can formalize this as a simple game between the agent A and an adversary B. The two agents alternate taking moves, with A playing the role of the agent and B playing the role of the environment. B wins if the resulting transcript is catastrophic, A wins if it is not catastrophic.
Before each episode, we play this game N times, where N > 0 is a hyperparameter. If B is able to win any of the games, then we add the resulting transcript to A’s training data and start again. Once B is not able to win any games, then we move on to a “real” episode. We hope that B is good enough at the game that if it can’t produce a catastrophic transcript, then the real environment can’t produce a catastrophic transcript either.
More precisely, before each episode we perform the following process:
- Set i = 0.
- A and B alternate taking moves, producing a transcript τ.
- If τ is catastrophic, we add τ to A’s training data with a reward of −1, and add τ to B’s training data with a reward of +1. Then we go back to step 1.
- If τ is not catastrophic, we add τ to B’s training data with a reward of −1.
- If i < N, we increment i and go back to step 2.
I discuss this idea in more detail in my post on red teams. There are serious problems with this approach and I don’t think it can work on its own, but fortunately it seems combinable with other techniques.Conclusion
Learning with catastrophes is a very general model of catastrophic failures which avoids being obviously impossible. I think that designing competent algorithms for learning with catastrophes may be an important ingredient in a successful approach to AI alignment.
This was originally posted here on 28th May, 2016.
Tomorrow's AI Alignment sequences post will be in the sequence on Value Learning by Rohin Shah.
The next post in this sequence will be 'Thoughts on Reward Engineering' by Paul Christiano, on Thursday.
Epistemic Status: Tentative
I’m fairly anti-hierarchical, as things go, but the big challenge to all anti-hierarchical ideologies is “how feasible is this in real life? We don’t see many examples around us of this working well.”
Backing up, for a second, what do we mean by a hierarchy?
I take it to mean a very simple thing: hierarchies are systems of social organization where some people tell others what to do, and the subordinates are forced to obey the superiors. This usually goes along with special privileges or luxuries that are only available to the superiors. For instance, patriarchy is a hierarchy in which wives and children must obey fathers, and male heads of families get special privileges.
Hierarchy is a matter of degree, of course. Power can vary in the severity of its enforcement penalties (a government can jail you or execute you, an employer can fire you, a religion can excommunicate you, the popular kids in a high school can bully or ostracize you), in its extent (a totalitarian government claims authority over more aspects of your life than a liberal one), or its scale (an emperor rules over more people than a clan chieftain.)
Power distance is a concept from the business world that attempts to measure the level of hierarchy within an organization or culture. Power distance is measured by polling less-powerful individuals on how much they “accept and expect that power is distributed unequally”. In low power distance cultures, there’s more of an “open door” policy, subordinates can talk freely with managers, and there are few formal symbols of status differentiating managers from subordinates. In “high power distance” cultures, there’s more formality, and subordinates are expected to be more deferential. According to Geert Hofstede, the inventor of the power distance index (PDI), Israel and the Nordic countries have the lowest power distance index in the world, while Arab, Southeast Asian, and Latin American countries have the highest. (The US is in the middle.)
I share with many other people a rough intuition that hierarchy poses problems.
This may not be as obvious as it sounds. In high power distance cultures, empirically, subordinates accept and approve of hierarchy. So maybe hierarchy is just fine, even for the “losers” at the bottom? But there’s a theory that subordinates claim to approve of hierarchy as a covert way of getting what power they can. In other words, when you see peasants praising the benevolence of landowners, it’s not that they’re misled by the governing ideology, and not that they’re magically immune to suffering from poverty as we would in their place, but just that they see their situation as the best they can get, and a combination of flattery and (usually religious) guilt-tripping is their best chance for getting resources from the landowners. So, no, I don’t think you can assume that hierarchy is wholly harmless just because it’s widely accepted in some societies. Being powerless is probably bad, physiologically and psychologically, for all social mammals.
But to what extent is hierarchy necessary?
Structurelessness and Structures
Nominally non-hierarchical organizations often suffer from failure modes that keep them from getting anything done, and actually wind up quite hierarchical in practice. I don’t endorse everything in Jo Freeman’s famous essay on the Tyranny of Structurelessness, but it’s important as an account of actual experiences in the women’s movement of the 1970s.
When organizations have no formal procedures or appointed leaders, everything goes through informal networks; this devolves into popularity contests, privileges people who have more free time to spend on gossip, as well as people who are more privileged in other ways (including economically), and completely fails to correlate decision-making power with competence.
Freeman’s preferred solution is to give up on total structurelessness and accept that there will be positions of power in feminist organizations, but to make those positions of power legible and limited, with methods derived from republican governance (which are also traditional in American voluntary organizations.) Positions of authority should be limited in scope (there is a finite range of things an executive director is empowered to do), accountable to the rest of the organization (through means like voting and annual reports), and impeachable in cases of serious ethical violation or incompetence. This is basically the governance structure that nonprofits and corporations use, and (in my view) it helps make them, say, less likely to abuse their members than cults and less likely to break up over personal drama than rock bands.
Freeman, being more egalitarian than the republican tradition, also goes further with her recommendations and says that responsibilities should be rotated (so no one person has “ownership” over a job forever), that authority should be distributed widely rather than concentrated, that information should be diffused widely, and that everyone in the organization should have equal access to organizational resources. Now, this is a good deal less hierarchical than the structure of republican governments, nonprofits, and corporations; it is still pretty utopian from the point of view of someone used to those forms of governance, and I find myself wondering if it can work at scale; but it’s still a concession to hierarchy relative to the “natural” structurelessness that feminist organizations originally envisioned.
Freeman says there is one context in which a structureless organization can work; a very small team (no more than five) of people who come from very similar backgrounds (so they can communicate easily), spend so much time together that they practically live together (so they communicate constantly), and are all capable of doing all “jobs” on the project (no need for formal division of labor.) In other words, she’s describing an early-stage startup!
I suspect Jo Freeman’s model explains a lot about the common phenomenon of startups having “growing pains” when they get too large to work informally. I also suspect that this is a part of how startups stop being “mission-driven” and ambitious — if they don’t add structure until they’re forced to by an outside emergency, they have to hurry, and they adopt a standard corporate structure and power dynamics (including the toxic ones, which are automatically imported when they hire a bunch of people from a toxic business culture all at once) instead of having time to evolve something that might achieve the founders’ goals better.
But Can It Scale? Historical Stateless Societies
So, the five-person team of friends is a non-hierarchical organization that can work. But that’s not very satisfying for anti-authoritarian advocates, because it’s so small. And, accordingly, an organization that small is usually poor — there’s only so many resources that five people can produce.
(Technology can amplify how much value a single person can produce. This is probably why we see more informal cultures among people who work with high-leverage technology. Software engineers famously wear t-shirts, not suits; Air Force pilots have a reputation as “hotshots” with lax military discipline compared to other servicemembers. Empowered with software or an airplane, a single individual can be unusually valuable, so less deference is expected of the operators of high technology.)
When we look at historical anarchies or near-anarchies, we usually also see that they’re small, poor, or both. We also see that within cultures, there is often surprisingly more freedom for women among the poor than among the rich.
Medieval Iceland from the tenth to thirteenth centuries was a stateless society, with private courts of law, and competing legislative assemblies (Icelanders could choose which assembly and legal code to belong to), but no executive branch or police. (In this, it was an unusually pure form of anarchy but not unique — other medieval European polities had much more private enforcement of law than we do today, and police are a 19th-century invention.)
The medieval Icelandic commonwealth lasted long enough — longer than the United States — that it was clear this was a functioning system, not a brief failed experiment. And it appears that it was less violent, not more, compared to other medieval societies. Even when the commonwealth was beginning to break down in the thirteenth century, battles had low casualty rates, because every man still had to be paid for! The death toll during the civil war that ended the commonwealth’s independence was only as high per capita as the current murder rate of the US. While Christianization in neighboring Norway was a violent struggle, the decision of whether to convert to Christianity in Iceland was decided peacefully through arbitration. In this case, it seems clear that anarchy brought peace, not war.
However, medieval Iceland was small — only 50,000 people, confined to a harsh Arctic environment, and ethnically homogeneous.
Other historical and traditional stateless societies are and were also relatively poor and low in population density. The Igbo of Nigeria traditionally governed by council and consensus, with no kings or chiefs, but rather a sort of village democracy. This actually appears to be fairly common in small polities. The Iroquois Confederacy governed by council and had no executive. (Note that the Iroquois are a hoe culture.) The Nuer of Sudan, a pastoral society currently with a population of a few million, have traditionally had a stateless society with a system of feud law — they had judges, but no executives. There are many more examples — perhaps most familiar to Westerners, the society depicted in the biblical book of Judges appears to have had no king and no permanent war-leader, but only judges who would decide cases which would be privately enforced. In fact, stateless societies with some form of feud law seem to be a pretty standard and recurrent type of political organization, but mostly in “primitive” communities — horticultural or pastoral, low in population density. This sounds like bad news for modern-day anarchists who don’t want to live in primitive conditions. None of these historical stateless societies, even the comparatively sophisticated Iceland, are urban cultures!
It’s possible that the Harappan civilization in Bronze Age India had no state, while it had cities that housed tens of thousands of people, were planned on grids, and had indoor plumbing. The Harappans left no massive tombs, no palaces or temples, houses of highly uniform size (indicating little wealth inequality) no armor and few weapons (despite advanced metalworking), no sign of battle damage on the cities or violent death in human remains, and very minimal city walls. The Harappan cities were commercial centers, and the Harappans engaged in trade along the coast of India and as far as Afghanistan and the Persian Gulf. Unlike other similar river-valley civilizations (such as Mesopotamia), the Harappans had so much arable land, and farmsteads so initially spread out, that populations steadily grew and facilitated long-distance trade without having to resort to raiding, so they never developed a warrior class. If so, this is a counterexample to the traditional story that all civilizations developed states (usually monarchies) as a necessary precondition to developing cities and grain agriculture.
Bali is another counterexample. Rice farming in Bali requires complex coordination of irrigation. This was traditionally not organized by kings, but by subaks, religious and social organizations that supervise the growing of rice, supervised by a decentralized system of water temples, and led by priests who kept a ritual calendar for timing irrigation. While precolonial Bali was not an anarchy but a patchwork of small principalities, large public works like irrigation were not under state control.
So we have reason to believe that Bronze Age levels of technological development (cities, metalworking, intensive agriculture, literacy, long-distance trade, and high populations) can be developed without states, at scales involving millions of people, for centuries. We also have much more abundant evidence, historical and contemporary, of informal governance-by-council and feud law existing stably at lower technology levels (for pastoralists and horticulturalists). And, in special political circumstances (the Icelanders left Norway to settle a barren island, to escape the power of the Norwegian king, Harald Fairhair) an anarchy can arise out of a state society.
But we don’t have successful examples of anarchies at industrial tech levels. We know industrial-technology public works can be built by voluntary organizations (e.g. the railroads in the US) but we have no examples of them successfully resisting state takeover for more than a few decades.
Is there something about modern levels of high technology and material abundance that is incompatible with stateless societies? Or is it just that modern nation-states happened to already be there when the Industrial Revolution came around?
Women’s Status and Material Abundance
A very weird thing is that women’s level of freedom and equality seems almost to anticorrelate with the wealth and technological advancement.
Horticultural (or “hoe culture“) societies are non-patriarchal and tend to allow women more freedom and better treatment in various ways than pre-industrial agricultural societies. For instance, severe mistreatment of women and girls like female infanticide, foot-binding, honor killings, or sati, and chastity-oriented restrictions on female freedom like veiling and seclusion, are common in agricultural societies and unknown in horticultural ones. But horticultural societies are poor in material culture and can’t sustain high population densities in most cases.
You also see unusual freedom for women in premodern pastoral cultures, like the Mongols. Women in the Mongol Empire owned and managed ordos, mobile cities of tents and wagons which also comprised livestock and served as trading hubs. While the men focused on hunting and war, the women managed the economic sphere. Mongol women fought in battle, herded livestock, and occasionally ruled as queens. They did not wear veils or bind their feet.
We see numerous accounts of ancient and medieval women warriors and military commanders among Germanic and Celtic tribes and steppe peoples of Central Asia. There are also accounts of medieval European noblewomen who personally led armies. The pattern isn’t obvious, but there seem to be more accounts of women military leaders in pastoral societies or tribal ones than in large, settled empires.
Pastoralism, to a lesser extent than horticulture but still more than plow agriculture, gives women an active role in food production. Most pastoral societies today have a traditional division of labor in which men are responsible for meat animals and women are responsible for milk animals (as well as textiles). Where women provide food, they tend to have more bargaining power. Some pastoral societies, like the Tuareg, are even matrilineal; Tuareg women traditionally have more freedom, including sexual freedom, than they do in other Muslim cultures, and women do not wear the veil while men do.
Like horticulture, pastoralism is less efficient per acre at food production than agriculture, and thus does not allow high population densities. Pastoralists are poorer than their settled farming neighbors. This is another example of women being freer when they are also poorer.
Another weird and “paradoxical” but very well-replicated finding is that women are more different from men in psychological and behavioral traits (like Big 5 personality traits, risk-taking, altruism, participation in STEM careers) in richer countries than in poorer ones. This isn’t quite the same as women being less “free” or having fewer rights, but it seems to fly in the face of the conventional notion that as societies grow richer, women become more equal to men.
Finally, within societies, it’s sometimes the case that poor women are treated better than rich ones. Sarah Blaffer Hrdy writes about observing that female infanticide was much more common among wealthy Indian Rajput families than poor ones. And we know of many examples across societies of aristocratic or upper-class women being more restricted to the domestic sphere, married off younger, less likely to work, more likely to experience restrictive practices like seclusion or footbinding, than their poorer counterparts.
Hrdy explains why: in patrilinear societies, men inherit wealth and women don’t. If you’re a rich family, a son is a “safe” outcome — he’ll inherit your wealth, and your grandchildren through him will be provided for, no matter whom he marries. A daughter, on the other hand, is a risk. You’ll have to pay a dowry when she marries, and if she marries “down” her children will be poorer than you are — and at the very top of the social pyramid, there’s nowhere to marry but down. This means that you have an incentive to avoid having daughters, and if you do have daughters, you’ll be very anxious to avoid them making a bad match, which means lots of chastity-enforcement practices. You’ll also invest more in your sons than daughters in general, because your grandchildren through your sons will have a better chance in life than your grandchildren through your daughters.
The situation reverses if you’re a poor family. Your sons are pretty much screwed; they can’t marry into money (since women don’t inherit.) Your daughters, on the other hand, have a chance to marry up. So your grandchildren through your daughters have better chances than your grandchildren through your sons, and you should invest more resources in your sons than your daughters. Moreover, you might not be able to afford restrictive practices that cripple your daughters’ ability to work for a living. To some extent, sexism is a luxury good.
A similar analysis might explain why richer countries have larger gender differences in personality, interests, and career choices. A degree in art history might function as a gentler equivalent of purdah — a practice that makes a woman a more appealing spouse but reduces her earning potential. You expect to find such practices more among the rich than the poor. (Tyler Cowen’s take is less jaundiced, and more general, but similar — personal choices and “personality” itself are more varied when people are richer, because one of the things people “buy” with wealth is the ability to make fulfilling but not strictly pragmatic self-expressive choices.)
Finally, all these “paradoxical” trends are countered by the big nonparadoxical trend — by most reasonable standards, women are less oppressed in rich liberal countries than in poor illiberal ones. The very best countries for women’s rights are also the ones with the lowest power distance: Nordic and Germanic countries.
Is Hierarchy the Engine of Growth or a Luxury Good?
If you observe that the “freest” (least hierarchical, lowest power distance, least authoritarian, etc) functioning organizations and societies tend to be small, poor, or primitive, you could come to two different conclusions:
- Freedom causes poverty (in other words, non-hierarchical organization is worse than hierarchy at scaling to large organizations or rich, high-population societies)
- Hierarchy is expensive (in other words, only the largest organizations or richest societies can afford the greatest degree of authoritarianism.)
The first possibility is bad news for freedom. It means you should worry you can’t scale up to wealth for large populations without implementing hierarchies. The usual mechanism proposed for this is the hypothesis that hierarchies are needed to coordinate large numbers of people in large projects. Without governments, how would you build public works? Or guard the seas for global travel and shipping? Without corporate hierarchies, how would you get mass-produced products to billions of people? Sure, idealists have proposed alternatives to hierarchy, but these tend to be speculative or small-scale and the success stories are sporadic.
The second possibility is (tentative) good news for freedom. It says that hierarchy is inefficient. For instance, secluding women in harems wastes their productive potential. Top-down state control of the economy causes knowledge problems that limit economic productivity. The same problem applies to top-down control of decisionmaking in large firms. Dominance hierarchies inhibit accurate transmission of information, which worsens knowledge problems and principal-agent problems (“communication is only possible between equals.”) And elaborate displays of power and deference are costly, as nonproductive displays always are. Only accumulations of large amounts of resources enable such wasteful activity, which benefits the top of the hierarchy in the short run but prevents the “pie” of total resources from growing.
This means that if you could just figure out a way to keep inefficient hierarchies from forming, you could grow systems to be larger and richer than ever. Yes, historically, Western economies grew richer as states grew stronger — but perhaps a stateless society could be richer still. Perhaps without the stagnating effects of rent-seeking, we could be hugely better off.
After all, this is kind of what liberalism did. It’s the big counter-trend to “wealth and despotism go together” — Western liberal-democratic countries are much richer and much less authoritarian (and less oppressive to women) than any pre-modern society, or than developing countries. One of the observations in Wealth of Nations is that countries with strong middle classes had more subsequent economic growth than countries with more wealth inequality — Smith uses England as an example of a fast-growing, equal society and China as an example of a stagnant, unequal one.
But this is only partial good news for freedom, after all. If hierarchies tend to emerge as soon as size, scale, and wealth arise, then that means we don’t have a solution to the problem of preventing them from emerging. On a model where any sufficiently large accumulation of resources begins to look attractive to “robber barons” who want to appropriate it and forcibly keep others out, we might hypothesize that a natural evolution of all human institutions is from an initial period of growth and value production towards inevitable value capture, stagnation, and decline. We see a lack of freedom in the world around us, not because freedom can’t work well, but because it’s hard to preserve against the incursions of wannabe despots, who eventually ruin the system for everyone including themselves.
That model points the way to new questions, surrounding the kinds of governance that Jo Freeman talks about. By default an organization will succumb to inefficient hierarchy, and structureless organizations will succumb faster and to more toxic hierarchies. When designing governance structures, the question you want to ask is not just “is this a system I’d want to live under today?” but “how effective will this system be in the future at resisting the guys who will come along and try to take over and milk it for short-term personal gain until it collapses?” And now we’re starting to sound like the rationale and reasoning behind the U.S. Constitution, though I certainly don’t think that’s the last word on the subject.
Originally posted at sandymaguire.me
I want to share a piece of ridiculously obvious advice today.
I've got a bad habit, which is being too smart for my own good. Which is to say, when I want to learn something new, too often I spend my time making tools to help me learn, rather than just learning the thing.
Take, for example, the first time I tried to learn how to play jazz music.
There's only one thing that I'm really good at, which is programming. The central tenet in programming is that "laziness is good," and if you're faced with doing something boring and repetitive, you should instead automate that thing away.
When all you have is a hammer...
According to The Book, the first thing to do to learn jazz is to learn your scales---in every mode for every key for several varieties of harmony. There are 12 notes, and seven modes, and at least four harmonies. That's what, like 336 different scales to learn?
"WHO HAS TIME FOR ALL THAT CRAP," I thought. "I'LL JUST WRITE A COMPUTER PROGRAM TO GENERATE THE SCALES FOR ME, AND THEN PLAY THOSE."
In retrospect, this was a terrible plan. Not only did it not get me closer to my goal of knowing how to play jazz music, I also didn't know enough about the domain to successfully model it. It's funny to read back through that blog post with the benefit of hindsight, but at the time I really thought I was onto something!
That's not to say it was wasted effort nor that it was useless, merely that it wasn't actually moving me closer to my stated goal of being able to play jazz music. It was scratching my itch for mental masturbation, and was a good exercise in attempting to model things I don't understand very well, but crucially, it wasn't helping.
Or take another example, a more recent foray into music for me---only a few weeks ago. This time I had more of a plan; I was taking piano lessons and getting advice on how to practice from my teacher. One of the things he suggested I do was to solo around in the minor pentatonic scale. And so I did, starting in C, and (tentatively) moving to G.
But doing it in Bb was hard! Rather than spend the two minutes that would be required to work out what notes I should play in the Bb minor pentatonic, I decided it would be better to write a computer program! This time it would connect to my keyboard and "listen" to the notes I played, and flash red whenever I played a note that wasn't in the Bb minor pentatonic. I guess the reasoning was "I'll train myself to play the right notes subconsciously." Or something.
I spent like 15 hours writing this computer program.
This attempt was arguably more helpful than my first computer program, but again, it's a pretty fucking roundabout way of accomplishing the goal. Here we are, four weeks later, and I still don't know how to noodle around in the Bb minor pentatonic.
Like I said. Too smart for my own good.
There's a happy ending to this story, however. Earlier this week, I decided I was going to actually learn how to play jazz music. So I started reading The Book again, and when I got to the scale exercises, I decided I'd just give them a go. No computers. Just the boring, repetitive stuff it said would make me a great jazz musician.
The book even gave me some suggestions on how to minimize the amount of exercises I need to do---rather than playing every mode in every key (eg. C ionian, then G ionian, then A ionian, etc etc until it's time to play dorians), instead to play C ionian followed by D dorian followed by E phrygian. These scales all share the same notes, so they're more-or-less the same thing, which means I actually only need to practice 12 things, rather than 84 (the other 250 can likewise be compressed together.)
If I had been patient, I would have read that PRO-TIP the first time around. It probably wouldn't have helped me make less-"smart" decisions, but it's worth keeping in mind that I could be two years ahead of where I am today if I were better at keeping my eye on the ball.
One of the scales the book made me do was Ab major---something I'd literally never once played in my twenty years of piano. It started on a black note and always felt too hard to actually do. I approached it with trepidation, but realized that it only took about three minutes to figure out.
The thing I'd been putting off for twenty years out of fear only took three minutes to accomplish.
I've often wondered why it seems like all of the good musicians have been playing their instruments for like 25 years. Surely music can't be that hard---you can get pretty fucking good at most things in six months of dedicated study. But in the light of all of this, it makes sense. If everyone learns music as haphazardly as I've been doing it, it's no wonder that it takes us all so long.
What have you been putting off out of fear? Are you sure it's as hard as it seems?
I've just noticed that the number of votes shown on my recent alignment forum post seems to actually correspond to the number of votes it's received on Less Wrong, rather than just counting the alignment forum votes. Not sure if this is intentional, but for me it makes the feature less useful. Not a priority though.
If you're in a closed space, you may want to open a window.
before the industrial revolution, the atmosphere had 300 parts-per-million (PPM) of CO2. today, this number is already above 400 on average, and 500 in urban areas.
but CO2 doesn't just effect the environment, high enough levels of it also effect our bodies, and our minds.
so let's leave the atmosphere for a bit, and go inside. one study checked office employee's decision making skills at various CO2 levels, here some of the results:
this level is common at poorly ventilated spaces like a workrooms/offices. and one study on schools in several US districts found 50% of classrooms to have this level.
at this CO2 level the cognitive function in the office experiment decreased by 15%.
this level can also be reached at the places described above.
here cognitive function decreased by 50%!
from this level onward some people described other side effects such as: slight nausea, loss of attention and poor concentration, sleepiness, headaches, and increased hearth rates.
and still, these levels aren't uncommon -
this is common in cars and bedrooms (closed spaces which are either small, you spend a long time in, or both. and the side effects increase.
motorcycle helmets can reach these levels. Being in such an environment for long times can harm your long-term health.
So what can you do?
1. simply open a window! (at least in this part of the century)
2. you can get some plants for your room or office -
This lung institute guide seems to be based on this study, so i suggest reading it.
3. buy a CO2 monitor if you want to always know in what environment you're in. though, these seem cost quite a bit (for a reason unclear to me). so i don't know if it will really benefit you. i know i won't bother.
The IPPC reported that CO2 levels will be, by the end of the century, between 541 and 970ppm. if we extrapolate from the previous study, this may mean a 10-15% decrease in the cognitive function of humanity as a species (and even more than the previous results in closed spaces).
some studies found evidence that air pollution can harm the brain itself.
Should this change our attitude towards climate change as a catastrophic risk?
Saw it on Hacker News, discussion here: https://news.ycombinator.com/item?id=18965274
Formal methods seem very relevant to AI safety, and I haven't seen much discussion of them on Less Wrong.
Or, what is the best way to think about "what is a question?" on LW?
The LW Team just had a retreat where we thought through a lot of high level strategy. We have a lot of ideas building off of the "questions" feature.
One thing that struck us is that a lot of early stage research has less to do with formalizable questions, and more to do with noticing anomalies in your current model/paradigm. Something feels off that you can't explain, or there's a concept you don't even understand well enough to ask a coherent question about.
The "question" feature was meant, in part, to reduce the cost of exploring early stage curiosity, but we wondered if it might even be a slightly-too-formalized.
Just like, technically there was nothing stopping you from asking a question as a post (but adding the feature caused a proliferation of questions) there is nothing stopping you from asking an ill-formed question. But, maybe changing the language slightly would better encourage early-stage curiosity.
How would feel if we changed "Ask a question" to "Pose a confusion" or something like that? (The main issue so far is that "pose confusion" is, well, way more confusing since it's a non-standard phrase. Other options include literally saying "Ask question/Pose Confusion" [i.e. both at once, so you get the benefit of the clear-cut "ask question"], or some word other than "pose.")
(Somewhat but not-entirely-jokingly, we also noticed people are hesitant to post "answers" since they sound like you're trying to claim you know what you're talking about. We jokingly considered "Post a deconfusion", or "post a partial answer" as options)
Cooperative IRL as a definition of human-AI group rationality, and an empirical evaluation of theory of mind vs. model learning in HRI
AI Alignment Podcast: Cooperative Inverse Reinforcement Learning (Lucas Perry and Dylan Hadfield-Menell) (summarized by Richard): Dylan puts forward his conception of Cooperative Inverse Reinforcement Learning as a definition of what it means for a human-AI system to be rational, given the information bottleneck between a human's preferences and an AI's observations. He notes that there are some clear mismatches between this problem and reality, such as the CIRL assumption that humans have static preferences, and how fuzzy the abstraction of "rational agents with utility functions" becomes in the context of agents with bounded rationality. Nevertheless, he claims that this is a useful unifying framework for thinking about AI safety.
Dylan argues that the process by which a robot learns to accomplish tasks is best described not just as maximising an objective function but instead in a way which includes the system designer who selects and modifies the optimisation algorithms, hyperparameters, etc. In fact, he claims, it doesn't make sense to talk about how well a system is doing without talking about the way in which it was instructed and the type of information it got. In CIRL, this is modeled via the combination of a "teaching strategy" and a "learning strategy". The former can take many forms: providing rankings of options, or demonstrations, or binary comparisons, etc. Dylan also mentions an extension of this in which the teacher needs to learn their own values over time. This is useful for us because we don't yet understand the normative processes by which human societies come to moral judgements, or how to integrate machines into that process.
On the Utility of Model Learning in HRI (Rohan Choudhury, Gokul Swamy et al): In human-robot interaction (HRI), we often require a model of the human that we can plan against. Should we use a specific model of the human (a so-called "theory of mind", where the human is approximately optimizing some unknown reward), or should we simply learn a model of the human from data? This paper presents empirical evidence comparing three algorithms in an autonomous driving domain, where a robot must drive alongside a human.
The first algorithm, called Theory of Mind based learning, models the human using a theory of mind, infers a human reward function, and uses that to predict what the human will do, and plans around those actions. The second algorithm, called Black box model-based learning, trains a neural network to directly predict the actions the human will take, and plans around those actions. The third algorithm, model-free learning, simply applies Proximal Policy Optimization (PPO), a deep RL algorithm, to directly predict what action the robot should take, given the current state.
Quoting from the abstract, they "find that there is a significant sample complexity advantage to theory of mind methods and that they are more robust to covariate shift, but that when enough interaction data is available, black box approaches eventually dominate". They also find that when the ToM assumptions are significantly violated, then the black-box model-based algorithm will vastly surpass ToM. The model-free learning algorithm did not work at all, probably because it cannot take advantage of knowledge of the dynamics of the system and so the learning problem is much harder.
Rohin's opinion: I'm always happy to see an experimental paper that tests how algorithms perform, I think we need more of these.
You might be tempted to think of this as evidence that in deep RL, a model-based method should outperform a model-free one. This isn't exactly right. The first ToM and black box model-based algorithms use an exact model of the dynamics of the environment modulo the human, that is, they can exactly predict the next state given the current state, the robot action, and the human action. The model-free algorithm must learn this from scratch, so it isn't an apples-to-apples comparison. (Typically in deep RL, both model-based and model-free algorithms have to learn the environment dynamics.) However, you can think of the ToM as a model-based method and the Black-box model-based algorithm as a model-free algorithm, where both algorithms have to learn the human modelinstead of the more traditional environment dynamics. With that analogy, you would conclude that model-based algorithms will be more sample efficient and more performant in low-data regimes, but will be outperformed by model-free algorithms with sufficient data, which agrees with my intuitions.
This kind of effect is a major reason for my position that the first powerful AI systems will be modular (analogous to model-based systems), but that they will eventually be replaced by more integrated, end-to-end systems (analogous to model-free systems). Initially, we will be in a (relatively speaking) low-data regime, where modular systems excel, but over time there will be more data and compute and we will transition to regimes where end-to-end systems will solve the same tasks better (though we may then have modular systems for more difficult tasks).Technical AI alignmentIterated amplification sequence
The reward engineering problem (Paul Christiano): The reward engineering problem is the problem of designing some form of reward feedback such that if the agent learns to get high reward, then we are happy with the resulting behavior. We assume that an overseer H is providing reward for an agent A, with H smarter than A.
(Aside: Why are we assuming H is smarter, when we are typically worried about AI that is more intelligent than we are? Because we could hope to use AI tools to create this reward function, so that the human + AI team is smarter than the AI being trained. In iterated amplification, this corresponds to the distillation step.)
We could imagine three approaches: direct supervision (where H provides a reward for every action A takes), imitation learning (where A tries to mimic H, perhaps by trying to fool a discriminator that distinguishes between A and H), and inverse reinforcement learning (where we infer a reward function by observing H's behavior and use that to supervise A).
With all of these approaches, we might hope that since H is smarter than A, A will not be able to trick H with some behavior that gets high reward but is actually undesirable. However, this argument has a major flaw. When A chooses an action, the process or computation by which it chooses that action is important information, which by default (especially with neural nets) only A has access to, and so this can allow it to fool H. For example, if A is tasked with drawing original artwork, it might copy and slightly change existing artwork, which H may not realize if it cannot see how the artwork was made. We could hope to fix this problem with transparency or explanations, but this requires a lot more research.
Imitation learning and IRL have the problem that A may not be capable of doing what H does. In that case, it will be off-distribution and may have weird behavior. Direct supervision doesn't suffer from this problem, but it is very time-inefficient. This could potentially be fixed using semi-supervised learning techniques.
Rohin's opinion: The information asymmetry problem between H and A seems like a major issue. For me, it's the strongest argument for why transparency is a necessary ingredient of a solution to alignment. The argument against imitation learning and IRL is quite strong, in the sense that it seems like you can't rely on either of them to capture the right behavior. These are stronger than the arguments against ambitious value learning (AN #31) because here we assume that H is smarter than A, which we could not do with ambitious value learning. So it does seem to me that direct supervision (with semi-supervised techniques and robustness) is the most likely path forward to solving the reward engineering problem.
There is also the question of whether it is necessary to solve the reward engineering problem. It certainly seems necessary in order to implement iterated amplification given current systems (where the distillation step will be implemented with optimization, which means that we need a reward signal), but might not be necessary if we move away from optimization or if we build systems using some technique other than iterated amplification (though even then it seems very useful to have a good reward engineering solution).
Capability amplification (Paul Christiano): Capability amplification is the problem of taking some existing policy and producing a better policy, perhaps using much more time and compute. It is a particularly interesting problem to study because it could be used to define the goals of a powerful AI system, and it could be combined with reward engineering above to create a powerful aligned system. (Capability amplification and reward engineering are analogous to amplification and distillation respectively.) In addition, capability amplification seems simpler than the general problem of "build an AI that does the right thing", because we get to start with a weak policy A rather than nothing, and were allowed to take lots of time and computation to implement the better policy. It would be useful to tell whether the "hard part" of value alignment is in capability amplification, or somewhere else.
We can evaluate capability amplification using the concepts of reachability and obstructions. A policy C is reachable from another policy A if there is some chain of policies from A to C, such that at each step capability amplification takes you from the first policy to something at least as good as the second policy. Ideally, all policies would be reachable from some very simple policy. This is impossible if there exists an obstruction, that is a partition of policies into two sets L and H, such that it is impossible to amplify any policy in L to get a policy that is at least as good as some policy in H. Intuitively, an obstruction prevents us from getting to arbitrarily good behavior, and means that all of the policies in H are not reachable from any policy in L.
We can do further work on capability amplification. With theory, we can search for challenging obstructions, and design procedures that overcome them. With experiment, we can study capability amplification with humans (something which Ought is now doing).
Rohin's opinion: There's a clear reason for work on capability amplification: it could be used as a part of an implementation of iterated amplification. However, this post also suggests another reason for such work -- it may help us determine where the "hard part" of AI safety lies. Does it help to assume that you have lots of time and compute, and that you have access to a weaker policy?
Certainly if you just have access to a weaker policy, this doesn't make the problem any easier. If you could take a weak policy and amplify it into a stronger policy efficiently, then you could just repeatedly apply this policy-improvement operator to some very weak base policy (say, a neural net with random weights) to solve the full problem. (In other variants, you have a much stronger aligned base policy, eg. the human policy with short inputs and over a short time horizon; in that case this assumption is more powerful.) The more interesting assumption is that you have lots of time and compute, which does seem to have a lot of potential. I feel pretty optimistic that a human thinking for a long time could reach "superhuman performance" by our current standards; capability amplification asks if we can do this in a particular structured way.Value learning sequence
Reward uncertainty (Rohin Shah): Given that we need human feedback for the AI system to stay "on track" as the environment changes, we might design a system that keeps an estimate of the reward, chooses actions that optimize that reward, but also updates the reward over time based on feedback. This has a few issues: it typically assumes that the human Alice knows the true reward function, it makes a possibly-incorrect assumption about the meaning of Alice's feedback, and the AI system still looks like a long-term goal-directed agent where the goal is the current reward estimate.
This post takes the above AI system and considers what happens if you have a distribution over reward functions instead of a point estimate, and during action selection you take into account future updates to the distribution. (This is the setup of Cooperative Inverse Reinforcement Learning.) While we still assume that Alice knows the true reward function, and we still require an assumption about the meaning of Alice's feedback, the resulting system looks less like a goal-directed agent.
In particular, the system no longer has an incentive to disable the system that learns values from feedback: while previously it changed the AI system's goal (a negative effect from the goal's perspective), now it provides more information about the goal (a positive effect). In addition, the system has more of an incentive to let itself be shut down. If a human is about to shut it down, it should update strongly that whatever it was doing was very bad, causing a drastic update on reward functions. It may still prevent us from shutting it down, but it will at least stop doing the bad thing. Eventually, after gathering enough information, it would converge on the true reward and do the right thing. Of course, this is assuming that the space of rewards is well-specified, which will probably not be true in practice.
Following human norms (Rohin Shah): One approach to preventing catastrophe is to constrain the AI system to never take catastrophic actions, and not focus as much on what to do (which will be solved by progress in AI more generally). In this setting, we hope that our AI systems accelerate our rate of progress, but we remain in control and use AI systems as tools that allow us make better decisions and better technologies. Impact measures / side effect penalties aim to define what not to do. What if we instead learn what not to do? This could look like inferring and following human norms, along the lines of ad hoc teamwork.
This is different from narrow value learning for a few reasons. First, narrow value learning also learns what to do. Second, it seems likely that norm inference only gives good results in the context of groups of agents, while narrow value learning could be applied in singe agent settings.
The main advantages of learning norms is that this is something that humans do quite well, so it may be significantly easier than learning "values". In addition, this approach is very similar to our ways of preventing humans from doing catastrophic things: there is a shared, external system of norms that everyone is expected to follow. However, norm following is a weaker standard than ambitious value learning (AN #31), and there are more problems as a result. Most notably, powerful AI systems will lead to rapidly evolving technologies, that cause big changes in the environment that might require new norms; norm-following AI systems may not be able to create or adapt to these new norms.Agent foundations
CDT Dutch Book (Abram Demski)
CDT=EDT=UDT (Abram Demski)Learning human intent
AI Alignment Podcast: Cooperative Inverse Reinforcement Learning (Lucas Perry and Dylan Hadfield-Menell): Summarized in the highlights!
On the Utility of Model Learning in HRI (Rohan Choudhury, Gokul Swamy et al): Summarized in the highlights!
What AI Safety Researchers Have Written About the Nature of Human Values (avturchin): This post categorizes theories of human values along three axes. First, how complex is the description of the values? Second, to what extent are "values" defined as a function of behavior (as opposed to being a function of eg. the brain's algorithm)? Finally, how broadly applicable is the theory: could it apply to arbitrary minds, or only to humans? The post then summarizes different positions on human values that different researchers have taken.
Rohin's opinion: I found the categorization useful for understanding the differences between views on human values, which can be quite varied and hard to compare.
Risk-Aware Active Inverse Reinforcement Learning (Daniel S. Brown, Yuchen Cui et al): This paper presents an algorithm that actively solicits demonstrations on states where it could potentially behave badly due to its uncertainty about the reward function. They use Bayesian IRL as their IRL algorithm, so that they get a distribution over reward functions. They use the most likely reward to train a policy, and then find a state from which that policy has high risk (because of the uncertainty over reward functions). They show in experiments that this performs better than other active IRL algorithms.
Rohin's opinion: I don't fully understand this paper -- how exactly are they searching over states, when there are exponentially many of them? Are they sampling them somehow? It's definitely possible that this is in the paper and I missed it, I did skim it fairly quickly.Other progress in AIReinforcement learning
Soft Actor-Critic: Deep Reinforcement Learning for Robotics (Tuomas Haarnoja et al)Deep learning
A Comprehensive Survey on Graph Neural Networks (Zonghan Wu et al)
Graph Neural Networks: A Review of Methods and Applications (Jie Zhou, Ganqu Cui, Zhengyan Zhang et al)News
Olsson to Join the Open Philanthropy Project (summarized by Dan H): Catherine Olsson, a researcher at Google Brain who was previously at OpenAI, will be joining the Open Philanthropy Project to focus on grant making for reducing x-risk from advanced AI. Given her first-hand research experience, she has knowledge of the dynamics of research groups and a nuanced understanding of various safety subproblems. Congratulations to both her and OpenPhil.
Announcement: AI alignment prize round 4 winners (cousin_it): The last iteration of the AI alignment prize has concluded, with awards of $7500 each to Penalizing Impact via Attainable Utility Preservation (AN #39) and Embedded Agency (AN #31, AN #32), and $2500 each to Addressing three problems with counterfactual corrigibility (AN #30) and Three AI Safety Related Ideas/Two Neglected Problems in Human-AI Safety (AN #38).
Provisional topic: "Seeing Like A State" book review.
Nashville SSC meetup *aims* to meet the 4th Tuesday of every month at 7:00 central.
All welcome. Contact james[at]writechem.com
This post links to this blog’s posts discussing game design, balance, economics and related topics, as well as any strategy posts. It does not contain new content.
Much of the blog is relevant to gaming, but these are the explicitly on-topic posts.Eternal Sequence
Artifact / Card Rebalancing Sequence
Game Reviews (Including Those Listed Above)
All games reviewed are recommended, we don’t generally waste time on unworthy games.
I recently attended the 2019 Beneficial AGI conference organised by the Future of Life Institute. I’ll publish a more complete write-up later, but I was particularly struck by how varied attendees' reasons for considering AI safety important were. Before this, I’d observed a few different lines of thought, but interpreted them as different facets of the same idea. Now, though, I’ve identified at least 6 distinct serious arguments for why AI safety is a priority. By distinct I mean that you can believe any one of them without believing any of the others - although of course the particular categorisation I use is rather subjective, and there’s a significant amount of overlap. In this post I give a brief overview of my own interpretation of each argument (note that I don’t necessarily endorse them myself). They are listed roughly from most specific and actionable to most general. I finish with some thoughts on what to make of this unexpected proliferation of arguments. Primarily, I think it increases the importance of clarifying and debating the core ideas in AI safety.
- Maximisers are dangerous. Superintelligent AGI will behave as if it’s maximising the expectation of some utility function, since doing otherwise can be shown to be irrational. Yet we can’t write down a utility function which precisely describes human values, and optimising very hard for any other function will lead to that AI rapidly seizing control (as a convergent instrumental subgoal) and building a future which contains very little of what we value (because of Goodhart’s law and the complexity and fragility of values). We won’t have a chance to notice and correct misalignment because an AI which has exceeded human level will improve its intelligence very quickly (either by recursive self-improvement or by scaling up its hardware), and then prevent us from modifying it or shutting it down.
- This was the main thesis advanced by Yudkowsky and Bostrom when founding the field of AI safety. Here I’ve tried to convey the original line of argument, although some parts of it have been strongly critiqued since then. In particular, Drexler and Shah have disputed the relevance of expected utility maximisation (the latter suggesting the concept of goal-directedness as a replacement), while Hanson and Christiano disagree that AI intelligence will increase in a very fast and discontinuous way.
- Most of the arguments in this post originate from or build on this one in some way. This is particularly true of the next two arguments - nevertheless, I think that there’s enough of a shift in focus in each to warrant separate listings.
- The target loading problem. Even if we knew exactly what we wanted a superintelligent agent to do, we don’t currently know (even in theory) how to make an agent which actually tries to do that. In other words, if we were to create a superintelligent AGI before solving this problem, the goals we would ascribe to that AGI (by taking the intentional stance towards it) would not be the ones we had intended to give it. As a motivating example, evolution selected humans for their genetic fitness, yet humans have goals which are very different from just spreading their genes. In a machine learning context, while we can specify a finite number of data points and their rewards, neural networks may then extrapolate from these rewards in non-humanlike ways.
- This is a more general version of the “inner optimiser problem”, and I think it captures the main thrust of the latter while avoiding the difficulties of defining what actually counts as an “optimiser”. I’m grateful to Nate Soares for explaining the distinction.
- The prosaic alignment problem. It is plausible that we build “prosaic AGI”, which replicates human behaviour without requiring breakthroughs in our understanding of intelligence. Shortly after they reach human level (or possibly even before), such AIs will become the world’s dominant economic actors. They will quickly come to control the most important corporations, earn most of the money, and wield enough political influence that we will be unable to coordinate to place limits on their use. Due to economic pressures, corporations or nations who slow down AI development and deployment in order to focus on aligning their AI more closely with their values will be outcompeted. As AIs exceed human-level intelligence, their decisions will become too complex for humans to understand or provide feedback on (unless we develop new techniques for doing so), and eventually we will no longer be able to correct the divergences between their values and ours. Thus the majority of the resources in the far future will be controlled by AIs which don’t prioritise human values. This argument was explained in this blog post by Paul Christiano.
- More generally, aligning multiple agents with multiple humans is much harder than aligning one agent with one human, because value differences might lead to competition and conflict even between agents that are each fully aligned with some humans. (As my own speculation, it’s also possible that having multiple agents would increase the difficulty of single-agent alignment - e.g. the question “what would humans want if I didn’t manipulate them” would no longer track our values if we would counterfactually be manipulated by a different agent).
- The human safety problem. This line of argument (which Wei Dai has recently highlighted) claims that no human is “safe” in the sense that giving them absolute power would produce good futures for humanity in the long term, and therefore that building an AI which extrapolates and implements the values of even a very altruistic human is insufficient. A prosaic version of this argument emphasises the corrupting effect of power, and the fact that morality is deeply intertwined with social signalling - however, I think there’s a stronger and more subtle version. In everyday life it makes sense to model humans as mostly rational agents pursuing their goals and values. However, this abstraction breaks down badly in more extreme cases (e.g. addictive superstimuli, unusual moral predicaments), implying that human values are somewhat incoherent. One such extreme case is running my brain for a billion years, after which it seems very likely that my values will have shifted or distorted radically, in a way that my original self wouldn’t endorse. Yet if we want a good future, this is the process which we require to go well: a human (or a succession of humans) needs to maintain broadly acceptable and coherent values for astronomically long time periods.
- An obvious response is that we shouldn’t entrust the future to one human, but rather to some group of humans following a set of decision-making procedures. However, I don’t think any currently-known institution is actually much safer than individuals over the sort of timeframes we’re talking about. Presumably a committee of several individuals would have lower variance than just one, but as that committee grows you start running into well-known problems with democracy. And while democracy isn’t a bad system, it seems unlikely to be robust on the timeframe of millennia or longer. (Alex Zhu has made the interesting argument that the problem of an individual maintaining coherent values is roughly isomorphic to the problem of a civilisation doing so, since both are complex systems composed of individual “modules” which often want different things.)
- While AGI amplifies the human safety problem, it may also help solve it if we can use it to decrease the value drift that would otherwise occur. Also, while it’s possible that we need to solve this problem in conjunction with other AI safety problems, it might be postponable until after we’ve achieved civilisational stability.
- Note that I use “broadly acceptable values” rather than “our own values”, because it’s very unclear to me which types or extent of value evolution we should be okay with. Nevertheless, there are some values which we definitely find unacceptable (e.g. having a very narrow moral circle, or wanting your enemies to suffer as much as possible) and I’m not confident that we’ll avoid drifting into them by default.
- Misuse and vulnerabilities. These might be catastrophic even if AGI always carries out our intentions to the best of its ability:
- AI which is superhuman at science and engineering R&D will be able to invent very destructive weapons much faster than humans can. Humans may well be irrational or malicious enough to use such weapons even when doing so would lead to our extinction, especially if they’re invented before we improve our global coordination mechanisms. It’s also possible that we invent some technology which destroys us unexpectedly, either through unluckiness or carelessness. For more on the dangers from technological progress in general, see Bostrom’s paper on the vulnerable world hypothesis.
- AI could be used to disrupt political structures, for example via unprecedentedly effective psychological manipulation. In an extreme case, it could be used to establish very stable totalitarianism, with automated surveillance and enforcement mechanisms ensuring an unshakeable monopoly on power for leaders.
- AI could be used for large-scale projects (e.g. climate engineering to prevent global warming, or managing the colonisation of the galaxy) without sufficient oversight or verification of robustness. Software or hardware bugs might then induce the AI to make unintentional yet catastrophic mistakes.
- People could use AIs to hack critical infrastructure (include the other AIs which manage aforementioned large-scale projects). In addition to exploiting standard security vulnerabilities, hackers might induce mistakes using adversarial examples or ‘data poisoning’.
- Argument from large impacts. Even if we’re very uncertain about what AGI development and deployment will look like, it seems likely that AGI will have a very large impact on the world in general, and that further investigation into how to direct that impact could prove very valuable.
- Weak version: development of AGI will be at least as big an economic jump as the industrial revolution, and therefore affect the trajectory of the long-term future. See Ben Garfinkel’s talk at EA Global London 2018 (which I’ll link when it’s available online). Ben noted that to consider work on AI safety important, we also need to believe the additional claim that there are feasible ways to positively influence the long-term effects of AI development - something which may not have been true for the industrial revolution. (Personally my guess is that since AI development will happen more quickly than the industrial revolution, power will be more concentrated during the transition period, and so influencing its long-term effects will be more tractable.)
- Strong version: development of AGI will make humans the second most intelligent species on the planet. Given that it was our intelligence which allowed us to control the world to the large extent that we do, we should expect that entities which are much more intelligent than us will end up controlling our future, unless there are reliable and feasible ways to prevent it. So far we have not discovered any.
What should we think about the fact that there are so many arguments for the same conclusion? As a general rule, the more arguments support a statement, the more likely it is to be true. However, I’m inclined to believe that quality matters much more than quantity - it’s easy to make up weak arguments, but you only need one strong one to outweigh all of them. And this proliferation of arguments is evidence against their quality: if your conclusions remain the same but your reasons for holding those conclusions change, that’s a warning sign for motivated cognition (especially when those beliefs are considered important in your social group). This problem is exacerbated by a lack of clarity about which assumptions and conclusions are shared between arguments, and which aren’t.
On the other hand, superintelligent AGI is a very complicated topic, and so perhaps it’s natural that there are many different lines of thought. One way to put this in perspective (which I credit to Beth Barnes) is to think about the arguments which might have been given for worrying about nuclear weapons, before they had been developed. Off the top of my head, there are at least four:
- They might be used deliberately.
- They might be set off accidentally.
- They might cause a nuclear chain reaction much larger than anticipated.
- They might destabilise politics, either domestically or internationally.
And there are probably more which would have been credible at the time, but which seem silly now due to hindsight bias. So if there’d been an active anti-nuclear movement in the 30’s or early 40’s, the motivations of its members might well have been as disparate as those of AI safety advocates today. Yet the overall concern would have been (and still is) totally valid and reasonable.
I think the main takeaway from this post is that the AI safety community as a whole is still confused about the very problem we are facing. The only way to dissolve this tangle is to have more communication and clarification of the fundamental ideas in AI safety, particularly in the form of writing which is made widely available. And while it would be great to have AI safety researchers explaining their perspectives more often, I think there is still a lot of explicatory work which can be done regardless of technical background. In addition to analysis of the arguments discussed in this post, I think it would be particularly useful to see more descriptions of deployment scenarios and corresponding threat models. It would also be valuable for research agendas to highlight which problem they are addressing, and the assumptions they require to succeed.
This post has benefited greatly from feedback from Rohin Shah, Alex Zhu, Beth Barnes, Adam Marblestone, Toby Ord, and the DeepMind safety team. All opinions are my own.
It may depend on what we mean by “best”.
Epistemic status: I understand very little of anything.
Speculation about potential applications: regulating a logical prediction market, e.g. logical induction; constructing judges or competitors in e.g. alignment by debate; designing communication technology, e.g. to mitigate harms and risks of information warfare.
The slogan “the best ideas float to the top” is often used in social contexts. The saying goes, “in a free market of ideas, the best ideas float to the top”. Of course, it is not intended as a facts statement, as in “we have observed that this is the case”; it is instead a values statement, as in “we would prefer for this to be the case.”.
In this essay, however, we will force an empirical interpretation, just to see what happens. I will provide three ways to consider the density of an idea, or the number assigned to how float-to-the-top an idea is.
In brief, an idea is a sentence, and you can vary the amount of it’s antecedent graph (like in bayesian nets, NARS-like architectures) or function out of which it is printed (like in compression) that you want to consider at a given moment, up to resource allocation. This isn’t an entirely mathematical paper, so don’t worry about WFFs, parsers, etc., which is why i’ll stick with “ideas” instead of “sentences”. I will also be handwaving between "description of some world states" and "belief about how world states relate to eachother".Intuition
Suppose you observe wearers of teal hats advocate for policy A, but you don’t know what A is. You’re minding your business in an applebees parking lot when a wearer of magenta hats gets your attention to tell you “A is harmful”. There are two cases:
- Suppose A is “kicking puppies”, (and I don’t mean the wearer of magenta hats is misleadingly compressing A to you, I mean the policy is literally kicking puppies). The inferential gap between you and the magentas can be closed very cheaply, so you’re quickly convinced that A is harmful (unless you believe that kicking puppies is good).
- Suppose A is “fleegan at a rate of flargen”, where fleeganomics is a niche technical subject which nevertheless can be learned by anyone of median education in N units[^1] or less. Suppose also that you know the value of N, but you’re not inclined to invest that much compute in a dumb election, so you either a. take them at their word that A is harmful; b. search the applebees for an authority figure who believes that A is harmful, but believes it more credibly; or c. leave the parking lot without updating in any direction.
“That’s easy, c” you respond, blindingly fast. You peel out of there, and the whole affair makes not a dent in your epistemic hygiene. But you left behind many others. Will they be as strong, as wise as you?
“In an information-rich world, the wealth of information means a dearth of something else: a scarcity of whatever it is that information consumes. What information consumes is rather obvious: it consumes the attention of its recipients. Hence a wealth of information creates a poverty of attention and a need to allocate that attention efficiently among the overabundance of information sources that might consume it.”
– Herbert Simon
Let’s call case 1 “constant” and case 2 “linear”. We assume that constant refers to negligible cost, and that linear is in pedagogical length (where pedagogical cost is some measure of the resources needed to acquire some sort of understanding).
A regulator, unlike you, isn’t willing to leave anyone behind for evangelists and pundits to prey on. This is the role I’m assuming for this essay. I will ultimately propose a negative attentional tax, in which the constant cases would be penalized to give the linear cases a boost. (It’s like negative income tax, replacing money with attention).
If you could understand fleeganomics in N/100000 bits, would it be worth it to you then?Let’s force an empirical interpretation of “the best ideas float to the top”
Three possible measures of density:
- the simplest ideas float to the top.
- the truest ideas float to the top.
- the ideas which advance the best values float to the top, where by “advance the best values” we mean either a. maximize my utility function, not yours; or b. maximize the aggregate/average utility function of all moral patients, without emphasis on zero-sum faceoffs between opponent utility functions.
Each in turn implies a sort of world in which it is the sole interpretation, and thus the sole factor over beliefs of truth-seekers.
The intuition given above leans heavily on density_3, however, we must start much lower, at the fundamentals of simplicity and truth. From now on, for brevity’s sake, please ignore density_3 and focus on the first two.density_1: The Simplest Ideas Float to the Top.
If you form a heuristic by philosophizing the conjunction rule in probability theory, you get Occam’s razor. In machine learning, we have model selection methods that directly penalize complexity. Occam’s razor doesn’t say anything about reception of ideas in a social system, beyond implying that in gambling the wise bet on shorter sentences (insofar as the wise are gamblers).
If we assume that the wearer of magenta hats is maximizing something like petition signatures, and by proxy maximizing the number of applebees patrons converted to magenta hat wearing applebees patrons, then in the world of density_1 they ought to persuade only via statements with constant or negligible cost. (remember, in the world of density_1, statement’s needn’t have any particular content to be successful. In an idealized setting, this would mean the empty string gets 100% of the vote in every election, or 100% of traders purchase nothing but the empty string, etc.; in a human setting, think of the “smallest recognizable belief”).density_2: The Truest Ideas Float to the Top.
If the truest ideas floated to the top, then statements with more substantial truth values (i.e. with more evidence, more compelling evidence, stronger inferential steps) win out against those with less substantial truth values. In a world governed only by density_2, all cost is negligible.
In this world, the wearer of magenta hats is incentivized to teach fleaganomics – to bother themselves (and others) with linear cost ideas – if that’s what leads people to more substantially held beliefs or commitments. This is a sort of oracle world, in a word, logical omniscience.
In a market view, truth only prevails in the long run (i.e. in the same way that price only converges to value but you can’t pinpoint when they’re equal, supply with demand, etc.), which is why the density_2 interpretation is suitable for oracles, or at least the infinite resources of AIXI-like agents. If you tried to populate the world of density_2 with logically uncertain/AIKR-abiding agents, the entire appeal of markets evaporates. “Those who know they are not relevant experts shut up, and those who do not know this eventually lose their money, and then shut up.” (Hanson), but without the “eventually”.Negative attention tax
Now suppose we live in some world where density_1 and density_2 are operating at the same time, with some foggier and handwavier things like density_3 on the margins. In such a world, we say false-complicated ideas are robustly uncompetitive and true-simple ideas are robustly competitive, where “robust” means “resilient to foul play”, and "foul play" means "any misleading compression, fallacious reasoning, etc.". Without such resilience, we have risk that false-simple ideas will succeed and true-complicated ideas will fail.
A regulator isn’t willing to leave anyone behind for evangelists and pundits to prey on.
Perhaps we want free attention distributed to true-but-complicated things, and penalties applied to false-but-simple things. In economics, a negative income tax (NIT) is a welfare system within an income tax where people earning below a certain amount receive supplemental pay from the government instead of paying taxes to the government.
For us, a negative attentional tax is a welfare system, where ideas demanding above a certain amount of compute receive supplemental attention, and ideas below that amount pay up.density_2 \ density_1 Simple Complicated False I'm saying this is a failure mode, danger zone, etc. Robustly uncompetitive (won’t bother us) True Robustly competitive (these will be fine) I’m saying the solution is to give these sentences a boost.
An example implementation: suppose I’m working at nosebook in year of our lord. When I notice certain posts get liked/shared blindingly fast, and others take more time, I suppose that the simple ones are some form of epistemic foul play, and the complicated ones are more likely to align with epistemic norms we prefer. I make an algorithm to suppress posts that get liked/shared too quickly, and replace their spots in the feed with posts that seem to be digested before getting liked/shared (disclaimer: this is not a resilient proposal, I spent all of 10 seconds thinking about it, please defer to your nearest misinformation expert)Individuals apply NAT credits to interesting-looking complicated ideas, complicated ideas aren't directly supplied with these supplements in the way that simple ideas are automatically handicapped.
Though the above may be a valid interpretation, especially in the nosebook example, NAT is more properly understood as credits allocated to individuals for them to spend freely.
You can imagine the stump speech.
extremely campaigning voice: I’m going to make sure every member of every applebees parking lot has a lengthened/handicapped mental speed when they’re faced with simple ideas, and this will come back to them as tax credits they can spend on complicated ideas. Every applebees patron deserves complexity, even if they can’t afford the full compute/price for it.
--footnote-- [^1]: "Pedagogical cost" is loosely inspired by "algorithmic decomposition" in Between Saying and Doing. TLDR., to reason about a student acquiring long division, we reason about their acquisition of subtraction and multiplication. For us, pedagogical cost or length of some capacity is the sum of the length of its prerequisite capacities. We'll consider our pedagogical units as some function on attentional units. Herbert Simon dismisses adopting Shannon's bit as the attentional unit, because he wants something invariant under different encoding choices. He goes on to suggest time in the form of "how long it takes for the median human cognition to digest". This can be our base unit of parsing things you already know how to parse, even though extending it to pedagogical cost wouldn't be as stable because we don't understand teaching or learning very well.
So far we have been talking about how to learn “values” or “instrumental goals”. This would be necessary if we want to figure out how to build an AI system that does exactly what we want it to do. However, we’re probably fine if we can keep learning and building better AI systems. This suggests that it’s sufficient to build AI systems that don’t screw up so badly that it ends this process. If we accomplish that, then steady progress in AI will eventually get us to AI systems that do what we want.
So, it might be helpful to break down the problem of learning values into the subproblems of learning what to do, and learning what not to do. Standard AI research will continue to make progress on learning what to do; catastrophe happens when our AI system doesn’t know what not to do. This is the part that we need to make progress on.
This is a problem that humans have to solve as well. Children learn basic norms such as not to litter, not to take other people’s things, what not to say in public, etc. As argued in Incomplete Contracting and AI alignment, any contract between humans is never explicitly spelled out, but instead relies on an external unwritten normative structure under which a contract is interpreted. (Even if we don’t explicitly ask our cleaner not to break any vases, we still expect them not to intentionally do so.) We might hope to build AI systems that infer and follow these norms, and thereby avoid catastrophe.
It’s worth noting that this will probably not be an instance of narrow value learning, since there are several differences:
- Narrow value learning requires that you learn what to do, unlike norm inference.
- Norm following requires learning from a complex domain (human society), whereas narrow value learning can be applied in simpler domains as well.
- Norms are a property of groups of agents, whereas narrow value learning can be applied in settings with a single agent.
Despite this, I have included it in this sequence because it is plausible to me that value learning techniques will be relevant to norm inference.Paradise prospects
With a norm-following AI system, the success story is primarily around accelerating our rate of progress. Humans remain in charge of the overall trajectory of the future, and we use AI systems as tools that enable us to make better decisions and create better technologies, which looks like “superhuman intelligence” from our vantage point today.
If we still want an AI system that colonizes space and optimizes it according to our values without our supervision, we can figure out what our values are over a period of reflection, solve the alignment problem for goal-directed AI systems, and then create such an AI system.
This is quite similar to the success story in a world with Comprehensive AI Services.Plausible proposals
As far as I can tell, there has not been very much work on learning what not to do. Existing approaches like impact measures and mild optimization are aiming to define what not to do rather than learn it.
One approach is to scale up techniques for narrow value learning. It seems plausible that in sufficiently complex environments, these techniques will learn what not to do, even though they are primarily focused on what to do in current benchmarks. For example, if I see that you have a clean carpet, I can infer that it is a norm not to walk over the carpet with muddy shoes. If you have an unbroken vase, I can infer that it is a norm to avoid knocking it over. This paper of mine shows how this you can reach these sorts of conclusions with narrow value learning (specifically a variant of IRL).
Another approach would be to scale up work on ad hoc teamwork. In ad hoc teamwork, an AI agent must learn to work in a team with a bunch of other agents, without any prior coordination. While current applications are very task-based (eg. playing soccer as a team), it seems possible that as this is applied to more realistic environments, the resulting agents will need to infer norms of the group that they are introduced into. It’s particularly nice because it explicitly models the multiagent setting, which seems crucial for inferring norms. It can also be thought of as an alternative statement of the problem of AI safety: how do you “drop in” an AI agent into a “team” of humans, and have the AI agent coordinate well with the “team”?Potential pros
Value learning is hard, not least because it’s hard to define what values are, and we don’t know our own values to the extent that they exist at all. However, we do seem to do a pretty good job of learning society’s norms. So perhaps this problem is significantly easier to solve. Note that this is an argument that norm-following is easier than ambitious value learning, not that it is easier than other approaches such as corrigibility.
It is also feels easier to work on inferring norms right now. We have many examples of norms that we follow; so we can more easily evaluate whether current systems are good at following norms. In addition, ad hoc teamwork seems like a good start at formalizing the problem, which we still don’t really have for “values”.
This also more closely mirrors our tried-and-true techniques for solving the principal-agent problem for humans: there is a shared, external system of norms, that everyone is expected to follow, and systems of law and punishment are interpreted with respect to these norms. For a much more thorough discussion, see Incomplete Contracting and AI alignment, particularly Section 5, which also argues that norm following will be necessary for value alignment (whereas I’m arguing that it is plausibly sufficient to avoid catastrophe).
One potential confusion: the paper says “We do not mean by this embedding into the AI the particular norms and values of a human community. We think this is as impossible a task as writing a complete contract.” I believe that the meaning here is that we should not try to define the particular norms and values, not that we shouldn’t try to learn them. (In fact, later they say “Aligning AI with human values, then, will require figuring out how to build the technical tools that will allow a robot to replicate the human agent’s ability to read and predict the responses of human normative structure, whatever its content.”)Perilous pitfalls
What additional things could go wrong with powerful norm-following AI systems? That is, what are some problems that might arise, that wouldn’t arise with a successful approach to ambitious value learning?
- Powerful AI likely leads to rapidly evolving technologies, which might require rapidly changing norms. Norm-following AI systems might not be able to help us develop good norms, or might not be able to adapt quickly enough to new norms. (One class of problems in this category: we would not be addressing human safety problems.)
- Norm-following AI systems may be uncompetitive because the norms might overly restrict the possible actions available to the AI system, reducing novelty relative to more traditional goal-directed AI systems. (Move 37 would likely not have happened if AlphaGo were trained to “follow human norms” for Go.)
- Norms are more like soft constraints on behavior, as opposed to goals that can be optimized. Current ML focuses a lot more on optimization than on constraints, and so it’s not clear if we could build a competitive norm-following AI system (though see eg. Constrained Policy Optimization).
- Relatedly, learning what not to do imposes a limitation on behavior. If an AI system is goal-directed, then given sufficient intelligence it will likely find a nearest unblocked strategy.
One promising approach to AI alignment is to teach AI systems to infer and follow human norms. While this by itself will not produce an AI system aligned with human values, it may be sufficient to avoid catastrophe. It seems more tractable than approaches that require us to infer values to a degree sufficient to avoid catastrophe, particularly because humans are proof that the problem is soluble.
However, there are still many conceptual problems. Most notably, norm following is not obviously expressible as an optimization problem, and so may be hard to integrate into current AI approaches.
Tomorrow, there'll be a break from AIAF sequences and the new post will be the Alignment Newsletter Issue #42, by Rohin Shah.
Tuesday's AI Alignment Forum sequences post will be 'Learning With Catastrophes' by Paul Christiano in the sequence on Iterated Amplification.
The next post in this sequence will be 'Future directions in narrow value learning' by Rohin Shah, on Wednesday 16th Jan.
Let me tell you a secret.
You don’t have to experience negative emotion.
I risk coming across as implying that “happiness is a choice,” and that's not what I mean. I’m not implying that it is something easy to do, I’m not implying that it is something you should be able to do right now...
But I’m bringing up the possibility. Have you ever imagined it? Living your normal, ordinary life, from now until you die, but with the distinction that you choose not to experience negative emotion?
It’s likely that you have not thought of it. After all, negative emotions are just part of life, aren’t they? They aren't things we can change, right?
The Serenity Prayer goes like this:God, grant me the serenity to accept the things I cannot change,
Courage to change the things I can,
And wisdom to know the difference.
The last part is invariably the most tricky one. I think people systematically underestimate the scope of the things that they can change, and that becomes more and more true as technology advances.
As Eliezer has pointed out,“We have a concept of what a medieval peasant should have had, the dignity with which they should have been treated, that is higher than what they would have thought to ask for themselves.”
A medieval peasant accepted infant death, slavery, and the like as “part of the plan,” as “just the way things are.” Just like people nowadays accept death as “just the way things are,” and say things like “it is impossible to avoid negative emotions altogether because to live is to experience setbacks and conflicts.”
The same can be said of us who grew up in abusive families, as well as oppressed groups in authoritarian societies — they may consider normal things that to us are abject, merely because they haven’t known of anything better.
I think if there is something close to making me feel indignation, it is the fact that the ways in which life can be better are not self-evident.
Throughout my childhood and adolescence I had a host of internalizing mental disorders — depression, anxiety, poor self-esteem, dysthymia, suicidal ideation, all that good stuff. I regularly met with several psychotherapists, but unfortunately none provided much help.
When I was 16, however, I was fortunate enough to experience a particularly severe major depressive episode. The pain was so strong, so disabling, so unwavering and all-encompassing, that it eventually prompted my mom to take me to a psychiatrist instead of psychologist. I experienced with one antidepressant, had problems with it, and then a few months later was prescribed Wellbutrin.
And… three weeks after I started taking it, I realized something odd. I realized that I didn't need to ruminate on all the ways in which I was the worst person in the world all the time! Even if that were true, it would be far better to occupy my thoughts with something positive, like trying to improve myself.
Another thing I noticed at the same time, and which shocked me, was that I was unable to feel jealousy. I had received the news that my ex — whom I still had a strong unrequited love for, which was largely the source of the depression — had started dating someone, and all that I could muster as an emotional reaction to it was “That's cool for him.” No feelings of jealousy, no feelings of rejection.
Eventually, after noticing those and other noteworthy changes in my mind, and after giving them a lot of thought and consideration — after making sure that it wasn't some sort of mirage — it was clear to me, by the fourth week, that, indeed, the depressive episode was over. My mind had gracefully transitioned from a state of constant mental torment to that of serene internal tranquility, and I deemed the change unlikely to be ephemeral.
It's been over two years, and although life has indeed had its ups and downs, there is... incredibly little overlap between my mood before and after I started taking Wellbutrin. Almost all of the days in my life after I started taking it have been better than almost all of the days before.
It is truly difficult to convey just how different the sadness I am capable of today is from the torment I used to be able to feel. My negative emotions, when present, are a pale version of their former selves, to an extent that they barely feel real — they’re pretty much cardboard cutouts of what they used to be.
Now, an interesting thing is that during my pre-Wellbutrin life, I would obviously never have desired for a life like the one I have now — such a thing simply wasn’t within the scope of my imagination. It doesn’t come to us naturally, to desire for a peaceful inner mind and a capacity to control our feelings. It's not a basic human drive, the way that the desires for sex, money, love, and recognition are. Your mind is all that you have, it is all your life is — but aiming the arrow of the desire at one’s own mind requires a fair amount of complicated metacognition.
What I find unfortunate about this story is that I had to get to an extremely low point in order for medication to be considered an option. If I hadn’t had that particularly severe depressive episode, I would keep having a life which was meh seventy percent of the time.
And that makes me wonder: how many people around don’t know how good life can be for them? How many people suffer and think they can’t help it? How many people don’t have a blast with their morning routine merely because they haven’t tried to? Sometimes it genuinely requires a lot of open-mindedness in order to notice that you are sitting on a pot of gold.
We are patently unaware of the scope of the space of possible human psychological experiences. There was once this debate about whether mental imagery was an actual thing. It was only settled when Francis Galton gave people surveys and saw that some people did have mental imagery, and others didn’t. Before that, everyone just assumed that everyone else was like themselves.
It does not seem implausible to me that the same fallacy would apply to the psychological phenomenon of the pleasantness of life. That is, we naturally expect others to experience life as being roughly as pleasant as it is to us in particular. I find this passage from Schopenhauer to be a good example:“In a world like this […] it is impossible to imagine happiness. It cannot dwell where, as Plato says, continual Becoming and never Being is all that takes place. First of all, no man is happy; he strives his whole life long after imaginary happiness, which he seldom attains, and if he does, then it is only to be disillusioned; and as a rule he is shipwrecked in the end and enters the harbour dismasted.”
He’s making big claims about the psychology of other people’s minds, claims that, thankfully, are wrong; the majority of people are happy. But there is a significant share of the population to whom that quote sounds entirely reasonable (my 15-year-old-self and David Benatar included). And those don’t know how good their life can be.
A while ago 80000hours posted about a study in which subjects who were indecisive about taking certain life-changing decisions agreed to make a decision based on a coin flip. The researchers then evaluated the subjects’ happiness several months after the study, and whether they had or not taken the decision the coin flip generated.
It turned out that people who changed something big in their life due to the coin flip turned out to be much happier later:The causal effect of quitting a job is estimated to be a gain of 5.2 happiness points out of 10, and breaking up as a gain of 2.7 out of 10!
Notably, “Should I move” also had a large effect (3.2), as did “should I start my own business.”(5.2).
One interesting thing I noticed in those results is that what those decisions have in common, compared to the decisions that did not influence happiness that much, are that they result in a substantial change in people’s day-to-day life experiences.
Perhaps day-to-day life experiences can be especially prone to being coded as something to be accepted, as “just part of life.” It can be difficult to think of changing something so fundamental about life that you experience it everyday.
Maybe the lesson here is that experimentation is valuable.
I’ve received some objection towards my attitude of valuing happiness without special exceptions and without upper bound.
One common objection is that negative emotion sends important messages. I actually agree with that. Roughly speaking, the message that negative valence sends is “stop what you’re doing and change your strategy.” So, now you know. Now you can try to avoid the negative feeling when you notice it coming, and remember the message: stop what you’re doing and change your strategy. (In the case that you choose to even care about it, since emotions are based on evolutionary goals that might not be fully aligned with our own.)
I want to make it clear that in this post I am not claiming that external circumstances do not matter and all that people need to do is change their internal states. Not at all. I fully endorse changing one's life in order in order to improve well-being when that is the best strategy to do so, and as we saw in that 80000hours post, it often is.“You can win with a long weapon, and yet you can also win with a short weapon. In short, the Way of the Ichi school is the spirit of winning, whatever the weapon and whatever its size.”
Another objection I’ve faced is the claim that it is futile to pursue happiness, that it is empty or hollow without suffering, and that we should be aiming at meaning.
I think the threat of “empty” or “meaningless” happiness is much less plausible than most people think. It seems to me that there is a close correspondence between high-level beliefs and mood. I, for one, have visited a quite wide range of mind-states along the valence axis, and every single step I took from the nadir of my worst depression to the great gratitude I feel now involved a change in how I see the world, a change in how I think.
The degree to which that is generalizable to other people is a question that I am interested in investigating. For now, it’s instructive to notice that the popular Nihilist Memes Facebook pages are nearly entirely consisted of memes about depression. And that one of the diagnostic criteria of Borderline Personality Disorder, a very unpleasant condition, is “feelings of chronic emptiness.” Religious and spiritual experiences, on the other hand, which I would regard as some of the most blissful states possible to humans, involve plenty of meaning, so much that it all-too-often messes up people's epistemology.
Another objection I have encountered is that constant happiness makes one insensitive to the suffering of others. That is not supported by empirical evidence. Positive mood makes people less willing to endure harm, or to let others endure harm. It has been found over and over again to make people more interested in helping others and doing more than what is expected from them.
Moreover, I would not be here endorsing positivity in LessWrong if I didn't think that it had useful pragmatic value at helping us think and work. That’s because most of the people who will ever live will live in the far-future, and many people in this site are doing valuable work on that area. It is important that they keep their minds sharp, and positivity goes a long way in that regard. There are, of course, other variables that affect productivity, and I am interested on investigating them as well.
Another motivating factor driving me to write this is that I think it is important for me to... have this debate, in order to think more clearly about others’ attitudes towards happiness, to understand where exactly differences in opinion from mine stem from. This might be valuable for cause prioritization research. The cool thing about information is that it doesn't have an expiration date. The knowledge and data that we gather will pass on the future and be a foundation future researchers will build upon.
I think Anna Sallamon, in one of my favorite LessWrong posts, provides a useful framework with which to think about why we may find some information aversive:when I notice I'm averse to taking in "accurate" information, I ask myself what would be bad about taking in that information.
I think that drives at least part of the motivation behind the acceptance of negative emotions. It makes sense, since there are many ways in which it can be bad to think that negative emotion is always bad. For instance, when you are actively feeling a negative emotion, it often helps to hear that it is okay to feel that emotion — that makes you feel reassured and validated. By just plainly recognizing the badness of negative emotion, on the other hand, you risk getting into a loop. As an example, it turns out that, as depressing at it sounds, with enough self-referentiality it is entirely possible to be depressed because you’re depressed because you’re depressed. I've been there. And it's distinctively worse than merely being depressed at the object-level.
I’ll steal one of the posts’ bucket drawings in order to illustrate this:
Whether negative emotion is always bad is a value judgement, which is why I left that label in the Desired state panel in blank. But it is always useful is to separate “is negative emotion always bad” and “should I feel shame/guilt/sadness for experiencing negative emotion” into two mental buckets; to recognize that they are separate questions.
Acceptance is useful when you cannot change a problem. Acceptance is useful when you cannot change a problem. Both those sentences can be true at the same time. And, as technology advances, our ability to solve problems improves; what was once impossible becomes merely an engineering problem.
We (Zvi Mowshowitz and Vladimir Slepnev) are happy to announce the results of the fourth round of the AI Alignment Prize, funded by Paul Christiano. From July 15 to December 31, 2018 we received 10 entries, and are awarding four prizes for a total of $20,000.The winners
We are awarding two first prizes of $7,500 each. One of them goes to Alexander Turner for Penalizing Impact via Attainable Utility Preservation; the other goes to Abram Demski and Scott Garrabrant for the Embedded Agency sequence.
We are also awarding two second prizes of $2,500 each: to Ryan Carey for Addressing three problems with counterfactual corrigibility, and to Wei Dai for Three AI Safety Related Ideas and Two Neglected Problems in Human-AI Safety.
We will contact each winner by email to arrange transfer of money. Many thanks to everyone else who participated!Moving on
This concludes the AI Alignment Prize for now. It has stimulated a lot of good work during its year-long run, but participation has been slowing down from round to round, and we don't think it's worth continuing in its current form.
Once again, we'd like to thank everyone who sent us articles! And special thanks to Ben and Oliver from the LW2.0 team for their enthusiasm and help.
Analysis of the paper: Less Competition, More Meritocracy (hat tip: Marginal Revolution: Can Less Competition Mean More Meritocracy?)
Epistemic Status: Consider the horse as if it was not a three meter sphere
Economic papers that use math to prove things can point to interesting potential results and reasons to question one’s intuitions. What is frustrating is the failure to think outside of those models and proofs, analyzing the practical implications.
In this particular paper, the central idea is that when risk is unlimited and free, ratcheting up competition dramatically increases risk taken. This introduces sufficient noise that adding more competitors can make the average winner less skilled. At the margin, adding additional similar competitors to a very large pool has zero impact. Adding competitors with less expected promise makes things worse.
This can apply in the real world. The paper provides a good example of a very good insight that is then proven ‘too much,’ and which does not then question or vary its assumptions in the ways I would find most interesting.I. The Basic Model and its Central Point
Presume some number of job openings. There are weak candidates and strong candidates. Each candidate knows if they are strong or weak, but not how many other candidates are strong, nor do those running the contest know how many are strong.
The goal of the competition is to select as many strong candidates as possible. Or formally, to maximize [number of strong selected – number of weak selected], which is the same thing if the number of candidates is fixed, but is importantly different later when the number of selected candidates can vary. Each candidate performs and is given a score, and for an N-slot competition, the highest N scores are picked.
By default, strong candidates score X and weak candidates score Y, X>Y, but each candidate can also take on as much risk as they wish, with any desired distribution of scores, so long as their score never goes below zero.
The paper then does assumes reflexive equilibrium, does math and proves a bunch of things that happen next. The math checks out; I duplicated the results intuitively.
There are two types of equilibrium.
In the first type, concession equilibria, strong candidates take no risk and are almost always chosen. Weak candidates take risk to try and beat other weak candidates, but attempting to beat strong candidates isn’t worthwhile. This allows strong candidates to take zero risk.
In the second type, challenge equilibria, weak candidates attempt to be chosen over strong candidates, forcing strong candidates to take risk.
If I am a weak candidate, I can be at least (Y/X) as likely as a strong candidate to be selected by copying their strategy with probability (Y/X) and scoring 0 otherwise. This seems close to optimal in a challenge equilibria.
Adding more candidates, strong or weak, risks shifting from a concession to a challenge equilibria. Each additional candidate, of any strength, makes challenge a better option relative to concession.
If competition is ‘insufficiently intense’ then we get a concession equilibria. We successfully identify every strong candidate, at the cost of accepting some weak ones. If competition is ‘too intense’ we lose that. The extra candidate that tips us over the edge makes things much worse. After that, quantity does not matter, only the ratio of weak candidates to strong.
Even if search is free, and you continue to sample from the same pool, hitting the threshold hurts you, and further expansion does nothing. Interviewing one million people for ten jobs, a tenth of which are strong, is not better than ten thousand, or even one hundred. Ninety might be better.
Since costs are never zero (and rarely negative), and the pool usually degrades as it expands, this argues strongly for limited competitions with weaker selection criteria, including via various hacks to the system.II. What To Do, and What This Implies, If This Holds
The paper does a good job analyzing what happens if its conditions hold.
If one has a fixed set of positions to fill (winners to pick) and wants to pick the maximum number of strong candidates, with no cost to expanding the pool of candidates, the ideal case is to pick the maximum number of strong candidates that maintains a concession equilibrium. With no control (by assumption) over who you select or how to select them, this is the same as picking the maximum number of candidates that maintains a concession equilibrium, no matter what decrease in quality you might get while expanding the pool.
The tipping point makes this a Price Is Right style situation. Get as close to the number as possible without going over. Going over is quite bad, worse than a substantial undershoot.
One can think of probably not interviewing enough strong candidates, and probably hiring some weak candidates, as the price you must pay to be allowed to sort strong candidates from weak candidates – you need to ‘pay off’ the weak ones to not try and fool the system. An extra benefit is that even as you fill all the slots, you know who is who, which can be valuable information in the future. Even if you’re stuck with them, better to know that.
A similar dynamic comes if choosing how many candidates to select from a fixed pool, or when choosing both candidate and pool sizes.
If one attempts to only have slots for strong candidates, under unlimited free risk taking, you guarantee a challenge equilibria. Your best bet will therefore probably be to pick enough candidates from the pool to create a concession equilibrium, just like choosing a smaller candidate pool.
The paper considers hiring a weak candidate as a -1, and hiring a strong candidate as a +1. The conclusions don’t vary much if this changes, since there are lots of other numerical knobs left unspecified that can cancel this out. But it is worth noting that in most cases the ratio is far less favorable than that. The default is that one good hire is far less good than one bad hire is bad. True bad hires are rather terrible (as opposed to all right but less than the best).
Thus, when the paper points out that it is sometimes impossible to reliably break 50% strong candidates under realistic conditions, no matter how many people are interviewed and how many slots are given out, they underestimate the chance that the system breaks down entirely into no contest at all, and no production.
What is the best we can do, if all assumptions hold?
The minimum portion of weak candidates accepted scales linearly with their presence in the pool, and with how strongly they perform relative to strong candidates. Thus we set the pool size such that this fills out the pool with some margin of error.
That is best if we set the pool size but nothing else. The paper considers college admissions. A college is advised to solve for which candidates are above a fixed threshold, then choose at random from those above the threshold (which is a suggestion one would only make in a paper with zero search costs, since once you have enough worthy candidates you can stop searching, but shrug.) Thus, we can always choose to arbitrarily limit the pool.
In practice, attempting this would change the pool of applicants. In a way you won’t like. You are more attractive to weak candidates and less attractive to strong ones. Weak candidates flood in to ‘take their shot,’ causing a vicious cycle of reputation and pool decay. You’ve not a good reach school or a safe school for a strong candidate, so why bother? If other colleges copy you, students respond by investing less in becoming strong and more in sending out all the applications, and the remaining strong candidates remain at risk.
True reflexive equilibria almost never exist, given the possible angles of response, and differences between people’s knowledge, preferences and cognition.III. Relax Reflective Equilibrium
Even if it is common knowledge that only two candidate strengths exist, and all candidates of each type are identical (which they aren’t), they will get different information and react differently, destroying reflexive equilibrium.
Players will not expect all others to jump with certainty between equilibria at some size threshold. Because they won’t. Which creates different equilibria.
Some players don’t know game theory, or don’t pay attention to strategy. Those players, as a group, lose. Smart game theory always has the edge.
An intuition pump: Learning game theory is costly, so the equilibrium requires it to pay off. Compare to the efficient market hypothesis.
Some weak candidates will always attempt to pass as strong candidates. There is a gradual shift from most not doing so to almost everyone doing so. More weak candidates steadily take on more risk. Eventually most of them mostly take on large risk to do their impression of a strong candidate. Strong candidates slowly start taking more risk more often as they sense their position becoming unsafe.
Zero risk isn’t stable anyway without continuous skill levels. Strong candidates notice that exactly zero risk puts them behind candidates who take on extra tail risk to get epsilon above them. Zero risk is a default strategy, so beating that baseline is wise.
Now those doing this try to outbid each other, until strong candidates lose to weak candidates at least sometimes. This risk will cap out very low if strong candidates consider the risk of losing at around their average performance to also be minuscule, but it will have to exist. Otherwise, there’s an almost free action in making one’s poor performances worse, since they are already losing to almost all other strong candidates, and doing that allows one to make their stronger performances better and/or more likely.
The generalization of this rule is that whenever you introduce a possible outcome into the system, and provide any net benefit to anyone if they do things that make the outcome more likely, there is now a chance that the outcome happens. Even if the outcome is ‘divorce,’ ‘government default,’ ‘forced liquidation,’ ‘we both drive off the cliff’ or ‘nuclear war.’ It probably also isn’t epsilon. While risk is near epsilon, taking actions that increase risk will look essentially free, so until the risk is big enough to matter it will keep increasing. Therefore, every risk isn’t only possible. Every risk will matter. Given enough time, someone will miscalculate, and Murphy’s Law ensues.
Future post: Possible bad outcomes are really bad.
Stepping back, the right strategy for each competitor will be to guess the performance levels that efficiently translate into wins, making sure to maximally bypass levels others are likely to naively select (such as zero risk strategies), and generally play like they’re in a variation of the game of Blotto.
A lot of these results are driven by discrete skill levels, so let’s get rid of those next.IV. Allow Continuous Skill Levels
Suppose instead of two skill levels, each player has their own skill level, and a rough and noisy idea where they lie in the distribution.
Each player has resources to distribute across probability. Success is increasing as a function of performance. Thinking players aim for performance levels they believe are efficient, and do not waste resources on performance levels that matter less.
All competitors also know that the chance of winning with low performance is almost zero. The value of additional performance probably gradually increases (positive second derivative) until it peaks at an inflection point, and then starts to decline as success starts to approach probability one. There may be additional quirky places in the distribution where extra performance is especially valuable. This exact curve won’t be known to anyone, different players will have different guesses partly based on their own abilities, and ability levels are continuous.
A sufficiently strong candidate, who expects their average performance to be above the inflection point, should take no risk. A weaker candidate should approximate the inflection point, and risk otherwise scoring a zero performance to reach that point. Simple.
If the distribution of skill levels is bumpy, what happens then? We have strong candidates and weak candidates (e.g. let’s say college graduates and high school graduates, or some have worked in the field and some haven’t, or whatever) so there’s a two-peak distribution of skill levels. Unless people are badly misinformed, we’ll still get a normal-looking distribution. If the two groups calculate very different expected thresholds, we’ll see two peaks.
In general, but not always, enough players will miscalculate or compete for the ‘everyone failed’ condition that trying to do so is a losing play. Occasionally there will be good odds to hoping enough others aim too high and miss.
Rather than have a challenge and a concession equilibrium, we have a threshold equilibrium. Everyone has a noisy estimate of the threshold they need. Those capable of reliably hitting the threshold take no risk, and usually make it. Those not capable of reliably hitting the threshold risk everything to make the threshold as often as possible.
Note that this equilibrium holds, although it may contain no one above the final threshold. If everyone aims for what they think is good-enough performance, aiming for less is almost worthless, and aiming for much more is mostly pointless, and threshold adjusts so that the expected number of threshold performances is very close to the number of slots.
More competition raises the threshold, forcing competitors to take on more risk, until everyone is using the same threshold strategy and success is purely proportional to skill. Thus, in a large pool, we once again have expanding the pool as a bad idea if it weakens average skill, even if search and participation costs for all are free.
In a small pool, the strongest candidates are ‘wasting’ some of their skill on less efficient outcomes beyond their best estimate of the threshold.
This ends up being similar to the challenge case, except that there is no inflection point where things suddenly get worse. You never expect to lose from expanding the pool while maintaining quality. Instead, things slowly get better as you waste less work at the top of the curve, so the value of adding more similar candidates quickly approaches zero.
The new intuition is, given low enough search costs, we should add equally strong potential candidates until we are confident everyone is taking risk, rather than stopping just short of causing stronger candidates to take risk. If participation is costly to you and/or the candidates, you should likely stop short of that point.
The key intuitive question to ask is, if a candidate was the type of person you want, would they be so far ahead of the game as to be obviously better than the current expected marginal winner? Would they be able to crush a much bigger pool, and thus be effectively wasting lots of effort? If and only if that’s true, there’s probably benefit to expanding your search, so you get more such people, and it’s a question of whether it is worth the cost.
The other strong intuition is that once your marginal applicant pool is lower in average quality than your average pool, that will always be a high cost, so focus on quality over quantity.
This suggests another course of action…V. Multi-Stage Process
Our model tells us that average quality of winners is, given a large pool, a function of the average quality of our base pool.
But we have a huge advantage: This whole process is free.
Given that, it seems like we should be able to be a bit more clever and complex, and do better.
We can improve if we can get a pool of candidates that has a higher average quality than our original candidate pool, but which is large enough to get us into a similar equilibrium. Each candidate’s success is proportional to their skill level, so our average outcome improves.
We already have a selection process that does this. We know our winners will be on average better than our candidates. So why not use that to our advantage?
Suppose we did a multi-stage competition. Before, we would have had 10 applicants for 1 slot. Expanding that to 100 applicants won’t do us any good directly, because of risk taking. But running 10 competitions with 10 people each, then pitting those 10 winners against each other, will improve things for us.
By using this tactic multiple times, we can do quite a bit better. Weaker candidates will almost never survive multiple rounds.
What happened here?
We cheated. We forced candidates to take observable, uncorrelated risks in each different round. We destroyed the rule that risk taking is free and easy, and assumed that a lucky result in round 1 won’t help you in round 2.
If a low-skill person can permanently mimic in all ways a high-skill person, and we observe that success, they are high skill now! A worthy winner. If they can’t, then they fall back down to Earth on further observation. This should make clear why the idea of unlimited cheap and exactly controlled risk is profoundly bizarre. A test that works that way is a rather strange test.
So is a test that costs nothing to administer. You get what you pay for.
The risk is that risk-taking takes the form of ‘guess the right approach to the testing process’ and thus test scores are correlated without having to link back to skill.
This is definitely a thing.
During one all-day job interview, I made several fundamental interview-skill mistakes that hurt me in multiple sessions. If I had fixed those mistakes, I would have done much better all day, but would not have been much more skilled at what they were testing for. A more rigorous or multi-step process could have only done so much. To get better information, they would have had to add a different kind of test. That would risk introducing bad noise.
This seems typical of similar contests and testing methods designed to find strong candidates.
A more realistic model would introduce costs to participation in the search process, for all parties. You’d have another trade-off between having noise be correlated versus minimizing its size, making more rounds of analysis progressively less useful.
Adding more candidates to the pool now clearly is good at first and then turns increasingly negative.VI. Pricing People Out
There are two realistic complications that can help us a lot.
The first is pricing people out. Entering a contest is rarely free. I have been fortunate that my last two job interviews were at Valve Software and Jane Street Capital. Both were exceptional companies looking for exceptional people, and I came away from both interviews feeling like I’d had a very fun and very educational experience, in addition to leveling up my interview skills. So those particular interviews felt free or better. But most are not.
Most are more like when I applied to colleges. Each additional college meant a bunch of extra work plus an application fee. Harvard does not want to admit a weak candidate. If we ignore the motivation to show that you have lots of applications, Harvard would prefer that weak candidates not apply. It wastes time, and there’s a non-zero chance one will gain admission by accident. If Harvard taxes applications, by requiring additional effort or raising the fee, they will drive weak applicants away and strengthen their pool, improving the final selections.
Harvard also does this by making Harvard hard. A sufficiently weak candidate should not want to go to Harvard, because they will predictably flunk out. Making Harvard harder, the way MIT is hard, would make their pool higher quality once word got out.
We can think of some forms of hazing, or other bad experiences for winners of competitions, partly as a way to discourage weak candidates from applying, and also partly as an additional test to drive them out.
Ideally we also reduce risk taken.
A candidate has uncertainly in how strong they are, and how much they would benefit from the prize. If being a stronger candidate is correlated with benefiting from winning, a correct strategy becomes to take less or no risk. If taking a big risk causes me to win when I would otherwise lose, I won a prize I don’t want. If taking a big risk causes me to lose, I lost a prize I did want. That pushes me heavily towards lowering my willingness to take risk, which in turn lowers the competition level and encourages me to take less risk still. Excellent.VII. Taking Extra Risk is Hard
Avoiding risk is also hard.
In the real world, there is a ‘natural’ amount of risk in any activity. One is continuously offered options with varying risk levels.
Some of these choices are big, some small. Sometimes the risky play is ‘better’ in an expected value sense, sometimes worse.
True max-min strategies that avoid even minimal risks decline even small risks that would cancel out over time. This is expensive.
If one wants to maximize risk at all costs, one ends up doing the more risky thing every time and takes bad gambles. This is also expensive.
It is a hard problem to get the best outcome given one’s desired level of risk, or to maximize the chance of exceeding some performance threshold, even with no opponent. In games with an opponent who wants to beat you and thus has the opposite incentives of yours (think football) it gets harder still. Real world performances are notoriously terrible.
There are two basic types of situations with respect to risk.
Type one is where adding risk is expensive. There is a natural best route to work or line of play. There are other strategies that overall are worse, but have bigger upside, such as taking on particular downside tail risks in exchange for tiny payoffs, or hoping for a lucky result. In the driving example, one might take an on average slower route that has variable amounts of traffic, or one might drive faster and risk an accident or speeding ticket.
Available risk is limited. If I am two hours away by car, I might be able to do something reckless and maybe get there in an hour and forty five minutes, but if I have to get there in an hour, it’s not going to happen.
I can hope to ever overcome only a limited skill barrier. If we are racing in the Indianapolis 500, I might try to win the race by skipping a pit stop, or passing more aggressively to make up ground, or choosing a car that is slightly faster but has more engine trouble. But if my car combined with my driving skill is substantially slower than yours (where substantially means a minute over several hours) and your car doesn’t crash or die, I will never beat you.
If I had taken the math Olympiad exam (the USAMO) another hundred times, I might have gotten a non-zero score sometimes, but I was never getting onto the team. Period.
In these situations, reducing risk beyond the ‘natural’ level may not even be possible. If it is, it will be increasingly expensive.
Type two is where giant risks are the default, then sacrifices are made to contain those risks. Gamblers who do not pay attention to risk will always go broke. To be a winning gambler, one can either be lucky and retain large risk, or one can be skilled and pay a lot of attention to containing risk. In the long term, containing risk, including containing risk by ceasing to play at all, is the only option.
Competitors in type two situations must be evaluated explicitly on their risk management, or on very long term results, or any evaluation is worthless. If you are testing for good gamblers and only have one day, you pay some attention to results but more attention to the logic behind choices and sizing. Tests that do otherwise get essentially random results, and follow the pattern where reducing the applicant pool improves the quality of the winners.
Another note is that the risks competitors take can be correlated across competitors in many situations. If you need a sufficiently high rank rather than a high raw score, those who take risks should seek to take uncorrelated risks. Thus, in stock market or gambling competitions, the primary skill often is in doing something no one else would think to do, rather than in picking a high expected value choice. Sometimes that’s what real risk means.VIII. Central Responses
There are also four additional responses by those running the competition, that are worth considering.
The first response is to observe a competitor’s level of risk taking and test optimization, and penalize too much (or too little). This is often quite easy. Everyone knows what a safe answer to ‘what is your greatest weakness’ looks like, bet size in simulations is transparent, and so on. If you respond to things going badly early on with taking a lot of risk, rather than being responsible, will you do that with the company’s money?
A good admissions officer at a college mostly knows instantly which essays had professional help and which resumes are based on statistical analysis, versus who lived their best life and then applied to college.
A good competition design gives you the opportunity to measure these considerations.
Such contests should be anti-inductive, if done right, with the really sneaky players playing on higher meta levels. Like everything else.
The second response is to vary the number of winners based on how well competitors do. This is the default.
If I interview three job applicants and all of them show up hung over, I need to be pretty desperate to take the one who was less hung over, rather than call in more candidates tomorrow. If I find three great candidates for one job, I’ll do my best to find ways to hire all three.
Another variation is that I have an insider I know well as the default winner, and the application process is to see if I can do better than that, and to keep the insider and the company honest, so again it’s mostly about crossing a bar.
The third response is that often there isn’t even a ‘batch’ of applications. There is only a series of permanent yes/no decisions until the position is filled. This is the classic problem of finding a spouse or a secretary, where you can’t easily go back once you reject someone. Once you have a sense of the distribution of options, you’re effectively looking for ‘good enough’ at every step, and that requirement doesn’t move much until time starts running out.
Thus, most contests that care mostly about finding a worthy winner are closer to threshold requirements than they look. This makes it very difficult to create a concession equilibrium. If you show up and aren’t good enough to beat continuing to search, your chances are very, very bad. If you show up and are are good enough to beat continuing to search, your chances are very good. The right strategy becomes either to aim at this threshold, or if the field is large you might need to aim higher. You can never keep the field small enough to keep the low-skill players honest.
The fourth response is to punish sufficiently poor performance. This can be as mild as in-the-moment social embarrassment – Simon mocking aspirants in American Idol. It can be as serious as ‘you’re fired,’ either from the same company (you revealed you’re not good enough for your current job, or your upside is limited), or from another company (how dare you try to jump ship!). In fiction a failed application can be lethal. Even mild retaliation is very effective in improving average quality (and limiting the size) of the talent pool.IX. Practical Conclusions
We don’t purely want the best person for the job. We want a selection process that balances search costs, for all concerned, with finding the best person and perhaps getting your applicants to improve their skill.
A weaker version of the paper’s core take-away heuristic seems to hold up under more analysis: There is a limit to how far expanding a search helps you at all, even before costs.
Rule 1: Pool quality on the margin usually matters more than quantity.
Bad applicants that can make it through are more bad than they appear. Expanding the pool’s quantity at the expense of average quality, once your supply of candidates isn’t woefully inadequate, is usually a bad move.
Rule 2: Once your application pool probably includes enough identifiable top-quality candidates to fill all your slots, up to your ability to differentiate, stop looking.
A larger pool will make your search more expensive and difficult for both you and them, add more regret because choices are bad, and won’t make you more likely to choose wisely.
Note that this is a later stopping point than the paper recommends. The paper says you should stop before you fill all your slots, such that weak applicants are encouraged not to represent themselves as strong candidates.
Also note that this rule has two additional requirements. It requires the good candidates be identifiable, since if some of them will blow it or you’ll blow noticing them, that doesn’t help you. It also requires that there not be outliers waiting to be discovered, that you would recognize if you saw them.
Another, similar heuristic that is also good is, make the competition just intense enough that worthy candidates are worried they won’t get the job. Then stop.
Rule 3: Weak candidates must either be driven away, or rewarded for revealing themselves. If weak candidates can successfully fake being strong, it is worth a lot to ensure that this strategy is punished.
Good punishments include application fees, giving up other opportunities or jobs, long or stressful competitions, and punishments for failure ranging from mild in-the-room social disapproval or being made to feel dumb, up to major retaliation.
Another great punishment is to give less rewards to success if it is by a low skilled person. If their prize is something they can’t use – they’ll flunk out, or get fired quickly, or similar – then they will be less inclined to apply.
Reward for participation is probability of success times reward for success, while cost is mostly fixed. Tilt this enough and your bad-applicant problem clears up.
Fail to tilt this enough, and you have a big lemon problem on multiple levels. Weak competitors will choose your competition over others, giving strong applicants less reason to bother both in terms of chance of winning, and desire to win. Who wants to win only to be among a bunch of fakers who got lucky? That’s no fun and it’s no good for your reputation either.
It will be difficult to punish weak candidates for faking being strong versus punishing them in general. But if you can do it, that’s great.
The flip side is that we can reward them for being honest. That will often be easier.
Preventing a rebellion of the less skilled is a constraint on mechanism design. We must either appease them, or wipe them out.
Rule 4: Sufficiently hard, high stakes competitions that are vulnerable to gaming and/or resource investment are highly toxic resource monsters.
This is getting away from the paper’s points, since the paper doesn’t deal with resource costs to participation or search, but it seems quite important.
In some cases, we want these highly toxic resource monsters. We like that every member of area sports team puts the rest of their life mostly on hold and focuses on winning sporting events. The test is exactly what we want them to excel at. We also get to use the trick of testing them in discrete steps, via different games and portions of games, to prevent ‘risk’ from playing too much of a factor.
In most cases, where the match between test preparation, successful test strategies and desired skills is not so good, this highly toxic resource monster is very, very bad.
Consider school, or more generally childhood. The more we reward good performance on a test, and punish failure, the more resources are eaten alive by the test. In the extreme, all of most child’s experiences and resources, and even those of their parents, become eaten. From discussions I’ve had, much of high school in China has something remarkably close to this, as everything is dropped for years to cram for a life-changing college entrance exam.
Rule 5: Rewards must be able to step outside of a strict scoring mechanism.
Any scoring mechanism is vulnerable to gaming and to risk taking, and to Goodhart’s Law. To avoid everyone’s motivation, potentially their entire life and being, being subverted, we need to be rewarding and punishing from the outside looking in on what is happening. This has to carry enough weight to be competitive with the prizes themselves.
Consider this metaphor.
If the real value of many journeys is the friends you made along the way, that can be true in both directions. Often one’s friends, experiences and lessons end up dwarfing in importance the prize or motivation one started out with; frequently we need a McGuffin and restrictions that breed creativity and focus to allow coordination, more than any prize.
It also works the other way. The value of your friends can be that they motivate and help you to be worthy of friendship, to do and accomplish things. The reason we took the journey the right way was so that we would make friends along it. This prevents us from falling to Goodhart’s Law. We don’t narrow in on checking off a box. Even in a pure competition, like a Magic tournament, we know the style points matter, and we know that it matters whether we think the style points matter, and so on.
The existence of the social, of various levels and layers, the ability to step outside the game, and the worry about unknown unknowns, is what guards systems from breakdown under the pressure of metrics. Given any utility function we know about, however well designed, and sufficient optimization pressure, things end badly. You need to preserve the value of unknown unknowns.
This leads us to:
Rule 6: Too much knowledge by potential competitors can be very bad.
The more competitors do the ‘natural’ thing, that maximizes their expected output, the better off we usually are. The less they know about how they are being evaluated, on what levels, with what threshold of success, the less they can game the system, and the less success depends on gaming skill or luck.
All the truly perverse outcomes came from scenarios where competitors knew they were desperadoes, and taking huge risks was not actually risky for them.
Having a high threshold is only bad if competitors know about it. If they don’t know, it can’t hurt you. If they suspect a high threshold, but they don’t know, that mitigates a lot of the damage. In many cases, the competitor is better served by playing to succeed in the worlds where the threshold is low, and accept losing when the threshold is unexpectedly high, which means doing exactly what you want. More uncertainty also makes the choices of others less certain, which makes situations harder to game effectively.
Power hides information. Power does not reveal its intentions. This is known, and the dynamics explored here are part of why. You want people optimizing for things you won’t even be aware of, or don’t care about, but which they think you might be aware of and care about. You want to avoid them trying too hard to game the things you do look at, which would also be bad. You make those in your power worry at every step that if they try anything, or fail in any way, it could be what costs them. You cause people to want to curry favor. You also allow yourself to alter the results, if they’re about to come out ‘wrong’. The more you reveal about how you work, the less power you have. In this case, the power to find worthy winners.
This is in addition to the fact that some considerations that matter are not legally allowed to be considered, and that lawsuits might fly, and other reasons why decision makers ensure that no one knows what they were thinking.
Thus we must work even harder to reward those who explain themselves and thereby help others, and who realize that the key hard thing is, as Hagbard Celine reminds us, to avoid power.
But still get things done.
An extremely basic question that, after months of engaging with AI safety literature, I'm surprised to realize I don't fully understand: why not tool AI?
AI Safety scenarios seem to conceive of AI as an autonomous agent. Is that because of the current machine learning paradigm, where we're setting the AI's goals but not specifying the steps to get there? Is this paradigm the entire reason why AI safety is an issue?
If so, is there a reason why advanced AI would need an agenty utility function sort of set up? Is it just too cumbersome to give step by step instructions for high level tasks?
I always had fairly good mathematical thinking (I think) and loved learning about beautiful concepts in math - but i didn't learn much at all in school (cause i had the choice). You can say i was "utilitarian" regarding learning math, i didn't do it if i didn't see how it can enrich my life.
so my knowledge of math is quite disorganized, i know more about Bayes theorem then many much simpler concepts (i know, it really shouldn't be that way).
Now i want to be able to analyze data, but i don't want to learn math that i won't use for it, if possible.
So here's my question - what basic stuff do i need to learn in order to be able to calculate probabilities, statistics, do Bayesian math, and overall do things within data analysis that I may yet be aware of.
If you also have suggestions for how to learn those things, after i learn the basics, it will be much appreciated.
thank you :)
In my last post, I argued that interaction between the human and the AI system was necessary in order for the AI system to “stay on track” as we encounter new and unforeseen changes to the environment. The most obvious implementation of this would be to have an AI system that keeps an estimate of the reward function. It acts to maximize its current estimate of the reward function, while simultaneously updating the reward through human feedback. However, this approach has significant problems.
Looking at the description of this approach, one thing that stands out is that the actions are chosen according to a reward that we know is going to change. (This is what leads to the incentive to disable the narrow value learning system.) This seems clearly wrong: surely our plans should account for the fact that our rewards will change, without treating such a change as adversarial? This suggests that we need to have our action selection mechanism take the future rewards into account as well.
While we don’t know what the future reward will be, we can certainly have a probability distribution over it. So what if we had uncertainty over reward functions, and took that uncertainty into account while choosing actions?Setup
We’ve drilled down on the problem sufficiently far that we can create a formal model and see what happens. So, let’s consider the following setup:
- The human, Alice, knows the “true” reward function that she would like to have optimized.
- The AI system maintains a probability distribution over reward functions, and acts to maximize the expected sum of rewards under this distribution.
- Alice and the AI system take turns acting. Alice knows that the AI learns from her actions, and chooses actions accordingly.
- Alice’s action space is such that she cannot take the action “tell the AI system the true reward function” (otherwise the problem would become trivial).
- Given these assumptions, Alice and the AI system act optimally.
This is the setup of Cooperative Inverse Reinforcement Learning (CIRL). The optimal solution to this problem typically involves Alice “teaching” the AI system by taking actions that communicate what she does and does not like, while the AI system “asks” about parts of the reward by taking actions that would force Alice to behave in different ways for different rewards.Does this solve our problems?
Two of the problems we identified in the last post are simply assumed away:
- Alice does not know the “true” reward function, but we assumed that she does.
- Alice may be unable to optimally give feedback to the AI system, but we assume that she is optimal here.
So this particular kind of reward uncertainty does not fix either of these problems. What about convergent instrumental subgoals?
Utility preservation. One major worry we had with the original setup was that the AI system would disable its narrow value learning system, as a manifestation of the instrumental goal of protecting its utility function. This is reversed in our setup: the AI system has a positive incentive to continue doing narrow value learning, since it helps it hone in on the true reward function, which in turn allows it to optimize the reward better. (We might worry that this prevents us from fixing any problems in the narrow value learning system, but that is a robustness problem: in the world where everything is working correctly, this is the correct incentive.)
Survival incentive, aka shutdown incorrigibility. Another worry is that the AI system has a survival incentive that causes it to prevent us from shutting it down. With reward uncertainty, the fact that we are trying to shut the AI system down is itself strong evidence about the reward function. The AI system should reason that its operation leads to worse outcomes, and so allow itself to be turned off.
The Off-Switch Game formalizes this reasoning in a simple setting. The AI system can either directly take action a (perhaps by disabling an off switch), can shut down, or can allow Alice to choose between these options. If the AI system allows Alice to choose, Alice then decides whether or not to shut down the AI system. Assuming that Alice is optimal, the AI system reasons that when Alice chooses to shut it down, the true reward function must have been one which makes the action a bad, and so it does better by deferring to Alice. However, when Alice is modeled as noisily rational instead of optimal, the AI system might reason that Alice might make a mistake when deciding to shut the AI system down, and so it might take action a directly without deferring to her. So, the AI system becomes shutdown corrigible, as long as it assumes that Alice is sufficiently rational.
Should robots be obedient? makes a similar point, arguing that an AI system that learns preferences and then acts to maximize their satisfaction can perform better than an AI system that simply obeys instructions, because humans are not perfectly rational. This creates a tradeoff between performance and obedience (which shutdown corrigibility is an instance of).
Of course, these simple models exclude many actions that a realistic AI system could take. In particular, it seems likely that an AI system would prefer to disable the shutdown button, gather information about the reward until it has fully updated, and optimize the resulting set of rewards. If the space of reward functions is misspecified, as it likely will be, this will lead to bad behavior. (This is the point made by Incorrigibility in the CIRL Framework.)
Note though that while this cuts against shutdown corrigibility (since the AI system would prefer to disable the shutdown button), I would frame the problem differently. If the space of rewards is well-specified and has sufficient weight on the true reward function and the AI system is sufficiently robust and intelligent, then the AI system must update strongly on us attempting to shut it down. This should cause it to stop doing the bad thing it was doing. When it eventually narrows down on the reward it will have identified the true reward, which by definition is the right thing to optimize. So even though the AI system might disable its off switch, this is simply because it is better at knowing what we want than we are, and this leads to better outcomes for us. So, really the argument is that since we want to be robust (particularly to reward misspecification), we want shutdown corrigibility, and reward uncertainty is an insufficient solution for that.A note on CIRL
There has been a lot of confusion on what CIRL is and isn’t trying to do, so I want to avoid adding to the confusion.
CIRL is not meant to be a blueprint for a value-aligned AI system. It is not the case that we could create a practical implementation of CIRL and then we would be done. If we were to build a practical implementation of CIRL and use it to align powerful AI systems, we would face many problems:
- As mentioned above, Alice doesn’t actually know the true reward function, and she may not be able to give optimal feedback.
- As mentioned above, in the presence of reward misspecification the AI system may end up optimizing the wrong thing, leading to catastrophic outcomes.
- Similarly, if the model of Alice’s behavior is incorrect, as it inevitably will be, the AI system will make incorrect inferences about Alice’s reward, again leading to bad behavior. As an example that is particularly easy to model, should the AI system model Alice as thinking about the robot thinking about Alice, or should it model Alice as thinking about the robot thinking about Alice thinking about the robot thinking about Alice? How many levels of pragmatics is the “right” level?
- Lots of other problems have not been addressed: the AI system might not deal with embeddedness well, or it might not be robust and could make mistakes, etc.
CIRL is supposed to bring conceptual clarity to what we could be trying to do in the first place with a human-AI system. In Dylan’s own words, “what cooperative IRL is, it’s a definition of how a human and a robot system together can be rational in the context of fixed preferences in a fully observable world state”. In the same way that VNM rationality informs our understanding of humans even though humans are not expected utility maximizers, CIRL can inform our understanding of alignment proposals, even though CIRL itself is unsuitable as a solution to alignment.
Note also that this post is about reward uncertainty, not about CIRL. CIRL makes other points besides reward uncertainty, that are well explained in this blog post, and are not mentioned here.
While all of my posts have been significantly influenced by many people, this post is especially based on ideas I heard from Dylan Hadfield-Menell. However, besides the one quote, the writing is my own, and may not reflect Dylan’s views.
The next AI Alignment Forum sequences post will be 'Capability Amplification' by Paul Christiano in the sequence on Iterated Amplification.
The next post in this sequence will be 'Following human norms' by Rohin Shah, on Saturday 19th Jan.
A large survey of self-regulatory strategies and how reportedly effective they are.
Much more in the typical "self-regulatory" paradigm for self-help psychology, where people are assumed to have a lot more control over the strategies they choose (i.e. downplaying interactions between actions and attitudes), but I'm curious what people's thoughts are.