Вы здесь

Новости LessWrong.com

Подписка на Лента Новости LessWrong.com Новости LessWrong.com
A community blog devoted to refining the art of rationality
Обновлено: 22 минуты 32 секунды назад

Could declining interest to the Doomsday Argument explain the Doomsday Argument?

23 января, 2019 - 14:51
Published on January 23, 2019 11:51 AM UTC

Epistemic status: this post is a history of my surprise


Thinking about Doomsday Argument (DA), I come to the following hypothesis:

Hypothesis 1: “I am randomly selected from all people, who know about DA. The number of such people is growing exponentially, from the 1980s until now and will continue to grow. Thus, I am currently located only 1-2 doublings before the end of this class of people, and such “end” could be best explained by a global catastrophe.”

I expected that the doubling time of the number of people who know about DA is around 5-10 years, and thus the end will happen around 2030, which is in accordance with my estimations of dangerous AI timing and some other predictions.


To check the hypothesis, I went on Google Trends to check the number of times the words “Doomsday argument” is searched. What I found surprised me: the number of searchers is actually declining:

The data is noisy, but it looks like the number of searchers declines from the average of 16 a month in 2008 to 7 in 2017.

Wikipedia views data even from 2015 (no early data available) also shows a decline around 2 times between 2015 and 2018.

Google Scholar analysis is less clear (obviously not exponential growth of the number of articles, but steady growth of mentions which means more scientists know DA):

Google Scholar articles about DA, in the period of

1989-1994: 14 articles (hand counted), 30 mentions of DA.

1995-2000: 15 articles, 50 mentions

2001-2006: 24 articles, 100 mentions

2007-2012: 18 articles, 140 mentions

2013-2018: 20 articles, 160 mentions

It shows that the peak of interest to DA by scientists was around 2000, which should not be surprising, as at the time the idea was relatively new. The growth of mentions could be explained by large “historical introductions” in other articles. However, the number of DA-related articles is now growing again.


When I first got the data of the declining interest to DA, I suggest that this could explain the DA:

Hypothesis 2: If there will be no more scientists who are interested in DA, the reference class of those who know about DA will end without end of the world.

However, a closer examination of the Google Scholar data doesn’t support this second hypothesis either: there is a steady influx of new scientists who try to refute or reanalyze the DA. Moreover, the growth of “mentions” shows that the number of scientists who know about DA is growing, but it is growing not exponentially, but more like logarithmically.

Internet access and general growth of the population, as well as public interest in science, could fuel the growth of the number of those who know about DA. On the other hand, lower number of Google searchers means that public interest to the topic has declined, maybe as there are less mainstream media publications which could fuel such interest or doomsday media paranoia, like in 2012, which could easily be observed as a spike of searchers around 2012.

The data could be explained if we suggest that fewer members of the public but more scientists now know about DA – and the question is interesting not from a sociological perspective, but in order to understand how the reference class of DA-aware observers is changing.

It seems that the correct reference class will be the scientists, not public, like the fact that I am writing this post (and had long, detailed interest to DA before) makes me closer to the scientists’ reference class.

For scientists, we have two sub-classes: those who know about DA, and those who try to make new contributions by writing articles. The difference is that one is growing and the other is not.


Both hypotheses are false: the hypothesis that the interest to DA is exponentially growing, and the one that the number of those who understand DA is exponentially declining: so, there is no end-very-soon, nor DA’s easy refutation.

However, using the reference class of those who know about DA still implies that the end is likely in 21 century. There are currently around 100 scientific articles about DA, and with 50 per cent probability (according to the Gott’s version of DA) there will be no more than total of 200 (which at current speed will happen in around 30 years, or 2049) and with 95 per cent – no more than 1000 articles (which will happen at current speed of publishing at 270 years). But such an end could mean not a global catastrophe but complete loss of interest in DA.


Link: That Time a Guy Tried to Build a Utopia for Mice and it all Went to Hell

23 января, 2019 - 09:27
Published on January 23, 2019 6:27 AM UTC

Video: https://www.youtube.com/watch?v=5m7X-1V9nOs

Text version: http://www.todayifoundout.com/index.php/2018/12/that-time-a-guy-tried-to-build-a-utopia-for-mice-and-it-all-went-to-hell/

"In 1968, an expert on animal behaviour and population control called John B. Calhoun built what was essentially a utopia for mice that was purpose built to satisfy their every need. Despite going out of his way to ensure the inhabitants of his perfect mouse society never wanted for anything, within 2 years virtually the entire population was dead. So what happened?"


Learning with catastrophes

23 января, 2019 - 06:01
Published on January 23, 2019 3:01 AM UTC

A catastrophe is an event so bad that we are not willing to let it happen even a single time. For example, we would be unhappy if our self-driving car ever accelerates to 65 mph in a residential area and hits a pedestrian.

Catastrophes present a theoretical challenge for traditional machine learning — typically there is no way to reliably avoid catastrophic behavior without strong statistical assumptions.

In this post, I’ll lay out a very general model for catastrophes in which they are avoidable under much weaker statistical assumptions. I think this framework applies to the most important kinds of catastrophe, and will be especially relevant to AI alignment.

Designing practical algorithms that work in this model is an open problem. In a subsequent post I describe what I currently see as the most promising angles of attack.

Modeling catastrophes

We consider an agent A interacting with the environment over a sequence of episodes. Each episode produces a transcript τ, consisting of the agent’s observations and actions, along with a reward r ∈ [0, 1]. Our primary goal is to quickly learn an agent which receives high reward. (Supervised learning is the special case where each transcripts consist of a single input and a label for that input.)

While training, we assume that we have an oracle which can determine whether a transcript τ is “catastrophic.” For example, we might show a transcript to a QA analyst and ask them if it looks catastrophic. This oracle can be applied to arbitrary sequences of observations and actions, including those that don’t arise from an actual episode. So training can begin before the very first interaction with nature, using only calls to the oracle.

Intuitively, a transcript should only be marked catastrophic if it satisfies two conditions:

  1. The agent made a catastrophically bad decision.
  2. The agent’s observations are plausible: we have a right to expect the agent to be able to handle those observations.

While actually interacting with the environment, the agent cannot query the oracle — there is no time to wait for a QA engineer to review a proposed action to check if it would be catastrophic.

Moreover, if interaction with nature ever produces a catastrophic transcript, we immediately fail. The performance of an algorithm is characterized by two parameters: the probability of catastrophic failure, and the total reward assuming no catastrophic failure.

We assume that there are some policies such that no matter what nature does, the resulting transcript is never catastrophic.

Traditionally in RL the goal is to get as much reward as the best policy from some class C. We’ slightly weaken that goal, and instead aim to do as well as the best policy from C that never makes a catastrophic decision.

Batch learning

I’ve described an online version of learning with catastrophes. We can also consider the batch version, where the learner is first given a large number of “training” episodes.

In the batch version, there is no penalty for catastrophes at training time, and we don’t care about training error. The two performance parameters are test-time performance and test-time catastrophe probability.

The oracle

This definition depends on an oracle who determines which transcripts are catastrophic.

For weak AI systems, the oracle may be a human. But a powerful AI system might take actions which are catastrophic but which look inoffensive to a human judge, so this approach doesn’t cut it.

In general, the judge should be a human+AI team which is more competent than the system being trained, armed with an adequate solution to the informed oversight problem.


Learning with catastrophes is straightforward given an unlimited number of queries to the catastrophe oracle. Given any online learning algorithm A, we can “harden” it by running the following process before the beginning of each episode (a similar process can be applied to a batch learner):

  1. Search over all possible environments, running A on each one to obtain a transcript.
  2. If we find a catastrophic transcript τ, then add τ to A’s training data with a reward of −1, and go back to step 1.

It’s easy to prove that this process converges, if A is competent: the number of times we invoke step 2 is at most the time required to learn an optimal catastrophe-free policy (plus the number of episodes).

The big problem with this algorithm is the exponential search in step 1.

Adversarial training

The most obvious approach to tractable learning with catastrophes is to replace step 1 with a tractable algorithm that approximates the search. This process is an example of adversarial training.

We can formalize this as a simple game between the agent A and an adversary B. The two agents alternate taking moves, with A playing the role of the agent and B playing the role of the environment. B wins if the resulting transcript is catastrophic, A wins if it is not catastrophic.

Before each episode, we play this game N times, where N > 0 is a hyperparameter. If B is able to win any of the games, then we add the resulting transcript to A’s training data and start again. Once B is not able to win any games, then we move on to a “real” episode. We hope that B is good enough at the game that if it can’t produce a catastrophic transcript, then the real environment can’t produce a catastrophic transcript either.

More precisely, before each episode we perform the following process:

  1. Set i = 0.
  2. A and B alternate taking moves, producing a transcript τ.
  3. If τ is catastrophic, we add τ to A’s training data with a reward of −1, and add τ to B’s training data with a reward of +1. Then we go back to step 1.
  4. If τ is not catastrophic, we add τ to B’s training data with a reward of −1.
  5. If i < N, we increment i and go back to step 2.

I discuss this idea in more detail in my post on red teams. There are serious problems with this approach and I don’t think it can work on its own, but fortunately it seems combinable with other techniques.


Learning with catastrophes is a very general model of catastrophic failures which avoids being obviously impossible. I think that designing competent algorithms for learning with catastrophes may be an important ingredient in a successful approach to AI alignment.

This was originally posted here on 28th May, 2016.

Tomorrow's AI Alignment sequences post will be in the sequence on Value Learning by Rohin Shah.

The next post in this sequence will be 'Thoughts on Reward Engineering' by Paul Christiano, on Thursday.


The Relationship Between Hierarchy and Wealth

23 января, 2019 - 05:00
Published on January 23, 2019 2:00 AM UTC

Epistemic Status: Tentative

I’m fairly anti-hierarchical, as things go, but the big challenge to all anti-hierarchical ideologies is “how feasible is this in real life? We don’t see many examples around us of this working well.”

Backing up, for a second, what do we mean by a hierarchy?

I take it to mean a very simple thing: hierarchies are systems of social organization where some people tell others what to do, and the subordinates are forced to obey the superiors.  This usually goes along with special privileges or luxuries that are only available to the superiors.  For instance, patriarchy is a hierarchy in which wives and children must obey fathers, and male heads of families get special privileges.

Hierarchy is a matter of degree, of course. Power can vary in the severity of its enforcement penalties (a government can jail you or execute you, an employer can fire you, a religion can excommunicate you, the popular kids in a high school can bully or ostracize you), in its extent (a totalitarian government claims authority over more aspects of your life than a liberal one), or its scale (an emperor rules over more people than a clan chieftain.)

Power distance is a concept from the business world that attempts to measure the level of hierarchy within an organization or culture.  Power distance is measured by polling less-powerful individuals on how much they “accept and expect that power is distributed unequally”.  In low power distance cultures, there’s more of an “open door” policy, subordinates can talk freely with managers, and there are few formal symbols of status differentiating managers from subordinates.  In “high power distance” cultures, there’s more formality, and subordinates are expected to be more deferential.  According to Geert Hofstede, the inventor of the power distance index (PDI), Israel and the Nordic countries have the lowest power distance index in the world, while Arab, Southeast Asian, and Latin American countries have the highest.  (The US is in the middle.)

I share with many other people a rough intuition that hierarchy poses problems.

This may not be as obvious as it sounds.  In high power distance cultures, empirically, subordinates accept and approve of hierarchy.  So maybe hierarchy is just fine, even for the “losers” at the bottom?  But there’s a theory that subordinates claim to approve of hierarchy as a covert way of getting what power they can.   In other words, when you see peasants praising the benevolence of landowners, it’s not that they’re misled by the governing ideology, and not that they’re magically immune to suffering from poverty as we would in their place, but just that they see their situation as the best they can get, and a combination of flattery and (usually religious) guilt-tripping is their best chance for getting resources from the landowners.  So, no, I don’t think you can assume that hierarchy is wholly harmless just because it’s widely accepted in some societies. Being powerless is probably bad, physiologically and psychologically, for all social mammals.

But to what extent is hierarchy necessary?

Structurelessness and Structures

Nominally non-hierarchical organizations often suffer from failure modes that keep them from getting anything done, and actually wind up quite hierarchical in practice. I don’t endorse everything in Jo Freeman’s famous essay on the Tyranny of Structurelessness, but it’s important as an account of actual experiences in the women’s movement of the 1970s.

When organizations have no formal procedures or appointed leaders, everything goes through informal networks; this devolves into popularity contests, privileges people who have more free time to spend on gossip, as well as people who are more privileged in other ways (including economically), and completely fails to correlate decision-making power with competence.

Freeman’s preferred solution is to give up on total structurelessness and accept that there will be positions of power in feminist organizations, but to make those positions of power legible and limited, with methods derived from republican governance (which are also traditional in American voluntary organizations.)  Positions of authority should be limited in scope (there is a finite range of things an executive director is empowered to do), accountable to the rest of the organization (through means like voting and annual reports), and impeachable in cases of serious ethical violation or incompetence. This is basically the governance structure that nonprofits and corporations use, and (in my view) it helps make them, say, less likely to abuse their members than cults and less likely to break up over personal drama than rock bands.

Freeman, being more egalitarian than the republican tradition, also goes further with her recommendations and says that responsibilities should be rotated (so no one person has “ownership” over a job forever), that authority should be distributed widely rather than concentrated, that information should be diffused widely, and that everyone in the organization should have equal access to organizational resources.  Now, this is a good deal less hierarchical than the structure of republican governments, nonprofits, and corporations; it is still pretty utopian from the point of view of someone used to those forms of governance, and I find myself wondering if it can work at scale; but it’s still a concession to hierarchy relative to the “natural” structurelessness that feminist organizations originally envisioned.

Freeman says there is one context in which a structureless organization can work; a very small team (no more than five) of people who come from very similar backgrounds (so they can communicate easily), spend so much time together that they practically live together (so they communicate constantly), and are all capable of doing all “jobs” on the project (no need for formal division of labor.)  In other words, she’s describing an early-stage startup!

I suspect Jo Freeman’s model explains a lot about the common phenomenon of startups having “growing pains” when they get too large to work informally.  I also suspect that this is a part of how startups stop being “mission-driven” and ambitious — if they don’t add structure until they’re forced to by an outside emergency, they have to hurry, and they adopt a standard corporate structure and power dynamics (including the toxic ones, which are automatically imported when they hire a bunch of people from a toxic business culture all at once) instead of having time to evolve something that might achieve the founders’ goals better.

But Can It Scale? Historical Stateless Societies

So, the five-person team of friends is a non-hierarchical organization that can work.  But that’s not very satisfying for anti-authoritarian advocates, because it’s so small.  And, accordingly, an organization that small is usually poor — there’s only so many resources that five people can produce.

(Technology can amplify how much value a single person can produce. This is probably why we see more informal cultures among people who work with high-leverage technology.  Software engineers famously wear t-shirts, not suits; Air Force pilots have a reputation as “hotshots” with lax military discipline compared to other servicemembers. Empowered with software or an airplane, a single individual can be unusually valuable, so  less deference is expected of the operators of high technology.)

When we look at historical anarchies or near-anarchies, we usually also see that they’re small, poor, or both.  We also see that within cultures, there is often surprisingly more freedom for women among the poor than among the rich.

Medieval Iceland from the tenth to thirteenth centuries was a stateless society, with private courts of law, and competing legislative assemblies (Icelanders could choose which assembly and legal code to belong to), but no executive branch or police.  (In this, it was an unusually pure form of anarchy but not unique — other medieval European polities had much more private enforcement of law than we do today, and police are a 19th-century invention.)

The medieval Icelandic commonwealth lasted long enough — longer than the United States — that it was clear this was a functioning system, not a brief failed experiment.  And it appears that it was less violent, not more, compared to other medieval societies.  Even when the commonwealth was beginning to break down in the thirteenth century, battles had low casualty rates, because every man still had to be paid for!  The death toll during the civil war that ended the commonwealth’s independence was only as high per capita as the current murder rate of the US.  While Christianization in neighboring Norway was a violent struggle, the decision of whether to convert to Christianity in Iceland was decided peacefully through arbitration.  In this case, it seems clear that anarchy brought peace, not war.

However, medieval Iceland was small — only 50,000 people, confined to a harsh Arctic environment, and ethnically homogeneous.

Other historical and traditional stateless societies are and were also relatively poor and low in population density. The Igbo of Nigeria traditionally governed by council and consensus, with no kings or chiefs, but rather a sort of village democracy.   This actually appears to be fairly common in small polities.  The Iroquois Confederacy governed by council and had no executive. (Note that the Iroquois are a hoe culture.)  The Nuer of Sudan, a pastoral society currently with a population of a few million, have traditionally had a stateless society with a system of feud law — they had judges, but no executives. There are many more examples — perhaps most familiar to Westerners, the society depicted in the biblical book of Judges appears to have had no king and no permanent war-leader, but only judges who would decide cases which would be privately enforced. In fact, stateless societies with some form of feud law seem to be a pretty standard and recurrent type of political organization, but mostly in “primitive” communities — horticultural or pastoral, low in population density.  This sounds like bad news for modern-day anarchists who don’t want to live in primitive conditions. None of these historical stateless societies, even the comparatively sophisticated Iceland, are urban cultures!

It’s possible that the Harappan civilization in Bronze Age India had no state, while it had cities that housed tens of thousands of people, were planned on grids, and had indoor plumbing.  The Harappans left no massive tombs, no palaces or temples, houses of highly uniform size (indicating little wealth inequality) no armor and few weapons (despite advanced metalworking), no sign of battle damage on the cities or violent death in human remains, and very minimal city walls.  The Harappan cities were commercial centers, and the Harappans engaged in trade along the coast of India and as far as Afghanistan and the Persian Gulf.  Unlike other similar river-valley civilizations (such as Mesopotamia), the Harappans had so much arable land, and farmsteads so initially spread out, that populations steadily grew and facilitated long-distance trade without having to resort to raiding, so they never developed a warrior class.  If so, this is a counterexample to the traditional story that all civilizations developed states (usually monarchies) as a necessary precondition to developing cities and grain agriculture.

Bali is another counterexample.  Rice farming in Bali requires complex coordination of irrigation. This was traditionally not organized by kings, but by subaks, religious and social organizations that supervise the growing of rice, supervised by a decentralized system of water temples, and led by priests who kept a ritual calendar for timing irrigation.  While precolonial Bali was not an anarchy but a patchwork of small principalities, large public works like irrigation were not under state control.

So we have reason to believe that Bronze Age levels of technological development (cities, metalworking, intensive agriculture, literacy, long-distance trade, and high populations) can be developed without states, at scales involving millions of people, for centuries.  We also have much more abundant evidence, historical and contemporary, of informal governance-by-council and feud law existing stably at lower technology levels (for pastoralists and horticulturalists).  And, in special political circumstances (the Icelanders left Norway to settle a barren island, to escape the power of the Norwegian king, Harald Fairhair) an anarchy can arise out of a state society.

But we don’t have successful examples of anarchies at industrial tech levels. We know industrial-technology public works can be built by voluntary organizations (e.g. the railroads in the US) but we have no examples of them successfully resisting state takeover for more than a few decades.

Is there something about modern levels of high technology and material abundance that is incompatible with stateless societies? Or is it just that modern nation-states happened to already be there when the Industrial Revolution came around?

Women’s Status and Material Abundance

A very weird thing is that women’s level of freedom and equality seems almost to anticorrelate with the wealth and technological advancement.

Horticultural (or “hoe culture“) societies are non-patriarchal and tend to allow women more freedom and better treatment in various ways than pre-industrial agricultural societies. For instance, severe mistreatment of women and girls like female infanticide, foot-binding, honor killings, or sati, and chastity-oriented restrictions on female freedom like veiling and seclusion, are common in agricultural societies and unknown in horticultural ones. But horticultural societies are poor in material culture and can’t sustain high population densities in most cases.

You also see unusual freedom for women in premodern pastoral cultures, like the Mongols. Women in the Mongol Empire owned and managed ordos, mobile cities of tents and wagons which also comprised livestock and served as trading hubs.  While the men focused on hunting and war, the women managed the economic sphere. Mongol women fought in battle, herded livestock, and occasionally ruled as queens.  They did not wear veils or bind their feet.

We see numerous accounts of ancient and medieval women warriors and military commanders among Germanic and Celtic tribes and steppe peoples of Central Asia.  There are also accounts of medieval European noblewomen who personally led armies. The pattern isn’t obvious, but there seem to be more accounts of women military leaders in pastoral societies or tribal ones than in large, settled empires.

Pastoralism, to a lesser extent than horticulture but still more than plow agriculture, gives women an active role in food production. Most pastoral societies today have a traditional division of labor in which men are responsible for meat animals and women are responsible for milk animals (as well as textiles).  Where women provide food, they tend to have more bargaining power.  Some pastoral societies, like the Tuareg, are even matrilineal; Tuareg women traditionally have more freedom, including sexual freedom, than they do in other Muslim cultures, and women do not wear the veil while men do.

Like horticulture, pastoralism is less efficient per acre at food production than agriculture, and thus does not allow high population densities. Pastoralists are poorer than their settled farming neighbors. This is another example of women being freer when they are also poorer.

Another weird and “paradoxical” but very well-replicated finding is that women are more different from men  in psychological and behavioral traits (like Big 5 personality traits, risk-taking,  altruism, participation in STEM careers) in richer countries than in poorer ones.  This isn’t quite the same as women being less “free” or having fewer rights, but it seems to fly in the face of the conventional notion that as societies grow richer, women become more equal to men.

Finally, within societies, it’s sometimes the case that poor women are treated better than rich ones.  Sarah Blaffer Hrdy writes about observing that female infanticide was much more common among wealthy Indian Rajput families than poor ones. And we know of many examples across societies of aristocratic or upper-class women being more restricted to the domestic sphere, married off younger, less likely to work, more likely to experience restrictive practices like seclusion or footbinding, than their poorer counterparts.

Hrdy explains why: in patrilinear societies, men inherit wealth and women don’t. If you’re a rich family, a son is a “safe” outcome — he’ll inherit your wealth, and your grandchildren through him will be provided for, no matter whom he marries. A daughter, on the other hand, is a risk. You’ll have to pay a dowry when she marries, and if she marries “down” her children will be poorer than you are — and at the very top of the social pyramid, there’s nowhere to marry but down.  This means that you have an incentive to avoid having daughters, and if you do have daughters, you’ll be very anxious to avoid them making a bad match, which means lots of chastity-enforcement practices. You’ll also invest more in your sons than daughters in general, because your grandchildren through your sons will have a better chance in life than your grandchildren through your daughters.

The situation reverses if you’re a poor family. Your sons are pretty much screwed; they can’t marry into money (since women don’t inherit.) Your daughters, on the other hand, have a chance to marry up. So your grandchildren through your daughters have better chances than your grandchildren through your sons, and you should invest more resources in your sons than your daughters. Moreover, you might not be able to afford restrictive practices that cripple your daughters’ ability to work for a living. To some extent, sexism is a luxury good.

A similar analysis might explain why richer countries have larger gender differences in personality, interests, and career choices.  A degree in art history might function as a gentler equivalent of purdah — a practice that makes a woman a more appealing spouse but reduces her earning potential. You expect to find such practices more among the rich than the poor.  (Tyler Cowen’s take is less jaundiced, and more general, but similar — personal choices and “personality” itself are more varied when people are richer, because one of the things people “buy” with wealth is the ability to make fulfilling but not strictly pragmatic self-expressive choices.)

Finally, all these “paradoxical” trends are countered by the big nonparadoxical trend — by most reasonable standards, women are less oppressed in rich liberal countries than in poor illiberal ones.  The very best countries for women’s rights are also the ones with the lowest power distance: Nordic and Germanic countries.

Is Hierarchy the Engine of Growth or a Luxury Good?

If you observe that the “freest” (least hierarchical, lowest power distance, least authoritarian, etc) functioning organizations and societies tend to be small, poor, or primitive, you could come to two different conclusions:

  1. Freedom causes poverty (in other words, non-hierarchical organization is worse than hierarchy at scaling to large organizations or rich, high-population societies)
  2. Hierarchy is expensive (in other words, only the largest organizations or richest societies can afford the greatest degree of authoritarianism.)

The first possibility is bad news for freedom. It means you should worry you can’t scale up to wealth for large populations without implementing hierarchies.  The usual mechanism proposed for this is the hypothesis that hierarchies are needed to coordinate large numbers of people in large projects.  Without governments, how would you build public works? Or guard the seas for global travel and shipping? Without corporate hierarchies, how would you get mass-produced products to billions of people?  Sure, idealists have proposed alternatives to hierarchy, but these tend to be speculative or small-scale and the success stories are sporadic.

The second possibility is (tentative)  good news for freedom.  It says that hierarchy is inefficient.  For instance, secluding women in harems wastes their productive potential. Top-down state control of the economy causes knowledge problems that limit economic productivity. The same problem applies to top-down control of decisionmaking in large firms.  Dominance hierarchies inhibit accurate transmission of information, which worsens knowledge problems and principal-agent problems (“communication is only possible between equals.”)  And elaborate displays of power and deference are costly, as nonproductive displays always are.  Only accumulations of large amounts of resources enable such wasteful activity, which benefits the top of the hierarchy in the short run but prevents the “pie” of total resources from growing.

This means that if you could just figure out a way to keep inefficient hierarchies from forming, you could grow systems to be larger and richer than ever.  Yes, historically, Western economies grew richer as states grew stronger — but perhaps a stateless society could be richer still.  Perhaps without the stagnating effects of rent-seeking, we could be hugely better off.

After all, this is kind of what liberalism did. It’s the big counter-trend to “wealth and despotism go together” — Western liberal-democratic countries are much richer and much less authoritarian (and less oppressive to women) than any pre-modern society, or than developing countries. One of the observations in Wealth of Nations is that countries with strong middle classes had more subsequent economic growth than countries with more wealth inequality — Smith uses England as an example of a fast-growing, equal society and China as an example of a stagnant, unequal one.

But this is only partial good news for freedom, after all. If hierarchies tend to emerge as soon as size, scale, and wealth arise, then that means we don’t have a solution to the problem of preventing them from emerging. On a model where any sufficiently large accumulation of resources begins to look attractive to “robber barons” who want to appropriate it and forcibly keep others out, we might hypothesize that a natural evolution of all human institutions is from an initial period of growth and value production towards inevitable value capture, stagnation, and decline.  We see a lack of freedom in the world around us, not because freedom can’t work well, but because it’s hard to preserve against the incursions of wannabe despots, who eventually ruin the system for everyone including themselves.

That model points the way to new questions, surrounding the kinds of governance that Jo Freeman talks about. By default an organization will succumb to inefficient hierarchy, and structureless organizations will succumb faster and to more toxic hierarchies. When designing governance structures, the question you want to ask is not just “is this a system I’d want to live under today?” but “how effective will this system be in the future at resisting the guys who will come along and try to take over and milk it for short-term personal gain until it collapses?”  And now we’re starting to sound like the rationale and reasoning behind the U.S. Constitution, though I certainly don’t think that’s the last word on the subject.


Too Smart for My Own Good

22 января, 2019 - 20:51
Published on January 22, 2019 5:51 PM UTC

Originally posted at sandymaguire.me

I want to share a piece of ridiculously obvious advice today.

I've got a bad habit, which is being too smart for my own good. Which is to say, when I want to learn something new, too often I spend my time making tools to help me learn, rather than just learning the thing.

Take, for example, the first time I tried to learn how to play jazz music.

There's only one thing that I'm really good at, which is programming. The central tenet in programming is that "laziness is good," and if you're faced with doing something boring and repetitive, you should instead automate that thing away.

When all you have is a hammer...

According to The Book, the first thing to do to learn jazz is to learn your scales---in every mode for every key for several varieties of harmony. There are 12 notes, and seven modes, and at least four harmonies. That's what, like 336 different scales to learn?


In retrospect, this was a terrible plan. Not only did it not get me closer to my goal of knowing how to play jazz music, I also didn't know enough about the domain to successfully model it. It's funny to read back through that blog post with the benefit of hindsight, but at the time I really thought I was onto something!

That's not to say it was wasted effort nor that it was useless, merely that it wasn't actually moving me closer to my stated goal of being able to play jazz music. It was scratching my itch for mental masturbation, and was a good exercise in attempting to model things I don't understand very well, but crucially, it wasn't helping.

Or take another example, a more recent foray into music for me---only a few weeks ago. This time I had more of a plan; I was taking piano lessons and getting advice on how to practice from my teacher. One of the things he suggested I do was to solo around in the minor pentatonic scale. And so I did, starting in C, and (tentatively) moving to G.

But doing it in Bb was hard! Rather than spend the two minutes that would be required to work out what notes I should play in the Bb minor pentatonic, I decided it would be better to write a computer program! This time it would connect to my keyboard and "listen" to the notes I played, and flash red whenever I played a note that wasn't in the Bb minor pentatonic. I guess the reasoning was "I'll train myself to play the right notes subconsciously." Or something.

I spent like 15 hours writing this computer program.

This attempt was arguably more helpful than my first computer program, but again, it's a pretty fucking roundabout way of accomplishing the goal. Here we are, four weeks later, and I still don't know how to noodle around in the Bb minor pentatonic.

Like I said. Too smart for my own good.

There's a happy ending to this story, however. Earlier this week, I decided I was going to actually learn how to play jazz music. So I started reading The Book again, and when I got to the scale exercises, I decided I'd just give them a go. No computers. Just the boring, repetitive stuff it said would make me a great jazz musician.

The book even gave me some suggestions on how to minimize the amount of exercises I need to do---rather than playing every mode in every key (eg. C ionian, then G ionian, then A ionian, etc etc until it's time to play dorians), instead to play C ionian followed by D dorian followed by E phrygian. These scales all share the same notes, so they're more-or-less the same thing, which means I actually only need to practice 12 things, rather than 84 (the other 250 can likewise be compressed together.)

If I had been patient, I would have read that PRO-TIP the first time around. It probably wouldn't have helped me make less-"smart" decisions, but it's worth keeping in mind that I could be two years ahead of where I am today if I were better at keeping my eye on the ball.

One of the scales the book made me do was Ab major---something I'd literally never once played in my twenty years of piano. It started on a black note and always felt too hard to actually do. I approached it with trepidation, but realized that it only took about three minutes to figure out.

The thing I'd been putting off for twenty years out of fear only took three minutes to accomplish.

I've often wondered why it seems like all of the good musicians have been playing their instruments for like 25 years. Surely music can't be that hard---you can get pretty fucking good at most things in six months of dedicated study. But in the light of all of this, it makes sense. If everyone learns music as haphazardly as I've been doing it, it's no wonder that it takes us all so long.

What have you been putting off out of fear? Are you sure it's as hard as it seems?


Vote counting bug?

22 января, 2019 - 18:44
Published on January 22, 2019 3:44 PM UTC

I've just noticed that the number of votes shown on my recent alignment forum post seems to actually correspond to the number of votes it's received on Less Wrong, rather than just counting the alignment forum votes. Not sure if this is intentional, but for me it makes the feature less useful. Not a priority though.


Stale air / high CO2 may decrease your cognitive function

22 января, 2019 - 15:52
Published on January 22, 2019 12:52 PM UTC

If you're in a closed space, you may want to open a window.

before the industrial revolution, the atmosphere had 300 parts-per-million (PPM) of CO2. today, this number is already above 400 on average, and 500 in urban areas.

but CO2 doesn't just effect the environment, high enough levels of it also effect our bodies, and our minds.

so let's leave the atmosphere for a bit, and go inside. one study checked office employee's decision making skills at various CO2 levels, here some of the results:


this level is common at poorly ventilated spaces like a workrooms/offices. and one study on schools in several US districts found 50% of classrooms to have this level.

at this CO2 level the cognitive function in the office experiment decreased by 15%.


this level can also be reached at the places described above.

here cognitive function decreased by 50%!


from this level onward some people described other side effects such as: slight nausea, loss of attention and poor concentration, sleepiness, headaches, and increased hearth rates.

and still, these levels aren't uncommon -


this is common in cars and bedrooms (closed spaces which are either small, you spend a long time in, or both. and the side effects increase.


motorcycle helmets can reach these levels. Being in such an environment for long times can harm your long-term health.

So what can you do?

1. simply open a window! (at least in this part of the century)

2. you can get some plants for your room or office -

The NASA Clean Air Study looked at about two and half dozen plants, and recommends these tree plants: Areca Palm, Snake Plant, and Money Plant.

This lung institute guide seems to be based on this study, so i suggest reading it.

3. buy a CO2 monitor if you want to always know in what environment you're in. though, these seem cost quite a bit (for a reason unclear to me). so i don't know if it will really benefit you. i know i won't bother.

The IPPC reported that CO2 levels will be, by the end of the century, between 541 and 970ppm. if we extrapolate from the previous study, this may mean a 10-15% decrease in the cognitive function of humanity as a species (and even more than the previous results in closed spaces).

some studies found evidence that air pollution can harm the brain itself.

Should this change our attitude towards climate change as a catastrophic risk?

This has been brought to my attention by this video series and this video. give it a look if you want to see for yourself how high CO2 levels affect a person



22 января, 2019 - 12:39
Published on January 22, 2019 9:39 AM UTC


Saw it on Hacker News, discussion here: https://news.ycombinator.com/item?id=18965274

Formal methods seem very relevant to AI safety, and I haven't seen much discussion of them on Less Wrong.


Should questions be called "questions" or "confusions" (or "other")?

22 января, 2019 - 05:45
Published on January 22, 2019 2:45 AM UTC

Or, what is the best way to think about "what is a question?" on LW?

The LW Team just had a retreat where we thought through a lot of high level strategy. We have a lot of ideas building off of the "questions" feature.

One thing that struck us is that a lot of early stage research has less to do with formalizable questions, and more to do with noticing anomalies in your current model/paradigm. Something feels off that you can't explain, or there's a concept you don't even understand well enough to ask a coherent question about.

The "question" feature was meant, in part, to reduce the cost of exploring early stage curiosity, but we wondered if it might even be a slightly-too-formalized.

Just like, technically there was nothing stopping you from asking a question as a post (but adding the feature caused a proliferation of questions) there is nothing stopping you from asking an ill-formed question. But, maybe changing the language slightly would better encourage early-stage curiosity.

So, curious:

How would feel if we changed "Ask a question" to "Pose a confusion" or something like that? (The main issue so far is that "pose confusion" is, well, way more confusing since it's a non-standard phrase. Other options include literally saying "Ask question/Pose Confusion" [i.e. both at once, so you get the benefit of the clear-cut "ask question"], or some word other than "pose.")

(Somewhat but not-entirely-jokingly, we also noticed people are hesitant to post "answers" since they sound like you're trying to claim you know what you're talking about. We jokingly considered "Post a deconfusion", or "post a partial answer" as options)


Alignment Newsletter #42

22 января, 2019 - 05:00
Published on January 22, 2019 2:00 AM UTC

Cooperative IRL as a definition of human-AI group rationality, and an empirical evaluation of theory of mind vs. model learning in HRI

Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter.


AI Alignment Podcast: Cooperative Inverse Reinforcement Learning (Lucas Perry and Dylan Hadfield-Menell) (summarized by Richard): Dylan puts forward his conception of Cooperative Inverse Reinforcement Learning as a definition of what it means for a human-AI system to be rational, given the information bottleneck between a human's preferences and an AI's observations. He notes that there are some clear mismatches between this problem and reality, such as the CIRL assumption that humans have static preferences, and how fuzzy the abstraction of "rational agents with utility functions" becomes in the context of agents with bounded rationality. Nevertheless, he claims that this is a useful unifying framework for thinking about AI safety.

Dylan argues that the process by which a robot learns to accomplish tasks is best described not just as maximising an objective function but instead in a way which includes the system designer who selects and modifies the optimisation algorithms, hyperparameters, etc. In fact, he claims, it doesn't make sense to talk about how well a system is doing without talking about the way in which it was instructed and the type of information it got. In CIRL, this is modeled via the combination of a "teaching strategy" and a "learning strategy". The former can take many forms: providing rankings of options, or demonstrations, or binary comparisons, etc. Dylan also mentions an extension of this in which the teacher needs to learn their own values over time. This is useful for us because we don't yet understand the normative processes by which human societies come to moral judgements, or how to integrate machines into that process.

On the Utility of Model Learning in HRI (Rohan Choudhury, Gokul Swamy et al): In human-robot interaction (HRI), we often require a model of the human that we can plan against. Should we use a specific model of the human (a so-called "theory of mind", where the human is approximately optimizing some unknown reward), or should we simply learn a model of the human from data? This paper presents empirical evidence comparing three algorithms in an autonomous driving domain, where a robot must drive alongside a human.

The first algorithm, called Theory of Mind based learning, models the human using a theory of mind, infers a human reward function, and uses that to predict what the human will do, and plans around those actions. The second algorithm, called Black box model-based learning, trains a neural network to directly predict the actions the human will take, and plans around those actions. The third algorithm, model-free learning, simply applies Proximal Policy Optimization (PPO), a deep RL algorithm, to directly predict what action the robot should take, given the current state.

Quoting from the abstract, they "find that there is a significant sample complexity advantage to theory of mind methods and that they are more robust to covariate shift, but that when enough interaction data is available, black box approaches eventually dominate". They also find that when the ToM assumptions are significantly violated, then the black-box model-based algorithm will vastly surpass ToM. The model-free learning algorithm did not work at all, probably because it cannot take advantage of knowledge of the dynamics of the system and so the learning problem is much harder.

Rohin's opinion: I'm always happy to see an experimental paper that tests how algorithms perform, I think we need more of these.

You might be tempted to think of this as evidence that in deep RL, a model-based method should outperform a model-free one. This isn't exactly right. The first ToM and black box model-based algorithms use an exact model of the dynamics of the environment modulo the human, that is, they can exactly predict the next state given the current state, the robot action, and the human action. The model-free algorithm must learn this from scratch, so it isn't an apples-to-apples comparison. (Typically in deep RL, both model-based and model-free algorithms have to learn the environment dynamics.) However, you can think of the ToM as a model-based method and the Black-box model-based algorithm as a model-free algorithm, where both algorithms have to learn the human modelinstead of the more traditional environment dynamics. With that analogy, you would conclude that model-based algorithms will be more sample efficient and more performant in low-data regimes, but will be outperformed by model-free algorithms with sufficient data, which agrees with my intuitions.

This kind of effect is a major reason for my position that the first powerful AI systems will be modular (analogous to model-based systems), but that they will eventually be replaced by more integrated, end-to-end systems (analogous to model-free systems). Initially, we will be in a (relatively speaking) low-data regime, where modular systems excel, but over time there will be more data and compute and we will transition to regimes where end-to-end systems will solve the same tasks better (though we may then have modular systems for more difficult tasks).

Technical AI alignmentIterated amplification sequence

The reward engineering problem (Paul Christiano): The reward engineering problem is the problem of designing some form of reward feedback such that if the agent learns to get high reward, then we are happy with the resulting behavior. We assume that an overseer H is providing reward for an agent A, with H smarter than A.

(Aside: Why are we assuming H is smarter, when we are typically worried about AI that is more intelligent than we are? Because we could hope to use AI tools to create this reward function, so that the human + AI team is smarter than the AI being trained. In iterated amplification, this corresponds to the distillation step.)

We could imagine three approaches: direct supervision (where H provides a reward for every action A takes), imitation learning (where A tries to mimic H, perhaps by trying to fool a discriminator that distinguishes between A and H), and inverse reinforcement learning (where we infer a reward function by observing H's behavior and use that to supervise A).

With all of these approaches, we might hope that since H is smarter than A, A will not be able to trick H with some behavior that gets high reward but is actually undesirable. However, this argument has a major flaw. When A chooses an action, the process or computation by which it chooses that action is important information, which by default (especially with neural nets) only A has access to, and so this can allow it to fool H. For example, if A is tasked with drawing original artwork, it might copy and slightly change existing artwork, which H may not realize if it cannot see how the artwork was made. We could hope to fix this problem with transparency or explanations, but this requires a lot more research.

Imitation learning and IRL have the problem that A may not be capable of doing what H does. In that case, it will be off-distribution and may have weird behavior. Direct supervision doesn't suffer from this problem, but it is very time-inefficient. This could potentially be fixed using semi-supervised learning techniques.

Rohin's opinion: The information asymmetry problem between H and A seems like a major issue. For me, it's the strongest argument for why transparency is a necessary ingredient of a solution to alignment. The argument against imitation learning and IRL is quite strong, in the sense that it seems like you can't rely on either of them to capture the right behavior. These are stronger than the arguments against ambitious value learning (AN #31) because here we assume that H is smarter than A, which we could not do with ambitious value learning. So it does seem to me that direct supervision (with semi-supervised techniques and robustness) is the most likely path forward to solving the reward engineering problem.

There is also the question of whether it is necessary to solve the reward engineering problem. It certainly seems necessary in order to implement iterated amplification given current systems (where the distillation step will be implemented with optimization, which means that we need a reward signal), but might not be necessary if we move away from optimization or if we build systems using some technique other than iterated amplification (though even then it seems very useful to have a good reward engineering solution).

Capability amplification (Paul Christiano): Capability amplification is the problem of taking some existing policy and producing a better policy, perhaps using much more time and compute. It is a particularly interesting problem to study because it could be used to define the goals of a powerful AI system, and it could be combined with reward engineering above to create a powerful aligned system. (Capability amplification and reward engineering are analogous to amplification and distillation respectively.) In addition, capability amplification seems simpler than the general problem of "build an AI that does the right thing", because we get to start with a weak policy A rather than nothing, and were allowed to take lots of time and computation to implement the better policy. It would be useful to tell whether the "hard part" of value alignment is in capability amplification, or somewhere else.

We can evaluate capability amplification using the concepts of reachability and obstructions. A policy C is reachable from another policy A if there is some chain of policies from A to C, such that at each step capability amplification takes you from the first policy to something at least as good as the second policy. Ideally, all policies would be reachable from some very simple policy. This is impossible if there exists an obstruction, that is a partition of policies into two sets L and H, such that it is impossible to amplify any policy in L to get a policy that is at least as good as some policy in H. Intuitively, an obstruction prevents us from getting to arbitrarily good behavior, and means that all of the policies in H are not reachable from any policy in L.

We can do further work on capability amplification. With theory, we can search for challenging obstructions, and design procedures that overcome them. With experiment, we can study capability amplification with humans (something which Ought is now doing).

Rohin's opinion: There's a clear reason for work on capability amplification: it could be used as a part of an implementation of iterated amplification. However, this post also suggests another reason for such work -- it may help us determine where the "hard part" of AI safety lies. Does it help to assume that you have lots of time and compute, and that you have access to a weaker policy?

Certainly if you just have access to a weaker policy, this doesn't make the problem any easier. If you could take a weak policy and amplify it into a stronger policy efficiently, then you could just repeatedly apply this policy-improvement operator to some very weak base policy (say, a neural net with random weights) to solve the full problem. (In other variants, you have a much stronger aligned base policy, eg. the human policy with short inputs and over a short time horizon; in that case this assumption is more powerful.) The more interesting assumption is that you have lots of time and compute, which does seem to have a lot of potential. I feel pretty optimistic that a human thinking for a long time could reach "superhuman performance" by our current standards; capability amplification asks if we can do this in a particular structured way.

Value learning sequence

Reward uncertainty (Rohin Shah): Given that we need human feedback for the AI system to stay "on track" as the environment changes, we might design a system that keeps an estimate of the reward, chooses actions that optimize that reward, but also updates the reward over time based on feedback. This has a few issues: it typically assumes that the human Alice knows the true reward function, it makes a possibly-incorrect assumption about the meaning of Alice's feedback, and the AI system still looks like a long-term goal-directed agent where the goal is the current reward estimate.

This post takes the above AI system and considers what happens if you have a distribution over reward functions instead of a point estimate, and during action selection you take into account future updates to the distribution. (This is the setup of Cooperative Inverse Reinforcement Learning.) While we still assume that Alice knows the true reward function, and we still require an assumption about the meaning of Alice's feedback, the resulting system looks less like a goal-directed agent.

In particular, the system no longer has an incentive to disable the system that learns values from feedback: while previously it changed the AI system's goal (a negative effect from the goal's perspective), now it provides more information about the goal (a positive effect). In addition, the system has more of an incentive to let itself be shut down. If a human is about to shut it down, it should update strongly that whatever it was doing was very bad, causing a drastic update on reward functions. It may still prevent us from shutting it down, but it will at least stop doing the bad thing. Eventually, after gathering enough information, it would converge on the true reward and do the right thing. Of course, this is assuming that the space of rewards is well-specified, which will probably not be true in practice.

Following human norms (Rohin Shah): One approach to preventing catastrophe is to constrain the AI system to never take catastrophic actions, and not focus as much on what to do (which will be solved by progress in AI more generally). In this setting, we hope that our AI systems accelerate our rate of progress, but we remain in control and use AI systems as tools that allow us make better decisions and better technologies. Impact measures / side effect penalties aim to define what not to do. What if we instead learn what not to do? This could look like inferring and following human norms, along the lines of ad hoc teamwork.

This is different from narrow value learning for a few reasons. First, narrow value learning also learns what to do. Second, it seems likely that norm inference only gives good results in the context of groups of agents, while narrow value learning could be applied in singe agent settings.

The main advantages of learning norms is that this is something that humans do quite well, so it may be significantly easier than learning "values". In addition, this approach is very similar to our ways of preventing humans from doing catastrophic things: there is a shared, external system of norms that everyone is expected to follow. However, norm following is a weaker standard than ambitious value learning (AN #31), and there are more problems as a result. Most notably, powerful AI systems will lead to rapidly evolving technologies, that cause big changes in the environment that might require new norms; norm-following AI systems may not be able to create or adapt to these new norms.

Agent foundations

CDT Dutch Book (Abram Demski)

CDT=EDT=UDT (Abram Demski)

Learning human intent

AI Alignment Podcast: Cooperative Inverse Reinforcement Learning (Lucas Perry and Dylan Hadfield-Menell): Summarized in the highlights!

On the Utility of Model Learning in HRI (Rohan Choudhury, Gokul Swamy et al): Summarized in the highlights!

What AI Safety Researchers Have Written About the Nature of Human Values (avturchin): This post categorizes theories of human values along three axes. First, how complex is the description of the values? Second, to what extent are "values" defined as a function of behavior (as opposed to being a function of eg. the brain's algorithm)? Finally, how broadly applicable is the theory: could it apply to arbitrary minds, or only to humans? The post then summarizes different positions on human values that different researchers have taken.

Rohin's opinion: I found the categorization useful for understanding the differences between views on human values, which can be quite varied and hard to compare.

Risk-Aware Active Inverse Reinforcement Learning (Daniel S. Brown, Yuchen Cui et al): This paper presents an algorithm that actively solicits demonstrations on states where it could potentially behave badly due to its uncertainty about the reward function. They use Bayesian IRL as their IRL algorithm, so that they get a distribution over reward functions. They use the most likely reward to train a policy, and then find a state from which that policy has high risk (because of the uncertainty over reward functions). They show in experiments that this performs better than other active IRL algorithms.

Rohin's opinion: I don't fully understand this paper -- how exactly are they searching over states, when there are exponentially many of them? Are they sampling them somehow? It's definitely possible that this is in the paper and I missed it, I did skim it fairly quickly.

Other progress in AIReinforcement learning

Soft Actor-Critic: Deep Reinforcement Learning for Robotics (Tuomas Haarnoja et al)

Deep learning

A Comprehensive Survey on Graph Neural Networks (Zonghan Wu et al)

Graph Neural Networks: A Review of Methods and Applications (Jie Zhou, Ganqu Cui, Zhengyan Zhang et al)


Olsson to Join the Open Philanthropy Project (summarized by Dan H): Catherine Olsson‏, a researcher at Google Brain who was previously at OpenAI, will be joining the Open Philanthropy Project to focus on grant making for reducing x-risk from advanced AI. Given her first-hand research experience, she has knowledge of the dynamics of research groups and a nuanced understanding of various safety subproblems. Congratulations to both her and OpenPhil.

Announcement: AI alignment prize round 4 winners (cousin_it): The last iteration of the AI alignment prize has concluded, with awards of $7500 each to Penalizing Impact via Attainable Utility Preservation (AN #39) and Embedded Agency (AN #31AN #32), and $2500 each to Addressing three problems with counterfactual corrigibility (AN #30) and Three AI Safety Related Ideas/Two Neglected Problems in Human-AI Safety (AN #38).


January 2019 Nashville SSC Meetup

22 января, 2019 - 00:51
Published on January 21, 2019 9:51 PM UTC

Provisional topic: "Seeing Like A State" book review.

Nashville SSC meetup *aims* to meet the 4th Tuesday of every month at 7:00 central.

All welcome. Contact james[at]writechem.com



Game Analysis Index

21 января, 2019 - 18:30
Published on January 21, 2019 3:30 PM UTC

This post links to this blog’s posts discussing game design, balance, economics and related topics, as well as any strategy posts. It does not contain new content.

Much of the blog is relevant to gaming, but these are the explicitly on-topic posts.

Eternal Sequence

Eternal, and Hearthstone Economy versus Magic Economy

The Eternal Grind

Eternal: The Exit Interview


Artifact / Card Rebalancing Sequence

Review: Artifact

Artifact Embraces Card Balance Changes

Card Collection and Ownership

Card Balance and Artifact

Card Rebalancing, Card Oversupply and Economic Considerations in Digital Card Games

Advantages of Card Rebalancing

Disadvantages of Card Rebalancing


Game Reviews (Including Those Listed Above)

All games reviewed are recommended, we don’t generally waste time on unworthy games.

Persona 5: Spoiler-Free Review

Review: Artifact

Review: Slay the Spire

Octopath Traveler: Spoiler-Free Review


Magic Strategy

Deck Guide: Burning Drakes



Disentangling arguments for the importance of AI safety

21 января, 2019 - 15:41
Published on January 21, 2019 12:41 PM UTC

I recently attended the 2019 Beneficial AGI conference organised by the Future of Life Institute. I’ll publish a more complete write-up later, but I was particularly struck by how varied attendees' reasons for considering AI safety important were. Before this, I’d observed a few different lines of thought, but interpreted them as different facets of the same idea. Now, though, I’ve identified at least 6 distinct serious arguments for why AI safety is a priority. By distinct I mean that you can believe any one of them without believing any of the others - although of course the particular categorisation I use is rather subjective, and there’s a significant amount of overlap. In this post I give a brief overview of my own interpretation of each argument (note that I don’t necessarily endorse them myself). They are listed roughly from most specific and actionable to most general. I finish with some thoughts on what to make of this unexpected proliferation of arguments. Primarily, I think it increases the importance of clarifying and debating the core ideas in AI safety.

  1. Maximisers are dangerous. Superintelligent AGI will behave as if it’s maximising the expectation of some utility function, since doing otherwise can be shown to be irrational. Yet we can’t write down a utility function which precisely describes human values, and optimising very hard for any other function will lead to that AI rapidly seizing control (as a convergent instrumental subgoal) and building a future which contains very little of what we value (because of Goodhart’s law and the complexity and fragility of values). We won’t have a chance to notice and correct misalignment because an AI which has exceeded human level will improve its intelligence very quickly (either by recursive self-improvement or by scaling up its hardware), and then prevent us from modifying it or shutting it down.
    1. This was the main thesis advanced by Yudkowsky and Bostrom when founding the field of AI safety. Here I’ve tried to convey the original line of argument, although some parts of it have been strongly critiqued since then. In particular, Drexler and Shah have disputed the relevance of expected utility maximisation (the latter suggesting the concept of goal-directedness as a replacement), while Hanson and Christiano disagree that AI intelligence will increase in a very fast and discontinuous way.
    2. Most of the arguments in this post originate from or build on this one in some way. This is particularly true of the next two arguments - nevertheless, I think that there’s enough of a shift in focus in each to warrant separate listings.
  2. The target loading problem. Even if we knew exactly what we wanted a superintelligent agent to do, we don’t currently know (even in theory) how to make an agent which actually tries to do that. In other words, if we were to create a superintelligent AGI before solving this problem, the goals we would ascribe to that AGI (by taking the intentional stance towards it) would not be the ones we had intended to give it. As a motivating example, evolution selected humans for their genetic fitness, yet humans have goals which are very different from just spreading their genes. In a machine learning context, while we can specify a finite number of data points and their rewards, neural networks may then extrapolate from these rewards in non-humanlike ways.
    1. This is a more general version of the “inner optimiser problem”, and I think it captures the main thrust of the latter while avoiding the difficulties of defining what actually counts as an “optimiser”. I’m grateful to Nate Soares for explaining the distinction.
  3. The prosaic alignment problem. It is plausible that we build “prosaic AGI”, which replicates human behaviour without requiring breakthroughs in our understanding of intelligence. Shortly after they reach human level (or possibly even before), such AIs will become the world’s dominant economic actors. They will quickly come to control the most important corporations, earn most of the money, and wield enough political influence that we will be unable to coordinate to place limits on their use. Due to economic pressures, corporations or nations who slow down AI development and deployment in order to focus on aligning their AI more closely with their values will be outcompeted. As AIs exceed human-level intelligence, their decisions will become too complex for humans to understand or provide feedback on (unless we develop new techniques for doing so), and eventually we will no longer be able to correct the divergences between their values and ours. Thus the majority of the resources in the far future will be controlled by AIs which don’t prioritise human values. This argument was explained in this blog post by Paul Christiano.
    1. More generally, aligning multiple agents with multiple humans is much harder than aligning one agent with one human, because value differences might lead to competition and conflict even between agents that are each fully aligned with some humans. (As my own speculation, it’s also possible that having multiple agents would increase the difficulty of single-agent alignment - e.g. the question “what would humans want if I didn’t manipulate them” would no longer track our values if we would counterfactually be manipulated by a different agent).
  4. The human safety problem. This line of argument (which Wei Dai has recently highlighted) claims that no human is “safe” in the sense that giving them absolute power would produce good futures for humanity in the long term, and therefore that building an AI which extrapolates and implements the values of even a very altruistic human is insufficient. A prosaic version of this argument emphasises the corrupting effect of power, and the fact that morality is deeply intertwined with social signalling - however, I think there’s a stronger and more subtle version. In everyday life it makes sense to model humans as mostly rational agents pursuing their goals and values. However, this abstraction breaks down badly in more extreme cases (e.g. addictive superstimuli, unusual moral predicaments), implying that human values are somewhat incoherent. One such extreme case is running my brain for a billion years, after which it seems very likely that my values will have shifted or distorted radically, in a way that my original self wouldn’t endorse. Yet if we want a good future, this is the process which we require to go well: a human (or a succession of humans) needs to maintain broadly acceptable and coherent values for astronomically long time periods.
    1. An obvious response is that we shouldn’t entrust the future to one human, but rather to some group of humans following a set of decision-making procedures. However, I don’t think any currently-known institution is actually much safer than individuals over the sort of timeframes we’re talking about. Presumably a committee of several individuals would have lower variance than just one, but as that committee grows you start running into well-known problems with democracy. And while democracy isn’t a bad system, it seems unlikely to be robust on the timeframe of millennia or longer. (Alex Zhu has made the interesting argument that the problem of an individual maintaining coherent values is roughly isomorphic to the problem of a civilisation doing so, since both are complex systems composed of individual “modules” which often want different things.)
    2. While AGI amplifies the human safety problem, it may also help solve it if we can use it to decrease the value drift that would otherwise occur. Also, while it’s possible that we need to solve this problem in conjunction with other AI safety problems, it might be postponable until after we’ve achieved civilisational stability.
    3. Note that I use “broadly acceptable values” rather than “our own values”, because it’s very unclear to me which types or extent of value evolution we should be okay with. Nevertheless, there are some values which we definitely find unacceptable (e.g. having a very narrow moral circle, or wanting your enemies to suffer as much as possible) and I’m not confident that we’ll avoid drifting into them by default.
  5. Misuse and vulnerabilities. These might be catastrophic even if AGI always carries out our intentions to the best of its ability:
    1. AI which is superhuman at science and engineering R&D will be able to invent very destructive weapons much faster than humans can. Humans may well be irrational or malicious enough to use such weapons even when doing so would lead to our extinction, especially if they’re invented before we improve our global coordination mechanisms. It’s also possible that we invent some technology which destroys us unexpectedly, either through unluckiness or carelessness. For more on the dangers from technological progress in general, see Bostrom’s paper on the vulnerable world hypothesis.
    2. AI could be used to disrupt political structures, for example via unprecedentedly effective psychological manipulation. In an extreme case, it could be used to establish very stable totalitarianism, with automated surveillance and enforcement mechanisms ensuring an unshakeable monopoly on power for leaders.
    3. AI could be used for large-scale projects (e.g. climate engineering to prevent global warming, or managing the colonisation of the galaxy) without sufficient oversight or verification of robustness. Software or hardware bugs might then induce the AI to make unintentional yet catastrophic mistakes.
    4. People could use AIs to hack critical infrastructure (include the other AIs which manage aforementioned large-scale projects). In addition to exploiting standard security vulnerabilities, hackers might induce mistakes using adversarial examples or ‘data poisoning’.
  6. Argument from large impacts. Even if we’re very uncertain about what AGI development and deployment will look like, it seems likely that AGI will have a very large impact on the world in general, and that further investigation into how to direct that impact could prove very valuable.
    1. Weak version: development of AGI will be at least as big an economic jump as the industrial revolution, and therefore affect the trajectory of the long-term future. See Ben Garfinkel’s talk at EA Global London 2018 (which I’ll link when it’s available online). Ben noted that to consider work on AI safety important, we also need to believe the additional claim that there are feasible ways to positively influence the long-term effects of AI development - something which may not have been true for the industrial revolution. (Personally my guess is that since AI development will happen more quickly than the industrial revolution, power will be more concentrated during the transition period, and so influencing its long-term effects will be more tractable.)
    2. Strong version: development of AGI will make humans the second most intelligent species on the planet. Given that it was our intelligence which allowed us to control the world to the large extent that we do, we should expect that entities which are much more intelligent than us will end up controlling our future, unless there are reliable and feasible ways to prevent it. So far we have not discovered any.

What should we think about the fact that there are so many arguments for the same conclusion? As a general rule, the more arguments support a statement, the more likely it is to be true. However, I’m inclined to believe that quality matters much more than quantity - it’s easy to make up weak arguments, but you only need one strong one to outweigh all of them. And this proliferation of arguments is evidence against their quality: if your conclusions remain the same but your reasons for holding those conclusions change, that’s a warning sign for motivated cognition (especially when those beliefs are considered important in your social group). This problem is exacerbated by a lack of clarity about which assumptions and conclusions are shared between arguments, and which aren’t.

On the other hand, superintelligent AGI is a very complicated topic, and so perhaps it’s natural that there are many different lines of thought. One way to put this in perspective (which I credit to Beth Barnes) is to think about the arguments which might have been given for worrying about nuclear weapons, before they had been developed. Off the top of my head, there are at least four:

  1. They might be used deliberately.
  2. They might be set off accidentally.
  3. They might cause a nuclear chain reaction much larger than anticipated.
  4. They might destabilise politics, either domestically or internationally.

And there are probably more which would have been credible at the time, but which seem silly now due to hindsight bias. So if there’d been an active anti-nuclear movement in the 30’s or early 40’s, the motivations of its members might well have been as disparate as those of AI safety advocates today. Yet the overall concern would have been (and still is) totally valid and reasonable.

I think the main takeaway from this post is that the AI safety community as a whole is still confused about the very problem we are facing. The only way to dissolve this tangle is to have more communication and clarification of the fundamental ideas in AI safety, particularly in the form of writing which is made widely available. And while it would be great to have AI safety researchers explaining their perspectives more often, I think there is still a lot of explicatory work which can be done regardless of technical background. In addition to analysis of the arguments discussed in this post, I think it would be particularly useful to see more descriptions of deployment scenarios and corresponding threat models. It would also be valuable for research agendas to highlight which problem they are addressing, and the assumptions they require to succeed.

This post has benefited greatly from feedback from Rohin Shah, Alex Zhu, Beth Barnes, Adam Marblestone, Toby Ord, and the DeepMind safety team. All opinions are my own.


Do the best ideas float to the top?

21 января, 2019 - 09:28
Published on January 21, 2019 5:22 AM UTC

It may depend on what we mean by “best”.

Epistemic status: I understand very little of anything.

Speculation about potential applications: regulating a logical prediction market, e.g. logical induction; constructing judges or competitors in e.g. alignment by debate; designing communication technology, e.g. to mitigate harms and risks of information warfare.

The slogan “the best ideas float to the top” is often used in social contexts. The saying goes, “in a free market of ideas, the best ideas float to the top”. Of course, it is not intended as a facts statement, as in “we have observed that this is the case”; it is instead a values statement, as in “we would prefer for this to be the case.”.

In this essay, however, we will force an empirical interpretation, just to see what happens. I will provide three ways to consider the density of an idea, or the number assigned to how float-to-the-top an idea is.

In brief, an idea is a sentence, and you can vary the amount of it’s antecedent graph (like in bayesian nets, NARS-like architectures) or function out of which it is printed (like in compression) that you want to consider at a given moment, up to resource allocation. This isn’t an entirely mathematical paper, so don’t worry about WFFs, parsers, etc., which is why i’ll stick with “ideas” instead of “sentences”. I will also be handwaving between "description of some world states" and "belief about how world states relate to eachother".


Suppose you observe wearers of teal hats advocate for policy A, but you don’t know what A is. You’re minding your business in an applebees parking lot when a wearer of magenta hats gets your attention to tell you “A is harmful”. There are two cases:

  1. Suppose A is “kicking puppies”, (and I don’t mean the wearer of magenta hats is misleadingly compressing A to you, I mean the policy is literally kicking puppies). The inferential gap between you and the magentas can be closed very cheaply, so you’re quickly convinced that A is harmful (unless you believe that kicking puppies is good).
  2. Suppose A is “fleegan at a rate of flargen”, where fleeganomics is a niche technical subject which nevertheless can be learned by anyone of median education in N units[^1] or less. Suppose also that you know the value of N, but you’re not inclined to invest that much compute in a dumb election, so you either a. take them at their word that A is harmful; b. search the applebees for an authority figure who believes that A is harmful, but believes it more credibly; or c. leave the parking lot without updating in any direction.

“That’s easy, c” you respond, blindingly fast. You peel out of there, and the whole affair makes not a dent in your epistemic hygiene. But you left behind many others. Will they be as strong, as wise as you?

“In an information-rich world, the wealth of information means a dearth of something else: a scarcity of whatever it is that information consumes. What information consumes is rather obvious: it consumes the attention of its recipients. Hence a wealth of information creates a poverty of attention and a need to allocate that attention efficiently among the overabundance of information sources that might consume it.”

– Herbert Simon

Let’s call case 1 “constant” and case 2 “linear”. We assume that constant refers to negligible cost, and that linear is in pedagogical length (where pedagogical cost is some measure of the resources needed to acquire some sort of understanding).

A regulator, unlike you, isn’t willing to leave anyone behind for evangelists and pundits to prey on. This is the role I’m assuming for this essay. I will ultimately propose a negative attentional tax, in which the constant cases would be penalized to give the linear cases a boost. (It’s like negative income tax, replacing money with attention).

If you could understand fleeganomics in N/100000 bits, would it be worth it to you then?

Let’s force an empirical interpretation of “the best ideas float to the top”

Three possible measures of density:

  1. the simplest ideas float to the top.
  2. the truest ideas float to the top.
  3. the ideas which advance the best values float to the top, where by “advance the best values” we mean either a. maximize my utility function, not yours; or b. maximize the aggregate/average utility function of all moral patients, without emphasis on zero-sum faceoffs between opponent utility functions.

Each in turn implies a sort of world in which it is the sole interpretation, and thus the sole factor over beliefs of truth-seekers.

The intuition given above leans heavily on density_3, however, we must start much lower, at the fundamentals of simplicity and truth. From now on, for brevity’s sake, please ignore density_3 and focus on the first two.

density_1: The Simplest Ideas Float to the Top.

If you form a heuristic by philosophizing the conjunction rule in probability theory, you get Occam’s razor. In machine learning, we have model selection methods that directly penalize complexity. Occam’s razor doesn’t say anything about reception of ideas in a social system, beyond implying that in gambling the wise bet on shorter sentences (insofar as the wise are gamblers).

If we assume that the wearer of magenta hats is maximizing something like petition signatures, and by proxy maximizing the number of applebees patrons converted to magenta hat wearing applebees patrons, then in the world of density_1 they ought to persuade only via statements with constant or negligible cost. (remember, in the world of density_1, statement’s needn’t have any particular content to be successful. In an idealized setting, this would mean the empty string gets 100% of the vote in every election, or 100% of traders purchase nothing but the empty string, etc.; in a human setting, think of the “smallest recognizable belief”).

density_2: The Truest Ideas Float to the Top.

If the truest ideas floated to the top, then statements with more substantial truth values (i.e. with more evidence, more compelling evidence, stronger inferential steps) win out against those with less substantial truth values. In a world governed only by density_2, all cost is negligible.

In this world, the wearer of magenta hats is incentivized to teach fleaganomics – to bother themselves (and others) with linear cost ideas – if that’s what leads people to more substantially held beliefs or commitments. This is a sort of oracle world, in a word, logical omniscience.

In a market view, truth only prevails in the long run (i.e. in the same way that price only converges to value but you can’t pinpoint when they’re equal, supply with demand, etc.), which is why the density_2 interpretation is suitable for oracles, or at least the infinite resources of AIXI-like agents. If you tried to populate the world of density_2 with logically uncertain/AIKR-abiding agents, the entire appeal of markets evaporates. “Those who know they are not relevant experts shut up, and those who do not know this eventually lose their money, and then shut up.” (Hanson), but without the “eventually”.

Negative attention tax

Now suppose we live in some world where density_1 and density_2 are operating at the same time, with some foggier and handwavier things like density_3 on the margins. In such a world, we say false-complicated ideas are robustly uncompetitive and true-simple ideas are robustly competitive, where “robust” means “resilient to foul play”, and "foul play" means "any misleading compression, fallacious reasoning, etc.". Without such resilience, we have risk that false-simple ideas will succeed and true-complicated ideas will fail.

A regulator isn’t willing to leave anyone behind for evangelists and pundits to prey on.

Perhaps we want free attention distributed to true-but-complicated things, and penalties applied to false-but-simple things. In economics, a negative income tax (NIT) is a welfare system within an income tax where people earning below a certain amount receive supplemental pay from the government instead of paying taxes to the government.

For us, a negative attentional tax is a welfare system, where ideas demanding above a certain amount of compute receive supplemental attention, and ideas below that amount pay up.

density_2 \ density_1 Simple Complicated False I'm saying this is a failure mode, danger zone, etc. Robustly uncompetitive (won’t bother us) True Robustly competitive (these will be fine) I’m saying the solution is to give these sentences a boost.

An example implementation: suppose I’m working at nosebook in year of our lord. When I notice certain posts get liked/shared blindingly fast, and others take more time, I suppose that the simple ones are some form of epistemic foul play, and the complicated ones are more likely to align with epistemic norms we prefer. I make an algorithm to suppress posts that get liked/shared too quickly, and replace their spots in the feed with posts that seem to be digested before getting liked/shared (disclaimer: this is not a resilient proposal, I spent all of 10 seconds thinking about it, please defer to your nearest misinformation expert)

Individuals apply NAT credits to interesting-looking complicated ideas, complicated ideas aren't directly supplied with these supplements in the way that simple ideas are automatically handicapped.

Though the above may be a valid interpretation, especially in the nosebook example, NAT is more properly understood as credits allocated to individuals for them to spend freely.

You can imagine the stump speech.

extremely campaigning voice: I’m going to make sure every member of every applebees parking lot has a lengthened/handicapped mental speed when they’re faced with simple ideas, and this will come back to them as tax credits they can spend on complicated ideas. Every applebees patron deserves complexity, even if they can’t afford the full compute/price for it.

--footnote-- [^1]: "Pedagogical cost" is loosely inspired by "algorithmic decomposition" in Between Saying and Doing. TLDR., to reason about a student acquiring long division, we reason about their acquisition of subtraction and multiplication. For us, pedagogical cost or length of some capacity is the sum of the length of its prerequisite capacities. We'll consider our pedagogical units as some function on attentional units. Herbert Simon dismisses adopting Shannon's bit as the attentional unit, because he wants something invariant under different encoding choices. He goes on to suggest time in the form of "how long it takes for the median human cognition to digest". This can be our base unit of parsing things you already know how to parse, even though extending it to pedagogical cost wouldn't be as stable because we don't understand teaching or learning very well.


Following human norms

21 января, 2019 - 02:59
Published on January 20, 2019 11:59 PM UTC

So far we have been talking about how to learn “values” or “instrumental goals”. This would be necessary if we want to figure out how to build an AI system that does exactly what we want it to do. However, we’re probably fine if we can keep learning and building better AI systems. This suggests that it’s sufficient to build AI systems that don’t screw up so badly that it ends this process. If we accomplish that, then steady progress in AI will eventually get us to AI systems that do what we want.

So, it might be helpful to break down the problem of learning values into the subproblems of learning what to do, and learning what not to do. Standard AI research will continue to make progress on learning what to do; catastrophe happens when our AI system doesn’t know what not to do. This is the part that we need to make progress on.

This is a problem that humans have to solve as well. Children learn basic norms such as not to litter, not to take other people’s things, what not to say in public, etc. As argued in Incomplete Contracting and AI alignment, any contract between humans is never explicitly spelled out, but instead relies on an external unwritten normative structure under which a contract is interpreted. (Even if we don’t explicitly ask our cleaner not to break any vases, we still expect them not to intentionally do so.) We might hope to build AI systems that infer and follow these norms, and thereby avoid catastrophe.

It’s worth noting that this will probably not be an instance of narrow value learning, since there are several differences:

  • Narrow value learning requires that you learn what to do, unlike norm inference.
  • Norm following requires learning from a complex domain (human society), whereas narrow value learning can be applied in simpler domains as well.
  • Norms are a property of groups of agents, whereas narrow value learning can be applied in settings with a single agent.

Despite this, I have included it in this sequence because it is plausible to me that value learning techniques will be relevant to norm inference.

Paradise prospects

With a norm-following AI system, the success story is primarily around accelerating our rate of progress. Humans remain in charge of the overall trajectory of the future, and we use AI systems as tools that enable us to make better decisions and create better technologies, which looks like “superhuman intelligence” from our vantage point today.

If we still want an AI system that colonizes space and optimizes it according to our values without our supervision, we can figure out what our values are over a period of reflection, solve the alignment problem for goal-directed AI systems, and then create such an AI system.

This is quite similar to the success story in a world with Comprehensive AI Services.

Plausible proposals

As far as I can tell, there has not been very much work on learning what not to do. Existing approaches like impact measures and mild optimization are aiming to define what not to do rather than learn it.

One approach is to scale up techniques for narrow value learning. It seems plausible that in sufficiently complex environments, these techniques will learn what not to do, even though they are primarily focused on what to do in current benchmarks. For example, if I see that you have a clean carpet, I can infer that it is a norm not to walk over the carpet with muddy shoes. If you have an unbroken vase, I can infer that it is a norm to avoid knocking it over. This paper of mine shows how this you can reach these sorts of conclusions with narrow value learning (specifically a variant of IRL).

Another approach would be to scale up work on ad hoc teamwork. In ad hoc teamwork, an AI agent must learn to work in a team with a bunch of other agents, without any prior coordination. While current applications are very task-based (eg. playing soccer as a team), it seems possible that as this is applied to more realistic environments, the resulting agents will need to infer norms of the group that they are introduced into. It’s particularly nice because it explicitly models the multiagent setting, which seems crucial for inferring norms. It can also be thought of as an alternative statement of the problem of AI safety: how do you “drop in” an AI agent into a “team” of humans, and have the AI agent coordinate well with the “team”?

Potential pros

Value learning is hard, not least because it’s hard to define what values are, and we don’t know our own values to the extent that they exist at all. However, we do seem to do a pretty good job of learning society’s norms. So perhaps this problem is significantly easier to solve. Note that this is an argument that norm-following is easier than ambitious value learning, not that it is easier than other approaches such as corrigibility.

It is also feels easier to work on inferring norms right now. We have many examples of norms that we follow; so we can more easily evaluate whether current systems are good at following norms. In addition, ad hoc teamwork seems like a good start at formalizing the problem, which we still don’t really have for “values”.

This also more closely mirrors our tried-and-true techniques for solving the principal-agent problem for humans: there is a shared, external system of norms, that everyone is expected to follow, and systems of law and punishment are interpreted with respect to these norms. For a much more thorough discussion, see Incomplete Contracting and AI alignment, particularly Section 5, which also argues that norm following will be necessary for value alignment (whereas I’m arguing that it is plausibly sufficient to avoid catastrophe).

One potential confusion: the paper says “We do not mean by this embedding into the AI the particular norms and values of a human community. We think this is as impossible a task as writing a complete contract.” I believe that the meaning here is that we should not try to define the particular norms and values, not that we shouldn’t try to learn them. (In fact, later they say “Aligning AI with human values, then, will require figuring out how to build the technical tools that will allow a robot to replicate the human agent’s ability to read and predict the responses of human normative structure, whatever its content.”)

Perilous pitfalls

What additional things could go wrong with powerful norm-following AI systems? That is, what are some problems that might arise, that wouldn’t arise with a successful approach to ambitious value learning?

  • Powerful AI likely leads to rapidly evolving technologies, which might require rapidly changing norms. Norm-following AI systems might not be able to help us develop good norms, or might not be able to adapt quickly enough to new norms. (One class of problems in this category: we would not be addressing human safety problems.)
  • Norm-following AI systems may be uncompetitive because the norms might overly restrict the possible actions available to the AI system, reducing novelty relative to more traditional goal-directed AI systems. (Move 37 would likely not have happened if AlphaGo were trained to “follow human norms” for Go.)
  • Norms are more like soft constraints on behavior, as opposed to goals that can be optimized. Current ML focuses a lot more on optimization than on constraints, and so it’s not clear if we could build a competitive norm-following AI system (though see eg. Constrained Policy Optimization).
  • Relatedly, learning what not to do imposes a limitation on behavior. If an AI system is goal-directed, then given sufficient intelligence it will likely find a nearest unblocked strategy.

One promising approach to AI alignment is to teach AI systems to infer and follow human norms. While this by itself will not produce an AI system aligned with human values, it may be sufficient to avoid catastrophe. It seems more tractable than approaches that require us to infer values to a degree sufficient to avoid catastrophe, particularly because humans are proof that the problem is soluble.

However, there are still many conceptual problems. Most notably, norm following is not obviously expressible as an optimization problem, and so may be hard to integrate into current AI approaches.

Tomorrow, there'll be a break from AIAF sequences and the new post will be the Alignment Newsletter Issue #42, by Rohin Shah.

Tuesday's AI Alignment Forum sequences post will be 'Learning With Catastrophes' by Paul Christiano in the sequence on Iterated Amplification.

The next post in this sequence will be 'Future directions in narrow value learning' by Rohin Shah, on Wednesday 16th Jan.


Life can be better than you think

20 января, 2019 - 19:05
Published on January 20, 2019 2:19 PM UTC

See also: Transhumanism as Simplified Humanism, You Only Live Twice, Flinching away from truth is often about protecting the epistemology, Generalizing From One Example

Let me tell you a secret.

You don’t have to experience negative emotion.

I risk coming across as implying that “happiness is a choice,” and that's not what I mean. I’m not implying that it is something easy to do, I’m not implying that it is something you should be able to do right now...

But I’m bringing up the possibility. Have you ever imagined it? Living your normal, ordinary life, from now until you die, but with the distinction that you choose not to experience negative emotion?

It’s likely that you have not thought of it. After all, negative emotions are just part of life, aren’t they? They aren't things we can change, right?

The Serenity Prayer goes like this:

God, grant me the serenity to accept the things I cannot change,
Courage to change the things I can,
And wisdom to know the difference.

The last part is invariably the most tricky one. I think people systematically underestimate the scope of the things that they can change, and that becomes more and more true as technology advances.

As Eliezer has pointed out,

“We have a concept of what a medieval peasant should have had, the dignity with which they should have been treated, that is higher than what they would have thought to ask for themselves.”

A medieval peasant accepted infant death, slavery, and the like as “part of the plan,” as “just the way things are.” Just like people nowadays accept death as “just the way things are,” and say things like “it is impossible to avoid negative emotions altogether because to live is to experience setbacks and conflicts.”

The same can be said of us who grew up in abusive families, as well as oppressed groups in authoritarian societies — they may consider normal things that to us are abject, merely because they haven’t known of anything better.

I think if there is something close to making me feel indignation, it is the fact that the ways in which life can be better are not self-evident.


Throughout my childhood and adolescence I had a host of internalizing mental disorders — depression, anxiety, poor self-esteem, dysthymia, suicidal ideation, all that good stuff. I regularly met with several psychotherapists, but unfortunately none provided much help.

When I was 16, however, I was fortunate enough to experience a particularly severe major depressive episode. The pain was so strong, so disabling, so unwavering and all-encompassing, that it eventually prompted my mom to take me to a psychiatrist instead of psychologist. I experienced with one antidepressant, had problems with it, and then a few months later was prescribed Wellbutrin.

And… three weeks after I started taking it, I realized something odd. I realized that I didn't need to ruminate on all the ways in which I was the worst person in the world all the time! Even if that were true, it would be far better to occupy my thoughts with something positive, like trying to improve myself.

Another thing I noticed at the same time, and which shocked me, was that I was unable to feel jealousy. I had received the news that my ex — whom I still had a strong unrequited love for, which was largely the source of the depression — had started dating someone, and all that I could muster as an emotional reaction to it was “That's cool for him.” No feelings of jealousy, no feelings of rejection.

Eventually, after noticing those and other noteworthy changes in my mind, and after giving them a lot of thought and consideration — after making sure that it wasn't some sort of mirage — it was clear to me, by the fourth week, that, indeed, the depressive episode was over. My mind had gracefully transitioned from a state of constant mental torment to that of serene internal tranquility, and I deemed the change unlikely to be ephemeral.

It's been over two years, and although life has indeed had its ups and downs, there is... incredibly little overlap between my mood before and after I started taking Wellbutrin. Almost all of the days in my life after I started taking it have been better than almost all of the days before.

It is truly difficult to convey just how different the sadness I am capable of today is from the torment I used to be able to feel. My negative emotions, when present, are a pale version of their former selves, to an extent that they barely feel real — they’re pretty much cardboard cutouts of what they used to be.

Now, an interesting thing is that during my pre-Wellbutrin life, I would obviously never have desired for a life like the one I have now — such a thing simply wasn’t within the scope of my imagination. It doesn’t come to us naturally, to desire for a peaceful inner mind and a capacity to control our feelings. It's not a basic human drive, the way that the desires for sex, money, love, and recognition are. Your mind is all that you have, it is all your life is — but aiming the arrow of the desire at one’s own mind requires a fair amount of complicated metacognition.

What I find unfortunate about this story is that I had to get to an extremely low point in order for medication to be considered an option. If I hadn’t had that particularly severe depressive episode, I would keep having a life which was meh seventy percent of the time.

And that makes me wonder: how many people around don’t know how good life can be for them? How many people suffer and think they can’t help it? How many people don’t have a blast with their morning routine merely because they haven’t tried to? Sometimes it genuinely requires a lot of open-mindedness in order to notice that you are sitting on a pot of gold.

We are patently unaware of the scope of the space of possible human psychological experiences. There was once this debate about whether mental imagery was an actual thing. It was only settled when Francis Galton gave people surveys and saw that some people did have mental imagery, and others didn’t. Before that, everyone just assumed that everyone else was like themselves.

It does not seem implausible to me that the same fallacy would apply to the psychological phenomenon of the pleasantness of life. That is, we naturally expect others to experience life as being roughly as pleasant as it is to us in particular. I find this passage from Schopenhauer to be a good example:

“In a world like this […] it is impossible to imagine happiness. It cannot dwell where, as Plato says, continual Becoming and never Being is all that takes place. First of all, no man is happy; he strives his whole life long after imaginary happiness, which he seldom attains, and if he does, then it is only to be disillusioned; and as a rule he is shipwrecked in the end and enters the harbour dismasted.”

He’s making big claims about the psychology of other people’s minds, claims that, thankfully, are wrong; the majority of people are happy. But there is a significant share of the population to whom that quote sounds entirely reasonable (my 15-year-old-self and David Benatar included). And those don’t know how good their life can be.

A while ago 80000hours posted about a study in which subjects who were indecisive about taking certain life-changing decisions agreed to make a decision based on a coin flip. The researchers then evaluated the subjects’ happiness several months after the study, and whether they had or not taken the decision the coin flip generated.

It turned out that people who changed something big in their life due to the coin flip turned out to be much happier later:

The causal effect of quitting a job is estimated to be a gain of 5.2 happiness points out of 10, and breaking up as a gain of 2.7 out of 10!

Notably, “Should I move” also had a large effect (3.2), as did “should I start my own business.”(5.2).

One interesting thing I noticed in those results is that what those decisions have in common, compared to the decisions that did not influence happiness that much, are that they result in a substantial change in people’s day-to-day life experiences.

Perhaps day-to-day life experiences can be especially prone to being coded as something to be accepted, as “just part of life.” It can be difficult to think of changing something so fundamental about life that you experience it everyday.

Maybe the lesson here is that experimentation is valuable.

I’ve received some objection towards my attitude of valuing happiness without special exceptions and without upper bound.

One common objection is that negative emotion sends important messages. I actually agree with that. Roughly speaking, the message that negative valence sends is “stop what you’re doing and change your strategy.” So, now you know. Now you can try to avoid the negative feeling when you notice it coming, and remember the message: stop what you’re doing and change your strategy. (In the case that you choose to even care about it, since emotions are based on evolutionary goals that might not be fully aligned with our own.)

I want to make it clear that in this post I am not claiming that external circumstances do not matter and all that people need to do is change their internal states. Not at all. I fully endorse changing one's life in order in order to improve well-being when that is the best strategy to do so, and as we saw in that 80000hours post, it often is.

“You can win with a long weapon, and yet you can also win with a short weapon. In short, the Way of the Ichi school is the spirit of winning, whatever the weapon and whatever its size.”

Another objection I’ve faced is the claim that it is futile to pursue happiness, that it is empty or hollow without suffering, and that we should be aiming at meaning.

I think the threat of “empty” or “meaningless” happiness is much less plausible than most people think. It seems to me that there is a close correspondence between high-level beliefs and mood. I, for one, have visited a quite wide range of mind-states along the valence axis, and every single step I took from the nadir of my worst depression to the great gratitude I feel now involved a change in how I see the world, a change in how I think.

The degree to which that is generalizable to other people is a question that I am interested in investigating. For now, it’s instructive to notice that the popular Nihilist Memes Facebook pages are nearly entirely consisted of memes about depression. And that one of the diagnostic criteria of Borderline Personality Disorder, a very unpleasant condition, is “feelings of chronic emptiness.” Religious and spiritual experiences, on the other hand, which I would regard as some of the most blissful states possible to humans, involve plenty of meaning, so much that it all-too-often messes up people's epistemology.

Another objection I have encountered is that constant happiness makes one insensitive to the suffering of others. That is not supported by empirical evidence. Positive mood makes people less willing to endure harm, or to let others endure harm. It has been found over and over again to make people more interested in helping others and doing more than what is expected from them.

Moreover, I would not be here endorsing positivity in LessWrong if I didn't think that it had useful pragmatic value at helping us think and work. That’s because most of the people who will ever live will live in the far-future, and many people in this site are doing valuable work on that area. It is important that they keep their minds sharp, and positivity goes a long way in that regard. There are, of course, other variables that affect productivity, and I am interested on investigating them as well.

Another motivating factor driving me to write this is that I think it is important for me to... have this debate, in order to think more clearly about others’ attitudes towards happiness, to understand where exactly differences in opinion from mine stem from. This might be valuable for cause prioritization research. The cool thing about information is that it doesn't have an expiration date. The knowledge and data that we gather will pass on the future and be a foundation future researchers will build upon.

I think Anna Sallamon, in one of my favorite LessWrong posts, provides a useful framework with which to think about why we may find some information aversive:

when I notice I'm averse to taking in "accurate" information, I ask myself what would be bad about taking in that information.

I think that drives at least part of the motivation behind the acceptance of negative emotions. It makes sense, since there are many ways in which it can be bad to think that negative emotion is always bad. For instance, when you are actively feeling a negative emotion, it often helps to hear that it is okay to feel that emotion — that makes you feel reassured and validated. By just plainly recognizing the badness of negative emotion, on the other hand, you risk getting into a loop. As an example, it turns out that, as depressing at it sounds, with enough self-referentiality it is entirely possible to be depressed because you’re depressed because you’re depressed. I've been there. And it's distinctively worse than merely being depressed at the object-level.

I’ll steal one of the posts’ bucket drawings in order to illustrate this:

Whether negative emotion is always bad is a value judgement, which is why I left that label in the Desired state panel in blank. But it is always useful is to separate “is negative emotion always bad” and “should I feel shame/guilt/sadness for experiencing negative emotion” into two mental buckets; to recognize that they are separate questions.

Acceptance is useful when you cannot change a problem. Acceptance is useful when you cannot change a problem. Both those sentences can be true at the same time. And, as technology advances, our ability to solve problems improves; what was once impossible becomes merely an engineering problem.


Announcement: AI alignment prize round 4 winners

20 января, 2019 - 17:46
Published on January 20, 2019 2:46 PM UTC

We (Zvi Mowshowitz and Vladimir Slepnev) are happy to announce the results of the fourth round of the AI Alignment Prize, funded by Paul Christiano. From July 15 to December 31, 2018 we received 10 entries, and are awarding four prizes for a total of $20,000.

The winners

We are awarding two first prizes of $7,500 each. One of them goes to Alexander Turner for Penalizing Impact via Attainable Utility Preservation; the other goes to Abram Demski and Scott Garrabrant for the Embedded Agency sequence.

We are also awarding two second prizes of $2,500 each: to Ryan Carey for Addressing three problems with counterfactual corrigibility, and to Wei Dai for Three AI Safety Related Ideas and Two Neglected Problems in Human-AI Safety.

We will contact each winner by email to arrange transfer of money. Many thanks to everyone else who participated!

Moving on

This concludes the AI Alignment Prize for now. It has stimulated a lot of good work during its year-long run, but participation has been slowing down from round to round, and we don't think it's worth continuing in its current form.

Once again, we'd like to thank everyone who sent us articles! And special thanks to Ben and Oliver from the LW2.0 team for their enthusiasm and help.