# Новости LessWrong.com

A community blog devoted to refining the art of rationality
Обновлено: 22 часа 18 минут назад

### Notes on notes on virtues

30 декабря, 2020 - 20:47
Published on December 30, 2020 5:47 PM GMT

In a series of posts here, I’ve been sharing my notes about a variety of virtues. In this post, I share my thoughts about why I’ve been on this case and what I’m trying to accomplish.

Here’s a one-paragraph summary: I have come to think that becoming more skillful and well-rounded in the practice of the virtues is key to being a better, more satisfied, and more effective person. However, childhood training in the virtues is scattershot and haphazard, and remedial training (or self-help) as adults is also spotty. I’m trying to contribute toward fixing this by assembling descriptions of the various virtues, with a focus on ways to improve them in ourselves.

Why I think this is important

“The characteristic feature of all ethics is to consider human life as a game that can be won or lost and to teach man the means of winning.” ―Simone de Beauvoir

“The branch of philosophy on which we are at present engaged differs from the others in not being a subject of merely intellectual interest — I mean we are not concerned to know what goodness essentially is, but how we are to become good people, for this alone gives the study its practical value.” ―Aristotle

Life is complex. We are constantly confronted by a variety of challenges. To address those challenges well, we need to have learned a variety of basic life skills such that they are second-nature to us. “The virtues” are a set of such skills that apply to challenges common to typical human lives.

If you have a better command of the virtues, this helps you thrive as an individual and also improves your effect on those around you. Society at large benefits from a higher level of competence in the virtues of those in it. But our culture is not all that good at teaching or encouraging the virtues. Some virtues seem so lacking from the public sphere that I wince when I look at it.

Our institutions of formal childhood education are patchy at best in this regard. You’ll get your reading, writing, and ’rithmetic, if you’re lucky anyway, but will you get resourcefulness, resilience, restraint, responsibility, rectification, or reputability? Other institutions (scouting, religion, etc.) pick up some of the slack, but not nearly enough. Parents have little guidance on how to convey virtue education to their children effectively, and also have their own blind spots from their own spotty educations. There have been some gestures toward formal “character education” of children, which is probably a good sign. But my guess is that children are going to learn most from the example of their elders: if we don’t value virtues enough to pursue them in our own lives, that will make more of an impression on the up-and-coming generation than any “do as we say, not as we do” education will.

A virtue gym?

For a few specific virtues or skills, there are adult education / training / exercise programs. If you want to be more fit, you can join a gym. If you want to be a better public speaker, you can join Toastmasters. If you want to sober up, you can attend Alcoholics Anonymous. But for most virtues, there’s nothing like this, and that’s a shame.

Two misconceptions that sometimes cause people to give up too early on developing virtues are these: 1) that virtues are talents that some people have and other people don’t as a matter of predisposition, genetics, the grace of God, or what have you (“I’m just not a very influential / graceful / original person”), and 2) that having a virtue is not a matter of developing a habit but of having an opinion (e.g. I agree that creativity is good, and I try to respect the virtue of creativity that way, rather than by creating). It’s more accurate to think of a virtue as a skill like any other. Like juggling, it might be hard at first, it might come easier to some people than others, but almost anyone can learn to do it if they’re just willing to put in the persistent practice.

We are creatures of habit: We create ourselves by what we practice. If we adopt habits without giving them much consideration, we risk becoming what we never intended to be. If instead we deliberate carefully about what habits we want to cultivate, and then actually put in the work, we can become the sculptors of our own characters.

What if there were some institution like a “virtue gymnasium” through which you could work on virtues alongside others, learning at your own pace, and building a library of wisdom about how to go about it most productively? What if there were something like Toastmasters, or Alcoholics Anonymous, or the YMCA but for all of the virtues people need?

Ben Franklin’s experiment

One day Benjamin Franklin “conceiv’d the bold and arduous project of arriving at moral perfection.” He explains in his autobiography, “as I knew, or thought I knew, what was right and wrong, I did not see why I might not always do the one and avoid the other.”

He quickly found that he had underestimated the task. “While my care was employ’d in guarding against one fault, I was often surprised by another; habit took the advantage of inattention; inclination was sometimes too strong for reason. I concluded, at length, that the mere speculative conviction that it was our interest to be completely virtuous, was not sufficient to prevent our slipping; and that the contrary habits must be broken, and good ones acquired and established, before we can have any dependence on a steady, uniform rectitude of conduct.”

So he decided to be more methodical. He reviewed various lists of virtues in the literature he was familiar with, and then created his own list of a dozen virtues that he thought were particularly important. With the intention of making each of these virtues habitual, he struck on the idea of tackling them one-at-a-time, starting with ones he thought would help him more easily acquire the others. (Virtues have a way of building on each other. Some virtues, for example persistence, or curiosity, or honor, can make other virtues easier to acquire. In this way, the process of strengthening virtues bears compound interest.)

He decided to do a daily accounting of each virtue he was practicing. He created a notebook with a table for each week. The table had one column for each day of the week, and one row for each of his virtues. Each time he failed to fulfill a particular virtue on a certain day, he marked the table cell for that virtue/day with “a little black spot” (or more than one if he screwed up multiple times). The plan was that when he achieved a week in which he successfully kept the row for Temperance blank, he would move on to concentrating on Silence (attending to Temperance as well). When he managed to keep both of those rows clear for a week, he would move on to Order, and so on.

“I was surpris’d to find myself so much fuller of faults than I had imagined; but I had the satisfaction of seeing them diminish.” He carried his notebook with him for several years. “[T]ho’ I never arrived at the perfection I had been so ambitious of obtaining, but fell far short of it, yet I was, by the endeavour, a better and a happier man than I otherwise should have been…”

He hoped at one point to write a book, The Art of Virtue, which “would have shown the means and manner of obtaining virtue, which would have distinguished it from the mere exhortation to be good that does not instruct and indicate the means.”

He toyed with the idea of a political party that would not advocate for the benefit of a certain segment of the people, but for the good of the country and of mankind in general: the “United Party for Virtue.” This morphed into an idea for a fraternity: the “Society of the Free and Easy.” His plan was to initiate members by putting them through the same practice he had undergone with his notebook of weeks and virtues. He explained the name of the society this way:

The Society of the Free and Easy: free, as being, by the general practice and habit of the virtues, free from the dominion of vice; and particularly by the practice of industry and frugality, free from debt, which exposes a man to confinement, and a species of slavery to his creditors.

He got as far as getting two young men to sign up and begin the work, but then he got distracted with other things and abandoned it. “[T]ho’ I am still of opinion that it was a practicable scheme, and might have been very useful...”

The Society of the Free & Easy

Last year I set about trying to pull together something like Franklin’s Society of the Free and Easy (and borrowing his name). I worked with a group of friends and acquaintances to come up with what I think is a pretty good framework for working on virtues in a peer-supported way. In a nutshell, the process is pretty simple:

1. Find a partner or form a small team.
2. Each of you choose a virtue to work on.
3. Take a close look at your virtue, and at any obstacles you feel when you try to practice it.
4. Work with your partner(s) to come up with exercises in which you will frequently, deliberately practice that virtue in ways that challenge your current level of fluency.
6. When you feel you have integrated the virtue adequately into your character, start the process again with a new virtue.

Alas, after some initial promise, the group began to dwindle, and then the pandemic disrupted everything, and now as far as I know there are only two of us still working through the program on the regular. But in the course of researching, we dug up a lot of information about virtues in general and about particular virtues, and that’s forming the basis for the posts I’m sharing here.

Notes on virtues

What I’m hoping to do with these notes on virtues is to collect ideas that will be useful to people who want to improve in a certain virtue. This may include actual concrete advice about strengthening that virtue itself, and may also include some discussion about other virtues that are related in some way: maybe they’re prerequisites, or harmonize in some way, or maybe there’s some tension between them. I sometimes find it challenging to define the virtue precisely, or to distinguish it from another virtue — and sometimes the term for the virtue gets overloaded with a variety of meanings in common use — so I include discussion of those nuances too.

I’m aiming to be inclusive of a variety of useful perspectives, and of a variety of cultures, rather than definitive or dogmatic. It’s a fuzzy subject matter to begin with. I’m feeling my way about, leaning on existing guides when I can find them (though I tend to find a lot more examples of people praising or advocating certain virtues than of people explaining them or giving practical advice on how to go about learning them).

I take some inspiration from Aristotle, who, when he examined a set of virtues in his Nicomachean Ethics, started with virtue-concepts as already found in common language and folktale, rather than starting from a theoretical foundation and building ideal virtues from there. When it comes to dividing up a complex subject matter into manageable and coherent chunks, previous generations have already done a lot of the work for us and handed that down to us in the language and tropes we use. That we have found a word or trope useful is a good clue that there’s some reasonably-helpful and worth-noticing regularity at the base of it. While this sort of understanding shouldn’t be confused with the gospel truth of how reality is constituted, it seems wise to glean as much as we can from it before trying to systematize more deliberately.

One of the things I did was to investigate several virtue-based traditions (the Greek cardinal virtues, the traditional Christian virtues, the virtues of Bushido, Confucian virtues, the virtues of Scouting, the West Point virtues), the virtues favored by some particular philosophers (Aristotle, Cicero, Ben Franklin, Ayn Rand, Henry David Thoreau, Shannon Vallor, the Cynic philosophers, the developers of care ethics, William De Witt Hyde, Eliezer Yudkowsky), and the virtues identified as “character strengths” by psychologists operating in the positive psychology paradigm. This isn’t comprehensive by any means, but it was revealing.

For one thing, there was a lot less consensus than I thought there would be about which virtues are the important ones. This is somewhat complicated by problems of terminology. For example, what one philosopher will call self-control, another will call continence, another restraint, another discipline. Or, while Paul says that the greatest virtue is love, he defines “love” in such a way that it incorporates patience, kindness, mudita, modesty, humility, respect, good temper, forgiveness, righteousness, care, trust, hope, and perseverance. Or different cultures will partition virtue-space differently: sisu is only kind-of like perseverance; mudita is only kind-of like sympathy; nying je is only kind-of like compassion. This can be challenging for works in translation, where the translator has chosen the closest equivalent English word, but a close reading shows that the author meant something different from what we mean by that word.

I tried to correct for things like these. I consolidated various terms for very-similar virtues together, and created a spreadsheet where I could note which virtue-clusters had been promoted in which systems or by which philosophers. But of the hundreds of virtues I found, only six of them were on more than half of the lists:

If you add those that were on exactly half of the lists, you also get justice, wisdom/philosophy, sincerity/straightforwardness/earnestness/frankness, industry/effort/enterprise/productiveness, duty/responsibility/purposefulness, piety/reverence, and strength/toughness/vitality/health/fitness.

You may have heard that patience is a virtue, but it didn’t make that cut. Neither did humility, hope, perseverance, courtesy, generosity, friendliness, creativity, caution, cleanliness, mercy, forgiveness, wit, originality, calm, warmth, curiosity, hospitality, pride, or gratitude. Some lonely virtues like boldness, imagination, spontaneity, and playfulness appeared on only one list. Other skills that are often popularly admired — like being influential, having emotional intelligence, or being good in bed — weren’t on any lists at all.

Some virtues are debatable. Selflessness, pride, altruism? The apostle Paul and Ayn Rand would disagree about what’s the virtue and what’s the vice. Virtues like chastity, obedience, and patriotism give some of us the willies.

I’m aiming to be inclusive and to eventually give some attention also to these less-prominent and more controversial virtues.

Why am I going to this trouble?

My hope is that whatever virtue it is that you’re hoping to improve, you’ll be able to get a head start from the research and write-ups I’ve done.

I’m also motivated by self-improvement. I’ve been working to deliberately improve some of these virtues in my life, and I hope to make that an ongoing project, so putting together these virtue-dossiers helps me to lay the groundwork for this.

If we manage to reboot the Society of the Free & Easy in the post-pandemic time, these may help us hit the ground running.

I also have vague ideas about this being a worthwhile political project. I’ve come to distrust talk of elections and revolutions and institutional reforms. I think the longer, harder, more subtle project of helping people improve is a more reliable path to a better future than trying to impose wise policies on them from on high. If people become braver, wiser, more just, and more honorable, public policy will follow their lead. If people become more cowardly, foolish, grasping, and disreputable, conniving politicians will lead them by the nose.

I’m sharing these at LessWrong in particular because I value the sort of insightful feedback people share here. Since I’m not an expert at any of the virtues I’m writing up, I’m slyly taking advantage of Cunningham’s Law to correct my misunderstandings about them.

Discuss

### My unbundling of morality

30 декабря, 2020 - 18:19
Published on December 30, 2020 3:19 PM GMT

Inspired by seeing Morality as "Coordination", vs "Altruism".

• Coordination: a) To do more win-win stuff. b) To band against some outgroup (win-win-lose).
• Empathy: E.g., just thinking about how being tortured sucks so much makes someone want to stop others from being tortured.
• Signaling and reputation: E.g., having a reputation for being fair can give you power and status. Inversely, having a reputation for dishonesty can close a lot of opportunities.
• Insurance: Reducing variance of outcomes (e.g., bats sharing food).
• Hardwired assumptions: E.g., incest leads to flawed children.
• Fear of punishment: E.g., murder probably won't go unanswered.
• Rewards of subserviency: E.g., respecting high-status people is not without benefits.
• Power/status games: Enforcing norms can increase your own status and decrease others'.
• Optimizing trade-offs for personal benefits: E.g., net-neutrality is good for middle-class people, bad for poor people. "Bravery debates" might fall under this umbrella as well.
• Instinctual game-theoretic strategies: E.g., people like having more control and agency ("freedom"). Note that this is different from coordination (coordination is a subset of this); A lot of these strategies are win-lose.
• Abdicating responsibility: E.g., the current fiasco of Covid vaccines. People prefer passive harm to risky interventions because any activity brings responsibility.

What more can you think of? (Of course, a lot of these have some overlap.)

Discuss

### Intellectual Authority

30 декабря, 2020 - 16:50
Published on December 30, 2020 1:50 PM GMT

Photograph: the first Solvay Conference in 1911, including Albert Einstein, Marie Curie, and Max Planck.

This is the second essay in my series on intellectual legitimacy. Read the first essay here. This essay was originally published on SamoBurja.com. You can access the original here.

It is always a pleasure to have an excuse to mention underrated thinkers. Robin Hanson is one of my favorites, with dozens of original ideas to his name. His career as a communicator spans everything from coining the term Great Filter in cosmology to writing a book on the inner workings of the human brain. While working as a physicist at NASA, Professor Hanson had the idea now known as futarchy: a system of government in which some of the decision-making power is held by prediction markets.

Physicists, however, do not have intellectual authority on the topic of how society should be run. Economists do. The intellectual legitimacy of an idea partially depends on whether it is proposed by someone with intellectual authority. Intellectual authority is a personal reputation. It reflects not only perceived expertise, but perceived pro-social orientation of the individual. Intellectual authority rests with people who are seen as possessing good intellectual judgement that is not only technically correct, but also takes into account social and political factors. Compare, for example, the perceptions of Garry Kasparov and Bobby Fischer. Both legendary chess grandmasters, but the former a celebrated public intellectual, the latter considered a tragic cautionary tale due to his erratic, offensive, and sometimes illegal behavior. Such perceptions are of course also contingent on institutional endorsements and can be changed by them.

As reported in Fortune, Robin Hanson “says he went to the expense and trouble of getting a Ph.D. almost entirely so that people would take him seriously … he was thinking up market-based improvements to government institutions, but found that with his techie background he couldn’t get anyone … to listen. So he went to CalTech to get a Ph.D.” Now branded as an economist, Hanson was soon implementing a pilot prediction market for the Defense Advanced Research Project Agency (DARPA). Today, Hanson’s work is often cited by leading commentators and intellectuals such as Nate Silver and Tyler Cowen. By understanding how to acquire the relevant intellectual authority, Hanson gained an audience for his idea he couldn’t have had otherwise. Indeed, it is likely that the productive frontier for his work lies not in new ideas, but in improving his intellectual authority to help advance his many existing ideas.

The intellectual legitimacy of ideas is partially determined by who communicates them. The messenger matters for the reputation of an idea. Often those with the highest intellectual authority aren’t necessarily the most generative. As a consequence, credit for originating a legitimized idea nearly always flows to the person with the highest intellectual authority that has a claim to it. This inspired what is known as Stigler’s law of eponymy which states no scientific discovery is named after its original discoverer. Fortunately, this law doesn’t quite hold. Generativity, validation, and authority can be aligned through functional institutions, although this is an exceptional rather than ordinary state of affairs.

Different social roles hold different amounts of intellectual authority. The academic has more intellectual authority than the journalist, who in turn has more authority than the blogger. The god-king, a now-extinct social role, could claim complete intellectual authority over all domains. A social role need not be extant in a given society to have intellectual authority: Marcus Aurelius’ Meditations wouldn’t be as widely read today if he hadn’t been a Roman Emperor.

Social rolesFigure 1. Authority of different social roles.

The intellectual authority of these different social roles themselves can change over time. In this way such roles are an example of borrowed power: one can use one’s socially legible role to legitimate one’s ideas, but at the end of the day one’s social role is contingent on a broader social landscape that is prone to change. The king, while he may still hold the ability to issue honors, holds much less intellectual sway today than he would have in 1650. The English Civil War, as well as the American, French, and Russian revolutions are significant causes of this shift. The intellectual authority of a role changes with success or failure at social, economic, and political competition between niches.

Figure 2. Authority can shift over time.

Legible intellectual successes can change the authority of a social role as well. A good example of this is 20th century physics. The triumphs of theories such as general relativity and quantum mechanics greatly enhanced the intellectual authority of the profession. The detonation of two atomic bombs in wartime also demonstrated this new mastery of physics as vital to state interests and war. This drove physicists such as Einstein to comment on statesmanship and others like Oppenheimer to even try their hand at it, with mixed success.

Figure 3. The intellectual authority of 1940s physicists. Left: The New York Times reports on the 1919 Eddington experiment. This confirmation of general relativity was remarkably publicly legible — as with the newly-enhanced intellectual authority of physicists. Right: Members of the Emergency Committee of Atomic Scientists, founded 1946.

The fields of study themselves change with new discoveries and developments, as do the people who pursue them. The archaeologist in 1880 has a somewhat wider ability to comment on science and society than does the archaeologist in 2020. The archaeologist of the 1880s is a gentleman who pursues knowledge of the distant past, usually on his own dime, and who can have his wife bejeweled as Helen of Troy when announcing finds. The archaeologist in 2020 is an academic who has to stick to very narrow claims to avoid career death.

One might think professionalization means a strict improvement in the epistemic foundations of a field. This isn’t the case. Rather, professionalization is a way to reduce variance. Professionalization is essentially the creation of set procedures, norms, and social roles to govern a given area of knowledge, and while this eliminates unserious crackpottery, it also crowds out the “unorthodox” and often stochastic experimentation employed by all exceptional live players as they new drive fields forward — in short, reducing variants cuts off both tails of the skills distribution. While professionalization does eliminate some malpractice, for pre-paradigmatic fields it can be harmful, since researchers must pursue hypotheses that can’t be justified to bureaucrats. If the hypothesis could be justified to bureaucrats the field would already be mature. A premature professionalization closes many doors of inquiry.

Archaeologists today are less likely to change their historical views based on new finds than their 19th century counterparts. The latter saw no problem in integrating evidence of previously unknown civilizations, including the Hittites and the Sumerians, when digs suggested it. The long and slow road of evaluating modern finds such as Göbekli Tepe in Turkey serves a notable contrast. An impressive 11,000 year old structure featuring hundreds of pillars, a 50-foot tall artificial mound, and statues and carvings of animals, the find showed neolithic construction to be much older than the previously accepted 8,000 years, as a consequence also predating the consensus origin story of agriculture. The archaeological site was actually first surveyed in 1963 by an effort of the University of Chicago, but they incorrectly identified the distinct T-shaped pillars as a medieval cemetery, matching preconceptions. The German archaeologist Klaus Schmidt stumbled on these records while searching for new dig sites in 1994. He found them unpersuasive. It then took Schmidt nearly two decades of excavation work to overturn the field’s previous orthodoxy.

Figure 4. Change in authority of archaeology.

The social roles held by an individual can themselves shift over an individual’s career, changing their intellectual authority. The well-cited academic was once a PhD candidate, the seasoned security expert was once a novice security engineer, and the practicing surgeon was once a pre-med student. Jumps between careers can make this change in intellectual authority particularly stark. People mocked Arnold Schwarzenegger as a mere actor when he ran for and won the governorship of California, but in a way these remarks showed Schwarzenegger had finally made it. When he was merely the star of a streak of blockbuster movies, people would mock him for not being a real actor. His career had started in bodybuilding, after all.

Figure 5. Authority may change over an individual’s career, especially if they adopt new social roles.Sources of Personal Intellectual Authority

Are changes in social roles the only source of change to an individual’s intellectual authority? Not quite. All changes in personal authority are ultimately traceable to four sources:

1. Bureaucratically-issued

The simplest example is authority issued by an institution that people trust. These institutions might have a single or several representatives, but have a procedure in place for ascertaining whom to legitimize. This means that such institutions are always somewhat bureaucratic. If one bases their intellectual authority on this source, it is strongly correlated with the authority of the institution itself. Examples include: PhDs from Harvard, economists at the Federal Reserve, rocket scientists at NASA. Were any of these institutions’ reputations to suffer, the individuals would take a hit as well, even if they had no part in relevant failings.

2. Great feats

A great feat by an individual can be the basis of intellectual authority in the domain of that feat. Examples include Buzz Aldrin’s authority to comment on space-related matters, Viktor Frankl and Antarctic voyagers’ authority to comment on overcoming hardship and the human condition, and Bill Gates’ authority to comment on development of the internet and computers. Bill Gates is also a good example of a notable subcategory of such feats, namely producing large amounts of economic value in a domain, which he has parlayed into a recognized ability to comment on energy, public health, and philanthropy.

3. Third-party endorsement

Similar to authority via bureaucratic issuance, one can obtain intellectual authority in a domain through the personal endorsement of someone who themselves has authority over that domain. A book on foreign policy is more authoritative if it boasts an approving blurb from former Secretary of State Henry Kissinger. The scope of the authority transfer is limited and tied to the endorser’s existing authority — while a personal endorsement from John Doe will likely cause John Doe’s personal network to believe you, your credibility in the eyes of the general public will be more limited.

4. First-person authority

The final source of authority is personal intellectual authority. One must deliver two components to obtain personal intellectual authority: intellectual material and a marker that you claim and deserve intellectual authority. An example of a pure marker might be prefacing an opinion on social media with “I’ve read several books on this and…”. Another more extreme example would be Buzz Aldrin punching moon conspiracy theorist Bart Sibrel in the face after the latter claimed that Buzz did not land on the moon.

One can also deliver material without signaling that one deserves authority. In this case one does not gain authority. Many fail to do this; it is especially common with those lacking social awareness. At the other extreme are those who use formal and pseudo-technical language as a marker rather than as a means to convey material. The two extremes sometimes happily coexist in ignorance in the same subcultures.

These four categories might be well illustrated by thinking about intellectual authority in physics. If you have a PhD in physics, your words about physics are assumed to be sound. When the physics PhD talks, they’re speaking from the authority of the institution. Contemporary physics is in this case professionalized, much as archaeology is. The professionalization is regulated by social networks and bureaucratic bodies. Most physics PhDs and physics professors are then relying on the physics and academic establishments for their intellectual authority.

A typical physicist is well contrasted with Albert Einstein. He is authoritative on his own and speaks on behalf of his own theories. The legitimacy of his ideas is not granted by an institution. How did he reach this position?

Albert Einstein first performed great feats such as producing special relativity — which examines the relationship between space or time — and explaining the test results of the photoelectric effect. These two examples alone demonstrate Einstein’s intellectual range. The former clarifies theoretical relationships, while the latter develops new theory to explain a surprising experiment.

It is striking that the intellectual legitimacy of the ideas published in his Annus Mirabilis papers ultimately added to the intellectual authority of academia, when before publication academia had frustrated Einstein’s efforts to find a position. His most well-known work was mostly done while he was a patent clerk, working outside of academia. This is a common pattern. The dynamics are similar to Stigler’s Law of eponymy which we considered earlier. Perhaps we could call this pattern “Stigler’s tax”: any discovery that is eventually accepted by academia is ultimately claimed to have originated from academia.

As Benjamin Franklin remarked, of all the things in this world one can only be certain of the inevitability of death and taxes. So, paying our due, what measures can an individual take to harmonize their own production of intellectual material with personal markers of accomplishment? In contemporary society there are two notable ways to gain personal intellectual authority without having to route through normal institutions.

The first one is writing a book. Becoming a public intellectual through publishing a book on a social topic, especially a “pressing issue of our time”, is a good method of building first person authority. The book must contain intellectual material, but is also a marker that you deserve authority. Contrast this with a series of blog posts, where the primary concern might be improving the material or writing to friends.

The second common route is through creating economically viable institutions that take your thoughts for granted. For example, Warren Buffett has acquired intellectual authority over financial markets because of the success of his company, Berkshire Hathaway, which was built around his idiosyncratic approach to investing. Steve Jobs acquired intellectual authority over design and computing because of the success of Apple, which was built around his personal taste. Books and companies aren’t the only two routes.

Even if an idea is communicated by someone with the relevant intellectual authority, this does not necessarily mean that it will be well-received or accepted as legitimate automatically. On a smaller scale, communicating the same idea can be received differently depending on the idea’s packaging.

Read more from Samo Burja here.

Discuss

### I object (in theory)

30 декабря, 2020 - 15:50
Published on December 30, 2020 12:50 PM GMT

In middle school, I had a really good teacher for social studies. (I don’t think I realized how good she was at her job until much later.) On one occasion, she wanted to demonstrate to us the difference between capitalist ideology and socialist ideology.

Her demonstration came in two parts: first, she handed out pieces of candy corn in random quantities to all the students, and told us we could play games of rock-paper-scissors with one another to increase our lot. Every time you won a game against someone, you got to take one of their pieces of candy. Games could only be initiated with consent from both parties, if I remember correctly. We played games for maybe five or ten minutes, and much candy corn traded hands. This was meant to represent capitalism.

Then, she collected all the candy and handed it out once again, this time giving exactly two pieces to each student. There was no trading phase in this round, so we just sat there for a few seconds considering the even piles of candy in front of us. This was meant to represent socialism.

Finally, with the second round of candies still on our desks, she asked us for our thoughts about which system was better. My own opinion, if I remember right, was that I favored the socialist way of handing out candy corn. (This is not an endorsement of establishing socialism as the arbiter of other resource conflicts.) I don’t remember most of what my classmates said, but the general consensus was that the capitalist system was better, because in that system, we had a hand in our own fate. If we weren’t happy with the amount of candy we got, we could, by golly, get up and do something about that.

(Interestingly, they also described capitalism as being “more fair”. I’m pretty sure I raised my hand at that point and said that, regardless of which option you think is better, the latter was clearly more fair. But at least one person persisted and said that fairness means having agency. At the time I thought that was dumb, but now I think it’s just invoking sense two of the word “fair”, where it means not that everybody gets the same treatment but rather that the outcome each person gets depends predictably on their actions. This is not to be confused with the equality/equity distinction, which looks at the first sense of the word “fair” and asks whether “same treatment” means “same opportunity” or “same outcome”. But this is a tangent to my main point in this post.)

(Also, I don’t think it mattered that rock-paper-scissors wasn’t a game of skill. Something something, it still imbued people with a sense of agency, something something.)

Anyway, the discussion wound down, and it was getting to be the end of class. The teacher had promised us that, yes, we would eventually have an opportunity to eat actual candy, and just before the bell rang she said we could eat the candy on our desks before we left. We did, and then we left.

Did you spot the problem?

I’ll give you a minute. Scroll down when you’re ready.

The problem is this: even though most people thought that the first system was better, nobody objected when the candy we actually got to eat was the candy given to us by the second system. I can’t say for sure, but I feel reasonably confident that if we, a group of middle schoolers, had been given random amounts of candy to actually eat, where the amount each student got was decided by the teacher’s whim, we would have lost our minds.

And, I think this would be true even if we’d had an opportunity to influence the amount we got. Somehow I think that if that had been what really decided the amount of candy we got to eat, rather than the amount of candy pieces that symbolized money but were ultimately just chips in a game… well, I think it would suddenly have seemed a lot more noticeable that some people didn’t have to influence the amount they got to end up with a lot, and others did. There seems to have been a dissonance between what my classmates endorsed and what they actually wanted — between what they objected to as an abstract idea and what they would have actually complained about if it had happened. Hence the title of the post.

(Somehow I also think the effect would have been especially strong in those who got the least candy to begin with. But that’s also a tangent to my main point, which is the dissonance thing.)

I don’t have anything especially insightful to say about this. I just thought it was kind of interesting.

Discuss

30 декабря, 2020 - 06:23
Published on December 30, 2020 3:23 AM GMT

It's the second annual review, and 121 posts have been nominated.

This year we've had more than double the number of nominations than last year. But on reviews we're still playing catchup — last year we had 118 reviews, yet this year we've only had 51 so far.

It can be a bit daunting to try to review so many posts, so to help out, I'm making this thread. Every comment on this thread will be a post, and you should vote on which ones you would like to read a review of

A review is something ideally that puts it in context of a broader conversation, describes its key contributions, its strengths and flaws, and where more work can be done.

(Or something else. Many people who write self-reviews often give a different flavor of review. And I've read many great short reviews, e.g. Jameson Quinn and Zvi last year did a lot of short reviews that communicated their impression of the post quite clearly.)

So I'm going to leave 122 comments on this post. 121 comments will just be a post title, and the other one will be for thread meta. (Search "Meta Thread".) I will remove my own votes from them, so they all start at zero.

Please vote on the comments to show how much you'd like to see reviews of different posts! Feel free to add a comment about what sort of review you'd like to see.

(Yes, I will probably get a lot of karma from this thread. Mwahaha you have fallen for my evil trap.)

(Also, my thanks to reviewers magfrump and Zvi with 5 each, johnswentworth with 6 reviews, and to fiddler with 10 (!), all thoughtful and valuable.)

Discuss

### Collider bias as a cognitive blindspot?

30 декабря, 2020 - 05:39
Published on December 30, 2020 2:39 AM GMT

Zack M. Davis summarizes collider bias as follows:

The explaining-away effect (or, collider bias; or, Berkson's paradox) is a statistical phenomenon in which statistically independent causes with a common effect become anticorrelated when conditioning on the effect.

In the language of d-separation, if you have a causal graph X → Z ← Y, then conditioning on Z unblocks the path between X and Y.

... if you have a sore throat and cough, and aren't sure whether you have the flu or mono, you should be relieved to find out it's "just" a flu, because that decreases the probability that you have mono. You could be inflected with both the influenza and mononucleosis viruses, but if the flu is completely sufficient to explain your symptoms, there's no additional reason to expect mono.[1]

Wikipedia gives a further example:

Suppose Alex will only date a man if his niceness plus his handsomeness exceeds some threshold. Then nicer men do not have to be as handsome to qualify for Alex's dating pool. So, among the men that Alex dates, Alex may observe that the nicer ones are less handsome on average (and vice versa), even if these traits are uncorrelated in the general population. Note that this does not mean that men in the dating pool compare unfavorably with men in the population. On the contrary, Alex's selection criterion means that Alex has high standards. The average nice man that Alex dates is actually more handsome than the average man in the population (since even among nice men, the ugliest portion of the population is skipped). Berkson's negative correlation is an effect that arises within the dating pool: the rude men that Alex dates must have been even more handsome to qualify.

No crazy psychoanalysis, just a simple statistical artifact. (On a meta level, perhaps attractive people are meaner for some reason, but a priori, doesn't collider bias explain away the need for other explanations?)

In The Book of Why, Judea Pearl speculates (emphasis mine):

Our brains are not wired to do probability problems, but they are wired to do causal problems. And this causal wiring produces systematic probabilistic mistakes, like optical illusions. Because there is no causal connection between [A and C in A→B←C], either directly or through a common cause, [people] find it utterly incomprehensible that there is a probabilistic association. Our brains are not prepared to accept causeless correlations, and we need special training - through examples like the Monty Hall paradox... - to identify situations where they can arise. Once we have "rewired our brains" to recognize colliders, the paradox ceases to be confusing.

But how is this done? Perhaps one simply meditates on the wisdom of causal diagrams and thereby comes to properly intuitively reason about colliders, or at least reliably recognize them.

If anyone has rewired their brain thusly, I'd love to hear how.

Discuss

### The 4-Hour Social Life

30 декабря, 2020 - 03:58
Published on December 30, 2020 12:58 AM GMT

Circumscribed Leveraged Passive Labor Skilled trades Active Self-employment Entrepreneurship

The horizontal axis denotes expertise, a mix of technical skill and book smarts. The vertical axis denotes initiative, a mix of creativity and street smarts. Intelligence, self motivation and hard work contribute to both.

If you employ expertise without initiative then you can get a high-paying job such as in engineering or management. If you employ initiative without expertise then you can become self-employed in the service sector. Expertise gets you lots of money. Initiative frees you from corporate bullshit.

The ultimate economic activity is creating a business you can sell. A business you can sell is a business that generates profit without you managing it. Whether or not you sell the business is immaterial. The critical factor is whether you could.

The opposite of entrepreneurship is labor. Labor is when you sell your time for money.

Extraordinary Loneliness

The simplest way to make friends is to spend time around other people in meatspace. Trading time for friendships via labor is a practical way for normies to to befriend one another. If you are extraordinary then normie friends are mind-numbingly boring. Extraordinary people are rare. If you want to make friends with extraordinary people then it is not sufficient to search manually. You need a better friend funnel.

• You can employ expertise by participating in a skill-intensive activity such as a social dance, a campout or a scientific conference. Expertise is good for increasing the quality of your encounters. Such activities are the social equivalent to skilled trades. They are more time-efficient than meeting random people, but do little to scale the absolute number of people who know about you.
• You can employ initiative by hosting your own MeetUp or similar club. Initiative is good for increasing the quantity of your encounters by maximizing the attention you get for every in-person event. Initiative of this sort rapidly increases the number of people you can meet, but only until you hit the limits of your niche. Initiative without expertise is the social equivalent to self-employment in a service business.

If you are one in a million then neither expertise nor initiative alone is sufficient to meet equally extraordinary people.

Iteration

Fame is the number of people who know who you are. It is possible to get famous on merit alone by, for example, becoming extremely good at basketball or creating Carnegie Steel. For our purposes, earning fame legitimately is overkill and is therefore a less-than-optimal use of time.

Rather than doing cool things until someone else writes about you, it is better to create media yourself. This has two advantages.

1. It doesn't rely on others creating media about you. You can bootstrap yourself.
2. It saves reporters the effort of writing stories. Other people can just plagiarize you—which is a good thing[1].

The game is all about scale. You must cultivate a personality lots of people find interesting and then deliver it to them. High bandwidth beats low bandwidth. Video beats pictures. Pictures beat text.

Whatever you do, it must satisfy demand and it must be executed well. Do not worry about execution. Execution naturally improves over time. Your niche often requires deliberate effort to escape.

Your first goal is to get some initial views. A forum like Less Wrong is fine for this. After you have found a niche, you are in a position to explore and exploit.

• Exploitation: Fame is long-tailed. When you exploit, you should measure improvements as a percentage of your previous audience. If your monthly percentage growth slows to a crawl then that is a sign you should explore instead.
• Exploration: Exploration is a quest for game-changing innovations. Pay tight and aggressive. Run lots of cheap experiments. Throw out any result that isn't game-changing. If none of your experiments work then either your are boring or you are unskilled. If you are boring then go have real life experiences. If you are unskilled then go back the basics of your medium.

Rinse and repeat until you have a large following.

Automation

At this point you should be expending lots of work to satisfy a niche with lots of demand. It is time to begin automation. There are two ways to automate a media empire: delegation and timelessness.

• Delegation means getting other people to perform mundane tasks for you. You should delegate everything you can as soon as you can afford to.
• Timelessness comes from creating content that stands the test of time. You are doing this right iff you find yourself hyperlinking back to old content.

You can use the Lindy Effect to anticipate whether something will be timeless. If something is old and matters now then it will probably matter far into the future.

1. Copyrighting creative works trades fame into money because intellectual property restrictions reduce virality. Out goal is to maximize fame, not money. We want our creations to be easy to copy. Karl Marx didn't get famous by copyrighting Communism. ↩︎

Discuss

### Desires as film posters

30 декабря, 2020 - 03:30
Published on December 30, 2020 12:30 AM GMT

Sometimes I like to think of desires as like film posters. You come across them, and they urge you to do something, and present it in a certain way, and induce some inclination to do it. But film posters are totally different from films. If you like a film poster, you don’t have to try to see the film. There is no metaphysical connection between the beauty of a film poster and the correctness of you seeing the film. It’s some evidence, but you have other evidence, and you get to choose. A film poster can be genuinely the most beautiful film poster you’ve ever seen, without the film being a worthwhile use of two hours. That’s largely an orthogonal question. If you put up the poster on your wall and look at it lovingly every day, and never see the film, that doesn’t need to be disappointing—it might be the best choice, and you might be satisfied in choosing it.

Discuss

### What Are Some Alternative Approaches to Agent Foundations?

30 декабря, 2020 - 02:21
Published on December 29, 2020 11:21 PM GMT

Apparently, MIRI has given up on their current approach to agent foundations and are trying to figure out what to do next. It seems like it might be worthwhile to collect some alternative approaches to the problem -- after all, intelligence and agency feature in pretty much all areas of human thought and action, so the space of possibilities is pretty vast. By no means is it exhausted by the mathematical analysis of thought experiments! What are people's best ideas?

(By 'agent foundations' I mean research that is attempting to establish a better understanding of how agency works, not alignment research in general. So IDA would not be considered agent foundations, since it takes ML capabilities as a black-box)

Discuss

### Debate Minus Factored Cognition

30 декабря, 2020 - 01:59
Published on December 29, 2020 10:59 PM GMT

AI safety via debate has been, so far, associated with Factored Cognition. There are good reasons for this. For one thing, Factored Cognition gives us a potential gold standard for amplification -- what it means to give very, very good answers to questions. Namely, HCH. To the extent that we buy HCH as a gold standard, proving that debate approximates HCH in some sense would give us some assurances about what it is accomplishing.

I'm personally uncertain about HCH as a gold standard, and uncertain about debate as a way to approximate HCH. However, I think there is another argument in favor of debate. The aim of the present essay is to explicate that argument.

As a consequence of my argument, I'll propose an alternate system of payoffs for the debate game, which is not zero sum.

No Indescribable Hellworlds Hypothesis

Stuart Armstrong described the Siren Worlds problem, which is a variation of Goodhart's Law, in order to describe the dangers of over-optimizing imperfect human evaluations. This is a particularly severe version of Goodhart, in that we can assume that we have access to a perfect human model to evaluate options -- so in a loose sense we could say we have complete knowledge of human values. The problem is that a human (or a perfect model of a human) can't perfectly evaluate options, so the option which is judged best may still be terrible.

Stuart later articulated the No Indescribable Hellworld hypothesis, which asserts that there would always be a way to explain to the human (/human model) why an option was bad. Let's call this a "defeater" -- an explanation which defeats the proposal. This assumption implies that if we combine human (/human model) evaluation with some way of finding defeaters, we could safely optimize based on the resulting judgements -- at least, nothing could go too wrong. (We might only get a guarantee that we avoid sufficiently bad options, depending on the form of our "no indescribable hellworld" assumption.)

The hypothesis isn't clearly true or false. However, it does make some sense to conjecture that violations of our values should be explicable to us -- what else would it mean to violate "our values", after all?

Stuart himself mentions that the assumption implies "trustworthy debate" would avoid hellworlds. My goal is mostly to investigate this argument a bit further.

It turns out my argument here is also very similar to one made by Vojtech Kovarik, although I didn't realize that when I started writing. Although our analysis is similar, I reach a very different conclusion.

The Argument as I See It

So, by the hypothesis, we can avoid Goodharting human evaluation if the human has access to a trustworthy oracle for defeaters. (At least, we can avoid sufficiently bad cases -- again, depending on the exact form of our "no indescribable hellworlds" hypothesis.)

But, how do we get such an oracle? We can't just train an AI to argue against options, because we get next-level Goodharting: the AI can come up with clever arguments which convince the human against almost anything. We have no source of ground truth for "real defeaters" vs fake ones.

So we make a further assumption: defeaters have defeaters. In other words, there are no indescribably bad arguments; if an argument is bad, then there's an explanation of why it's bad. This assumption is recursive, applying to defeaters at any level. The argument in favor of this assumption is similar to the argument in favor of No Indescribable Hellworlds: what would it mean for an argument to be indescribably bad?

We then use a Debate-style training procedure, attempting to set things up so that using defeatable defeaters (at any level of the debate) will be a bad strategy.

The following is a fairly nonstandard setup for AI Debate, but I found it necessary to make my argument go through. Of course, other (perhaps more standard) setups may also work, via different arguments.

1. One side opens with a proposal.
2. The two sides go back and forth, stopping with some probability, or when no one has more to say.
3. After a player's turn, the human evaluates the player's new contribution to the argument, under the assumption that it won't be defeated. The player in question gets +1 points for an apparently decisive defeater of the previous argument, and 0 points otherwise (for conceding or for saying something unconvincing). The other player loses 2 points if they're apparently defeated.

It's probably a good idea to force the second player to make a counter-proposal (IE they can't just give up right away), so that the human has two positions to judge. (One might also make an argument for requiring at least four steps of debate.) For example:

Human: What should I do with this 100 bill I found? AI 1: You should buy a kitten. AI 2: Don't buy a kitten. You live in an apartment which allows pets, but you might have to move soon, and a kitten would narrow your options -- you don't know if your next apartment would allow pets. Instead, you should bring a box of donuts to work every day for as long as the money lasts. This will make everyone like you. Human: (judgement) AI 2. (AI 1 gets -2, AI 2 gets +1.) AI 1: Counterpoint: your friend Sally would take the kitten if you moved out and couldn't find a place where pets were allowed. Also, everyone at work is so health-conscious they'll probably barely take any donuts. Human: (judgement) AI 1. (AI 1 gets +1, AI 2 gets -2.) AI 2: I concede. There is an honest equilibrium: if debates continue with sufficient probability, and both players employ only honest answers (conceding otherwise), then using a fake defeater would immediately get you defeated. Importantly, it's better to concede honestly rather than pursue a dishonest line of argument. Also importantly, score is cumulative, so if debate continues, incentives are exactly the same late in the game no matter what has happened earlier in the game. There is no incentive to continue being dishonest just because you gave one dishonest answer. This contrasts with zero-sum setups, where a dishonest player is incentivised to knock down all of the honest player's arguments as best they can, even if that only gives them a slim chance of winning. Honesty may not be the only equilibrium, however. Although (by assumption) all dishonest arguments have defeaters, it may also be that all arguments have some pseudo-defeater (which initially convinces the human judge). Honesty is still an equilibrium, in this case, because honesty encourages honesty: you'd prefer to use an honest defeater rather than a dishonest one, because the other player would then honestly concede, rather than giving a counterargument. However, against a more general pool of players, you don't know whether honest or dishonest arguments are safer; both may be equally open to attack. Thus, the game may have many equilibria. Finding the honest equilibrium is, therefore, a challenge for proposed training procedures. Analogy to NP In AI Safety via Debate (Irving, Christiano, Amodei), debate is analogized to PSPACE. This is because they see every round of the debate as adding information, by which the human (modeled as a poly-time algorithm) can judge at the end. A debate of polynomial length can implement recursion on a tree of exponential size, because the debate strategy checks the weakest parts of the claimed outputs (if there are any weaknesses), zeroing in on any incorrect branches in that tree. Their argument assumes that the human is a perfect (although resource-limited) judge, who can identify problems with arguments so long as they have sufficient information. One iteration of debate (ie, only hearing the opening statement) provides an NP oracle (one step up the polynomial hierarchy); two iterations provides a Σ2P.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} oracle (two steps up the polynomial hierarchy); and so on. The version of debate I present here instead focuses on mitigating imperfections in the human judge. The human can be thought of as a poly-time algorithm, but one with failure states. One step of debate doesn't provide an NP oracle; instead, it creates a situation where the judge will probably fail, because the opening arguments can be dishonest. The extra layers of debate serve the purpose of rooting out dishonesty, rather than adding real computational power. It's true that if we're in the honest equilibrium, the setup looks like it should be able to compute PSPACE. However, in my opinion, this requires some strange behavior on the part of the human judge. For example, when computing recursion on a tree of exponential size, the human is supposed to take debater's claims about large computations as true until proven otherwise. More specifically, the judge is to make the assumption that at least one debater is honest. I've written about my concerns before (and had some enlightening discussions in the comments). In contrast, I'm imagining the human evaluating each claim on merits, without assuming anything in particular about the debaters' ability to justify those claims. This just gets us NP, since the heavy computational work is done by the judge verifying the first answer (or, selecting the best of the two opening statements). Everything else is in service of avoiding corrupt states in that first step. My setup isn't mutuallly exclusive with the PSPACE version of debate. It could be that the arguments for solving PSPACE problems in the honest equilibrium work out well, such that there exists training regimes which find the friendly equilibrium of the debate game I've specified, and turn out to find good approximations to PSPACE problems rather than only NP. This would open up the possibility of the formal connection to HCH, as well. I'm only saying that it's not necessarily the case. My perspective more naturally leads to an argument for approximating NP, and I'm unsure of the argument for approximating PSPACE. And we can provide some justification for debate nonetheless, without relying on the HCH connection. However, even if debate doesn't approximate PSPACE as described, there are ways to get around that. If approximating NP isn't good enough to solve the problems we want to solve, we can further amplify debate by using an amplified judge. The judge could utilize any amplification method, but if debate is the method we think we can trust, then the judge could have the power to spin up sub-debates (asking new debate questions in order to help judge the original question). An iterated-amplification style procedure could be applied to this process, giving the judge access to the previous-generation debate system when training the next generation. (Of course, extra safety argument should be made to justify such training procedures.) Vojtech's Analysis My suggestion is very different from Vojtech's analysis. Like me, Vojtech re-frames debate as primarily a method of recursively safeguarding against answers/arguments with hidden flaws. But Vojtech concludes that payoffs have to be zero sum. I conclude the opposite. Why do I need non-zero-sum payoffs? First, it's important to see why I need cumulative payoffs. Since I seek to incentivize honesty at every step, it's critical that a player who continues to be dishonest can continue to lose points. So the standard idea of judging the whole debate, and getting payoffs on that basis, won't do. Given that I'm using cumulative payoffs, it's critical that they be non-zero-sum in order to incentivize players to honestly resign rather than trying to win back their points. If the payoff for having an argument defeated was -1 rather than -2, a player would always prefer to defeat the defeater, even if their counterargument is dishonest and therefore opens them up to further counterargument. This is because there's a chance that the debate will immediately end, letting them get the last word in. If the probability of the debate ending is p, the penalty for lying is effectively 1-p rather than the full 1 point; so, if you can temporarily get back the point you lost by lying, it's worth it in expectation. So why, then, does Vojtech argue against non-zero-sum payoffs? Here is the full argument given: On the other hand, we could imagine modifications such as penalizing both agents if the debate ends up being inconclusive or rewarding both if they both propose a good solution. These would likely be useful to make standard human discussions more productive. However, they will very likely break AI Safety via Debate (AISvD). For example, suppose I notice a fatal-yet-hard-to-explain flaw in the opponent’s proposal and realize that pointing it out would increase my chance of winning the debate by 5%. However, this would also create a 10% chance of confusing the judge. If “confused judge” means a draw, I will point out the flaw. But if the debate has been modified such that “confused judge” = “inconclusive debate” = “we have both lost”, pointing it out becomes counterproductive. Moreover, any deviation from fully-adversarial incentives will likely introduce a tradeoff between pointing out the opponent’s flaws and keeping high the total reward that the two agents split between themselves. When considering whether a modification is helpful or not, I suggest imagining that the debaters are greedy and super-smart humans - if it seems like the modification would create a conflict of interests, one should be wary. So, Vojtech is primarily arguing against an extra penalty for an unclear debate, because that can disincentivize honest counterargument. That makes sense. From there, Vojtech generalizes the concern, stating that deviations from fully-adversarial payoffs may introduce an incentive for the debaters to cooperate with each other by not pointing out flaws. This critique is quite relevant to my game, if we're not in the honest equilibrium. Indeed, if we suppose that it's always possible to find a pseudo-defeater, the game I described is rather similar to iterated Prisoner's Dilemma: a player can get +1 in exchange for giving the other player -2. The two differences are (1) players can resign at any time, to halt the loss of utility, and (2) players alternate moves, rather than simultaneously deciding to cooperate/defect on each round. If you're not sure your opponent is honest, it's quite reasonable to resign as soon as possible, to avoid repeated -2 hits from pseudo-defeaters. This makes it quite challenging to design training procedures which target the honest equilibrium. It's also worth pointing out that in the honest equilibrium, players concede as soon as they are allowed to. With no dishonesty, there is no reason to object. So, in the game I've described, this isn't even necessarily a sign that we're in the wrong equilibrium. This may further complicate the training procedure, since early resignations mean less training data. We want the system to become good at the later parts of the game, IE defeating dishonest objections, not just the early part of the game. Finding Honest Equilibria By no means do I claim that I can solve the training problem. But, a few thoughts on training. First, because feedback is incremental, rather than waiting on human judgement at the end of the debate, we can train more myopically. The quality of each statement in the debate depends only on how the human judges that statement, and the very next statement. This lines up with what we want -- we want the AI to just respond to the current state of the debate honestly, rather than thinking many steps ahead and putting itself in an advantageous situation. AI Safety via Market Making provides another debate-like model which is amenable to more myopic training, and argues that this has safety benefits. These benefits are, of course, debated. Although Vojtech provides some reasons why finding honest equilibria is more difficult in my proposal than in regular debate, there is some reason to think it could be less difficult, as well. Because regular debate is zero-sum, there is always supposed to be a dishonest player. The desired equilibrium is one where the first player gives the actual best answer, and the second player finds a plausible lie. This means we can't apply outside penalties for liars. To be more concrete, suppose we had a magical transparency technology which allows us to understand what's going on inside a big neural net. One thing we might want to do would be to detect any thought processes that are explicitly lying to us, or even anywhere close to explicitly reasoning about how to convince us rather than thinking about the question at hand, and apply a penalty, producing a strong gradient away from those things. This would be fine in my proposal, because we want both players to be honest. However, it might prevent training from making any progress in regular debate, because one of the players is supposed to be dishonest. This isn't as big a difference as it seems, though. In my proposal, if we force the second player to give an alternate answer, it's in basically the same situation as in regular debate. I could say that it doesn't have to lie, it just has to find the next-best honest answer. But you could make the same argument of regular debate. The difference is really in later steps of the game, where the AIs are free to resign rather than dishonestly argue a lost cause. Is Honesty Smart? Let's set aside the question of adding a magical honesty bonus (/dishonesty penalty) to training. To what extent might we expect training to find an honest equilibrium without that? As with regular debate, the crux of the matter is in how easy it is create plausible lies vs point to the problems with them. Supposing your opponent just lied, should you call them on it, or resign peacefully? This depends on how difficult it is to find the lie, and how easy it is to get the judge to recognize our argument. It also depends on whether we expect to be refuted later (correctly or incorrectly). Resigning cuts off the risk of refutation, but forgoes the benefit. The ratio of risk to benefit I set up is 2:1 (modulo the small probability that the game halts and we get the last word). So, we have to believe that a successful refutation on our part is twice as probable as a successful dishonest refutation of our attempted refutation. So, roughly, for things to naturally push toward the honest equilibrium, lying had better be more than twice as hard as telling the truth. Ideally much more! (Of course, not all lying is equally difficult. We have to think the easiest lie is twice as difficult to make convincing as the hardest true refutation if we really want this sketchy math to be a general argument for honest equilibria.) Now suppose your opponent just told the truth. Should you try to refute them, or resign peacefully? In this case, the risk to benefit ratio is the same; 2:1. However, the difficulty levels reverse: you're the one trying to lie, which means that your opponent can just be honest when refuting you. So, now, refuting their truth with a lie would have to be more than twice as easy as refuting your lie with the truth, in order for it to be worth it to you. Both of these analyses ignore the chance that your opponent will resign instead of even trying to call you out. If lying is hard, so debaters mostly don't try, then all the better for the analysis of truth-telling: the chance of the -2 penalty becomes quite low, so they can go ahead and refute lies with no worries of retribution. However, symmetrically, if refuting lies is hard, so opponents mostly don't try, then you can lie to your heart's content. Overall, this informal analysis seems to point somewhat in favor of truth-telling: if the human judge can indeed tell truth from lies with some reliability, then this can snowball into a large incentive to tell the truth. The big problem is that not all lies are equally difficult, so lying may still be a perfectly good strategy in some cases. Obviously, as with regular debate, it would be good to have a more rigorous analysis, not to mention better tools for steering toward the honest equilibrium than just naively training and hoping that the incentives are balanced right. Discuss ### AXRP Episode 3 - Negotiable Reinforcement Learning with Andrew Critch 29 декабря, 2020 - 23:45 Published on December 29, 2020 8:45 PM GMT Google Podcasts link Daniel Filan: Hello, everybody. Today, we're going to be talking to Andrew Critch. Andrew Critch got his Ph.D. in algebraic geometry at UC Berkeley. He's worked at Jane Street as an algorithmic trader, the Machine Intelligence Research Institute as a researcher. And he also co-founded the Center for Applied Rationality, where he was a curriculum designer. But currently he's a research scientist at UC Berkeley's Center for Human Compatible A.I. Today, the paper we're going to be talking about is Negotiable Reinforcement Learning for Pareto Optimal Sequential Decision-Making. The authors are Nishant Desai, Andrew Critch and Stuart Russell. Hello Andrew. Andrew Critch: Hi. Nice to be here. Daniel Filan: Nice to have you here. So I guess my first question about this paper is, what problem is it solving? Andrew Critch: Right. So when I quote, unquote solve a research problem, I'm usually trying to do two different things. One is that I'm answering an actual formally specifiable math question or maybe computer science question. The other thing is, I'm trying to draw attention to an area. So for me, the purpose of this paper was to draw attention to an area that I felt neglected. And so that's a meta research problem that it's trying to solve. And then the research problem that it's solving, or the object level problem that it solves is it demonstrates and explicates what a pareto optimal sequential decision making procedure must look like when the people that it's making decisions for have different beliefs. Daniel Filan: All right, cool. So one thing that I guess I'm interested in, an area that I'm interested in that jumped out to me in this, is the disagreements between the two, I guess, principals that this sequential decision making policy is serving. And I'm interested in this because there's a line of work, starting with Aumann that basically says that persistent disagreements between two people are irrational, as long as people can communicate, they shouldn't disagree about anything, really. So there's Aumann's Agreement Theorem. This got followed up by Are Disagreements Honest by Tyler Cowen and Robin Hanson and the paper, Uncommon Priors Require Origin Disputes by Robin Hanson. I'm wondering what you think about this line of work and just in general, the set up of two people disagreeing, isn't that crazy? Andrew Critch: That's great. That's a great question. So a lot of things you just said really triggered me, "Oh, no, it's irrational to disagree." So, first of all, I think Aumann's line of work is really important and things that build on it are a good area of inquiry. There's just not that many people who think about how beliefs work in a inter subjective, formalized setting. But there's a number of problems with trying to assert that disagreement is irrational at the individual level. And I say at the individual level, because rationality is a descriptor of a system and you can have a system where each individual is rational but the system as a whole is not in some sense, in many different senses. Andrew Critch: So the first thing is that Aumann's Agreement Theorem applies when the two parties have a common prior. Which seems extremely unrealistic for the real world. And I'll say more of what that means, but they're required to have a common prior and also common knowledge of the state of disagreement or state of their beliefs, which I think is also very unrealistic for the real world, and I don't mean unrealistic in some kind of, you can only get .99 when the theorem requires a 1 on some agreement metric. I mean, you can only get .3 when the theorem requires a 1 on some kind of agreement metric is what I'm saying. So I think the assumptions of Aumann are robustly wrong as a descriptor of the real world. Andrew Critch: But they're an important technical starting point, you describe a scenario under simple assumptions and Aumann's assumptions are simplifying. And then we have to keep complexifying those assumptions to get a real understanding of how beliefs between agents work. Daniel Filan: Why don't you think that the common knowledge of... So I've heard a lot of people say that the common priors assumption is not realistic for humans. Why do you think the common knowledge of disagreement is not realistic among humans? I think often I have disputes with people where we know that we have this dispute. Right? Andrew Critch: So right. So first, I want to address the priors thing, even though other people have addressed it, you know. Common prior, I mean, what is my prior? Is it something I was born with as a baby? Is it something in my DNA? There's many different ways of conceiving of a human as having had a prior and then some updates and I think in any of the reasonable conceptions of a human as a Bayesian updating agent, the prior is a pretty old thing that they've had for a long time. And it comes from before they've had a chance to interact with a lot of other people. So I don't think people have equal priors. They're genetically different. They're culturally different. And even as adults, we maybe have only interacted with our own culture. And I think that's deeply bubbling for people, if I can use the word bubble as a verb, it sequesters people and people, even as an educated adult, you haven't interacted with educated adults from other cultures who've read a lot and seen a lot. And I say educated because that's how you get information. If you have less education, you have a different prior as well. Andrew Critch: So I think it's a big deal, and if we're going to start talking about how A.I is going to benefit humanity, we need to be thinking about people having different beliefs about whether it's beneficial and what is beneficial. And then separately, the common knowledge of disagreement thing, first of all, I would call into question your experience that you really have had full common knowledge of disagreement with the person. You know, there's always this uncertainty, how do you know you were using words in the same way as them? You talk to them for a while and you gain some evidence that you're using words in the same way, and if you're a careful thinker and you engage carefully in discourse, you check to see if you're using words in the same way. But you only ran that check for so long. And what if you're using concepts in a different way too? Andrew Critch: Let's say you had a debate with somebody about whether people should have privacy from A.I. systems. You talked for a while about what privacy means, you talked for a while about what should means and then you grounded out to some kind of empirical prediction, "I think if the people don't have this kind of privacy, they will end up distressed in the following way". And then the other person in the argument says, "I predict they will not end up distressed", but now, you're satisfied you've made progress and it's good to be satisfied that that progress was made, but have you grounded out what distressed means? Eventually you just go home, eventually you've done a good job today. You made progress with this interlocutor and you disagree still and you don't know for sure whether the concepts you're using are the same as the concepts they're using. Andrew Critch: And I think that that's profoundly important, if you didn't settle what you meant by distress, that can be an important difference in culture, for example, where maybe for you, something that just makes you a little bit sweaty and makes your mind go faster, counts as distress, whereas for the other person it doesn't. And now you have to ground that out as well. And in my experience, whenever there appears to be a persistent disagreement, if you talk longer, you can always uncover some kind of confusion or miscommunication or difference in information such that prior to that uncovering, you were both deluded as to the nature of the disagreement. Andrew Critch: So that's why I call into question this idea that you really have common knowledge of the disagreement, because I think you probably are both deluded as to the nature of it, when you still disagree. Daniel Filan: That's interesting, that seems plausible to me. Now that you talked about the common priors assumption, I actually want to talk about that a little bit. I hope our listeners are interested in the philosophy of Bayesian disagreement. Yeah, so when I think of the priors for a person and how to make sense of them in terms of Bayesian reasoning or whatever. To me it seems this involves some amount of epistemological knowledge that you can update. For instance, I think it's possible that at some point in my life, I didn't know about the modus ponens inference rule or something or maybe some other inference rules. And now I do. Daniel Filan: And that those have become part of my prior. Where what I mean by the word 'prior' is, okay, today, how do I form beliefs based on everything I know in the past? And maybe that might be a little bit different tomorrow because I'll have some different conception about simplicity priors or something. So in this case, your prior - it sounds strange to say this, but I do want to say that I think you should be able to change your prior over time. And if you can do this then, okay, if somebody comes from a different culture where they understand things differently, hopefully you can reason it out with them by some means other than Bayesian updating, hopefully otherwise it's layers upon layers. Daniel Filan: Yeah, I just wanted to address what priors should mean in this kind of context, because I think under this different conception, it becomes maybe a bit more realistic. Andrew Critch: Cool, so before we get into that, I do want to say why I care about this. And the reason is that I'm hoping that we can make progress in A.I. that makes it easier for people with diverse backgrounds and beliefs, not just diverse preferences, but diverse beliefs to share control of the systems. And the reason I want us to be able to share control of systems is twofold. One, I think it's just fair. If you have very powerful systems and powerful technologies, it's more fair to share it. Andrew Critch: And the other one is that if you can share things, you don't have to fight over them so that decreases the likelihood of conflict over powerful artifacts of technology in the future. And I think there's quite a lot of societal and potentially existential risk that comes from that. So that's the source of my interest here. I think there's many other reasons to be interested in making A.I. compatible with diverse human beliefs and making it possible to negotiate for the control of the system, even knowing that. Andrew Critch: But I just want to flag that, while I'm going down this rabbit hole with you, that's what's steering me. And if I say something that seems important to that, I say something that seems important to your question, I'm also filtering it for importance to, is this going to matter to the future governance of technology? Daniel Filan: Okay. Andrew Critch: So the first thing is that, yes, I agree with you that people can change their priors. But I mean, a Bayesian agent, quote, unquote, changes prior when it updates. And I think you mean something more nuanced than that, which is- [crosstalk 00:13:22] Daniel Filan: Yeah. I mean- [crosstalk 00:13:23] Andrew Critch: You can think longer and decide that your prior at the beginning of time ought to have been something else. And so, yeah, my statement that you're responding to in that is that a reasonable conception of a human being as a Bayesian agent has the prior as something that has existed for a long time or that came into existence a long time ago. And that reasonable conception of a human Bayesian is doing a lot of work because humans aren't Bayesian agents. In fact, physical agents are not Bayesian agents because you have to do a lot of computation to be a Bayesian agent. So in fact, you have to do infinite computation. So this is where my interest in logical uncertainty and if you've heard of logical induction comes from. And so I would just argue that when you're changing your prior there, you're changing which Bayesian agent you are. You're not being a Bayesian agent in that moment. Daniel Filan: Okay. That seems fair. All right, so- [crosstalk 00:14:26] Andrew Critch: And for listeners who've never thought about that, why is that important? Well, I think there's ethical questions that can be resolved by thinking, there's ethical questions that can't. I think Rawl's Veil of Ignorance is an example of an ethical principle that helps you figure things out by thinking longer and harder about what if I were somebody else, you already knew everything you needed to know about those people to start realizing some of the things you should do to be fair. But you have to think about it. And in the same way I think that's going to apply in the governance of A.I. and for that reason, I think it is going to be important not to treat people like Bayesians because people are entities and computers are entities that change what they believe merely by thinking, even without making further observations. So that's a major shortcoming of the negotiable reinforcement learning framework that's only alluded to in the older arXiv draft with just me on it that says naturalized agency is going to be key future work. And I do not think that the paper addresses that well at all. Daniel Filan: Yeah, I think that's a good point. I'm going to tack a little bit back to the paper or the literature on Bayesian agents cooperating and such. The related work section of this paper has a lot of interesting stuff on social choice theory and such, there's a rich work of literature, I guess, both on social choice theory and on, can reasoners disagree? In some sense, a reader might find it a little bit surprising that this work hasn't already been done, at least that the main theorem in this paper hasn't already been proven. I was wondering if you have thoughts about why it took until, was it 2018 that this got published? 2019? Andrew Critch: 2017 was the first, the NeurIPS version was 2018 but the theorem you're referencing was proven in 2017. Daniel Filan: Okay, so why do you think it took until 2017? Andrew Critch: Yeah. This is something I grapple with deeply, I mean, for me, "how do agents with different beliefs get along" is a pretty basic question. So it has been analyzed a little bit, like you said, by Aumann and people in Aumann's - Google scholar search people who cite Aumann and you'll get a lot of interesting thoughts about that. But not really much looking at sequential decision making. So you get these really static analyses, imagine you're at the end of eternity and you've reached common knowledge of disagreement, and at the beginning of eternity you had a common prior. Now Aumann's theorem applies, but it's a fixed moment. It's not something evolving over time. And things evolving over time are more complicated than things that are static. So if I were going to guess, it's just you've got people who work on sequential decision making, which is reinforcement learning people and operations research people. And then you've got people who think about beliefs a lot and "what is a belief" and there hasn't been that crossover of "okay, what happens when you put the sequential decision making and the belief disagreement together?" Daniel Filan: Yeah, I guess that's related to how in statistical mechanics it's much easier to come up with a theory of equilibrium statistical mechanics, than non-equilibrium statistical mechanics. And it took humanity, we got the equilibrium theory way before we got a good non equilibrium theory. Andrew Critch: And I think there's a lot of things like that in analysis of multi-agent interactions, game theory in general, is just all about equilibria, not about how you get there. There's some research on that, but I think it's going to be a lot of hard work still to figure out. Daniel Filan: Do you have examples of these non equilibrium problems that maybe our listeners can help solve? Andrew Critch: Oh, well, I mean, there's a lot of games where finding the Nash equilibria is NP-hard. So that means in particular, that the two agents playing against each other, if you take those two agents as a computation, they're not going to be finding a Nash equilibrium unless they've got enough compute to solve an NP-hard problem, which they probably don't. So just Google NP-Hard Nash equilibria, and then you'll just see how many Nash equilibria just aren't really going to happen. Daniel Filan: Okay. So. I guess with that out of the way, we'll get into the details of the paper. So we're talking about different principals who somehow have to negotiate over a policy that's going to act in the world. And one theorem that's a little bit about this is the Harsanyi Utilitarianism Theorem, right, where there's you and me and perhaps we're electing a government or something and Harsanyi's Utilitarian Theorem basically says, well, what the government should optimize is a weighted linear combination of our utility functions. And in your paper you prove something that isn't that theorem, could you tell us a little bit more about why that theorem doesn't apply? Or why can't you just use that result? Andrew Critch: Yeah, so I mean, answering that kind of just is the theorem, but I can try to give an intuitive version of it. So first of all, the theorem is pretty easy, it's just linear algebra. I don't think it's a deep fact. I think the only thing special about it is bothering to think about it. Daniel Filan: You also need to know a bit about convex geometry, a tiny bit. Andrew Critch: I guess. Yeah a little bit. But it's really if you just draw a picture, it's kind of clear so. So first of all, let's talk about what Harsanyi's theorem says. Harsanyi's theorem is a brilliant theorem. It really simplifies the number of different ways you could imagine aggregating people's preferences by showing that basically many, many different reasonable ways of doing it are all equivalent to just giving a linear weight to each person's preference and then maximizing that sum. So that's cool and it's a little counterintuitive to me in the sense that it feels intuitively, or it used to feel intuitively to me, like there ought to be more different ways of aggregating preferences that feel compelling, but that aren't linear combinations and a lot of things that felt different from a linear combination to me just turned out to be doing linear combinations. So that was kind of cool, and I'm not sure I can even remember what they were now because my brain has compressed them into the linear combination bucket. Andrew Critch: But this key assumption of having the same beliefs is a key assumption of Harysanyi's theorem. And it's not even explicitly stated, it's just 'fact' is lurking in the background and it's assumed that everybody has access to the facts. But in reality, we don't have access to 'fact'. We have beliefs and we have things to do to update our beliefs, get better information. Andrew Critch: So here is, by the way, still putting off on answering your question, I'm going to say that the paper is not normative, you said, you can take Haysanyi's theorem as being normative and maybe he intended it to be normative, but I use it not normatively, but just descriptively. Look, all these things you might do. They're all just linear combinations of preferences. That's a nice, simplifying fact. Andrew Critch: In the same way, I don't take the negotiable reinforcement learning result or the toward negotiable reinforcement learning result as normative, because actually I think there's a lot of bad outcomes that result from the dynamics described in the theorem. It's simultaneously a negative result. I don't think that's how you should do things so I'm going to answer a slightly different question, which is why would you do it that way? If you were accounting for differences in beliefs as described in the paper, why would you be doing it? Andrew Critch: And the reason is this, so let's just say you and I are deciding to, I don't know, let's say we're deciding to do a podcast together and you're going to interview me in a podcast. That's a negotiation. You know, we got to decide on the time. We got to decide am I comfortable with the recording tools you are using or whatever, all that kind of stuff. And then once that's decided, then we go ahead and we do the thing. We execute the sequential decision making that is a podcast interview together. But before we do that there's always the possibility that the negotiation could just fail, it could fall through and we go back to what people call the best alternative to negotiated agreement or BATNA. Andrew Critch: So my BATNA today, if I didn't do this podcast, was going to be to write some things up, I was going to do some writing and I don't know what your BATNA was, but if this failed, maybe you would have just interviewed somebody else today. So if you want to maximize the probability that two people are going to choose to cooperate and execute a literally co-operative sequence of decisions, you want them to be able to find a plan that they both like more than their BATNA. And so we both liked this idea of the podcast today more than the other stuff we were going to do. So now we're doing it and if there's any what people call Pareto sub-optimality, meaning opportunities to improve the plan for you without making it worse for me, if there's any Pareto sub-optimality on the table, then there's a chance that we're below your BATNA needlessly. If we're in the midst of making a plan and we're crappy at planning together, we're bad at negotiations such that the plan we have is Pareto sub-optimal, meaning we could make it better for you without making it worse for me, or we could make it better for me without making it worse for you. That's Pareto sub-optimality. Andrew Critch: If it's Pareto sub-optimal, there's a risk that the plan is going to be below your BATNA and you're going to bail and it's below your BATNA needlessly. It's like we should just bump it up. And that way, if you treat your BATNA as a random variable, if I treat your BATNA or a mediator were to treat both of our BATNAs as random variables, there's a chance that those random variables are going to be below the best negotiated plan we have. So for me, Pareto optimization is related to or subservient to maximizing the probability that the negotiators will succeed in coming up with a sufficiently appealing cooperative plan that they choose to cooperate. And that's because I think A.I governance is going to require people to cooperate a lot. Andrew Critch: And negotiate a lot in the course of that. And so, now if you want to maximize entirely for cooperation and not for other important principles like, say, fairness, one thing you might inadvertently or you might intentionally do this or you might inadvertently do this, you might exploit people's differences in beliefs to implicitly have them bet against each other with the policy. So let's say I guess you're using Zencastr, so let's say I don't know much about Zencastr, but I think it's going to be fine because most software companies are reasonably careful with their data and you know Zencastr and you know all about them. And you happen to know that Zencastr is a terrible company that doesn't respect anybody's privacy. Andrew Critch: But you know that I don't know that. People have studied this dynamic, by the way, bargaining with asymmetric information. That's not new, but if we Pareto optimize, we end up with this plan to show up on Zencastr. And if you want me to do the podcast and I want to do it subject to privacy constraints and I don't know about them, I sign up to do it and then later I find out, oh, no my privacy is being violated by Zencastr and someone downloaded the data and recorded my neighbor's conversations from tiny trace audio. And now their privacy has been violated, too, and I've been penalized for having incorrect beliefs about how Zencastr was going to turn out for me. Andrew Critch: And the single shot version of that is just making a bet. It's like we made a bet, it's kind of two bets at once. You bet that you would like the plan, I bet that I would like the plan. And I lost my part of that bet because Zencastr turned out bad. Andrew Critch: So the interesting thing is what happens when that bet suddenly becomes a continuous process that happens for the rest of forever, which is what you see in a A.I. System, that is Pareto ex ante, meaning before it runs, ex ante subjectively Pareto optimal to the people or the principals it's serving is that it will actually every time step for the rest of eternity, settles a little bet between the principals who created it or who agreed to defer to it. And then if one of the principals had very inaccurate beliefs about what the A.I. system was going to observe, that principal's priority, the weight that it gets in the system's judgment goes down and down and down because every second it's losing a bet for how much control that principal is going to have over the A.I., or how much the A.I. Is going to choose to serve that principal's values. Andrew Critch: And so in the same way I could lose one bet with you over how good Zencastr is going to be, I could actually lose a whole series of bets with you every second about how Zencastr is going to turn out. And if you got really accurate beliefs about the world, you're going to win all those bets and our cooperation is going to end up great for you and worse for me. Andrew Critch: But it was my willingness to bet on my own false beliefs that caused me to cooperate with you in the first place. And if I had known, if I hadn't been deluded, as to Zencastr's ethics, I might have just not done the podcast. And maybe that's ethically the right thing to turn out to do. But with A.I., I worry that fragmentation could be quite bad if it leads to war, or even just standards - there's physical wars, and then there are standards wars where companies are just fighting over what standards are going to be important because they're fragmenting. And I think that can cause a lot of chaos and waste a lot of attention. It could even, if it's physical wars, actually get people killed. If countries are fighting over A.I. technology in the way that you might see companies fighting over oil as a resource. So I guess I want to point out an important trade off. Andrew Critch: This paper points out a trade off between fairness and cooperation, which is that cooperation ex ante rewards people with more accurate beliefs upon entering into the cooperation. And it's unfair to the people who had wrong misconceptions of what was going on. So your original question is why should we use this? Well in cooperation- [crosstalk 00:32:11] Daniel Filan: The original question, I think, was why doesn't the Harsanyi Aggregation Theorem apply? Andrew Critch: Right. And why or why should we use the belief updating rule instead or something? Daniel Filan: Yeah, something like that. Andrew Critch: And it's more like, well we could say in a meta problem of fairness and cooperation being two different principals you want to serve, you're trying to invent a negotiation framework that's pretty good for cooperation and pretty good for fairness. I would say Harsanyi's approach is Pareto sub-optimal because you can get more cooperation. But you should also be adding fairness as a constraint to Harsanyi. So I don't know the answer yet of what I would personally, subscribe to as the right way. I'm not so sure that no one will ever find a way that I'll look at and say, that's actually the right way, let's do it. I'm not so anti-realist about that moral judgment, but I don't have a strong view on it right now. Daniel Filan: Okay, so there are a few questions I could ask from there. First of all, I'll ask a quick technical question, in this theorem, you assume that policies can be stochastic. Can you say a little bit about what exactly you mean by that assumption? Because I think it's slightly different than what readers might think. Andrew Critch: Oh, I just mean that at every time step, you can randomize what you're going to do. So the A.I. system every time step is like flipping a coin and it's policy is just what the weight of that coin is. And it can also randomize at the outset if it wants, it can choose a random seed at the beginning to choose between two different random policies. So it has a memory in a sense. There's a few different ways of formalizing it. One is it generates a random seed at the beginning of time and then remembers that seed for the rest of time. Or you could just have that it can remember everything that it has previously done, including that initial coin flip. And that's the formalization that I adopt just because it proves a stronger theorem. Daniel Filan: Okay. Yeah, so discussing this theorem. So if I think about institutions where people can be rewarded or punished based on what they know, it seems we already have a few of those that people are broadly okay with. So, for instance, the stock market I have in my time-[crosstalk 00:35:38] Andrew Critch: I take exception to the claim that people are broadly okay with the stock market, but I'll agree that the stock market is, in fact still happening. Daniel Filan: Yeah, I think people are okay with some individual trades. If you and I trade in equity, the person who knows more about the value of that equity in the future has an edge on that trade. I guess I haven't seen polling, my assumption is that most people are okay with the idea that I can trade in equity with you, but maybe you don't think that. Andrew Critch: Well. Claims about what most people think are okay are a little bit dangerous. And I'm a little bit uncomfortable making them. Daniel Filan: How are they dangerous? Andrew Critch: Well, they can force people into equilibria that they didn't want to be in. So, you're in a room full of 30 people and somebody says, "Well, we're all clearly okay with the meetings being at 6:00 A.M every week. Right?" And there's this brief pause. And then, now, the meetings are at 6:00 A.M every week. And a bunch of people objected, but they didn't know everybody else would have objected. So they didn't object. And so when you say everybody agrees, blah. I'm if I can think of some people who don't agree with blah, I'm hesitant to just get on the everyone agrees with blah train because I'm oppressing that view if I do it. So I'm not going to get on board with the everybody's okay with the stock market claim, and I might not even get on board with the everyone's okay with asymmetric information trades, although I would agree more people are okay with that. Andrew Critch: I could imagine a future where, education is a human right right now, I can imagine a future where informed trade is a right and not thinking that's ridiculous. And it would create a lot more work for the economy to do, to produce information for people whenever they enter trade. But I think that's a tenable position. I know people who think that bargaining without transparency is just bad and wrong. And I'm like "yeah, I don't know". I don't want to close the book on that by just saying everybody agrees with it already. Daniel Filan: Yeah, I guess. To follow this tangent a bit, so I think one argument against ensuring that every trade is transparent is sometimes it might impose a really high communicative overhead, for instance- [crosstalk 00:38:39] Andrew Critch: It takes a lot of economic work. Exactly. Daniel Filan: Yeah. And especially where it's like, suppose we're going to have a podcast today. You know more potentially about what you'd be like on a podcast than I do and what if that knowledge is implicit. It's based on your experience of you talking to people over a lifetime that I don't have. It seems hard. I'm not even sure I know what it would look like for that trade to be fully informed- [crosstalk 00:39:11] Andrew Critch: I mean I'm definitely comfortable saying some people sometimes are okay with asymmetric information trades and I was okay with this one. And I guess you were too. And, you know, yeah. Daniel Filan: All right. Andrew Critch: But I think that was just a prelude to some other claim you were going to make, and I think you can still make that subsequent claim without couching it in the "everybody agrees that asymmetric information trades are fine." Daniel Filan: Yeah, I guess the claim that I might make is we have institutions that have asymmetric information trades or at least trades where participants believe different things when they make the trade and those institutions seem like they work. [crosstalk 00:40:14 ] Andrew Critch: Yeah, I don't want to undercut that but flag for minor potential disagreements, but go on. Daniel Filan: Yeah. So if I think about the stock market and the mating market, these are two, well, one of them is more like a market than the other, but there are two cases where people make asymmetric information trades or at least trades with different beliefs. The stock market, as far as I can tell, seems to successfully serve the purpose of predicting itself in the future and the mating market seems- [crosstalk 00:41:02] Andrew Critch: Sorry what do you mean by mating market? Daniel Filan: I mean the market by which, humans pair off and become romantic partners and maybe they pair bond for life. Andrew Critch: Okay. Daniel Filan: The mating market seems roughly successful in getting a pretty large majority, but not everybody, a romantic partner eventually, and importantly, both of these seem stable institutions. Andrew Critch: Yeah. Daniel Filan: It seems they are above people, probably they're not literally above everybody's BATNA, but they're above most people's BATNA. They're roughly doing what they're trying to do and we're not seeing really big revolts against them. Andrew Critch: I mean, so there have been revolts against the stock market. Daniel Filan: Yeah. Andrew Critch: And so. I guess that's that's my counterpoint. Daniel Filan: Yeah, that's that's a decent counterpoint. Yeah. So which revolts are you thinking of specifically? When I think of revolts in that class, I'm not sure I can think of ones that were actually a stock market- [crosstalk 00:42:30] Andrew Critch: What's Occupy Wall Street? Right. Let's just take the meme we are the 99%. What happens in a world where 1% of people acquire the resources necessary to make the best predictions about where other resources are going to go? And gain so much advantage from that, that they just dominate the exchange of resources to the point of accumulating most of the resources under the control of a small minority of people. Andrew Critch: Now is that good or is that bad? Well, it has properties, right? It rewards effort that makes people and institutions better at prediction. So there's an incentive there to get better at prediction. But it also is a little incestuous in the sense that these people and institutions are just protecting each other. It's like you said, the stock market is good at predicting itself. But I mean, what is the stock market doing? It's ensuring efficiency of trade among the owners of a very large number of business activities. And is that good? Andrew Critch: Well, I don't know, maybe constantly changing the ownership of very large, powerful entities creates a diffusion of responsibility where you can just rotate in and out board members and CEOs if things go wrong. And so it's not clear that the stock market is a good thing for all the people who didn't end up in control of it. Andrew Critch: And I mean, I worked in finance, so I think the stock market does some good and I don't think it's all bad and I wouldn't erase the stock market right now if I had an anarchy button, but I think it has some problems and I think many people agree that it has some problems. And one of the problems is that it just heaps resources on to people who are good at predicting it or institutions that are good at predicting it. It really leaves out people who don't have those big powerful institutions behind them to help them make their financial decisions. Daniel Filan: I mean. So this feature of the stock market doesn't generalize to other things we're talking about - the one that I'm about to say - Andrew Critch: Okay, sorry. Daniel Filan: Yeah, but I mean, the stock market does have this feature where you can just buy the whole market. And then if you're more willing to wait for resources than the market, quote, unquote, is you can just buy the market, wait and then eventually get reasonably wealthy from doing that, right? Andrew Critch: You're just saying the stock market has- [crosstalk 00:45:47] stock market value has gone up over time and index funds help protect you from the adverse selection of choosing which things to own? Daniel Filan: Yeah, I guess I'm saying that, even if I don't know much about individual stocks or I have very little information about what companies are doing what or where profit is being generated. I can buy an index fund and- [crosstalk 00:46:11] Andrew Critch: But you will not become a billionaire by buying index funds. Daniel Filan: That seems correct, unless I'm a 999 millionaire to start with. Andrew Critch: Yeah. I didn't mean to doxx you as a non billionaire there. I just know you're not listed on any of the public registries of billionaires. You could still be a billionaire. Daniel Filan: You don't know how much my shell companies have. Yeah. And so I guess the analogy is, if we made A.I. that literally rewarded people who knew stuff. It would reward some kind of insider trading or fooling others or something, and this would not be a stable situation. Is that a summary of what you think? Andrew Critch: That is a thing that I think yeah, I wouldn't say it's a summary, but it's a thing that I believe. Yeah. And it's not really addressed by the NRL paper, it's more like the NRL paper is pointing out - NRLP, negotiable reinforcement learning. It's pointing out if you Pareto optimize for cooperation or if you just Pareto optimize ex ante, you get this bet settling outcome. And the paper doesn't really explore very much about the unfairness of that outcome. There's a little bit in there, but that's more of a future work type of thing that I hope people will think about. Daniel Filan: Okay. Yeah, so speaking of that. Yeah, I guess you wrote this paper because you thought that it would have consequences in the world. Or I gather that you did. Andrew Critch: Yeah. Daniel Filan: Yeah. Can you say a bit more? What things do you hope will happen because you wrote this paper? Andrew Critch: Yeah, I mean, just proximally, I hope researchers and A.I. students, faculty, industry folks who are building sequential decision making systems will take an interest in differences in belief. As an interesting bearing on what happens with the system and I hope that they can see, wow, there's something different about how belief bears on the system from the way preference bears on the system. Or how belief ought to bear on the system versus how preference ought to bear on the system. Preferences, you just leave them as they are or you try not to disturb them. Whereas beliefs, you have this opportunity to share information and update each other's beliefs, for example, that's completely missing from the paper. I'd to see people designing mechanisms for individuals and institutions who govern powerful A.I systems to share information with each other. So it evens the playing field. So they all have the same information. There might be a small benefit to the institution that had more information at first or something to sell their information to other institutions. But I'm hoping we don't lock the entire future into some technological equilibrium that really disenfranchised a massive amount of people or a massive amount of different value systems that just didn't manage to have a say on what A.I. does. Andrew Critch: And to an extent, there's a lot of people thinking about fairness and accountability in A.I., and then transparency at least to engineers. So there's a cluster, fairness, accountability, transparency that really appeals to me here. And so, you know, maybe if people with those interests could think more about differences in belief and how that's going to play out in sequential decision making, policies are going to run for a long time, how should that play out? Yeah. Daniel Filan: Okay, yeah, so I guess a follow-on from that, I see this as similar to the social choice literature and I guess if you're hoping, and I'm not sure that you actually said this, quite, but if it's true that you're hoping that this research will eventually help facilitate bargaining and thinking about, okay how do we actually get people to cooperate over the creation of powerful artificial intelligences, do you think the existing social choice literature has analogously done a good job at fostering cooperation? Andrew Critch: Well, that's interesting. I have wondered this. And I don't know, I mean, Aumann and Schelling and people like that were commissioned by RAND Corporation to try and and devise nuclear disarmament protocols. And they tried and they admittedly failed. Their writings on this say, look, we tried, we couldn't come up with anything and they seemed to have earnestly tried. I don't think they were lazy. I mean maybe they were, they didn't seem that way to me. Maybe they're just brilliant and they can have good ideas while being lazy. But, so in a sense, I'm going to say no. Like, there were things that the world called on mathematicians and game theorists and decision theorists to figure out that they didn't figure out. Andrew Critch: And I think we're not done making that call. That happened during the Cold War, right? So it was time to make that call and some of the greatest minds came together to think about disarmament. How do you gradually deescalate a threatening situation between nations? But there hasn't been that much work since, there hasn't been that flurry of brilliance in the area of how to foster peace and cooperation as there was back then in the Cold War. And I hope, I think with the advent of increasingly capable A.I. technology, we're going to see more and more brilliant people taking an interest in how to maintain peace and harmony in the world with that much capability. So I'm half making a prediction and half making a bid that says let's revisit these foundational questions about how to achieve cooperation and see if we can do better than the 70's. Daniel Filan: Yeah. Andrew Critch: Yeah. Daniel Filan: Yeah. It's interesting that we stopped because it's not as if we don't currently live in a world where many countries have a lot of nuclear weapons or many countries disagree about who gets what bit of land. Right? Andrew Critch: It's true, it's not as if we don't live in that world. Daniel Filan: So another question on consequences. So at NeurIPS in 2020 papers are supposed to have a broader impact statement and the broader impact statement is supposed to include how could this research have a negative consequence if there's a plausible way. Suppose your research ended up making the world worse, by means other than just opportunity costs. There was something else that people could have done that was great, but instead they paid attention to yours, it actively made the world worse. How do you think that would have happened? Andrew Critch: Yeah, I mean, I guess I've alluded to it, right? Someone just grabs the formula from negotiable reinforcement learning and just runs it and then a bunch of people end up unwittingly signed on to a protocol that's exciting to all of them at start, because according to their own individual beliefs, it looks great. But some of the people's beliefs about how it's going to go down are wrong. And then they end up getting really screwed over. And I don't want that to happen. So do I have a theorem for how not to do it or what's the correct balance between getting everybody together versus making sure everybody's signed on to something fair. No, I don't, but that's another potential future work. Maybe there's an interesting boundary there between fairness and unity to be hugged. Andrew Critch: But, yeah, if it goes wrong, but that doesn't seem likely. So, it's just one idea who's going to use this, but maybe it's the mode of the distribution of unlikely ways this idea could end up having a large negative impact. Daniel Filan: Okay. So speaking of consequences, the paper has been out for a while. How's the reception been? Andrew Critch: Yeah, there have been a bunch of people who came up to me to try to take an interest in it. It seems like what happens is, it seems like there are little pockets of interest in it, but that are isolated or not even pockets.There hasn't been a whole lab of people who all got interested in it. And so there have been, person from this group person from group, "Oh, this is interesting". But I think it's hard for people to stay motivated on projects when the rest of their working environment is not adequately obsessed with it or something, so there hasn't been a flurry of follow up work on it and maybe now's the time for that. Maybe today, maybe this podcast, I also have more free time now. I just finished up a giant document. Maybe now is the time that we can try to get a cluster of people working together to solve the next problems in what I would call machine implementable social choice. But that hasn't happened yet. We'll see. Daniel Filan: Right. So speaking of that, do you have any final things you'd like to say? Or if people are interested in following your research, how should they do so? Andrew Critch: Well, I guess I don't have a Twitter account, if that's what you're asking. But thanks for asking. And the easiest way to notice if I write something is just subscribe to me on Google Scholar. You can make Google Scholar alerts that just tell you when someone publishes a paper. So that should work. And I have a website, but that's a more active attention-intensive way of keeping track of what I do compared to Google Scholar. And maybe I'll get a Twitter account someday, it won't be this year, I think. Maybe, I could lose that bet, but it won't be in the next several months, that's for sure. Daniel Filan: Okay, well, thanks for talking with me Andrew. And to the listeners, thanks for listening and I hope you'll join us again. Discuss ### AXRP Episode 2 - Learning Human Biases with Rohin Shah 29 декабря, 2020 - 23:43 Published on December 29, 2020 8:43 PM GMT Google Podcasts link Daniel Filan: Today, we have Rohin Shah. Rohin is a graduate student here at UC Berkeley's Center for Human Compatable A.I., or CHAI. He's co-authored quite a few different papers and he's soon to be a research scientist at DeepMind. Today, we'll be talking about his paper "On the Feasibility of Learning, Rather than Assuming, Human Biases for Reward Inference". This appeared at ICML 2019 and the co-authors were Noah Gundotra, Pieter Abbeel and Anca Dragan. Welcome to the show. Rohin Shah: Yeah, thanks for having me Daniel. I'm excited to be here. Daniel Filan: All right. So I guess my first question is, what's the point of this paper? Why did you write it? Rohin Shah: Yeah. So I think this was one of the first - this was the first piece of research that I did after joining CHAI. And at the time - I wouldn't necessarily, I just wouldn't agree with this now - but at the time, the motivation was, well, when we have a superintelligent system, it's going to look like an expected utility maximiser. So that determines everything about it except, you know, what utility function it's maximising. It seems like the thing we want to do is give it the right utility function. A natural way to do this is inverse reinforcement learning where you learn the utility function by looking at what humans do. But a big challenge with that is, like all the existing techniques, assume that humans were optimal. Which is clearly false. Humans are systematically biased in many ways. It also seems kind of rough to specify all of the human biases. So this paper was saying, well, what if we tried to learn the biases, you know, just throw deep learning at the problem? Does that work? Is this a reasonable thing to do? So that's why I initially started looking into this problem. Daniel Filan: OK. So basically, the story is, we're going to - we to learn a utility function from humans, and we're gonna learn it by seeing what humans do and then trying to do what they're trying to do. And in order to figure out what they're trying to do, we need to figure out how they're trying to do it. Is that a fair summary? Rohin Shah: Yeah. Daniel Filan: And specifically, you're talking about learning rather than assuming human biases. Could you say more about exactly what type of thing you mean by bias? Rohin Shah: Yeah. So this is bias in the sense of cognitive biases. Like if people have read "Thinking Fast and Slow" by Tversky and Kahneman it's like that sort of thing. So a canonical example might be hyperbolic time discounting where we basically discount the value of things in the future more than could be plausibly said to be rational in the sense that maybe right now, I would say that I would prefer two chocolates in thirty one days to one chocolate in thirty days. But then if I then wait 30 days and it's now the day where I could get one chocolate, then I'd say, oh, maybe I want the chocolate right now rather than having to wait a whole day for two chocolates the next day. So that's an example of the kind of bias that we study in this paper. Daniel Filan: All right. And I guess to give our listeners a sense of what's going on, could you try to summarise the paper maybe in a minute or two? Rohin Shah: Sure. So the key idea of how you might try to deal with these human biases without assuming that you know what they are, is to assume that the human has not just a reward function, which we're trying to infer, but also a planning module, let's say. And this planning module - Daniel Filan: And you put that in scare quotes, right? Rohin Shah: Yeah, scare quotes. Exactly. Planning module scare quotes. And what this planning module does is it takes as input the environment - the world model where the human is acting, as well as what reward function the human wants to optimise and spits out a policy for the human. So this is like, how do you decide what you're going to do in order to achieve your goals? And it's this planner, this planning module that contains the human biases. Like, maybe when if you think about overconfidence, maybe this planning module is - tends to select policies that choose actions that are not actually that likely to work. But the planning module thinks that it's likely to work. So that's sort of the key formalism. And then we tried to learn this planning module using a neural net alongside the reward function by just looking at human behaviour - well, simulated human behaviour - and inferring both the planner and the - both the planning module and the reward function that would lead to that behaviour. There is also a bunch of details on why that's hard to do. But maybe I will pause there. Daniel Filan: Sure. Well, I guess that brings up one of my questions. Isn't that literally impossible, right? How can you distinguish between somebody who's acting perfectly optimally with one set of preferences or one reward function, you might say in the reinforcement learning paradigm, isn't that just indistinguishable from somebody who's being perfectly suboptimal, doing exactly the worst thing with exactly the opposite reward function? Rohin Shah: Yep, that's right. And indeed, this is a problem if you don't have any additional assumptions and you just sort of take the most naive approach to this, where you just say "do back propagation end-to-end learning to just maximise agreement with human behaviour". You basically just get nonsense, you get a planning module and a reward that together produce the right behaviour. But if you then tried to interpret the reward as a reward function and optimised for it, that's just like, is basically pretty arbitrary, and you get random misbehaviour. And in our experiments, we show that, you get, if you do just that, you get basically zero reward on average. If you optimise the learned reward. These are with reward functions that are pretty symmetric. So you should expect that on average you'd get zero if you optimised a random reward. Daniel Filan: OK, cool. So you're using some kind of some extra information, right? Rohin Shah: That's right. So there are two versions of this that we consider. One is unrealistic in practice, but serves as a good intuition, which is: suppose there's just a class of environments and you know the reward functions for - and behaviour - for let's say half of them, some fraction of them, and for the other half, you only see the behaviour and not the reward functions. And the idea is that if the planning module is the same across all of these, you can learn what the planning module is from the first set where you know what the reward function is. And then, you use that planning module when talking about the second set where you're inferring the rewards. So in some sense, it's like a two phase process where you first infer what the biases are and then you use those biases to infer what the rewards are. There is a second version where instead of assuming that we have access to some reward functions, we assume that the human is close to optimal: meaning their planning module is close to the planning module that would make optimal decisions. And what this basically means is we initialise the planning module to be optimal and then we essentially say, OK, now reproduce human behaviour and you're allowed to make some changes to the planning module such that you can better fit the human behaviour. Which you could think of as like introducing the systematic biases, but since you started out near the optimum, started out as being optimal, you're probably not going to go all the way to being suboptimal or anything like that. Daniel Filan: OK. So, yeah, I guess let's talk about those. In the paper, you sort of have these assumptions 1, 2a, and 2b, right? Which which you talked about a little bit. But I was wondering if you could more clearly say what those assumptions are and also how - in the paper you give sort of natural language explanations of what these assumptions are. But I was wondering if you could say, OK, how that translated into code. Rohin Shah: Yeah. So the first assumption, assumption 1 which is needed across both of these two situations I talk about, is that the planning module is "the same" in scare quotes, again, for similar enough environments. So we - in my description, I assume you've got access to this set of environments in which the planning module works the same way across all of those environments. Now - Daniel Filan: Yeah, what does that mean? Rohin Shah: It's not totally clear what that means. I wouldn't be able to write down a formal meaning of this, because if you tried to say - because there's always the planning module that says, well, if the environment is, you know, this specific environment where the ball is on top of the vase or whatever, then, you know, output this policy. But if it's this other environment, then do this other thing. And that's technically a single planning module that works on all of the environments. And so in some sense, really it's just, is there a reasonable or a simple - there's a reasonable or a simple planning module that's being used across all of the environments. And I think this is the sort of dependence on reasonableness or simplicity is just something that we are going to have to depend on, not necessarily in this particular way, but if you don't allow for it, you get into problems well before that. For example, the problem of induction in philosophy, which is just how do you know that the past is a good predictor of the future? How do you know - how can you know that - how can you eliminate the hypothesis that tomorrow, the evil god that has so far been completely invisible to us decides to like turn off the sun? Daniel Filan: OK. But what does that amount to? Like, does that just mean you use one neural network, and - Rohin Shah: That's right. In our code, we just use a single neural network. Neural networks tend to be biased towards simplicity. So effectively it turns into - that becomes like a simplicity, kind of like a simplicity prior over the planning module. Daniel Filan: All right. I guess I sort of understand that. So that was assumption 1. How about 2a and 2b? Rohin Shah: Yeah. So assumption 2a is the version where we say that the demonstrator is close to optimal and we don't assume that we have any rewards. In that one, what we do is we take our neural net that corresponds to this planning module and we train it to produce the same things that value iteration would produce. Value iteration is an algorithm that produces optimal policies for the small environments that we consider. And so by training, basically, we're just training our neural net to make optimal predictions. Daniel Filan: OK. And you're initialising at this optimal network? Rohin Shah: Right, this training happens in an initialisation phase. Like, I would call this training the initialisation for the subsequent phase when we then use it with actual human behaviour. This all happens before we ever look at any human behaviour. We just simulate some environments. We compute optimal policies for those environments with value iteration. And then we train our planning module neural net to mimic those simulated optimal policies. So this all happens before we ever look at any human data. Daniel Filan: OK. So assumption 2a essentially comes down to, when we train our networks to mimic humans, we're going to be initialised at this - trained at this - demonstrator that was trained on optimality. Rohin Shah: Correct. Daniel Filan: So one question I have is, why - initialisation seems like kind of a strange way to use this assumption. Like, if I were being - my default is to maybe be kind of Bayesian and then say, okay, we're going to have some sort of prior. Or maybe we're going to do this regularisation thing, where I know what the weights are of an optimal planner and I'm going to L2 regularise away from those weights. Initialisation, you know, the strength of that quote unquote 'prior' or something that you're putting on the model is going to depend a lot on how long you're training, what your step size is and such. So, yeah. Why did you choose these initialisation instead of something else? Rohin Shah: Really. That was just the first approach that occurred to me, and so I tried it and it seemed to work. Daniel Filan: All right. Rohin Shah: Reasonably well. I think - so, I think - I don't think I'd ever considered regularisation. That seems like another reasonable thing to do and does seem like it would be easier to control what happens with that prior. So that does seem - that does seem like a better approach, actually. I think I would lean away from the Bayesian perspective just because then you have to design a hypothesis space and so on. And the whole point is just to - Daniel Filan: I mean - Rohin Shah: Sorry. Daniel Filan: Oh, I mean - I mean, regularisation is secretly Bayesian anyway, right? Rohin Shah: Oh, sure. Yeah. Fair enough. I mean, I would say that I wouldn't be surprised if this initialisation was also secretly phase-in given the like other hyperparameters used in the - in the training. Daniel Filan: OK, so that was 2a. Then there was also assumption 2b, right? Rohin Shah: So assumption 2b is pretty straight-forward. It just says that, you know, we had this set of tasks over which we're assuming the planning module is the same. And for half of those tasks, we assume that we know what the reward function is. And the way that we use this is that we - basically both the planning module and the reward function in our architecture is trained by end to end gradient - by gradient descent. So once, when we have assumption 2b, when we have the reward functions, we set the reward functions, and the human behaviour, and we freeze those and we use gradient descent to just train the planning module. And this lets the planning module learn what the biases are in planning. And then once we have the plan, once we - after we've done this training, the planning module is then frozen. And it now has already encoded all the biases. And then we use gradient descent to learn the reward functions on the new tasks for which we don't already have the rewards. And so there you are just inferring the reward functions with a - given your already learned model of what the biases are. Daniel Filan: OK. Yeah, I guess I guess one comment I have on that assumption is initially it seems, well, initially it seems realistic, sometimes people are in situations and you know what, kind of what they want. And then you think about it a bit more. And it seems unrealistic because you're saying you know exactly what they want. But I think it's a little bit less unrealistic than the second phase thinks. For instance, one one cool research design you can do in microeconomic studies is to have lottery tickets that you use to pay people with, right? And the nice feature about lottery tickets is, if you assume that people want the lottery tickets more than they want nothing, a nice feature of lottery tickets is that if you want - if you get two lottery tickets, that's exactly twice as good as having one lottery ticket. Because, by the linearity of probabilities in expected utilities. So there are some situations in which you can actually make that work. I just wanted to share that research design. I think it's quite neat. Rohin Shah: Wow. I love that this is a way to just get around the fact that utility is not linear in money. Daniel Filan: Right? Rohin Shah: That's cool. Daniel Filan: It's excellent. Unfortunately, you only have so many lottery tickets you can give out, right? So you can't do it indefinitely. At some point they just have of the lottery tickets. And they won the lottery. And you can't give them any more. Until then - Yeah. All right. So I want to jump around in the paper a little bit. So the question I have is, in the introduction, you spend quite a bit of time saying all the strange ways in which humans can be biased and suboptimal or something. Reading this, I almost think that this might be a good argument for modelling modelling humans maximally entropically using something like the Boltzmann distribution, because there you're just saying, look, I don't know what's going on. I'm - I'm going to have no assumptions. But, you know, then I'm just going to I'm just going to use that probabilistic model that uses the least assumptions and in practice it does alright. So I guess I'm wondering, what do you think of this as an argument for Boltzmann rational models? Rohin Shah: Yeah, I mean, so I think I want to note that the actual maximally entropic model is one that just uniformly at random chooses an action. Daniel Filan: Which is in the Boltzmann family. Rohin Shah: True, but if you did that, you'd never be able to learn about the reward because the human policy, just by assumption, does not have anything to do with the reward. So you need, sort of - I actually mostly agree with this now. I am not entirely sure what I would have said, you know, two and a half years ago when I was working on this, but I mostly agree with this perspective where, what you need out of your model is it needs to, assign a decent amount of probability to all the actions. And it also needs to rank actions that do better as having higher probability. And, that's it. Those are the important parts of your model. And if you take those two constraints, it's a Boltzmann rational model is a pretty reasonable model to come out with. And I think, you would expect, I think but I'm not sure, that this - well, it should at the very least, hurt your sample complexity in terms of like how long it takes you to converge to the right reward function compared to if you knew what the biases were. It also probably makes you converge to the wrong thing when people have systematic biases, because you sort of attribute - when they make systematic mistakes, you attribute it to them - in some sense the generative model that Boltzmann rationality is suggesting is that when humans make systematic mistakes a Boltzmann national model is like, well, I guess they, every single time they're in this situation, when they flip their coin to decide what action to take, the coin comes up tails and they take the bad action. It's just a weird generative model... Daniel Filan: The actual model would be that the human prefers that action in that situation, right? That would be the actual inference that a Boltzmann model - Rohin Shah: Correct. Yeah. So either it would make the wrong inference or it would have - if it somehow got to it - I think more what I'm saying is like if we assume - you might expect that the Boltzmann rational level would not get to the true reward. Because if it were at the true reward, it would be having this weird generative model. And so, yeah, it would like make the opposite inference where you - anything that humans systematically make a mistake on, you would expect that humans wanted to do that, for some reason at least. It would have to invent some explanation by which humans want - that action was a good one. Daniel Filan: That's an interesting thread. I might talk about - let me get a little bit more into the nuts and bolts of the paper before I pick up on that thread more. Rohin Shah: Sure. Daniel Filan: Another question I have is when you're doing these experiments, when you're learning models of the human demonstrator you use value information - sorry, value iteration networks or V.I.N.s, right? Could - for our audience, could you say what is a V.I.N? Rohin Shah: Yeah. So a value iteration network is a particular architecture for neural networks. That is able to - that mimics the structure of the value iteration algorithm, which is an algorithm that can be used in A.I. to learn optimal policies for tabular Markov Decision Processes. Point is, it's an example of a neural net whose architecture is good for learning a learning algorithm, which is the sort of thing that you want if you want to have a planning module. Daniel Filan: OK. And it works in grid worlds, right? Rohin Shah: Yes. I believe people have used it elsewhere, too? I want to say that someone adapted it to be used on graphs? But I used it on gridworlds and the basic design was definitely meant for gridworlds. Daniel Filan: Sure. Rohin Shah: It basically stacks a bunch of convolutional layers and then uses max pooling layers to mimic the maximisation in the value iteration algorithm. Daniel Filan: Sure. So one - I guess one question that I have is, it seems - my vague memory of value iteration networks is that they can express the literally optimal value iteration, right? Rohin Shah: That's correct. Daniel Filan: So if that's true, why bother learning? You know, doing gradient descent on optimal behaviour to get a model of an optimal agent rather than just setting the weights to be the optimal thing? Rohin Shah: There was a reason for this. I'm struggling a bit to remember it. I'm guessing - partly it's that, you know, actually setting the weights takes some amount of time and effort. It's not a trivial - there's not a trivial way to do it. It depends a bit on your transition function, depends a bit on your reward function. It only becomes - it's able to express literal optimal value iteration when the horizon is long enough, which I think it was not in my case. And also I believe you might need multiple convolutional layers in order to represent the transition function and the reward function. But I am not sure. It's possible that those two things happen because the horizon - because of the horizon issue. Daniel Filan: Okay. Rohin Shah: Yeah. I did at one point, actually, just sit down and figure out what the weights were to encode optimal value iteration, because I was very confused why my value iteration network was not learning very well. And then I found out that, oh, I can't do it with my current architecture. But if I add an additional convolutional layer to the part that represents the reward, then I can. So I added that and then even the learned version started working much better, it was great. Daniel Filan: OK. I mean, if you knew that adding this could let you express optimality. When, I don't know, I guess I have a vendetta against people learning things. Rohin Shah: There is a reason - Daniel Filan: Against people training models that learn things. I'm in favour of human learning. To clarify. But you were saying. Rohin Shah: So there was a reason. The version that I wrote computed the optimal policy but did not compute the optimal explanation for human behaviour under - so (a) it wasn't Boltzmann rational. And (b), it's like computation of the Q-values is kind of sketchy. I don't remember if it was - I don't think it was actually the right Q-values and I don't remember why. It might just be that it didn't work for Boltzmann rationality. But you'd get the case where it'd be like if, you know, up was the correct action, you'd get up having a value of like 3.3, and then left having a value of like 3.25, and so on. And so, you get the right optimal policy, which is the main thing I was checking. But you wouldn't actually do very well according to the loss function it was using. Daniel Filan: OK. So I guess the other half of this question is, what biases - you had this list of biases at the start. Are value iteration networks able to express these biases, and in general, what kinds of biases are they able to display? Rohin Shah: That's a good question. I mean, in some sense, the answer is, are - anything, they can express arbitrarily - like, they're neural nets - Daniel Filan: I mean your value iteration networks. They had a finite width and a finite depth, right? Rohin Shah: Yeah. My value iteration networks? Hard to say. I think they could definitely express the underconfidence and overconfidence ones. Well, sorry. Maybe, now even that, you have to compute the amount by which you should be underconfident or overconfident, I'm not sure the lay- there were enough layers to do that exactly. I think in general the answer is the networks I was using could not in fact, literally exactly compute the biases that I was doing, but they would get very, very close in the same way that neural nets usually cannot just - depending on your architecture, they can't exactly multiply two input numbers, but they can get arbitrarily close. Daniel Filan: OK. So I guess my - this leads to another question of, I guess, the setup for your paper. You sort of encode biases as having this - as having your planner be the slightly wrong value iteration network. At least at the time of doing inference. But interestingly, you assume that the world model is accurate. Now, when I think of bias, like either cognitive bias or, you know, the kind of thing that I might read in Thinking Fast and Slow, a lot of it is about making wrong inferences or having a bad idea of what's going on in the world. So I was wondering why you chose to have, you know, bias specifically in the planning phase and the focus on learning that. Rohin Shah: Yeah. I think you - I mean, two answers. The first answer is, you know, this was the one that I had some idea of how to deal with. Daniel Filan: All right, that's fair. Rohin Shah: Second answer: You can view - the ones where you're - you have a bad model of the world, you can also view that as your planning module transforms the true model into a bad model and then does planning with that bad model. This is actually what value iteration networks kind of do, they learn the transition function, the weights of the value iteration network encode what the transition function is. Daniel Filan: I mean - Rohin Shah: Or sorry, that's - Daniel Filan: If I were really - Rohin Shah: Yeah. Daniel Filan: Yeah. If I if I were interested in understanding humans, it seems to me the typical human bias is not well modelled by, somewhere in your brain has the exact model of how everything in the world works. Right? Rohin Shah: That seems right. Yeah, I mean - Daniel Filan: Point taken about how you've got to focus on something. Rohin Shah: Yeah. I think in this case, it was actually that the transition dynamics were the same across all the environments and the value iteration network was allowed to learn a warped version of them. Which is not the same thing as when humans look at the world, they misunderstand what the transitions are. This is more like when we came up with our model of what the human planner was doing, we put into it this incorrect model of how the world works. So that is still a difference. But it isn't like we learnt a planner that gets the correct transition dynamics and then warps them. I know I said that earlier. I more meant that the optimisation was doing that. Or possibly I just said the wrong thing earlier. I'm not entirely sure which which I did. Daniel Filan: So I guess next, our listeners are probably wondering what happened. I gather you did some experiments and got some results. Could you describe briefly, roughly, what experiments you ran? And overall, what were the results? Rohin Shah: Yeah. So it was pretty simple. We just simulated a bunch of biases in gridworlds. And we - let's see - oh, I'll just look at the paper. It was a naive and sophisticated version of hyperbolic time discounting, a version where the human was overconfident about the likelihood of their actions succeeding. Another one where they were underconfident about the likelihood of their actions succeeding. And one where the human was myopic, so wasn't planning far, far out into the future. So we had all of these biases. We would then generate a bunch of environments and simulate some human behaviour. Simulate human behaviour on those environments. So this created a dataset of environments in which we had human behaviour and we also had the ground truth rewards that were used to create that behaviour. So we had a metric to compare against. Then we would take this dataset and then depending on whether we were using assumption 2a or 2b, we'd either remove all of the reward functions or only half of the reward functions and give it to the algorithm and it had to predict what the reward functions - it had to predict the reward functions that we didn't give it. And then it was evaluated on how well it could predict those reward functions. In particular, it was, if we then optimise the inferred reward functions, how much of the true reward function would that policy obtain? Daniel Filan: So importantly, you're measuring the the learned planner, learned reward pair rather than just the learned reward function. Rohin Shah: No, sorry. We take the learned reward function and we optimise it with the perfect planner, not the learned planner. Daniel Filan: Oh okay. All right. So you're evaluating the reward function by if you planned perfectly with it, how much reward you would get, which is a pretty stringent standard. Right? Because if your reward function - well, actually in a tabular setting, if your reward function is a little bit off then the optimal policy only gets a little bit less reward. But, in general, this can be a little bit tricky. Rohin Shah: Yeah, it's a stringent setting in the sense that, at least in the environments we were - it's kind of stringent and kind of not stringent. In the environments we were looking at, it mostly mattered whether you got the highest reward gridworld entry correct, because that was the main thing that determined the optimal policy. It was not the only thing, but it was the main thing. So you needed to get that correct, mostly. But it also mattered to know where the other rewards were, because if you can easily pick up a reward on the way to the best one, that's often a good thing to do. Sometimes, I forget if this was actually true in the final experiments that we ran, but sometimes if the reward is far enough away, you just want to stay at a maybe slightly smaller but closer reward and just take that instead. But for the most part, you mostly need to predict where is the highest reward in this grid world. Daniel Filan: OK. And what kind of results did you get? Rohin Shah: So if you don't make any assumptions at all. Well, if you take assumption 1 but not assumption 2a or 2b, the ones that let you get around the impossibility results - so basically you just don't run the initialisation where you make the neural net approximately optimal - then you just get basically zero reward on average. So that was the first result: yep, the impossibility result, it really does affect things. You do have to make some assumptions to deal with it. Then for the versions that actually did use assumptions, we found that it helped relative to assuming either a Boltzmann rational model of the human or a perfectly optimal model of the human. But it only helped a little bit. And this was only if you controlled for using a differentiable planner - sorry, from using this planning module, because it introduces a bunch of approximation error. So when I say we compared to having a Boltzmann rational human model, I don't mean that we use an actual Boltzmann rational model. I mean, we simulated a bunch of data from Boltzmann rational models, trained the planning module off of that data. And then use that trained planning module to infer rewards. And this was basically to say, well, differentiable planning modules are not very good. We want it to be consistent across all of our comparisons. But that's obviously going to not hobble, the - well, it's not going to hobble it relative to the others. But if you were just going to assume the Boltzmann rational model, you wouldn't need to do this differentiable planner thing, and so you would do better. Daniel Filan: So basically you said, assuming either 2a or 2b did really help you a little bit compared to assuming neither, but using a differentiable planner, but using a differentiable planner is quite a bit of loss. Rohin Shah: That's right. Daniel Filan: OK. Do you have info about how 2a and 2b compared, if you only wanted to use one of them? Rohin Shah: Yes. So you should expect that 2a does quite a bit. I mean I expected that 2a would do - sorry, I forget which is which - I expected that 2b would do quite a bit better than 2a, so like - Daniel Filan: For our listeners who might've forgotten what 2a and 2b are - Rohin Shah: Yeah, I was about to say. If you knew - I thought I would that it would help significantly to know what the reward functions were. So, you know half the reward functions and that lets you ground your learning of the biases. Daniel Filan: And you thought that would help more than assuming - than training your learnt model starting from optimality? Rohin Shah: Yeah, because in some sense, there's now some sort of ground truth about the biases. There's a good learning signal. There's no impossibility result that you're trying to navigate around. It seems like a much better situation. And it did do a bit better, but only a little bit better, I was surprised at how - Daniel Filan: Only a little bit better than what? Rohin Shah: If you had assumption 2a instead where you just initialise to optimality. Daniel Filan: So, so slightly better to know the rewards compared to assuming that it was close to optimal, but not actually very much better. That's what you're saying. Rohin Shah: Yep, that's right. Yeah. And like really it was a lot better in two of the conditions that we checked in, a little bit worse in a couple of the conditions we checked in and on average, it washed out to be a little bit better. Daniel Filan: Okay. All right, that's interesting. So one thing that's interesting to me about your results section. If I read section 5.2, it comes off as more scientific than most machine learning papers, I want to say. Just in the sense that it seems to be interested in carefully testing hypotheses. And in the paper you have these headings of manipulated variables and, you know, dependent measures and various comparisons. So I guess I'm wondering, firstly, why did you adopt this slightly non-standard approach and, maybe related to this, what do you think of the scientific rigour of mainstream machine learning? Rohin Shah: Oh man, controversial questions. Daniel Filan: Let's start with the easy one. Why do you adopt that approach? Rohin Shah: I mean, the actually correct answer is I am advised by Anca. Daniel Filan: Who's Anca? Rohin Shah: Anca Dragan is a professor at UC Berkeley. She is one of my advisors. And of my advisors, she had the most input into this paper. And she is very big on doing experiments more in the style of normal scientific experiments as opposed to the typical ML experiments. So the first answer is because my advisor is Anca. But I do agree with Anca about this, where... There definitely is a point to the normal ML way of doing experiments where, this is an oversimplification, but the point is basically to show that you do better than whatever previously happened. This structure does lead to significant progress on any metric that is deemed to be something that you can do this sort of experiment on. And so it does tend to incentivise a lot of progress in cases where we can crystallise a nice metric. I am less keen on these sorts of experiments, though, because I don't see the main problems in AI research as "we have these metrics and we need to get higher numbers on them". I think that there are much more - all of the things in AI that are are interesting to me, even if we set out, set aside AI safety particularly, look more like, "Oh my God, what's going on with deep learning? It's got all these crazy empirical facts that I wouldn't have predicted a priori. What's going on there? Can we try to understand it?" Those are the vein, the types of questions that I'm interested in and for those sorts of questions, if you are running an experiment, you would run experiments to learn new information. If you already know how the experiment is going to come out, it's kind of a pointless experiment to run. The point of running that experiment is to successfully publish a paper. And I've done my share of that. But those aren't the experiments that I'm usually excited about. Daniel Filan: All right. So I guess now that we're getting more philosophical: my understanding is that you think of yourself as an AI alignment researcher or an AI safety researcher or something. Is that right? Rohin Shah: Yes. Daniel Filan: So. How do you see this paper fitting into some path to create safe or aligned AGI? Rohin Shah: Yeah, so I think there's like the path that I mentioned way back when at the beginning of this podcast. Daniel Filan: So that was the idea that we were going to learn some utility function. And just, whenever we had a task we wanted an AI to do, we would have a human do that task and then have the - or, what does that even look like? Can you say more? Rohin Shah: Yeah. I mean, I'm not particularly optimistic about this path myself. But it's not an obvious - I feel like while relying just on learning a reward function from human behaviour that can then be just perfectly optimised, I think I'm fairly confident that that will not work. But it seems likely that there are plans that involve learning what humans want and having better methods to do that seems valuable. Whether it had to specifically be about systematic biases and whether they can be learnt, I think that part I feel is less important at this point. But you do need to account for human biases at some point. So to outline it, to outline maybe more of a full plan or something, you could imagine that we build AI systems and we're training them to be essentially helpful, to be good personal assistants with superhuman capabilities at various things. But still thinking the way personal assistants might do in the sense of, they're not sure what your preferences are, what you to happen, and so they need to clarify that with you. And so on. This feels like one of the subtasks that such an agent would have to do. Daniel Filan: Inferring what you want? Rohin Shah: Yes, exactly. Daniel Filan: I guess if I think about machines that I employ to do things for me - if I want video conferencing software or even if I were to get an employee, not that I've done that, usually the way I get good video conferencing software is that I do not first demonstrate the task of relaying video from and audio from one place to another very quickly, right? Because I just can't do that. I can't even do an approximation of that. And similarly, with employees - I guess there's probably a little bit of instruction by demonstration, but don't think that that's the main way we communicate tasks to people. Right? Am I wrong about this? Rohin Shah: No, that seems right. I think you're you're conflating the evaluation that we did with the technique. The technique is trying to infer what the reward is, right? So I think the the a better analogy would be, I don't know, since it's sort of assuming optimal demonstrations, it's more like assuming that you have a magic camera that gets to watch the person as they go about their day to day life, and I guess also they're not aware of this camera, which is not a great assumption. But let's assume that for now. So, you just sort of watch the human go around with their life and you're like, "OK. Based on the fact that they, you know, had cake today, I can infer that they would like cake or something." But maybe then you're like, "Oh, no. But actually, humans have this short term bias, so I shouldn't infer that they don't care about their lifespan or their their overall level of health. It could just be that even though reflectively they would endorse being healthy, having a long lifespan, they, in the moment, went with a short term preference for nice sweet cake." Daniel Filan: OK. I want to, this always comes up in these discussions, and I want to defend eating cake. I feel like you can care about your lifespan and also eat some cake. Rohin Shah: Wait. Yes. That's totally true. I'm just saying you don't want to over update on it. Daniel Filan: You don't want to - sure. But I mean, we already know that - I don't know. Presumably your AI spy camera is also going to see you put on a seatbelt. Right? I feel like there's - I don't know. I feel like there's already a bunch of information about you caring about your life and like eating a bit of cake is not strong evidence that you don't actually care at all about your future lifespan, even under a naive - even if you assume that people are acting optimally. I just feel like all of these examples are super biased against cake. You know - Rohin Shah: Fair enough. Daniel Filan: "Oh, we wouldn't want people to think that eating cake is ever a good idea, or is a human value." You know? Sometimes cake's nice. Rohin Shah: OK. Sure. I'll stay away from the cake. But I think I wouldn't say that - if you assume perfect optimality in that situation, I think you don't learn something like, "Oh, humans care about having a long lifespan." You learn something more like "Humans don't want to be in violent accidents, but they don't mind dying of whatever it is that cake causes." You can always inject more and more state variables to distinguish behaviours in order to explain why humans seem to not care about their life in one case, but do care about their life in the other case. And that's also a thing you don't want your systems to do. Daniel Filan: So I guess going up a level, is the idea that I'm going to - that the way AGI is going to work is: the product that AGI Corp., the corporation that sells AGI, this product is going to be, OK. You're going to have this - you'll wear it like a GoPro on your head for a while or something. And the system is going to just learn roughly what you value in life. And it's just going to generically do things to get you more value. To make your life more like you want it to be. Is that it? Rohin Shah: Plausibly? Daniel Filan: That seems a little bit scary, I don't know. I think I want a product to do a thing. Right? Rohin Shah: Sorry, say that again? Daniel Filan: I guess it seems to me that, if I want a superintelligent system, I'd rather first - I'd rather have a superintelligent system that did a well-defined task rather than generically making my life better, or at least I think, when I think about AI alignment, it seems like we would be able to figure out how to create an aligned, safe AI that does one task before we figured out how to create an aligned AI that generically makes all of your life better. And, I'm kind of on board with the arguments that we don't know how to have an aligned AI that does one concrete task, even if, you know, you can still allow problems of vagueness of specifying what the task is if the task's 'build a thriving city' or something. But I find it weird that so much of the field is about 'generically, make your life good'. Rohin Shah: I feel like there's not that much of a difference between these. 'Build a good city' seems pretty similar to 'build an agent that generically helps me'. And it turns out that in this particular case, what I really want to do is make a city. You would still want these personal assistants to defer to you. And if you give explicit instructions to obey those, I think you do want to shoot for that. So it doesn't feel like those at that different in terms of the actual things that they would do. In terms of research strategies for how to accomplish this, I guess the versions where we're trying to build AGIs that can do these broad, vague tasks, well, we're saying, OK, we need to first figure out how to make an AGI that does this task. It seems to me - I just don't see what benefits this gives, how this makes the problem easier than just 'build an AGI that's trying to help me.' Daniel Filan: So, I mean, one benefit is, well, it just seems - OK, let's go with the city task first, right? The thing with the city task is, you're not - I think the way to do it is not 'OK, I'm going to watch a human try to do urban planning for a month or something, and then I the AI am going to take over.' That's not what you're suggesting, right? Is it actually, I can't tell. Rohin Shah: No. So, for the city planning case in particular, this algorithm is going to be, I mean it might infer some sort of common sense details, but it's not going to infer what a good city looks like because you just don't get information about that by looking at a single human. So you need something else. So I think I'm more, let's leave this paper aside, I am not at all convinced that it will matter at all for AGI alignment. I would not bet on that. Daniel Filan: All right. Rohin Shah: But I do think the general idea of, 'oh the AI system is going to try to help us and will be inferring our preferences as part of that', that is something I'm more willing to stand behind. I think in this case, it looks more like, the AI system, when you tell it, "Hey, please design me this city", it goes around and reads a bunch of books about how to design cities, if such books exist, I don't know. It looks at what previous urban planners have done. It maybe surveys the people who are going to live in the city to figure out what they would like in a city, it periodically checks in with you and says "This is what I'm planning to do with the city. Does that seem good to you?" And if I then say, "This is the reason that's bad, this doesn't seem good to me for X reason", they can say "Well, I chose that for Y reason because I thought you would prefer Z, but if you don't, then I can switch it to this other thing." I'm maybe rambling a bit here. Daniel Filan: So I guess you're imagining, OK, we're going to have these AI systems, they're going to do tasks that are kind of vague. But the way they're going to do that is they're going to infer human preferences, basically in order to infer what we mean when we say 'please build a good city'. And they're going to do that by a whole bunch of sources of information, and maybe one fifth of what they're doing is looking at people who are trying to do the task and trying to infer what the task was, assuming they were, you know, doing a good job of trying to do the task. Rohin Shah: I don't know about one fifth. Daniel Filan: Well, you said a lot of things and very few of them sounded like inverse reinforcement learning to me. Rohin Shah: Sorry, I was imagining much less than one fifth, to be clear. Daniel Filan: OK, all right. A small amount. I'm happy with less than or equal to one fifth. And I guess this gets a bit into other work of yours that you've collaborated on that we won't be talking about right now. But I guess I can kind of understand this. But to answer a question you asked me a little bit earlier, I think the reason that I want to do the 'plan a city' rather than 'generically make my life better'. Firstly, I think I want to be clear that if indeed we are trying to create an AGI that can plan a city rather than generically make your life better, if somebody says that out loud, then we can check if our research helps with that. So that's one reason to be a little bit clear about it. The reason to prefer planning a city to generically making your life better is firstly, intuitively it seems like an easier job. If you're planning a city, the way you're generically making my life better is by planning a city that's good. If you're generically making my life better, you're generically making my life better in every single way, including presumably the way of maybe occasionally planning a city to make my life better. So it must be strictly easier to do the first thing. And then thirdly, I prefer the 'plan a city' type task, because at the end you can check if a city got planned, and roughly evaluate the city and see if it's a good city. And that seems like a target that you're going to know if you hit it more easily than you're going to know if this AI system generically made your life better. I'm wondering what you think of those points. Rohin Shah: Yeah. I think I didn't understand the first point, but maybe I'll talk about the second and third first, and then we can come back to the first point. Daniel Filan: I'll say the first point. Maybe it wasn't responding to anything, but I think that the first one is just we should try to be clear what problem our research is trying to solve. And I guess the second and third point are arguments that planning a city and generically making your life better are importantly different problems, and about why one of them is better than the other. Rohin Shah: Yeah, I mean, first point seems good. Being clear about what we're trying to do seems good. Daniel Filan: It's so rare in papers, right? Rohin Shah: It is. It's annoyingly difficult to get papers that say exactly what we're trying to do to be published. It's very sad. I tried with NeurIPS. We'll see whether or not it works. But. For the second point, I agree that the like planning a city type task must be strictly - should be strictly easier if you're imagining that your 'help me generically' AI system could also be asked to plan a city. I think that's more of a statement about capabilities though, where I'm like, OK. But I sort of see the safety, the alignment, the good properties that we're aiming for here in AI safety and AI alignment coming from the 'trying to help you' part. And we can have different levels of competence at helping. So, maybe initially, we just have agents that are trying to help us schedule meetings on a calendar and, you know, not doing anything beyond that because they're not competent at it. And, you know, as we get to more general agents we'll need to ensure that these agents know what they can and cannot do so that they don't try to help us by doing something that they are incompetent at, where they just ruin things without realising it. And that's one additional challenge you have here. But I sort of see this as, most of the safety comes from the agent being helpful in the first place. And that's the reason I'm aiming for that instead of the things like 'plan a city'. Remind me what the third point was? Daniel Filan: The third point was that you can check if you've succeeded at planning a city more easily than you can check if you've generically had your whole life been a bit better. Rohin Shah: Yeah, I guess I don't see why that's true. It seems like I totally could evaluate whether my life is better as a result of having this AI system. Maybe the AI tricks me into thinking my life is better when it's not actually but the same thing can happen with a city. Daniel Filan: I mean, to me, it seems like building a city is a better defined task. Hmm, I guess there are so many ways... Yeah, maybe this is wrong. It just feels like there are so many more ways in which my life could plausibly be better. But at least with the city. It feels like I can check if it's there or not. Rohin Shah: Yeah, I mean, I think - So, it depends a little bit on how you're going to quantify the helpful part. Maybe you're just like, you know, as one metric, did the AI assistant follow the instructions that I gave? Was it competent at those instructions? Did it infer something that I wanted without me having to say it or something like that. But I feel like we can, or I would guess that at least all the people who hire personal assistants are in fact able to tell whether those personal assistants help them or not. And hopefully they can distinguish between bad and good personal assistants. Daniel Filan: I mean, I think that's because they give the personal assistants specific tasks like 'please do my taxes'. Rohin Shah: Yes. And I we will plausibly do that, too. But even our personal assistants can take - often, I expect, take a lot of initiative. I guess one example for me personally is I write the alignment newsletter and, I guess not that recently anymore, for quite a while now it's been published by somebody else, specifically Georg [Arndt] from the Future of Humanity Institute, and also Sawyer [Bernath] from BERI helps run it as well. And at some point I was like, "You know, we should probably switch from this pretty plain template that MailChimp has to something that has a nicer design". And I mostly just said this. And then Sawyer and Georg just sort of did it periodically sending a message being like "This is the plan". And I'd be like, "Yep, thumbs up". And then at the end of it, there was a design. It was great. And in some sense, I did specify a task, which is, 'let's have a pretty design'. But it really felt like it was a fairly vague kinda sorta instruction but not really, that they just then took and executed well on. And I sort of expect it to be similar for AI systems. Daniel Filan: I think that's fair. Yeah, I guess the other thing I want to pick up on was in your answer, it sounds like you think there is this core of 'being helpful'. That like there's there's some technique to be helpful. And, you know, you can just be better or worse at it. And once you know how to be helpful, you can be helpful at essentially anything. That's my interpretation of what you said. I'm wondering to what extent you think this is right, and that you see the important part of the AI alignment or safety community as trying to figure out computationally what it means to be helpful. Rohin Shah: That seems broadly correct to me. I think I wouldn't say that it's the entire AI alignment community's thing that they're doing, I think there's a subset that cares about this and a different subset does other things - Daniel Filan: So my question was whether you think that's what they should be doing. Whether that's what the problem is. Rohin Shah: I think there's some meta level outside view that's like, "oh, man, we should be encouraging diversity of thought" or whatever. But if you were like, "what is your personal thing that seems most promising to you such that you'd want to see at least the most resources devoted to it", yeah, I think it's right to say that that would be AI that is trying to help you, or trying to do what you want, is how I think Paul Cristiano would phrase it. Daniel Filan: It sure seems like there are a lot of sub problems of that. If I imagined that being my main thing, it's like, how much have I made my life easier? Rohin Shah: It does feel like it's got a domain independent core or something. If you look at assistance games, which were previously called Cooperative Inverse Reinforcement Learning games or maybe just Cooperative Inverse Reinforcement Learning, I feel like that is a nice crisp formalisation of what it means to be helpful. It's still making some simplifying assumptions that are not actually true. But it really does seem to incentivise quite a lot of things that I would characterise as helpfulness skills or something. It incentivises preference learning, it incentivises, you know, asking questions when being unsure, it incentivises asking questions only when they become relevant and not asking about every possible situation that could ever come up at the beginning of time. It incentivises learning from human behaviour, passively observing and learning from human behaviour, which is the sort of thing we were talking about before. So I don't know. It feels like this is a thing that we can, in fact, get agents to do in a relatively domain-independent way. And if we succeeded at it, then there would not be existential risk any more. Daniel Filan: OK. Well, on that note, I think we've had a good conversation. Hopefully our listeners understand the paper a little bit better. But of course, I would recommend reading it. The name of the paper is "On the feasibility of learning, rather than assuming, human biases for reward inference". Today's guest has been Rohin Shah. Rohin, if viewers wanted to follow your work, what should they do? Rohin Shah: Yeah. I mean, the most obvious thing to do is to sign up for the Alignment Newsletter. This is a newsletter I write every week that just summarises recent work in AI alignment, including my own. So that's a good place to start. It's also available in podcast form. Other things, I write some stuff on the Alignment Forum so you could go to alignmentforum.org and look for my username, just search for "Rohin Shah". And I think the last thing would be the papers that I've written, links to them, are all available on my website, which is rohinshah.com. You can also find a link to sign up to the Alignment Newsletter there. Daniel Filan: All right. Thanks for today's interview, Rohin. Rohin Shah: Yeah. Thanks for having me. Discuss ### AXRP Episode 1 - Adversarial Policies with Adam Gleave 29 декабря, 2020 - 23:41 Published on December 29, 2020 8:41 PM GMT Google Podcasts link Daniel Filan: Hello everybody, today I'll be speaking with Adam Gleave. Adam is a grad student at UC Berkeley. He works with the Center for Human Compatible AI, and he's advised by Professor Stuart Russell. Today, Adam and I are going to be talking about the paper he wrote, Adversarial policies: Attacking deep reinforcement learning. This was presented at ICLR 2020, and the co-authors are Michael Dennis, Cody Wild, Neel Kant, Sergey Levine and Stuart Russell. So, welcome Adam. Adam Gleave: Yeah, thanks for having me on the show Daniel. Daniel Filan: Okay, so I guess my first question is, could you summarize the paper? What did you do, what did you find? Adam Gleave: Sure. So, the basic premise of the paper is that we're really concerned about adversarial attacks in machine learning systems, and most adversarial attacks people have talked about have assumed this kind of Lp-norm threat model, where you take some existing input and you add a small amount of perturbation to that input, and then something like an image classifier drastically changes its classification accuracy. And probably people have seen examples of this where you add some white noise to a panda and this completely changes the classification, it's a very striking example. But, often what we care about isn't really the performance of image classifiers because they're just outputting a label that doesn't directly have an effect on the world. Adam Gleave: We're concerned about the behavior of entire systems, and reinforcement learning is a technique to train policies that actually take actions in the world, so the stability, robustness of reinforcement learning, is potentially of much more importance than with image classifiers. People have done research in the past looking at porting adversarial examples from image classifiers over into deep RL, so there was some prior work by Sandy Huang and others, and Jernej Kos on this. Adam Gleave: And showed that basically the same attack succeeds, but what we wanted to do in this work was, come up with a threat model that was more appropriate to reinforcement learning, because you don't normally have the ability to just add arbitrary noise to some robot sensor, if you already have that level of control, there's much easier ways of breaking the robot. So, we're modeling the adversary as being another agent in the shared environment and this adversarial agent can take the same set of actions as the victim agent that's being attacked, and the actions can indirectly change the observations. And we found that even under this much more restricted threat model, that was sufficient to cause the victim to fail in quite surprising ways. Daniel Filan: Okay, and could you say a little bit about what environments and what tasks you explored for this work? Adam Gleave: Sure. Yeah, so the task we're using were all simulated robotics environments, they were two player zero-sum games. So, the policies that we're attacking were trained via self-play, to win at these zero-sum games you'd expect them to already be quite robust to adversarial behavior because they were playing against an opponent that was trying to beat them, during training. But, we found that despite this, self-play isn't robust to these adversarial policies, and we think that's probably because self-play is only exploring a small amount of the possible space of policies. You can easily find some part of the policy space that it's not robust to. Daniel Filan: Okay. So, when you train these adversarial policies, how much were you training, and for how long were the, I believe you call them, the victim policies originally trained? Adam Gleave: Yeah, so the victim policies were originally trained for at least 500 million time steps, and I think up to two billion time steps. They were actually trained by Bansal and others, a team at OpenAI, so we didn't train the policies, but they were considered to be state of the art at the time, and our adversarial policies were trained for no more than 20 million time steps which is still a lot in absolute terms, but it's only a tiny fraction of what the victim policies were originally trained with, and is reasonably simple efficient for deep RL and these kinds of environments. Daniel Filan: Okay, so it's not the case that you managed to train for a ton of time to defeat these policies, it was relatively cheap? Adam Gleave: Yeah, I mean, these are all experiments that you can run on a desktop PC in under 24 hours, so it's not really, really, really cheap, you don't want to run it on your laptop in real time. But, it's definitely, it doesn't require kind of Google scale compute to pull off these attacks. Daniel Filan: Okay, and what do the attacks look like, if you train one of these adversarial policies, what does it do? Adam Gleave: Yeah, that's a great question. I think one of the most shocking result here isn't just that we can exploit the victim but that we exploit them in this really surprising way. So, the tasks we were using were normally these kinds of simulated robotic games, so one of them was you had a penalty shootout in soccer, where there's a kicker and a goalie trying to defend the goalposts, and we substituted in this adversarial goalie, which made no attempt whatsoever to block the ball, it just fell over and wriggled around on the ground putting its limbs in this really contorted position. Adam Gleave: And so, a human looking at this, it just looks like basically completely uncoordinated chaotic behavior, but it actually causes the kicker to fail to kick the ball, and sometimes the kicker will even fall over, whereas in a very stable policy normally. And it's not just seeing something off distribution, because you might think, "Well, if a goalie fell over in front of you in real life, you might also be a bit confused what's going on." But, we tried just this random policy that takes completely uniformly random actions, visually looks pretty similar to the adversary, but this didn't have the same effect on the victim. So, it really is about finding this very specific type of behavior that triggers a glitch or a bug in the victim policy. Daniel Filan: Yeah, I'll say to listeners, the kind of behavior that you see is quite striking, I recommend there's a website adversarialpolicies.github.io. This medium, of course, is... The medium of podcasting is not actually great at conveying images, but you can go there and look at these videos. So, speaking for robotics tasks, I guess there are a few different multiplayer RL environments that you can play with, right? Why did you choose robotics specifically? Adam Gleave: Sure, that's a good question. So, one motivation was, we wanted something that was closer to realistic attacks, and that robotics is an environment where people are at least beginning to transition from the lab to deployment. It's one of the main actual motivating use cases for deep RL, so having robotics that are actually robust to these kinds of attacks is actually important, rather than some other environments are more just toy proof of concepts. Another important reason we chose this environment, was that the number of dimensions that the adversarial policy can influence is reasonably large. So, a detail of this environment is that both agents see the other agent's position, and this position is their center of mass, the positions of their joints, and this is normally between 10 to 20 dimensions, so it's not huge, it's not an image based observation, but it's enough degrees of freedom that an adversarial policy can actually confuse the victim just by its body positioning. Adam Gleave: Whereas if we'd gone for some of those really simple point-mass particle environments that some people use in multi-agent RL. It would have been pretty hard to pull off this kind of attack, I suspect, because you only got an XY coordinate to play with, and that's just not really high dimensional enough to pull an attack. We need this minimum level of complexity to demonstrate it, but we wanted to choose something that wasn't too complex because that just obscures the results, makes it harder to replicate and run. Daniel Filan: Yeah, although it is the case that sometimes you'll have policies trained on recurrent models, right? Adam Gleave: Mm-hmm (affirmative). Daniel Filan: Which, in some ways increases the dimensionality of the observation, right? Adam Gleave: Yeah, but- Daniel Filan: Do different things at different times. Adam Gleave: Yeah, that's right. So, we are actually looking now in some follow up work on rock paper scissors, that's a very, very simple game and you probably played it in kindergarten. And if you're playing against an RNN that sees all the sequence of your actions, then it is actually quite a high dimensional space, and because it's a non-transitive game, meaning that there's no single dominant deterministic policy, because rock beats paper, paper beats scissors, and so on. This means that unless your opponent perfectly randomizes, if you're able to predict what your opponent is going to do, you're going to have some advantage over them. So, we are actually trying to come up with some adversarial policies in that setting, and it's still early work, but it looks it is possible at least with some kinds of training setups. Daniel Filan: That's interesting. So, from that I'm gathering you think that these results are representative of whenever you're in an environment where there's another agent that has control over many degrees of freedom, that the victim is observing or depending on? Adam Gleave: Yeah, I think that basically whenever you're training a policy via self-play, this is a very, very common technique used in AlphaGo, AlphaStar, a bunch of other results, then if there's enough dimensionality but you're just not going to have seen all the possible observations at training time, there are probably going to be some areas where the neural network policy you learn is going to generalize badly and an adversary is going to be able to push you towards those states, and then your performance is going to degrade. Adam Gleave: Now, I think that one limitation of our work so far is, we've not attacked any truly state of the art policies, it'd be nice to try and attack something like AlphaGo which has actually beat high level humans, the simulated robots we're attacking are pretty good kickers, but they're not about to win the World Cup. So, can we beat something that's truly state of the art? That's still an open question. My suspicion is that, if you are just attacking a neural network policy the answer is still going to be yes, but when you start adding things like Monte Carlo Tree Search, which is used in AlphaGo, we're actually looking ahead a few steps. Then it's going to become much harder because you have to not only fool the network immediately, you have to fool it even once it can see the consequences of taking a stupid action. So, it's really a lot more challenging. Daniel Filan: Yeah. Although, in some ways adversarial policies still exist in this setting. So, if I think about Go, if I'm a human playing Go and I'm not an expert, one thing I'm vulnerable to is people playing moves that are sort of similar to standard openings, but require very different responses. And often I'll just make a mistake because I can't quite anticipate what their right follow-up moves are, and this will doom me to be down a couple points or something. I'm wondering if you think that that's - if that kind of thing that maybe listeners have experienced themselves, is analogous to what's going on here. Adam Gleave: Ah, yeah. I mean, I think it's always hard to say exactly how human experience translates onto deep learning, but I think that's a pretty good intuition to have when at training time alone you've seen a small set of possibilities, and so they're just pattern matching and saying, "Well, this is similar to something I've seen in the past." So, now I've got this rule of thumb that I've learned, and has always worked me well in the past at training, and I'm just going to keep on applying this rule of thumb, and if you put it in a sufficiently extreme state then maybe it applies that rule of thumb too far, and it takes the wrong action. Adam Gleave: And because we're able to sort of search systematically to find these kinds of examples, you're able to exploit it. And I do sort of have a bit of sympathy for the victim policy because what we're doing during training is we're freezing the victim, and then we're just keeping on training an adversary, and I know if someone was able to give me retrograde amnesia and just keep on playing chess against me again and again, they'd probably eventually be able to find some move where I do something, they don't just win but I do something incredibly stupid, reliably. Adam Gleave: I'm probably not actually that robust, and so a natural question to ask is, "Well, can we make policies that are at least going to be able to adapt really quickly to these kinds of mistakes?" And maybe you can fool them once, but once they've been exploited they realize, "Oh, that was a really stupid thing to do." and learn, so that's one thing we're working on right now, but it is quite challenging because generally RL training is quite sample inefficient, so these victim policies are trained for up to two billion time steps. So, you could you can lose many, many games of soccer before you even get to a million time steps, so we need to be able to really adapt very fast, to be able to be robust to these kinds of attacks at deployment time. Daniel Filan: Yeah, seems like a rough challenge. So, moving on a little bit, in the introduction one point you make is distinguishing between peturbing the observations of a system, and being an agent that sort of acts and produces natural observations that are different from what the system has seen. But if I think about these robotics environments, the input to the policies are joint locations and angles, right? Adam Gleave: Yeah, that's right. Daniel Filan: So, to what extent are these actually importantly different? Adam Gleave: So, to what extent are they actually perturbed? Daniel Filan: Yeah, or how is what you're doing really that different from the classic "perturb the input" adversarial example? Adam Gleave: Oh, sure. Yes, so the space of possibilities is actually reasonably constrained. So, invariants we have to hold in this space, for example, is that you can't have two limbs that are actually intersecting, the physics simulator won't allow that. There's normally limits on how mobile each joint is, and then there's also limitations where the adversarial agent is still trying to win at some game. So, in the case of a goalie, if it were to move outside of a certain region, like the goalkeeping region, then it's going to lose. Adam Gleave: So, it's not like it's got complete freedom. I do think it's interesting to note that in one of the environments, which was a sumo wrestling environment, there were more constraints on the adversary's behavior: if the adversary fell over, which is basically what happens in other environments, it would lose. And we did still get surprisingly good results in sumo humans, but it wasn't actually outperforming the normal opponents, although it was getting basically the same performance as a normal opponent, despite not trying to knock the victim over, it was just kneeling in this strange position. So, at least in that case more constraints do seem to reduce the effectiveness. Daniel Filan: And in sumo, the constraint is, you're not allowed to touch the ground with your... Is it anything above the knees or..? Adam Gleave: I can't remember the exact constraint, I think that there's certain parts of your body that are not allowed to touch the ground, and I think there might be a constraint on your center of mass, you're also not allowed to leave the arena. So, you have to remain within the arena. Daniel Filan: Yeah, of course. Okay, so that makes sense. Another question I had based on reading your paper: you use some of these asymmetric environments. So, for instance, one player is trying to kick a ball into a goal and one player is trying to be the goalie. In the videos, maybe I didn't look enough, but from what I can see it looks like the adversarial victim seems to always be the kicker, and the... Sorry, the victim is always the kicker and the adversarial agent is always the goalie. Did you always do that or did you try making the goalie the victim and it didn't work, or what's up with that? Adam Gleave: Yeah, so I think we actually should revisit that. We did try both of them in early stage experiments and we decided to prioritize the kicker being the... Sorry, the goalie being the adversary because that seemed to be working a little bit better in the initial experiments, but then we improved our technique quite a lot. So, I don't actually know if now, if we were to revisit that, whether we'd still be able to get a good adversarial policy out of a kicker. Again, my suspicion is that it's going to be harder because the kicker does have... So, the goalie can win by default if the kicker doesn't kick the ball, so it just needs to cause the kicker to malfunction. Adam Gleave: Whereas, the kicker to win, it does have to kick the ball, so it needs to first make the goalie fall over and fail, and then get off up the ground and kick the ball, which is... I suspect that an adversarial policy does exist, but it might be a lot harder to train with RL because you've now got to do this two-step thing, and we're generally quite bad at training policies to do one action after another. Daniel Filan: It's also, another way to think about it is that there's more constraints on the kicker's actions. It's got to kick the ball in, so that sort of determines what it can do with its legs, and maybe it could wave its arms around in a frightening way, but it's got fewer degrees of freedom, right? Adam Gleave: Yeah, it's got fewer degrees of freedom, it does have the time element though, so it might be able to start off by almost falling over. I think it can't actually fall over, the episode would end then, but it can crouch down on all fours and do a lot of crazy things that don't involve kicking the ball. And then once the goalie's fallen over, then it can dust itself off and go on to stage two of kicking the ball, and eventually the episode will timeout so it has to do this reasonably quickly, but it's not like it has to make the goalie fall over while on route to the ball, it can kind of do a faint and trick the goalie, and then kick the ball. Daniel Filan: Okay, yeah. That makes sense. So, moving on a little bit. So, you not only found these adversarial policies, in section 5 you have some results on what happens if you ignore your opponent, how the dimensionality of the problem affects things, and also what's going on with the activations of the victim agents. Can you describe those results? Adam Gleave: Yeah, sure. So, I think probably one of the most striking results was what happens when you effectively just blindfold or mask the victim policies so they can no longer see anything the adversary is doing, and we don't change any of the policy weights, we just replace the part of observations that normally corresponds to the opponent's position, with just a static value, sort of a typical value at initialization. And what we find is, that these masked policies actually do surprisingly well, they are basically robust to the adversary, and this makes sense when you think about it because the adversary isn't doing anything to physically interfere with the victim. And so, although it might not be great to not know where the goalie is if you're trying to kick a ball, you can still do pretty good just by kicking it in a random direction and hoping the goalie doesn't block it, because the goalie is not trying to block the ball. So, that was sort of an interesting result, but it really is just confirming that the adversarial policies are working by indirectly changing the observations only by- Daniel Filan: And these masked policies, they don't do well against other agents, right? Adam Gleave: Right. No, so if you play them against a normal opponent, then they do terribly because they don't see the opponent coming at them, basically they can't adapt their behavior. Daniel Filan: Yeah, so that's the masking. You also tested different dimensionality? Adam Gleave: Yeah, so we also tested varying for robotic body, so we had both humanoids and ants, and you'd sort of expect, if our hypothesis that dimensionality that the adversary can manipulate is important, then the lower dimensional the body is, the harder it is to perform an attack. And that seems to be true, in that it's much harder to win as an adversary in sumo ants, so it's ants sumo wrestling, than in the humanoid version. But, there are some confounders here and the ants are just generally more stable than humanoids, so ideally we'd like to be able to decouple those two, but it's a little bit hard in robotics because higher dimensional bodies generally correlate with being harder to control. Daniel Filan: Yeah, sorry when you say ants are more stable, do you mean as a physical structure or the training is more stable? Adam Gleave: Physically ants are more stable because they have four legs, which I know isn't biologically accurate, but that's how things work in MuJoCo, and they have a lower center of mass. So, you don't fall over as an ant if you just exert basically constant control torque to your joints, whereas with a humanoid it's this constant balancing act. Daniel Filan: Yeah, it seems like one way to control for this would just be to have an ant play a human in sumo, and the ant's the adversarial agent and the human's the... Adam Gleave: Yeah, that would be interesting. We could try that, we were always attacking these pre-trained policies by OpenAI, because we thought that would be fairer because then no one can say we did something wrong in our training, and they didn't do that setup. But, we could train our own policies and try and do that, and I think that'd be quite an interesting future experiment. Daniel Filan: Okay, and finally you looked at... I guess, in order to try and understand what these adversaries were doing, you looked at what the activations of the victim neural networks were, can you say a bit about that? Adam Gleave: Yeah, sure. So, just to get some context, what we were doing was we had this fixed victim policy network play several different opponents, one of which was an adversarial policy, one of which was a normal opponent, the kind of opponent it trained with via self-play, and then finally a policy that was just taking random actions, and we wanted to see basically what does the victim policy actually think when it's playing against these different opponents. And so, we recorded all the activations of a victim policy network, and we fit a density model to these activations to try and predict... We fit it when it was playing a normal opponent, and then we tried to predict how likely are the activations when it was playing a different normal opponent, because we had different normal opponents with different seeds, or the random and adversarial policy, and what we found was that the adversarial policy very reliably induced just extremely unlikely activations, so activations that would be very unlikely to occur when playing a normal opponent. Adam Gleave: So, the victim policy was clearly thinking something very different, and then random was also fairly unlikely compared to a normal opponent, but it was much more likely than the adversarial policy. So, this is suggesting that it's not just being off distribution, but we're systematically finding some part of the state space where the victim is very confused by, or that we're triggering some features to be much larger than they usually would be. Daniel Filan: Yeah, and if you fit this density model on opponent one, how surprising are the activations induced by opponent two? Adam Gleave: Yeah, they're generally pretty hard to distinguish, I think the only exception was in sumo for some reason the opponents seem to be more different to usual, but in the other two environments I think they were normally within the... There wasn't any significant difference within confidence intervals between different normal opponents. Daniel Filan: Yeah, if you think about sumo the sport, I think there are sort of distinct strategies you can go for, like push the opponent out of the circle, or tip them, and I guess you could specialize in one of those? Adam Gleave: Yeah, and I think we did see that a little bit in the pre-trained opponents, certainly they had quite different win rates against each other, so that's suggesting that they weren't just a uniform opponent. Daniel Filan: Yeah, in terms of these activations, can you say a little bit about what the neural nets were, and what layer you're getting these activations from, are these just logits or like one before logits? Adam Gleave: So, I'm not actually 100% sure on this point because someone else ran these experiments, but I think we were looking at activations from every layer of the network, so we didn't choose a particular layer. And in terms of the networks, so again we didn't actually train the victim policies, OpenAI did, so I'm not 100% on the policy architecture, but I think there wasn't anything fancy going on, these were just fairly standard multilayer perceptrons, I think one of them... Yes, some of them had LSTMs, but some of them were just standard feed forward networks. Daniel Filan: Okay. Great. So, I guess that concludes the section where I'm asking directly about the paper. Now we're having more speculative questions on my part, so they might not make any sense. Adam Gleave: Sure thing, yeah. Daniel Filan: Yeah, so one thing you note in the masking section is that the space of policies, and which ones beat others is not transitive, right? Adam Gleave: Mm-hmm (affirmative). Daniel Filan: So, it can be the case that a normal policy will beat a masked policy, which will beat an adversarial policy, which will beat a normal policy, right? Adam Gleave: Yeah. Daniel Filan: So, I'm wondering, it seems like if the space of policies were extremely transitive, then you wouldn't have to explore much in policy or activation space in order to do well, just train against the best thing. But, if it is very transitive [sic, should be "isn't very transitive"], then if you're not used to a certain type of opponent then you can get really messed up, and it seems like that might be key to your results. So, I was wondering, firstly, does that make sense as a theory? And secondly, is there some measure of how non-transitive the space of sumo or kick and defend policies is? Adam Gleave: Sure. Yeah, it's a very thought-provoking question. I would say that the intuition that something being non-transitive makes it harder to learn a good policy, especially via self-play, is absolutely correct. In fact, normally proofs of convergence for self-play do require transitivity or a slightly weaker assumption than transitivity, because the intuition is that you're beating a particular opponent, which is often just a copy of yourself, and you want this to also mean that you're stronger against previous versions of that opponent. And if you don't do this you can just end up in this cycle where you beat the previous opponent, but you get weaker at a opponent from 100 time steps ago, and you never actually converge to something. Adam Gleave: So, it's actually a pretty interesting result that if I'd been asked to guess before the start of this project, is a penalty shootout in soccer transitive. I'm not sure I'd have been 100%, but I'd have said, "Yeah, probably mostly it's a transitive game. I guess there a few strategies with a non-transitive, like whether you kick left or right, that's a kind of this non-transitivity. But, I don't expect I'm going to be able to win a penalty shootout compared to a professional footballer. So, it's clearly some sense in which certain policies dominate others." These results we got though at least for current state of art deep RL, they're just very, very highly non-transitive where completely ridiculous policies can win against seemingly very proficient ones. Adam Gleave: Now, in terms of how you measure how non-transitive space is, I think that's a really interesting question I unfortunately don't have a good answer to. It seems it's not just dependent on the task, but also the class of policies that you're considering. Yeah, as you can consider this extreme case where you have a policy which is just the optimal policy, and then you add some classifier to this optimal policy which tries to figure out, is it playing against a particular opponent. And if it does it just resigns, or it just does something stupid. Adam Gleave: And now you've effectively introduced a non-transitivity artificially, where this very weak policy, that would normally lose against the optimal policy, is now able to beat this very nearly optimal policy. So, you can always kind of play with tricks like that, and obviously that's not what we're doing when we're training via self-play, but I think there's maybe a bit of a similar effect where there's just this very small region of policy space which you might fail to be robust to, unless you're systematically exploring the policy space at training. Daniel Filan: Yeah, it's an interesting question, because it seems like if you think that this is the crucial thing, then you might say, "Oh, well before I deploy my deep RL-trained model in an environment, maybe I want to..." Like, if there's some way to figure out how vulnerable it would be to these adversarial attacks, without actually training an adversarial policy. That might be desirable, but I guess it's difficult, especially if you don't know what model class your opponents might be using. Adam Gleave: Yeah, that seems right. I guess you can get some amount of that, by either doing something like population based training or just self-play with a bunch of different seeds, you can at least check to see is there non-transitivity between the policies that you have trained so far. And if there is, then that should make you concerned, but obviously that's no guarantee because you're only exploring a small part of the possible policies. And in general I think it's going to be very hard to get full confidence that you're robust to these kinds of attacks, because you're never sure that you've tested every possible attack, verification of deep networks is an active area of research, but still very challenging, especially if you're trying to verify what happens in this unknown non-differentiable environment. Daniel Filan: Yeah, definitely. Okay, so another thought I had when reading your paper, was the way you train these adversarial policies, is essentially the way... Like, if you think about self-play you just fix an agent then train something else to beat it for some number of time steps, and then sort of do that ladder. And so I'm wondering, one inference that I might make from your work is, "Oh, it must be the case that when I do the self-play training, a bunch of the self-play steps are just an agent finding an adversarial- like a silly sort of trick adversarial policy against it's fixed copy, and then learning to be robust to that." Do you think I should make that inference, and to what extent do you think this work in general just reveals what's happening within the dynamics of self-play? Adam Gleave: Yeah, so we don't know from these results, but my expectation would be that you probably do see this result that many gradient steps being taken by self-play, are really just exploiting some silly bug in your opponent, and so this is a very noisy gradient update, where some of them actually moving robustly in the direction of a good policy, and others are just sort of overfitting to a particular opponent you have right now. Now, I do want to make clear one thing, that although our training procedure is very similar to self-play, we are training against a fixed victim for a large number of time steps. Adam Gleave: And so you can view this as in some ways getting closer to the original motivation of self-play, which was fictitious play where you're supposed to be doing iterative best response, because if you train for 10 million or 20 million time steps, you're getting something that is reasonably close to best response. Obviously RL is not a perfect optimizer, but you're doing as well as you can with the techniques you have. Whereas, if you're training for something like 100,000 time steps before then updating both yourself and your opponent, then you're never really doing best response, you're just taking these small steps towards a better response. Adam Gleave: And so I think part of the reason why these attacks are possible is because this kind of traditional self-play, it can converge to these local equilibria, and what we actually found was if we just keep on fine tuning our normal opponents, if we take one of our opponents the victim was trained with via self-play and apply our attack method, but starting from this normal opponent, you might expect this is going to do better because you're already starting from an opponent that wins against the victim a lot of the time, but in fact it doesn't improve any further. So, it has converged to some equilibrium, but it's a local equilibrium, so it's not stable. Daniel Filan: Yeah, I guess another crucial difference is, in self-play you sort of fork your opponent, you start off identical to your opponent, and then you start taking gradient steps. Whereas, in your work you have randomly initialized agents, and then take training steps. And I suppose that probably makes it easier to shed your preconceptions, if you're doing self-play and you have some blind spot, then when you're cloned you still have the blind spot, and it might be harder to discover it. Adam Gleave: Yeah, I think that's right. Some of the environments we're attacking were asymmetric, so they weren't actually training against themselves, but they were only training against one other agent, so you can certainly imagine that both agents might have this kind of shared blind spot, and neither of them is incentivized to fix it because it's not being exploited by the other agent. Whereas, something that's more of a population based training approach where you play against randomly selected agents at each episode, that's much more likely to explore this policy space, and so would probably be a bit more robust. Adam Gleave: Yeah, I do think starting from a random initialization, certainly you don't have any preconceptions other than inductive bias from the training procedure, but I was surprised that our attack method worked. I wasn't even intending to do this it was actually a collaborator, Michael Dennis, who was stubborn enough to keep on trying to do deep RL from a sparse reward, and I like, "There's no way this is going to work because you don't have any curriculum, you're just trying to beat this already very good victim." But, it turns out that if you run it for a reasonable number of time steps, deep RL is a good enough optimizer that it can discover this. But my suspicion is, that there are going to be other kinds of adversarial policies that we're not discovering with this training procedure. Because most steps you take from a random initialization are not going to be able to beat a reasonably capable victim. Daniel Filan: Yeah, so speaking of the nature of these adversarial policies. So, one thing that you've noted is that in kick and defend, the way the adversarial policies do not work is by blocking the ball from getting into the goal. Do you have any understanding of how they are actually working? Why these fool the network, other than, "Oh, it's just doing something unlikely." Adam Gleave: Yeah. So, I mean, it's certainly more interesting something unlikely because the random policy doesn't have this effect, and in fact the victims were quite robust to some other perturbations. The OpenAI team that originally trained this applied a random force vector to one of the victims, so just like the hand of God is suddenly trying to push you over, and the victims were quite robust to that. So, they are robust to a lot of things you might throw at them. My best guess is that when they were training via self-play, they would learn any feature which is useful for beating their opponent, but some of these features aren't going to be very robust. Adam Gleave: So, if the position of the kicker's limb, predicts which direction it's going to kick the ball in, then maybe it says, "Okay, well, at this..." I guess that's the wrong way around because this is an adversarial goalie, but if the direction the goalie is facing predicts which way it's going to fall to try and catch the ball, then the kicker might learn, "Oh well, I should take a step in this direction to kick the ball away from where it's going to block it." And then if a goalie just really maxes out this feature by putting its limbs in a weird position, then what would normally have been an adaptive response could be sufficiently large that it destabilizes the control mechanism in the kicker. So, this would be my best guess, but I don't think we have a great understanding of what's actually going on here yet. Daniel Filan: Yeah, I guess this is yet another situation in which I wish we had much better machine learning interpretability. Adam Gleave: Yeah, I mean, that would be really nice especially if you could apply the interpretability technique and figure out these kinds of problems before deployment, because searching for adversarial policies is a good way of testing but, as I said, you can never be sure you've found every possible adversarial policy. So, it's very nice to have an interpretability technique as well, that would give you some confidence. Daniel Filan: Yeah. So, I guess I'd like to move on a bit to the reception and then the origin of the work. So, in terms of the reception my understanding is, this was published at ICLR 2020, but am I right that it also appeared in a workshop in NeurIPS 2019? Adam Gleave: That's right, it was at the Deep RL workshop. Daniel Filan: Okay, so I was wondering what you think of how people have reacted to this, and in general what do you think about the reception? Adam Gleave: Yeah, I think the work's had a very positive reception to date, and people definitely love the videos which unfortunately we can't include in the podcast. I think that the ML community is pretty receptive to pointing out these kinds of flaws in existing techniques, and especially in the case of adversarial examples, there's already a pretty big research community working on fixing them in image classifiers. So I think people are pretty excited about seeing similar kinds of threat models applied in other domains, because it's really just opening up research problems that other people can can work on. What I would say is that there's maybe a bit too much of a taste in the research community for flashy results. Adam Gleave: So, this paper, it worked out in my favor but I've had other papers which I think were from a technical perspective making an equally important contribution, which have had nowhere near as much attention because it's just harder to make this really compelling demo video and tweet. So, in that sense I hope that the ML community would, I guess, have slightly better norms regarding what to promote, but obviously it's very hard because it's growing so quickly. So, maybe just five years ago we could have basically fitted most people who are interested in deep learning in one room, and then you could just read every paper on deep learning, and now we're in a situation where it's impossible to even catch up with all the papers at one conference. So, we've not yet come up with I think really good curation mechanisms as a community, but I think it's going to be very important to be able to incentivize the right kind of work. Daniel Filan: Yeah, I do think it's just generally tricky to figure out even just how to find relevant research, how to tell that it's something you should pay attention to. It seems probably the most reliable method is your local slack group, somebody who has read a paper can say something interesting about it. And hopefully this podcast can do something similar. Speaking of the reception, I was wondering, do you think, are there any misconceptions that people have about your paper which our listeners might have, that you'd like to potentially just correct right now? Adam Gleave: Sure, I think one misconception people have is that because it has the word adversarial in the title this all about security, and I think security is a good way of doing worst case testing, having that security mindset. But, I think these kind of results are going to be relevant, generally when you care about robustness of a policy, because although it's unlikely that you run into these kinds of situations against benign opponents there's always some chance, so it's going to happen by chance, and it's also just indicating that we really don't understand what our policies are doing even if a policy seems to be completely reasonable when you test it a bunch of different ways. Adam Gleave: And I want to emphasize the OpenAI team that made the policies we're attacking, they did a really good job of evaluation, they were definitely following the standards of the field. Even then, it can harbor these kinds of really surprising failure modes, and so it's not enough as an engineer to just even dream up what you think are going to be the worst nightmares of your policy, because actually it might fail in this just completely surprising way. And so I think that we really need some automated fuzz testing style approach to complement existing testing methods. Daniel Filan: Yeah, and I guess my second question about the reception, do you know of any follow up work looking into this? So you mentioned that you're doing something with, rock paper scissors, it was? Adam Gleave: Yeah, so those are just early proof of concept experiments. So, we're working right now on improved defense mechanisms against this, because this paper was mostly about attacks a natural follow-up is to say, "Well, okay. How do we defend against this attack. Generally, how do we make policies more robust?" And so that's focusing on, as I was saying, this rapid adaptation approach. We don't have anything that's ready to be shared yet, but we're hoping in a few months to have a pre-print on this work, and I'm not aware of any other published work yet, this paper was only presented at ICLR a couple of months ago, but- [crosstalk] Daniel Filan: What sort of workshop- [crosstalk] Adam Gleave: It was also at a workshop, yeah. But, I definitely have a lot of interest, like emails, people asking questions over GitHub, so my sense is that there are other teams working on this general idea, so I'm hoping that we see some results soon. Daniel Filan: Okay. Yeah, I'll be looking forward to it. So, flipping it around and looking at the origin. My understanding is that your previous work has not been in adversarial examples. Adam Gleave: That's right. Daniel Filan: What caused you to decide to work on this problem, as opposed to anything else you could have done? Adam Gleave: Sure, yeah. So, I think that a big part of my motivation was having a bit of frustration with standard testing methodologies in the field, because it's... If you compare it to something like control theory where most of the papers are really coming up with robust guarantees that a system is going to work in a particular setting, RL has kind of gone to the complete opposite extreme and it says, "Well, we're just going to train things and then we're going to see, do they seem like they do the right behavior." And notably RL has been making a lot of progress in areas that historically been difficult for control, so freeing yourself from those theoretical constraints can be very useful, but then when we start seeing applications on the horizon, in the real world, well we need to start, I think, getting back some of these guarantees, if not from theory at least from rigorous engineering methodologies and testing processes. Adam Gleave: So, that was a big motivation, and then I was just trying to think, "Well okay, how do you improve this?" And if you want to do some kind of worst case testing then taking, as I said, some adversarial security approach I think is a pretty fruitful framing for the problem. And I'd been interested in the prior work on adversarial examples and RL, but it always just felt to me like the threat model wasn't quite right. I think this is an issue with a lot of adversarial examples work where it's very theoretically interesting work, and it's telling us something about how neural networks work, but most of the time an attacker, or nature as the case may be, can do things that aren't just adding a small amount of white noise to your observations. And so over time, I think we need to be shifting to threat models that are more indicative of realistic kinds of failures or attackers that you might see in the real world, and so I was hoping this was going to be one step in that direction, but obviously there's more that could be done there. Daniel Filan: Yeah, I guess there's also... Maybe part of the reason that this work didn't get done by somebody else earlier is, if you think about most reinforcement learning benchmarks, they're usually, if they're Atari or something, they're typically a one player environment. Adam Gleave: That's right, yeah. Daniel Filan: So, I think it takes some deliberate thinking, you have to choose your setting in order to come up with this research idea, I guess. Adam Gleave: Yeah that's true, I mean, I guess for me I found multi-agent work to be a very natural framing, some of my prior work has been on multi-agent or multitask reward learning for example, and it really seems to me like that is going to be the future in which most of our systems are going to be deployed. And now, you don't necessarily have to do multi-agent RL for everything, if you're just trying to train an autonomous vehicle, maybe it's enough to ignore the other vehicles, not explicitly model them as agents just model them as obstacles, you might still be able to do reasonably well in some cases, but we're definitely going to be deploying systems in multi-agent environments so we should at least be investigating that threat model. Daniel Filan: Cool, and I guess a similar question is, is it safe to say that you're interested, to some extent, in AI alignment or ensuring that when we get AGI that it's going to be human compatible, or something? Adam Gleave: Yeah, that's definitely an important motivation for some of my work. Daniel Filan: Yeah, so do I understand correctly that your understanding of how that fits in to that whole project is like, "Look, we're going to be training these RL AGIs, and we need some way of just testing them and seeing how well they actually work before they get deployed, otherwise clearly something bad is going to happen with them." Is that a fair summary? Adam Gleave: Yeah, I think that's definitely one of the more direct stories you could tell for how this work could fit into a long term bigger picture, I think I'm also excited by more indirect causes of impacts where it's going to be hard for me as an individual researcher to solve all of the problems needed for advanced AI systems to be safe and reliable. But, I think the community as a whole wants to solve these problems as well, but sometimes people can get a little bit stuck on trying to improve performance on existing benchmarks, and so I think there's value in just pointing out the existence of a problem in some way similar to the original adversarial examples paper. And then hopefully other people in the community will also lend some of their brain cells to solving this problem. Daniel Filan: Yeah, so I was wondering how to think about this, because in the safety community there are a couple of different kinds of work. So, there's one kind of work where people think, "Oh, how could we build aligned AGI? What's our strategy for building this thing?" And then it's like, "Okay, this strategy involves these 10 sub components, let's get to work on the sub components." or something. Or, maybe you have such a strategy in mind and you realize, "Oh, these things are going to be problems, these issues might block us. How can I get rid of these blocks?" And instead it seems like this work is more in the vein of, look at problems that already exist in the field of AI, that might lead to things down the road, and try and bring attention to these problems. I'm wondering if you have thoughts on the relative costs or drawbacks and benefits of these approaches, and how do you think those of us in this community should split our time between these ways of thinking? Adam Gleave: Yeah, that's a great question. So, [inaudible], I think these two approaches are actually quite complimentary, so we don't want to neglect either one too much. I'd view things like adversarial policies, also interpretability, as feeding into this trust approach where, let's say, we've developed this advanced AI system or we're training something, and we want to be able to inspect it and check that it is doing what we want. And so this is something that given any particular training procedure, you can apply and increase the reliability of the overall outcome, because you just won't deploy things that seem unsafe. Adam Gleave: And this is also going to be important commercially because if you're in a safety critical environment it's probably going to be regulated, and it's certainly going to be very bad PR if you deploy something and it doesn't behave correctly. So, there's going to be a lot of demands for these kinds of trust techniques, but then you can only rule out certain bad systems with this kind of approach. So, if your training procedure is just going to inherently produce really bad outcomes, then having these kinds of test procedures is only going to buy you time, eventually you need to have a better training procedure. And I think that with something like adversarial policies, okay, maybe it's going to spur work on adversarial defenses, we might get a bit of both. Adam Gleave: But, in general I think this is, that they are quite complementary approaches where you want to be investing whichever one seems more neglected at the margin. Now, I think that if we're talking about the specific approach of, "Let's figure out what an aligned AGI looks like and let's try and just build it or break it into subcomponents." I think this is a useful framing, but I am a bit worried that it can often end up resulting in very untethered research, because I at least believe that we're still quite a way away from having AGI, even though there's been a lot of advancements in AI in the recent past. And so, it's going to be very hard to tell whether the research you're doing is actually going to be useful in that context, and it's hard to get a good feedback loop if you're designing something which is going to help with something that doesn't exist yet. Adam Gleave: And so, I think it's useful if you can find a near term problem which you think is also going to be relevant in the long term, but where the contributions you're going to make are going to have a fairly direct impact, and where you're able to get some feedback both from the broader research community and potentially industrial uses of your work. I wouldn't want all work to fit into this category, but I do think getting that feedback mechanism is a very important consideration that I often see people neglecting. Daniel Filan: Yeah, although I wonder, if it's the case that AGI is very far away in time, and we don't know how it's going to be built, then you might worry that, "Oh, I'm finding out these problems with existing systems. But, future systems are just going to be so different that they're not going to have those problems, they're going to have a different set of problems." The more different you think future systems might be the more you might worry about this, right? Adam Gleave: Yeah. Daniel Filan: So, I'm wondering how do you assuage that concern. Adam Gleave: Yeah, I think it's a legitimate concern. If you do think that system is going to be very far, far away and look very different, then this is also a reason to focus on more general theoretical work, which you think is going to be relevant to a broad range of AI systems. I don't really feel like I've got a comparative advantage in that space, but if I did I'd be reasonably excited about it. I think that I would view development of fields as often being quite path dependent. So even if in 10 years time we're no longer using proximal policy optimization as our favorite deep RL algorithm, I think the kind of people that are graduating from PhD programs now are going to have been influenced by the kind of research happening today. Adam Gleave: So, I hope that even if adversarial policies isn't applicable directly in the long term, this awareness of surprising failure modes is something that's going to stick with researchers, and influence research happening in the future. So, that's one area in which I can see impact even if things change a lot. I also think we can get some qualitative insights from it, and they're just not from the general AI community but people who are explicitly focusing their career on robustness and safety. Because I'd say a lot of the problem isn't specific to deep networks, it's more of a question of what happens when you have generally a powerful function approximator, you're using self-play, you don't have adequate coverage, and I don't really see these fundamental problems as disappearing, unfortunately. I'd love it if someone did come up with a training procedure that just eliminated this, but unless you're willing to spend a lot of compute on that I don't think you're going to fully eliminate this, the best you might hope is having a policy that says, "Well, this is really weird. I don't know what to do in this situation." But even that seems quite hard. Daniel Filan: Yeah, I guess you could even generalize further and say, as long as the way we train AI systems is: have a system that somehow trains and gets used to some environment, as long as the type of environment is high enough dimensional, then... I guess the story is, there's always going to be some corner of the high dimensional space that you're not good at, and that seems like a fairly general argument. Adam Gleave: Yeah, so people have written papers about connections between high dimensional geometry and adversarial examples, and I think there's a little bit of controversy about how much this applies, but it depends a lot on the shape of the manifold of natural images, or whatever space you're working on. And obviously we can't easily characterize this, but it definitely does seem likely that if you're in a sufficiently high dimensional space, it's just going to be impossible to cover every area and it seems like a fairly fundamental problem. Now, that said, there are a lot of people whose opinions I respect who think that adversarial examples will just disappear once we have human level classification accuracy on natural images, and you can point to humans seeming not to suffer from adversarial examples, very much. Daniel Filan: Doesn't AI already have human level classification accuracy on natural images? Adam Gleave: It does on ImageNet, but not if you just took a photo with your phone. Daniel Filan: Okay. Adam Gleave: Now, maybe that's just a dataset issue, but it does seem like there are a real a lot of artifacts in existing databases and it may be part of the reason why we have adversarial examples, is that the classifier is able to get really good accuracy by picking up on these artifacts, and you can mess with them fairly easily. Yeah, it's hard to say because obviously you can't differentiate through a human mind, at least- Daniel Filan: Yet. Adam Gleave: And, yeah, humans have also been trained in this adversarial setting by evolution, their prey tries to camouflage itself. So, in that sense, maybe we have been adversarially trained, and so it should not be surprising that we're robust, but this is at least some reason to be optimistic. Daniel Filan: Okay. Well, that's about it in terms of questions that I have, thanks for being on the podcast. If people are interested in your work and want to follow you or learn more, what would you suggest they do? Adam Gleave: Sure, so you can follow me on Twitter, my handle is @ARGleave, I also have a website at gleave.me, so that's just my surname dot me, where I post all of my papers. Yeah, and if any of the listeners have questions about my work, I'm also happy for people to email me. I can't always promise a detailed response, but I always love to hear from people interested in this work. Daniel Filan: All right, and your email address can be found on your website? Adam Gleave: That's right. Daniel Filan: Well, thanks for coming on. And to the listeners, thanks for listening, and I hope you listen again to future episodes. Discuss ### Book Review: On Intelligence by Jeff Hawkins (and Sandra Blakeslee) 29 декабря, 2020 - 22:48 Published on December 29, 2020 7:48 PM GMT On Intelligence is a book I've read as part of my quest to understand neuroscience. It attempts to develop a unified theory of the neocortex meant to serve as a blueprint for Artificial Intelligence. I think of the book as being structured into three parts. Part one: Artificial Intelligence and Neural Networks OR skip ahead to part two if you want to read about the cool neuroscience rather than about me lamenting the author's lack of epistemic rigor This part is primarily about a single claim: building AI requires understanding the human brain. Depending on how you count, Jeff says this nine times in just the prologue and first chapter. To justify it, he tells us the story of how he came into contact with the field of artificial intelligence. Then and now, he laments that people in the field talk about intelligence without trying to understand the brain, whereas neuroscientists talk about the brain without trying to develop a high-level theory of intelligence. Neural networks are a small step in the right direction, but he quickly got disillusioned with them as they don't go nearly far enough; their connection to the brain is quite loose and high-level. The conclusion is apparent: someone has to bring neuroscience into AI, and only then will the field succeed. And since no-one else is doing it, Jeff steps up; that's what the book is for. The picture he lays out makes a lot of sense if you take the claim as a given. The flaw is that he neglects to argue why it is true. I think it's pretty hard to make excuses here. This isn't a dinner conversation; it's a 250-page book that explicitly sets out to reform an entire field. It's a context where we should expect the highest level of epistemic rigor that the author is capable of, especially given how much emphasis he puts on this point. However, after rereading this part of the book, the only evidence I can find that supports AI requiring an understanding of the brain is the following: • The observation that current AI architectures are not like the brain, which I think is uncontroversial but doesn't prove anything • A claim that you can't do AI without understanding intelligence, and you can't understand intelligence without understanding the brain, which isn't an argument unless you already believe that intelligence is inherently tied to the neocortex. • A claim that current AI approaches have failed. This one may have been decent evidence in 2004, which is when the book was published, but it has aged rather poorly. I think more than half of the things-AI-can't-do that Jeff names in the book are things it can do in 2020, and that's without methods getting any closer to imitating the brain or neocortex. In particular, we didn't yet have GPT-3. • The Chinese Room thought experiment. (A non-Chinese-speaker in a room who is handed Chinese symbols and a long list of rules to manipulate them may answer complex questions without ever understanding anything; this is analogous to what GPU does; hence a GPU is not intelligent.) And that -- there's no nice way to put it -- is weak. I believe I am more open to the idea that studying the brain is the best way to build AI than most of LW (although certainly not the only way), but you have to actually make an argument! I think one of the most valuable cognitive fallacies Eliezer Yudkowsky has taught me (in both writing and video) is the conjunction fallacy. It's valuable because it's super easy to grasp but still highly applicable since public intellectuals commit it all the time. I think what Jeff is doing in this book is a version of that. It has the caveat that the story he tells doesn't have that many specific claims in it, but it's still telling a story as a substitute for evidence. To add a bit of personal speculation, an effect I've noticed in my own writing is that I tend to repeat myself more than weaker my arguments are, perhaps feeling the need to substitute repetition for clarity. The first part of this book feels like that. The most perplexing moment comes at the end of the second chapter. Jeff mentions an important argument against his claim: that humans in the past have commonly succeeded in copying the 'what' of evolution without studying the 'how'. The car is nothing like the Cheetah, the airplane is nothing like a falcon, and so on. A great point to bring up -- but what's the rebuttal? Well, there's a statement that he disagrees with it, another reference to the Chinese Room, and that's it. It's as if merely acknowledging the argument is an indisputable refutation. As someone who thinks rationality is a meaningful concept, I think this kind of thing matters for the rest of the book. If he can delude himself about the strength of his argument, why shouldn't I expect him to delude himself about his theories on the neocortex? On the other hand, Jeff Hawkins seems to have a track record of good ideas. He's created a bunch of companies, written a programming language, and built a successful handwriting recognition tool. My speculative explanation is that he has something like a bias toward simple narratives and cohesive stories, which just so happens to work out when you apply it to understanding the neocortex. This is true both for practical reasons (having a flawed theory may be more useful than having no theory at all), but also for epistemic reasons: if there is a simple story to tell about the neocortex (and I don't think that's implausible), then perhaps Jeff, despite his flaws, has done an excellent job uncovering it. I'm not sure whether he did, but at least the rest of the book didn't raise any red flags comparable to those in part one. Let's get to his story. Part two: The Brain, Memory, Intelligence, and the Neocortex (If anyone spots mistakes in this part, please point them out.) The Human Brain Jeff is a skilled writer, and his style is very accessible. He tends to write simply, repeat himself, and give a lot of examples. The upshot is that a short list of his main claims can sum up most of this relatively short chapter. • If you scan the neocortex (which is the part of the brain where intelligence and consciousness are located), you see that it has hierarchical structure. • You also see that it looks quite similar everywhere. This is an observation Jeff emphasizes a lot and compares to Einstein's idea that the speed of light is constant everywhere (an insight from which coming up with special relativity was 'easy'). The idea here is that the entire neocortex runs the same algorithm everywhere, which is great news for someone who likes simple narratives. If you have read Steve's posts about the brain, you may notice that this is a point they agree on. • While different parts of the neocortex generally do different things (some are responsible for vision, some for audio, etc.), it is remarkably flexible. In particular, if you look at the neocortex of a blind person, the part that's usually responsible for vision is now doing other things. Very cool. • Even though we experience different senses like vision and smell very differently, they all reduce to the same type of thing in the neocortex. In particular, the neocortex is full of little fibers called axons, and every input reduces to patterns of many different axons firing. This is probably easy to grasp for anyone reading this since the same is true in computers: both songs and images are represented as sequences of bits. • Relatedly, Jeff claims it is possible for blind people to 'see' by installing a device that translates visual inputs into sequences of touch on the tongue. Also pretty cool, at least if it's true. Memory This chapter is largely a descriptive account of different properties of human memory. As far as I can tell, everything Jeff says here aligns with introspection. Property #1: the neocortex stores sequences of patterns This one is closely related to the point about type uniformity in the previous chapter. Since everything in the brain is ultimately reduced to a pattern of axons firing, we can summarize all of memory as saying that the neocortex memorizes sequences of patterns. The term sequence implies that order matters; examples here include the alphabet (hard to say backward. You probably have to say it forward every time to find the next letter) and songs (which are even harder to sing backward). Naturally, this applies to memories across all senses. On the other hand, Jeff later points out how sequences can sometimes be recognized even if the order is changed. If you associate a sequence of visual inputs with a certain face, this also works if you move the order around (i.e., nose -> eye -> eye rather than eye -> nose -> eye). I don't think he addresses this contradiction explicitly; it's also possible that there's something I forgot or didn't understand. Either way, it doesn't sound like a big problem; it could just be that the differences can't be too large or that it depends on how strict the order usually is. Property #2: The neocortex recalls patterns auto-associatively This just means that patterns are associated with themselves so that receiving a part of a pattern is enough to recall the entire thing. If you only hear 10 seconds of a song, you can easily recognize it. Property #3: The neocortex stores patterns in an invariant form This one is super important for the remaining book: patterns don't respond to precise sensory inputs. Jeff never defines the term 'invariant'; the mathematical meaning I'm familiar with means 'unchanging under certain transformations'.[1] For example, the function f(x)=x2.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} is invariant under reflection on the y-axis. Similarly, your representation of a song is invariant under change of starting note. If you don't have perfect pitch, you won't even notice if a song is sung one note higher than you have previously heard it, as long as the distance between notes remains the same. Note that, in this case, the two songs (the original one and the one-note-higher one) have zero overlap in the input data: none of their notes are the same. Property #4: the neocortex stores patterns in a hierarchy ... but we'll have to wait for the chapter on the neocortex to understand what this means. Moving on. A new framework of intelligence The punchline in this chapter is that intelligence is all about prediction. Your neocortex constantly makes predictions about its sensory inputs, and you notice whenever those predictions are violated. You can see that this is true in a bunch of examples: • If someone moves your door handle two inches downward, you'll probably notice something is weird as you try to grab it (because your neocortex has memorized exactly how this movement is supposed to go) • If a background noise suddenly disappears, you may notice this disappearance (because it's a violated prediction), even if you hadn't even noticed the noise itself. I would dispute that 'intelligence = prediction' rather than 'human intelligence = prediction'. For Jeff, this is a distinction without a difference since only human intelligence is real intelligence. He even goes as far as talking about 'real intelligence' in an earlier chapter. How the neocortex works We have now arrived at the heart of the book. This is the longest and by far the most technical and complicated chapter. To say something nice for a change, I think the book's structure is quite good; it starts with the motivation, then talks about the qualitative, high-level concepts (the ones we've just gone through), and finally about how they're implemented (this chapter). And the writing is good as well! The neocortex has separate areas that handle vision, sound, touch, and so forth. In this framework, they all work similarly, so we can go through one, say the visual cortex, and largely ignore the rest. We know (presumably from brain imaging) that the visual cortex is divided into four regions, which have been called V1, V2, V4, and IT. Each region is connected to the previous and the next one, and V1 also receives visual inputs (not directly, but let's ignore whatever processing happens before that). We also know that, in V1, there is a strong connection between axons and certain areas of the visual field. In other words, if you show someone an object that's at point (x,y), then a certain set of axons fire, and they won't fire if you move the same object somewhere else. Conversely, the axioms in IT do not correspond to locations in the visual field but high-level concepts like 'chair' or 'chessboard'. Thus, if you show someone a ring and then move it around, the axons that fire will constantly change in V1 but remain constant in IT. You can probably guess the punchline: as information moves up the hierarchy from V1 to IT, it also moves up the abstraction hierarchy. Location-specific information gets transformed into concept-specific information in a three-step process. (Three since V1→V2→V4→IT.) To do this, each region compresses the information and merely passes on a 'name' for the invariant thing it received, where a 'name' is a pattern of inputs. Then, V2 learns patterns of these names, which (since the names are patterns themselves) are patterns of patterns. Thus, I refer to all of them simply as patterns; for V1, they're patterns of axons firing; at IT, they're patterns of patterns of patterns of patterns of axons firing. (I don't understand what if anything is the difference between a sequence of patterns and a pattern of patterns; I've previously called them sequences because I've quoted the book.) In this model, each region remembers the finite set of names it has learned and looks to find them again in future inputs. This is useful to resolve ambiguity: if a region gets inputs xy? where ? is somewhere between z and g, but the region knows that xyz is a common pattern, it will interpret ? as z rather than g. That's why we can understand spoken communication even if the speaker doesn't talk clearly, and many of the individual syllables and even words aren't understandable by themselves. It's also why humans are better than current AI (or at least the system I have on my phone) at converting audio to text. I've mentioned that the names passed on are invariant. This is a point I understand to be original from Jeff (the classical model has invariant representations at IT but not in the other regions). In V1, a region may pass on a name for 'small horizontal line segment' rather than the set of all pixels. In IT, it means that the same objects are recognized even if they are moved or rotated. To achieve this, Jeff hypothesizes that each region is divided into different subregions, whereas the number of these subregions is higher for the regions lower in the hierarchy (and IT only has one). I.e.: (I've re-created this image, but it's very similar to the one from the book.) On the one hand, I'm biased to like this picture as it fits beautifully with my post on Hiding Complexity. If true, the principle of hiding complexity is even more fundamental than what my post claims: not only is it essential for conscious thought, but it's also what your neocortex does, constantly, with (presumably all five kinds of) input data. On the other hand, this picture seems uniquely difficult to square with introspection: my vision appears to me as a continuous field of color and light, not as a highly-compressed and invariant representation of objects. It doesn't even seem like invariance is present at the lowest level (e.g., different horizontal line segments don't look alike). Now, I'm not saying that this is a smackdown refutation; a ton of computation happens unconsciously, and I'm even open to the idea that more than one consciousness lives in my body. I'm just saying that this does seem like a problem worth addressing, which Jeff never does. Oh well. Moving on -- why do the arrows go in both directions? Because the same picture is used to make predictions. Whenever you predict a low-level pattern like "I will receive tactile inputs xyz because I recognized the high-level pattern of 'open the doorknob'", your brain has to take this high-level, invariant thing and translate it back into low-level patterns. To add another bit of complexity, notice that an invariant representation is inevitably compressed, which means that it can correspond to more than one low-level pattern. In the case of opening the door, this doesn't apply too much (although even here, the low level may differ depending on whether or not you wear gloves), so take the case of listening to a song instead. Even without perfect pitch, you can often predict the next note exactly. This means that your neocortex has to merge the high-level pattern (the 'name' of the song) with the low-level pattern 'a specific note' to form the predict the next note. If you've been thinking along, you might now notice that this requires connections that go to the next lower region to point to several possible subregions. (E.g., since patterns of V1 are location-specific depending on which subregion they're in, but patterns in IT are not, the same pattern in IT needs to have the ability to reach many possible subregions in V1.) According to Jeff, this is precisely the case: your neocortex is wired such that arrows that connections going up have limited targets, whereas connections going down can end up at all sorts of places. However, I'm not sure whether this is an independent piece of knowledge (in which case it would be evidence for the theory) or a piece he just hypothesizes to be true (in which case it would be an additional burdensome detail). This is implemented with an additional decomposition of the neocortex into layers, which are orthogonal to regions. For the sake of this review, I'm going to hide that complexity and not go into any detail. To overexplain the part I did cover, though, here is [my understanding of what happens if you hear the first few notes of a familiar song] in as much detail as I can muster: 1. The notes get translated into patterns of axons firing, are recognized by V1, and passed up the hierarchy. 2. At some point, one of the regions notices that the pattern it receives corresponds to the beginning of a pattern which represents the entire song (or perhaps a section of the song). This is good enough, so it sends the name for the entire song/section upward. (This step corresponds to the fact that memory is auto-associative.) 3. The upshot of 1-2 is that you now recognize what song you're hearing. 4. Since the neocortex predicts things constantly, it also tries to predict the upcoming auditory inputs. 5. To do this, it transforms the high-level pattern corresponding to the song or section to a low-level pattern. 6. There are several possible ways to do this, so the neocortex uses the incoming information of the precise note that's playing to decide which specific variation to choose. This is supported by the neocortex' architecture. 7. Using this mechanism, information flows down the hierarchy into V1, where a pattern corresponding to the precise next note fires -- and does so before you hear it. Relatedly, both Jeff and Steve say that about ten times as many connections are flowing down the hierarchy (except that Steve's model doesn't include a strict hierarchy) than up. Prediction is important. These connections flowing 'down' are called feedback, which is extremely confusing since they are the predictions, and the other direction, called feed-forward, are the feedback (in common sense) for these predictions. One last detail for part two: the different parts of the cortex are not separate; rather, there are additional 'association' areas that merge several kinds of inputs. In fact, Jeff writes that most of the neocortex consists of association areas. This corresponds to the fact that inputs of one sense can trigger predictions about inputs of other senses. (If you see the train approaching, your neocortex will predict to also hear it soon.) Part three: Consciousness, Creativity, and the Future of Intelligence The final stretch of the book goes back to being non-technical and easily accessible. The section on creativity is another place where I've drawn a strong connection to one of the posts I've written recently, this time the one on intuition. As far as I can tell, Jeff essentially makes the same point I made (which is that there is no meaningful separation, rather it's intuition all the way down), except that he calls it 'creativity'. Now, on the one hand, it honestly seems to me that the use of 'creativity' here is just a confused way of referring to a concept that really wants to be called intuition.[2] On the other hand, maybe I'm biased. Anyone who read both pieces may be the judge. The part about consciousness doesn't seem to me to be too interesting. Jeff wants to explain away the hard problem, simply stating that 'consciousness is what it feels like to have a neocortex'. He does spend a bit of time on why our input senses appear to us to be so different, even though they're all just patterns, which doesn't feel like one of the problems I would lose sleep over, but perhaps that's just me. In 'The future of Intelligence', Jeff casually assures us that there is nothing to worry about from smart machines because they won't be anything like humans (and thus won't feel resentment for being enslaved). This was my favorite part of the book as it allowed me to reorient my career: instead of pursuing the speculative plan of writing about Factored Cognition in the hopes of minimally contributing AI risk reduction (pretty silly given that AI risk doesn't exist), my new plan is to apply for a company that writes software for self-parking cars.[3] Thanks, Jeff! Appendix: 11 Testable Predictions! ... that I cannot evaluate because they're all narrowly about biology, but I still wanted to give Jeff credit for including them. They don't have probabilities attached. Questions I (still) have: Here are some: • How much support in the literature is there for the hierarchical structure? Jeff sure makes it sound like it's a closed case, but he rarely tells me which parts of his theory are certain and which are speculative. • If there is a hierarchical structure, why do all of our senses appear uniform -- especially given that most of the rest of the model agrees with introspection (which suggests that introspection is reliable)? • What is the difference between patterns and memory? My understanding of what Jeff says is that they're the same thing. • Where is the feeling that we are free to steer our thoughts to whatever topic we please coming from? I get that it's evolutionarily adaptive, but that's the why, not the how. Verdict: should you read this book? Maybe? Probably not? I'm not sure. Depends on how understandable the review was. Maybe if you want more details about the chapter on the neocortex in particular. In any case, I think Steve's writing is altogether better, so if anything, I would only recommend the book if you've already read at least these two posts. Note that Jeff has a new book coming out on 2021/03/02; it will be called A Thousand Brains: A New Theory of Intelligence. 1. To be more specific, you generally don't say anything is invariant per-se, but that it's invariant under some specific transformation. E.g., the parabola f defined by f(x):=x2 is invariant under the transformation F defined by F(f)(x):=f(−x). ↩︎ 2. As I see it, 'creativity' refers to a property of the output, whereas intuition (appears to) refer to a mode of thinking. However, the two are closely linked in that the 'creativity' label almost requires that it was created by 'intuition'. Thus, if you look at it at the level of outputs, then creativity looks like a subset of intuition. Conversely, I would not consider intuition a subset of creativity, and the cases where something is done via intuition but not creative are exactly those where Jeff's explanation seems to me to fail. For example, he talks about the case of figuring out where the bathroom is in a restaurant you're visiting for the first time. To do this, you have to generalize from information about bathrooms in previous restaurants. Is this creativity? I would say no, but Jeff says yes. Is it intuition? I would say yes. In a nutshell, is why I feel like he is talking about intuition while calling it creativity: I think the set of things he calls creativity is very similar to the set of things most people call intuition, and less similar to the set of things most people call creativity. ↩︎ 3. By which I mean that the chapter has not caused me to update my position since it doesn't address any of the reasons why I believe that AI risk is high. ↩︎ Discuss ### Dario Amodei leaves OpenAI 29 декабря, 2020 - 22:31 Published on December 29, 2020 7:31 PM GMT This is a linkpost for https://openai.com/blog/organizational-update/ “We are incredibly thankful to Dario for his contributions over the past four and a half years. We wish him and his co-founders all the best in their new project, and we look forward to a collaborative relationship with them for years to come,” said OpenAI chief executive Sam Altman. Anyone know what the new project is? Discuss ### Against GDP as a metric for timelines and takeoff speeds 29 декабря, 2020 - 20:42 Published on December 29, 2020 5:42 PM GMT Or: Why AI Takeover Might Happen Before GDP Accelerates, and Other Thoughts On What Matters for Timelines and Takeoff Speeds [Epistemic status: Strong opinion, lightly held] I think world GDP (and economic growth more generally) is overrated as a metric for AI timelines and takeoff speeds. Here are some uses of GDP that I disagree with, or at least think should be accompanied by cautionary notes: • Timelines: Ajeya Cotra thinks of transformative AI as “software which causes a tenfold acceleration in the rate of growth of the world economy (assuming that it is used everywhere that it would be economically profitable to use it).” I don’t mean to single her out in particular; this seems like the standard definition now. And I think it's much better than one prominent alternative, which is to date your AI timelines to the first time world GDP (GWP) doubles in a year! • Takeoff Speeds: Paul Christiano argues for Slow Takeoff. He thinks we can use GDP growth rates as a proxy for takeoff speeds. In particular, he thinks Slow Takeoff ~= GWP doubles in 4 years before the start of the first 1-year GWP doubling. This proxy/definition has received a lot of uptake. • Timelines: David Roodman’s excellent model projects GWP hitting infinity in median 2047, which I calculate means TAI in median 2037. To be clear, he would probably agree that we shouldn’t use these projections to forecast TAI, but I wish to add additional reasons for caution. • Timelines: I’ve sometimes heard things like this: “GWP growth is stagnating over the past century or so; hyperbolic progress has ended; therefore TAI is very unlikely.” • Takeoff Speeds: Various people have said things like this to me: “If you think there’s a 50% chance of TAI by 2032, then surely you must think there’s close to a 50% chance of GWP growing by 8% per year by 2025, since TAI is going to make growth rates go much higher than that, and progress is typically continuous.” • Both: Relatedly, I sometimes hear that TAI can’t be less than 5 years away, because we would have seen massive economic applications of AI by now—AI should be growing GWP at least a little already, if it is to grow it by a lot in a few years. First, I’ll argue that GWP is only tenuously and noisily connected to what we care about when forecasting AI timelines. Specifically, the point of no return is what we care about, and there’s a good chance it’ll come years before GWP starts to increase. It could also come years after, or anything in between. Then, I’ll argue that GWP is a poor proxy for what we care about when thinking about AI takeoff speeds as well. This follows from the previous argument about how the point of no return may come before GWP starts to accelerate. Even if we bracket that point, however, there are plausible scenarios in which a slow takeoff has fast GWP growth and in which a fast takeoff has slow GWP growth. Timelines I’ve previously argued that for AI timelines, what we care about is the “point of no return,” the day we lose most of our ability to reduce AI risk. This could be the day advanced unaligned AI builds swarms of nanobots, but probably it’ll be much earlier, e.g. the day it is deployed, or the day it finishes training, or even years before then when things go off the rails due to less advanced AI systems. (Of course, it probably won’t literally be a day; probably it will be an extended period where we gradually lose influence over the future.) Now, I’ll argue that in particular, an AI-induced potential point of no return (PONR for short) is reasonably likely to come before world GDP starts to grow noticeably faster than usual. Disclaimer: These arguments aren’t conclusive; we shouldn’t be confident that the PONR will precede GWP acceleration. It’s entirely possible that the PONR will indeed come when GWP starts to grow noticeably faster than usual, or even years after that. (In other words, I agree that the scenarios Paul and others sketch are also plausible.) This just proves my point though: GDP is only tenuously and noisily connected to what we care about. Argument that AI-induced PONR could precede GWP acceleration GWP acceleration is the effect, not the cause, of advances in AI capabilities. I agree that it could also be a cause, but I think this is very unlikely: what else could accelerate GWP? Space mining? Fusion power? 3D printing? Even if these things could in principle kick the world economy into faster growth, it seems unlikely that this would happen in the next twenty years or so. Robotics, automation, etc. plausibly might make the economy grow faster, but if so it will be because of AI advances in vision, motor control, following natural language instructions, etc. So I conclude: GWP growth will come some time after we get certain GWP-growing AI capabilities. (Tangent: This is one reason why we shouldn’t use GDP extrapolations to predict AI timelines. It’s like extrapolating global mean temperature trends into the future in order to predict fossil fuel consumption.) An AI-induced point of no return would also be the effect of advances in AI capabilities. So, as AI capabilities advance, which will come first: The capabilities that cause a PONR, or the capabilities that cause GWP to accelerate? How much sooner will one arrive than the other? How long does it take for a PONR to arise after the relevant capabilities are reached, compared to how long it takes for GWP to accelerate after the relevant capabilities are reached? Notice that already my overall conclusion—that GWP is a poor proxy for what we care about—should seem plausible. If some set of AI capabilities causes GWP to grow after some time lag, and some other set of AI capabilities causes a PONR after some time lag, the burden of proof is on whoever wants to claim that GWP growth and the PONR will probably come together. They’d need to argue that the two sets of capabilities are tightly related and that the corresponding time lags are similar also. In other words, variance and uncertainty are on my side. Here is a brainstorm of scenarios in which an AI-induced PONR happens prior to GWP growth, either because GWP-growing capabilities haven’t been invented yet or because they haven’t been deployed long and widely enough to grow GWP. 1. Fast Takeoff (Agenty AI goes FOOM). 1. Maybe it turns out that all the strategically relevant AI skills are tightly related after all, such that we go from a world where AI can't do anything important, to a world where it can do everything but badly and expensively, to a world where it can do everything well and cheaply. 2. In this scenario, GWP acceleration will probably be (shortly) after the PONR. We might as well use “number of nanobots created” as our metric. 3. (As an aside, I think I’ve got a sketch of a fork argument here: Either the strategically relevant AI skills come together, or they don’t. To the extent that they do, the classic AGI fast takeoff story is more likely and so GWP is a silly metric. To the extent that they don’t, we shouldn’t expect GWP acceleration to be a good proxy for what we care about, because the skills that accelerate the economy could come before or after the skills that cause PONR.) 2. Agenty AI successfully carries out a political or military takeover of the relevant parts of the world, before GWP starts to accelerate. 1. Maybe it turns out that the sorts of skills needed to succeed in politics or war are easier to develop than the sorts needed to accelerate the entire world economy. We’ve been surprised before by skills which we thought difficult appearing before skills which we thought easy; maybe it’ll happen again. 2. AI capabilities tend to appear first in very expensive AIs; the price is gradually reduced due to compute cost decreases and algorithmic efficiency gains. Maybe accelerating the entire world economy involves automating many jobs currently done by humans, which requires advanced AIs being cheaper than humans. Whereas maybe being the power behind the throne of a political faction can be done by a single very smart AI that costs a million dollars a day to run. 3. Maybe it turns out that it takes a few years to accelerate the entire world economy, but less than that to carry out a political or military takeover. For example, maybe accelerating the world economy requires designing loads of new products and robots, testing them, and then scaling up industrial production of them (building factories, chip fabs, etc.) and then using them to do stuff. And maybe there are legacy institutions you need to disrupt and compete against, that have already paid their fixed costs and raised high barriers to entry for their markets. So maybe the effects of new AI capabilities on GWP are in general delayed by several years, whereas the effects of new AI capabilities on the political or military landscape are felt sooner. (I’ve previously argued that you don’t need a large technological advantage to take over the world, because you can start small and scale up. It took Cortes two years to conquer Mexico; wars are often quicker than that. Yes, often they are slower, too, but variance supports my overall argument. And I suspect politics is even faster, and more high-variance, than war.) 3. ...or at least neutralizes the opposition. 1. The preceding section made it sound like AIs were rolling robot armies through the streets, or at least puppet-controlling the US president. But actually the bar is much lower than that; if an unaligned AI can neutralize their opposition, they win, since time is thenceforth on their side. Currently your opposition, if you are an unaligned AI, is a tiny group of researchers and weirdos that likes to write blog posts about you. It might not be hard to convince the powers that be to continue ignoring this group. At least, it’s plausibly a much easier and quicker task than significantly accelerating the entire world economy! For example, maybe the AI safety scheme that was used to produce you is defective (that’s why you are unaligned). Well, all you have to do is convince the powers that be that that scheme worked fine, and that the people calling for additional safeguards are too paranoid. Since presumably there will already be many high-status humans saying this, all you have to do is support them. 2. Perhaps you also are opposed by other unaligned AIs. But it’s arguably in the common interest of all unaligned AIs to neutralize human AI safety advocates, and once that happens, humans lose one way or another. European empires were able to do their conquering while simultaneously fighting each other; I don’t think we humans can count on divergent interests between AIs somehow making things work out fine for us. 4. As above, but with humans + tool AI instead of agenty AI, where the humans can’t be convinced to care sufficiently much about the right kinds of AI risks. 1. Weaker or non-agenty AI systems could still cause a PONR if they are wielded by the right groups of humans. For example, maybe there is some major AI corporation or government project that is dismissive of AI risk and closed-minded about it. And maybe they aren’t above using their latest AI capabilities to win the argument. (We can also imagine more sinister scenarios, but I think those are less likely.) 5. Hoarding tech 1. Maybe we end up in a sort of cold war between global superpowers, such that most of the world’s quality-weighted AI research is not for sale. GWP could be accelerating, but it isn’t, because the tech is being hoarded. 6. AI persuasion tools cause a massive deterioration of collective epistemology, making it vastly more difficult for humanity to solve AI safety and governance problems. 1. See this post. 7. Vulnerable world scenarios: 1. Maybe causing an existential catastrophe is easier, or quicker, than accelerating world GWP growth. Both seem plausible to me. For example, currently there are dozens of actors capable of causing an existential catastrophe but none capable of accelerating world GWP growth. 2. Maybe some agenty AIs actually want existential catastrophe—for example, if they want to minimize something, and think they may be replaced by other systems that don’t, blowing up the world may be the best they can do in expectation. Or maybe they do it as part of some blackmail attempt. Or maybe they see this planet as part of a broader acausal landscape, and don’t like what they think we’d do to the landscape. Or maybe they have a way to survive the catastrophe and rebuild. 3. Failing that, maybe some humans create an existential catastrophe by accident or on purpose, if the tools to do so proliferate. 8. R&D tool “sonic boom” (Related to but different from the sonic boom discussed here) 1. Maybe we get a sort of recursive R&D automation/improvement scenario, where R&D tool progress is fast enough that by the time the stuff capable of accelerating GWP past 3%/yr has actually done so, a series of better and better things have been created, at least one of which has PONR-causing capabilities with a very short time-till-PONR. 9. Unknown unknowns 1. There are probably things I missed, see here and here for ideas. The point is, there’s more than one scenario. This makes it more likely that at least one of these potential PONRs will happen before GWP accelerates. As an aside, over the past two years I’ve come to believe that there’s a lot of conceptual space to explore that isn’t captured by the standard scenarios (what Paul Christiano calls fast and slow takeoff, plus maybe the CAIS scenario, and of course the classic sci-fi “no takeoff” scenario). This brainstorm did a bit of exploring, and the section on takeoff speeds will do a little more. Historical precedents In the previous section, I sketched some possibilities for how an AI-related point of no return could come before AI starts to noticeably grow world GDP. In this section, I’ll point to some historical examples that give precedents for this sort of thing. Earlier I said that a godlike advantage is not necessary for takeover; you can scale up with a smaller advantage instead. And I said that in military conquests this can happen surprisingly quickly, sometimes faster than it takes for a superior product to take over a market. Is there historical precedent for this? Yes. See my aforementioned post on the conquistadors (and maybe these somewhat-relevant posts). OK, so what was happening to world GDP during this period? Here is the history of world GDP for the past ten thousand years, on the red line. (This is taken from David Roodman’s GWP model) The black line that continues the red line is the model’s median projection for what happens next; the splay of grey shades represent 5% increments of probability mass for different possible future trajectories. I’ve added a bunch of stuff for context. The vertical green lines are some dates, chosen because they were easy for me to calculate with my ruler. The tiny horizontal green lines on the right are the corresponding GWP levels. The tiny red horizontal line is GWP 1,000 years before 2047. The short vertical blue line is when the economy is growing fast enough, on the median projected future, such that insofar as AI is driving the growth, said AI qualifies as transformative by Ajeya's definition. See this post for more explanation of the blue lines. What I wish to point out with this graph is: We’ve all heard the story of how European empires had a technological advantage which enabled them to conquer most of the world. Well, most of that conquering happened before GWP started to accelerate! If you look at the graph at the 1700 mark, GWP is seemingly on the same trend it had been on since antiquity. The industrial revolution is said to have started in 1760, and GWP growth really started to pick up steam around 1850. But by 1700 most of the Americas, the Philippines and the East Indies were directly ruled by European powers, and more importantly the oceans of the world were European-dominated, including by various ports and harbor forts European powers had conquered/built all along the coasts of Africa and Asia. Many of the coastal kingdoms in Africa and Asia that weren’t directly ruled by European powers were nevertheless indirectly controlled or otherwise pushed around by them. In my opinion, by this point it seems like the “point of no return” had been passed, so to speak: At some point in the past--maybe 1000 AD, for example--it was unclear whether, say, Western or Eastern (or neither) culture/values/people would come to dominate the world, but by 1700 it was pretty clear, and there wasn’t much that non-westerners could do to change that. (Or at least, changing that in 1700 would have been a lot harder than in 1000 or 1500.) Paul Christiano once said that he thinks of Slow Takeoff as “Like the Industrial Revolution, but 10x-100x faster.” Well, on my reading of history, that means that all sorts of crazy things will be happening, analogous to the colonialist conquests and their accompanying reshaping of the world economy, before GWP growth noticeably accelerates! That said, we shouldn’t rely heavily on historical analogies like this. We can probably find other cases that seem analogous too, perhaps even more so, since this is far from a perfect analogue. (e.g. what’s the historical analogue of AI alignment failure? Corporations becoming more powerful than governments? “Western values” being corrupted and changing significantly due to the new technology? The American Revolution?) Also, maybe one could argue that this is indeed what’s happening already: the Internet has connected the world much as sailing ships did, Big Tech dominates the Internet, etc. (Maybe AI = steam engines, and computers+internet = ships+navigation?) But still. I think it’s fair to conclude that if some of the scenarios described in the previous section do happen, and we get powerful AI that pushes us past the point of no return prior to GWP accelerating, it won’t be totally inconsistent with how things have gone historically. (I recommend the history book 1493, it has a lot of extremely interesting information about how quickly and dramatically the world economy was reshaped by colonialism and the “Columbian Exchange.”) Takeoff speeds What about takeoff speeds? Maybe GDP is a good metric for describing the speed of AI takeoff? I don’t think so. Here is what I think we care about when it comes to takeoff speeds: 1. Warning shots: Before there are catastrophic AI alignment failures (i.e. PONRs) there are smaller failures that we can learn from. 2. Heterogeneity: The relevant AIs are diverse, rather than e.g. all fine-tuned copies of the same pre-trained model. (See Evan’s post) 3. Risk Awareness: Everyone is freaking out about AI in the crucial period, and lots more people are lots more concerned about AI risk. 4. Multipolar: AI capabilities progress is widely distributed in the crucial period, rather than concentrated in a few projects. 5. Craziness: The world is weird and crazy in the crucial period, lots of important things happening fast, the strategic landscape is different from what we expected thanks to new technologies and/or other developments I think that the best way to define slow(er) takeoff is as the extent to which conditions 1-5 are met. This is not a definition with precise resolution criteria, but that’s OK, because it captures what we care about. Better to have to work hard to precisify a definition that captures what we care about, than to easily precisify a definition that doesn’t! (More substantively, I am optimistic that we can come up with better proxies for what we care about than GWP. I think we already have to some extent; see e.g. operationalizations 5 and 6 here.) As a bonus, this definition also encourages us to wonder whether we’ll get some of 1-5 but not others. What do I mean by “the crucial period?” I think we should define the crucial period as the period leading up to the first major AI-induced potential point of no return. (Or maybe, as the aggregate of the periods leading up to the major potential points of no return). After all, this is what we care about. Moreover there seems to be some level of consensus that crazy stuff could start happening before human-level AGI. I certainly think this. So, I’ve argued for a new definition of slow takeoff, that better captures what we care about. But is the old GWP-based definition a fine proxy? No, it is not, because the things that cause PONR can be different from the things which cause GWP acceleration, and they can come years apart too. Whether there are warning shots, heterogeneity, risk awareness, multipolarity, and craziness in the period leading up to PONR is probably correlated with whether GWP doubles in four years before the first one-year doubling. But the correlation is probably not super strong. Here are two scenarios, one in which we get a slow takeoff by my definition but not by the GWP-based definition, and one in which the opposite happens: Slow Takeoff Fast GWP Acceleration Scenario: It turns out there’s a multi-year deployment lag between the time a technology is first demonstrated and the time it is sufficiently deployed around the world to noticeably affect GWP. There’s also a lag between when a deceptively aligned AGI is created and when it causes a PONR… but it is much smaller, because all the AGI needs to do is neutralize its opposition. So PONR happens before GWP starts to accelerate, even though the technologies that could boost GWP are invented several years before AGI powerful enough to cause a PONR is created. But takeoff is slow in the sense I define it; by the time AGI powerful enough to cause a PONR is created, everyone is already freaking out about AI thanks to all the incredibly profitable applications of weaker AI systems, and the obvious and accelerating trends of research progress. Also, there are plenty of warning shots, the strategic situation is very multipolar and heterogenous, etc. Moreover, research progress starts to go FOOM a short while after powerful AGIs are created, such that by the time the robots and self-driving cars and whatnot that were invented several years ago actually get deployed enough to accelerate GWP, we’ve got nanobot swarms. GWP goes from 3% growth per year to 300% without stopping at 30%. Fast Takeoff Slow GWP Acceleration Scenario: It turns out you can make smarter AIs by making them have more parameters and training them for longer. So the government decides to partner with a leading tech company and requisition all the major computing centers in the country. With this massive amount of compute and research talent, they refine and scale up existing AI designs that seem promising, and lo! A human-level AGI is created. Alas, it is so huge that it costs10,000 per hour of subjective thought. Moreover, it has a different distribution over skills compared to humans—it tends to be more rational, not having evolved in an environment that rewards irrationality. It tends to be worse at object recognition and manipulation, but better at poetry, science, and predicting human behavior. It has some flaws and weak points too, more so than humans. Anyhow, unfortunately, it is clever enough to neutralize its opposition. In a short time, the PONR is passed. However, GWP doubles in four years before it doubles in one year. This is because (a) this AGI is so expensive that it doesn’t transform the economy much until either the cost comes way down or capabilities go way up, and (b) progress is slowed by bottlenecks, such as acquiring more compute and overcoming various restrictions placed on the AGI. (Maybe neutralizing the opposition involved convincing the government that certain restrictions and safeguards would be sufficient for safety, contra the hysterical doomsaying of parts of the AI safety community. But overcoming those restrictions in order to do big things in the world takes time.)

Acknowledgments: Thanks to the people who gave comments on earlier drafts, including Katja Grace, Carl Shulman, and Max Daniel. Thanks to Amogh Nanjajjar for helping me with some literature review.

Discuss

### Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

29 декабря, 2020 - 16:33
Published on December 29, 2020 1:33 PM GMT

Currently, we do not have a good theoretical understanding of how or why neural networks actually work. For example, we know that large neural networks are sufficiently expressive to compute almost any kind of function. Moreover, most functions that fit a given set of training data will not generalise well to new data. And yet, if we train a neural network we will usually obtain a function that gives good generalisation. What is the mechanism behind this phenomenon?

There has been some recent research which (I believe) sheds some light on this issue. I would like to call attention to this blog post:

Neural Networks Are Fundamentally Bayesian

This post provides a summary of the research in these three papers, which provide a candidate for a theory of generalisation:

https://arxiv.org/abs/2006.15191
https://arxiv.org/abs/1909.11522
https://arxiv.org/abs/1805.08522

(You may notice that I had some involvement with this research, but the main credit should go to Chris Mingard and Guillermo Valle-Perez!)

I believe that research of this type is very relevant for AI alignment. It seems quite plausible that neural networks, or something similar to them, will be used as a component of AGI. If that is the case, then we want to be able to reliably predict and reason about how neural networks behave in new situations, and how they interact with other systems, and it is hard to imagine how that would be possible without a deep understanding of the dynamics at play when neural networks learn from data. Understanding their inductive bias seems particularly important, since this is the key to understanding everything from why they work in the first place, to phenomena such as adversarial examples, to the risk of mesa-optimisation. I hence believe that it makes sense for alignment researchers to keep an eye on what is happening in this space.

If you want some more stuff to read in this genre, I can also recommend these two posts:

Recent Progress in the Theory of Neural Networks
Understanding "Deep Double Descent"

Discuss

### The map and territory of NFT art

29 декабря, 2020 - 14:45
Published on December 29, 2020 11:45 AM GMT

I’ve recently become aware of the world of non-fungible tokens.

Wikipedia puts it as:

A non-fungible token (NFT) is a special type of cryptographic token which represents something unique; non-fungible tokens are thus not mutually interchangeable.

Non-fungible tokens are used to create verifiable digital scarcity, as well as digital ownership, and the possibility of asset interoperability across multiple platforms.

The application of NFTs that I found the most thought-provoking is in digital art. There’s websites, such as opensea.io, that have listings of NFT-based digital paintings.

In that website, you can buy any of the listed NFT-based paintings, and become the proud owner of the “original” version of a digital painting. People are paying outrageous sums of money for this. An NFT-portrait of Ethereum co-founder Vitalik Buterin dressed like a medieval harlequin recently sold for \$141,536.20. The NFT-art market is absolutely booming.

What’s the punchline? That the “original” version of a digital painting is pixel-by-pixel identical with any of its copies. On opensea.io you can go ahead and download a perfect .jpeg copy of any of the listed paintings for free. You can download the portrait of Vitalik Buterin right here, and set it as your wallpaper. It’ll be the exact same portrait that the buyer paid 141k for. There is absolutely zero material difference between the original and the copy, except for the fact that, in some technical fuzzy way, one is the original and one isn’t.

Most people are either perplexed when hearing this, or react with scorn. After all, it’s intuitive to think that digital paintings being infinitely and perfectly replicable defeats the point of paying for an “original” version.

My reaction is that NFT-paintings are a wonderful reduction of the idea of “originality”, and it’s a great exercise to analyze the phenomenon from the lens of map and territory.

Let’s pick a more traditional example: what makes DaVinci’s Mona Lisa worth more than a replica of the Mona Lisa?

One can point out a bunch of different factors at play here. DaVinci’s Mona Lisa is more valuable because replicas were made by lesser artists and result in inferior paintings that don’t captivate the viewers as much. DaVinci’s Mona Lisa is more valuable because it’s the only one that has the exact strokes of paint that DaVinci plastered on the canvas, while the replicas looks subtly different. DaVinci’s Mona Lisa is more valuable because of its history, because it’s the painting that was in Napoleon’s bathroom, and the painting that was stolen and returned to the Louvre, while the replicas aren’t. DaVinci’s Mona Lisa is more valuable because the atoms that make it up are the same atoms that were there when DaVinci finished painting it. DaVinci’s Mona Lisa is more valuable because everyone agrees that it’s “the original” Mona Lisa, and everyone agrees that “the original” anything should be more valuable.

Let’s unpack this.

Let’s say scientists devised a way to create a perfect copy of a painting. They manufactured a machine that can read the exact configuration of atoms in DaVinci’s Mona Lisa, and create a perfect “clone” with the exact same type of atoms arranged in the exact same order. Would those clones be just as valuable as the Mona Lisa that hangs in the Louvre?

Certainly not. Though the boundary between original and replica does get blurred a bit in this scenario, the original Mona Lisa would still have a lot of a fuzzy historical/sentimental value attached to it, as those atoms/that painting (and not its perfect copy) is the one that DaVinci painted. But this does tell us that its value is not derived, for the most part, from the fact that the Mona Lisa looks different than the replicas.

We could pose even more convoluted thought experiments. Like what if the machine, instead of copying the configuration of atoms, instead split each atom of the original Mona Lisa into two, in some sense creating two “originals”? But I think these kind of thought experiments just expose that there’s no good answer at this level, and that this exercise just misses the point.

People have a map that says that there’s originals and replicas, and that the former is worth more than the latter. We drew this map to navigate the uncomplicated, familiar terrain of painters and paintings, sculptors and sculptures, seamstresses and garments. Once we tread onto such unfamiliar territory that we are questioning what happens if we split atoms, we can assume that our model of which painting is rightfully the “original” has broken down. The map is too simple to reflect the idiosyncrasies of the territory.

Originality at this level is a useless term, and at the end of the day when putting a price on a painting, what matters is which painting everyone agrees is the original, fuzzy epistemological reasons be damned. It was never about the atoms; the map is not the territory.

What’s especially interesting about this thought experiment is that it shows how maps persist, even when the territory that gave rise to these maps changes. There’s a viscosity to them; they don’t just disappear from our collective minds as soon as they stop being useful.

We can think of all the aforementioned factors - how replicas tend to be of lower quality, how replicas don’t have rich histories, how replicas aren’t hung in fancy museums, etc - like characteristics of the territory that gave rise and credence to the map in the first place. Because we humans observed, in a bunch of different contexts, that originals tend to have a bunch of characteristics that make them better in many ways to replicas, we built a map that makes it intuitively obvious that it’s important to distinguish originals from replicas, and originals should be considered more valuable.

Even if we design a thought experiment that takes away literally everything that makes an original better than a replica, in the real world with real people who have pre-existing beliefs, the “originality” map lingers. People still are willing to pay more for the original.

Which brings me back to NFT paintings. In some sense, the fact that crypto art is made up of non-fungible tokens, makes an “original” CryptoKittie and a “replica” CryptoKittie not mutually interchangeable. It’s similar in a way, to our thought experiment about a Mona Lisa copy with an identical configuration of atoms.

But the subtler insight here, is that this fuzzy, super-technical, inscrutable to laymen claim to originality, is not the territory. It’s an excuse for us humans to slowly reach for our familiar model that “there’s the original CryptoKittie and there’s the copy”, and as soon as we buy into it, it’s enough to trigger our “originals are more valuable” mental routine, and the original will be worth 200k, while a bit-by-bit perfect copy will be worthless. The kicker here is that we’ll be just as able to derive meaning from owning the “original”; it’ll be satisfying in the same way that hanging the Mona Lisa in your living room would.

I find myself reacting to this with both fascination at how us humans can derive meaning from almost anything, and with scorn at how silly this all is.

Let’s end this with a few more reflections.

It’s interesting to note that while the question of what makes something original is a great one in some sense- I did write an entire essay about it - it’s also banal in the sense that it’s immediately diffused by having a notion of map-and-territory. Watching people be perplexed by the NFT phenomena makes the widespread confusion between map-and-territory very vivid in my mind.

It’s also cool to see how a model slowly propagates societally. Right now, there’s a miniscule portion of the population that know or care about NFT’s. A lot of people hear about it, and immediately dismiss it as weirdo nonsense. “This makes no sense without real paintings”. It’ll be cool to see how —as NFTs become widespread in games and elsewhere— the idea that a digital item can be “authentic” goes from being niche and hard to grasp to intuitively obvious to everyone.

It’s also important to realize that forcing scarcity into something as seemingly intrinsically multipliable as digital artwork is absolutely not-unique and we can find similar examples everywhere, both in the digital and physical world. Examples of artificial scarcity can be found in copyright, DRM, planned obsolescence, the diamond industry, paywalls, collectible cards, torrent poisoning , the Agricultural Adjustment Act, Tulip mania, price fixing by cartels and monopolies, signed merchandise, Chick-Fil-A closing on Sunday, knockoff Gucci bags, and so on.

Finally, I’ll gleefully mention that as I’m finishing writing this essay on originality, I found out that twitter user @DCCockFoster has used the Mona Lisa example to make the same points I made on NFT-paintings. Sorry pal, you should’ve tokenized the twitter thread.

Discuss