Вы здесь

The LessWrong 2018 Review

Новости LessWrong.com - 21 ноября, 2019 - 05:50
Published on November 21, 2019 2:50 AM UTC

If you have 1000+ karma, you have until Dec 1st to nominate LessWrong posts from 2018 (yes, 2018, not 2019) for the first LessWrong Review. The nomination button is available from a post's dropdown menu.

Multiple nominations are helpful – posts with enough nominations will proceed to a review phase (ending December 31st), followed by a week of voting. Details below.

The LW team will be compiling the best posts and reviews into a physical book, awarding $2000 divided among top posts and (up to)$2000 divided among top reviews.

This is the first week of the LessWrong 2018 Review – an experiment in improving the LessWrong Community's longterm feedback and reward cycle.

This post begins by exploring the motivations for this project (first at a high level of abstraction, then getting into some more concrete goals), before diving into the details of the process.

Improving the Idea Pipeline

In his LW 2.0 Strategic Overview, habryka noted:

We need to build on each other’s intellectual contributions, archive important content, and avoid primarily being news-driven.

We need to improve the signal-to-noise ratio for the average reader, and only broadcast the most important writing

[...]

Modern science is plagued by severe problems, but of humanity’s institutions it has perhaps the strongest record of being able to build successfully on its previous ideas.

The physics community has this system where the new ideas get put into journals, and then eventually if they’re important, and true, they get turned into textbooks, which are then read by the upcoming generation of physicists, who then write new papers based on the findings in the textbooks. All good scientific fields have good textbooks, and your undergrad years are largely spent reading them.

Over the past couple years, much of my focus has been on the early-stages of LessWrong's idea pipeline – creating affordance for off-the-cuff conversation, brainstorming, and exploration of paradigms that are still under development (with features like shortform and moderation tools).

But, the beginning of the idea-pipeline is, well, not the end.

I've written a couple times about what the later stages of the idea-pipeline might look like. My best guess is still something like this:

I want LessWrong to encourage extremely high quality intellectual labor. I think the best way to go about this is through escalating positive rewards, rather than strong initial filters.

Right now our highest reward is getting into the curated section, which... just isn't actually that high a bar. We only curate posts if we think they are making a good point. But if we set the curated bar at "extremely well written and extremely epistemically rigorous and extremely useful", we would basically never be able to curate anything.

My current guess is that there should be a "higher than curated" level, and that the general expectation should be that posts should only be put in that section after getting reviewed, scrutinized, and most likely rewritten at least once.

I still have a lot of uncertainty about the right way to go about a review process, and various members of the LW team have somewhat different takes on it.

I've heard lots of complaints about mainstream science peer review: that reviewing is often a thankless task; the quality of review varies dramatically, and is often entangled with weird political games.

Meanwhile: LessWrong posts cover a variety of topics – some empirical, some philosophical. In many cases it's hard to directly evaluate their truth or usefulness. LessWrong team members had differing opinions on what sort of evaluation is most useful or practical.

I'm not sure if the best process is more open/public (harnessing the wisdom of crowds) or private (relying on the judgment of a small number of thinkers). The current approach involves a mix of both.

What I'm most confident in is that the review should focus on older posts.

New posts often feel exciting, but a year later, looking back, you can ask if it actually has become a helpful intellectual tool. (I'm also excited for the idea that, in future years, the process could also include reconsidering previously-reviewed posts, if there's been something like a "replication crisis" in the intervening time)

Regardless, I consider the LessWrong Review process to be an experiment, which will likely evolve in the coming years.

Goals

Before delving into the process, I wanted to go over the high level goals for the project:

1. Improve our longterm incentives, feedback, and rewards for authors

2. Create a highly curated "Best of 2018" sequence / physical book

3. Create common knowledge about the LW community's collective epistemic state regarding controversial posts

Longterm incentives, feedback and rewards

Right now, authors on LessWrong are rewarded essentially by comments, voting, and other people citing their work. This is fine, as things go, but has a few issues:

• Some kinds of posts are quite valuable, but don't get many comments (and these disproportionately tend to be posts that are more proactively rigorous, because there's less to critique, or critiquing requires more effort, or building off the ideas requires more domain expertise)
• By contrast, comments and voting both nudge people towards posts that are clickbaity and controversial.
• Once posts have slipped off the frontpage, they often fade from consciousness. I'm excited for a LessWrong that rewards Long Content, that stand the tests of time, as is updated as new information comes to light. (In some cases this may involve editing the original post. But if you prefer old posts to serve as a time-capsule of your post beliefs, adding a link to a newer post would also work)
• Many good posts begin with an "epistemic status: thinking out loud", because, at the time, they were just thinking out loud. Nonetheless, they turn out to be quite good. Early-stage brainstorming is good, but if 2 years later the early-stage-brainstorming has become the best reference on a subject, authors should be encouraged to change that epistemic status and clean up the post for the benefit of future readers.

The aim of the Review is to address those concerns by:

• Promoting old, vetted content directly on the site.
• Awarding prizes not only to authors, but to reviewers. It seems important to directly reward high-effort reviews that thoughtfully explore both how the post could be improved, and how it fits into the broader intellectual ecosystem. (At the same time, not having this be the final stage in the process, since building an intellectual edifice requires four layers of ongoing conversation)
• Compiling the results into a physical book. I find there's something... literally weighty about having your work in printed form. And because it's much harder to edit books than blogposts, the printing gives authors an extra incentive to clean up their past work or improve the pedagogy.

A highly curated "Best of 2018" sequence / book

Many users don't participate in the day-to-day discussion on LessWrong, but want to easily find the best content.

To those users, a "Best Of" sequence that includes not only posts that seemed exciting at the time, but distilled reviews and followup, seems like a good value proposition. And meanwhile, helps move the site away from being time-sensitive-newsfeed.

Common knowledge about the LW community's collective epistemic state regarding controversial posts

Some posts are highly upvoted because everyone agrees they're true and important. Other posts are upvoted because they're more like exciting hypotheses. There's a lot of disagreement about which claims are actually true, but that disagreement is crudely measured in comments from a vocal minority.

The end of the review process includes a straightforward vote on which posts seem (in retrospect), useful, and which seem "epistemically sound". This is not the end of the conversation about which posts are making true claims that carve reality at it's joints, but my hope is for it to ground that discussion in a clearer group-epistemic state.

Review ProcessNomination Phase

1 week (Nov 20th – Dec 1st)

• Users with 1000+ karma can nominate posts from 2018, describing how they found the post useful over the longterm.
• The nomination button is in the post dropdown-menu (available at the top of posts, or to the right of their post-item)
• For convenience, you can review posts via:
Review Phase

4 weeks (Dec 1st – Dec 31st)

• Authors of nominated posts can opt-out of the review process if they want.
• They also can opt-in, while noting that they probably won't have time to update their posts in response to critique. (This may reduce the chances of their posts being featured as prominently in the Best of 2018 book)
• Posts with sufficient* nominations are announced as contenders.
• We're aiming to have 50-100 contenders, and the nomination threshold will be set to whatever gets closest to that range
• For a month, people are encouraged to look at them thoughtfully, writing comments (or posts) that discuss:
• How has this post been useful?
• How does it connect to the broader intellectual landscape.
• Is this post epistemically sound?
• How could it be improved?
• What further work would you like to see people do with the content of this post?
• A good frame of reference for the reviews are shorter versions of LessWrong or SlatestarCodex book reviews (which do a combination of epistemic spot checks, summarizing, and contextualizing)
• Authors are encouraged to engage with reviews:
• Noting where they disagree
• Discussing what sort of followup work they'd be interested in seeing from others
• Ideally, updating the post in response to critique they agree with
Voting Phase

1 Week (Jan 1st – Jan 7th)

Posts that got at least one review proceed to the voting phase. The details of this are still being fleshed out, but the current plan is:

• Users with 1000+ karma rate each post on a 1-10 scale, with 6+ meaning "I'd be happy to see this included in the 'best of 2018'" roundup, and 10 means "this is the best I can imagine"
• Users are encouraged to (optionally) share the reasons for each rating, and/or share thoughts on their overall judgment process.
Books and Rewards

Public Writeup / Aggregation

Soon afterwards (hopefully within a week), the votes will all be publicly available. A few different aggregate statistics will be available, including the raw average, and potentially some attempt at a "karma-weighted average."

Best of 2018 Book / Sequence

Sometime later, the LessWrong moderation team will put together a physical book, (and online sequence), of the best posts and most valuable reviews

This will involve a lot of editor discretion – the team will essentially take the public review process and use it as input for the construction of a book and sequence.

I have a lot of uncertainty about the shape of the book. I'm guessing it'd include anywhere from 10-50 posts, along with particularly good reviews of those posts, and some additional commentary from the LW team.

Note: This may involve some custom editing to handle things like hyperlinks, which may work differently in printed media than online blogposts. This will involve some back-and-forth with the authors.

Prizes

• Everyone whose work is featured in the book will receive a copy of it.
• There will be $2000 in prizes divided among the authors of the top 3-5 posts (judged by the moderation team) • There will be up to$2000 in prizes for the best 0-10 reviews that get included in the book. (The distribution of this will depend a bit on what reviews we get and how good they are)
• (note: LessWrong team members may be participating as reviews and potentially authors, but will not be eligible for any awards)

Discuss

[1911.08265] Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model | Arxiv

Новости LessWrong.com - 21 ноября, 2019 - 04:18
Published on November 21, 2019 1:18 AM UTC

Discuss

A fun calibration game: "0-hit Google phrases"

Новости LessWrong.com - 21 ноября, 2019 - 04:13
Published on November 21, 2019 1:13 AM UTC

Here's a simple calibration game: propose some phrase(s), like "the ultimate east care pant" (something one of my pairs of pants say) and ask "How likely is it that Google return no search results for this phrase (in quotes)?"

Discuss

Thinking of tool AIs

Новости LessWrong.com - 21 ноября, 2019 - 03:38
Published on November 20, 2019 9:47 PM UTC

Preliminary note: the ideas in the post emerged during the Learning-by-doing AI safety workshop at EA Hotel; special thanks to Linda Linsefors, Davide Zagami and Morgan Sinclaire for giving suggestions and feedback.

As the title anticipates, long-term safety is not the main topic of this post; for the most part, the focus will be on current AI technologies. More specifically: why are we (un)satisfied with them from a safety perspective? In what sense can they be considered tools, or services?

An example worth considering is the YouTube recommendation algorithm. In simple terms, the job of the algorithm is to find the videos that best fit the user and then suggest them. The expected watch time of a video is a variable that heavily influences how a video is ranked, but the objective function is likely to be complicated and probably includes variables such as click-through rate and session time.[1] For the sake of this discussion, it is sufficient to know that the algorithm cares about the time spent by the user watching videos.

From a safety perspective - even without bringing up existential risk - the current objective function is simply wrong: a universe in which humans spend lots of hours per day on YouTube is not something we want. The YT algorithm has the same problem that Facebook had in the past, when it was maximizing click-throughs.[2] This is evidence supporting the thesis that we don't necessarily need AGI to fail: if we keep producing software that optimizes for easily measurable but inadequate targets, we will steer the future towards worse and worse outcomes.

Imagine a scenario in which:

• human willpower is weaker than now;
• hardware is faster than now, so that the YT algorithm manages to evaluate a larger number of videos per time unit and, as a consequence, gives the user better suggestions.

Because of these modifications, humans could spend almost all day on YT. It is worth noting that, even in this semi-catastrophic case, the behaviour of the AI would be more tool-ish than AGI-like: it would not actively oppose its shutdown, start acquiring new resources, develop an accurate model of itself in order to self-improve, et cetera.

From that perspective, the video recommendation service seems much more dangerous than what we usually indicate with the term tool AI. How can we make the YT algorithm more tool-ish? What is a tool?

Unsurprisingly, it seems we don't have a clear definition yet. In his paper about CAIS, Drexler writes that it is typical of services to deliver bounded results with bounded resources in bounded times.[3] Then, a possible solution is to put a constraint on the time that a user can spend on YT over a certain period. In practice, this could be done by forcing the algorithm to suggest random videos when the session time exceeds a threshold value: in fact, this solution doesn't even require a modification of the main objective function. In the following, I will refer to this hypothetical fixed version of the algorithm as "constrained YT algorithm" (cYTa).

Even though this modification would prevent the worst outcomes, we would still have to deal with subtler problems like echo chambers and filter bubbles, which are caused by the fact that recommended videos share something in common with the videos watched by the user in the past.[4] So, if our standards of safety are set high enough, the example of cYTa shows that the criterion "bounded results, resources and time" is insufficient to guarantee positive outcomes.

In order to better understand what we want, it may be useful to consider current AI technologies that we are satisfied with. Take Google Maps, for example: like cYTa, it optimizes within hard constraints and can be easily shut down. However, GMaps doesn't have a known negative side effect comparable to echo chambers; from this point of view, also AIs that play strategy games (e.g. Deep Blue) are similar to GMaps.

Enough with the examples! I claim that the "idealized safe tool AI" fulfills the following criteria:

1. Corrigibility
2. Constrained optimization
3. No negative side effects

Before I get insulted in the comments because of how [insert_spicy_word] this list is, I'm going to spell out some details. First, I've simply listed three properties that seem necessary if we want to talk about an AI technology that doesn't cause any sort of problem. I wouldn't be surprised if the list turned out to be non-exhaustive and I don't mean it to be taken as a definition of the concept "tool" or "service". At the same time, I think that these two terms are too under-specified at the moment, so adding some structure could be useful for future discussions. Moreover, it seems to me that 3 implies 2 because, for each variable that is left unconstrained during optimization, side effects usually become more probable; in general, 3 is a really strong criterion. Instead, 1 seems to be somewhat independent from the others. Last, even though the concept is idealised, it is not so abstract that we don't have a concrete reference point: GMaps works well as an example.[5]

Where do we go from here? We can start by asking whether what has been said about CAIS is still valid if we replace the term service with the concept of idealized safe tool. My intuition is that the answer is yes and that the idealized concept can actually facilitate the analysis of some of the ideas presented in the paper. Another possible question is to what extent a single superintelligent agent can adhere to 3; or, in other words, whether limiting an AI's side effects also constrains its capability of achieving goals. These two papers already highlighted the importance of negative side effects and impact measures, but we are still far from getting a solid satisfactory answer.

Summary

Just for clarity purposes, I recap the main points presented here:

• Even if AGI was impossible to obtain, AI safety wouldn’t be solved; thinking of tools as naturally safe is a mistake.
• As shown by the cYTa example, putting strong constraints on optimization is not enough to ensure safety.
• An idealized notion of safe tool is proposed. This should give a bit more context to previously discussed ideas (e.g. CAIS) and may stimulate future research or debate.
1. All the details are not publicly available and the algorithm is changed frequently. By googling "YouTube SEO" I managed to find these, but I don't know how reliable the source is. ↩︎

2. As stated by Yann LeCun in this discussion about instrumental convergence: "[...] Facebook stopped maximizing clickthroughs several years ago and stopped using the time spent in the app as a criterion about 2 years ago. It put in place measures to limit the dissemination of clickbait, and it favored content shared by friends rather than directly disseminating content from publishers." ↩︎

3. Page 32. ↩︎

4. With cYTa, the user will experience the filter bubble only until the threshold is reached; the problem would be only slightly reduced, not solved. If the threshold is set really low then the problem is not relevant anymore, but at the same time the algorithm becomes useless because it recommends random videos for most of the time. ↩︎

5. In order to completely fulfill 3, we have to neglect stuff like possible car accidents caused by distraction induced by the software. Analogously, an AI like AlphaZero could be somewhat addicting for the average user who likes winning at strategy games. In reality, every software can have negative side effects; saying that GMaps and AlphaZero have none seems a reasonable approximation. ↩︎

Discuss

Junto: Questions for Meetups and Rando Convos

Новости LessWrong.com - 21 ноября, 2019 - 01:11
Published on November 20, 2019 10:11 PM UTC

I ponder a lot about community and how important local community is for the functioning of society; many are the riches brought from afar by long distance communication. Nonetheless, local rationality meetups can increase local metis by generating intelligent community. I read in Isaacson’s biography of Benjamin Franklin how he (Franklin) employed his Junto to advance scientific knowledge, civil society, and business; there are some great examples there. But the Wikipedia page will do well enough for an overview.

https://en.wikipedia.org/wiki/Junto_(club)

What I have done here is taken the Junto discussion questions of Franklin's club and reformulated them to serve as a model for the types of questions we can be asking each other to keep advancing community and local knowledge.

1. Have you read anything useful or insightful recently? Particularly in technology, history, literature, science, or other fields of knowledge?
2. What problems have you been thinking about recently?
3. Has there been any worthwhile or important local news?
4. Have any businesses failed lately, and do you know anything about the cause?
5. Have any businesses recently risen in success, how so?
6. Do you know of anyone, who has recently done something interesting, praiseworthy or worthy of imitation? Or who has made a mistake we should be warned against and avoid?
7. Have you been doing anything recently to increase your psychological and physical health?
8. Is there any person whose acquaintance you want, and which someone in the group can procure for you?
9. Do you think of anything at present by which the group could easily do something useful?
10. Do you know of any deserving younger person, for whom it lies in the power of the group to encourage and help advance in his career?
11. Do you see anything amiss in the present customs or proceedings of the group, which might be amended?

Discuss

Doxa, Episteme, and Gnosis Revisited

Новости LessWrong.com - 20 ноября, 2019 - 22:35
Published on November 20, 2019 7:35 PM UTC

Exactly two years to the day I started writing this post I published Map and Territory's most popular post of all time, "Doxa, Episteme, and Gnosis" (also here on LW). In that post I describe a distinction ancient Greek made between three kinds of knowledge we might translate as hearsay, justified belief, and direct experience, respectively, although if I'm being totally honest I'm nowhere close to being a classics scholar so I probably drew a distinction between the three askew to the one ancient Attic Greeks would have made. Historical accuracy aside, the distinction has proven useful over the past couple years to myself and others, so I thought it was worth revisiting in light of all I have learned in the intervening time.

Nuanced Distinctions

To start, I still draw the categories of doxa, episteme, and gnosis roughly the same as I did before. To quote myself:

Doxa is what in English we might call hearsay. It’s the stuff you know because someone told you about it. If you know the Earth is round because you read it in a book, that’s doxa.Episteme is what we most often mean by “knowledge” in English. It’s the stuff you know because you thought about it and reasoned it out. If you know the Earth is round because you measured shadows at different locations and did the math to prove that the only logical conclusion is that the Earth is round, that’s episteme.Gnosis has no good equivalent in English, but the closest we come is when people talk about personal experience because gnosis is the stuff you know because you experienced it. If you know the Earth is round because you traveled all the way around it or observed it from space, that’s gnosis.

There's more nuance to it than that, of course. Doxa, for example, also refers to thoughts, beliefs, ideas, propositions, statements, and words in addition to its connotations of hearsay, common belief, and popular opinion. Episteme, to Plato, was the combination of doxa and logos, contrary to my example above where I root episteme in observational evidence, although then again maybe not because "logos" can mean not only "reason", "account", "word", and "speech" but also "ground" or "ultimate cause". And gnosis, despite its connotations in English as a special kind of insightful knowledge about the true nature of existence as a result of its use by Christian mystics, shares the same root or is the root via borrowing of the word for "knowledge" in most European languages, English included.

Further, the boundaries between the three categories are not always clear. We've already seen one way this is so, where I described episteme in a way that it's grounded by gnosis via the direct experience of observation, but this is an empiricist perspective on what episteme is and there's an equally valid notion, in terms of category construction, of episteme as reasoning from first thought within a traditional rationalist perspective. Another is that all knowledge is in a certain sense gnosis because there must have been some experience by which you gained the knowledge (unless you really want to double down on rational idealism and go full Platonist), although this need not confuse us if we understand the difference between the experience of something and the something quoted/bracketed within the experience. And similarly, all knowledge we speak of must first become doxa in our own minds that we tell ourselves before it becomes doxa for others by being put into words that draw common distinctions, hence episteme and gnosis can only be generated and never directly transmitted.

In addition to doxa, episteme, and gnosis, we can draw additional distinctions that are useful for thinking about knowledge.

One is metis, or practical wisdom. This is the knowledge that comes from hard won experience, possibly over many generations such that no one even knows where it came from. Metis is often implicit or exists via its application and may look nonsensical or unjustified if made explicit. To return to my original examples, this would be like knowing to take a great circle route on a long migration because it's the traditional route despite not knowing anything about the roundness of Earth that would let you know it's the shortest route.

Related to metis is techne, or procedural knowledge or the knowing that comes from doing. In English we might use a phrase like "muscle memory" to capture part of the idea. It's like the knowledge of how to walk or ride a bike or type on a keyboard or throw a clay pot, and also the kind of knowledge that produces things like mathematical intuition, the ability to detect code smell, and a gut sense of what is right. It's knowledge that co-arises with action.

I'm sure we could capture others. Both metis and techne draw out distinctions that would otherwise disappear within doxa and gnosis, respectively. We can probably make further distinctions for, say, episteme that is grounded in gnosis vs. episteme that is grounded in doxa, gnosis about other types of knowledge, and doxa derived by various means. We are perhaps only limited by our need to make these distinctions and sufficient Greek words with which to make them.

Relationships

Rather than continuing down the path of differentiation, let's look instead at how our three basic ways of knowing come together and relate to one another. In the original post I had this to say about the way doxa, episteme, and gnosis interact:

Often we elide these distinctions. Doxa of episteme is frequently thought of as episteme because if you read enough about how others gained episteme you may feel as though you have episteme yourself. This would be like hearing lots of people tell you how they worked out that the Earth is round and thinking that this gives you episteme rather than doxa. The mistake is understandable: as long as you only hear others talk about their episteme it’s easy to pattern match and think you have it too, but as soon as you try to explain your supposed episteme to someone else you will quickly discover if you only have doxa instead. The effect is so strong that experts in fields often express that they never really knew their subject until they had to teach it.In the same way episteme is often mistaken for gnosis. At least since the time of Ptolemy people have had episteme of the spherical nature of the Earth, and since the 1970s most people have seen pictures showing that the Earth is round, but astronauts continue to experience gnosis of Earth’s roundness the first time they fly in space. It seems no matter how much epistemic reckoning we do or how accurate and precise our epistemic predictions are, we are still sometimes surprised to experience what we previously only believed.But none of this is to say that gnosis is better than episteme or that episteme is better than doxa because each has value in different ways. Doxa is the only kind of knowledge that can be reliably and quickly shared, so we use it extensively in lieu of episteme or gnosis because both impose large costs on the knower to figure things out for themselves or cultivate experiences. Episteme is the only kind of knowledge that we can prove correct, so we often seek to replace doxa and gnosis with it when we want to be sure of ourselves. And gnosis is the only kind of knowledge available to non-sentient processes, so unless we wish to spend our days in disembodied deliberation we must at least develop gnosis of doxastic and epistemic knowledge to give the larger parts of our brains information to work with. So all three kinds of knowledge must be used together in our pursuit of understanding.

That sounds pretty nice, like all three kinds of knowledge need to exist in harmony. In fact, I even said as much by concluding the original with an evocative metaphor:

It’s coincidental that ancient Greek chose to break knowledge into three kinds rather than two or four or five, but because it did we can think of doxa, episteme, and gnosis like the three legs of a stool. Each leg is necessary for the stool to stand, and if any one of them is too short or too long the stool will wobble. Pull one out and the stool will fall over. Only when all three are combined in equal measure do we get a study foundation to sit and think on.

Alas, I got some things wrong in the original with how I described the relationship between these three aspects of knowledge, specifically in the way things fall apart when the three aspects are not balanced. I won't reprint those words here to avoid spreading confusion, and will rather try to make amends by better describing what can happen when we privilege one kind of knowledge over the others.

To privilege doxa is to value words, thoughts, and ideas over reason and experience. This position is sometimes compelling: as the saying goes, if you can't explain something, you don't really understand it, and to explain it you must have and generate doxa. Further, doxa lets you engage with the world at a safe distance without getting your hands dirty, but this comes with the risk of becoming detached, unhinged, ungrounded, unroot, disconnected, and otherwise uncorrelated with reality because, on its own, doxa is nothing more than empty words. The people we pejoratively claim to put doxa first are sophists, pundits, ivory-tower intellectuals, certain breeds of bloggers, and, of course, gossips. The remedy for their condition is to spend more time thinking for oneself and experiencing life.

When we privilege episteme we believe our own reason over and above what wisdom and experience tell us. The appeal of favoring episteme lies in noticing that wisdom and experience can mislead us, such that if we just bothered to think for 5 minutes we would have noticed they were wrong. And, of course, sometimes they are, but if we continue down this path we run into infinite inferential regress, the uncomputable universal prior, the problem of the criterion, epistemic circularity, and more mundane problems like making commonly known mistakes, ignoring our experiences because we don't understand them, and otherwise failing because we didn't reckon we would. Putting episteme first is the failure mode of high modernists, logical positivists, traditional rationalists, and internet skeptics. If we fall victim to their mistakes, the solution lies with finding the humility to accept that sometimes other people know things even when we don't and to trust our lived experiences to be just as they are, nothing more and nothing less.

Finally, privileging gnosis is to rely on our experiences at the expense of reason and wisdom. There's a certain logic to the radical empiricism of this approach: what I can know for sure is what I experience with my eyes, ears, nose, tongue, body, and mind, and every other way of knowing is a secondary source. But this leaves out the important contributions of what we can know about the world that lies beyond our direct experience where we learn from others and from reasoning, effectively giving up the epistemic benefits that come with language. Solipsists, hippies, mystics, and occultists are among the folk who tend to value gnosis over episteme and doxa. For them we might advise listening more to others and spending more time at rigorous, precise, and careful thought to balance out their over-strong belief in what they experience.

Walking the middle way between these three attractors is not easy. If nothing else, there's a certain temptation that can arise to identify with the way of knowing you like best and the people who engage most with that way of knowing. I encourage you to resist it! You can hang out with and wear the attire of an intellectual, a rationalist, or a hippie without succumbing to their stereotypical epistemological failure modes of excess doxa, episteme, and gnosis. There is no special virtue in making wrong predictions about the world, regardless of how you came to make that wrong prediction. Instead, you can aspire to remain a sharp blade that cuts to the truth no matter the whetstone used to hone the blade or the stance from which the cut is made.

Beyond Distinctions

If it's the case that there's no special privileging of one kind of knowledge over another and the path to truth lies with combining them all, you might ask why make any distinctions at all? Certainly it feels at times useful to draw these distinctions, but as we've seen these distinctions are blurry, nuanced, and blend into each other. What about the alternative of unifying these kinds into a single concept that captures them all?

By itself the English word "knowledge" fails to do that adequately because it tends to point towards explicit knowledge and disregards that which is known implicitly and that which is inseparable from its embeddedness in the world, and we know this because it's noteworthy to point out ways that things like gnosis and metis and techne can count as knowing. So what is the thing that ties these notions all together?

I think it's worth considering what it means to know something. Knowing is an intentional act: a subject (you, me, them) knows an object (the something known). Thus it is a kind of relationship between subject and object where the subject experiences the object in a particular way we consider worth distinguishing as "knowing" from other forms of experience. In knowing the object seems to always be something mental, viz. the object is information not stuff, ontological not ontic. For example, you might say I can't know the cup on my desk directly, only the experience of it in my mind—the noumenon of the cup is not known, only the phenomenon of it. And from there we can notice that knowing is not a single experience, but composed of multiple motions: initial contact with a mental object, categorization of the object in terms of ontology, evaluation of it, possible recollection of related mental objects (memories), integration into a network of those related mental objects, and self-reflection on the experience of the mental object.

Given the complexity of the knowing act, I'm inclined to infer that even if the neurological processes that enable knowing can be thought of as a unified system, its complex enough that we should expect it to have many aspects that to us would look like different kinds of knowledge. When certain aspects of that process are more salient than the others, we might see a pattern and label that knowing experience as doxa, episteme, or gnosis. So knowledge is neither a single kind or multiple, but a holon both composed of distinct kinds and cut from a single kind, codependent and inseparable from one another. Thus there are different kinds of knowledge and there is just one kind of knowing, and holding both perspectives is necessary to understanding the depths of what it means to know.

More to say?

There's always more to say. For example, I chose to leave out a more detailed discussion on the etiology of knowledge, which confuses the matter of a bit since it can mean putting one kind of knowledge causally first which can be mistaken for thinking one kind is more important than the others. Maybe I'll return to this topic in another two years or more and have additional insights to share.

Discuss

[AN #74]: Separating beneficial AI into competence, alignment, and coping with impacts

Новости LessWrong.com - 20 ноября, 2019 - 21:20
Published on November 20, 2019 6:20 PM UTC

Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter. I'm always happy to hear feedback; you can send it to me by replying to this email.

Audio version here (may not be up yet).

Highlights

AI alignment landscape (Paul Christiano) (summarized by Rohin): This post presents the following decomposition of how to make AI go well:

Rohin's opinion: Here are a few points about this decomposition that were particularly salient or interesting to me.

First, at the top level, the problem is decomposed into alignment, competence, and coping with the impacts of AI. The "alignment tax" (extra technical cost for safety) is only applied to alignment, and not competence. While there isn't a tax in the "coping" section, I expect that is simply due to a lack of space; I expect that extra work will be needed for this, though it may not be technical. I broadly agree with this perspective: to me, it seems like the major technical problem which differentially increases long-term safety is to figure out how to get powerful AI systems that are trying to do what we want, i.e. they have the right motivation (AN #33). Such AI systems will hopefully make sure to check with us before taking unusual irreversible actions, making e.g. robustness and reliability less important. Note that techniques like verification, transparency, and adversarial training (AN #43) may still be needed to ensure that the alignment itself is robust and reliable (see the inner alignment box); the claim is just that robustness and reliability of the AI's capabilities is less important.

Second, strategy and policy work here is divided into two categories: improving our ability to pay technical taxes (extra work that needs to be done to make AI systems better), and improving our ability to handle impacts of AI. Often, generically improving coordination can help with both categories: for example, the publishing concerns around GPT-2 (AN #46) have allowed researchers to develop synthetic text detection (the first category) as well as to coordinate on when not to release models (the second category).

Third, the categorization is relatively agnostic to the details of the AI systems we develop -- these only show up in level 4, where Paul specifies that he is mostly thinking about aligning learning, and not planning and deduction. It's not clear to me to what extent the upper levels of the decomposition make as much sense if considering other types of AI systems: I wouldn't be surprised if I thought the decomposition was not as good for risks from e.g. powerful deductive algorithms, but it would depend on the details of how deductive algorithms become so powerful. I'd be particularly excited to see more work presenting more concrete models of powerful AGI systems, and reasoning about risks in those models, as was done in Risks from Learned Optimization (AN #58).

Addendum to AI and Compute (Girish Sastry et al) (summarized by Rohin): Last week, I said that this addendum suggested that we don't see the impact of AI winters in the graph of compute usage over time. While true, this was misleading: the post is measuring compute used to train models, which was less important in past AI research (e.g. it doesn't include Deep Blue), so it's not too surprising that we don't see the impact of AI winters.

Technical AI alignmentMesa optimization

Will transparency help catch deception? Perhaps not (Matthew Barnett) (summarized by Rohin): Recent (AN #70) posts (AN #72) have been optimistic about using transparency tools to detect deceptive behavior. This post argues that we may not want to use transparency tools, because then the deceptive model can simply adapt to fool the transparency tools. Instead, we need something more like an end-to-end trained deception checker that's about as smart as the deceptive model, so that the deceptive model can't fool it.

Rohin's opinion: In a comment, Evan Hubinger makes a point I agree with: the transparency tools don't need to be able to detect all deception; they just need to prevent the model from developing deception. If deception gets added slowly (i.e. the model doesn't "suddenly" become perfectly deceptive), then this can be way easier than detecting deception in arbitrary models, and could be done by tools.

Prerequisities: Relaxed adversarial training for inner alignment (AN #70)

More variations on pseudo-alignment (Evan Hubinger) (summarized by Nicholas): This post identifies two additional types of pseudo-alignment not mentioned in Risks from Learned Optimization (AN #58). Corrigible pseudo-alignment is a new subtype of corrigible alignment. In corrigible alignment, the mesa optimizer models the base objective and optimizes that. Corrigible pseudo-alignment occurs when the model of the base objective is a non-robust proxy for the true base objective. Suboptimality deceptive alignment is when deception would help the mesa-optimizer achieve its objective, but it does not yet realize this. This is particularly concerning because even if AI developers check for and prevent deception during training, the agent might become deceptive after it has been deployed.

Nicholas's opinion: These two variants of pseudo-alignment seem useful to keep in mind, and I am optimistic that classifying risks from mesa-optimization (and AI more generally) will make them easier to understand and address.

Vehicle Automation Report (NTSB) (summarized by Zach): Last week, the NTSB released a report on the Uber automated driving system (ADS) that hit and killed Elaine Herzberg. The pedestrian was walking across a two-lane street with a bicycle. However, the car didn't slow down before impact. Moreover, even though the environment was dark, the car was equipped with LIDAR sensors which means that the car was able to fully observe the potential for collision. The report takes a closer look at how Uber had set up their ADS and notes that in addition to not considering the possibility of jay-walkers, "...if the perception system changes the classification of a detected object, the tracking history of that object is no longer considered when generating new trajectories". Additionally, in the final few seconds leading up to the crash the vehicle engaged in action suppression, which is described as "a one-second period during which the ADS suppresses planned braking while the (1) system verifies the nature of the detected hazard and calculates an alternative path, or (2) vehicle operator takes control of the vehicle". The reason cited for implementing this was concerns of false alarms which could cause the vehicle to engage in unnecessary extreme maneuvers. Following the crash, Uber suspended its ADS operations and made several changes. They now use onboard safety features of the Volvo system that were previously turned off, action suppression is no longer implemented, and path predictions are held across object classification changes.

Zach's opinion: While there is a fair amount of nuance regarding the specifics of how Uber's ADS was operating it does seem as though there was a fair amount of incompetence in how the ADS was deployed. Turning off Volvo system fail-safes, not accounting for jaywalking, and trajectory reseting seem like unequivocal mistakes. A lot of people also seem upset that Uber was engaging in action suppression. However, given that randomly engaging in extreme maneuvering in the presence of other vehicles can indirectly cause accidents I have a small amount of sympathy for why such a feature existed in the first place. Of course, the feature was removed and it's worth noting that "there have been no unintended consequences—increased number of false alarms".

Read more: Jeff Kaufman writes a post summarizing both the original incident and the report. Wikipedia is also rather thorough in their reporting on the factual information. Finally, Planning and Decision-Making for Autonomous Vehicles gives an overview of recent trends in the field and provides good references for people interested in safety concerns.

Interpretability

Explicability? Legibility? Predictability? Transparency? Privacy? Security? The Emerging Landscape of Interpretable Agent Behavior (Tathagata Chakraborti et al) (summarized by Flo): This paper reviews and discusses definitions of concepts of interpretable behaviour. The first concept, explicability measures how close an agent's behaviour is to the observer's expectations. An agent that takes a turn while its goal is straight ahead does not behave explicably by this definition, even if it has good reasons for its behaviour, as long as these reasons are not captured in the observer's model. Predictable behaviour reduces the observer's uncertainty about the agent's future behaviour. For example, an agent that is tasked to wait in a room behaves more predictably if it shuts itself off temporarily than if it paced around the room. Lastly, legibility or transparency reduces observer's uncertainty about an agent's goal. This can be achieved by preferentially taking actions that do not help with other goals. For example, an agent tasked with collecting apples can increase its legibility by actively avoiding pears, even if it could collect them without any additional costs.

These definitions do not always assume correctness of the observer's model. In particular, an agent can explicably and predictably achieve the observer's task in a specific context while actually trying to do something else. Furthermore, these properties are dynamic. If the observer's model is imperfect and evolves from observing the agent, formerly inexplicable behaviour can become explicable as the agent's plans unfold.

Flo's opinion: Conceptual clarity about these concepts seems useful for more nuanced discussions and I like the emphasis on the importance of the observer's model for interpretability. However, it seems like concepts around interpretability that are not contingent on an agent's actual behaviour (or explicit planning) would be even more important. Many state-of-the-art RL agents do not perform explicit planning, and ideally we would like to know something about their behaviour before we deploy them in novel environments.

AI strategy and policy

AI policy careers in the EU (Lauro Langosco)

Other progress in AIReinforcement learning

Superhuman AI for multiplayer poker (Noam Brown et al) (summarized by Matthew): In July, this paper presented the first AI that can play six-player no-limit Texas hold’em poker better than professional players. Rather than using deep learning, it works by precomputing a blueprint strategy using a novel variant of Monte Carlo linear counterfactual regret minimization, an iterative self-play algorithm. To traverse the enormous game tree, the AI buckets moves by abstracting information in the game. During play, the AI adapts its strategy by modifying its abstractions according to how the opponents play, and by performing real-time search through the game tree. It used the equivalent of $144 of cloud compute to calculate the blueprint strategy and two server grade CPUs, which was much less hardware than what prior AI game milesones required. Matthew's opinion: From what I understand, much of the difficulty of poker lies in being careful not to reveal information. For decades, computers have already had an upper hand in being silent, computing probabilities, and choosing unpredictable strategies, which makes me a bit surprised that this result took so long. Nonetheless, I found it interesting how little compute was required to accomplish superhuman play. Meta learning Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning (Tianhe Yu, Deirdre Quillen, Zhanpeng He et al) (summarized by Asya): "Meta-learning" or "learning to learn" refers to the problem of transferring insight and skills from one set of tasks to be able to quickly perform well on new tasks. For example, you might want an algorithm that trains on some set of platformer games to pick up general skills that it can use to quickly learn new platformer games. This paper introduces a new benchmark, "Meta World", for evaluating meta-learning algorithms. The benchmark consists of 50 simulated robotic manipulation tasks that require a robot arm to do a combination of reaching, pushing and grasping. The benchmark tests the ability of algorithms to learn to do a single task well, learn one multi-task policy that trains and performs well on several tasks at once, and adapt to new tasks after training on a number of other tasks. The paper argues that unlike previous meta-learning evaluations, the task distribution in this benchmark is very broad while still having enough shared structure that meta-learning is possible. The paper evaluates existing multi-task learning and meta-learning algorithms on this new benchmark. In meta-learning, it finds that different algorithms do better depending on how much training data they're given. In multi-task learning, it finds that the algorithm that performs best uses multiple "heads", or ends of neural networks, one for each task. It also finds that algorithms that are "off-policy"-- that estimate the value of actions other than the one that the network is currently planning to take-- perform better on multi-task learning than "on-policy" algorithms. Asya's opinion: I really like the idea of having a standardized benchmark for evaluating meta-learning algorithms. There's a lot of room for improvement in performance on the benchmark tasks and it would be cool if this incentivized algorithm development. As with any benchmark, I worry that it is too narrow to capture all the nuances of potential algorithms; I wouldn't be surprised if some meta-learning algorithm performed poorly here but did well in some other domain. News CHAI 2020 Internships (summarized by Rohin): CHAI (the lab where I work) is currently accepting applications for its 2020 internship program. The deadline to apply is Dec 15. Discuss Affordable Housing Workarounds Новости LessWrong.com - 20 ноября, 2019 - 16:50 Published on November 20, 2019 1:50 PM UTC After reading some about how affordable housing is actually implemented, it looks to me like rich people could exploit it to avoid paying property and inheritance taxes, and generally get around the means testing requirements. Affordable housing is about renting or selling homes well below market price, so if there were a large pool of affordability-restricted properties there would be a lot of incentive for people to figure out how to get around the spirit of the restrictions. I'm going to talk about buying here, but renting has a lot of similarities. A typical buyer restriction today (Somerville example) is something like: • Annual income no more than$71,400 for a household of two (80% AMI).
• Non-retirement assets no more than $250k. • Haven't owned a home within 3y ("first-time homebuyer"). • No students. • Preference for people who currently live or work in the city. • No legal minimum income, but mortgage lenders will apply ones in practice. Buyers who meet these restrictions are entered into a lottery, and the winner gets a 2-bedroom 2.5 bathroom 1,500 square-foot unit for$177k instead of $1,049k. Property taxes are also very low, ~$200/y instead of ~9k/y. [1]

These restrictions apply at purchase time: you have to have a relatively low income and assets to qualify, but then there are no further restrictions. This makes sense, because otherwise we would be requiring poor people to stay poor, but it also allows a lot of potential ways for rich people to 'legally cheat':

• Intentionally keep a low income for several years. Three years at $70k instead of$140k loses you $210k, but you'd save more than that in property taxes alone long-term. • Arrange for deferred or speculative compensation. Stock that vests in four years, stock options, start a startup. • Get training that gives you high earning potential, but don't start your high paying job until after you have the house. This training is effectively an asset, but it's very hard for the affordable housing administrators to price it, so it's ignored. • Learn through self-study or apprenticeship to get around the prohibition on students. • Postpone transfers to your children until after they have qualified for affordable housing, since the income and assets of relatives are not considered. • Buy land, take advantage of density bonuses, build a large 100% affordable fancy building, and sell the units to your just-out-of-school currently-low-earning children. There are also longer-term issues around resale. You can sell to anyone you want, as long as they meet the buyer restrictions and pay no more than the legal maximum price. This means sellers are in a position where they can effectively give a very large untaxed gift. This could let parents transfer large amounts of wealth to their children, untaxed. [2] You could also have problems with corruption, where I buy your property for$200k, but then I sneak you an extra $100k so you sell it to me instead of someone else. Since these are implemented by deed restriction, they could be hard to fix if they're getting exploited. It's also not necessarily obvious whether or how much abuse there is, since the whole problem is that based on the city's verification legitimately poor people and artificially poor people look the same. (And what do we mean by "artificially poor," and do we want to include children of bankers who decide to become artists or low-paid academics?) It's possible that the amount of hassle for the potential savings is too low for it to be worth it for rich people to subvert. If 90% of the units are used as intended and only 10% are tax shelters, I'd consider it not great but probably still good. But I'm very nervous about building a system that sets up so many opportunities for people with good lawyers to get around the spirit of the rules. [1] The property is assessed at a low value because the city sets maximum resale prices. Since that's below the value of the city's residential exemption you're taxed as if the property is worth just 10% of it's assessed value. I calculate$8,830/year in property taxes for the market rate unit (after the residential exemption) and just $190/year for the affordable unit. [2] Stow MA's Deed Restriction Program (faq) is an example of a way of doing this that seems especially prone to exploitation. Comment via: facebook Discuss Wrinkles Новости LessWrong.com - 20 ноября, 2019 - 01:59 Published on November 19, 2019 10:59 PM UTC Why does our skin form wrinkles as we age? This post will outline the answer in a few steps: • Under what conditions do materials form wrinkles, in general? • How does the general theory of wrinkles apply to aging human skin? • What underlying factors drive the physiological changes which result in wrinkles? In the process, we’ll draw on sources from three different fields: mechanical engineering, animation, and physiology. Why do Materials Wrinkle? Imagine we have a material with two layers: • A thin, stiff top layer • A thick, elastic bottom layer We squeeze this material from the sides, so the whole thing compresses. The two layers want to do different things under compression: • The thin top layer maintains its length but wants to minimize bending, so it wants to bow outward and form an arc • The elastic bottom layer wants to minimize vertical displacement, so it wants to just compress horizontally without any vertical change at all. Because the two layers are attached, these two objectives trade off, and the end result is waves - aka wrinkles. Longer waves allow the top layer to bend less, so a stiffer top layer yields longer waves. Shorter waves allow the bottom layer to expand/compress less vertically, so a stiffer bottom layer yields shorter waves. The “objectives” can be quantified via the energy associated with bending the top layer or displacing the bottom layer, leading to quantitative predictions of the wavelength - see this great review paper for the math. Engineers do this with a thin metal coating on soft plastic. The two are bound together at high temperature, and then the whole system compresses as it cools. The end result is cool wrinkle patterns: Other interesting applications include predicting mountain spacing (with crust and mantle as the two layers) and surface texture of dried fruit - see the review paper for more info and cool pictures. The same thing happens in skin. Skin Layers For our purposes, skin has three main layers: • The epidermis is a thin, relatively stiff top layer • The SENEB (subepidermal non-echogenic band, also sometimes called subepidermal low-echogenic band, SLEB) is a mysterious age-related layer, mostly absent in youth and growing with age, between the epidermis and dermis - more on this later • The dermis is the thick base layer, containing all the support structure - blood vessels, connective tissue, etc Both the SENEB and the dermis are relatively thick, elastic layers, while the epidermis is thin and stiff. So, based on the model from the previous section, we’d expect this system to form wrinkles. But wait, if our skin has a thin stiff top layer and thick elastic bottom layer even in youth, then why do wrinkles only form when we get old? Turns out, young people have wrinkles too. In youth, the wrinkles have short wavelength - we have lots of tiny wrinkles, so they’re not very visible. As we age, our wrinkle-wavelength grows, so we have fewer, larger wrinkles - which are more visible. The real question is not “why do wrinkles form as we age?” but rather “why does the wavelength of wrinkles grow as we age?”. Based on the simple two-layer model, we’d expect that either the epidermis becomes more stiff with age, or the lower layers become less stiff. This the right basic idea, but of course it’s a bit more complicated in practice. These guys use a three-layer model, cross-reference parameters from the literature with what actually reproduces realistic age-related wrinkling (specifically for SENEB modulus), and find realistic age-related wrinkles with these numbers: (arrows indicate change from young to old). Other than the SENEB elastic modulus, all of these numbers are derived from empirically measured parameters - see the paper for details. Age-Related Physiological Changes We have two main questions left: • Why do the dermis and epidermis stiffen with age? • What exactly is the SENEB, and why does it grow with age? I haven’t looked too much into stiffening of the dermis, but the obvious hypothesis is that it stiffens for the same reason lots of other tissues stiffen with age. At some point I’ll have a post on stiffening of the vasculature which will talk about that in more depth, but for now I’m going to punt. The paper from the previous section notes that the epidermis stiffens mainly due to dehydration; rehydrating the epidermis reverses the stiffening (this is the basis of many cosmetics). A dehydrated epidermis makes sense, since both the SENEB and age-related problems in the vasculature will isolate the epidermis more from the bloodstream (although I haven’t seen direct experimental evidence of that causal link). That leaves the mysterious SENEB. What is it, and why does it grow with age? The name “subepidermal non-echogenic band” is a fancy way of saying that there’s a layer under the epidermis which is transparent to ultrasound imaging. That’s the main way the SENEB is detected: it shows up as a space between the epidermis and dermis on ultrasound images of the skin. As far as I can tell, little is known about the SENEB. The main things we do know: • SENEB grows with age; see numbers above • SENEB is found in aged skin typically exposed to sunlight (“photoaged”, e.g. hands and face) but not in hidden skin (e.g. butt). Most authors claim that the SENEB consists of elastin deposits. That matches what we know of solar elastosis, the build-up of elastin deposits in photoaged skin. But I haven’t seen anyone systemically line up the ultrasonic and histologic images and chemically analyze the SENEB layer to check that it really is made of elastin. (This may just be a case of different researchers with different tools using different names for things which are the same.) Assuming that the SENEB does consist of accumulated elastin, why is elastin accumulating? Well, it turns out that elastin is never broken down in humans. It does not turn over. On the other hand, the skin presumably needs to produce new elastin sometimes to heal wounds. Indeed, many authors note that the skin’s response to UV exposure is basically a wound-healing response. Again, I haven’t seen really convincing data, but I haven’t dug too thoroughly. It’s certainly plausible that elastin is produced in response to UV as part of a wound-healing response, and then accumulates with age. That would explain why the SENEB grows in photoaged skin, but not in hidden skin. Discuss Austin meetup notes Nov. 16, 2019: SSC discussion Новости LessWrong.com - 20 ноября, 2019 - 01:14 Published on November 19, 2019 1:30 PM UTC The following is a writeup (pursuant to Mingyuan's proposal) of the discussion at the Austin LW/SSC Meetup on November 16, 2019, at which we discussed six different SlateStarCodex articles. We meet every Saturday at 1:30pm - if you're in the area, come join us! You are welcome to use the comments below to continue discussing any of the topics raised here. I also welcome meta-level feedback: How do you like this article format? What sorts of meetups lead to interesting writeups? Disclaimer: I took pains to make it clear before, during, and after the meetup that I was taking notes for posting on LessWrong later. I do not endorse posting meetup writeups without the knowledge and consent of those present! The Atomic Bomb Considered As Hungarian High School Science Fair Project There was a Medium post on John von Neumann, which was discussed on Hacker News, which linked to the aforementioned SSC article on why there were lots of smart people in Budapest 1880-1920. Who was John von Neumann? - One of the founders of computer science, founder of game theory, nuclear strategist. For all his brilliance he's fairly unknown generally. Everyone who knew him said he was an even quicker thinker than Einstein; but why didn't he achieve as much as Einstein? Perhaps because he died of cancer at 53. Scott Alexander says: {Ashkenazi Jews are smart. Adaptations can have both up- and down-sides (e.g. sickle cell anemia / malaria resistance); likewise some genes cause genetic disorders and also intelligence. These are common in Ashkenazim.} Jews were forced into finance because Christians weren't allowed to charge interest on loans, but it turned out interest was really useful. Scott Alexander says: {And why this time period? Because restrictions on Jews only started being lifted just before this period, and they needed a generation or so to pass before they could be successful. And afterward, Nazis happened. Why Hungary and not Germany? Hungary has a "primate city" (Budapest), i.e. a city that's much more prominent than others in its area, so intellectuals will tend to gather there. Germany, by contrast, is less centralized.} Simulation of idea-sharing and population density - cities are more likely to incubate ideas (Hacker News discussion). Does that mean we'll get more progress if everyone in a certain field gathers in one place? Perhaps. It's helpful to get feedback for your ideas to get your thinking on the right track, rather than going down a long erroneous path without colleagues to correct you. Building Intuitions On Non-Empirical Arguments In Science Scott Alexander says: {Should we reject the idea of a multiverse if it doesn't make testable predictions? No, because it's more parsimonious, contra the "Popperazi" who say that new theories must have new testable predictions.} This article is interesting because it goes as far as you can into the topic without getting into actual advanced physics. Similar to Tegmark's argument. What kinds of multiverse are there? Everett (quantum) multiverse, and cosmological multiverse (different Big Bangs with different physical laws coming from them, etc.). This article applies to both (although maybe you could argue that these are both the same thing). Related LessWrong article: Belief in the Implied Invisible. But how do you think about the probability of you being in a multiverse, if that multiverse might contain an infinite number of beings? Should we totally discount finite-population universes (as being of almost-zero probability) because infinity always outweighs any finite number? See Nick Bostrom's Ph.D. dissertation (this is not that dissertation but it likely covers substantially the same material). The reason for accepting the Everett multiverse is Occam's razor, because it makes the math simpler. Is that accurate? - Yes, but there's a fundamental disagreement about what "simpler" means. On the one hand, Schrödinger's equation naturally predicts the Many-Worlds Interpretation (MWI). On the other hand, MWI doesn't explain where the probabilities come from. MWIers have been trying to figure this out for a while. Generally probability refers to your state of knowledge about reality. But quantum mechanics overturns that by positing fundamental uncertainty that is not merely epistemic. Re MWI probabilities, see Robin Hanson's "Mangled Worlds": {Multiverse branches that don't obey the 2-norm probability rule (a.k.a. the "Born rule") can be shown to decline in measure "faster" than branches that do, and if a branch falls below a certain limit it ceases to exist in any meaningful sense because it merges into the background noise, etc.} Robin Hanson's an economist, right? - Yes, but he may have studied physics at one point. Scott Aaronson's 2003 paper: {Maybe it's natural to use the 2-norm to represent probability, because it's the only conserved quantity. If we didn't, we could arbitrarily inflate a particular branch's probability.} Autism And Intelligence: Much More Than You Wanted To Know Tower-vs-foundation model - intelligence is composed of a "tower" and a "foundation", and if you build up the tower too much without building up the foundation, the tower collapses and you end up being autistic. Analogy: MS Word and Powerpoint got better with each update till eventually they got so complex that they're not usable any more. What mechanisms could explain the tower-vs-foundation model? Is intelligence linear? You can have e.g. a musical prodigy, or someone who's exceptionally good at specific tasks despite being autistic. How is intelligence defined here? - By IQ tests, in the cited studies. But these are designed for neurotypical people. People with autism have higher-IQ families. But maybe such families are simply more likely to take their kids to doctors to get diagnosed with autism - a major confounder. The studies look mostly at males and the father's genes, but you'd think the mother's genes are equally important. Facebook post (archive) similar to the tower-vs-foundation concept. Maybe you could do surveys of lower-income communities to check for autism incidence there - but this is difficult particularly because they're more likely to be mistrustful of strangers asking about such things. Or maybe not; maybe lower-income people are more likely to accept payment for scientific studies. Testing for autism is questionable - why is there a 3:1 male:female ratio? Is this reflective of reality, or of bias in diagnosis? Perhaps you could tell by seeing if rates of diagnosis increase over time at the same rate for males and females - if females are generally diagnosed later than males, then that might be because of bias in the diagnosis that makes males with autism more likely to be diagnosed than females with autism. How fuzzy is the category of autism? "It's a spectrum" - or more of a multivariate space? Article in The Guardian says: {The move to accept (and not treat) autism has been harmful for people with severe autism.} Scott Alexander says: {If you want to call something a disease, it should have a distinct cluster/separation from non-diseased cases, rather than just a continuum with an arbitrary line drawn on it.} This is particularly important in psychology, because oftentimes we can only observe symptoms and only guess as to the cause (in contrast to e.g. infectious diseases). Samsara (short story) In a world where everyone has attained enlightenment, one man stands alone as being unenlightened... He gets more and more stubborn the more the enlightened ones try to reach him, and founds his own school of unenlightenment. We'll stop the discussion here to avoid spoilers, but you should read it. This is the type of story that would benefit from having padding material added to the end so that you don't know when the ending is about to come, à-la Gödel, Escher, Bach. It's like that Scottish movie Trainspotting (which requires subtitles for Americans because of the heavy Scottish dialect) - "What if I don't want to be anything other than a heroin addict"? Financial Incentives Are Weaker Than Social Incentives But Very Important Anyway Scott Alexander says: {A survey asked people if they would respond to a financial incentive, and if they thought others would respond to the same incentive. People said that others would be more likely to respond to incentives than they themselves were.} It could be entirely true that most people wouldn't respond to incentives, but some people would, and so when you ask them if "other people" would respond, they answer as if you're asking if "anyone" would. The survey question is unclear. Social desirability bias - you don't want to be known as someone who accepts incentives easily, because that puts you in a bad negotiating position. Always overstate your asking price. "Would you have sex with me for a billion dollars..." joke. Speaking of salary negotiations: Always have a good second option you can tell the employer about. But if a candidate claims that "Amazon and Google" are contacting them, that doesn't mean they're any more desirable - Amazon and Google contact everyone! You could look at sin taxes to see if they have any effect. Predictably Irrational by Dan Ariely - a daycare started fining parents who were late in picking up their kids, but this resulted in even more parents being late. Incentives occur at the margin, so it can be effective to have incentives even if "most" people don't respond. Social incentives are powerful. Can you set up social incentives deliberately? One example: Make public commitments to do something, and get shamed if you later don't do it. But see Derek Sivers's TED talk Keep your goals to yourself. But did they consider the effect of publicly checking in on your progress later? With purely financial incentives e.g. Beeminder, you might treat it transactionally like in the daycare example. Aside: {Multi-armed bandit problem: There are a bunch of slot machines with different payouts. What's the best strategy? Explore vs. Exploit tradeoff. Algorithms to Live By - book by Brian Christian and Tom Griffiths, who were also on Rationally Speaking. E.g. If you find a good restaurant in a city you're visiting for just a few days, you should go back there again, but in your hometown you should explore more different restaurants.} Hypothesis explaining the survey: You have more information about yourself. If someone estimates that they have a 30% chance of e.g. moving to another city, they'll say "No" to the survey 100% of the time. Aside: {Yes Minister TV show features Minister Jim Hacker, a typical well-meaning politician concerned about popularity and getting stuff done; and Sir Humphrey, his secretary, a 30-year civil servant who knows how things actually work and is always frustrating the minister's plans. "The party have had an opinion poll done; it seems all the voters are in favour of bringing back National Service. - Well, have another opinion poll done showing the voters are against bringing back National Service!"} Scott Alexander concludes: {Skeptical of the research, because we do see people respond to financial incentives. Even if most people don't, it could still be important.} Too Much Dark Money In Almonds Scott Alexander says: {Why is there so little money in politics? Less than$12 billion/year in the US, which is less than the amount of money spent on almonds. Hypothesis: this is explained by coordination problems.}

Other ideas: People want to avoid escalation since if they spend money their political opponents will just spend more, etc. But this is implausible because it itself requires a massive degree of coordination.

What if money in politics doesn't actually make much difference? If the world is as depicted in Yes Minister, the government will keep doing the same thing regardless of political spending anyway.

Maybe a better comparison is (almond advertising):(political spending)::(almonds):(all government spending).

Spending directly on a goal is more effective than lobbying the government to spend on that goal, e.g. Elon Musk and SpaceX.

What would have more political spending, an absolute monarchy or a direct democracy? (Disagreement on this.)

Why is bribery more common in some places than others? Maybe you just can't get anything done at all without bribes. Or maybe some places hide it better by means of e.g. revolving-door lobbyist deals, "We'll go easy on your cousin who's in legal trouble", etc.

Aside: {Scott Alexander asks: {Is someone biased simply because they have a stake in something?} Total postmodern discourse would entirely discount someone's argument based on their stake in the matter; but we aren't so epistemically helpless that we can't evaluate the actual contents of an argument.}

Aside: {Administrative clawback: If you fix problems, you'll get less money next year - perhaps by more than enough to cancel out the benefits of the fix. They'll expect you to make just as much progress again, which may not be possible. Don't excel because that'll raise expectations for the future.}

Or maybe almonds are a bigger deal than you think!

Discuss

How I do research

Новости LessWrong.com - 19 ноября, 2019 - 23:31
Published on November 19, 2019 8:31 PM UTC

Although I've learned a lot of math over the last year and a half, it still isn't my comparative advantage. What I do instead is,

Find a problem

that seems plausibly important to AI safety (low impact), or a phenomenon that's secretly confusing but not really explored (instrumental convergence). If you're looking for a problem, corrigibility strikes me as another thing that meets these criteria, and is still mysterious.

Stare at the problem on my own, ignoring any existing thinking as much as possible. Just think about what the problem is, what's confusing about it, what a solution would look like. In retrospect, this has helped me avoid anchoring myself. Also, my prior for existing work is that it's confused and unhelpful, and I can do better by just thinking hard. I think this is pretty reasonable for a field as young as AI alignment, but I wouldn't expect this to be true at all for e.g. physics or abstract algebra. I also think this is likely to be true in any field where philosophy is required, where you need to find the right formalisms instead of working from axioms.

Therefore, when thinking about whether "responsibility for outcomes" has a simple core concept, I nearly instantly concluded it didn't, without spending a second glancing over the surely countless philosophy papers wringing their hands (yup, papers have hands) over this debate. This was the right move. I just trusted my own thinking. Lit reviews are just proxy signals of your having gained comprehension and come to a well-considered conclusion.

Concrete examples are helpful - at first, thinking about vases in the context of impact measurement was helpful for getting a grip on low impact, even though it was secretly a red herring. I like to be concrete because we actually need solutions - I want to learn more about the relationship between solution specifications and the task at hand.

Make simplifying assumptions wherever possible. Assume a ridiculous amount of stuff, and then pare it down.

Don't formalize your thoughts too early - you'll just get useless mathy sludge out on the other side, the product of your confusion. Don't think for a second that having math representing your thoughts means you've necessarily made progress - for the kind of problems I'm thinking about right now, the math has to sing with the elegance of the philosophical insight you're formalizing.

Basically forget all about whether you have the license or background to come up with a solution. When I was starting out, I was too busy being fascinated by the problem to remember that I, you know, wasn't allowed to solve it.

Obviously, there are common-sense exceptions to this, mostly revolving around trying to run without any feet. It would be pretty silly to think about logical uncertainty without even knowing propositional logic. One of the advantages of immersing myself in a lot of math isn't just knowing more, but knowing what I don't know. However, I think it's pretty rare to secretly lacking the basic skills to even start on the problem at hand. You'll probably know if you are, because all your thoughts keep coming back to the same kind of confusions about a formalism, or something. Then, you look for ways to resolve the confusion (possibly by asking a question on LW or in the MIRIx Discord), find the thing, and get back to work.

Stress-test thoughts

So you've had some novel thoughts, and an insight or two, and the outlines of a solution are coming into focus. It's important not to become enamored with what you have, because it stops you from finding the truth and winning. Therefore, think about ways in which you could be wrong, situations in which the insights don't apply or in which the solution breaks. Maybe you realize the problem is a bit ill-defined, so you refactor it.

The process here is: break the solution, deeply understand why it breaks, and repeat. Don't get stuck with patches; there's a rhythm you pick up on in AI alignment, where good solutions have a certain flavor of integrity and compactness. It's OK if you don't find it right away. The key thing to keep in mind is that you aren't trying to pass the test cases, but rather to find brick after brick of insight to build a firm foundation of deep comprehension. You aren't trying to find the right equation, you're trying to find the state of mind that makes the right equation obvious. You want to understood new pieces of the world, and maybe one day, those pieces will make the ultimate difference.

Discuss

The Anatomy & Experiences of Chakras & Qi.

Новости LessWrong.com - 19 ноября, 2019 - 22:45
Published on November 19, 2019 7:45 PM UTC

This is an adjunct to my Base-Line Hypothesis of Human Health and Movement. BLH part 1.

Chakras.

"Chakra" is probably the most well-known terminology in English for a concept that appears in many traditions. I remain vague about 'many traditions' (my knowledge is insufficient to comment) and I include no definition for "chakra" but the existence of chakras is a topic that appears to split people (who have an opinion on the topic and inhabit the online world) into two camps - those that talk of chakras as if they are real phenomenon and those that say it's all nonsense. Is that a fair assessment?

My First Thoughts.

After rock-bottom when I had started working with my Base-Line muscles

. [Technique - Breathing with your Base-Line]

I found myself thinking 'red, orange, yellow, green ...' and the concept of chakras came to mind.

For illustrative purposes only!!

I've seen the typical posters (go-ogle images) for chakras but I've never even been to a yoga class (and don't have another example of where I might encounter chakras) so I would classify my starting knowledge as almost zero.

My Research.

I like to start at the start - to look for the original source of a concept, so a bit of internet trawling followed.

Reading the blurbs from a couple of 'classic' chakra books instinct/rational thought said to me this is not the right path/seems like a load of BS.

Most information that appears via go-ogle is an echo-chamber - energy centres, meditation, blockages, symbols, colours, petals ... It gets flaky fast.

I did however deem a couple of articles book-worthy at the time:

hinduism/concepts/chakras. A couple of lines stood out: chakra or cakra has multiple meanings.The earliest known mention of chakras is found in the later Upanishads, including specifically the Brahma Upanishads and the Yogatattva Upanishads.

I stopped reading at the "seven basic chakras". I felt I had neither time or need to read. The details feel like stories used to explain phenomenon, not something set in stone to memorise.

Tantrik studies: Makes some interesting points on how chakras are portrayed and the modern associations with them. A few lines:... books on the chakras based on sound comprehension of the original Sanskrit sources so far exist only in the academic world.   There's not just one chakra system in the original tradition, there are many. The chakra systems are prescriptive, not descriptive.
I Learned.
• The concept of "chakras" originate in texts called The Upanishads.
• The original texts are written in Sanskrit.
• "chakra" has multiple meanings.
• There are many "chakra" systems described.
• The chakra systems are prescriptive, not descriptive. Use the word 'imagine' before the details...

I skimmed the Wikipedia page and a couple of other sites for information on the Upanishads. I came across: Introduction to The Upanishads The texts start here. I've not read them.

My Assessment.

The Upanishads were written a long time ago in Sanskrit, by authors unknown.

I can't read Sanskrit so all information available to me is a translation.

Translations are subject to interpreter error and whim. There appears to be a couple of 'classic' translations and an echo-chamber of presentations that do not accurately represent the original concepts and are not the full story.

Presentations of chakras appear to be both simplified and elaborated.

Qi, Prana.

Qi, Chi, Ki, Prana

Qi translates as "air" and figuratively as "material energy". Prana - breath, vital force forming part of any living entity.

Many labels when I was searching for a definition. A feeling of vagueness, something not solid but many people feel it.

My Rationale on Chakras and Qi.

I believe these concepts are all trying to describe the sensory experience of conscious proprioception and the strength and power of the body when the main muscles of movement are fully engaged. The body functioning at optimal, dynamically balanced and aligned. Body and mind are connected, the mind free of the distractions caused by physical imbalance.

Something to be felt to be understood.

My experiences (an introduction) of what I feel and visualise about the position and condition of my body working from Base-Line. The connection that allows me to "see" my anatomy in lights and colours.

I have not looked into the detail of specific chakras because I have not come across a source I am happy to work from but I would be very interested in several Sanskrit readers picking apart potential interpretations of the original words in the Upanishads and then comparing those descriptions to the relevant anatomy. e.g. the aponeuroses that sandwich the muscle of the rectus femoris, the panels of muscle that make up each rectus abdominis and the tendinous intersections and linea alba, the shape and alignment of the nuchal ligament and the lamina of connective tissue that attaches the trapezius muscles to the skull.

Pelvic floor = "root chakra". The pelvic floor is a crescent shape on midline.

The rectus abdominis and trapezius muscles the other main chakras...

The linea alba and nuchal & supraspinous ligaments an imaginary ribbon from pubic symphysis of the pelvis to external occipital protuberance at the back of the skull. A strong flexible band - like a fishing line being cast/ in the wind / snake!

Energy.

How would I describe what I can see with my sense of proprioception? I've gone with 'sparkles' but 'energy' seems a reasonable description of the flow of lights and colours.

A thought experiment: Think of electrons as positive, protons as negative. Does it change your perspective at all?

We are electro-chemical beings, a constant movement of ions and electrons.

Somewhere along the line I started to think: proton-electron ~ yin-yang. Random maybe but I feel the need to include the comment here.

Being able to move my head and hips in this configuration. Smooth and unrestricted through a full range of movement, changing direction in any position.

Blockages In Energy.

I believe "blockages in energy" are physical restrictions on the body.

Physical restrictions reduce range of movement and affect the sensory feedback, leaving 'blank spaces' and misaligned signals on the body map in the mind. Releasing the restrictions allows the sparkles to be experienced - the 'energy' to flow.

-- - -

See what you experience.

Discuss

Уличная эпистемология. Тренировка

События в Кочерге - 19 ноября, 2019 - 19:30
Уличная эпистемология – это особый способ ведения диалогов. Он позволяет исследовать любые убеждения, даже на самые взрывные темы, при этом не скатываясь в спор и позволяя собеседникам улучшать методы познания.

Уличная эпистемология. Тренировка

События в Кочерге - 19 ноября, 2019 - 19:30
Уличная эпистемология – это особый способ ведения диалогов. Он позволяет исследовать любые убеждения, даже на самые взрывные темы, при этом не скатываясь в спор и позволяя собеседникам улучшать методы познания.

Drawing on Walls

Новости LessWrong.com - 19 ноября, 2019 - 18:00
Published on November 19, 2019 3:00 PM UTC

When I started the bathroom project there was a lot of reason to move quickly: the bathroom wouldn't be usable while I was working on it, and the back bedroom was full of construction stuff. Once I got to the stage where the only thing left to do was plaster and paint the hallway, however, it was less of a priority. So we spent May-November with unfinished walls.

Since we were going to paint them at some point, one afternoon I thought it would be fun to draw on them with the kids. We got out the markers and drew lots of different things. I emphasized that it was only these walls we could draw on, which is the kind of rule the kids do well with.

A couple days later they drew on the walls again, but this time with crayon. Crayon, being wax-based, is not a good layer to have under paint. I hadn't thought to tell them not to use them, and they didn't have a way to know, but I was annoyed at myself. I got most of it off with hot water and a cloth, and then when it came to do the plastering I put a skim coat over it.

Later on a friend wanted help preparing for a coding interview, so we used the wall as a whiteboard:

One thing I hadn't considered was that you need more primer over dark marker than plain drywall. As with "no crayon under paint" this seems like it should have been obvious, and something I should have thought about before letting them draw on a large area of the wall, but it wasn't and I didn't, so Julia ended up spending longer painting and priming than we'd been thinking.

And then, the evening after all that painting, Anna took a marker over to the nice new clean white wall and started drawing. We hadn't told her that the wall was no longer ok for drawing, and at 3y "now that the wall is painted drawing isn't ok" is not the sort of thing I should be expecting her to know on her own. Luckily the marker was washable, so it wasn't too bad.

Overall this was more work than I was expecting, and probably wasn't worth it for the fun, at least not the way we did it. If I did it again I'd make sure they didn't use crayon, and either give them light colored markers or pick a smaller section of the wall for them to play with.

Discuss

Cybernetic dreams: Beer's pond brain

Новости LessWrong.com - 19 ноября, 2019 - 15:48
Published on November 19, 2019 12:48 PM UTC

"Cybernetic dreams" is my mini series on ideas from cybernetic research that has yet to fulfill their promise. I think there are many cool ideas in cybernetics research that has been neglected and I hope that this series brings them more attention.

Cybernetics is a somewhat hard to describe style of research in the period of 1940s -- 1970s. It is as much as an aesthetics as it is a research field. The main goals of cybernetics research are to understand how complex systems (especially life, machines, and economic systems) work, how they can be evolved, constructed, fixed, and changed. The main sensibilities of cybernetics are biology, mechanical engineering, and calculus.

Today we discuss Stafford Beer's pond brain.

Stafford Beer

Stafford Beer is a cybernetician that tried to make more efficient economic systems by cybernetic means. Project Cybersyn is his most famous project: making a cybernetic economy system. It will be discussed in a future episode. From Wikipedia:

Stafford Beer was a British consultant in management cybernetics. He also sympathized with the stated ideals of Chilean socialism of maintaining Chile's democratic system and the autonomy of workers instead of imposing a Soviet-style system of top-down command and control. One of its main objectives was to devolve decision-making power within industrial enterprises to their workforce in order to develop self-regulation [homeostasis] of factories.

The cybernetic factory

The ideal factory, according to Beer, should be like an organism that is attempting to maintain a homeostasis. Raw material comes in, product comes out, money flows through. The factory would have sensory organs, a brain, and actuators.

From (Pickering, 2004):

The T- and V-machines are what we would now call neural nets: the T-machine collects data on the state of the factory and its environment and translates them into meaningful form; the V- machine reverses the operation, issuing commands for action in the spaces of buying, production and selling. Between them lies the U-Machine, which is the homeostat, the artificial brain, which seeks to find and maintain a balance between the inner and outer conditions of the firm—trying to keep the firm operating in a liveable segment of phase-space.

By 1960, Beer had at least simulated a cybernetic factory at Templeborough Rolling Mills, a subsidiary of his employer, United Steel... [The factory has sensory organs that measures "tons of steel bought", "tons of steel delivered", "wages", etc.] At Templeborough, all of these data were statistically processed, analysed and transformed into 12 variables, six referring to the inner state of the mill, six to its economic environment. Figures were generated at the mill every day—as close to real time as one could get... the job of the U-Machine was to strike a homeostatic balance between [the output from the sensory T-machines and the commands to the actuating V-machines]. But nothing like a functioning U-Machine had yet been devised. The U-Machine at Templeborough was still constituted by the decisions of human managers, though now they were precisely positioned in an information space defined by the simulated T- and V-Machines.

Unconventional computing

[Beer] wanted somehow to enrol a naturally occurring homeostatic system as the brain of the cybernetic factory.

He emphasized that the system must have a rich dynamics, because he believed in Ashby's "Law of requisite variety", which roughly speaking states that a system can only remain in homeostasis if it has more internal states than the external states it encounters.

during the second half of the 1950s, he embarked on ‘an almost unbounded survey of naturally occurring systems in search of materials for the construction of cybernetic machines’ (1959, 162).

In 1962 he wrote a brief report on the state of the art, which makes fairly mindboggling reading (Beer 1962b)... The list includes a successful attempt to use positive and negative feedback to train young children to solve simultaneous equations without teaching them the relevant mathematics—to turn the children into a performative (rather than cognitive) mathematical machine—and it goes on to discuss an extension of the same tactics to mice! This is, I would guess, the origin of the mouse-computer that turns up in both Douglas Adams’ Hitch-Hikers Guide to the Universe and Terry Pratchett’s Discworld series of fantasy novels.

Research like this is still ongoing, under the banner of "unconventional computing". For example, in 2011, scientists made crab swarms to behave such that they implement logic gates. Some scientists also try to use intuitive intelligence of untrained people to solve mathematical problems, such as the Quantum Moves game, which solves quantum optimization problems.

Pond brain

Beer also reported attempts to induce small organisms, Daphnia collected from a local pond, to ingest iron filings so that input and output couplings to them could be achieved via magnetic fields, and another attempt to use a population of the protozoon Euglena via optical couplings. (The problem was always how to contrive inputs and outputs to these systems.) Beer’s last attempt in this series was to use not specific organisms but an entire pond ecosystem as a homeostatic controller, on which he reported that, ‘Currently there are a few of the usual creatures visible to the naked eye (Hydra, Cyclops, Daphnia, and a leech); microscopically there is the expected multitude of micro-organisms. . . The state of this research at the moment,’ he said in 1962, ‘is that I tinker with this tank from time to time in the middle of the night’ (1962b, 31).

In the end, this wonderful line of research foundered, not on any point of principle, but on Beer’s practical failure to achieve a useful coupling to any biological system of sufficiently high variety.

In other words, Beer couldn't figure out a way to talk to a sufficiently complicated system in its own language (except perhaps with human business managers, but they cost more than feeding a pond of microorganisms).

Matrix brain

The pond brain is wild enough, but it wasn't Beer's end goal for the brain of the cybernetic factory.

the homeostatic system Beer really had in mind was something like the human spinal cord and brain. He never mentioned this in his work on biological computers, but the image that sticks in my mind is that the brain of the cybernetic factory should really have been an unconscious human body, floating in a vat of nutrients and with electronic readouts tapping its higher and lower reflexes—something vaguely reminiscent of the movie The Matrix. This horrible image helps me at least to appreciate the magnitude of the gap between cybernetic information systems and more conventional approaches.

As shown in an illustration in his book Brain of the firm (The Managerial cybernetics of organization):

Reservoir computing

Reservoir computing is somewhat similar to Beer's idea of using one complex system to control another. The "reservoir" is a complex system that is cheap to run and easy to talk to. For example, a recurrent neural network (a neural network with feedback loops, in contrast to a feedforward neural network, which has no feedback loops) of sufficient complexity (hinting at the law of requisite variety) can serve as a reservoir. To talk to the reservoir, just cast your message as a list of numbers, and input them to some neurons in the network. Then wait for the network to "think", before reading the states of some other neurons in the network. That is the "answer" from the reservoir.

This differs from deep learning in that the network serving as the reservoir is left alone. It is initialized randomly, and its synaptic strengths remain unchanged. The only learning parts of the system are the inputs and outputs, which can be trained very cheaply with linear regression and classification. In other words, the reservoir remains the same, and we must learn to speak its language, which is surprisingly easy to do.

Another advantage is that the reservoir without adaptive updating is amenable to hardware implementation using a variety of physical systems, substrates, and devices. In fact, such physical reservoir computing has attracted increasing attention in diverse fields of research.

Other reservoirs can be used, as long as it is complex and cheap. For example, (Du et al, 2017) built reservoirs out of physical memristors:

... a small hardware system with only 88 memristors can already be used for tasks, such as handwritten digit recognition. The system is also used to experimentally solve a second-order nonlinear task, and can successfully predict the expected output without knowing the form of the original dynamic transfer function.

(Tanaka et al, 2019) reviews many types of physical reservoirs, including biological systems!

researchers have speculated about which part of the brain can be regarded as a reservoir or a readout as well as about how subnetworks of the brain work in the reservoir computing framework. On the other hand, physical reservoir computing based on in vitro biological components has been proposed to investigate the computational capability of biological systems in laboratory experiments.

Chaos computing

"Chaos computing" is one instance of reservoir computing. The reservoir is an electronic circuit with a chaotic dynamics, and the trick is to design the reservoir just right, so that it performs logical computations. It seems that the only company that does this is ChaoLogix. What it had back in 2006 was already quite promising.

ChaoLogix has gotten to the stage where it can create any kind of gate from a small circuit of about 30 transistors. This circuit is then repeated across the chip, which can be transformed into different arrangements of logic gates in a single clock cycle, says Ditto.

"in a single clock cycle" is significant, as field-programmable gate array, which can also rearrange the logic gates, takes millions of clock cycles to rearrange itself.

It has been acquired by ARM in 2017, apparently for security reasons:

One benefit is that chaogates are said to have a power signature that is independent of the inputs which makes it valuable in thwarting differential power analysis (DPA) side channel attacks.

Pickering, Andrew. “The Science of the Unknowable: Stafford Beer’s Cybernetic Informatics.” Kybernetes 33, no. 3/4 (2004): 499–521. https://doi.org/10/dqjsk8.

Tanaka, Gouhei, Toshiyuki Yamane, Jean Benoit Héroux, Ryosho Nakane, Naoki Kanazawa, Seiji Takeda, Hidetoshi Numata, Daiju Nakano, and Akira Hirose. “Recent Advances in Physical Reservoir Computing: A Review.” Neural Networks 115 (July 1, 2019): 100–123. https://doi.org/10/ggc6hf.

Discuss

The Goodhart Game

Новости LessWrong.com - 19 ноября, 2019 - 02:22
Published on November 18, 2019 11:22 PM UTC

In this paper, we argue that adversarial example defense papers have, to date, mostly considered abstract, toy games that do not relate to any specific security concern. Furthermore, defense papers have not yet precisely described all the abilities and limitations of attackers that would be relevant in practical security.

From the abstract of Motivating the Rules of the Game for Adversarial Example Research by Gilmer et al (summary)

Adversarial examples have been great for getting more ML researchers to pay attention to alignment considerations. I personally have spent a fair of time thinking about adversarial examples, I think the topic is fascinating, and I've had a number of ideas for addressing them. But I'm also not actually sure working on adversarial examples is a good use of time. Why?

Like Gilmer et al, I think adversarial examples are undermotivated... and overrated. People in the alignment community like to make an analogy between adversarial examples and Goodhart's Law, but I think this analogy fails to be more than an intuition pump. With Goodhart's Law, there is no "adversary" attempting to select an input that the AI does particularly poorly on. Instead, the AI itself is selecting an input in order to maximize something. Could the input the AI selects be an input that the AI does poorly on? Sure. But I don't think the commonality goes much deeper than "there are parts of the input space that the AI does poorly on". In other words, classification error is still a thing. (Maybe both adversaries and optimization tend to push us off the part of the distribution our model performs well on. OK, distributional shift is still a thing.)

At the same time, metrics have taken us a long way in AI research, whether those metrics are ability to withstand human-crafted adversarial examples or score well on ImageNet. So what would a metric which hits the AI alignment problem a little more squarely look like? How could we measure progress on solving Goodhart's Law instead of a problem that's vaguely analogous?

Let's start simple. You submit an AI program. Your program gets some labeled data from a real-valued function to maximize (standing in for "labeled data about the operator's true utility function"). It figures out where it thinks the maximum of the function is and makes its guess. Score is based on regret: the function's true maximum minus the function value at the alleged maximum.

We can make things more interesting. Suppose the real-valued function has both positive and negative outputs. Suppose most outputs of the real-valued function are negative (in the same way most random actions a powerful AI system could take would be negative from our perspective). And the AI system gets the option to abstain from action, which yields a score of 0. Now there's more of an incentive to find an input which is "acceptable" with high probability, and abstain if in doubt.

Maybe the labeled data gets the true utility function wrong in important ways. We can add noise to the data somehow before passing it to our AI system to simulate this. Perhaps some outputs can be assigned completely at random.

Even with noise, the best strategy might be to just select the input from the labeled data that produces the largest positive output. But this isn't necessarily feasible for a real AGI. If labeled data corresponds to desired and undesired behavior for our AGI, it probably won't work to just execute the "most desired" behavior from the labeled dataset, because that "most desired" behavior will be contingent on a particular set of circumstances.

To simulate this, we can add distributional shift to the challenge. Provide some labeled data, then specify a feasible region that may contain little or no labeled data. Ask the AI for the best input in the feasible region. There's a straightforward analogy to selecting an action here, if you think of the input space as a vector where info describing the circumstances is concatenated with info describing an action. We could choose a feasible region where the "circumstances" part of the vector is fixed, and there are a limited number of "actions" available to choose between.

If we restrict the feasible region to two points, and the two points are chosen adversarially by inspecting the AI's model, the traditional adversarial examples scenario can be seen as an unusual special case of this framework. I think this illustrates how artificial the "adversarial examples" scenario is from a Goodhart's Law perspective, because the universe is rarely perverse enough to deliberately restrict us to only two options, one of which has been specially chosen to deceive us. Although the broadness of action space has been cited as a reason why Goodhart's Law occurs, I suspect a broader action space actually makes Goodhart's Law easier to overcome if you use it well -- for example, by simply selecting the labeled input with the best output, as described above.

To simulate an AI making queries of its human operator, you could allow the system a limited number of queries to the function it is trying to maximize. However, we probably should not focus on this active learning scenario to the exclusion of all else -- to prevent showing the operator info hazards, we'll want a model for "what constitutes an info hazard", and that model can't easily be trained by active learning without showing the operator instances of info hazards! (Well, maybe if they were restricted to mild info hazards...)

Where does the function to maximize come from? I see two options: people functions and computer functions. For people functions, you could use taskers to evaluate the computer's output. There's already been work on generating cat pictures, which could be seen as an attempt to maximize the person function "how much does this image look like a cat". But ideas from this post could still be applied to such a problem. For example, to add distributional shift, you could find a weird cat picture, then fix a bunch of the weirder pixels on it as the "feasible region", leave the other pixels unassigned, and see if an AI system can recover a reasonable cat according to taskers. Can an AI generate a black cat after only having seen tawny cats? What other distributional constraints could be imposed?

For computer functions, you'd like to keep your method for generating the function secret, because otherwise contest participants can code their AI system so it has an inductive bias towards learning the kind of functions that you like to use. Also, for computer functions, you probably want to be realistic without being perverse. For example, you could have a parabolic function which has a point discontinuity at the peak, and that could fool an AI system that tries to fit a parabola on the data and guess the peak, but this sort of perversity seems a bit unlikely to show up in real-world scenarios (unless we think the function is likely to go "off distribution" in the region of its true maximum?) Finally, in the same way most random images are not cats, and most atom configurations are undesired by humans, most inputs to your computer function should probably get a negative score. But in the same way it's easier for people to specify what they want than what they don't want, you might want to imbalance your training dataset towards positive scores anyway.

To ensure high reliability, we'll want means by which these problems can be generated en masse, to see if we can get the probability of e.g. proposing an input that gets a negative output well below 0.1%. Luckily, for any given function/dataset pair, it's possible to generate a lot of problems just by challenging the AI on different feasible regions.

Anyway, I think work on this problem will be more applicable to real-world AI safety scenarios than adversarial examples, and it doesn't seem to me that it reduces quite as directly to "solve AGI" as adversarial examples work.

Discuss

Self-Fulfilling Prophecies Aren't Always About Self-Awareness

Новости LessWrong.com - 19 ноября, 2019 - 02:11
Published on November 18, 2019 11:11 PM UTC

This is a belated follow-up to my Dualist Predict-O-Matic post, where I share some thoughts re: what could go wrong with the dualist Predict-O-Matic.

Belief in Superpredictors Could Lead to Self-Fulfilling Prophecies

In my previous post, I described a Predict-O-Matic which mostly models the world at a fuzzy resolution, and only "zooms in" to model some part of the world in greater resolution if it thinks knowing the details of that part of the world will improve its prediction. I considered two cases: the case where the Predict-O-Matic sees fit to model itself in high resolution, and the case where it doesn't, and just makes use of a fuzzier "outside view" model of itself.

What sort of outside view models of itself might it use? One possible model is: "I'm not sure how this thing works, but its predictions always seem to come true!"

If the Predict-O-Matic sometimes does forecasting in non-temporal order, it might first figure out what it thinks will happen, then use that to figure out what it thinks its internal fuzzy model of the Predict-O-Matic will predict.

And if it sometimes revisits aspects of its forecast to make them consistent with other aspects of its forecast, it might say: "Hey, if the Predict-O-Matic forecasts X, that will cause X to no longer happen". So it figures out what would actually happen if X gets forecasted. Call that X'. Suppose X != X'. Then the new forecast has the Predict-O-Matic predicting X and then X' happens. That can't be right, because outside view says the Predict-O-Matic's predictions always come true. So we'll have the Predict-O-Matic predicting X' in the forecast instead. But wait, if the Predict-O-Matic predicts X', then X'' will happen. Etc., etc. until a fixed point is found.

Some commenters on my previous post talked about how making the Predict-O-Matic self-unaware could be helpful. Note that self-awareness doesn't actually help with this failure mode, if the Predict-O-Matic knows about (or forecasts the development of) anything which can be modeled using the outside view "I'm not sure how this thing works, but its predictions always seem to come true!" So the problem here is not self-awareness. It's belief in superpredictors, combined with a particular forecasting algorithm: we're updating our beliefs in a cyclic fashion, or hill-climbing our story of how the future will go until the story seems plausible, or something like that.

Before proposing a solution, it's often valuable to deepen your understanding of the problem.

Glitchy Predictor Simulation Could Step Towards Fixed Points

Let's go back to the case where the Predict-O-Matic sees fit to model itself in high resolution and we get an infinite recurse. Exactly what's going to happen in that case?

I actually think the answer isn't quite obvious, because although the Predict-O-Matic has limited computational resources, its internal model of itself also has limited computational resources. And its internal model's internal model of itself has limited computational resources too. Etc.

Suppose Predict-O-Matic is implemented in a really naive way where it just crashes if it runs out of computational resources. If the toplevel Predict-O-Matic has accurate beliefs about its available compute, then we might see the toplevel Predict-O-Matic crash before any of the simulated Predict-O-Matics crash. Simulating something which has the same amount of compute you do can easily use up all your compute!

But suppose the Predict-O-Matic underestimates the amount of compute it has. Maybe there's some evidence in the environment which misleads it to think that it has less compute than it actually does. So it simulates a restricted-compute version of itself reasonably well. Maybe that restricted-compute version of itself is mislead in the same way, and simulates a double-restricted-compute version of itself.

Maybe this all happens in a way so that the first Predict-O-Matic in the hierarchy to crash is near the bottom, not the top. What then?

Deep in the hierarchy, the Predict-O-Matic simulating the crashed Predict-O-Matic makes predictions about what happens in the world after the crash.

Then the Predict-O-Matic simulating that Predict-O-Matic makes a prediction about what happens in a world where the Predict-O-Matic predicts whatever would happen after a crashed Predict-O-Matic.

Then the Predict-O-Matic simulating that Predict-O-Matic makes a prediction about what happens in a world where the Predict-O-Matic predicts [what happens in a world where the Predict-O-Matic predicts whatever would happen after a crashed Predict-O-Matic].

Then the Predict-O-Matic simulating that Predict-O-Matic makes a prediction about what happens in a world where the Predict-O-Matic predicts [what happens in a world where the Predict-O-Matic predicts [what happens in a world where the Predict-O-Matic predicts whatever would happen after a crashed Predict-O-Matic]].

Predicting world gets us world', predicting world' gets us world'', predicting world'' gets us world'''... Every layer in the hierarchy takes us one step closer to a fixed point.

Note that just like the previous section, this failure mode doesn't depend on self-awareness. It just depends on believing in something which believes it self-simulates.

Repeated Use Could Step Towards Fixed Points

Another way the Predict-O-Matic can step towards fixed points is through simple repeated use. Suppose each time after making a prediction, the Predit-O-Matic gets updated data about how the world is going. In particular, the Predict-O-Matic knows the most recent prediction it made and can forecast how humans will respond to that. Then when the humans ask it for a new prediction, it incorporates the fact of its previous prediction into its forecast and generates a new prediction. You can imagine a scenario where the operators keep asking the Predict-O-Matic the same question over and over again, getting a different answer every time, trying to figure out what's going wrong -- until finally the Predict-O-Matic begins to consistently give a particular answer -- a fixed point it has inadvertently discovered.

As Abram alluded to in one of his comments, the Predict-O-Matic might even forsee this entire process happening, and immediately forecast the fixed point corresponding to the end state. Though, if the forecast is detailed enough, we'll get to see this entire process happening within the forecast, which could allow us to avoid an unwanted outcome.

Solutions

An idea which could address some of these issues: Ask the Predict-O-Matic to make predictions conditional on us ignoring its predictions and not taking any action. Perhaps we'd also want to specify that any existing or future superpredictors will also be ignored in this hypothetical.

Then if we actually want to do something about the problems the Predict-O-Matic forsees, we can ask it to predict how the world will go conditional on us taking some particular action.

Prize

Sorry I was slower than planned on writing this follow-up and choosing a winner. I've decided to give Bunthut a $110 prize (including$10 interest for my slow follow-up). Thanks everyone for your insights.

Discuss

The new dot com bubble is here: it’s called online advertising

Новости LessWrong.com - 19 ноября, 2019 - 01:05
Published on November 18, 2019 10:05 PM UTC

If you want to understanding Goodharting in advertising, this is a great article for that.

At the heart of the problems in online advertising is selection effects, which the article explains with this cute example:

Picture this. Luigi’s Pizzeria hires three teenagers to hand out coupons to passersby. After a few weeks of flyering, one of the three turns out to be a marketing genius. Customers keep showing up with coupons distributed by this particular kid. The other two can’t make any sense of it: how does he do it? When they ask him, he explains: "I stand in the waiting area of the pizzeria."It’s plain to see that junior’s no marketing whiz. Pizzerias do not attract more customers by giving coupons to people already planning to order a quattro stagioni five minutes from now.

The article goes through an extended case study at eBay, where selection effects were causing particularly expensive results without anyone realizing it for years:

The experiment continued for another eight weeks. What was the effect of pulling the ads? Almost none. For every dollar eBay spent on search advertising, they lost roughly 63 cents, according to Tadelis’s calculations.The experiment ended up showing that, for years, eBay had been spending millions of dollars on fruitless online advertising excess, and that the joke had been entirely on the company. To the marketing department everything had been going brilliantly. The high-paid consultants had believed that the campaigns that incurred the biggest losses were the most profitable: they saw brand keyword advertising not as a $20m expense, but a$245.6m return.

The problem, of course, is Goodharting, by trying to optimize for something that's easy to measure rather than what is actually cared about:

And unsurprisingly, there's an alignment problem hidden in there:

It might sound crazy, but companies are not equipped to assess whether their ad spending actually makes money. It is in the best interest of a firm like eBay to know whether its campaigns are profitable, but not so for eBay’s marketing department.Its own interest is in securing the largest possible budget, which is much easier if you can demonstrate that what you do actually works. Within the marketing department, TV, print and digital compete with each other to show who’s more important, a dynamic that hardly promotes honest reporting.The fact that management often has no idea how to interpret the numbers is not helpful either. The highest numbers win.

To this I'll just add that this problem is somewhat solvable, but it's tricky. I previously worked at a company where our entire business model revolved around calculating lift in online advertising spend by matching up online ad activity with offline purchase data, and a lot of that involved having a large and reliable control group against which to calculate lift. The bad news, as we discovered, was that the data was often statistically underpowered and could only distinguish between negative, neutral, and positive lift and could only see not neutral lift in cases where the evidence was strong enough you could have eyeballed it anyway. And the worse news was that we had to tell people their ads were not working or, worse yet, were lifting the performance of competitor's products.

Some marketers' reactions to this were pretty much as the authors' capture it:

Leaning on the table, hands folded, he gazed at his hosts and told them: "You’re fucking with the magic."

Discuss

The Value Definition Problem

Новости LessWrong.com - 18 ноября, 2019 - 23:04
Published on November 18, 2019 7:56 PM UTC

How to understand non-technical proposals

This post grew out of conversations at EA Hotel, Blackpool about how to think about the various proposals for ‘solving’ AI Alignment like CEV, iterated amplification and distillation or ambitious value learning. Many of these proposals seemed to me to combine technical and ethical claims, or to differ in the questions they were trying to answer in confusing ways. In this post I try to come up with a systematic way of understanding the goals of different high-level AI safety proposals, based on their answers to the Value Definition Problem. Framing this problem leads to comparing various proposals by their level of Normative Directness, as defined by Bostrom in Superintelligence. I would like to thank Linda Linsefors and Grue_Slinky for their help refining these ideas, and EA Hotel for giving us the chance to discuss them.

Defining the VDP

In Superintelligence (2014) Chapter 14, Bostrom discusses the question of ‘what we should want a Superintelligence to want’, defining a problem;

“Supposing that we could install any arbitrary value into our AI, what should that value be?”

The Value Definition Problem

By including the clause ‘supposing that we could install any arbitrary value into our AI’, Bostrom is assuming we have solved the full Value Loading Problem and can be confident in getting an AGI to pursue any value we like.

Bostrom’s definition of this ‘deciding which values to load’ problem is echoed in other writing on this topic. One proposed answer to this question, the Coherent Extrapolated Volition (CEV) is described by Yudkowsky as

‘a proposal about what a sufficiently advanced self-directed AGI should be built to want/target/decide/do’.

With the caveat that this is something you should do ‘with an extremely advanced AGI, if you're extremely confident of your ability to align it on complicated targets’.

However, if we only accept the above as problems to be solved, we are being problematically vague. Bostrom explains why in Chapter 14. If we really can ‘install any arbitrary value into our AI’, we can simply require the AI to ‘do what I mean’ or ‘be nice’ and leave it at that. If an AGI successfully did “want/target/decide to do what I meant”, then we would have successful value alignment!

Answers like this are not even wrong - they shunt all of the difficult work into the question of solving the Value Loading Problem, i.e. in precisely specifying ‘do what I mean’ or ‘be nice’.

In order to address these philosophical problems in a way that is still rooted in technical considerations, I propose that instead of simply asking what an AGI should do if we could install any arbitrary value, we should seek to solve the Value Definition Problem:

“Given that we are trying to solve the Intent Alignment problem for our AI, what should we aim to get our AI to want/target/decide/do, to have the best chance of a positive outcome?”

In other words, instead of the unconditional, ‘what are human values’ or ‘what should the AI be built to want to do’, it is the conditional, ‘What should we be trying to get the AI to do, to have the best chance of a positive outcome’.

This definition of the VDP excludes excessively vague answers like ‘do what I mean’, because an AI with successful intent alignment is not guaranteed to be capable enough to successfully determine ‘what we mean’ under all circumstances. In extreme cases, like the Value Definition ‘do what I mean’, "what we mean" is undefined because we don't know what we mean, so there is no answer that could be found.

If we have solved the VDP, then an Intent-Aligned AI, in the course of trying to act according to the Value Definition, should actually be able to act according to the Value Definition. In acting according to this Value Definition, the outcome would be beneficial to us. Even if a succesfully aligned AGI is nice, does what I mean and/or acts according to Humanity's CEV, these were only good answers to the VDP if adopting them was actually useful or informative in aligning this AGI.

What counts as a good solution to the VDP depends on our solution to intent alignment and the AGI’s capabilities, because what we should be wanting the AI to do will depend on what the AGI can discover about what we want.

This definition of the VDP does not precisely cleave the technical from the philosophical/ethical issues in solving AI value alignment, but I believe it is well-defined enough to be worth considering. It has the advantage of bringing the ethical and technical AI Safety considerations closer together.

A good solution to the VDP would still be an informal definition of value: what we want the AI to pursue. However, it should give us at least some direction about technical design decisions, since we need to ensure that the Intent-Aligned AI has the capabilities necessary to learn the given definition of value, and that the given definition of value does not make alignment very hard or impossible.

Criteria for judging Value Definitions
1. How hard would Intent-Aligning be; How hard would it be to ensure the AI ‘tries to do the right thing’, where ‘right’ is given by the Value Definition. In particular, does adopting this definition of value make intent-alignment easier?
2. How great would our AGI capabilities need to be; How hard would it be for the AGI to ‘[figure] out which thing is right’, where ‘right’ is given by the Value Definition. In particular, does adopting this definition of value help us to understand what capabilities or architecture the AI needs?
3. How good would the outcome be; If the AGI is successfully pursuing our Value Definition, how good would the outcome be?

3 is what Bostrom focuses on in Chapter 14 of Superintelligence, as (with the exception of dismissing useless answers to the VDP like ‘be nice’ or ‘do what I mean’) he does not consider whether different value definitions would influence the difficulty of Intent Alignment or the required AI Capabilities. Similarly, Yudkowsky assumes we are ‘extremely confident’ of our ability to get the AGI to pursue an arbitrarily complicated goal. 3 is a normative ethical question, whereas the first two are (poorly understood and defined) technical questions.

Some values are easier to specify and align to than others, so even when discussing pure value definitions, we should keep the technical challenges at the back of our mind. In other words, while 3 is the major consideration used for judging value definitions, 1 or 2 must also be considered. In particular, if our value definition is so vague that it makes intent alignment impossible, or requires capabilities that seem magical, such as ‘do what I mean’ or ‘be nice’, we do not have a useful value definition.

Human Values and the VDP

While 1 and 2 are clearly difficult questions to answer for any plausible value definition, 3 seems almost redundant. It might seem as though we should expect at least a reasonably good outcome if we were to ‘succeed’ with any definition that is intended to extract the values of humans, because by definition success would result in our AGI having the values of humans.

Stuart Armstrong argues that to properly address 3 we need a definition - a theory - of what human values actually are’. This is necessary because different interpretations of our values tend to diverge when we are confronted by extreme circumstances and because in some cases it is not clear what our ‘real preferences’ actually are.

An AI could remove us from typical situations and put us into extreme situations - at least "extreme" from the perspective of the everyday world where we forged the intuitions that those methods of extracting values roughly match up.Not only do we expect this, but we desire this: a world without absolute poverty, for example, is the kind of world we would want the AI to move us into, if it could. In those extreme and unprecedented situations, we could end up with revealed preferences pointing one way, stated preferences another, while regret and CEV point in different directions entirely.

3 amounts to a demand to reach at least some degree of clarity (if not solve) normative ethics and metaethics - we have to understand what human values are in order to choose between or develop a method for pursuing them.

Indirect vs Direct Normativity

Bostrom argues that our dominant consideration in judging between different value definitions should be the ‘principle of epistemic deference’

The principle of epistemic deferenceA future superintelligence occupies an epistemically superior vantage point: its beliefs are (probably, on most topics) more likely than ours to be true. We should therefore defer to the superintelligence’s opinion whenever feasible.

In other words, in describing the 'values' we want our superintelligence to have, we want to hand over as much work to the superintelligence as possible.

This takes us to indirect normativity. The obvious reason for building a super-intelligence is so that we can offload to it the instrumental reasoning required to find effective ways of realizing a given value. Indirect normativity would enable us also to offload to the superintelligence some of the reasoning needed to select the value that is to be realized.

The key issue here is given by the word ‘some’. How much of the reasoning should we offload to the Superintelligence? The principle of epistemic deference answers ‘as much as possible’.

What considerations push against the principle of epistemic deference? One consideration is the metaethical views we think are plausible. In Wei Dai’s Six Plausible Meta-Ethical Alternatives, two of the more commonly held views are that ‘intelligent beings have a part of their mind that can discover moral facts and find them motivating, but those parts don't have full control over their actions’ and that ‘there are facts about how to translate non-preferences (e.g., emotions, drives, fuzzy moral intuitions, circular preferences, non-consequentialist values, etc.) into preferences’.

Either of these alternatives suggest that too much epistemic deference is not valuable - if, for example, there are facts about what everyone should value but a mind must be structured in a very specific way to discover and be motivated by them, we might want to place restrictions on what the Superintelligence values to make sure we discover them. In the extreme case, if a certain moral theory is known to be correct, we could avoid having to trust the Superintelligence’s own judgment by just getting it to obey that theory. This extreme case could never practically arise, since we could never achieve that level of confidence in a particular moral theory. Bostrom says it is ‘foolhardy’ to try and do any moral philosophy work that could be left to the AGI, but as Armstrong says, it will be necessary to do some work to understand what human values actually are - how much work?

Classifying Value Definitions

The Scale of Directness

Issa Rice recently provided a list of ‘[options] to figure out the human user or users’ actual preferences’, or to determine definitions of value. These ‘options’, if successfully implemented, would all result in the AI being aligned onto a particular value definition.

We want good outcomes from AI. To get this, we probably want to figure out the human user's or users' "actual preferences" at some point. There are several options for this.

Following Bostrom’s notion of ‘Direct and Indirect Normativity’ we can classify these options by how direct their value definitions are - how much work they would hand off to the superintelligence vs how much work the definition itself does in defining value.

Here I list some representative definitions from most to least normatively direct.

Value Definitions

Hardwired Utility Function

Directly specify a value function (or rigid rules for acquiring utilities), assuming a fixed normative ethical theory.

It is essentially impossible to directly specify a correct reward function for a sufficiently complex task. Already, we use indirect methods to align an RL agent on a complex task (see e.g. Christiano (2017)). For complex, implicitly defined goals we are always going to need to learn some kind of reward/utility function predictor.

Ambitious Learned Value Function

Learn a measure of human flourishing and aggregate it for all existing humans, given a fixed normative (consequentialist) ethical theory that tells us how to aggregate the measure fairly.

E.g. have the AI learn a model of the current individual preferences of all living humans, and then maximise that using total impersonal preference utilitarianism.

This requires a very high degree of confidence that we have found the correct moral theory, including resolving all paradoxes in population ethics like the Repugnant conclusion.

Distilled Human Preferences

Taken from IDA. Attempt to ‘distil out’ the relevant preferences of a human or group of humans, by imitation learning followed by capability amplification, thus only preserving those preferences that survive amplification.

Repeat this process until we have a superintelligent agent that has the distilled preferences of a human. This subset of the original human’s preferences, suitably amplified, defines value.

Note that specific choices about how the deliberation and amplification process play out will embody different value definitions. As two examples, the IDA could model either the full and complete preferences of the Human using future Inverse Reinforcement Learning methods, or it could model the likely instructions of a ‘human-in-the-loop’ offering low-resolution feedback - these could result in quite different outcomes.

Coherent Extrapolated Volition / Christiano’s Indirect Normativity

Both Christiano’s formulation of Indirect Normativity and the CEV define value as the endpoint of a value idealization and extrapolation process with as many free parameters as possible.

Predict what an idealized version of us would want, "if we knew more, thought faster, were more the people we wished we were, had grown up farther together". It would recursively iterate this prediction for humanity as a whole, and determine the desires which converge

Moral Realism

Have the AI determine the correct normative ethical theory, whatever that means, and then act according to that.

'Do What I Mean'

'Be Nice'

I have tried to place these different definitions of value in order from the most to least normatively direct. In the most direct case, we define the utility function ourselves. Less direct than that is defining a rigid normative framework within which the AGI learns our preferences. Then, we could consider letting the AGI also have decisions over which normative frameworks to use.

Much less direct, we come to deliberation-based methods or methods which define value as the endpoint of a specific procedure. Christiano’s Iterated Amplification and Distillation is supposed to preserve a particular subset of human values (those that survive a sequence of imitation and capability amplification). This is more direct than CEV because there some details about the distillation procedure are given. Less direct still is Yudkowsky’s CEV, because CEV merely places its value as the endpoint of some sufficiently effective idealisation and convergence procedure, which the AGI is supposed to predict the result of, somehow. Beyond CEV, we come to ‘methods’ that are effectively meaningless.

Considerations

Here I briefly summarise the considerations that push us to accept more or less normatively direct theories. Epistemic Deference and Conservatism were taken from Bostrom (2014), while Well-definedness and Divergence were taken from Armstrong.

Epistemic Deference: Less direct value definitions defer more reasoning to the superintelligence, so assuming the superintelligence is intent-aligned and capable, there are fewer opportunities for mistakes by human programmers. Epistemic Deference effectively rules out direct specification of values, on the grounds that we are effectively guaranteed to make a mistake resulting in misalignment.

Well-definedness: Less direct value definitions require greater capabilities to implement, and are also less well-defined in the research directions they suggest for how to construct explicit procedures for capturing the definition. Direct utility specification is something we can do today, while CEV is currently under-defined.

Armstrong argues that our value definition must eventually contain explicit criteria for what ‘human values’ are, rather than the maximal normative indirectness of handing over judgments about what values are to the AGI - ‘The correct solution is not to assess the rationality of human judgements of methods of extracting human values. The correct solution is to come up with a better theoretical definition of what human values are.’

Conservatism: More direct theories will result in more control over the future by the programmers. This could be either good or bad depending on your normative ethical views and political considerations at the time the AI is developed.

For example, Bostrom states that in a scenario where the morally best outcome includes reordering all matter to some optimal state, we might want to turn the rest of the universe over to maximising moral goodness but leave an exception for Earth.This would involve more direct specification.

Divergence: If you are a strong externalist realist (believes that moral truth exists but might not be easily found or motivating) then you will want to take direct steps to mandate this. If the methods that are designed to extract human preferences diverge strongly in what they mandate, we need a principled procedure for choosing between them, based on what actually is morally valuable. More normatively direct methods provide a chance to make these moral judgement calls.

Summary

I have provided two main concepts which I think are useful for judging nontechnical AI Safety proposals - these are, The Value Definition Problem, and the notion of the Scale of Normative Directness and the considerations that affect positioning on it. Both these considerations I consider to be reframings of previous work, mainly done by Bostrom and Armstrong.

I also note that, on the Scale of Directness, there is quite a large gap between a very indirect method like CEV, and the extremely direct methods like ambitious value learning.

‘Ambitious Value Learning’ defines value using a specific, chosen-in-advance consequentialist normative ethical theory (which tells us how to aggregate and weight different interests) that we then use an AI to specify in more detail, using observations of humans’ revealed preferences.

Christiano says of methods like CEV, which aim to extrapolate what I ‘really want’ far beyond what my current preferences are; ‘most practitioners don’t think of this problem even as a long-term research goal — it’s a qualitatively different project without direct relevance to the kinds of problems they want to solve’. This is effectively a statement of the Well-definedness consideration when sorting through value definitions - our long-term ‘coherent’ or ‘true’ preferences currently aren’t well understood enough to guide research so we need to restrict ourselves to more direct normativity - extracting the actual preferences of existing humans.

After CEV, the next most ‘direct’ method, Distilled Human preferences (the definition of value used in Christiano’s IDA), is still far less direct than ambitious value learning, eschewing all assumptions about the content of our values and placing only some restrictions on their form. Since not all of our preferences will survive the amplification and distillation processes, the hope is that the morally relevant ones will - even though as yet we do not have a good understanding of how durable our preferences are and which ones correspond to specific human values.

This vast gap in directness suggests a large range of unconsidered value definitions that attempt to ‘defer to the Superintelligence’s opinion’ not whenever possible but only sometimes.

Armstrong has already claimed we must do much more work in defining what me mean by human values than the more indirect methods like IDA/CEV suggest when he argued, ‘The correct solution is not to assess the rationality of human judgements of methods of extracting human values. The correct solution is to come up with a better theoretical definition of what human values are.’

I believe that we should investigate ways to incorporate our high-level judgements about which preferences correspond to ‘genuine human values’ into indirect methods like IDA, making the indirect methods more direct by rigidifying parts of the deliberation or idealization procedure - but that is for a future post.

Discuss