# Новости LessWrong.com

A community blog devoted to refining the art of rationality
Обновлено: 1 час 2 минуты назад

### The "Backchaining to Local Search" Technique in AI Alignment

18 сентября, 2020 - 18:05
Published on September 18, 2020 3:05 PM GMT

In the spirit of this post by John S. Wentworth, this is a reference for a technique I learned from Evan Hubinger. He's probably not the first to use it, but he introduced it to me, so he gets the credit.

In a single sentence, backchaining to local search is the idea of looking at how a problem of alignment could appear through local search (think gradient descent). So it starts with a certain problem (say reward tampering), and then tries to create a context where the usual training process in ML (local search) could create a system suffering from this problem. It’s an instance of backchaining in general, which just looks for how a problem could appear in practice.

Backchaining to local search has two main benefits:

• It helps decide whether this specific problem is something we should worry about.
• It forces you to consider your problem from a local search perspective, instead of the more intuitive human/adversarial perspective (how would I mess this up?).

Let's look at a concrete example: reward gaming (also called specification gaming). To be even more concrete, we have a system with a camera and other sensors, and its goal is to maximize the amount of time when my friend Tom smiles, as measured through a loss function that captures whether the camera sees Tom smiling. The obvious (for us) way to do reward gaming here is to put a picture of Tom’s smiling face in front of the camera -- then the loss function is minimized.

The backchaining to local search technique applied to this example asks "How can I get this reward gaming behavior by local search?" Well this reward gaming strategy is probably a local minima for the loss function (as changing just a little the behavior would increase the loss significantly), so local search could find it and stay in there. It's also better than most simple strategies, as ensuring that someone smiles (not necessarily a good goal, mind you) requires rather complex actions in the world (like going full "Joker" on someone, or changing someone's brain chemistry, or any other weird and impractical scheme). So there's probably a big zone in model space for which our specific example of reward gaming is the local minima.

All in all, the backchaining to local search technique tells us that this looks like a problem that should happen frequently in practice. Which lines up well with the evidence: see this list of reward gaming examples in the literature, and the corresponding post.

The last thing to point in such a reference post is how to interpret this technique. Because just like models, no technique applies to every situation. If you cannot backchain to local search from your alignment issue, it might mean one of these things.

• There is a good scenario and you just failed to find it.
• There is no scenario, or it is very improbable. In that case, I would say that the question lies in whether there are settings in which scenarios do exist and are probable. If not, I personally would see that as saying that this problem could only happen in the case of a complete shift in learning algorithm, which might or might not be an assumption you're ready to accept.
• It makes no sense to apply backchaining to your problem. For example, if you think about the issue of non-realizability in learning, you don't need backchaining to tell you that it's important in practice. And no model space ever used in practice will contain "reality". So there's no point trying to backchain from this problem.

That is, this technique assumes that your problem is a specific behavior of a trained system (like reward gaming), and that learning algorithms will not shift completely before we reach AGI. So it has close ties with the prosaic AGI approach to AI Safety.

In conclusion, when you encounter or consider a new alignment problem which talks about the specific behavior of the AI (compared to a general issue of theory, for example), backchaining to local search means trying to find a scenario where a system suffering from your alignment problem emerges from local search in some model space. If you put decent probability on the prosaic AGI idea, it should tell you something important about your alignment problem.

Discuss

### Some thoughts on criticism

18 сентября, 2020 - 07:58
Published on September 18, 2020 4:58 AM GMT

Here are some somewhat unconnected unconfident thoughts on criticism that I’ve been thinking about recently.

---

A while ago, when I started having one-on-ones with people I was managing, I went into the meetings with a list of questions I was going to ask them. After the meetings, I’d look at my notes and realize that almost all the value of the meeting came from the part where I asked them what the worst parts of their current work situation were and what the biggest mistakes I was making as a manager were.

I started thinking that almost the whole point of meetings like that is to get your report to feel comfortable giving an honest answer to those questions when you ask them--everything else you talk about is just buttering them up.

I wish I knew how to make people I’m managing more enthusiastic about criticising me and telling me their insecurities. Maybe I could tell them to have a group chat with just them in it where they all had to name their biggest complaints about me? Maybe I should introduce them all to a former intern of mine who promised not to repeat anything they said who told them about all the mistakes I made while managing them, as an attempt to credibly signal actual interest?

---

A lot of the helpful criticism I’ve gotten over the last few years was from people who were being kind of unreasonable and unfair.

One simple example of this is that one time someone (who I’d engaged with for many hours) told me he didn’t take my ideas seriously because I had blue hair. On the one hand, fuck that guy; on the other hand, it’s pretty helpful that he told me that, and I'm grateful to him for telling me. You’d think that being on the internet would expose you to all the relatively uninformed impolite criticism you’d possibly need, but in my experience this isn’t true.

Additionally I think that when people are annoyed at you, they look harder for mistakes you’re making and they speak more frankly about them. So it’s sometimes actually more likely that people will give you useful criticism if they get unreasonably annoyed at you first. This goes especially for people who know you and understand you well. (This is also a reason to think it’s probably helpful to sometimes get drunk around people you like a lot but don’t totally see eye to eye with. I haven’t actually experimented with this.)

I think I’ve often been insufficiently gracious about receiving criticism in this kind of case, which seems pretty foolish of me. This is even more foolish because I’ve often behaved this way in contexts where I had more social power than the person who was criticizing me. I wish that I basically always responded to criticism from people I don’t know extremely well by saying “that’s interesting, thanks for telling me you think that, I’ll think about it.” I’m working on it.

---

I think one of the central things that’s hard about criticism is that people often tie their identities to being good at various things, and it’s hard to predict exactly which way they do this and so it’s tricky to know what criticism will deeply hurt them. For example, I think people often have pretty core beliefs like “I’m not that good at X and Y, but at least I can hold onto the fact that I’m good at Z”. Often, people like that will respond well to criticism about X and Y but not about Z. The problem is that it’s kind of hard to guess which things are in which category for someone.

I think it’s really really hard to be actually entirely open to criticism, and I don’t know if it’s even a good idea for most people to try to strive for it.

I think that if you tell people that it’s extremely virtuous to be open to deep criticism, they sometimes just become really good at not listening to or understanding criticism (like described here https://www.lesswrong.com/posts/byewoxJiAfwE6zpep/reality-revealing-and-reality-masking-puzzles#Disorientation_patterns).

A lot of the time, one of my biggest bottlenecks is that I’m not feeling secure enough to be properly open to criticism. This means both that I can’t properly criticise myself and that other people correctly conclude that they shouldn’t criticise me (and these people don’t even look as hard as they could for my weaknesses).

For example, this is true right now as I’m writing this. If I imagine getting an email from someone who I deeply admire where they’d written up their thoughts on the biggest mistakes I was making, I feel like I’d put off opening it, because I feel fragile enough that I’d worry that reading it would crush me and make me feel useless and depressed and unable to do the things I do. And when I go to a whiteboard and try to make lists of the most likely ways that I’m currently making big mistakes, I feel like I intuitively flinch away from looking directly at the question.

My guess is that this is true of most people most of the time.

I think that this is a major mechanism via which I’m less productive when I’m less happy--I’m less able to ask myself whether I’m really working on the most important problem right now.

Even in this state it’s pretty useful to get criticism from people, because they manage to do a pretty good job of filtering the criticism to be not too core to who you are as a person.

I definitely wouldn’t want anyone reading this to criticize me less as a result of reading this post.

---

One way of getting better criticism is to come up with a list of things you think you might be doing wrong, then ask specifically about them. This both credibly signals that you’re actually interested in criticism, and also communicates that that topic isn’t one of your weak points.

I think that it’s probably generally more helpful to come up with a list of twenty mistakes you’re most likely to be making, and then circulate an anonymous survey where people check the ones they think you’re indeed making, rather than to circulate an open ended criticism form.

----

I recently heard about someone who I’ve spent between 10 and 100 hours talking to doing something related to their career that looked to me and to many of my friends like a blunder. I don’t think any of us told that person that we thought they’d fucked up. This was partially because it seemed like they’d already made the decision, and in my case it was because I had only heard about this indirectly and it felt a bit weird to reach out to someone to say that I’d heard they’d done something that I thought was dumb. It still feels a bit sad.

If you message me asking for it, I’ll tell you if I think you’re doing something that looks like it’s plausibly a bad mistake with your career or life at the moment. I can only think of a few people where my answer would be yes.

---

I think that thinking of yourself as better than other people is, in some ways, helpful for being more pleasant to talk to. I’ve basically never heard anyone make this point directly before.

One context in which I’m often unpleasant is when someone’s saying something I strongly disagree with in a way I dislike, and I lash out aggressively and unhelpfully. I think this is because I feel threatened--I think I intuitively feel like it’s really important for the people in the conversation to see me win the argument, so that they think that I’m smart and right.

If I felt more secure and more superior to the people in the conversation, I think it would be easier to behave better, because I’d feel more like I was proposing some ideas and then seeing if the people I was talking to were interested in them, and then inasmuch as they weren’t, I’d shrug and give up and quietly update against those people.

I have this attitude much less than a lot of the people I know. I think this makes them better than me at being pleasant.

However, I think that feeling more insecure makes it somewhat easier to connect with people, because it means that my heart is more on my sleeve and I can engage with their disagreements more wholeheartedly and openly, and this makes people more comfortable about having some kinds of conversations with me.

Probably there’s a happy medium here that is better than my current attitude.

Discuss

### For what X would you be indifferent between living X days, but forgetting your day at the end of everyday, or living 10 days? (terminally)

18 сентября, 2020 - 07:06
Published on September 18, 2020 4:05 AM GMT

terminally meaning in and of itself, as oppose to instrumentally meaning as a mean to other ends

Discuss

### Against Victimhood

18 сентября, 2020 - 04:58
Published on September 18, 2020 1:58 AM GMT

Cross-posted, as always, from Putanumonit.

I have written many posts in the shape of giving life advice. I hear back from readers who take it and those who refuse it. Either is good — I’m just a guy on the internet, to be consumed as part of a balanced diet of opinions.

But occasionally I hear: who are you to give life advice, your own life is so perfect! This sounds strange at first. If you think I’ve got life figured out, wouldn’t you want my advice? I think what they mean is that I haven’t had to overcome the hardships they have, hostile people and adverse circumstances.

I talk quite often about things that are going poorly for me, but only from the point of view of how I fucked up. I avoid talking about being wronged, oppressed, attacked, discriminated against, or victimized. If you assume that it’s because I live a charmed life where none of these things happen, you may need a refresher on the base rate fallacy.

The reason I never talk about being a victim is that I’m extremely averse to victim mentality. It’s an insidious mindset, one that’s self-reinforcing both internally and by outside feedback. I’ve seen it claim people, groups, entire nations. On the flip side, I’ve noticed that the less often I think of myself as a victim the less I am victimized, which in turn makes that mindset even rarer. If I do feel on occasion that I have been harmed through no fault of my own by hostile actors I keep it to myself. This is a bad time to be a victim on the internet.

What’s bad about victim mentality? Most obviously, inhabiting a narrative where the world has committed a great injustice against which you are helpless against is extremely distressing. Whether the narrative is justified or not, it causes suffering.

See yourself as a victim prevents you from improving your situation by your own power, since doing so will contradict the story of your own helplessness. In particular that’s true of the story you tell yourself. That story is your identity, how your mind predicts your future actions. If your story is helpless victimhood your mind will refuse to help — it wants to be vindicated more than it wants to do better.

When I was young I was a weird nerd, and while that maybe hasn’t changed much I do find myself in social circles where weird nerdiness is welcome. In school it was not very welcome, and people weren’t nice to me. But I never really fell into thinking that I was singled out for abuse, mostly because I reasoned that if my classmates really wanted to abuse me they could do so much more than they have. I could come up with very creative ways to bully someone, and no one in my school seemed equally creative. I learned to avoid the worst people and slowly make friends with the rest.

Avoiding bad things is a usually a great tactic, but it’s not available to victims. Avoiding the victimizer makes it hard to sustain the story of victimhood. It also leaves behind the lingering residue of injustice, knowing that the culprit did not get their comeuppance. That sense of injustice often haunts the victim long after they’re safe from the original source of harm.

Most importantly, victim mentality leaves no room for empathy. Victims can’t see anyone else’s struggles or suffering, especially those of their perceived victimizers.

When I write about dating I get replies from young men who feel maligned and victimized by women. They complain about impossible standards, ambiguous behavior, and dating norms as if these were setup on purpose to immiserate them. When I was younger and finally managed to reject this line of thinking for myself I started understanding women’s own difficulties, fears, and frustrations with dating, the real issues behind their seemingly-unfair complaints about men. That’s when my dating life improved dramatically.

People in general don’t like victims, and they certainly don’t want to date them.

Why do people claim victimhood despite the drawbacks? It makes sense in a small group where people know each other and reputations are tracked. The group will band to punish the perpetrator and offer the victim restitution, knowing that this will redound to them in turn.

But in the world at large, and especially on the internet of beefs, there are a lot more punishers than restitutionists. Claiming publicly that you’ve been victimized by X will immediately attract everyone with an axe to grind against X. Any pure souls trying to help will get swallowed up by the sheer number and energy of anti-Xers.

The anti-Xers have a vested interest in your continued victimization by X. Nothing is more detrimental to their cause than X’s victims making peace with X on their own terms. Victimhood-mongering provides purpose and gainful employment to countless individuals. The victims end up doubly helpless and doubly beholden: both to their oppressors and their “liberators”.

Global recognition of one’s victimhood is pretty much the worst thing that can happen to anyone. It happened to the Palestinians.

The United Nations agrees that Palestinians are victims. That’s why in addition to UNHCR, the UN agency to support refugees worldwide, it has a special agency dedicated to Palestinian refugees: UNRWA. UNRWA differs from UNHCR in two main ways:

1. It does not share UNHCR’s mandate to “assist refugees in eliminating their refugee status by local integration in the current country, resettlement in a third country or repatriation”, thus keeping Palestinianss and their descendants in refugee status for perpetuity.
2. It employs twice the number of staff.

Muslim leaders from Tunis to Kuala Lumpur agree that Palestinians are victims. They need to because it plays well for public opinion and allows them to maintain an anti-Israel stance in the absence, for most of them, of any actual conflict with Israel. Until there’s something serious at stake like foreign investments or a weapons deal, that is, and then they sign a deal with Israel and tell the Palestinians to shut up and stop complaining.

Many Americans agree that Palestinians are victims, especially those on college campuses. Shortly after I arrived on campus in the US I was invited to a “conversation” about the ongoing operation in Gaza with a left-leaning Jewish student group.

I asked them whether they though the killing of Hamas military chief Ahmed Jabari, which set off the operation, was justified. None of the 15 students in attendance knew who he was. Most of them couldn’t tell Hamas from the PLO, conflated the situation in Gaza with the settlements in the West Bank, and had little knowledge or interest in actual matters of Palestinian life and governance such as elections, security arrangements, water and energy supply, etc.

I realized that the vast majority of them joined this organization to establish their progressive bona fides and differentiate themselves from Jewish conservatives. Israel-Palestine is the biggest game in town, but it probably could just as well be male circumcision or female rabbis.

I want to make it clear — I don’t particularly disagree that Palestinians are victims in many senses and that their ability to help themselves is constrained by outside forces, chief among which is Israel. I have real compassion for them. I just want to note that decades of global recognition of Palestinian victimhood have been a boon for UN staffers, Muslim politicians, and American progressives, along with many others. Surely such a broad and powerful coalition would bring Palestine peace and prosperity and an end to victimhood?

Shockingly, it hasn’t.

These situations apply mostly to identifiable groups, examples of which are abundant, but it can happen to individuals too. In families, schools, organizations there are people who like playing the savior role, and they have a sharp nose for victims in need.

Isn’t this all victim blaming? This is a reasonable objection, although I have some issues with the concept itself and its provenance. Here’s the headline of the Wikipedia article on victim blaming:

Victim blaming occurs when the victim of a crime or any wrongful act is held entirely or partially at fault for the harm that befell them. The study of victimology seeks to mitigate the prejudice against victims, and the perception that victims are in any way responsible for the actions of offenders.

Equating the perception that victims have even a modicum of responsibility with prejudice gives up the game before it started. With this axiom in place “victimology” is not a field of inquiry, it’s a tool of advocacy to be used in competitions for victimhood status. These competitions have many losers and no winners.

But the concept of victim blaming as it’s naively understood still has value and needs to be addressed. To do so we need to clarify two distinctions.

The first difference is between being a victim in a particular instance and victimhood as an ongoing story. When you are the target of a crime you are a victim — you can appeal to an authority for help. You tell mom your older sister stole your toy and mom makes her return it.

This is entirely different from saying decades later that your career failures are a result of being victimized by your sister as a child. Even if the causal effect of your sister’s depredation is not literally zero, it has probably done less harm than the narrative of victimhood itself.

When I arrived in New York with little money I promptly lost $3,000 to a rental scam. If the cops had apprehended the scammer I would have confirmed that yes, I was their victim, and would have demanded the full amount back. But my main reaction was to analyze the situation, read up on similar scams, and build a plan for the future that will prevent me from falling for those again. I don’t even see it as an injustice —$3,000 is a fair price to pay for a class in scam resilience.

The second, related distinction is between public status and individual mindset. Blame is at its core a social concept, used to coordinate how we allocate responsibility and demand atonement as a group. We could very want for society to direct those at the perpetrator, and simultaneously advise the victim to hold themselves privately accountable at least in part.

The problem here is that every public discussion of the issue is perceived as mostly an attempt to shape the reaction of society, rather than the attitude of the victim. I care more about the latter, which is why this entire essay avoided talking about individual victims and how they could improve. The main way to change individual mindset it to talk about your own experience, which is also what I would encourage you to do in the comments. (I will heavily moderate less-than-perfectly-charitable discussion of the behavior of actual victims and all sides of the Israel-Palestine conflict.)

In conclusion we must address privilege. It’s sure easy for someone who is safe from oppression to talk about the pitfalls of victim mentality; not so for the victim!

My first answer is that there is a strong bias in favor of overstating victimhood that needs to be corrected. This bias is caused by all people I discussed who benefit from being seen as protectors of victims. This bias is especially powerful in the United States, where victimhood is more and more allocated on the basis of group identity (which is enduring and political) instead of individual circumstance (which can be helped and overcome). If we lived in ancient Sparta, I would be giving the reverse advice instead.

Victim mentality is manufactured en masse by the American education system. It can only be resisted by individual efforts to reject it for yourself, in the privacy of your own blog-browsing.

But yes — I am privileged. In my nationality, my social and professional status, and more. But for all of those I, or the group that I’m part of, decided not to be victims and to take responsibility even when the former option was on the table. Again, I will not go into detail about how the option of victimhood was available, since merely talking about it is claiming it.

I don’t care if I miss out on the desirable status of having overcome great adversity™. It’s a currency that certainly has its value. But it’s not worth the price it demands.

Victimhood is a vicious cycle. It leads to helplessness which leads to victimization which leads to external recognition of victimhood which in turn leads to helplessness and so on down the spiral. Rejecting victimhood works the same way, small decisions that compound until one reaches escape velocity.

Not being a victim is indeed a privilege. With time and a change of mindset, this privilege can be yours too.

Discuss

### Objective Dog Ratings: The Shiba Inu

18 сентября, 2020 - 03:12
Published on September 18, 2020 12:12 AM GMT

The Shiba Inu

Shibas are a spitz dog, which, if you remember from the original Dog Ratings post, is my projected winner for the All-Dog All-Stars Contest, but that doesn't mean that this spitz will take the trophy home.

Let's take a look:

First, it must be noted that shibas have the curly tail which is most distinctive of domesticated animals, which is a Good Thing. Props to you, noble shibas.

Behaviorally, shibas are very clean animals. They clean themselves, they're very easy to housebreak, and...well, they will track mud into the house and not wipe their feet off first, but they're still dogs. You can only expect so much from them. They're clever, but the jury is out on just how much so.

Shibas do not bark very much, instead preferring to emit terrible, nearly goat-like noise called a "shiba scream," which has been described as "bloodcurdling" and like the "screams of the damned." The name sounds like an anime attack move, and I wholeheartedly approve.

I could not find much in the way of terrible diseases, which is a little surprising given their history. Shibas were extensively crossbred with other dogs in the 19th century and, between crossbreeding, distemper outbreaks, and WWII food shortages, only three "bloodlines" or "variations" of shiba still remained in the post-war era. It's hard for me to get details on this, so I can only assume that the gene pools were relatively large and healthy or the breeders did a good job, because, as I say, they aren't a tragedy of inbreeding.

This is, admittedly, a little bit conjecture, but I imagine that their healthiness has something to do with the story of their preservation as a breed. Shibas were bred to do a job (hunting and flushing out small game and, on occasion, boar), and hunters were among those who tried to preserve them.

Anyway, the point is, they are pretty healthy dogs, and that's pretty important here at Objective Dog Ratings. Shibas aren't exceptionally intelligent, but there's something awful and awesome about their unique vocal talents, so I'm going to award them a very tentative, by-the-tip-of-their-nose, three stars. A forward-thinking, politically progressive wolf would not be ashamed to know a shiba, though they might well be a little unsettled.

Rating: ★★★ (Good)

Original post (with more pictures) here.

Discuss

### Gems from the Wiki: Do The Math, Then Burn The Math and Go With Your Gut

18 сентября, 2020 - 01:41
Published on September 17, 2020 10:41 PM GMT

During the LessWrong 1.0 Wiki Import we (the LessWrong team) discovered a number of great articles that most of the LessWrong team hadn't read before. Since we expect many others to also not have have read these, we are creating a series of the best posts from the Wiki to help give those hidden gems some more time to shine.

The original wiki article was fully written by riceissa, who I've added as a coauthor to this post. Thank you for your work on the wiki!

"Do the math, then burn the math and go with your gut"1 is a procedure for decision-making that has been described by Eliezer Yudkowsky. The basic procedure is to go through the process of assigning numbers and probabilities that are relevant to some decision ("do the math") and then to throw away this calculation and instead make the final decision with one's gut feelings ("burn the math and go with your gut"). The purpose of the first step is to force oneself to think through all the details of the decision and to spot inconsistencies.

History

In July 2008, Eliezer Yudkowsky wrote the blog post "When (Not) To Use Probabilities", which discusses the situations under which it is a bad idea to verbally assign probabilities. Specifically, the post claims that while theoretical arguments in favor of using probabilities (such as Dutch book and coherence arguments) always apply, humans have evolved algorithms for reasoning under uncertainty that don't involve verbally assigning probabilities (such as using "gut feelings"), which in practice often perform better than actually assigning probabilities. In other words, the post argues in favor of using humans' non-verbal/built-in forms of reasoning under uncertainty even if this makes humans incoherent/subject to Dutch books, because forcing humans to articulate probabilities would actually lead to worse outcomes. The post also contains the quote "there are benefits from trying to translate your gut feelings of uncertainty into verbal probabilities. It may help you spot problems like the conjunction fallacy. It may help you spot internal inconsistencies – though it may not show you any way to remedy them."2

In October 2011, LessWrong user bentarm gave an outline of the procedure in a comment in the context of the Amanda Knox case. The steps were: "(1) write down a list of all of the relevant facts on either side of the argument. (2) assign numerical weights to each of the facts, according to how much they point you in one direction or another. (3) burn the piece of paper on which you wrote down the facts, and go with your gut." This description was endorsed by Yudkowsky in a follow-up comment. bentarm's comment claims that Yudkowsky described the procedure during summer of 2011.3

In December 2016, Anna Salamon described the procedure parenthetically at the end of a blog post. Salamon described the procedure as follows: "Eliezer once described what I take to be the a similar ritual for avoiding bucket errors, as follows: When deciding which apartment to rent (he said), one should first do out the math, and estimate the number of dollars each would cost, the number of minutes of commute time times the rate at which one values one's time, and so on. But at the end of the day, if the math says the wrong thing, one should do the right thing anyway."4

• CFAR Exercise Prize – Andrew Critch's Bayes game, described on this page, gives another technique for dealing with uncertainty in real-life situations
1. Qiaochu Yuan. "Qiaochu_Yuan comments on A Sketch of Good Communication". March 31, 2018. LessWrong.
2. Eliezer Yudkowsky. "When (Not) To Use Probabilities". July 23, 2008. LessWrong.
3. bentarm. "bentarm comments on Amanda Knox: post mortem". October 21, 2011. LessWrong.
4. Anna Salamon. "'Flinching away from truth' is often about *protecting* the epistemology". December 20, 2016. LessWrong.

Discuss

### Let the AI teach you how to flirt

17 сентября, 2020 - 22:04
Published on September 17, 2020 7:04 PM GMT

"It's Not You, it's Me: Detecting Flirting and its Misperception in Speed-Dates" is a fascinating approach to the study of flirtation. It uses a machine learning model to parse speed-dating data and detect whether the participants were flirting. Here's a sci-hub link. I found three key insights in the paper.

First of all, people basically assume that others share their own intentions. If they were flirting, they assume their partner was too. They're quite bad at guessing whether their partner was flirting, but they do a bit better than chance.

Secondly, the machine learning model was about 70% accurate in detecting flirtation. It's much better than the speed date participants themselves, despite having far less information to draw upon and the fact that the authors used a more forgiving standard of success for people's detection rates than for the detection rates of the machine learning model.

Thirdly, storytelling and conversations about friends seem to be the strongest signals of flirtation. Talking about the mundane details of student life (this was on a college campus) were the strongest signals of non-flirtation.

Finally, men and women have quite different approaches to flirtation:

Men who say they are flirting ask more questions, and use more you and we. They laugh more, and use more sexual, anger [hate/hated, hell, ridiculous, stupid, kill, screwed, blame, sucks, mad, bother, shit], and negative [bad, weird, hate, crazy, problem*, difficult, tough, awkward, boring, wrong, sad, worry] emotional words. Prosodically they speak faster, with higher pitch, but quieter (lower intensity min). Features of the alter (the woman) that helped our system detect men who say they are flirting include the woman’s laughing, sexual words [love, passion, virgin, sex, screw] or swear words, talking more, and having a higher f0 (max).

Women who say they are flirting have a much expanded pitch range (lower pitch min, higher pitch max), laugh more, use more I and well, use repair questions [Wait, Excuse me] but not other kinds of questions, use more sexual terms, use far less appreciations [Wow, That’s true, Oh, great] and backchannels [Uh-huh., Yeah., Right., Oh, okay.], and use fewer, longer turns, with more words in general. Features of the alter (the man) that helped our system detect women who say they are flirting include the male use of you, questions, and faster and quieter speech.

This paper has changed the way I think about skillful heterosexual flirtation. I used to think that flirting was a unisex behavior, and that men and women were decently skilled at detecting it. In much the same way that it's harder to write a novel than to read one, I thought that the hard part was signalling your own intentions, not interpreting theirs.

Now, I think that a strategy for skillful flirtation is to get the other person to broadcast their intentions, and learn to interpret their signals correctly. Men and women have different flirting styles. Each person knows when they themselves are trying to flirt. But they're bad at guessing when their partner is trying to flirt. This suggests that if you can get your partner to engage in their own natural flirting style, and get good at detecting it, then you can guess their intentions with much more confidence than the average person is capable of.

Both men and women should try to make each other laugh, let their voices be more musical, and provoke each other to talk about love and sex. They should tell stories about their lives and friendships and try to avoid mundane details.

A man who wants to signal flirtation to a woman should ask lots of questions that provoke the woman to talk about herself at length. Note that the "appreciations" and "backchannels" that are negatively correlated with women's flirtation are responses that women tend to give to men who keep going on about themselves. This is the old standard advice.

A woman who wants to signal flirtation to a man should maybe find topics they can complain about together - hopefully in a lighthearted way. She could also talk about her life in such a way that it provokes him to be curious and ask questions about her or observe connections between himself and her.

Discuss

### Covid 9/17: It’s Worse

17 сентября, 2020 - 18:10
Published on September 17, 2020 3:10 PM GMT

Last week we learned there is plausibly a simple, cheap and easy way out of this entire mess. All we have to do is take our Vitamin D. In case it needed to be said, no. We are not taking our Vitamin D. There’s definitely some voices out there pushing it, including the nation’s top podcaster Joe Rogan, but I don’t see any signs of progress.

Instead, as school restarts, the outside gets colder and pandemic fatigue sets in, people’s precautions are proving insufficient to the task. This week showed that we have taken a clear step backwards across the country.

I see three ways for things not to get steadily worse for a while. Either a vaccine arrives, which is unlikely, something else new (that we see little sign of) arrives to change behavior for the better, or this week was a blip. It’s only one week of data, it follows labor day, and it is wise not to move too quickly to extrapolation. The effect size seems too large, though, and too distributed among outcomes, to be coincidence.

In terms of news, it was a quiet week. There was some bluster, but little substance.

Let’s run the numbers. They’re not good.

Positive Test Counts DateWESTMIDWESTSOUTHNORTHEASTJuly 23-July 291102196790324066726008July 30-Aug 5910026446221294523784Aug 6-Aug 12930426193118848621569Aug 13-Aug 19808876338415699820857Aug 20-Aug 26675456654013232218707Aug 7-Sep 2550007540112741421056Sep 3-Sep 9472737243910640821926Sep 10-Sep 16450507526411581223755

This doesn’t look too bad on its own. Whether it’s good, bad or very bad news depends on whether testing is improving. If testing were still ramping up, it could easily count as good news, even with the worrying reversal in the South. Unfortunately, that’s not what happened. Testing actually declined this week to 4.63 million tests, the lowest value since the first week of July.

Alas, the positive test percentages:

DateUSA testsPositive %NY testsPositive %Cumulative PositivesJuly 16-July 225,456,1688.6%454,9951.1%1.20%July 17-July 295,746,0567.9%452,8891.0%1.34%July 30-Aug 55,107,7397.8%484,2451.0%1.46%Aug 6-Aug 125,121,0117.3%506,5240.9%1.58%Aug 13-Aug 195,293,5366.2%548,4210.8%1.68%Aug 20-Aug 264,785,0566.0%553,3690.7%1.77%Aug 27-Sep 25,042,1135.5%611,7210.8%1.85%Sep 3-Sep 94,850,2535.3%552,6240.9%1.93%Sep 10-Sep 164,632,0055.8%559,4630.9%2.01%

This was very surprising to me. It would not have been too surprising to see things level off around previous levels. But to have it also reverse so suddenly indicates a major change. The default hypothesis is that the reopening of schools is finally taking its toll, now that it has had time to accumulate sufficient compound damage. If that’s the case, we’re in for at least several more weeks of things getting worse.

What’s weirder is that the death counts are headed in the wrong direction, despite what were clearly positive trends in leading indicators several weeks ago.

Deaths by Region

DateWESTMIDWESTSOUTHNORTHEASTJuly 9-July 1513805392278650July 16-July 2214696743106524July 23-July 2917077004443568July 30-Aug 518317194379365Aug 6-Aug 1217386634554453Aug 13-Aug 1915768504264422Aug 20-Aug 2615037453876375Aug 27-Sep 212457593631334Sep 3-Sep 911417712717329Sep 10-Sep 1611599543199373

Labor day weekend was too far in the past to provide much of an excuse here. The Midwest and Northeast are clearly headed in the wrong direction. The South and West could claim this is a backlog issue and things are still fine, especially the West, but it does not look good. If that’s what happened while leading indicators were improving, what’s going to happen over the next few weeks?

Extra No Good Very Bad Numbers: Meanwhile In Europe

I’ve mostly limited the scope of this column to the United States, but it needs to be pointed out that much of Europe looks like it’s got its own second wave at this point. Spain and France are already there, and the U.K. is well on its way. Germany is holding steady so far and we can hope that holds. When you don’t eradicate, vigilance can never end. Then eventually it does, or the seasons change and tip things over the edge as behaviors adjust to that.

Given all our advances, one hopes that this won’t come with too many deaths even if the infection numbers get out of control.

Numbers told the story main this week. The rest is more of a round up.

United Arab Emirates Joins Vaccine Club

Best news of the week: UAE announces emergency approval for use of COVID-19 vaccine. One more country, albeit a relatively small one, sees the light and rolls the favorably weighted dice.

Here’s to you, UAE. Except that you don’t drink, and neither do I. So hats off, instead.

Football Coach Gives Us Some Straight Talk

I didn’t know I could love Coach O even more, but the results are in and it turns out I absolutely can do that. The man tells it like it is. LSU coach Ed Orgeron — ‘Most’ of team has contracted coronavirus

That’s the SEC. Here, it just means more.

The problem is not that the team has some players who have caught Covid-19. The problem is that the team has players that haven’t caught Covid-19! They might catch it in the future. So we need to have backups ready for those players.

My assumption is that LSU’s campus is full of college kids who don’t care if they get Covid-19, so a ton of them got Covid-19 right away, and none of this has anything to do with football. I saw stories saying it was all over the dorms.

Or you could take the other approach, look like you’re acting all responsible, and be the fun police without actually making anyone safer. I’m looking at you, PAC-12. But I’m not looking at you, Big 10, because unlike most people these days, I believe in forgiveness.

Play Ball!

You are the Big 10 conference. Cause you had a bad day. A really bad day. Even worse than when they added Rutgers and Maryland. You’ve taken one down. Cancelled your entire season, like many other things, over nothing.

You sing a sad song, hopefully while socially distanced, to turn it around.

Then you realize you’ve made a huge mistake. You get your shirt together. You mumble something about a ‘proper daily testing regime.’

It’s going to be a tight schedule. By waffling, they’ve made it so that an outbreak that causes delays could endanger several teams and their ability to play a full season – there’s only room for one off week. And like others, they made the mistake of scheduling that off-week rather than holding it in reserve to handle a crisis.

But what matters is, they’re back, and we’re playing. The PAC-12 is still not back. They’re really pushing the scheduling window to its breaking point, but they’re working on it. Besides, we all know they were going to get excluded from the playoff regardless, like they do every year, so it doesn’t matter that much if they play further into December or even January.

More football. Ergo, more peace.

Burn Baby Burn

The west coast of the United States is more than a little on fire. The air is not fit for humans to breathe. The sky is frequently the wrong color. Photos of this past month’s sky and its resemblance to something that isn’t part of a post-apocalyptic wasteland have been unfavorably compared to photographs from the Blade Runner 2045 movie. It’s pretty bad out there. Presumably this is having an effect on Covid-19, but it’s not obvious which way – if everyone does everything indoors that’s bad, but if they can’t even go outside to get to other people, that’s good for the moment, I guess?

The bigger point is that once again we have two distinct versions of ‘scientific consensus’ about what’s going on with these fires. From what I can tell, here’s the situation.

California used to naturally burn periodically, on its own. It wasn’t great, it was bad enough to sometimes make the air bad, but it kept things in balance.

For about a century, California has been aggressively putting out every fire it can find. There has effectively been the mentality of a ‘war on fire.’ This has led to an accumulation of a massive amount of fuel.

We know that the way to deal with this is controlled burn. But when someone starts a controlled burn, they get punished for it. They have to file environmental impact statements (because the fire will damage the air today, and that’s no good, even though we now see the alternative), deal with lots of regulations and so on. If something goes wrong, they get the blame and the lawsuits. It’s much easier to just not burn, so mostly people don’t burn. Certainly no one private does controlled burns, and the public does maybe 30,000 acres a year even with extra efforts. But the historical average was millions of acres, so we’re doing essentially nothing.

Thus, lots of giant fires.

We then hire a combination of overpriced unionized labor that demands overtime pay so good they occasionally start the fires themselves, and prison labor we barely pay at all that’s now unavailable because of fear of Covid-19. And we use them to fight all the fires, including ones that don’t threaten anything of value and thus would be net positive to allow those to continue. So we have anything like the resources necessary to stop this.

Also, climate change is a thing, which is also making things somewhat worse.

So what do the Democratic politicians and most media outlets say?

That this is “a climate damn emergency” and that the “scientific consensus” is that this is all the result of climate change.

Thus. Everyone involved gets to act all righteous and feel like they’re scoring points in the political wars. They make Sacrifices to the Gods that, even if they work as intended, only mean that things will continue to get worse but ever so slightly slower. That’s their response to this “emergency.”

Because if it’s all due to climate change, they don’t have to actually do anything that might stop the fires. Like more controlled burns, or devoting more or smarter resources to protecting what needs protecting. The things that might help anyone actually alive today not need to flee across the country.

I bring this up because the parallel to how those same media and political sources deal with Covid-19 should be obvious. Claim they know what the “science” says. Blame things that don’t matter. Actively interfere with the things that might help, massively slow or block any useful action while denying its possibility or effectiveness. Call for gigantic long term sacrifices that offer little tangible gain. While simultaneously claiming it would have been impossible to actually prevent the problem or mitigate its effects.

Label anyone who says otherwise “anti-science” and irresponsible and just awful.

Covid-19 is not some outlier. These people lie. About everything. All the time.

Not every time. They do sometimes tell the truth, when it suits them. But if anything, that only makes it more difficult. As the old joke goes, it’s easy to tell the truth from Pravda, because everything in Pravda is a lie. But The New York Times is trickier, because sometimes it tells the truth.

Vaccine Responsibility

The main Covid-19 nominal headlines this week were about vaccines. Trump continues to promise a vaccine by late October. The head of the CDC says that’s not going to happen. Trump says the head of the CDC is ‘confused.’ The CDC walks the comments back. On net, this showed some attempt by the CDC to not kowtow to Trump, but then a kowtow, so on net seems like a wash.

Gates and Fachi and others continue to say not to expect a vaccine. All this back and forth.

For those worried, yes, the halted vaccine trial from last week has resumed and never had a good reason to pause.

The net visible news on this was presumably bad, as indicated by the Good Judgment Project which has us down to a 59% chance of 25 million doses administered by the end of March.

It’s all talk. None of this substantially changed my view of the current state of the vaccine research.

A vaccine will be available in October if Trump is able to override the CDC and FDA, and make it happen by fiat to help its reelection chances. If it can’t do that, the vaccine will wait a few more months at minimum, and then we’ll see what happens. I continue to think distribution would be the right thing to do and the objections are deeply wrong, and of course that none of that has anything to do with why Trump is going to try to overrule those objections.

The other big comment from the head of the CDC was that ‘masks could be more effective than the vaccine.’ Which is the kind of thing one says when one thinks it is more important to look like a Very Serious and Responsible Person, who is saying Very Serious and Responsible Things, and is properly encouraging vigilant mask usage. We wouldn’t want people to think that something else might help them. Think of what they might do!

Trump pushed back hard on that as well, as he would regardless of the statement’s truth value. The same way the statement was made without regard to its truth value, beyond finding a way to show it could possibly be technically correct. If that.

I grow weary of it all. It’s the fatigue setting in. The same bluster. The same warnings that it will not be over until the Proper Authorities say it’s over. The same downplaying and dismissals by the White House. No authorities we can trust. Round and round and round we go. When it stops, I’m trying to guess, and the prognosis doesn’t look great.

Going Forward

We’re getting close to the election. I did a look at the betting odds, and found some instances of small free money available for those interested. As time goes by, focus will shift away from Covid-19 as a health problem, and towards the upcoming election and its details. Everything will more and more be in that light and only that light, from all sides. My guess is that this will decrease the amount of meaningful virus news, and we’ll be more focused on the pure numbers.

Meanwhile, the virus will ignore all that, because the virus doesn’t care.

This blog does its best to stay out of politics despite discussing politicized issues. It would many times have said something about ‘a plague on both your houses’ except that seems a little on the nose. I have a strong preference on outcomes, which readers can presumably guess – but saying it outright wouldn’t convince anyone. You have all the information you need, to decide which candidate you prefer. Vote accordingly. If you are in a swing state and can afford to cast your vote in person, do so, to facilitate a quick, clear and peaceful resolution of the election – the Covid-19 risk involved will be minimal, and it helps reduce the tail risk from a disputed election.

I’ll also start another experiment this week. If you see news you think should make next week’s summary, throw it into the comments, with links if possible, and we’ll see if together we can cover things a little easier. I see no reason not to try that out. I’ll check both the LessWrong version of this post and the original for such comments.

Discuss

### Making the Monte Hall problem weirder but obvious

17 сентября, 2020 - 15:10
Published on September 17, 2020 12:10 PM GMT

The Monty Hall problem is famously unintuitive. This post starts with an extreme version where the solution is blindingly obvious. We then go through a series of small changes. It will be clear that these don’t affect the solution.

Discuss

### How do you celebrate your birthday?

17 сентября, 2020 - 13:00
Published on September 17, 2020 10:00 AM GMT

Or: what's a cool idea to celebrate your birthday?

Or: why don't you celebrate your birthday?

Discuss

### Curating and analyzing smart, impactful sports plays and strategies that don’t get the attention they deserve

17 сентября, 2020 - 10:24
Published on September 17, 2020 7:24 AM GMT

I had a shower thought the other day: someone should start a sports blog/YouTube channel/social profile that curates and breaks down smart, unflashy, winning plays and strategies that would never make a highlight reel. In other words, "medium lights".

I love watching athletes do incredible things as much as the next person, but I also particularly appreciate when they make cerebral, timely, +EV plays. The ones that help their teams win, but don’t get the attention they deserve. It’s a shame that so many great efforts and moments are forgotten in time.

So I've decided to do it! Check out my newsletter here.

I've written three posts so far:

If you're into the rational, strategic, analytical side of sports (especially basketball), you might find Medium Lights interesting and/or useful! Let me know what you think; I'm open to all meaningful feedback.

Discuss

### Rationality for Kids?

17 сентября, 2020 - 05:39
Published on September 16, 2020 10:58 PM GMT

So I really appreciate the lessons I've learned from "Rationality", but I wish I had learned them earlier in life. We are now homeschooling my kids, and I want to volunteer to teach my kids plus others who are interested lessons about thinking rationally.

Does anyone have recommendations on how to put together a curriculum which gets at the core ideas of rationality, but is oriented towards young kids? Some criteria:

Children will likely range from 7-11, meaning they should be simple concepts and require very little prior knowledge and only the simplest math.

Lessons should be interactive.

Lessons should include TRUE experiments (not just doing fun stuff with chemicals).

Lessons should be fun and appealing enough that parents will want to sign their kids up.

Any other suggestions on the course (wording that will be appealing without sounding too "nerdy" or alarming to the conservative types who usually homeschool) are welcome.

Discuss

### Noticing Wasted Motion

17 сентября, 2020 - 05:30
Published on September 17, 2020 2:30 AM GMT

This post is about a practical skill you need to effectively minimize wasted motion - noticing. Noticing is a key skill for changing behavior in general, because it allows you to respond deliberately instead of automatically. Yet in my experience, it is one of the more difficult skills to learn. Hopefully the following will give you a toehold.

In my last post, I talked about the cost of wasted motion - the lost opportunity to go straight for the goal and make the world a bit better. If you’re staring into the dark world of the cost of wasted motion, don’t despair. Minimizing wasted motion is a skill and you can train that skill

But before you can do that, you need to learn to notice wasted motion. This is hard. First, it’s hard to remember to pay attention, especially if you’re already doing something else. Second, it’s hard because -- deep down -- you might not want to.

It feels ughy to start working, because that project is big and hard. The immediate pain of starting work looms large in the hyperbolic-discounting agents we call brains. Maybe you're anxious that you’ll look bad if you try and fail. So you just...let your mind slip away to easier tasks. You may not even notice you’re doing it.

But, if you’re paying attention, there’s a little catch there, in your mind. When you stop and think, you know you're procrastinating a little bit. You know you're wasting motion toward your goal. You want to complete your goal, but you also want to avoid dealing with those bad feelings. And that internal conflict makes you feel guilty or frustrated or anxious.

But that little catch is small and easily ignored. If you haven't thought about it before, you may not even have realized it's there. Maybe that guilt is so small you never really notice it. And maybe it's grown so large that you feel overwhelmed and ashamed (link to measuring progress).

Learn to notice that bit of frustration at how you’re doing the task, that ugh feeling around a certain plan.

If you’re not sure where to start, it might help to read more about noticing. A therapy-style before-after record can also be helpful for identifying emotions, what issues trigger that emotion, and reliable ways to respond.

Personally, I started learning how to notice by reflecting after something went wrong, and seeing what emotions I felt earlier that could have warned me. Once I had those emotions in mind, I started to pause and check in whenever I noticed that emotion. This led to successfully correcting course, which caused a virtuous cycle by reinforcing noticing that emotion.

For example, one time I got caught up in errands and realized I was supposed to meet a friend for dinner in 5 minutes. The only catch was that the friend was three miles away, and I was on bike. Now for those of you who don’t know me in person, I’m not exactly the paragon of physical endurance. There was no way on earth I was getting there in 5 minutes. But I felt this tunnel vision pressure to bike as fast as I could and not think about the fact I was probably late.

I finally showed up, quite winded, about half an hour late. It turned out my friend had been pretty worried. I felt really bad, but it was useful for recognizing that the feeling of “I can’t stop to reflect, I just have to press forward as fast as I can”. That tunnel vision feeling reliably signals that I’m feeling time pressure to keep going, but also think I may be doing the wrong thing. Since then, I’ve gotten better at slowing down and checking my plan whenever I notice that feeling.

I experience emotional responses that tell me a poor choice is happening. Other people sometimes describe noticing physical sensations or directly noticing a poor choice, so feel free to try those instead.

You can try deliberately practicing noticing. If you already know of something that triggers the feeling, then you can deliberately cause that emotion and practice responding the way you want to. Or you can try making an intention to notice. Borrowed from psychology, these intentions are usually formatted as if/then sentences - if I notice X feeling, then I will do Y. There is high variance to how well these intentions work in practice, but they are extremely useful when successful.

It might take you a bit to identify what you’re trying to notice. Some people describe looking for what their thoughts slide away from, rather than going straight for the negative feeling. If you’re really struggling, deliberately plan not to fix the problem when you notice you’re wasting motion. This piece of advice might seem super weird. After all, the end goal is to save motion. But if you have to change something if you notice it, you might subconsciously stop yourself from ever noticing it.

As you start to pay attention to those cues, they can be confusing. It’s hard to tell if you’re feeling frustrated because your plan is bad or just because the task is hard. Try doing a brain dump when you feel that frustration. Let your thoughts spill onto the page, and see if there’s a different action you could take that would accomplish your goal better and make that feeling go away.  If you’re not certain about this plan, try it for a week. Then reevaluate whether it’s useful for you.

As you practice, the feelings become clearer and more useful. You’ll get fewer false alarms.

But for now, notice when you feel that catch in your emotions. Pay attention to why you feel frustrated or trapped or unhappy. See if those emotions are saying, “Here, do it this way instead. That way won’t work, you’ll just waste time” or “You know it would be better if you did the other thing instead. Why don’t you go and do that?” Use those cues to become aware of when you’re wasting motion.

Once you’re aware, then you can stop wasting motion.

Discuss

### Artificial Intelligence: A Modern Approach (4th edition) on the Alignment Problem

17 сентября, 2020 - 05:23
Published on September 17, 2020 2:23 AM GMT

The 4th edition of Artificial Intelligence: A Modern Approach came out this year. While the 3rd edition published in 2009 mentions the Singularity and existential risk, it's notable how much the 4th edition gives the alignment problem front-and-center attention as part of the introductory material (speaking in the authorial voice, not just "I.J. Good (1965) says this, Yudkowsky (2008) says that, Omohundro (2008) says this" as part of a survey of what various scholars have said). Two excerpts—

1.1.5 Beneficial machines

The standard model has been a useful guide for AI research since its inception, but it is probably not the right model in the long run. The reason is that the standard model assumes that we will supply a fully specified objective to the machine.

For an artificially defined task such as chess or shortest-path computation, the task comes with an objective built in—so the standard model is applicable. As we move into the real world, however, it becomes more and more difficult to specify the objective completely and correctly. For example, in designing a self-driving car, one might think that the objective is to reach the destination safely. But driving along any road incurs a risk of injury due to other errant drivers, equipment failure, and so on; thus, a strict goal of safety requires staying in the garage. There is a tradeoff between making progress towards the destination and incurring a risk of injury. How should this tradeoff be made? Furthermore, to what extent can we allow the car to take actions that would annoy other drivers? How much should the car moderate its acceleration, steering, and braking to avoid shaking up the passenger? These kinds of questions are difficult to answer a priori. They are particularly problematic in the general area of human–robot interaction, of which the self-driving car is one example.

The problem of achieving agreement between our true preferences and the objective we put into the machine is called the value alignment problem: the values or objectives put into the machine must be aligned with those of the human. If we are developing an AI system in the lab or in a simulator—as has been the case for most of the field's history—there is an easy fix for an incorrectly specified objective: reset the system, fix the objective, and try again. As the field progresses towards increasingly capable intelligent systems that are deployed in the real world, this approach is no longer viable. A system deployed with an incorrect objective will have negative consequences. Moreover, the more intelligent the system, the more negative the consequences.

Returing to the apparently unproblematic example of chess consider what happens if the machine is intelligent enough to reason and act beyond the confines of the chessboard. In that case, it might attempt to increase its chances of winning by such ruses as hypnotizing or blackmailing its opponent or bribing the audience to make rustling noises during its opponents thinking time.³ It might also attempt to hijack additional computing power for itself. These behaviors are not "unintelligent" or "insane"; they are a logical consequence of defining winning as the sole objective for the machine.

It is impossible to anticipate all the ways in which a machine persuing a fixed objective might misbehave. There is good reason, then, to think that the standard model is inadequate. We don't want machines that are intelligent in the sense of pursuing their objectives; we want them to pursue our objectives. If we cannot transfer those objectives perfectly to the machine, tghen we need a new formulation—one in which the machine is pursuing our objectives, but is necessarily uncertain as to what they are. When a machine knows that it doesn't know the complete objective, it has an incentive to act cautiously, to ask permission, to learn more about our preferences through observation, and to defer to human control. Ultimately, we want agents that are provably beneficial to humans. We will return to this topic in Section 1.5.

And in Section 1.5, "Risks and Benefits of AI"—

At around the same time, concerns were raised that creating artificial superintelligence or ASI—intelligence that far surpasses human ability—might be a bad idea (Yudkowsky, 2008; Omohundro 2008). Turing (1996) himself made the same point in a lecture given in Manchester in 1951, drawing on earlier ideas from Samuel Butler (1863):¹⁵

It seems probably that once the machine thinking method had started, it would not take long to outstrip our feeble powers. ... At some stage therefore we should have to expect the machines to take control, in the way that is mentioned in Samuel Butler's Erewhon.

These concerns have only become more widespread with recent advances in deep learning, the publication of books such as Superintelligence by Nick Bostrom (2014), and public pronouncements from Stephen Hawking, Bill Gates, Martin Rees, and Elon Musk.

Experiencing a general sense of unease with the idea of creating superintelligent machines is only natural. We might call this the gorilla problem: about seven million year ago, a now-extinct primate evolved, with one branch leading to gorillas and one to humans. Today, the gorillas are not too happy about the human branch; they have essentially no control over their future. If this is the result of success in creating superhuman AI—that humans cede control over their future—then perhaps we should stop work on AI and, as a corollary, give up the benefits it might bring. This is the essence of Turing's warning: it is not obvious that we can control machines that are more intelligent than us.

If superhuman AI were a black box that arrived from outer space, then indeed it would be wise to exercise caution in opening the box. But it is not: we design the AI systems, so if they do end up "taking control," as Turing suggests, it would be the result of a design failure.

To avoid such an outcome, we need to understand the source of potential failure. Norbert Weiner (1960), who was motivated to consider the long-term future of AI after seeing Arthur Samuel's checker-playing program learn to beat its creator, had this to say:

If we use, to achieve our purposes, a mechanical agency with whose operation we cannot interfere effectively ... we had better be quite sure that the purpose put into the machine is the purpose which we really desire.

Many cultures have myths of humans who ask gods, genies, magicians, or devils for something. Invariably, in these stories, they get what they literally ask for, and then regret it. The third wish, if there is one, is to undo the first two. We will call this the King Midas problem: Midas, a legendary King in Greek mythology, asked that everything he touched should turn to gold, but then regretted it after touching his food, drink, and family members.¹⁶

We touched on this issue in Section 1.1.5, where we pointed out the need for a significant modification to the standard model of putting fixed objectives into the machine. The solution to Weiner's predicament is not to have a definite "purpose put into the machine" at all. Instead, we want machines that strive to achieve human objectives but know that they don't know for certain exactly what those objectives are.

It is perhaps unfortunate that almost all AI research to date has been carried out within the standard model, which means that almost all of the technical material in this edition reflects that intellectual framework. There are, however, some early results within the new framework. In Chapter 16, we show that a machine has a positive incentive to allow itself to be switched off if and only if it is uncertain about the human objective. In Chapter 18, we formulate and study assistance games, which describe mathematically the situation in which a human has an objective and a machine tries to achieve it, but is initially uncertain about what it is. In Chapter 22, we explain the methods of inverse reinforcement learning that allow machines to learn more about human preferences from observations of the choices that humans make. In Chapter 27, we explore two of the principal difficulties: first, that our choices depend on our preferences through a very complex cognitive architecture that is hard to invert; and, second, that we humans may not have consistent preferences in the first place—either individually or as a group—so it may not be clear what AI systems should be doing for us.

Discuss

### What are examples of simpler universes that have been described in order to explain a concept from our more complex universe?

17 сентября, 2020 - 04:31
Published on September 17, 2020 1:31 AM GMT

Sometimes there's a concept that can be difficult to understand when entangle with everything else that needs to be understood about our physics.

If you isolate that concept in a simpler universe, it makes it easier to explain how the concept works.

What are such examples?

(I feel like I asked a similar question somewhere at some point, but can't find it)

Discuss

### Sunday September 20, 12:00PM (PT) — talks by Eric Rogstad, Daniel Kokotajlo and more

17 сентября, 2020 - 03:27
Published on September 17, 2020 12:27 AM GMT

This Sunday at 12pm (PT), we're running another session of "lightning talks" by curated LessWrong authors (see here for previous weeks' transcripts).

• For the first hour, we will have a series of lightning talks each lasting about 5 minutes followed by discussion. The talks will be short and focus on presenting one core idea well, rather than rushing through a lot of content.
• From 1PM to 2PM, we'll have a hangout in breakout rooms. If you are not interested in the talks, feel free to just show up for this part (or the other way around).
• We want to give top LessWrong writers an interesting space to discuss their ideas, and have more fruitful collaboration between users. Think of it like a cross between an academic colloquium and some friends chatting by a whiteboard.

If you're a curated author and interested in giving a 5-min talk at a future event, which will then be transcribed and edited, sign up here.

Speakers
Details

When? Sunday September 20, 12:00PM (PT)

Discuss

### What Does "Signalling" Mean?

17 сентября, 2020 - 00:19
Published on September 16, 2020 9:19 PM GMT

I still feel a strong empathy for the post You Can't Signal to Rubes, which called out LessWrong for using the word "signalling" incorrectly. That post got heavily, and rightly, downvoted because it also got the definition wrong. :( But it had a point!

At the time of writing, the current definition of signalling on the LessWrong tag is:

Signaling is behavior whose main purpose is to demonstrate to others that you possess some desirable trait. For example, a bird performing an impressive mating display signals that it is healthy and has good genes.

I'm not even sure I should correct it, because this does seem to summarize the LessWrong consensus on what signalling means. But we already have a term for signalling desirable properties about yourself: virtue signalling! Maybe you'll object that "virtue signalling" doesn't have quite the right connotations. Ok. But, could you find another word? I would prefer for "signalling" to point to the subject of signalling theory, which I understand to be the game theory of communication (often focusing on evolutionary game theory).

Scott Alexander's What Is Signaling, Really? seems to get most things right:

In conclusion, a signal is a method of conveying information among not-necessarily-trustworthy parties by performing an action which is more likely or less costly if the information is true than if it is not true. Because signals are often costly, they can sometimes lead to a depressing waste of resources, but in other cases they may be the only way to believably convey important information.

Although all of his examples are about signalling self-properties, he never stipulates that, instead always using the more general conveying-information definition. He also avoids the signalling is automatically bad pitfall. Instead, he explains that signalling is often unfortunately costly, but is nonetheless a very useful tool.

However, reading it, I'm not sure whether he means to contrast signalling with "mere assertion", or whether he considers assertion to be a kind of signalling:

Life frequently throws us into situations where we want to convince other people of something. If we are employees, we want to convince bosses we are skillful, honest, and hard-working. If we run the company, we want to convince customers we have superior products. If we are on the dating scene, we want to show potential mates that we are charming, funny, wealthy, interesting, you name it.

In some of these cases, mere assertion goes a long way.

[...]

In other cases, mere assertion doesn't work.

[...]

I'll charitably assume that he meant both cases to be types of signalling. But for anyone who was mislead by the wording: signalling is the theory of conveying information! Mere assertions, if they carry information, count as signalling!

So, to summarize the points I've raised so far:

1. Sometimes people talk like signalling is just the bad thing (the dishonest or not-maximally-honest practice of making yourself look good).
2. Relatedly, people tend to exclude "mere assertion" from signalling, making signaling and literal use of language mutually exclusive.
3. Often people restrict signalling to signalling facts about yourself. (In fact, often restricted to status signalling.)

To be honest, I'm not even sure academic uses of the term "signalling" avoid the "mistakes" I'm pointing at! The Wikipedia article Signalling (economics) currently begins with the following:

In contract theory, signalling (or signaling; see spelling differences) is the idea that one party (termed the agent) credibly conveys some information about itself to another party (the principal).

[Note that I've defaulted to the Wikipedia spelling of signalling; spelling on LessWrong seems mixed.]

On the other hand, the page on Signalling Theory (a page which is very biology-focused, despite the broader applicability of the theory) includes examples such as alarm calls (eg, birds warning each other that there is a snake in the grass). These signals cannot be interpreted as facts about the signaller.

Perhaps it is a quirk of economics which restricts the term "signalling" to hidden information about the agent, and LessWrong inherited this restricted sense via Robin Hanson?

Discuss

### [AN #117]: How neural nets would fare under the TEVV framework

16 сентября, 2020 - 20:20
Published on September 16, 2020 5:20 PM GMT

[AN #117]: How neural nets would fare under the TEVV framework Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world View this email in your browser Newsletter #117
Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.
Audio version here (may not be up yet). SECTIONS ﻿HIGHLIGHTS
﻿TECHNICAL AI ALIGNMENT
﻿TECHNICAL AGENDAS AND PRIORITIZATION
﻿MISCELLANEOUS (ALIGNMENT)
﻿OTHER PROGRESS IN AI
﻿REINFORCEMENT LEARNING
﻿NEWS ﻿ ﻿ ﻿ HIGHLIGHTS

Estimating the Brittleness of AI: Safety Integrity Levels and the Need for Testing Out-Of-Distribution Performance (Andrew L. John) (summarized by Flo): Test, Evaluation, Verification, and Validation (TEVV) is an important barrier for AI applications in safety-critical areas. Current TEVV standards have very different rules for certifying software and certifying human operators. It is not clear which of these processes should be applied for AI systems.

If we treat AI systems as similar to human operators, we would certify them ensuring that they pass tests of ability. This does not give much of a guarantee of robustness (since only a few situations can be tested), and is only acceptable for humans because humans tend to be more robust to new situations than software. This could be a reasonable assumption for AI systems as well: while systems are certainly vulnerable to adversarial examples, the authors find that AI performance degrades surprisingly smoothly out of distribution in the absence of adversaries, in a plausibly human-like way.

While AI might have some characteristics of operators, there are good reasons to treat it as software. The ability to deploy multiple copies of the same system increases the threat of correlated failures, which is less true of humans. In addition, parallelization can allow for more extensive testing that is typical for software TEVV. For critical applications, a common standard is that of Safety Integrity Levels (SILs), which correspond to approximate failure rates per hour. Current AI systems fail way more often than current SILs for safety-critical applications demand. For example an image recognition system would require an accuracy of 0.99999997 at 10 processed frames per second just to reach the weakest SIL used in aviation.

However, SILs are often used on multiple levels and it is possible to build a system with a strong SIL from weaker components by using redundant components that fail independently or by detecting failures sufficiently early, such that AI modules could still be used safely as parts of a system specifically structured to cope with their failures. For example, we can use out-of-distribution detection to revert to a safe policy in simple applications. However, this is not possible for higher levels of automation where such a policy might not be available.

﻿

Flo's opinion: While I agree with the general thrust of this article, comparing image misclassification rates to rates of catastrophic failures in aviation seems a bit harsh. I am having difficulties imagining an aviation system that fails due to a single input that has been processed wrongly, even though the correlation between subsequent failures given similar inputs might mean that this is not necessary for locally catastrophic outcomes.

Rohin's opinion: My guess is that we’ll need to treat systems based primarily on neural nets similarly to operators. The main reason for this is that the tasks that AI systems will solve are usually not even well-defined enough to have a reliability rate like 0.99999997 (or even a couple of orders of magnitude worse). For example, human performance on image classification datasets is typically under 99%, not because humans are bad at image recognition, but because in many cases what the “true label” should be is ambiguous. For another example, you’d think “predict the next word” would be a nice unambiguous task definition, but then for the question “How many bonks are in a quoit?“, should your answer be “There are three bonks in a quoit” or “The question is nonsense”? (If you’re inclined to say that it’s obviously the latter, consider that many students will do something like the former if they see a question they don’t understand on an exam.)

﻿ ﻿ ﻿ TECHNICAL AI ALIGNMENT
﻿ TECHNICAL AGENDAS AND PRIORITIZATION

AI Paradigms and AI Safety: Mapping Artefacts and Techniques to Safety Issues (Jose Hernandez-Orallo et al) (summarized by Rohin) (H/T Haydn Belfield): What should prioritization within the field of AI safety look like? Ideally, we would proactively look for potential issues that could arise with many potential AI technologies, making sure to cover the full space of possibilities rather than focusing on a single area. What does prioritization look like in practice? This paper investigates, and finds that it is pretty different from this ideal.

In particular, they define a set of 14 categories of AI techniques (examples include neural nets, planning and scheduling, and combinatorial optimization), and a set of 10 kinds of AI artefacts (examples include agents, providers, dialoguers, and swarms). They then analyze trends in the amount of attention paid to each technique or artefact, both for AI safety and AI in general. Note that they construe AI safety very broadly by including anything that addresses potential real-world problems with AI systems.

While there are a lot of interesting trends, the main conclusion is that there is an approximately 5-year delay between the emergence of an AI paradigm and safety research into that paradigm. In addition, safety research tends to neglect non-dominant paradigms.

﻿

Rohin's opinion: One possible conclusion is that safety research should be more diversified across different paradigms and artefacts, in order to properly maximize expected safety. However, this isn’t obvious: it seems likely that if the dominant paradigm has 50% of the research, it will also have, say, 80% of future real-world deployments, and so it could make sense to have 80% of the safety research focused on it. Rather than try to predict which paradigm will become dominant (a very difficult task), it may be more efficient to simply observe which paradigm becomes dominant and then redirect resources at that time (even though that process takes 5 years to happen).

Avoiding Negative Side Effects due to Incomplete Knowledge of AI Systems (Sandhya Saisubramanian et al) (summarized by Rohin): This paper provides an overview of the problem of negative side effects, and recent work that aims to address it. It characterizes negative side effects based on whether they are severe, reversible, avoidable, frequent, stochastic, observable, or exclusive (i.e. preventing the agent from accomplishing its main task), and describes existing work and how they relate to these characteristics.

In addition to the canonical point that negative side effects arise because the agent’s model is lacking (whether about human preferences or environment dynamics or important features to pay attention to), they identify two other main challenges with negative side effects. First, fixing negative side effects would likely require collecting feedback from humans, which can be expensive and challenging. Second, there will usually be a tradeoff between pursuing the original goal and avoiding negative side effects; we don’t have principled methods for dealing with this tradeoff.

Finally, they provide a long list of potential directions for future side effect research.

﻿ MISCELLANEOUS (ALIGNMENT)

Foundational Philosophical Questions in AI Alignment (Lucas Perry and Iason Gabriel) (summarized by Rohin): This podcast starts with the topic of the paper Artificial Intelligence, Values and Alignment (AN #85) and then talks about a variety of different philosophical questions surrounding AI alignment.

Exploring AI Safety in Degrees: Generality, Capability and Control (John Burden et al) (summarized by Rohin) (H/T Haydn Belfield): This paper argues that we should decompose the notion of “intelligence” in order to talk more precisely about AI risk, and in particular suggests focusing on generality, capability, and control. We can think of capability as the expected performance of the system across a wide variety of tasks. For a fixed level of capability, generality can be thought of as how well the capability is distributed across different tasks. Finally, control refers to the degree to which the system is reliable and deliberate in its actions. The paper qualitatively discusses how these characteristics could interact with risk, and shows an example quantitative definition for a simple toy environment.

﻿ ﻿ ﻿ OTHER PROGRESS IN AI
﻿ REINFORCEMENT LEARNING

The Animal-AI Testbed and Competition (Matthew Crosby et al) (summarized by Rohin) (H/T Haydn Belfield): The Animal-AI testbed tests agents on the ability to solve the sorts of tasks that are used to test animal cognition: for example, is the agent able to reach around a transparent obstacle in order to obtain the food inside. This has a few benefits over standard RL environments:

1. The Animal-AI testbed is designed to test for specific abilities, unlike environments based on existing games like Atari.

2. A single agent is evaluated on multiple hidden tasks, preventing overfitting. In contrast, in typical RL environments the test setting is identical to the train setting, and so overfitting would count as a valid solution.

The authors ran a competition at NeurIPS 2019 in which submissions were tested on a wide variety of hidden tasks. The winning submission used an iterative method to design the agent: after using PPO to train an agent with the current reward and environment suite, the designer would analyze the behavior of the resulting agent, and tweak the reward and environments and then continue training, in order to increase robustness. However, it still falls far short of the perfect 100% that the author can achieve on the tests (though the author is not seeing the tests for the first time, as the agents are).

﻿

Rohin's opinion: I’m not sure that the path to general intelligence needs to go through replicating embodied animal intelligence. Nonetheless, I really like this benchmark, because its evaluation setup involves new, unseen tasks in order to prevent overfitting, and because of its focus on learning multiple different skills. These features seem important for RL benchmarks regardless of whether we are replicating animal intelligence or not.

﻿

Generalized Hindsight for Reinforcement Learning (Alexander C. Li et al) (summarized by Rohin): Hindsight Experience Replay (HER) introduced the idea of relabeling trajectories in order to provide more learning signal for the algorithm. Intuitively, if you stumble upon the kitchen while searching for the bedroom, you can’t learn much about the task of going to the bedroom, but you can learn a lot about the task of going to the kitchen. So even if the original task was to go to the bedroom, we can simply pretend that the trajectory got rewards as if the task was to go to the kitchen, and then update our kitchen-traversal policy using an off-policy algorithm.

HER was limited to goal-reaching tasks, in which a trajectory would be relabeled as attempting to reach the state at the end of the trajectory. What if we want to handle other kinds of goals? The key insight of this paper is that trajectory relabeling is effectively an inverse RL problem: we want to find the task or goal for which the given trajectory is (near-)optimal. This allows us to generalize hindsight to arbitrary spaces of reward functions.

This leads to a simple algorithm: given a set of N possible tasks, when we get a new trajectory, rank how well that trajectory does relative to past experience for each of the N possible tasks, and then relabel that trajectory with the task for which it is closest to optimal (relative to past experience). Experiments show that this is quite effective and can lead to significant gains in sample efficiency. They also experiment with other heuristics for relabeling trajectories, which are less accurate but more computationally efficient.

﻿

Rohin's opinion: Getting a good learning signal can be a key challenge with RL. I’m somewhat surprised it took this long for HER to be generalized to arbitrary reward spaces -- it seems like a clear win that shouldn’t have taken too long to discover (though I didn’t think of it when I first read HER).

﻿

Rewriting History with Inverse RL: Hindsight Inference for Policy Improvement (Benjamin Eysenbach, Xinyang Geng et al) (summarized by Rohin): This paper was published at about the same time as the previous one, and has the same key insight. There are three main differences with the previous paper:

1. It shows theoretically that MaxEnt IRL is the “optimal” (sort of) way to relabel data if you want to optimize the multitask MaxEnt RL objective.

2. In addition to using the relabeled data with an off-policy RL algorithm, it also uses the relabeled data with behavior cloning.

3. It focuses on fewer environments and only uses a single relabeling strategy (MaxEnt IRL relabeling).

﻿ ﻿ ﻿ NEWS

FHI is hiring Researchers, Research Fellows, and Senior Research Fellows (Anne Le Roux) (summarized by Rohin): FHI is hiring for researchers across a wide variety of topics, including technical AI safety research and AI governance. The application deadline is October 19.

FEEDBACK I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email. PODCAST An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.
Subscribe here:

Want to change how you receive these emails?
You can update your preferences or unsubscribe from this list.

Discuss

### Spoiler-Free Review: Orwell

16 сентября, 2020 - 16:30
Published on September 16, 2020 1:30 PM GMT

Orwell is another game in the discrete-choices-over-time genre. In this case, you are an investigator, and choose which ‘datachunks’ to upload into the system. From there, others will take action.

Like other games in the genre, if you are going to play, play it blind.

I’d rank the game as lower Tier 3 – it’s good at its job, but not essential. It mostly does what it sets out to do, creating an experience and an atmosphere. It has some big frustrations along the way.

Orwell has three problems that prevent it from doing better. It’s also short.

You should play Orwell if and only if the concept of Orwell seems like something you want to experience.

Problem one, which is not in any way a spoiler, is that a lot of the game effectively involves finding the datachunks, or links on pages that lead to new pages that in turn contain datachunks. Several times I got frustratingly stuck trying to figure out where the game wanted me to click. Similarly, there is a star by things that are new, which leads to furious “make the star go away” actions to allow for better searching.

Problem two, which is a minor spoiler, is that the game often gives you less choices than it looks like it can, or than it easily could. Events mostly seem to proceed in order, so you don’t really have the option to withhold most datachunks. Several times I wanted to not upload something, but the game would simply not proceed if I didn’t do it. This leads to the problem of, if I don’t upload this, I could spend a long time not knowing if that’s the only way to advance the game while looking for some other way to advance it that might or might not exist. I would have appreciated a lot more flexibility. Mostly all the system gives you are some binary choices where two chunks conflict and you have to decide which one to go with.

Problem three requires spoiling the experience to talk about, so that would be a distinct post.

Discuss

### Applying the Counterfactual Prisoner's Dilemma to Logical Uncertainty

16 сентября, 2020 - 13:34
Published on September 16, 2020 10:34 AM GMT

The Counterfactual Prisoner's Dilemma is a symmetric version of the original where regardless of whether the coin comes up heads or tails you are asked to pay $100 and you are then paid$10,000 if Omega predicts that you would have paid if the coin had come up the other way. If you decide updatelesly you will always received $9900, while if you decide updatefully, then you will receive$0. So unlike Counterfactual Mugging, pre-committing to pay ensures a better outcome regardless of how the coin flip turns out, suggesting that focusing only on your particular probability branch is mistaken.

The Logical Counterfactual Mugging doesn't use a coin flip, but instead looks at the parity of sometime beyond your ability to calculate, like the 10,000th digit of pi. You are told it is even and then asked to pay $100 on the basis that if Omega predict you would have paid, then he would have given you$10,000 if had turned out to be odd.

You might naturally assume that you couldn't construct a logical version of the Counterfactual Prisoner's Dilemma. I certainly did at first. After all, you might say, the coin could have come up tails, but the 10,000th digit of pi couldn't have turned out to be odd. After all, that would be a logical impossibility.

But could the coin actually have come up tails? If the universe is deterministic, then the way it came up was the only way it could ever have come up. So is there is less difference between these two scenarios than it looks at first glance?

Let's see. For the standard counterfactual mugging, you can't find the contradiction because you lack information about the world, while for the logical version, you can't find the contradiction because of processing power. In the former, we could actually construct two consistent worlds - one where it is heads and one where it is tails - that are consistent with the information you have about the scenario. In the later, we can't.

Notice however that for the logical version to be well defined, you need to define what Omega is doing when it is making its prediction. In Counterfactuals for Perfect Predictors, I explained that when dealing with perfect predictors, often the counterfactual would be undefined. For example, in Parfit's Hitchhicker a perfect predictor would never give a lift to someone who never pays in town, so it isn't immediately clear that predicting what such a person would do in town involves predicting something coherent.

However, even though we can't ask what the hitchhiker would do in an incoherent situation, we can ask what they would do when they receive an input representing an incoherent situation (see Counterfactuals for Perfect Predictors for a more formal description). Indeed, Updateless Decision Theory uses this technique - programs are as defined as input-output maps - although I don't know whether Wei Dai was motivated by this concern or not.

Similarly, the predictor in Logical Counterfactual Mugging must be predicting something that is well defined. So we can assume that it is producing a prediction based on an input, which may possibly represent a logically inconsistent situation. Given this, we can construct a logical version of the Counterfactual prisoner's dilemma. Writing this explicitly:

First you are told the 10,000th digit of Pi. Regardless of whether it is odd or even, you are asked for $100. You are then paid$10,000 if you Omega predicts that you would produce output corresponding to paying when fed input correpsonding to having been informed that this digit had the opposite parity that you observed.

There really isn't any difference between how we make the logical case coherent and how we make the standard case coherent. At this point, we can see that just as per the original Counterfactual Prisoner's Dilemma always paying scores you \$9900, while never paying scores you nothing. You are guaranteed to do better regardless of the coin flip (or in Abram Demski's terms we now have an all-upside updateless situation).

Discuss