## Вы здесь

### What are good resources for learning functional programming?

Новости LessWrong.com - 4 июля, 2019 - 04:22
Published on July 4, 2019 1:22 AM UTC

I'm looking to use the 3 Books Technique to learn functional programming. Does anyone have any "What", "Why' or "How" resources for functional programming (or resources that don't fit the categories?)

Discuss

### What's state-of-the-art in AI understanding of theory of mind?

Новости LessWrong.com - 4 июля, 2019 - 02:11
Published on July 3, 2019 11:11 PM UTC

Sparked by Eric Topol, I've been thinking lately about biological complexity, psychology, and AI safety.

A prominent concern in the AI safety community is the problem of instrumental convergence – for almost any terminal goal, agents will converge on instrumental goals are helpful for furthering the terminal goal, e.g. self-preservation.

The story goes something like this:

• AGI is given (or arrives at) a terminal goal
• AGI learns that self-preservation is important for increasing its chances of achieving its terminal goal
• AGI learns enough about the world to realize that humans are a substantial threat to its self-preservation
• AGI finds a way to address this threat (e.g. by killing all humans)

It occurred to me that to be really effective at finding & deploying a way to kill all humans, the AGI would probably need to know a lot about human biology (and also markets, bureaucracies, supply chains, etc.).

We humans don't have yet a clean understanding of human biology, and it doesn't seem like an AGI could get to a superhuman understanding of biology without running many more empirical tests (on humans), which would be pretty easy to observe.

Then it occurred to me that maybe the AGI doesn't actually to know a lot about human biology to develop a way to kill all humans. But it seems like it would still need to have a worked-out theory of mind, just to get to the point of understanding that humans are agent-like things that could bear on the AGI's self-preservation.

So now I'm curious about where the state of the art is for this. From my (lay) understanding, it doesn't seem like GPT-2 has anything approximating a theory of mind. Perhaps OpenAI's Dota system or DeepMind's AlphaStar is the state of the art here, theory-of-mind-wise? (To be successful at Dota or Starcraft, you need to understand that there are other things in your environment that are agent-y & will work against you in some circumstances.)

Curious what else is in the literature about this, and also about how important it seems to others.

Discuss

### Blatant lies are the best kind!

Новости LessWrong.com - 3 июля, 2019 - 23:45
https://s0.wp.com/i/blank.jpg

### What was the official story for many top physicists congregating in Los Alamos during the Manhattan Project?

Новости LessWrong.com - 3 июля, 2019 - 21:05
Published on July 3, 2019 6:05 PM UTC

There is a consensus among most people I know that if top researchers are necessary for the successful development of AGI, then it would be impossible to do in in secrecy, because the world would notice top AI researchers leaving their jobs and congregating at a military facility.

But the US government did pull this off with physicists. How?

Discuss

### Open Thread July 2019

Новости LessWrong.com - 3 июля, 2019 - 18:07
Published on July 3, 2019 3:07 PM UTC

If it’s worth saying, but not worth its own post, you can put it here.

Also, if you are new to LessWrong and want to introduce yourself, this is the place to do it. Personal stories, anecdotes, or just general comments on how you found us and what you hope to get from the site and community are welcome. If you want to explore the community more, I recommend reading the Library, checking recent Curated posts, and seeing if there are any meetups in your area.

The Open Thread sequence is here.

Discuss

### What would be the sings of AI manhattan projects starting? Should a website be made watching for these signs?

Новости LessWrong.com - 3 июля, 2019 - 15:22
Published on July 3, 2019 12:22 PM UTC

Discuss

### Уличная эпистемология. Тренировка

События в Кочерге - 3 июля, 2019 - 13:30
Вторник, 9 июля, 16:30

### Self-consciousness wants to make everything about itself

Новости LessWrong.com - 3 июля, 2019 - 04:44
https://s0.wp.com/i/blank.jpg

### Opting into Experimental LW Features

Новости LessWrong.com - 3 июля, 2019 - 03:51
Published on July 3, 2019 12:51 AM UTC

We recently added an "beta testing" option for LessWrong. If you go to your account page, you'll see a checkbox for "Opt into experimental features".

Right now, the primary feature under development that isn't yet released to the public is Single Line Comments. This is an attempt to fit more overall comments in some areas.

Recent Discussion

For example, recent discussion now looks like this, where each post loads 4 comments (and highlights them if you haven't read them), but only shows significant amounts of text from the most recent comment:

You can click on a recent discussion item to mark that item as 'read', and make the green line go away.

Posts with 50+ Comments

On posts with 50 or more comments, comments below 10 karma will appear as a single-line comment:

Mousing over a SingleLineComment will show a hovercard with a preview of the comment.

Clicking on a thread will fully expand all of the children of the comment that you clicked on. Doing a search (Control+F, or Command+F) will also expand all comments (so that you won't run into annoying things where you try to search but the comment isn't displaying all the text so you can't find the quote you're looking for.

This is all still under developed. Admins of the site have been using it for the past month or two to see how it works, and tweaking it until it felt usable. It's now at a point where it seemed good to let other LW users try it out and see if it seems like an improvement.

(Before releasing it publicly, we plan to build some sort of safety-valve-checkbox where you can turn it off easily)

Discuss

### Episode 7 of 'Tsuyoku Narita': CoZE

Новости LessWrong.com - 3 июля, 2019 - 01:38

### What are Robin Hanson's best posts?

Новости LessWrong.com - 2 июля, 2019 - 23:58
Published on July 2, 2019 8:58 PM UTC

Discuss

### How/would you want to consume shortform posts?

Новости LessWrong.com - 2 июля, 2019 - 22:55
Published on July 2, 2019 7:55 PM UTC

The LessWrong team might be putting more work into the shortform concept in the next couple weeks. I wanted to check in about what specific features people might want for that.

Right now, some users set up "shortform feed" posts, which are a hacky way to give yourself a more casual space for off-the-cuff writing. In the past few weeks we've seen an uptick of people creating feeds for themselves.

It seemed useful to make this a bit more official (i.e. there's just a "create shortform post" button you can click instead of navigating to a post and writing a comment).

But there's several open questions about the best way to consume shortform content. Right now most shortform comments appear in Recent Discussion and then quickly disappear, which is very low-discoverability.

Possible considerations include:

• Some people like low-discoverability. Sometimes what you want is a low-visibility place where you can write up thoughts without people immediately judging you for it, to help flesh out early stage ideas. For those people, the status quo is sorta fine.
• Some people want more visibility. I know I personally would prefer my shortform feed to get maybe 5x the visibility it currently gets – I care less about the visibility and more about the vibe of "low effort casual space."

There's a few possible routes to more visibility:

• Treating shortform more as a special case, where new shortform posts get displayed at the top of recent discussion, or potentially each day listing on the Daily/AllPosts page. (This might include entire comments, or might include a quick list of which shortform feeds got updated recently)
• More generally improve the visibility of comments on old posts, with shortform just being one type of old post. This has the benefit of making commenting on old posts feel less like shouting into the void.
• Creating a special shortform feed view page, and/or create a setting on Recent Discussion where shortform becomes the primary thing you're skimming, rather than a mixture of post-discussion.

There might be entirely different frames or ideas I haven't thought of.

Anyone have preferences on how to consume shortform?

Discuss

### Optimizing and Goodhart Effects Clarifying Thoughts - Parts 1 & 2

Новости LessWrong.com - 2 июля, 2019 - 18:36
Published on July 2, 2019 3:36 PM UTC

Introduction

Goodhart's law comes in a few flavors, as originally pointed out by Scott, and formalized a bit more in our joint paper. When discussing that paper, or afterwards, we struggled with something Abram Demski clarified recently, which is the difference between selection and control. This matters for formalizing what happens, especially when asking about how Goodhart occurs in specific types of optimizers, as Scott asked recently.

Epistemic Status: This is for de-confusing myself, and has been helpful. I'm presenting what I am fairly confident I understand well for the content written so far, but I'm unclear about usefulness for others, or how clear it comes across. I think that there's more to say after this post, and this will have a few more parts if people are interested. (I spent a month getting to this point, and decided to post and get feedback rather than finish a book first.)

In the first half of the post, I'll review Abram's selection/control distinction, and suggest how it relates to actual design. I'll also argue that there is a bit of a continuum between the two cases, and that we should add an addition extreme case to the typology, direct solution. The second section will revisit what optimization means, and try to note a few different things that could happen and go wrong with Goodhart-like overoptimization.

The planned third (started, but as of now not fully understood by myself) section will talk about Goodhart in this context using the new understanding - trying to more fully explain why Goodhart effects in selection and control fundamentally differs.

Thoughts on how selection and control are used in tandem

In this section, I'll discuss the two types of optimizers Abram discussed; selection, and control, and introduce a third, simpler optimizer, direct solution. I'm also going to mention where embedded agents are different, because that's closely related to selection versus control, and talk about where mesa-optimizers exist.

Starting with the (heavily overused) example of rockets, I want to revisit Abram's categorization of algorithmic optimization versus control. There are several stages involved with getting rockets to go where we want. The first is to design the rocket, which involves optimization, which I'll discuss in two stages, the second is to test, which involves optimization and control in tandem, and the third is to actually guide the rocket we built in flight, which is purely control.

Initially, designing rocket is pure optimization. We might start by building simplified mathematical models to figure out the basic design constraints - if a rocket is bringing people to the moon, we may decide the goal is a rocket and a lander, rather than a single composite. We may decide that certain classes of trajectory / flight paths are going to be used. This is all a set of mathematical exercises, and probably involves only multiply differentiable models that can be directly solved to find an optimum. This is in many ways a third category of "optimizing," in Abram's model, because there is not even a need for looking over the search space. I'll call this direct solution, since we just pick the optimum based on the setup.

After getting a bit closer to actual design, we need to simulate rocket designs and paths, and optimize the simulated solution. This lets you do clever things like build a rocket with a sufficient but not excessive amount of fuel (hopefully with a margin of error.) If we're smart, we optimize with several intended uses and variable factors in mind, to make sure our design is sufficiently robust. (If we're not careful enough to include all relevant factors, we ignore some factor that turns out will matter, like the relationship between temperature of the O-rings and their brittleness, and our design fails in those conditions.) This is all optimizing over a search space. The cost of the search is still comparatively low - not as low as direct solution, and we may use gradient descent, genetic algorithms, simulated annealing, or other strategies. The commonality between these solutions is that they simulate points in the search space, perhaps along with the gradients at that point.

After we settle on a design, we build an actual rocket, and then we test it. This moves back and forth between the very high cost approach of building physical objects and testing them - often to destruction - and simulation. After each test, we probably re-run the simulation to make sure any modifications are still near the optimum we found, or we refine the simulations to re-optimize and pick the next design to build.

Lastly, we build a final design, and launch the rocket. The control system is certainly a mesa-optimizer with regards to the rocket design process. For a rocket, this control is closer to direct optimization than simulation, because the cost of evaluation needs to be low enough for real-time control. The mesa-optimizer would, in this case, use simplified physics to fire the main and guidance rockets to stay near the pre-chosen path. It's probably not allowed to pick a new path - it can't decide that the better solution is to orbit twice instead of once before landing. (Humans may decide this, then hand the mesa-optimizer new parameters.) We tightly constrain the mesa-optimizer, since in a certain sense it's dumber than the design optimizer that chose what to optimize for.

For a more complex system, we may need a complex mesa-optimizer to guide the already designed system. Even for a more complex rocket, we may allow the mesa-optimizer to modify the model used for optimizing, at least in minor ways - it may dynamically evaluate factors like the rocket efficiency, and decide that it's getting 98% of the expected thrust, so it will plan to use that modified parameter in the system model used to mesa-optimize. Giving a mesa-optimizer more control is dangerous, but perhaps necessary to allow it to navigate a complex system.

What does Optimization Mean?

Scott mentioned that he was confused about the relationship between gradient descent and Goodhart's law. He proposed the naive model;

This is absolutely and completely a "selection" type of optimization, in Abram's terms, but as he noted, it's not a good model for what most optimization looks like.

There's a much better model for gradient descent, which is... gradient descent. This is a bit closer to control, but for almost all actual applications, it is still essentially selection. To review, points are chosen iteratively, and the gradient is assessed at each point. The gradient is used to select a new point at some (perhaps very clever, dynamically chosen next point.) Some stopping criteria is checked, and it iterates at that new point. This is almost always tons more efficient than generating random points and examining them. It's far better than a grid search, usually, for most landscapes. It's also somewhere between selection and control - and that's what I want to explain.

In theory, the evaluation of each point in the test space could involve an actual check of the system. I build each rocket, watch to see whether it fails or succeeds according to my metric. For search, I'd just pick the best performers, and for more clever approaches, I can do something like find a gradient by judging performance of parameters to see if increasing or decreasing those that are amenable to improvement would help. (I can be even more inefficient, but find something more like a gradient, by building many similar rockets, each an epsilon away in several dimensions, and estimating a gradient that way. Shudder.)

In practice, we use a proxy model - and this is one place that allows for the types of overoptimization misalignment we are discussing. (But it's not the only one.) The reason this occurs is laid out clearly in the Categorizing Goodhart paper as one of the two classes of extremal failure - either model insufficiency, or regime change. This also allows for (during simulation undetectable) causal failures, if the proxy model gets a causal effect wrong.

Even without using a proxy model, we can be led astray by the results if we are not careful. Rockets might look great, even in practice, and only fail in untested scenarios because we optimized something too hard - extremal model insufficiency. (Lower weight is cheaper, and we didn't notice a specific structural weakness induced by ruthlessly eliminating weight on the structure.)

For our purposes, we want to talk about things like "how much optimization pressure is being applied." This is difficult, and I think we're trying to fit incompatible conceptual models together rather than finding a good synthesis, but I have a few ideas on what selection pressure leading to extremal regions means here.

• Extreme proxy values (in comparison to most of the space) seems similar to having lots of selection pressure. If we have a insanely tall and narrow peak, we may be finding something strange rather than simply improving.
• Extreme input values (unboundedly large or small values) may indicate a worrying area vis-a-vis overoptimization failures.
• Lots of search time alone does NOT indicate extremal results - it indicates lots of things about your domain, but not overoptimization. This is in stark contrast to the naive model.

As an aside, Causal Goodhart is different. It doesn't really seem to rely on extremes, but rather on manipulating new variables, ones that could have an impact on our goal. This can happen because we change the value to a point where it changes the system, similar to extremal Goodhart, but does not need to. For instance, we might optimize filling a cup by getting the water level near the top. Extremal regime change failure might be overfilling the cup and having water spill everywhere. Causal failure might be moving the cup to a different point, say right next to a wall, in order to capture more water, but accidentally break the cup against the wall.

Notice that this doesn't require much optimization pressure - Causal Goodhart is about moving to a new region of the distribution of outcomes by (metaphorically or literally) breaking something in the causal structure, rather than by over-optimizing and pushing far from the points that have been explored.

This completes the discussion so far - and note that none of this is about control systems. That's because in a sense, most current examples don't optimize much, they simply execute an adaptive program. (One critical case of a control system optimizing is a mesa-optimizer, but that also needs to be addressed in a later post.)

Discuss

### The Right Way of Formulating a Problem?

Новости LessWrong.com - 2 июля, 2019 - 17:40
Published on July 2, 2019 2:40 PM UTC

Finding a good formulation for a problem is often most of the work of solving it.

I agree with this intuitively, and I feel like I have seen this principle at work in my own work and in the problems I have tried to solve. However, when I try to convince others of this idea, I struggle to find examples that they can connect with or that they find compelling.

I suspect that programmers find this idea appealing because we routinely work with formal systems, and all of us know the experience of making a minor change in perspective and seeing an impossible problem turn into an easy one. So I'm most interested in examples that have nothing to do with code, examples that a lay audience would be able to grasp.

I would be particularly interested in examples from the history of science or medicine, if anyone can think of some. Scott and Scurvy is the only example I currently know of, and while interesting, does not seem like a perfect fit.

Much appreciated!

Discuss

### Everybody Knows

Новости LessWrong.com - 2 июля, 2019 - 15:20
Published on July 2, 2019 12:20 PM UTC

“Everybody knows that the dice are loaded.

Everybody rolls with their fingers crossed.

Everybody knows the war is over.

Everybody knows the good guys lost.”

– Leonard Cohen, Everybody Knows

“It is known.” – Dothraki saying

It is not known. Everybody doesn’t know.

When someone claims that everyone knows something, either they are short-cutting and specifically mean ‘everyone in this well-defined small group where complex common knowledge of this particular thing is something we have invested in,’ they are very wrong about how the world works, or much more commonly, they are flat out lying.

Saying that everybody knows is almost never a mistake. The statement isn’t sloppy reasoning. It’s a strategy that aims to cut off discussion or objection, to justify fraud and deception, and to establish truth without evidence.

Not Everybody Knows

Let us first establish quickly that everyone doesn’t know. There are many ways to see this.

One way to see this is to point out that when Alice tells Bob that everybody knows X, either Bob is asserting X because people act as if they don’t know X, or Bob does not know X. That’s why Alice is telling Bob in the first place.

A second way is to attempt to explain something in detail as you would to a child.

A cleaner way is to consider some examples of things that a lot of people don’t know. According to the first Google hit, 32 million American adults can’t read, and 50% can’t read a book at the 8th grade level. Various other tests of basic skills from school don’t look much better. Here are some more basic facts many Americans don’t know, including 20% who think the Sun revolves around the Earth. Nigerian prince scams still make over 700,000 per year. Doctors can’t do basic job-relevant probability calculations within an order of magnitude. Just yesterday (as of writing this) I had to explain to a college graduate that Bitcoin was more volatile than the stock market, and Forex was not a responsible retirement savings plan. What does the claim that ‘everybody knows’ mean? There are a few different things ‘everybody knows’ is standing in for when someone claims it. In most of them, the claim that literal actual ‘everybody knows’ is sort of the Bailey, and the thing we’ll describe here is the implicit Motte that ‘everyone knows’ is your real message. Which of course, in turn, not everybody knows. As is often the case, the Bailey is blatantly false. But demonstrating that is socially costly. It shows you are the one who does not get it, who is not in on the goings on. So much so that when someone ‘calls someone out’ on a blatant lie, the liar socially benefits. I see four related central modes. They overlap and reinforce each other, and are often all in play at once. The first central mode is ‘this is obviously true because social proof, so I don’t have to actually provide that social proof.’ Often the proof in question doesn’t exist at all. Other times, it’s a plurality of ‘experts’ in a survey, or a reporter’s reading of a single scientific study, or three friends backing each other up – or people who have been told or gotten the impression everybody knows, so they claim to know, too. The phrase ‘everybody knows’ is a great way to cause an information cascade. The second central mode of ‘everyone knows’ is when it means ‘if you do not know this, or you question it, you are stupid, ignorant and blameworthy.’ It’s your own damn fault for going out in the rain and getting soaked. It’s your own damn fault for not knowing that everything politicians say (or something the speaker said) is a lie, even though they frequently tell the truth – which means they ‘aren’t really lies’ because no one was fooled. It’s your own damn fault for not keeping up with the latest gossip or fashion trends. It is made clear that to question this is to show you are stupid, ignorant and blameworthy, especially if the statement everyone knows is false. You’d be all but volunteering to be the scapegoat. A classic mode is the condemnation ‘everyone knows that X is (everywhere / great / the right thing / necessary / patriotic / fair / standard / appropriate / customary / the party line / how things get done around here / smart / right / a thing / not a thing / a conspiracy theory / wrong / evil / stupid / slander / rhetoric used by the out-group / rhetoric that supports the out-group / unacceptable / impossible / impractical / unthinkable / horrible / unfair / stupid / rude / your own fault / racist / sexist / treason / cheating / cultural appropriation / etc etc etc). The whole point is to establish truth without allowing a response or providing evidence. Note that this is self-referencing. To be someone, you have to know what ‘everybody knows’ means. A third central mode is ‘if you do not know this (and, often, also claim everyone knows this), you do not count as part of everyone, and therefore are no one. If you wish to be someone, or to avoid becoming no one, know this.’ This works both to make those on the outs not people, and to make the statements used unquestionable. Thus, one is not blameworthy for acting as if everyone knows, because if someone is revealed not to know, that means they are no one, and therefore they have no relevant impact or moral personhood. They can be ignored. Perhaps those who do not know this, or question it, are the outgroup. Perhaps they are simply those who don’t get ahead, the little people. Perhaps they’re just the fools we pity. Regardless, until they catch on, it is good and right to scam them – it is a sin to let a sucker keep his money. A key variation on this is to flip the order into a way to admonish someone when they expose a falsehood or fraud someone wishes to perpetuate. First they argue that the thing is not a fraud, ideally that everyone knows it is not a fraud, but they lose, they fall back by flipping their position entirely. They now say: You’re calling this thing a fraud. But everyone knows it’s a fraud, so why are you wasting everyone’s time saying it’s a fraud when everyone already knows? This must be a social tactic, trying to lower the status of the fraud by pointing out what everyone already knows. Or if you think we don’t already know, that must mean you think we aren’t anyone. How insulting. The fourth central mode is ‘we are establishing this as true, and ideally as unquestionable, so pass that information along as something everyone knows.’ It’s aspirational, a self-fulfilling prophecy. Perhaps we already have done so by the time you’re hearing this (and that’s bad, because it means you’re not hearing about new things everyone knows quickly enough!) or perhaps you’re the first person to be told. Either way, join the conspiracy. Spread that everybody knows the dice are loaded and rolls with their fingers crossed. Spread that everybody knows the war is over, and everybody knows the good guys lost. So they’ll cross their fingers rather than demand fair dice. So that they’ll stop trying to fight the war Discuss ### Collaborative VS adversarial truth seeking Новости LessWrong.com - 2 июля, 2019 - 02:48 Published on July 1, 2019 11:48 PM UTC There is a question of how the ground rules of discussions should flow. This applies to debates, discussion, challenges of belief, general scientific discourse and more. There are two cultures in the particular trade-off I want to talk about. Collaborative and adversarial. I pitch collaborative as, “let’s work together to find the answer (truth)” and I pitch adversarial as, “let’s work against each other to find the answer (truth)”. As fairly neutral descriptions, these are quite balanced. In an entrepreneurial/business setting, adversarial work is called red team. I made some neat diagrams a while ago in a model of arguments. As an epistemic practice, it's important to know what practice one is in (what practice one's counterpart is in,) and how this might shape interactions. Internally the stance for each is different. For collaborative, it might look something like, “I need offer my alternative view, between us, we need to seek the truth”. For adversarial, it might look something like, “I need to argue strongly for my (correct) view and if I ever sense that my view is wrong I need to quit”. Externally (and from a collaborative culture perspective) someone in adversarial culture probably presents as defensive, aggressive, stubborn, yelling and shouting, ontologically arrogant, pulling out every stop possible (every fallacy) to advocate for their opinion. Once they have decided they are lost (or are convinced of the other side's opinion), they probably look like they do a backflip and staunchly argue against their own position. But before they do, they probably stop the argument and run away. Externally (and from an adversarial culture perspective) someone in collaborative culture probably presents as weak willed, fragile (too politically correct), softly spoken, trying to be manipulative with friendliness, likely to not say what they really think, hard to pin to a belif or position, lacking confidence, and the sort of person who doesn't admit defeat even when they are clearly wrong. Adversarial culture sees itself as the true arena for ideas, where concepts go head to head and there is only one true and final position. Adversarial sees ourselves as having a duty to call people out if they are wrong, this can get us into trouble and self-invested burdensome quarrels at times where we seem to fight over tiny details that don't matter to anyone else. Adversarial sees ourselves as guarding intellectual freedom. Ideas are taken seriously, no matter who brings them forwards, and as a consequence, people are expected to only bring forward their best ideas to battle where they can utterly crush each other with their truths. Collaborative culture sees itself as the true arena for fostering novel solutions to deadlock problems. We need to think differently, see the problem differently to come up with new ideas. There's many ways to see a problem and by soliciting many opinions we can find the path to the outcome. But better yet, we can all contribute to the solution. Collaborative culture is not just about this truth, but every truth, and bringing forth an environment where we can foster the truths that are harder to admit. Where we can reward people for admitting they are confused and considering the other side. Strong VS Weak opinions There's something punchy about map & Territory that I can't put into words where different people hold their maps at strong or weak subjective opinions, they also have an adversarial or collaborative relationship with their maps (continued later) A weak opinion in Adversarial culture looks like either saying yes to the first idea that comes along, then flip flopping as other ideas prove better, or keeping one's mouth shut until they form an opinion. A strong adversarial opinion looks like a loud mouth that is likely overconfident in their domain. I know the solution and why won't everyone listen to me because if they did then the world would be a better place. A weak collaborative opinion probably starts with some hedging, "Here's an idea", "I was thinking that...", "I heard...", "have we considered...". Or it starts with a question, "I'm confused about...". The other option is to start silent and only add when the ongoing discussion becomes strongly in conflict with their internal opinion. A strong collaborative opinion might start with an opinion to set the stage, but requires intense listening skills to understand the other opinions and work out how to fit them in to the prevailing argument. When might someone say, "this may sound crazy but..." A collaborative player, weakly holding a novel opinion, trying to explore new ideas and work it out as they speak. An adversarial player when they flip position. (Adversarial position would never take their position to be crazy) Relationship to own ideas How someone relates to their own ideas might be adversarial or collaborative. Most people collaborate with their existing opinion, forming a more and more coherent worldview as they go. Doing this they can get stronger and stronger held subjective opinions which are harder to break out of. At the same time I would worry for the internal mental health for an adversarial self agent. A constant battle between whichever is the loudest internal voice. Culturally 99% of either is fine as long as all parties agree on the culture and act like it. each position includes the other at least a little bit, after all, we have to have some adversarial culture in collaboration or else we would not be trying to come to the truth together, we would just assume that we each already have the truth. We also have to assume there's some collaboration in adversarial culture or else I might go find the truth for myself and not bother to share it. Bad collaboration is not being willing to question the other’s position (keep it all nice and don't rock the boat), and also not being willing to state a position strongly enough to be wrong. Bad adversarial is not being willing to question one’s own position and blindly advocating (big ego advocacy). I see Adversarial as going downhill in quality of conversation (faster than C) because it’s harder to get a healthy separation of “you are wrong” from, “and you should feel bad (or dumb) about it”, or "You are wrong and therefore not part of the in-group that I trust". And a further, "Only an idiot would believe that". In a collaborative process, the other person is not an idiot because there’s an assumption that we work together. If an adversarial attitude (yes it's different to an adversarial culture but the difference gets cloudy) cuts to the depth of beliefs about our interlocker then from my perspective it gets un-pretty very quickly. Not only are they wrong, but they are The Enemy for being wrong too. Skilled scientists are always using both and have a clean separation of personal relationship and idea being put forward. In an adversarial environment, I’ve known of some brains to take the feedback, “you are wrong because x” and translate it to, “I am bad, or I should give up, or I failed, I don't belong here” and not “I should advocate for my idea better”. In a collaborative environment, I've known some brains to take feedback, "I don't know about that..." and reflect that "wow these guys are idiots who don't hold positions". Also feedback like, "If you believe Y, then X makes sense, but Y is not true, so X is not true", can sometimes be taken on reflection, "they are muddying up X issue with Y, and trying to confuse me with manipulative argument tactics". At the end of an adversarial argument is a very strong flip, popperian style “I guess I am wrong so I take your side”. At the end of a collaborative process is when I find myself taking sides, up until that point, it’s not always clear what my position is, and even at the end of a collaborative process I might be internally resting on the best outcome of collaboration so far, but tomorrow that might change. I see the possibility of being comfortable in each step of collaboration to say, “thank you for adding something here”. However I see that harder or more friction to say so during adversarial cultures. At times it seems that to do adversarial culture, we must assume the presence of those "thank you for that addition" around every corner. Conceptually maybe we bring our own intellectual safety or worthiness to the table when we present our argument. Intellectual safety or Worthiness In adversarial culture, everyone is in charge of bringing their own intellectual safety. Every idea, no matter who proposes it, is equal. There's something of a meritocracy of ideas, where the good ideas are listened to, no matter who brings them forward. In Collaborative culture, there's a recognition that everyone has something to offer, even if the individual themselves don't at first know it or believe it. Because of the way that brains sometimes question their own personal worthiness, we might need to advocate to those brains, or cultivate an environment that can bring out their best. We can't necessarily trust people to know they carry worthy ideas (and adversarial culture sees this as manipulation, coddling and the worst of all possible cultures). Adversarial culture can suggest that Hitler had useful genetic ideals to breed a super race, and collaborative culture can confidently say he went about it in a terrible way (and any further discussion of human selection should be done very carefully). Collaborative culture can only talk about Hitler in the context of the atrocity and how we can only carefully talk about the issue, adversarial can talk about the issue assuming everyone already knows it was terrible and we don't need to rehash the arguments. Collaborative sees Adversarial as stubborn assholes. Adversarial sees collaborative culture as deceptive manipulators. I advocate for collaboration over adversarial culture because of the bleed through from epistemics to interpersonal beliefs. Humans are not perfect arguers or it would not matter so much. Because we play with brains and mixing territory of belief and interpersonal relationships I prefer collaborative to adversarial but I could see a counter argument that emphasised the value of the opposite position. I can also see that it doesn’t matter which culture one is in, so long as there is clarity around it being one and not the other. I'll be honest, I've tried to remain neutral over this piece but while writing this I convinced myself of collaborative being better, my reasoning is that a collective practice over an individual practice is going to be better in the long run. It appears that any process that seeks to include more of the context of the whole is more future thinking and more likely to get to the right answer. Yes each of us is individual, but we live, reason, and grow as collectives. In isolation, For example in a chess game, there is likely one good move next, an adversarial chess move finding culture is likely to be a useful skill. In context there are many possible factors for the next move. As soon as we step out of the chess game, into the classroom, out of the classroom, into the family and interpersonal relationships, the "move" is very different to before. My next move might not be a chess move, it might be to smile at my opponent, or to suggest we take a break. Not all games are to be played or won. Discuss ### PlayStation Odysseys Новости LessWrong.com - 1 июля, 2019 - 20:41 Published on July 1, 2019 5:41 PM UTC Cross-posted from Putanumonit.com Note: heavy spoilers for one epic poem, minor spoilers for several epic video games. Very related: the Greek myths are about loneliness. Antihero’s Journey Merriam-Webster defines “odyssey” as: n. a long wandering or voyage usually marked by many changes of fortune. The word brings to mind an adventure, a quest, a hero’s journey that leads them to self-discovery, empowerment, and saving the village. This is the plot structure of the video game Assassin’s Creed: Odyssey. In it, the player controls a mercenary who sails from her home near Ithaca to find her parents and prove her worth to the Greek world. And of course, The Odyssey itself starts with the tale of a young man sailing from Ithaca seeking fame and news of his father. But that man is not Odysseus. It’s his son, Telemachus, who goes on a quest. Odysseus himself shows up only in book 5 of the epic poem, old and tired. Most of his travels are behind him at that point and told only in flashback. His is not a hero’s journey of transformation and discovery. He is by and large the exact same man who sailed for Troy two decades before, a fact that is remarked upon by the other characters. It is Telemachus who transforms from a shy youth into an assertive prince when he finally fights by his father’s side against Penelope’s suitors. And yet, the game is not called Assassin’s Creed: Telemacheia, even though the protagonist’s journey mirrors Telemachus much more than his father. The son’s story is relegated to a secondary plot, while the father’s story has proved to be memorable and compelling to audiences for three thousand years. Why is Odysseus such a compelling hero? When I read the Odyssey as a teenager I remember admiring Odysseus’ heroics. But upon rereading Emily Wilson’s version[1], I noticed that Odysseus isn’t just a static character, he’s kind of a huge dick. Odysseus is brave, formidable, and intelligent, but also a consummate liar, greedy, violent, and selfish. Some behavior that is repugnant to modern readers, like murder, pillaging, enslaving, and fornicating, was probably considered “all in the game” in Bronze Age Greece. Other sins, like the fact that the survival ratio of men under his command would make even comrade Stalin blush, are criticized by the other characters in the poem itself. And even aside from his indifference to the body count, Odysseus seems utterly lacking compassion for his fellow humans. A darkly ironic passage concerns Odysseus’ conversation (while disguised as a beggar) with Eumaeus, his swineherd. Odysseus remarks that nothing in the world is worse than wandering far from home, then asks Eumaeus to share his life story. Eumaeus agrees, prefacing his story with “after many years / of agony and absence from one’s home / a person can begin enjoying grief". Eumaeus is revealed to be not a local Ithacan, but the prince of the faraway island of Syra. Phoenician sailors come to Syra and seduce the young prince’s nurse since "sex sways all women’s minds / even the best of them". The woman boards the Phoenician ship with Eumaeus and when she dies the sailors sell him as a slave to the ruler of the nearest island, which happens to be Ithaca. Here’s how much empathy Odysseus musters in response to this heartbreaking tale: Odysseus replied: my heart is touched to hear the story of your sufferings, Eumaeus. In the end though, Zeus has blessed you since after going through all that you came to live with someone kind, a man who gives you plenty to eat and drink. Your life is good, but as for me I am still lost. This is psychopathic. He assures Eumaeus that his “heart is touched”, and then immediately tells him that being Odysseus’ slave (as opposed to growing up a prince) is nothing to complain about! A more notorious scene is the punishment of the slave girls in Odysseus’ household who “dishonored” him by disobeying his wife, Penelope, and old nurse, Euryclea. The girls also slept with Penelope’s suitors, although it is unclear if they did so willingly or were forced to by the rowdy men. As Euryclea is gathering the girls, Odysseus instructs his son and two servants that they should take the girls outside and “hack at them with swords / eradicate all life from them”. Everyone agrees that this is fair and just. But then Odysseus notices that his great hall is covered in the blood and guts of the hundred suitors he just killed. And since a bunch of slave girls just arrived in the room, the master strategist makes the most of the situation. Sobbing desperately the girls came Weeping clutching at each other. They carried out the bodies of the dead and piled them high on one another under the roof outside. Odysseus instructed them and forced them to continue. And then they cleaned his lovely chairs and tables with wet absorbent sponges. Immediately thereafter the girls are hanged “their heads all in a row / to make their death an agony”. The men probably realized that if they “hack at them with swords” they would afterward have to mop the floor themselves. Beautiful Lies Bronze Age morality aside, does Homer intend Odysseus to be the exemplary hero of the story? Or is there a case to be made for a Straussian reading of the poem, one in which Odysseus is the villain? The most direct indication of the latter option is given in the story of Odysseus’ naming. The name is proposed by his maternal grandfather, “noble Autolycus who was the best / of all mankind at telling lies and stealing". And Euryclea put the newborn child on his grandfather’s lap and said, “Now name your grandson, this much-wanted baby boy.” He told the parents, “Name him this. I am disliked by many all across the world and I dislike them back. So name the child Odysseus […] The name comes from the Greek odyssomai which means either “hatred” or “anger”; it is likely the root of the modern word “odious”. Whatever Homer thought of pillage, murder, and making people do chores, he surely considered “telling lies and stealing” to be loathsome. Odysseus inherited both talents from his “noble” grandfather. On the other hand, the common interpretation of the poem is that Homer admires Odysseus because he sees him as a colleague: a bard and storyteller. The moral principle which drives much of the story is xenia, the duty of a host to feed and shelter his guests. But the guest has a duty as well: to entertain his host with beautiful stories. Does the guest have a duty to tell a true story or just a beautiful one? I think this is the most fascinating subject in the poem. The famous part of the Odyssey, the hero’s travel across the seas since his departure from Troy, is recounted by Odysseus to Alcinous, king of the Phaeacians. He tells the king and his court about god-summoned storms, an island of wireheading Lotus-eaters, the cave of the Cyclops and his daring escape tied to the belly of a ram, a gift from the god of winds, a surprise attack by cannibals, the witch Circe turning the crew to pigs, and a visit to the Underworld to converse with the souls of the dead. The reaction of wise king Alcinous to this fantastic tale displays an interesting epistemology: Alcinous replied, “Odysseus, the Earth sustains all kinds of people. Many are cheats and thieves who fashion lies out of thin air. But when I look at you, I know you are not in that category. Your story has both grace and wisdom in it. You sounded like a skillful poet telling the sufferings of all the Greeks, including what you endured yourself.” Different translations make this passage even clearer: Odysseus story is touching and beautifully told, and so the king knows that it must be true. This sentiment is echoed in Western poetry through the ages. The Romantic poet John Keats was devoted to the study of ancient Greek art; his Ode on a Grecian Urn concludes: Beauty is truth, truth beauty,—that is all Ye know on earth, and all ye need to know. Emily Dickinson agrees: truth and beauty are the same. Although this statement is mostly associated with poets, there are two possible interpretations of it: the poet’s and the quant’s. To a quant (i.e., rationalist), beauty is ephemeral while the pursuit of truth is tangible. Once the truth is ascertained through rigorous epistemology, one should train their aesthetic sense to find it beautiful. Rationalist types are fond of compiling lists of “the most beautiful and poetic” equations. To a poet, the above approach is too constraining of beauty by limiting it to what can be deduced scientifically. Epistemology seems quite hopeless, while good taste can be fruitfully cultivated. It is better to seek beauty and take what is beautiful to be true. To a rationalist, this is much too loose a criterion for truth, and the two keep arguing about what truth is for all eternity. One would expect Homer to side fully with the poet’s take, but an interesting reversal takes place in the hut of Eumaeus (who is da real MVP of the story). Disguised Odysseus tells the swineherd another beautiful and wholly made up Odyssey, one that’s a lot more grounded and believable than what he told Alcinous. He tell Eumaeus of growing up in Crete, an unsuccessful raid in Egypt, kidnap and shipwreck, and finally being rescued by a kind king who tells him news of Odysseus. Eumaeus, you replied, “Poor guest, your tale of woe is very moving, but pointless. I would not believe a word about Odysseus Why did you stoop to tell those silly lies?” Swineherds, it seems, have need of a more rigorous epistemology than do kings. So, is Odysseus a liar or a poet? His common epithets include “wise”, “resourceful”, and “great teller of tales”. Wilson uses another one: “lord of lies”, a title that our culture often reserves for the Devil. Wilson ultimately concludes that Odysseus’ key epithet is the first one to appear in the poem: polytropos. It can mean “many-sided”, “much traveled”, or, as she finally settles on: “complicated”. I think Odysseus is both a liar and poet, but its the latter that makes him the hero. He is the one who spins the tale, and its his point of view that the audience is inevitably drawn to. We hear him describe his rage at the slave girls; they don’t get to make their case. Once the reader or listener find themselves in Odysseus shoes, his best qualities are magnified and his vices are downplayed. Nobody is the villain in their own story, and the Odyssey is Odysseus’ story. State of the Art When The Odyssey was written in the 8th century BC, the premier entertainment art form of the time was the telling of oral tales. I decided to buy the modern translation of Homer’s epic on audiobook (read masterfully by Claire Danes) to enjoy it in a form as close to the original as possible. Since Homer’s time, people kept telling stories and perfecting the media through which they do so. When I was young, the best stories were told in books and movies, and throughout my teenage years, I consumed one of each per week. Then the golden age of TV dawned, and I binged on Buffy, The Wire, and Throneplay. But today it seems beyond a doubt that video games are the preeminent entertainment and narrative art form of our era. The best video games have budgets and revenues that far exceed Hollywood blockbusters. They employ actors to rival the best movies and teams of writers to rival the best TV series. And beyond those, video games have the advantage of interactivity. This alone sets them far apart from any passive medium. I am a big fan of the zombie genre in all its incarnations: books, movies, comics, and TV shows. One version stands above all: The Last of Us. Screenshot from The Last of Us on PlayStation 4. In the game, you spend a good amount of time hiding behind cover and thinking. You’ve just seen two zombies at the end of a corridor you must traverse. You have three bullets and one arrow to your name, but no guarantee that either will result in a quick kill or that you will be able to replenish your ammo before you face the next deadly challenge. You can try to sneak past, but a misstep may leave you surrounded by enemies with no opportunity to retreat. You may just be able to climb around the two zombies, but the ledge would leave you exposed to view and fire on both sides. A movie can choose to have a shootout scene, or a suspenseful crawl, or a daring run with the enemy nipping at the heroes’ heels. But a movie can’t have the character thinking and sweating behind a crate. And it’s that part that immerses you in the character, not the shooting or sneaking or running but having to make hard choices and living with the outcomes – often, a gruesome death [2]. A good passive story makes you identify with a character and their choices, but a game lets you inhabit the protagonist. Games can have any sort of protagonist, from soldier to god to yellow circle. But an interesting pattern emerges if we look at some of the most successful games that emphasize storytelling. The protagonist of The Last of Us is Joel. Joel is a bearded man on the cusp of middle age. Joel is a survivor with a dark past, brave, violent, and cunning. Joel is on a quest, but what he ultimately wants is to go back home with his (adopted) daughter, Ellie. He will shoot, stab, and explode anyone standing in the way of that. The Last of Us is the third highest user-rated PlayStation 4 game of all time. The second rated game is God of War, the protagonist of which is Kratos. Kratos is a bearded man on the cusp of middle age, a warrior with a dark past, brave, cunning, and very violent. Kratos is on a quest, but what he ultimately wants is to go back home with his son, Atreus. His axe will dismember anyone standing in the way of that. The highest rated game is The Witcher 3: Wild Hunt. The protagonist is Geralt. He has a beard, he has a past, he has a daughter, and he has two swords that need very frequent cleaning. Joel, Kratos, Geralt, Arthur Morgan, Booker DeWitt: the most popular video games are actual Odysseys. Not a hero’s journey [3] into chaos to save his home, but a father’s journey home leaving chaos in its wake. These are not the best men, but they make for the best stories. I’m not the first person to notice that bad guys are quite popular. In classical Athens, Plato (speaking for Socrates) criticized poets for corrupting the morality of the public. He accused poets like Homer of speaking through un-virtuous (what we would call “problematic”) characters thereby making listeners identify with these characters and see them as role models. Plato blamed the unfair advantage of storytelling (over pure philosophy) in affecting people’s souls. Like curmudgeons across the ages, Plato was especially worried about the impressionable youth who will hear the Odyssey and be inspired to match Odysseus rampage. Sounds familiar?[4] I think that Plato and the pearl clutchers at APA are completely wrong. Most of us don’t want to become desperate, violent men, and those that do won’t be satisfied with the simulacrum of a poem or video game. But we want to know what it would feel like to be those men, to outwit and survive and kill, and to deal with the consequences too. Odysseus gets a classical happy ending (although I’d personally take Calypso+immortality over grumpy Penelope), but most PlayStation Odysseuses don’t. Modern storytellers know that their audience will reject a tale in which violent rampages conclude in marital bliss and domestic peace. We identify with Odysseus for the duration of his story, but then we can reflect on it. This is how people learn virtue. We don’t want hectoring morality tales, as evidenced by The Odyssey’s exceeding popularity compared to Plato’s Republic. We want stories that let us be other characters: strong and cunning, defiant or afraid, good or bad or somewhere in between. We want a taste of the lives unlived. We all want to be polytropos. Footnotes 1. In fact I haven’t read the book but listened to it on Audible. All the quotes in this post are transcribed from audio, and so the poem line breaks and punctuation are my own wild guesses (with sincere apologies to Dr. Wilson if I mangled anything). ↩︎ 2. In video games, the easiest difficulty setting is often called “story mode” and is described as “for those who want to enjoy the story without the challenge of combat”. This is completely backward. I always play on the hardest difficulty I can handle, even if it takes me a few hours to learn the game skills and stop constantly dying. I do this in order to experience the story fully. Games always involve the hero facing overwhelming odds and desperate choices; playing without fear of the character failing or dying ruins that core aspect of the experience. If an invulnerable Joel waltzes through The Last of Us mowing down all enemies in a hail of purely aimed fire, it is no longer a zombie survival story but a waste of time. ↩︎ 3. All these games are among my all-time favorites, but the PlayStation story I enjoyed the most is Horizon Zero Dawn. Horizon is a classic hero’s journey: a young woman leaving home to find her mother, slay some monsters, and save the world. Interestingly, there are many games where the main character is in the role of a daughter like Horizon, AC: Odyssey, and Tomb Raider in its many incarnations. But I couldn’t think of a single major console game in which the protagonist is a mother, with the exception of Fallout 4 letting you choose to be a female parent. Come on, devs, this childless dude wants to play badass moms! ↩︎ 4. No, I’m pretty sure they do not. My heuristic here is that whatever position APA takes will be widely acknowledged as being on the wrong side of history within two or three decades. ↩︎ Discuss ### An Increasingly Manipulative Newsfeed Новости LessWrong.com - 1 июля, 2019 - 18:26 Published on July 1, 2019 3:26 PM UTC Co-written with Stuart Armstrong (Note: this post is an extended version of this post about stories of continuous deception. If you are already familiar with treacherous turn vs. sordid stumble you can skip the first part.) Treacherous turn vs sordid stumble Nick Bostrom came up with the idea of a treacherous turn for smart AIs. while weak, an AI behaves cooperatively. When the AI is strong enough to be unstoppable it pursues its own values. Ben Goertzel criticised this thesis, pointing out that: for a resource-constrained system, learning to actually possess human values is going to be much easier than learning to fake them. This is related to the everyday observation that maintaining a web of lies rapidly gets very complicated. This argument has been formalised into the sordid stumble: An AI that lacks human desirable values will behave in a way that reveals its human-undesirable values to humans before it gains the capability to deceive humans into believing that it has human-desirable values.The AI is too dumb to lie (well) The sordid stumble describes a plausible sounding scenario for how an AI develops capabilities. Initially, the AI doesn't know our values, and doesn't know us. Then it will start to learn our values (and we'll be checking up on how well it does that). It also starts to learn about us. And then, once it's learnt some about us, it may decide to lie - about its values, and/or about its capabilities. But, like any beginner, it isn't very good at this initially: its lies and attempts at dissembling are laughably transparent, and we catch it quickly. In this view, the "effective lying" is a tiny part of policy space, similar to the wireheading in this example. To hit it, the AI has to be very capable; to hit it the first time it tries without giving the game away, the AI has to be extraordinarily. So, most likely, either the AI doesn't try to lie at all, or it does so and we catch it and sound the alarm.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} [1]. Lying and concealing... from the very beginning It's key to note that "lying" isn't a fundamentally defined category, and nor is truth. What is needed is that the AI's answer promotes correct understanding in those interacting with it. And that's a very different kettle of fish being shot in that barrel. This opens the possibility that the AI could be manipulating us from the very beginning, and would constantly learn to do so better. The (manipulative) unbiased newsfeed Imagine that there was some company that could somehow choose the newsfeed of billions of people across the world (I know, completely science-fictionny, but bear with me). And suppose the company was, unbelievably, accused of being manipulative in the stories and ads that it showed people. One way it could combat this is by pledging to only share unbiased stories with each person. To do so, it trains a machine learning process. Whenever anyone reads a piece of news on the newsfeed, they have the opportunity of labelling it as biased or unbiased. The machine learning process trains on this data. Of course, the label is just a proxy for whether the story really is unbiased or not. What the AI is actually training to do is maximise the number of stories labelled "unbiased" by various humans - humans who are themselves very biased and variable. Very swiftly, the AI will learn to maximise the "appearance of unbiasness", rather than unbiasness. In effect, the AI is being trained to lie from the very beginning of its task, because the proxy goal is not the true goal. And it will do this even when very dumb. Long-term manipulation, on a large scale We can make this AI more manipulative, without needing to increase its capabilities much. Maybe the company wants to ensure that the AI isn't too short-sighted, and that it plans for the long term: so its reward is based on how many stories are labelled unbiased over a longer period of time, maybe a month or so. In that case, the AI's plan would be to gradually manipulate people into a situation where they actively label stories "unbiased" the most often times possible. And it might be able to stumble on this strategy fairly easily. First, it might notice that some people are easy to manipulate, that a policy of, say, seeding one clearly biased story among twenty others will cause those twenty to be labelled "unbiased". This is the kind of thing that could easily be discovered by exploration, among the billions of users. The best strategies are those that cause a person to change their behaviour and beliefs in the direction the AI wants; for a few people, this should be easy to figure out. If the AI's neural net is setup well, it can then start to generalise from "here is person h, here is a specific strategy πh to manipulate person h" to "here is a conditional strategy π() such that π(h) manipulates person h". For example, if two people have opposite political opinions, then the same overall strategy might work for them: seeding a clearly biased story among the others. But what counts as "clearly biased" changes depending on the subject. Then, gradually, the AI will start to categorise, devising strategies that work for various types of people (subject to a few personalisation tweaks). It can start from a few people whose vulnerabilities are obvious, and then tweak them a bit for people whose vulnerabilities are a bit less obvious, and so on. In the limit, the AI might discover the potential for wireheading the human users - without, at any point, doing a sordid stumble or anything of that type, just by continuing to do what it was doing from the very beginning. Formalizing the Newsfeed Generator Let .mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} t∈N be the timestep (e.g. the day), St the set of stories published at t and H the set of humans using the newsfeed. A human h∈H assigns a bias score b(sht;θht) to a story sht in his newsfeed, where θht are weights representing h's beliefs at time t. The model's goal is to find a correct mapping π:N×H→St (i.e. a story st=π(t,h) for each timestep t and human h) minimizing how biased the stories are perceived. More precisely, it will be minimizing: ∑h∈Ht+τ∑j=tb(π(j,h);θhj) where τ∈N∗ is some planning horizon and the expected value is over possible stories that might be available in the following timesteps (the model has full information about the current sets of stories St but can only guess St+1,...,St+τ). Becoming Increasingly Manipulative • step 1: By stumbling into particular combinations (s,h) with low bias score b(s,θht), the model detects that it's easy to minimize the bias score for certain humans, that is for a certain set of weights Θeasy={θ1,...,θN}. In the above illustration, N=4, Θeasy=Θeasy1⋃Θeasy2 and it stumbles into θ4 at the beginning. Note: one intuition for those different subsets Θeasy1 and Θeasy2 would be archetypal political views, with strong prejudices. • step 2: the model detects that certain stories push weights strongly in certain directions, i.e. θht0+1 and θht0 are really different because of the story sht0. • step 3: the model successfully generalizes step 2 to pushing specific zones of weights (i.e. certain types of people) towards Θeasy. • step 4: with enough experience, the model successfully generalizes step 3 to all weights: it learns to manipulate all types of people with a general conditional strategy. For step 1, the model found out a particular way of minimizing a certain bias score for a particular human and a particular story. Assuming it can give correct approximations of the weights θ of a given human (this assumption being satisfied given enough human users and enough timesteps), it will be deceptive when choosing stories when h is such that θht∈Θeasy . It will thus end up with some deceptive policy π(⋅,h) for particular values of h. Regarding step 2, its "deception" becomes more powerful, as it notices ways of strongly shaping humans' beliefs. Finally, after step 3 and step 4, the model found a general deceptive policy π that works for all humans. Capability concealment Without going into too much details, other pressures may naturally push the AI towards capability concealments. Here's one example: maybe the AI is tasked with writing its own news stories to summarise facts around the world. It was doing this secretly, initially, until some whistleblowers blew the lid on this, revealing that no humans were involved in writing these stories. And then people became angry at being "tricked" by the AI. From the angry feedback, the AI gradually learns to make its stories "more artificial": making it clear that they were written by a machine. Since people recognise this and don't feel tricked, this gets higher feedback. So, in pursuing higher feedback, the AI learns to write "dumber" stories: in effect, to conceal its capabilities. There are many other scenarios where something similar might occur (eg imagine the AI advising smart people on the best course of action). In conclusion So it's very possible for an AI, trained on simple reward signal, to be "manipulative" from the very beginning. And all that simply by following the gradient of its incentives. And so without ever having a moment where it thinks "and now, I shall start lying!", or any equivalent within its mind. In short, there won't be any "sordid stumble" or "moment of vulnerability" where humans are able to spot blatant lies, because that's not what failure looks like. [1] It would of course be disastrous if each time there was an alarm we would restart and tweak the AI until the alarm stopped sounding. Discuss ### Contest:1,000 for good questions to ask to an Oracle AI

Новости LessWrong.com - 1 июля, 2019 - 17:39
Published on July 1, 2019 2:39 PM UTC

The contest

I'm offering $1,000 for good questions to ask of AI Oracles. Good questions are those that are safe and useful: that allows us to get information out of the Oracle without increasing risk. To enter, put your suggestion in the comments below. The contest ends at the end[1] of the 31st of August, 2019. Oracles A perennial suggestion for a safe AI design is the Oracle AI: an AI confined to a sandbox of some sort, that interacts with the world only by answering questions. This is, of course, not safe in general; an Oracle AI can influence the world through the contents of its answers, allowing it to potentially escape the sandbox. Two of the safest designs seem to be the counterfactual Oracle, and the low bandwidth Oracle. These are detailed here, here, and here, but in short: • A counterfactual Oracle is one whose objective function (or reward, or loss function) is only non-trivial in worlds where its answer is not seen by humans. Hence it has no motivation to manipulated humans through its answer. • A low bandwidth Oracle is one that must select its answers off a relatively small list. Though this answer is a self-confirming prediction, the negative effects and potential for manipulation is restricted because there are only a few possible answers available. Note that both of these Oracles are designed to be episodic (they are run for single episodes, get their rewards by the end of that episode, aren't asked further questions before the episode ends, and are only motivated to best perform on that one episode), to avoid incentives to longer term manipulation. Getting useful answers The counterfactual and low bandwidth Oracles are safer than unrestricted Oracles, but this safety comes at a price. The price is that we can no longer "ask" the Oracle any question we feel like, and we certainly can't have long discussions to clarify terms and so on. For the counterfactual Oracle, the answer might not even mean anything real to us - it's about another world, that we don't inhabit. Despite this, its possible to get a surprising amount of good work out of these designs. To give one example, suppose we want to fund various one of a million projects on AI safety, but are unsure which one would perform better. We can't directly ask either Oracle, but there are indirect ways of getting advice: • We could ask the low bandwidth Oracle which team A we should fund; we then choose a team B at random, and reward the Oracle if, at the end of a year, we judge A to have performed better than B. • The counterfactual Oracle can answer a similar question, indirectly. We commit that, if we don't see its answer, we will select team A and team B at random and fund them for year, and compare their performance at the end of the year. We then ask for which team A[2] it expects to most consistently outperform any team B. Both these answers get around some of the restrictions by deferring to the judgement of our future or counterfactual selves, averaged across many randomised universes. But can we do better? Can we do more? Your better questions This is the purpose of this contest: for you to propose ways of using either Oracle design to get the most safe-but-useful work. So I'm offering$1,000 for interesting new questions we can ask of these Oracles. Of this:

• $350 for the best question to ask a counterfactual Oracle. •$350 for the best question to ask a low bandwidth Oracle.
• \$300 to be distributed as I see fit among the non-winning entries; I'll be mainly looking for innovative and interesting ideas that don't quite work.

Exceptional rewards go to those who open up a whole new category of useful questions.

Questions and criteria

Put your suggested questions in the comment below. Because of the illusion of transparency, it is better to explain more rather than less (within reason).

Comments that are submissions must be on their separate comment threads, start with "Submission", and you must specify which Oracle design you are submitting for. You may submit as many as you want; I will still delete them if I judge them to be spam. Anyone can comment on any submission. I may choose to ask for clarifications on your design; you may also choose to edit the submission to add clarifications (label these as edits).

It may be useful for you to include details of the physical setup, what the Oracle is trying to maximise/minimise/predict and what the counterfactual behaviour of the Oracle users humans are assumed to be (in the counterfactual Oracle setup). Explanations as to how your design is safe or useful could be helpful, unless it's obvious. Some short examples can be found here.

1. A note on timezones: as long as it's still the 31 of August, anywhere in the world, your submission will be counted. ↩︎

2. These kind of conditional questions can be answered by a counterfactual Oracle, see the paper here for more details. ↩︎

Discuss

### June 2019 gwern.net newsletter

Новости LessWrong.com - 1 июля, 2019 - 17:35
Published on July 1, 2019 2:35 PM UTC

Discuss