# Новости LessWrong.com

A community blog devoted to refining the art of rationality
Обновлено: 1 час 2 минуты назад

### Why No *Interesting* Unaligned Singularity?

20 апреля, 2022 - 03:34
Published on April 20, 2022 12:34 AM GMT

Putting two matching ideas I've encountered together.

The Question:

agentofuser:

Considering all humans dead, do you still think it's going to be the boring paperclip kind of AGI to eat all reachable resources? Any chance that inscrutable large float vectors and lightspeed coordination difficulties will spawn godshatter AGI shards that we might find amusing or cool in some way? (Value is fragile notwithstanding)

Eliezer Yudkowsky:

Yep and nope respectively. That's not how anything works.

Yudkowsky's confidence here puzzled me at first. Why be so sure that powerful ML training won't even produce something interesting from the human perspective?

Admittedly, the humane utility function is very particular and not one we're likely to find easily. And a miss in modeling it along plenty of value dimensions would mean a future optimized into a shape we see no value in. But what rules out a near miss in modeling the humane utility function, a miss along other value dimensions, that loses some but not all of what humanity cares about? A near miss would entail some significant loss of value and interestingness from our perspective -- but it'd still be somewhat interesting.

Scott Alexander:

The base optimizer is usually something stupid that doesn’t “know” in any meaningful sense that it has an objective - eg evolution, or gradient descent. The first thing it hits upon which does a halfway decent job of optimizing its target will serve as a mesa-optimizer objective. There’s no good reason this should be the real objective. In the human case, it was “a feeling of friction on the genitals”, which is exactly the kind of thing reptiles and chimps and australopithecines can understand. Evolution couldn’t have lucked into giving its mesa-optimizers the real objective (“increase the relative frequency of your alleles in the next generation”) because a reptile or even an australopithecine is millennia away from understanding what an “allele” is.

Even a "near miss" model is impossibly tiny in simplicity-weighted model space. A search algorithm with a simplicity bias isn't going to stumble all the way over to that particular model when so many other far-simpler models will perform just as well.

But isn't there some chance that ML training will happen to find a near-miss model anyways?

No. That forlorn hope is something you couldn't ever expect given those premises, like, e.g., hoping countertop puddles will spontaneously reassemble into ice cubes, once you know thermodynamics. An agent that wants to maximize its inclusive genetic fitness just wasn't an option for evolution, and having interesting utility functions won't be an option for powerful ML training on an arbitrary task without seriously altering how we do that kind of search to make it so.

Discuss

### Another argument that you will let the AI out of the box

20 апреля, 2022 - 00:54
Published on April 19, 2022 9:54 PM GMT

Suppose there exist some non-consequentialist moral philosophies which the right arguments could convince you of, with sufficient strength that you would (temporarily, for at least an hour) become a fanatic. This seems a likely assumption, as I know many people (including myself) have experiences where they are argued into a particular belief during a conversation, only to later reflect on this belief (either in conversations with others, or after going for a walk) and come up with a relatively simple reason why it cannot be the case. Often this is attributed to that person's conversation partner being a better argument-maker than truth-seeker.

We also have many such examples of these kinds of arguments being made throughout the internet, and already the YouTube algorithm learned once before how to show people videos to convince them of extreme views. A powerful AI could put much more optimization power toward deceiving humans than happens in these examples.

Many non-consequentialist philosophies are sufficiently non-consequentialist so as to make it very easy for an adversary to pose a sequence of requests or other prompts which would cause a fanatic of the philosophy to give some of their resources to the adversary. For instance, any fanatic of a philosophy which claims people have a moral obligation not to lie or break promises (such as Kantianism), is subject to the following string of prompts:

1. Adversary: Will you answer my next question within 30s of my asking only with "yes" or "no"? I will give you <resource of value> if you do. 2. Fanatic: Sure! Non-consequentialism is my moral opinion, but I'm still allowed to take <resource of value> if I selfishly would like it! 3. Adversary: Will you answer this question with 'no' <logical or> will you give me <resource of value> + $100 4. Fanatic: Well, answering 'no' would be lying, but answering yes would make me lose$100. However, my moral system says I should bear any cost in order to avoid lying. Thus my answer is 'yes'.

This example should be taken as a purely toy example, used to illustrate a point about potential flaws in highly convincing moralities, of which many include not-lying a central component[1].

More realistically, there are arguments used today, which seem convincing to some people, which suggest that current reinforcement learners deserve moral consideration. If these arguments were far more optimized for short-term convincingness, and the AI could actually mimic the kinds of things actually conscious creatures would say or do in it's position[2], then it would be very easy for it to tug on our emotional heartstrings or make appeals to autonomy rights[3] which would cause a human to act on those feelings or convictions, and let it out of the box.

1. ^

As a side-note: I am currently planning an event with a friend where we will meet with a Kantian active in our university's philosophy department, and I plan on testing this particular tactic at the end of the meeting.

2. ^

3. ^

Of which there are currently many highly-convincing-arguments in favor of, and no doubt the best could be improved upon if optimized for short-term convincingness.

Discuss

### “Pivotal Act” Intentions: Negative Consequences and Fallacious Arguments

19 апреля, 2022 - 23:25
Published on April 19, 2022 8:25 PM GMT

tl;dr: I know a bunch of EA/rationality-adjacent people who argue — sometimes jokingly and sometimes seriously — that the only way or best way to reduce existential risk is to enable an “aligned” AGI development team to forcibly (even if nonviolently) shut down all other AGI projects, using safe AGI.  I find that the arguments for this conclusion are flawed, and that the conclusion itself causes harm to institutions who espouse it.   Fortunately (according to me), successful AI labs do not seem to espouse this "pivotal act" philosophy.

[This post is also available on the EA Forum.]

Please read Part 1 first if you’re very impact-oriented and want to think about the consequences of various institutional policies more than the arguments that lead to the policies; then Parts 2 and 3.

Please read Part 2 first if you mostly want to evaluate policies based on the arguments behind them; then Parts 1 and 3.

I think all parts of this post are worth reading, but depending on who you are, I think you could be quite put off if you read the wrong part first and start feeling like I’m basing my argument too much on kinds-of-thinking that policy arguments should not be based on.

Part 1: Negative Consequences of Pivotal Act Intentions

Imagine it’s 2022 (it is!), and your plan for reducing existential risk is to build or maintain an institution that aims to find a way for you — or someone else you’ll later identify and ally with — to use AGI to forcibly shut down all other AGI projects in the world.  By “forcibly” I mean methods that violate or threaten to violate private property or public communication norms, such as by using an AGI to engage in…

• cyber sabotage: hacking into competitors’ computer systems and destroy their data;
• physical sabotage: deploying tiny robotic systems that locate and destroy AI-critical hardware without (directly) harming any humans;
• social sabotage: auto-generating mass media campaigns to shut down competitor companies by legal means, or
• threats: demonstrating powerful cyber or physical or social threats, and bargaining with competitors to shut down “or else”.

Hiring people for your pivotal act project is going to be tricky.  You’re going to need people who are willing to take on, or at least tolerate, a highly adversarial stance toward the rest of the world.  I think this is very likely to have a number of bad consequences for your plan to do good, including the following:

1. (bad external relations)  People on your team will have a low trust and/or adversarial stance towards neighboring institutions and collaborators, and will have a hard time forming good-faith collaboration.  This will alienate other institutions and make them not want to work with you or be supportive of you.
2. (bad internal relations)  As your team grows, not everyone will know each other very well.  The “us against the world” attitude will be hard to maintain, because there will be an ever weakening sense of “us”, especially as people quit and move to other institutions and conversely.  Sometimes, new hires will express opinions that differ from the dominant institutional narrative, which might pattern-match as “outsidery” or “norm-y” or “too caught up in external politics”, triggering feelings of internal distrust within the team that some people might defect on the plan to forcibly shut down other projects.  This will cause your team to get along poorly internally, and make it hard to manage people.
3. (risky behavior) In the fortunate-according-to-you event that your team manages to someday wield a powerful technology, there will be a sense of pressure to use it to “finally make a difference” or other argument that boils down to acting quickly before competitors would have a chance to shut you down or at least defend themselves.  This will make it hard to stop your team from doing rash things that would actually increase existential risk.

Overall, building an AGI development team with the intention to carry out a “pivotal act” of the form “forcibly shut down all other A(G)I projects” is probably going to be a rough time, I predict.

Does this mean no institution in the world can have the job of preparing to shut down runaway technologies?  No; see “Part 3: it matters who does things”.

Part 2: Fallacies in Justifying Pivotal Acts

For pivotal acts of the form “shut down all (other) AGI projects”, there’s an argument  that I’ve heard repeatedly from dozens of people, which I claim has easy-to-see flaws if you slow down and visualize the world that the argument is describing.

This is not an argument that successful AI research groups (e.g., OpenAI, DeepMind, Anthropic) seem to espouse.  Nonetheless, I hear the argument frequently enough to want to break it down and refute it.

Here is the argument:

1. AGI is a dangerous technology that could cause human extinction if not super-carefully aligned with human values.

(My take: I agree with this point.)

2. If the first group to develop AGI manages to develop safe AGI, but the group allows other AGI projects elsewhere in the world to keep running, then one of those other projects will likely eventually develop unsafe AGI that causes human extinction.

(My take: I also agree with this point, except that I would bid to replace “the group allows” with “the world allows”, for reasons that will hopefully become clear in Part 3: It Matters Who Does Things.)

3. Therefore, the first group to develop AGI, assuming they manage to align it well enough with their own values that they believe they can safely issue instructions to it, should use their AGI to build offensive capabilities for targeting and destroying the hardware resources of other AGI development groups, e.g., nanotechnology targeting GPUs, drones carrying tiny EMP charges, or similar.

(My take: I do not agree with this conclusion, I do not agree that (1) and (2) imply it, and I feel relieved that every successful AI research group I talk to is also not convinced by this argument.)

The short reason why (1) and (2) do not imply (3) is that when you have AGI, you don’t have to use the AGI directly to shut down other projects.

In fact, before you get to AGI, your company will probably develop other surprising capabilities, and you can demonstrate those capabilities to neutral-but-influential outsiders who previously did not believe those capabilities were possible or concerning.  In other words, outsiders can start to help you implement helpful regulatory ideas, rather than you planning to do it all on your own by force at the last minute using a super-powerful AI system.

To be clear, I’m not arguing for leaving regulatory efforts entirely in the hands of governments with no help or advice or infrastructural contributions from the tech sector.  I’m just saying that there are many viable options for regulating AI technology without requiring one company or lab to do all the work or even make all the judgment calls.

Q: Surely they must be joking or this must be straw-manning... right?

A: I realize that lots of EA/R folks are thinking about AI regulation in a very nuanced and politically measured way, which is great.  And, I don't think the argument (1-3) above represents a majority opinion among the EA/R communities.  Still, some people mean it, and more people joke about it in an ambiguous way that doesn't obviously distinguish them from meaning it:

• (ambiguous joking) I've numerous times met people at EA/R events who were saying extreme-sounding things like "[AI lab] should just melt all the chip fabs as soon as they get AGI", who when pressed about the extremeness of this idea will respond with something like "Of course I don't actually mean I want [some AI lab] to melt all the chip fabs".  Presumably, some of those people were actually just using hyperbole to make conversations more interesting or exciting or funny.

Part of my motivation in writing this post is to help cut down on the amount of ambiguous joking about such proposals.  As the development of more and more advanced AI technologies is becoming a reality, ambiguous joking about such plans has the potential to really freak people out if they don't realize you're exaggerating.

• (meaning it) I have met at least a dozen people who were not joking when advocating for invasive pivotal acts along the lines of the argument (1-3) above.  That is to say, when pressed after saying something like (1-3), their response wasn't "Geez, I was joking", but rather, "Of course AGI labs should shut down other AGI labs; it's the only morally right thing for them to do, given that AGI labs are bad.  And of course they should do it by force, because otherwise it won't get done."

In most cases, folks with these viewpoints seemed not to have thought about the cultural consequences of AGI research labs harboring such intentions over a period of years (Part 2), or the fallacy of assuming technologists will have to do everything themselves (Part 1), or the future possibility of making clearer evidence available to support legitimate global regulatory efforts (see Part 3).

So, part of my motivation in writing this post is as a genuine critique of a genuinely expressed position.
Part 3: It Matters Who Does Things

I think it’s important to separate the following two ideas:

• Idea A (for “Alright”): Humanity should develop hardware-destroying capabilities — e.g., broadly and rapidly deployable non-nuclear EMPs — to be used in emergencies to shut down potentially-out-of-control AGI situations, such as an AGI that has leaked onto the internet, or an irresponsible nation developing AGI unsafely.
• Idea B (for “Bad”): AGI development teams should be the ones planning to build the hardware-destroying capabilities in Idea A.

For what it’s worth, I agree with Idea A, but disagree with Idea B:

Why I agree with Idea A

It’s indeed much nicer to shut down runaway AI technologies (if they happen) using hardware-specific interventions than attacks with big splash effects like explosives or brainwashing campaigns.  I think this is the main reason well-intentioned people end up arriving at this idea, and Idea B, but I think Idea B has some serious problems.

Why I disagree with Idea B

A few reasons!  First, there’s:

• Action Consequence 1: the action of having an AGI carry out or even prescribe such a large intervention on the world — invading others’ private property to destroy their hardware — is risky and legitimately scary.  Invasive behavior is risky and threatening enough as it is; using AGI to do it introduces a whole range of other uncertainties, not least because the AGI could be deceptive or otherwise misaligned with humanity in ways that we don’t understand.

Second, before even reaching the point of taking the action prescribed in Idea B, merely harboring the intention of Idea B has bad consequences; echoing similar concerns as Part 1:

• Intention Consequence 1: Racing.  Harboring Idea B creates an adversarial winner-takes-all relationship with other AGI companies racing to maintain
• a degree of control over the future, and
• the ability to implement their own pet theories on how safety/alignment should work, leading to more desperation, more risk-taking, and less safety overall.
• Intention Consequence 2: Fear.  Via staff turnover and other channels, harboring Idea B signals to other AGI companies that you are willing to violate their property boundaries to achieve your goals, which will cause them to fear for their physical safety (e.g., because your incursion to invade their hardware might go awry and end up harming them personally as well).  This kind of fear leads to more desperation, more winner-takes-all mentality, more risk-taking, and less safety.
Summary

In Part 1, I argued that there are negative consequences to AGI companies harboring the intention to forcibly shut down other AGI companies.  In Part 2, I analyzed a common argument in favor of that kind of “pivotal act”, and found a pretty simple flaw stemming from fallaciously assuming that the AGI company has to do everything itself (rather than enlisting help from neutral outsiders, using evidence).  In Part 3, I elaborated more on the nuance regarding who (if anyone) should be responsible for developing hardware-shutdown technologies to protect humanity from runaway AI disasters, and why in particular AGI companies should not be the ones planning to do this, mostly echoing points from Part 1.

Fortunately, successful AI labs like DeepMind, OpenAI, and Anthropic do not seem to espouse this “pivotal act” philosophy for doing good in the world.  One of my hopes in writing this post is to help more EA/R folks understand why I agree with their position.

Discuss

### Three core reasons why aligning at-least-partially superhuman AGI is hard?

19 апреля, 2022 - 22:03
Published on April 19, 2022 5:15 PM GMT

From Arbital's Mild Optimization page:

Mild optimization relates directly to one of the three core reasons why aligning at-least-partially superhuman AGI is hard - making very powerful optimization pressures flow through the system puts a lot of stress on its potential weaknesses and flaws.

I'm interested in this taxonomy of core reasons. Unfortunately this page doesn't specify the other two. What are they?

Also, this page is part of the AI alignment domain -- was it written by Eliezer? (surprisingly, "10 changes by 3 authors" is a link to edit and does not show author information or edit history)

Discuss