Вы здесь

Новости LessWrong.com

Подписка на Лента Новости LessWrong.com Новости LessWrong.com
A community blog devoted to refining the art of rationality
Обновлено: 58 минут 27 секунд назад

Costs and benefits of amniocentesis for normal pregnancies

14 мая, 2022 - 01:47
Published on May 13, 2022 10:47 PM GMT

Disclaimer: No medical training.

Amniocentesis is a sample of the amniotic fluid to test fetal DNA at 15-20 weeks of pregnancy. It involves inserting a long needle into the amniotic sack, which is potentially risky. But it could catch serious defects.

Arguably the main cost of the procedure comes from an increased risk of miscarriage, which increases by about 1 in 1,000. The best-estimated risks (see this meta-analysis) may still be confounded by selection into the procedure, but experts agree the risk is non-zero. Depending on the person, you might also consider the stress entailed in the process, the feeling of not being done testing, and the cost of a false positive.

What are the benefits? This is harder, because pregnancy involves a lot of correlated tests. For instance, Down syndrome can be detected in "cell free" DNA testing (also known as NIPT) at week 8. It's also detected with much higher accuracy in the amnio. But a negative in the cell free DNA test will drastically reduce the chance that an amnio comes back positive for Down, so it certainly reduces the benefit of amnio--enough that current convention is to not recommend amnio after a negative NIPT result. 

There are also ultrasounds. These can detect issues that are also caught in genetic tests, like Down. But they can also detect issues that are not currently detectable in DNA. For instance, it's estimated that roughly half of all cases of Noonan syndrome are novel, meaning a genetic test for them wouldn't come back positive based on known mutations. 

So the amnio can help catch things that are (i) genetic and (ii) invisible to the ultrasound and other tests. 

To complicate things further, DNA testing of the amniotic fluid is currently performed at three levels of detail. From lowest to highest resolution:

  1. Karotyping or cytogenetic. The lowest resolution, this essentially counts chromosomes. It detects the same things as the cell free DNA test, but with greater accuracy. 
  2. Microarray. Can catch missing or deleted chromosomal segments. 
  3. Prenatal exome sequencing: Genotyping the fetus. Explainer here

Getting an increased risk of Down in the cell free DNA test will greatly increase the benefits of the test. Most medical advice treats the amnio as an obvious choice if there's a finding of increased risk. 

What about for the majority of people, who have not had any indications of elevated risk?  I consider the case of the microarray, which is becoming more available.

Because everyone agrees that the procedure is at least a tiny bit dangerous, and that the risk of a serious syndrome is small, the question hinges on exactly what information is added by the test. The benefit from the amnio is zero if it's only catching things that the NIPT, blood, nuchal translucency, or ultrasound screens would catch (assuming the timing is roughly similar). What is the added information?

Most of the medical papers on the topic were not suited to this question, but Srebniak et al (2018) is almost a perfect fit. They study the results from microarrays and restrict to fetuses that were “karyotypically normal” (normal number of chromosomes)--which is where you would be after a negative NIPT.

The most important outcome seemed to be an "early-onset syndromic disorder." In their meta-analysis of 10,000 mothers, 0.37% of fetuses had this kind of issue. These were serious issues. They write: "most were deletions of various chromosomal regions causing loss and disruption of many genes and leading to intellectual disability, developmental delay, dysmorphic features and variable structural anomalies...A significant number of these syndromes are actually more severe than Down syndrome."

Would these have been detected otherwise? They say that “detection of these disorders by routine ultrasound was assessed as generally unlikely” and suggest around a 50% chance. Not very exact, but still useful. 

Taking their findings literally, amnio reduces your chance of a surprise serious issue by 1/540 ( = 0.37% * 50%), while increasing your chance of a miscarriage by 1/1,000. Serious sub-chromosomal issues are surprisingly common, and about half of them need an amnio (with microarray) to be detected. This might be enough to justify the amnio for some people, although that goes against the standard recommendations.

The main reservation I have is that these numbers seem too high: most people do not get these tests, so does this mean that the population rate of early-onset syndromic disorder is around 0.37%? It does not seem right that 1/270 live births have a genetic abnormality often worse than Down syndrome, but I couldn’t find a source speaking directly to this issue.

One possibility is that the mothers in the meta-analysis had unobserved factors that put them at higher risk. This is pure speculation, but you might want to check if your cost-benefit calculation changes if you further decrease the prevalence estimates by a factor of 2.


 



Discuss

Fermi estimation of the impact you might have working on AI safety

14 мая, 2022 - 01:33
Published on May 13, 2022 5:49 PM GMT

Cross-posted here: https://forum.effectivealtruism.org/posts/widWpunQMfuNTCYE3/fermi-estimation-of-the-impact-you-might-have-working-on-ai

I tried doing a Fermi estimation of the impact I would have if I worked on AI safety, and I realized it wasn't easy to do with only a calculator. So I build a website which does this Fermi estimation given your beliefs about AGI, AI safety, and your impact on AI safety progress.

You can try it out here: https://xriskcalculator.vercel.app/

This tool focuses on technical work, and assumes that progress on AGI and progress on AI safety are independent. This is obviously an approximation that is vastly inaccurate, but for now I don't think of a simple way of taking into account the fact that advanced AI could speed up AI safety progress. Other limitations are outlined on the website.

What do you think of this tool? Do you think of a way it could be improved?



Discuss

Frame for Take-Off Speeds to inform compute governance & scaling alignment

14 мая, 2022 - 01:23
Published on May 13, 2022 10:23 PM GMT


 

Figure 1: Something happens at future time t' that causes more resources to be poured into alignment

The argument goes: there will be a time in the future, t’, where e.g. a terrible AI accident occurs, alignment failures are documented (e.g. partial deception), or the majority of GDP is AI such that more people are pouring resources into aligning AI. Potentially to the point that >90% of alignment resources will be used in the years before x-catastrophe or a pivotal act (Figure 2)

Figure 2: potentially the majority of total resources poured into alignment happen after t'

The initial graph (Fig. 1) seems surprisingly useful as a frame for arguing different cruxes & intuitions. I will quickly enumerate a few & would appreciate comments where you disagree.

Compute governance w/o considering hardware/software overhang is net-negative

If we just govern compute usage while advances in hardware/software are continued, this may lead to just shifting t’ to the right w/o slowing down timelines, which implies less resources poured into alignment in total for no benefit. 

If we limit compute successfully for many years, but hardware & software improvements continue, an actor can defect and experience a large, discontinuous increase in capabilities. If we (somehow) limit all of them, it will become much much harder to produce transformative AI (intuition: it’s like someone trying to build it today).

Operationalizing the Cause of t’

As mentioned before, t’ could be caused by “a terrible AI accident occurs, alignment failures are documented (e.g. partial deception), or the majority of GDP is AI such that more people are pouring resources into aligning AI.” 

There are other potential causes as well, and I would find it beneficial to investigate how true the above 3 (and others) are as far as convincing real AI researchers into switching their research focus to alignment. I mean literally talking to researchers in machine learning and asking what capabilities (negative or positive) would get them to seriously consider switching their research focus. 

Additionally, if we find that e.g. showing partial deception in models really would be convincing, then pouring resources into showing that sooner would be shifting t’ to the left, implying more overall resources poured into alignment.

More Resources Don’t Imply More Progress

I expect most AI researchers who try to do alignment to (1) not do impactful work or (2) reinvent the wheel. So assuming we have lots of people who want to do alignment, is there a process that makes them avoid (1) and (2)? For example, a course/workshop they take, post they read, etc.

What I currently think is important is creating multiple documents like https://arbital.com/p/AI_boxing/ . So if someone comes up w/ a boxed-ai plan, we can clearly argue that it must (1) actually build an air-type sandbox and (2) still be useful in the remaining channels to perform a pivotal act. If their plan actually considers these two arguments, then I am much more excited about them. 

So creating more documents like that for e.g. interpretability, learning from human feedback, etc and iterating on those arguments with researchers working in those fields today, will be useful for future researchers to not waste their time w/ dead-ends & reinventing the wheel. my latest post on avoiding dead-end research.

Another useful thing to have is clearly specifying the sub-problems, which may look like grounding it in already established formalizations. I think this is really hard, but having these allows outsiders to make clear progress on the problem and would even allow us to directly pay unaligned researchers to work on it today (or set up bounties/millennium prize-like questions)

Infrastructure for scaling

Related, if we do expect way more people to enter the field, are we building the infrastructure to support that scaling?  Ideally what scales to 10,000 people also applies to the thousand or so people that want to do alignment research today.

Special thanks to Tamay Besiroglu for introducing me to this framing & several arguments, though note my takes are different than his.


 



Discuss

Alignment as Constraints

14 мая, 2022 - 01:07
Published on May 13, 2022 10:07 PM GMT

In order to find the most promising alignment research directions to pour resources into, we can go about it 3 ways

  1. Constraints all alignment proposals should have
  2. Constraints for current research directions
  3. Constraints for new research directions
Constraints all alignment proposals should have

We can imagine the space of all possible research directions

This includes all possible research directions, including pouring resources into McDonald's HR Department. But we can add constraints to focus on research directions more likely to help advance alignment.

If you can tell a story on how it reduces x-risk from AI, then I am slightly more excited about your proposal. But we can continue to add more constraints like: 

By constraining more and more, we can narrow down the space to search and (hopefully) avoid dead ends in research. This frame opens up a few questions to ask:

  1. What are the ideal constraints that all alignment proposals should have?
  2. How can we get better at these constraints? e.g. You can tell a better story if you build a story that works, then break it, then iterate on that process (Paul Christiano commonly suggests this). If we can successfully teach these mental movements, we could have more researchers wasting less time. 
Constraints for current research directions

We can also perform this constraint-narrowing on known research agendas (or known problems). A good example is this arbital page on boxed AI, clearly explaining the difficulty of:

  1. Building actually robust, air-type systems that can't effect the world except through select channels.
  2. Still getting a pivotal act from those limited channels

Most proposals for a boxed-ai is doomed to fail, but if the proposal competently accounts for (1) & (2) above (which can be considered additional constraints), then I am more excited about this research direction.

Doing a similar process for e.g. interpretability, learning from human feedback, agent foundations research would be very useful. An example would be 

  1. Finding the constraints we want for interpretability, such as leading to knowledge of possible deception, what goal the model is pursuing, it's understanding of human values, etc. 
  2. Tailor this argument to research at Redwood, Anthropic, OpenAI and get their feedback. 
  3. Repeat and write up results

I expect to either convince them to change their research direction, be convinced myself, or find the cruxes and make bets/predictions if applicable. 

This same process can be applied to alignment-adjacent fields like bias & fairness and task specification in robotics. The result should be "bias & fairness research but it must handle these criticisms/constraints" which is easier to convince those researchers to change to than switching to other alignment fields. 

This is also very scalable/can be done in parallel, since people can perform this process on their own research agendas or do separate deep dives into other's research.

Constraints for new research directions

The earlier section was mostly on what not to work on, but doesn't tell you what specifically to work on (ignoring established research directions). Here is a somewhat constructive algorithm

  1. Break existing alignment proposals/concepts into "interesting" components
  2. Mix & match those components w/ all other components (most will be trash, but some will be interesting).

For example, Alex Turner's power-seeking work could be broken down into:

  1. Power-seeking
  2. Instrumental convergence
  3. Grounding concepts in formalized math
  4. Deconfusion work
  5. MDP's
  6. Environment/graph symmetries

which you could break down into different components, but this is what I found interesting. How I formalized (3) can then be mix & matched with other alignment concepts such as mesa-optimizers, deception, & interpretability, which are research directions I approve of.

For overall future work, we can:

  1. Figure out constraints we want all alignment proposals to have and how to improve that process for constraints we are confident in
  2. Improve current research directions by trying to break them and build them up, getting the feedback of experts in that field (including alignment-adjacent fields like bias & fairness and task specification)
  3. Find new research directions by breaking proposals into interesting components and mix & matching them 

I'd greatly appreciate any comments or posts that do any of these three.



Discuss

How close to nuclear war did we get over Cuba?

13 мая, 2022 - 23:09
Published on May 13, 2022 7:58 PM GMT

Cross posted to LessWrong and The Good blog

 

The Cuban missile crisis was probably the closest we came to nuclear war. There were two types of close call in Cuba. There were moments of tension between the US and Soviet militaries, all of which tell us something interesting about how different patterns of civil-military relations influence the outcomes of moments of crisis. But, there was also the risk of nuclear not by accident but by intention. Throughout the crisis, the US government kept an invasion of Cuba or airstrikes on the missile sites as options on the table, and the former was pushed very aggressively by the joint chiefs of staff. The joint chiefs of staff were the most senior officers in the US army and provided its strategic direction. The influence of these most senior members of the military bureaucracy has been pervasive throughout the history of nuclear weapons, and their role in nuclear risk will be something I’ll return to in a later post. 


 

Cuba: some background


 

I gave some background on the strategic situation in Cuba in this post, but for this story it’s laying out the timeline of events in a little more detail. From the end of the Spanish-American war in 1898 until Castro’s revolution toppled the dictator Hugo Baptista, Cuba was the playground of the American elite and a quasi-colonial possession. American companies owned the sugar plantations, and the mines and the casinos and Baptista’s thugs beat up anyone who tried to change things. You can imagine then that it came as quite a shock when the bearded, fiery, Jesuit educated Castro succeeded in launching a nationalist revolution and overthrowing the Baptista government. Initially Castro made overtures to the Americans but these were rebuffed as Ho Chi Minh’s attempts at alliance - another nationalist leader who turned to communism - had been 4 years earlier in 1956. So, Castro turned to the Soviets and so, with Castro now firmly in the Soviet camp and expropriating all of the American owned businesses, it was time for Castro to go. 


 

Another theme in the history of nuclear weapons has been the importance of intelligence and intelligence failures and they will get their own post later. By 1961 the CIA had successfully orchestrated changes of government in Iran and Guatemala and expected to be able to do the same with Cuba and that the Cuban refugees they’d armed would be welcomed with open arms. This did not happen. The force that landed in the Bay of Pigs was roundly defeated by the Cuban army and Cubans celebrated their victory over the Yanqui imperialists. Two things came about as a result of this. Firstly, President Kennedy became all the more desperate to depose Castro and avenge his humiliation. Secondly, Castro was able to convince his Soviet backers to place nuclear missiles (although, there were other reasons Krushchev wanted missiles in Cuba.)


 

This brings us to the start of the crisis. U2 planes, American spy planes that flew much higher than conventional aircraft, photographed missile sites on the 14th of October 1962. By 16th CIA analysts realised what the U2 had photographed and estimated that missiles could be ready to hit the US within 18 hours. Over the whole course of the crisis the Americans acted on the assumption that there were nuclear warheads on Cuba that could be mated with the missiles, while never actually finding the sites where the warheads were kept. They were correct as it turned out, with the first missiles being fully assembled and able to launch within 8 hours by the 25th of October. 

Now Kennedy had a choice to make. There were essentially 4 choices on the table: invade Cuba, launch an airstrike against the missile sites, blockade Cuba, or do nothing. He chose to blockade, leading to the first of 4 brushes with nuclear war. 


 

The Invasion


 

The Joint chiefs of staff were pushing aggressively for an invasion, led by Air force chief of staff Curtis LeMay. LeMay in particular believed that the US vast advantage in nuclear capabilities meant that they could invade and the Soviets would be forced to back down. The Joint chiefs believed that the missiles in Cuba represented an important strategic threat because they the missiles would have been able to hit airforce bases where the planes that carried the majority of US nuclear arsenal were housed. I talk about the credibility of that belief here, but for the purposes of this story, all you need to know is that Kennedy was convinced by Robert McNamera, the secretary of defence, to hold off on the invasion. This was supported by internal defence department documents showing that simulations run on their state of the art computers showed that the missiles would have a very small effect in a nuclear war. Despite this, Kennedy accepted his General's arguments that if the missiles weren’t gone by the end of the Monday the 29th of October, they’d invade Cuba on the 30th. This decision in my mind took us closer to nuclear war than any other. U2s had by the 27th found Luna anti-tank missiles on Cuba, which were nuclear tipped. It wasn’t known to the Americans that the missiles were nuclear tipped, but it was known that they had the capacity to be. However, what the Americans didn’t know was that there were also nuclear tipped cruise missiles on Cuba aimed at the Guantanamo bay naval base with orders to fire in response to a US invasion.  


 

The Submarine


 

I’ll start with the most famous of the incidents, the night when Arkhipov maybe - or maybe didn’t - save the world. Under the conditions of the quarantine - what the Americans called the blockade, as blockades are technically acts of war - no Soviet sea traffic was allowed into Cuba. The Soviets had sent a fleet of submarines to Cuba, which Vassily Arkhipov commanded, although he was second in command of the one he was on, the B-59 . Prior to the Cuban missile crisis Robert McNamera had instituted a new method for signalling to submarines to surface - dropping depth charges. However, the Soviet Union hadn’t accepted this and so hadn’t passed the information onto their officers. Therefore, when a flotilla of American destroyers located the Soviet sub and started dropping depth charges, the exhausted, exhausted, exhausted crew of the B-59 didn’t know what hell was going on. 


 

I think this was a much less dangerous event than I think it is commonly believed, or at least that I believed until I started to read more deeply about the missile crisis. The first thing to note is that the B-59 carried a tactical nuclear weapon not a strategic one. Tactical nuclear weapons were battlefield weapons so if the weapon had been launched it would have destroyed the American floatilla but it would not have hit the American mainland. It’s not at all clear that this would have escalated to full scale nuclear war - there were no American war plans for instance that escalated from the use of tactical nuclear weapons by the Soviets to firing nuclear missiles. There have also been a surprisingly large number of conflicts between nuclear powers where one side killed the other's soldiers including a small-scale border war between China and the Soviet Union in 1969, and numerous incidents between Indian and Pakistani soldiers. Finally, it’s just not clear why the use of a tactical weapon would escalate to use to a thermonuclear weapon. It’s not clear that it wouldn’t - maybe the taboo against the use of nuclear weapons is the only thing holding back their use and this would have broken it, or maybe this would have just escalated into a full scale war between the US and USSR. But maybe it wouldn't and we know that both Kennedy and Khrushchev desperately wanted to avoid war. 


 

There are conflicting reports about what actually happened in the submarine after the depth charges started to drop. There are some reports that it really only was Arkhipov opposing the bellicose captain, but there are other reports that the captain was opposed by all of the officers in the submarine. But all the reports agree that the captain wasn’t proposing firing the missile and certainly hadn’t turned his key in the firing system - he wanted to arm the missile. Granted, this means that the next step would be firing the missile, but I think the history of nuclear close calls suggests that at every step on the path to actually firing a nuclear missile any individual is unlikely to proceed to the next step. Whatever really happened, the Captain didn’t arm the missile and surfaced. The nuclear danger though, didn’t end there. Navy pilots dropped photographic canisters over the surfaced sub which the crew interpreted as a practice bombing run. Assuming they were about to be blown out of the water they started to return below deck to fire their nuclear tipped torpedo before a Navy messenger was able to signal to Soviets that they weren’t under attack. 


 

The U2s 


 

Two American spy planes moved the world closer to nuclear war during October of 1962. The first was flying over Cuba when it was shot down by Soviet anti-aircraft fire, killing the pilot. The Soviet military was not authorised to shoot down US planes over Cuba and the specific officer who ordered the planes to be shot down was acting independently of his commanding officer. However, there was no American reprisal and this was a move supported by the Joint chiefs, the arch hawks of the crisis. The reason was that a proportional response wouldn’t have destroyed the nuclear missiles in Cuba, the key target, while still dramatically increasing the risk of a Soviet response. Therefore the decision was taken to wrap the reprisal into the invasion that was scheduled for 3 days later. 


 

The second U2 was engaging in a routine mission over the north pole. Because of how light the planes needed the U2s had very limited navigational equipment on board, meaning the pilots often had to use the stars to navigate. Unfortunately on that day the Northern lights were active on the plane's flight path, meaning the pilot lost his bearings and ended up 300 miles into the Soviet union. Soviet fighter jets were scrambled with orders to shoot him down. American fighters were sent in response. But the real kicker, from a nuclear risk perspective, is that all of the fighter jets were armed exclusively with nuclear tipped missiles meaning a dogfight would have been fought with tactical nukes. The context for this moment of danger was that Kennedy had ordered the strategic air command to put on defcon 2 - the state of alert before a war footing. For our purposes this meant two things: firstly that US planes would carry tactical nuclear weapons and secondly that there be US nuclear bombers constantly in the air. We now have documents showing Kennedy did order this move - what he didn’t order was for Thomas Powers, the commander of SAC, to effectively tell the Soviets that his forces had been moved to the state of readiness before nuclear war. Nor did Kennedy or McNamara know that routine U2 missions near the Soviet union were continuing during the crisis. When the Soviets picked up the U2 on their radars they assumed that it was scoping out Soviet territory in preparation for it to be bombed and so the orders of the scambled fighters was to the shoot the plane down first and ask questions later.


 

There are two final cases of nuclear close calls which I’m including for completeness sake. The first came when a missile regiment received an order to fire its missiles on the Soviet union. The commanding officer refused the order, believing it not to be genuine as the US military was on defcon 3 and SAC on defcon 2 and they should only expect to receive orders to fire once at defcon 1. The officer threatened to have anyone who tried to follow the order shot, a threat that may have been necessary. The reason I’m not going into this in more detail is that there’s no historical consensus around whether or not it happened or not. The second case was a classic failure in an early warning system at the end of the crisis after Kennedy and Krushechev had agreed to a deal. Again, the evidence here is scant, and there are no reports as to the degree of seriousness with which the false alarm was treated. However, it mustn’t have been taken that seriously as there are no reports that any of civilians it would have gone through before reaching the President were notified. 


 

Why



 

I think the Cuban missile crisis really shows well a few of the key themes of nuclear risk:


 

  1. The existence of tactical nuclear weapons
  2. The breakdown of civilian control over the military 
  3. Intelligence failures 



 

I think the first really notable thing is how all of these moments of danger could have been avoided if none of the actors had tactical nuclear weapons. It’s possible that the invasion of Cuba would have triggered a conflict between Soviet and American soldiers and this would have escalated, but it seems like it was vastly more likely to escalate to a nuclear war if the Soviets used a nuclear weapon to destroy an American military base. I think there are a couple of reasons why tactical nuclear weapons are so dangerous. The first is that their use is delegated to the individual officers and sometimes individual soldiers. There are just a lot of individual officers and so it’s much more likely that one of these soldiers will get into a high stakes situation and that one of individuals in high stakes situations will choose to use their nuclear weapon. The second reason why I think they’re especially dangerous is that the threshold for their use is much lower than with strategic weapons. If the Americans had invaded Cuba is seems extremely likely that the Soviets would have used their tactical weopons whereas they’d never respond to an invasion of Cuba by bombing the continential US. It would have become just another proxy battle in the cold war. 


 

The second theme I think is the military apparatus getting out of control and acting as a fundamentally bellicose force. The classic theory of civil-military relations comes from Samuel Huntington’s book the Soldier and the state in which Huntington lays out a vision of what civil military relations should look like. This ideal is the military as a tool of the state that does the states the exact bidding while the civilians leave the details of how to implement their vision operationally to the professionals in the officer corps. The classic case of this relationship breaking down was General Moltke the Younger going ahead with the Schilefeen plan in the first world war against the wishes of the Kaiser, invading neutral Belgium and Luxembourg and so bringing Britain into the war. The Soviet officers shooting down the American U2 and General Powers informing the Soviets of the move to defcon 2 are classic examples of the military acting independently and aggressively. This is the fundamental problem of bureaucracy - to have an effective state there must be delegation but delegation leaves open the possibility of independent actions against the wishes of the principle. However, the submarine and the U2 over Russia point to a deeper problem with Huntington's model of civil-military relations in the nuclear age. When tactical nuclear weapons are used it’s not clear where the line between operational and strategic decision falls. Finally, in a return to a failing that falls much more neatly in Huntington's paradigm is the decision by McNamara to institute a new way of signalling to enemy submarines. This is a classic case of civilians getting involved in operational matters that professional soldiers are much better equipped to deal with. Any submarine officer would have known a submarine captain would sooner die than be forced to surface. 


Finally, intelligence failures form the backdrop of the whole crisis. The first and probably most egregious was not seeing that Castro was a nationalist not a communist and really would have allied himself with the US had they accepted his initial advances. The second was not understanding that Castro had broad support from the Cuban people in his capacity as a nationalist liberator and a pretty transparently US backed invasion would not endear the US to the Cubans. The Bay of pigs debacle was the inciting incident for bringing the missiles to Cuba. The next in this list of failures was failing to see that missiles were being installed in Cuba despite it taking over a month and the missile sites having extremely poor cover from the natural environment. This one, to be fair, seems much harder to solve and much less obvious at the time - photo interpretation is extremely hard! Finally, we have the failure to realise that there were nuclear tipped  missiles both on Soviet submarines and on Cuba. Unlike the missiles being moved to Cuba, the use of tactical missiles wasn’t a radically new tactic and seems like a classic case of where probabilistic thinking could have led to a better informed US response. I think the Cuban missile crisis is the best example I know of where an organisation that could have feasibly used Tetlock style forecasting could have reduced existential risk. 



Discuss

Against Time in Agent Models

13 мая, 2022 - 22:55
Published on May 13, 2022 7:55 PM GMT

When programming distributed systems, we always have many computations running in parallel. Our servers handle multiple requests in parallel, perform read and write operations on the database in parallel, etc.

The prototypical headaches of distributed programming involve multiple processes running in parallel, each performing multiple read/write operations on the same database fields. Maybe some database field says “foo”, and process 1 overwrites it with “bar”. Process 2 reads the field - depending on the timing, it may see either “foo” or “bar”. Then process 2 does some computation and writes another field - for instance, maybe it sees “foo” and writes {“most_recent_value”: “foo”} to a cache.  Meanwhile, process 1 overwrote “foo” with “bar”, so it also overwrites the cache with {“most_recent_value”: “bar”}. But these two processes are running in parallel, so these operations could happen in any order - including interleaving. For instance, the order could be:

  1. Process 2 reads “foo”
  2. Process 1 overwrites “foo” with “bar”
  3. Process 1 overwrites the cache with {“most_recent_value”: “bar”}
  4. Process 2 overwrites the cache with {“most_recent_value”: “foo”}

… and now the cached value no longer matches the value in the database; our cache is broken.

One of the main heuristics for thinking about this sort of problem in distributed programming is: there is no synchronous time. What does that mean?

Well, in programming we often picture a “state-update” model: the system has some state, and at each timestep the state is updated. The update rule is a well-defined function of the state; every update happens at a well-defined time. This is how each of the individual processes works in our example: each executes two steps in a well-defined order, and each step changes the state of the system

Single process: a well-defined sequence of steps, each updating the state.

But with multiple processes in parallel, this state-update model no longer works. In our example, we can diagram our two processes like this:

Each process has its own internal “time”: the database read/write happens first, and the cache overwrite happens second. But between processes, there is no guaranteed time-ordering. For instance, the first step of process 1 could happen before all of process 2, in between the steps of process 2, or after all of process 2.

Two processes: many possible time-orderings of the operations.

We cannot accurately represent this system as executing along one single time-dimension. Proof:

  • Step 1 of process 1 is not guaranteed to happen either before or after step 1 of process 2; at best we could represent them as happening “at the same time”
  • Step 2 of process 1 is also not guaranteed to happen either before or after step 1 of process 2; at best we could represent them as happening “at the same time”
  • … but step 2 of process 1 is unambiguously after step 1 of process 1 in time, so the two steps can’t happen at the same time.

In order to accurately represent this sort of thing, it has to be possible for one step to be unambiguously after another, even though both of them are neither before nor after some third step.

The “most general” data structure to represent such a relationship is not a one-dimension “timeline” (i.e. total order), but rather a directed acyclic graph (i.e. partial order). That’s how time works in distributed systems: it’s a partial order, not a total order. A DAG, not a timeline. That DAG goes by many different names - including computation DAG, computation circuit, or causal model.

Beyond Distributed Programming

The same basic idea carries over to distributed systems more generally - i.e. any system physically spread out in space, with lots of different stuff going on in parallel. In a distributed system, “time” is a partial order, not a total order.

In the context of embedded agents: we want to model agenty systems which are “made of parts”, i.e. the agent is itself a system physically spread out in space with lots of different stuff going on in parallel. Likewise, the environment is made of parts. Both are distributed systems.

This is in contrast to state-update models of agency. In a state-update model, the environment has some state, the agent has some state, and at each timestep their states update. The update rule is a well-defined function of state; every update happens at a well-defined time.

Instead of the state-update picture, I usually picture an agent and its environment as a computation DAG (aka circuit aka causal model), where each node is a self-contained local computation. We carve off some chunk of this DAG to call “the agent”.

 

The obvious “Cartesian boundary” - i.e. the interface between agent and environment - is just a Markov blanket in the DAG (i.e. a cut which breaks the graph into two parts). That turns out to be not-quite-right, but it’s a good conceptual starting point.

Main takeaway: computational DAGs let us talk about agents without imposing a synchronous notion of "time" or "state updates", so we can play well with distributed systems.



Discuss

Agency As a Natural Abstraction

13 мая, 2022 - 22:03
Published on May 13, 2022 6:02 PM GMT

Epistemic status: Speculative attempt to synthesize findings from several distinct approaches to AI theory.

Disclaimer: The first three sections summarize some of Chris Olah's work on interpretability and John Wentworth's Natural Abstractions Hypothesis, then attempt to draw connections between them. If you're already familiar with these subjects, you can probably skip all three parts.

Short summary: When modelling a vast environment where simple rules result in very complex emergent rules/behaviors (math, physics...), it's computationally efficient to build high-level abstract models of this environment. Basic objects in such high-level models often behave very unlike basic low-level objects, requiring entirely different heuristics and strategies. If the environment is so complex you build many such models, it's computationally efficient to go meta, and build a higher-level abstract model of building and navigating arbitrary world-models. This higher-level model necessarily includes the notions of optimization and goal-orientedness, meaning that mesa-optimization is the natural answer to any "sufficiently difficult" training objective. All of this has various degrees of theoretical, empirical, and informal support.

1. The Universality Hypothesis

One of the foundations of Chis Olah's approach to mechanistic interpretability is the Universality Hypothesis. It states that neural networks are subject to convergence — that they would learn to look for similar patterns in the training data, and would chain up the processing of these patterns in similar ways.

The prime example of this effect are CNNs. If trained on natural images (even from different datasets), the first convolution layer reliably learns Gabor filters and color-contrast detectors, and later layers show some convergence as well:

Analogous features across CNNs. Source.

It's telling that these features seem to make sense to us, as well — that at least one type of biological neural network also learns similar features. (Gabor filters, for example, were known long before modern ML models.) It's the main reason to feel optimistic about interpretability at all — it's plausible that the incomprehensible-looking results of matrix multiplications will turn out to be not so incomprehensible, after all.

It's telling when universality doesn't hold, as well.

Understanding RL Vision attempts to interpret an agent trained to play CoinRun, a simple platformer game. CoinRun's levels are procedurally generated, could contain deadly obstacles in the form of buzzsaws and various critters, and require the player to make their way to a coin.

Attempting to use feature visualization on the agent's early convolutional layers produces complete gibberish, lacking even Gabor filters:

Comparison between features learned by a CNN (left) and a RL agent (right).

It's nonetheless possible to uncover a few comprehensible activation patterns via the use of different techniques:

Visualization of positive and negative attributions. I strongly recommend checking out the paper if you haven't already, it has rich interactivity.

The agent learns to associate buzzsaws and enemies with decreased chances of successfully completing a level, and could be seen to pick out coins and progression-relevant level geometry.

All of these comprehensible features, however, reside on the third convolutional layer. None of the other four convolutional layers, or the two fully-connected layers, contain anything that makes sense. The authors note the following:

Interestingly, the level of abstraction at which [the third] layer operates – finding the locations of various in-game objects – is exactly the level at which CoinRun levels are randomized using procedural generation. Furthermore, we found that training on many randomized levels was essential for us to be able to find any interpretable features at all.

At this point, they coin the Diversity Hypothesis:

Interpretable features tend to arise (at a given level of abstraction) if and only if the training distribution is diverse enough (at that level of abstraction).

In retrospect, it's kind of obvious. The agent would learn whatever improves its ability to complete levels, and only that. It needs to know how to distinguish enemies and buzzsaws and coins from each other, and to tell apart these objects from level geometry and level backgrounds. However, any buzzsaw looks like any other buzzsaw and behaves like any other buzzsaw and unlike any coin — the agent doesn't need a complex "visual cortex" to sort them out. Subtle visual differences don't reveal subtle differences in function, the wider visual context is irrelevant as well.  Learning a few heuristics for picking out the handful of distinct objects the game actually has more than suffices. Same for the higher-level patterns, the rules and physics of the game: they remain static.

Putting this together with the (strong version of) the Universality Hypothesis, we get the following: ML models could be expected to learn interpretable features and information-processing patterns, but only if they're exposed to enough objective-relevant diversity across these features.

If this condition isn't fulfilled, they'll jury-rig some dataset-specialized heuristics that'd be hard to untangle. But if it is, they'll likely cleave reality along the same lines we do, instead of finding completely alien abstractions.

John Wentworth's theory of abstractions substantiates the latter.

(For completeness' sake, I should probably mention Chris Olah et al.'s more recent work on transformers, as well. Suffice to say that it also uncovers some intuitively-meaningful information-processing patterns that reoccur across different models. Elaborating on this doesn't add much to my point, though.

One particular line stuck with me, however. When talking about a very simple one-layer attention-only transformer, and some striking architectural choices it made, they note that "transformers desperately want to do meta-learning." Consider this to be... ominous foreshadowing.)

2. The Natural Abstraction Hypothesis

Real-life agents are embedded in the environment, which comes with a host of theoretical problems. For example, that implies they're smaller than the environment, which means they physically can't hold its full state in their head. To navigate it anyway, they'd need to assemble some simpler, lower-dimensional model of it. How can they do it? Is there the optimal, "best way" to do it?

The Natural Abstractions Hypothesis is aimed to answer this question. It's based on the idea that, for all the dizzying complexity that real-life objects have on the level of fundamental particles, most of the information they contain is only relevant — and, indeed, only accessible — locally.

Consider the door across the room from you. The details of the fluctuation of the individual atoms comprising it never reach you, they are completely wiped out by the environment on the way to you. For the same reason, they don't matter. The information that reaches you, the information that's relevant to you and could impact you, is only the high-level summaries of these atoms' averaged-out behavior, consistent across time. Whether the door is open or closed, what material it is, its shape.

That's what natural abstractions are: high-level summaries of the low-level environment that contain only the information that actually reaches far-away objects.

Graphical model representing interactions between objects X and Y across some environment Z. f(X) is the abstract model of X, containing only whatever information wasn't wiped out by Z.

Of course, if you go up to the door with an electronic microscope and start making decisions based on what you see, the information that reaches you and is relevant to you would change. Similarly, if you're ordering a taxi to your house, whether that door is open or closed is irrelevant to the driver getting directions. That's not a problem: real-life agents are also known for fluidly switching between a multitude of abstract models of the environment, depending on the specific problem they're working through.

"Relevant to you", "reaches you", etc., are doing a lot of work here. Part of the NAH's conceit is actually eliminating this sort of subjective terminology, so perhaps I should clean it up too.

First, we can note that the information that isn't wiped out is whatever information is represented with high redundancy in the low-level implementation of whatever object we care about — e. g., an overwhelming amount of door-particles emit the same information about the door's material. In this manner, any sufficiently homogeneous/stable chunk of low-level reality corresponds to a valid abstraction.

An additional desideratum for a good abstraction is global redundancy. There are many objects like your door in the world. This means you can gather information on your door from other places, or gather information about other places by learning that they have "a door". This also makes having an internal symbol for "a door" useful.

Putting these together, we can see how we can build entire abstraction layers: by looking for objects or patterns in the environment that are redundant both locally and globally, taking one type of such objects as a "baseline", then cleaving reality such that none of the abstractions overlap and the interactions between them are mediated by noisy environments that wipe out most of the detailed information about them.

Fundamental physics, chemistry, the macro-scale environment, astronomy, and also geopolitics or literary theory — we can naturally derive all of them this way.

The main takeaway from all of this is, good abstractions/high-level models are part of the territory, not the map. There's some degree of subjectivity involved — a given agent might or might not need to make use of the chemistry abstraction for whatever goal it pursues, for example — but the choice of abstractions isn't completely arbitrary. There's a very finite number of good high-level models.

So suppose the NAH is true; it certainly looks promising to me. It suggests the optimal way to model the environment given some "reference frame" — your scale, your task, etc. Taking the optimal approach to something is a convergent behavior. Therefore, we should expect ML models to converge towards similar abstract models when exposed to the same environment and given the same type of goal.

Similar across ML models, and familiar to us.

3. Natural Abstractions Are Universal

Let's draw some correspondences here.

Interpretable features are natural abstractions are human abstractions.

The Diversity Hypothesis suggests some caveats for the convergence towards natural abstractions. A given ML model would only learn the natural abstractions it has to learn, and no more. General performance in some domain requires learning the entire corresponding abstraction layer, but if a model's performance is evaluated only on some narrow task within that domain, it'll just overfit to that task. For example:

  • InceptionV1 was exposed to a wide variety of macro-scale objects, and was asked to identify all of them. Naturally, it learned a lot of the same abstractions we use.
  • The CoinRun agent, on the other hand, was exposed to a very simple toy environment. It learned all the natural abstractions which that environment contained — enemies and buzzsaws and the ground and all — but only them. It didn't learn a general "cleave the visual input into discrete objects" algorithm.

There are still reasons to be optimistic about interpretability. For one, any interesting AI is likely to develop general competence across many domains. It seems plausible, then, that the models we should be actually concerned about will be more interpretable than the contemporary ones, and also more similar to each other.

As an aside, I think this is all very exciting in general. These are quite different approaches, and it's very promising that they're both pointing to the same result. Chris' work is very "bottom-up" — taking concrete ML models, noticing some similarities between them, and suggesting theoretical reasons for that. Conversely, John's work is "top-down" — from mathematical theory to empirical predictions. The fact that they seem poised to meet in the middle is encouraging.

4. Diverse Rulesets

Let's consider the CoinRun agent again. It was briefly noted that its high-level reasoning wasn't interpretable either. The rules of the game never changed, it wasn't exposed to sufficient diversity across rulesets, so it just learned a bunch of incomprehensible CoinRun-specific heuristics.

What if it were exposed to a wide variety of rulesets, however? Thousands of them, even? It can just learn specialized heuristics for every one of them, of course, plus a few cues for when to use which. But that has to get memory-taxing at some point. Is there a more optimal way?

We can think about it in terms of natural abstractions. Suppose we train 1,000 separate agents instead, each of them trained only on one game from our dataset, plus a "manager" model that decides which agent to use for which input. This ensemble would have all the task-relevant skills of the initial 1,000-games agent; the 1,000-games agent would be a compressed summary of these agents. A natural abstraction over them, one might say.

A natural abstraction is a high-level summary of some object that ignores its low-level details and only preserve whatever information is relevant to some other target object. The information it ignores is information that'd be wiped out by environment noise on the way from the object to the target.

Our target is the loss function. Our environment is the different training scenarios, with their different rulesets. The object we're abstracting over is the combination of different specialized heuristics for good performance on certain rulesets.[1]

The latter is the commonality across the models, the redundant information we're looking for: their ability to win. The noisy environment of the fluctuating rules would wipe out any details about the heuristics they use, leaving only the signal of "this agent performs well". The high-level abstraction, then, would be "something that wins given a ruleset". Something that outputs actions that lead to low loss no matter the environment it's in. Something that, given some actions it can take, always picks those that lead to low loss because they lead to low loss.

Consequentialism. Agency. An optimizer.

5. Risks from Learned Optimization Is Always Relevant

This result essentially restates some conclusions from Risks from Learned Optimization. That paper specifically discusses the conditions in which a ML model is likely to become a mesa-optimizer (i. e., learn runtime optimization) vs. remain a bundle of specialized heuristics that were hard-coded by the base optimizer (the training process). In particular:

[S]earch—that is, optimization—tends to be good at generalizing across diverse environments, as it gets to individually determine the best action for each individual task instance. There is a general distinction along these lines between optimization work done on the level of the learned algorithm and that done on the level of the base optimizer: the learned algorithm only has to determine the best action for a given task instance, whereas the base optimizer has to design heuristics that will hold regardless of what task instance the learned algorithm encounters. Furthermore, a mesa-optimizer can immediately optimize its actions in novel situations, whereas the base optimizer can only change the mesa-optimizer's policy by modifying it ex-post. Thus, for environments that are diverse enough that most task instances are likely to be completely novel, search allows the mesa-optimizer to adjust for that new task instance immediately.

For example, consider reinforcement learning in a diverse environment, such as one that directly involves interacting with the real world. We can think of a diverse environment as requiring a very large amount of computation to figure out good policies before conditioning on the specifics of an individual instance, but only a much smaller amount of computation to figure out a good policy once the specific instance of the environment is known. We can model this observation as follows.

Suppose an environment is composed of N.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')}  different instances, each of which requires a completely distinct policy to succeed in. Let P be the optimization power (measured in bits) applied by the base optimizer, which should be approximately proportional to the number of training steps. Then, let x be the optimization power applied by the learned algorithm in each environment instance and f(x) the total amount of optimization power the base optimizer must put in to get a learned algorithm capable of performing that amount of optimization. We will assume that the rest of the base optimizer's optimization power, P−f(x), goes into tuning the learned algorithm's policy. Since the base optimizer has to distribute its tuning across all N task instances, the amount of optimization power it will be able to contribute to each instance will be P−f(x)N, under the previous assumption that each instance requires a completely distinct policy. On the other hand, since the learned algorithm does all of its optimization at runtime, it can direct all of it into the given task instance, making its contribution to the total for each instance simply x.

Thus, if we assume that, for a given P, the base optimizer will select the value of x that maximizes the minimum level of performance, and thus the total optimization power applied to each instance, we get

x∗=argmaxx P−f(x)N+x.

As one moves to more and more diverse environments—that is, as N increases—this model suggests that x will dominate P−f(x)N, implying that mesa-optimization will become more and more favorable. Of course, this is simply a toy model, as it makes many questionable simplifying assumptions. Nevertheless, it sketches an argument for a pull towards mesa-optimization in sufficiently diverse environments.

As an illustrative example, consider biological evolution. The environment of the real world is highly diverse, resulting in non-optimizer policies directly fine-tuned by evolution—those of plants, for example—having to be very simple, as evolution has to spread its optimization power across a very wide range of possible environment instances. On the other hand, animals with nervous systems can display significantly more complex policies by virtue of being able to perform their own optimization, which can be based on immediate information from their environment. This allows sufficiently advanced mesa-optimizers, such as humans, to massively outperform other species, especially in the face of novel environments, as the optimization performed internally by humans allows them to find good policies even in entirely novel environments.

6. Multi-Level Models

Now let's consider the issue of multi-level models. They're kind of like playing a thousand different games, no?

It's trivially true for the real world. Chemistry, biology, psychology, geopolitics, cosmology — it's all downstream of fundamental physics, yet the objects at any level behave very unlike the objects at a different level.

But it holds true even for more limited domains.

Consider building up all of mathematics from the ZFC axioms. Same as physics, we start from some "surface" set of rules. We notice that the objects defined by them could be assembled into more complex structures, which could be assembled into more complex structures still, and so on. But at some point, performing direct operations over these structures becomes terribly memory-taxing. We don't think about the cosine function in terms of ZFC axioms, for example; we think about it as its own object, with its own properties. We build an abstraction, a high-level summary that reduces its internal complexity to the input -> output mapping.

When doing trigonometry in general, we're working with an entire new abstraction layer, populated by many abstractions over terribly complex structures built out of axiomatic objects. Calculus, probability theory, statistics, topology — every layer of mathematics is a minor abstraction layer in its own right. And in a sense, every time we prove a theorem or define a function we'd re-use, we add a new abstract object.

The same broad thought applies to any problem domain where it's possible for sufficiently complex structures to arise. It's memory-efficient to build multiple abstract models of such environments, and then abstract over the heuristics for these models.

But it gets worse. When we're navigating an environment with high amounts of emergence, we don't know how many different rulesets we'd need to learn. We aren't exposed to 1,000 games all at once. Instead, as we're working on some problem, we notice that the game we're playing conceals higher-level (or lower-level) rules, which conceal another set of rules, and so on. Once we get started, we have no clue when that process would bottom out, or what rules we may encounter.

Heuristics don't cut it. You need general competence given any ruleset to do well, and an ability to build natural abstractions given a novel environment, on your own. And if you're teaching yourself to play by completely novel rules, how can you even tell whether you're performing well, without the inner notion of a goal to pursue?

(Cute yet non-rigorous sanity-check: How does all of this hold up in the context of human evolution? Surprisingly well, I think. The leading hypotheses for the evolution of human intelligence tend to tie it to society: The Cultural Intelligence Hypothesis suggests that higher intelligence was incentivized because it allowed better transmission of cultural knowledge, such as how to build specialized tools or execute incredibly tricky hunting strategies. The Machiavellian Intelligence points to the political scheming between homo sapiens themselves as the cause.

Either is kind of like being able to adapt to new rulesets on the fly, and build new abstractions yourself. Proving a lemma is not unlike prototyping a new weapon, or devising a plot that abuses ever-shifting social expectations: all involve iterating on a runtime-learned abstract environment to build an even more complex novel structure in the pursuit of some goal.)

7. A Grim Conclusion

Which means that any sufficiently powerful AI is going to be a mesa-optimizer.

I suspect this is part of what Eliezer is talking about when he's being skeptical of tool-AI approaches. Navigating any sufficiently difficult domain, any domain in which structures could form that are complex enough to suggest many many layers of abstraction, is astronomically easier if you're an optimizer. It doesn't matter if your AI is only taught math, if it's a glorified calculator — any sufficiently powerful calculator desperately wants to be an optimizer.

I suspect it's theoretically possible to deny that desperate desire, somehow. At least for some tasks. But it's going to be very costly — the cost of cramming specialized heuristics for 1,000 games into one agent instead of letting it generalize, the cost of setting x to zero in the mesa-optimizer equation while N skyrockets, the cost of forcing your AI to use the low-level model of the environment directly instead of building natural abstractions. You'd need vastly more compute and/or data to achieve the level of performance on par with naively-trained mesa-optimizers (for a given tech level)[2].

And then it probably won't be any good anyway. A freely-trained 1,000-games agent would likely be general enough to play the 1,001th game without additional training. 1,000 separately-trained agents with a manager? Won't generalize, explicitly by design. Similarly, any system we forced away from runtime optimization won't be able to discover/build new abstraction layers on its own, it'd only be able to operate within the paradigms we already know. Which may or may not be useful.

Mesa-optimizers will end the world long before tool AIs can save us, the bottom line is.

  1. ^

    I feel like I'm abusing the terminology a bit, but I think it's right. Getting a general solution as an abstraction over a few specific ones is a Canonical Example, after all: the "1+1=2*1" & "2+2=2*2" => "n+n=2*n" bit.

  2. ^

    I'm put in mind of gwern's/nostalgebraist's comparison with "cute algorithms that solve AI in some theoretical sense with the minor catch of some constant factors which require computers bigger than the universe". As in, avoiding mesa-optimization for sufficiently complex problems may be "theoretically possible" only in the sense that it's absolutely impossible in practice.



Discuss

"Tech company singularities", and steering them to reduce x-risk

13 мая, 2022 - 20:24
Published on May 13, 2022 5:24 PM GMT

The purpose of this post (also available on the EA Forum) is to share an alternative notion of “singularity” that I’ve found useful in timelining/forecasting.

  • fully general tech company is a technology company with the ability to become a world-leader in essentially any industry sector, given the choice to do so — in the form of agreement among its Board and CEO — with around one year of effort following the choice. 

Notice here that I’m focusing on a company’s ability to do anything another company can do, rather than an AI system's ability to do anything a human can do.  Here, I’m also focusing on what the company can do if it chooses rather than what it actually ends up choosing to do.  If a company has these capabilities and chooses not to use them — for example, to avoid heavy regulatory scrutiny or risks to public health and safety — it still qualifies as a fully general tech company.

This notion can be contrasted with the following:

  • Artificial general intelligence (AGI) refers to cognitive capabilities fully generalizing those of humans.
  • An autonomous AGI (AAGI) is an autonomous artificial agent with the ability to do essentially anything a human can do, given the choice to do so — in the form of an autonomously/internally determined directive — and an amount of time less than or equal to that needed by a human.

Now, consider the following two types of phase changes in tech progress:

  1. A tech company singularity is a transition of a technology company into a fully general tech company.  This could be enabled by safe AGI (almost certainly not AAGI, which is unsafe), or it could be prevented by unsafe AGI destroying the company or the world.
  2. An AI singularity is a transition from having merely narrow AI technology to having AGI technology.

I think the tech company singularity concept, or some variant of it, is important for societal planning, and I’ve written predictions about it before, here:

  • 2021-07-21 — prediction that a tech company singularity will occur between 2030 and 2035
  • 2022-04-11 — updated prediction that a tech company singularity will occur between 2027 and 2033.
A tech company singularity as a point of coordination and leverage

The reason I like this concept is that it gives an important point of coordination and leverage that is not AGI, but which interacts in important ways with AGI.  Observe that a tech company singularity could arrive

  1. before AGI, and could play a role in
    1. preventing AAGI, e.g., through supporting and enabling regulation;
    2. enabling AGI but not AAGI, such as if tech companies remain focussed on providing useful/controllable products (e.g., PaLM, DALL-E);
    3. enabling AAGI, such as if tech companies allow experiments training agents to fight and outthink each other to survive.
  2. after a tech company singularity, such as if the tech company develops safe AGI, but not AAGI (which is hard to control, doesn't enable the tech company to do stuff, and might just destroy it).

Points (1a) and (1b) are, I think, humanity’s best chance for survival.  Moreover, I think there is some chance that the first tech company singularity could come before the first AI singularity, if tech companies remain sufficiently oriented on building systems that are intended to be useful/usable, rather than systems intended to be flashy/scary.

How to steer tech company singularities?

The above suggests an intervention point for reducing existential risk: convincing a mix of

  • scientists
  • regulators
  • investors, and
  • the public

… to shame tech companies for building useless/flashy systems (e.g., autonomous agents trained in evolution-like environments to exhibit survival-oriented intelligence), so they remain focussed on building usable/useful systems (e.g., DALL-E, PaLM) preceding and during a tech company singularity.  In other words, we should try to steer tech company singularities toward developing comprehensive AI services (CAIS) rather than AAGI.

How to help steer scientists away from AAGI: 

  • point out the relative uselessness of AAGI systems, e.g., systems trained to fight for survival rather than to help human overseers;
  • appeal to the badness of nuclear weapons, which are — after detonation — the uncontrolled versions of nuclear reactors.
  • appeal to the badness of gain-of-function lab leaks, which are — after getting out — the uncontrolled versions of pathogen research.

How to convince the public that AAGI is bad: 

  • this is already somewhat easy; much of the public is already scared of AI because they can’t control it.
  • do not make fun of the public or call people dumb for fearing things they cannot control; things you can’t control can harm you, and in the case of AGI, people are right to be scared.

How to convince regulators that AAGI is bad:

  • point out that uncontrollable autonomous systems are mainly only usable for terrorism
  • point out the obvious fact that training things to be flashy (e.g., by exhibiting survival instincts) is scary and destabilizing to society.
  • point out that many scientists are already becoming convinced of this (they are)

How to convince investors that AAGI is bad: point out

  • the uselessness and badness of uncontrollable AGI systems, except for being flashy/scary;
  • point out that scientists (potential hires) are already becoming convinced of this;
  • point out that regulators should, and will, be suspicious of companies using compute to train uncontrollable autonomous systems, because of their potential to be used in terrorism.

Speaking personally, I have found it fairly easy to make these points since around 2016.  Now, with the rapid advances in AI we’ll be seeing from 2022 onward, it should be easier.  And, as Adam Scherlis (sort of) points out [EA Forum comment], we shouldn't assume that no one new will ever care about AI x-risk, especially as AI x-risk becomes more evidently real.  So, it makes sense to re-try making points like these from time to time as discourse evolves.

Summary

In this post, I introduced the notion of a "tech company singularity", discussed how the idea might be usable as an important coordination and leverage point for reducing x-risk, and gave some ideas for convincing others to help steer tech company singularities away from AAGI.

All of this isn't to say we'll be safe from AI risk, and far from it; e.g., see What Multipolar Failure Looks Like.  Efforts to maintain cooperation on safety across labs and jurisdictions remains paramount, IMHO.

In any case, try on the "tech company singularity" concept and see if does anything for you :)



Discuss

An observation about Hubinger et al.'s framework for learned optimization

13 мая, 2022 - 19:20
Published on May 13, 2022 4:20 PM GMT

The observations I make here have little consequence from the point of view of solving the alignment problem. If anything, they merely highlight the essential nature of the inner alignment problem. I will reject the idea that robust alignment, in the sense described in Risks From Learned Optimization, is possible at all. And I therefore also reject the related idea of 'internalization of the base objective', i.e. I do not think it is possible for a mesa-objective to "agree" with a base-objective or for a mesa-objective function to be “adjusted towards the base objective function to the point where it is robustly aligned.” I claim that whenever a learned algorithm is performing optimization, one needs to accept that an objective which one did not explicitly design is being pursued. At present, I refrain from attempting to propose my own adjustments to the framework, or to build on the existing literature or to develop my own theory. I am certainly not against doing any of those things, but they are things to possibly be pursued later; none of them is the purpose of this post.
 

To make my main point, I will introduce only a bare minimum of mathematical notation. We will show that a mesa-objective always has a different type signature to a base objective and that the default assumption ought to be that there is no way to compare them in general and certainly no general way to interpret what it means for them to ‘agree’. Suppose that an optimizer is searching through a space S.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')}  of systems. At this time, I do not want to attempt to unpack what it means to 'search', but, naively, we can imagine that there is an objective function f :S→R, which determines something that we might call the 'search criterion'. The idea of course is that the optimizer is a system that is 'searching' through the set S and judging different points according to the criterion that higher values of f are better.

In the background, there is some 'task' and naively we can think of this as being represented by a 'task space' X which consists of all of the different possible 'presentations' or 'instances' of the task. For example, perhaps the task is choosing the next move in a game of Go or the next action in a real-time strategy video game. In these examples, a given  x∈X would represent a board position in Go, say, or a single snapshot of the game-state in the video game. Then, in general, given x∈X and s∈S, we can think that s(x) is the output of s on the task instance x or the action taken by s when presented with x (i.e. s(x) denotes the next board move in Go or the next action to be taken in the video game). So each element of S defines a map from the task space X to some kind of output space or space of possible actions, which we need not notate.

Now, it is possible that there exists m∈S which works in the following way: Whenever the output of m on an instance x of the task needs to be evaluated, i.e. whenever m(x) is computed, what happens is that m searches over another search space Σ and looks for elements that score highly according to some other objective function g:Σ→R. Whenever this is the case, we say that such an m∈S is a mesa-optimizer and that the original optimizer - the one that searches over S - is the base optimizer. Notice that in some way, elements of Σ must in turn correspond to outputs/actions, because given some x, the mesa-optimizer m conducts a search over Σ to determine what output m(x) is, but that is all just part of the internal workings of m and we need not 'know' or notate how this correspondence works. 

In Risks From Learned Optimization, Hubinger et al. write:
 

In such a case, we will use base objective to refer to whatever criterion the base optimizer was using to select between different possible systems and mesa-objective to refer to whatever criterion the mesa-optimizer is using to select between different possible outputs.

So: The mesa-objective is the criterion that m is using in its search: It expresses the idea that higher values of g are better. And the base objective refers to the criterion that higher values of f are better.

Inner Alignment, Robust Alignment, and Pseudo Alignment

The domain of f is the space S  - the space of systems that the base optimizer is searching over (and which can be represented mathematically as a space of functions, each of which is from X to the output or 'action' space). On the other hand, the domain of g is Σ . As mentioned above, we might want to think of Σ as corresponding to (a subset of) the output space, but either way, a priori, there is nothing to suggest that S and Σ are not different spaces. The two objective functions used as criteria in these searches have different domains and it is not clear how to compare them. 

In Risks From Learned Optimization, it is written that "The problem posed by misaligned mesa-optimizers is... the gap between the base objective and the mesa-objective... We will call the problem of eliminating the base-mesa objective gap the inner alignment problem...". I think that they are absolutely right to point to the difference between the base objective and a mesa-objective as being the source of an important issue, but I find referring to it as a "gap", at least at the level of generality posited, to be somewhat misleading. We are not dealing with two objects that are in principle comparable but just so happen to be separated by a gap (a gap waiting to be narrowed by the correct clever idea, say). Instead, the difference, which is due to the different type signatures of the objective functions, is essential in character and rather means that they are, in general, incomparable. 

Consider the definitions of robust alignment and pseudo alignment:

We will use the term robustly aligned to refer to mesa-optimizers with mesa-objectives that robustly agree with the base objective across distributions and the term pseudo-aligned to refer to mesa-optimizers with mesa-objectives that agree with the base objective on past training data, but not robustly across possible future data (either in testing, deployment, or further training).

What might it possibly mean to have mesa-optimizers with mesa-objectives that "agree" with the base objective on past training data or "across distributions"? Again, the base objective refers to a criterion used to select between different systems. How can a mesa-objective, a criterion that a particular one of these systems uses to select between different actions, 'agree' or 'disagree' with it on any particular set of data or "across distributions"? Without further development of the framework, or further explanation, it's impossible to know precisely what this could mean. Robust alignment seems at best to be a very odd, extreme case (where somehow we have ended up with something like f=g and/or S=Σ ?) and at worst simply impossible.

Later, Hubinger attempts to clarify the terminology in a separate post: Clarifying Inner Alignment Terminology. This attempt at clarification and increased rigour should obviously be encouraged, but it is immediately clear that some of the main definitions are still unsatisfactory: The last of seven definitions is the definition is that of Inner Alignment:

Inner Alignment: A mesa-optimizer is inner aligned if the optimal policy for its mesa-objective is impact aligned with the base objective it was trained under.

This version of the definition seems to turn crucially on the notion that a policy could be "impact aligned" with the base objective. Let us turn to Hubinger's own definition of "Impact Alignment", from the same post, to find out what this means precisely: 

(Impact) Alignment: An agent is impact aligned (with humans) if it doesn't take actions that we would judge to be bad/problematic/dangerous/catastrophic.

It seems that we are only told what impact alignment means in the context of an agent and humanity. So we are still missing what seems to be the very core of this edifice: What does it really mean for a mesa-optimizer to be - in whatever is the appropriate sense of the word - 'aligned'? What could it mean for a mesa-objective to 'agree with' the base objective?

Internalization of the base objective

In the Deceptive Alignment post, the idea of “Internalization of the base objective” is introduced. Arguably this is the point at which one might expect the issues I have raised to be most fleshed out, because in highlighting the possibility of “internalization” of the base objective, i.e. that it is possible for a mesa-objective function to be “adjusted towards the base objective function to the point where it is robustly aligned,”  there is an implicit claim that robust alignment really can occur. So to understand this phenomenon, we might look for an explanation as to how this occurs. But the ensuing analysis is somewhat weak and vague, to the point that it is almost just a restatement of the claim that it purports to explain: 

information about the base objective flows into the learned algorithm via the optimization performed by the base optimizer—the base objective is built into the mesa-optimizer as it is adapted by the base optimizer.

I could try to give my own interpretation of what happens when information about the base objective “flows into” the learned algorithm “via” the optimization process, but I would be making something up that does not appear in the text. And what follows is really just a discussion of some possible ways by which the mesa-optimizer comes to be able to use information about the base objective (it could get information about the base objective directly via the base optimizer or it could get it from the task inputs). None of it goes towards alleviating the specific concerns laid out above and none of it really explains with any conviction how true “internalization” happens. Moreover, a footnote admits that in fact the two routes by which the mesa-optimizer may come to be able to use information about the base objective do not even neatly correspond to the dichotomy given by ‘internalization of the base objective’ vs. ‘modelling of the base objective’.
 

Conclusions

 My observations here run counter to any argument which suggests it is possible to 'close the gap' between the base and mesa objectives. As stated above, this suggests that the inner alignment problem has an essential nature: I claim that whenever mesa-optimization occurs, one needs to accept that internally, there is pressure towards a goal which one did not explicitly design.

Of course a close reading of what has been said here really only shows that we cannot rely on the specific formalization I have used (though it may be no more than a few mathematical functions) while still maintaining the exact theoretical framework described in Risks From Learned Optimization. Therefore, either we can try to revise the framework slightly, essentially omitting the notions of robust alignment and 'internalization of the base objective' and focussing more on revised versions of 'proxy alignment' and 'approximate alignment' as descriptors of what is essentially the best possible situation in terms of alignment. Or, it may be the case that the fault is with my formalization and that what I claim are conceptual issues are little more than notational or mathematical curiosities. If the latter is indeed the case, then at the very least, we need to be explicit about whatever tacit assumptions have been made that imply that formalization along the lines I have outlined cannot provide a permissible analysis. For example, I can certainly imagine that it may be possible to add in details on a case-by-case basis or at least to restrict to a specific explicit class of base objectives and then explicitly define how to compare mesa-objectives to them. Perhaps those who object to my view will claim that this is what is really going on in people's minds and it's just that it has not been spelled out. However, at present, I believe that at the level of generality that Risks From Learned Optimization strives for, we simply cannot speak of mesa-objectives ‘agreeing with’ or even really of being ‘adjusted towards’ base objectives.

Remarks

As a final set of remarks, I wanted to briefly discuss the general attitude I have taken here. One might read this and think either that yes, this all seems reasonable, but since it is not about addressing the alignment problem at all, what was the point? Or perhaps one might think that it could all be avoided if only I were to make a more charitable reading of the Risks From Learned Optimization posts in the first place. Am I acting in bad faith?... Surely I "get what they mean"? Indeed, often I do feel like I can see or could guess what the authors are getting at. Why then, have I gone out of my way to take them at their word to such a great extent, just so I can point out inconsistencies? 

I want to end by describing some general, if somewhat vague and half-baked, thoughts about this kind of theoretical/conceptual AI Alignment work and hopefully this will help to answer the above questions. In my humble opinion, one of the things that this type of work ought to be 'worried about' is that it exists in a kind of no-man's land between on the one hand more traditional academic work in fields like computer science and philosophy and on the other hand more 'mainstream' ML Safety, shall we say. For a while I have been wondering whether or not this kind of theoretical alignment work is doomed to remain in this no man's land, propped up by a few success stories but mostly fed by a steady stream of informal arguments, futurological speculation, and 'hand-waving' on blogs and comment sections. I of course do not fully know, but here are a couple of things that have come to mind when trying to think about this: Firstly, when we do not have the luxury of mathematical proof nor the crutch of being backed up by working code and empirical results, it is even more important to subject arguments to a high level of scrutiny. There should be (and hopefully can be) a high bar of intellectual and academic rigour for theoretical/conceptual work in this area. It needs to strive to be as clean and clear as possible. And it's worth saying that one reason for this is so that it stands on its own two feet, so to speak, when interrogated outside of communities like this one. Secondly, I feel it is important that the best arguments and ideas we have - and good critiques of them - appear 'in the literature'. I certainly don't advocate for a completely traditional model of dissemination and publication (there are many advantages to the Alignment Forum and the prevailing rationalist/EA/longtermist ecosystem and their ways of doing things) and of course many great ideas start out as hand-waving and speculation, but it will ultimately not be enough that some idea is 'generally known' in the online/EA alignment communities or can be put together by combing through comment sections and the minds of the relevant people if said idea is never really or cannot be 'written up' in a truly convincing way. As I've said, these remarks are not fully fleshed out and further discussion here doesn't really seem appropriate. For now, the idea was to explain some of my motivation for taking time to post something like this. All discussion and comments are welcome.







 


 



Discuss

Still possible to change username?

13 мая, 2022 - 16:41
Published on May 13, 2022 1:41 PM GMT

I could swear there used to be an option for changing one's username (I've done it before). Has this option been removed? Am I just too daft to find where to click? Or is it auto-disabled after you've done it once?



Discuss

Can moderators fix old sequences posts?

13 мая, 2022 - 15:30
Published on May 13, 2022 12:30 PM GMT

I'm re-reading sequences now and I'm noticing eye-opening things like so many have SEQ RERUN, useless copies that clutter up the link space while having very few comments that they seem to be intended for. Can the moderators do something about this? Remove links to them, maybe even delete them by moving the comments to the original posts. (I don't know if there is functionality to do this, and also how morally acceptable it would be, and if anyone but me is worried about these problems, maybe no one else needs to "fix" this?) And also I notice, that in the old entries, the answers are not child comments, which creates a terrible confusion when reading by karma, when it is not clear what the question was and where to find it. (Same requests and questions here)



Discuss

DeepMind is hiring for the Scalable Alignment and Alignment Teams

13 мая, 2022 - 15:17
Published on May 13, 2022 12:17 PM GMT

We are hiring for several roles in the Scalable Alignment and Alignment Teams at DeepMind, two of the subteams of DeepMind Technical AGI Safety trying to make artificial general intelligence go well.  In brief,

  • The Alignment Team investigates how to avoid failures of intent alignment, operationalized as a situation in which an AI system knowingly acts against the wishes of its designers.  Alignment is hiring for Research Scientist and Research Engineer positions.
  • The Scalable Alignment Team (SAT) works to make highly capable agents do what humans want, even when it is difficult for humans to know what that is.  This means we want to remove subtle biases, factual errors, or deceptive behaviour even if they would normally go unnoticed by humans, whether due to reasoning failures or biases in humans or due to very capable behaviour by the agents.  SAT is hiring for Research Scientist - Machine LearningResearch Scientist - Cognitive ScienceResearch Engineer, and Software Engineer positions.

We elaborate on the problem breakdown between Alignment and Scalable Alignment next, and discuss details of the various positions.

“Alignment” vs “Scalable Alignment”

Very roughly, the split between Alignment and Scalable Alignment reflects the following decomposition:

  1. Generate approaches to AI alignment – Alignment Team
  2. Make those approaches scale – Scalable Alignment Team

In practice, this means the Alignment Team has many small projects going on simultaneously, reflecting a portfolio-based approach, while the Scalable Alignment Team has fewer, more focused projects aimed at scaling the most promising approaches to the strongest models available.

Scalable Alignment’s current approach: make AI critique itself

Imagine a default approach to building AI agents that do what humans want:

  1. Pretrain on a task like “predict text from the internet”, producing a highly capable model such as Chinchilla or Flamingo.
  2. Fine-tune into an agent that does useful tasks, as evaluated by human judgements.

There are several ways this could go wrong:

  1. Humans are unreliable: The human judgements we train against could be flawed: we could miss subtle factual errors, use biased reasoning, or have insufficient context to evaluate the task.
  2. The agent’s reasoning could be hidden: We want to know not just what the system is doing but why, both because that might reveal something about what that we don’t like, and because we expect good reasoning to better generalize to other situations.
  3. Even if the agent is reasoning well, it could fail in other situations: Even if the reasoning is correct this time, the AI could fail to generalize correctly to other situations.

Our current plan to address these problem is (in part):

  1. Give humans help in supervising strong agents: On the human side, provide channels for oversight and advice from peers, experts in various domains, and broader society.  On the ML side, agents should explain their behaviour and reasoning, argue against themselves when wrong, and cite relevant evidence.
  2. Align explanations with the true reasoning process of the agent: Ensure that agent’s are able and incentivized to show their reasoning to human supervisors, either by making reasoning explicit if possible or via methods for interpretability and eliciting latent knowledge.
  3. Red team models to exhibit failure modes that don’t occur in normal use

We believe none of these pieces are sufficient by themselves:

  • (1) without (2) can be rationalization, where an agent decides what to do and produces an explanation after the fact that justifies its answer.
  • (2) without (1) doesn’t scale: The full reasoning trace of the agent might be enormous, it might be terabytes of data even with compression, or exponentially large without compression if the agent is using advanced heuristics which expand into very large human-interpretable reasoning traces.
  • (1)+(2) without (3) will miss rare failures.
  • (3) needs (1)+(2) to define failure.

An example proposal for (1) is debate, in which two agents are trained in a zero-sum game to provide evidence and counterarguments for answers, as evaluated by a human judge.  If we imagine the exponentially large tree of all possible debates, the goals of debate are to (1) engineer the whole tree so that it captures all relevant considerations and (2) train agents so that the chosen single path through the tree reflects the tree as a whole.

Figure 1, AI Safety Needs Social Scientists

The full picture will differ from the pure debate setting in many ways, and we believe the correct interpretation of the debate idea is “agents should critique themselves”. There is a large space of protocols that include agents critiquing agents as a component, and choosing between them will involve

The three goals of “help humans with supervision”, “align explanations with reasoning”, and “red teams” will be blurry once we put the whole picture together.  Red teaming can occur either standalone or as an integrated part of a training scheme such as cross-examination, which allows agents to interrogate opponent behavior along counterfactual trajectories.  Stronger schemes to help humans with supervision should improve alignment with reasoning by themselves, as they grow the space of considerations that can be exposed to humans.  Thus, a key part of the Scalable Alignment Team’s work is planning out how these pieces will fit together.

Examples of our work, involving extensive collaboration with other teams at DeepMind:

  1. Risk analyses, both for long-term alignment risks and harms that exist today:
    1. Kenton et al. 2021, Alignment of language agents
    2. Weidinger et al. 2021, Ethical and social risks of harm from language models
  2. Language model pretraining, analysis, and safety discussion
    1. Rae et al. 2021, Scaling language models: Methods, analysis & insights from training Gopher
    2. Borgeaud et al. 2021, Improving language models by retrieving from trillions of tokens
  3. Safety
    1. Perez et al. 2022, Red teaming language models with language models
    2. Gleave and Irving 2022, Uncertainty Estimation for Language Reward Models
    3. Menick et al. 2022, Teaching language models to support answers with verified quotes
  4. Earlier proposals for debate and human aspects of debate
    1. Irving et al. 2018, AI safety via debate
    2. Irving and Askell 2019, AI safety needs social scientists

We view our recent safety papers as steps towards the broader scalable alignment picture, and continue to build out towards debate and generalizations.  We work primarily with large language models (LLMs), both because LLMs are a tool for safety by enabling human-machine communication and are examples of ML models that may cause both near-term and long-term harms.

Alignment Team’s portfolio of projects

In contrast to the Scalable Alignment Team, the Alignment Team explores a wide variety of possible angles on the AI alignment problem. Relative to Scalable Alignment, we check whether a technique could plausibly scale based on conceptual and abstract arguments. This lets us iterate much faster at the cost of getting less useful feedback from reality. To give you a sense of the variety, here’s some examples of public past work that was led by current team members:

  1. Learning objectives from human feedback on hypothetical behavior
  2. Understanding agent incentives using causal influence diagrams
  3. Examples of specification gaming
  4. Eliciting latent knowledge contest
  5. Avoiding side effects through impact regularization
  6. Improving our philosophical understanding of “agency” using Conway’s game of life
  7. Relating specification problems and Goodhart’s Law
  8. Decoupling approval from actions to avoid tampering

That being said, over the last year there has been some movement away from previous research topics and towards others. To get a sense of our current priorities, here are short descriptions of some projects that we are currently working on:

  1. Primarily conceptual:
    1. Investigate threat models in which due to increasing AI sophistication, humans are forced to rely on evaluations of outcomes (rather than evaluations of process or reasoning).
    2. Investigate arguments about the difficulty of AI alignment, including as a subproblem the likelihood that various AI alignment plans succeed.
    3. Compare various decompositions of the alignment problem to see which one is most useful for guiding future work.
  2. Primarily empirical:
    1. Create demonstrations of inner alignment failures, in a similar style as this paper.
    2. Dig deeper into the grokking phenomenon and give a satisfying account of how and why it happens.
    3. Develop interpretability tools that allow us to understand how large language models work (along similar lines as Anthropic’s work).
    4. Evaluate how useful process-based feedback is on an existing benchmark.

Relative to most other teams at DeepMind, on the Alignment team there is quite a lot of freedom in what you work on. All you need to do to start a project is to convince your manager that it’s worth doing (i.e. reduces x-risk comparably well to other actions you could take), and convince enough collaborators to work on the project.

In many ways the team is a collection of people with very different research agendas and perspectives on AI alignment that you wouldn’t normally expect to work together. What ties us together is our meta-level focus on reducing existential risk through alignment failures:

  1. Every new project must come accompanied by a theory of change that explains how it reduces existential risk; this helps us avoid the failure mode of working on interesting conceptual projects that end up not connecting to the situations we are worried about. 
  2. It’s encouraged to talk to people on the team with very different perspectives and try to come to agreement, or at least better understand each other’s positions. This can be an explicit project even though it isn’t “research” in the traditional sense.
Interfacing with the rest of DeepMind

Both Alignment and Scalable Alignment collaborate extensively with people across DeepMind.

For Alignment, this includes both collaborating on projects that we think are useful, and by explaining our ideas to other researchers. As a particularly good example, we recently ran a 2 hour AI alignment “workshop” with over 100 attendees. (That being said, you can opt out of these engagements in order to focus on research, if you prefer.)

As Scalable Alignment’s work with large language models is very concrete, we have tight collaborations with a variety of teams, including large-scale pretraining and other language teams, Ethics and Society, and Strategy and Governance.

The roles

Between our two teams we have open roles for Research Scientists (RSs), Research Engineers (REs), and (for Scalable Alignment) Software Engineers.  Scalable Alignment RSs can have either a machine learning background or a cognitive science background (or equivalent).  The boundaries between these roles are blurry.  There are many skills involved in overall Alignment / Scalable Alignment research success: proposing and leading projects, writing and publishing papers, conceptual safety work, algorithm design and implementation, experiment execution and tuning, design and implementation of flexible, high-performance, maintainable software, and design and analysis of human interaction experiments.  

We want to hire from the Pareto frontier of all relevant skills.  This means RSs are expected to have more research experience and more of a track record of papers, but SWEs are expected to be better at scalable software design / collaboration / implementation, with REs in between, but also that REs can and do propose and lead projects if capable (e.g., this recent paper had an RE as last author).  For more details on the tradeoffs, see the career section of Rohin’s FAQ.

For Scalable Alignment, most of our work focuses on large language models.  For Machine Learning RSs, this means experience with natural language processing is valuable, but not required.  We are also interested in candidates motivated by other types of harms caused by large models, such as those described in Weidinger et al., Ethical and social risks of harm from language models, as long as you are excited by the goal of removing such harms even in subtle cases which humans have difficulty detecting.  For REs and SWEs, a focus on large language models means that experience with high performance computation or large, many-developer codebases is valuable.  For the RE role for Alignment, many of the projects you could work on would involve smaller models that are less of an engineering challenge, though there are still a few projects that work with our largest language models.

Scalable Alignment Cognitive Scientists are expected to have a track record of research in cognitive scientists, and to design, lead, and implement either standalone human-only experiments to probe uncertainty, or the human interaction components of mixed human / machine experiments.  No experience with machine learning is required, but you should be excited to collaborate with people who do!

Apply now!

We will be evaluating applications on a rolling basis until positions are filled, but we will at least consider all applications that we receive by May 31. Please do apply even if your start date is up to a year in the future, as we probably will not run another hiring round this year. These roles are based in London, with a hybrid work-from-office / work-from-home model.

While we do expect these roles to be competitive, we have found that people often overestimate how much we are looking for. In particular:

  • We do not expect you to have a PhD if you are applying for the Research Engineer or Software Engineer roles. Even for the Research Scientist role, it is fine if you don’t have a PhD if you can demonstrate comparable research skill (though we do not expect to see such candidates in practice).
  • We do not expect you to have read hundreds of blog posts and papers about AI alignment, or to have a research agenda that aims to fully solve AI alignment. We will look for understanding of the basic motivation for AI alignment, and the ability to reason conceptually about future AI systems that we haven’t yet built.
    • If we ask you, say, whether an assistive agent would gradient hack if it learned about its own training process, we’re looking to see how you go about thinking about a confusing and ill-specified question (which happens all the time in alignment research). We aren’t expecting you to give us the Correct Answer, and in fact there isn’t a correct answer; the question isn’t specified well enough for that. We aren’t even expecting you to know all the terms; it would be fine to ask what we mean by “gradient hacking”.
  • As a rough test for the Research Engineer role, if you can reproduce a typical ML paper in a few hundred hours and your interests align with ours, we’re probably interested in interviewing you.
  • We do not expect SWE candidates to have experience with ML, but you should have experience with high performance code and experience with large, collaborative codebases (including the human aspects of collaborative software projects).

Go forth and apply!



Discuss

Thoughts on AI Safety Camp

13 мая, 2022 - 10:16
Published on May 13, 2022 7:16 AM GMT

I

Early this year I interviewed a sample of AISC participants and mentors, and spent some time thinking about the problems the AI safety research community is facing, and have changed my mind about some things.

AI Safety Camp is a program that brings together applicants into teams, and over about a hundred hours of work those teams do AI safety-related projects that they present at the end (one project made it into a Rob Miles video). I think it's really cool, but what exactly it's good for depends on a lot of nitty gritty details that I'll get into later.

Who am I to do any judging? I'm an independent alignment researcher, past LW meetup organizer, physics PhD, and amateur appliance repairman. What I'm not is a big expert on how people get into alignment research - this post is a record of me becoming marginally more expert.

II

The fundamental problem is how to build an ecosystem of infrastructure that takes in money and people and outputs useful AI safety research. Someone who doesn't know much about AISC (like my past self) might conceive of many different jobs it could be doing within this ecosystem:

  • Educating relative newcomers to the field and getting them more interested in doing research on AI alignment.
  • Providing project opportunities that are a lot like university class projects - contributing to the education of people in the process of skilling up to do alignment research.
  • Providing potentially-skilled and potentially-interested people a way to "test their fit" to see if they want to commit to doing more AI alignment work.
  • Catalyzing the formation of groups and connections that will persist after the end of the camp.
  • Helping skilled and interested people send an honest signal of their alignment research skills to future employers and collaborators.
  • Producing object-level useful research outputs.

In addition to this breakdown, there's orthogonal dimensions of what parts of AI safety research you might specialize to support:

  • Conceptual or philosophical work.
  • Machine learning projects.
  • Mathematical foundations.
  • Policy development.
  • Meta-level community-building.

Different camp parameters (length, filters on attendees, etc.) are better-suited for different sorts of projects. This is why AISC does a lot of machine learning projects, and why there's a niche for AISC alum Adam Shimi to start a slightly different thing focused on conceptual work (Refine).

III

Before talking to people, I'd thought AISC was 35% about signalling to help break into the field, 25% about object-level work, and 15% about learning, plus leftovers. Now I think it's actually 35% about testing fit, 30% about signalling, and 15% about object-level work, plus different leftovers.

It's not that people didn't pick projects they were excited about, they did. But everyone I asked acknowledged that the length of the camp wasn't that long, they weren't maximally ambitious anyhow, and they just wanted to produce something they were proud of. What was valuable to them was often what they learned about themselves, rather than about AI.

Or maybe that's too pat, and the "testing fit" thing is more about "testing the waters to make it easier to jump in." I stand by the signalling thing, though. I think we just need more organizations trying to snap up the hot talent that AISC uncovers.

Looking back at my list of potential jobs for AISC (e.g. education, testing fit, catalyzing groups, signalling) I ordered them roughly by the assumed skill level of the participants. I initially thought AISC was doing things catered to all sorts of participants (both educating newcomers and helping skilled researchers signal their abilities, etc.), while my revised impression is that they focus on people who are quite skilled and buy into the arguments for why this is important, but don't have much research experience (early grad school vibes). In addition to the new program Refine, another thing to compare to might be MLSS, which is clearly aimed at relative beginners.

IV

When I talked to AISC participants, I was consistently impressed by them - they were knowledgeable about AI safety and had good ML chops (or other interesting skills). AISC doesn't need to be in the business of educating newbies, because it's full of people who've already spent a year or three considering AI alignment and want to try something more serious.

The size of this demographic is actually surprisingly large - sadly the organizers who might have a better idea didn't talk to me, but just using the number applying to AISC as the basis for a Fermi estimate (by guessing that only 10-20% of people who want to try AI alignment research had the free time and motivation to apply) gets you to >2000 people. This isn't really a fixed group of people, either - new people enter by getting interested in AI safety and learning about AI, and leave when they no longer get much benefit from the fit-testing or signalling in AISC. I would guess this population leaves room for ~1 exact copy of AISC (on an offset schedule), or ~4 more programs that slightly tweak who they're appealing to.

Most participants cut their teeth on AI alignment through independent study and local LW/EA meetup groups. People are trying various things (see MLSS above) to increase the amount of tooth-cutting going on, and eventually the end game might be to have AI safety just be "in the water supply," so that people get exposed to it in the normal course of education and research, or can take a university elective on it to catch up most of the way to the AISC participants.

The people I talked to were quite positively disposed to AISC. At the core, people were glad to be working on projects that excited them, and liked working in groups and with a bit of extra support/motivational structure.

Some people attended AISC and decided that alignment research wasn't for them, which is a success in its own way. On average, I think attending made AI alignment research feel "more real," and increased peoples' conviction that they could contribute to it. Several people I talked to came away with ideas only tangentially related to their project that they were excited to work on - but of course it's hard to separate this from the fact that AISC participants are already selected for being on a trajectory of increasing involvement in AI safety.

In contrast, the mentorship aspect was surprisingly (to me) low-value to people. Unless the mentor really put in the hours (which most understandably did not), decisions about each project were left in the hands of the attendees, and the mentor was more like an occasional shoulder angel plus useful proofreader of their final report. Not pointless, but not crucial. This made more sense as I came to see AISC as not being in the business of supplying education from outside.

Note that in the most recent iteration that I haven't interviewed anyone from, the format of the camp has changed - projects now come from the mentors rather than the groups. I suspect this is intended to solve a problem where some people just didn't pick good projects and ran into trouble. But it's not entirely obvious whether the (probable) improvement of topics dominates the effects on mentor and group engagement etc., so if you want to chat about this in the comments or with me via video call, I have more questions I'd be interested to ask.

Another thing that people didn't care about that I'd thought they would was remote vs. in-person interaction. In fact, people tended to think they'd prefer the remote version (albeit not having tried both). Given the lower costs and easier logistics, this is a really strong point in favor of doing group projects remotely. It's possible this is peculiar to machine learning projects, and [insert other type of project here] would really benefit from face to face interaction. But realistically, it looks like all types should experiment with collaborating over Discord and Google Docs.

V

What are the parameters of AISC that make it good at some things and not others?

Here's a list of some possible topics to get the juices flowing:

  • Length and length variability.
  • Filtering applicants.
  • Non-project educational content.
  • Level of mentor involvement.
  • Expectations and evaluation.
  • Financial support.
  • Group size and formation conditions.
  • Setting and available tools.

Some points I think are particularly interesting:

Length and length variability: Naturally shorter time mandates easier projects, but you can have easy projects across a wide variety of sub-fields. However, a fixed length (if somewhat short) also mandates lower-variance projects, which discourages the inherent flailing around of conceptual work and is better suited to projects that look more like engineering.

Level of mentor involvement: Giving participants more supervision might reduce length variability pressure and increase the object-level output, but reduce the signalling power of doing a good job (particularly for conceptual work). On the other hand, participating in AISC at all seems like it would still be a decent sign of having interesting ideas. The more interesting arguments against increasing supervision are that it might not reduce length variability pressure by much (mentors might have ideas that are both variable between-ideas and that require an uncertain amount of time to accomplish, similar to the participants), and might not increase the total object-level output, relative to the mentor and participants working on different topics on the margin.

Evaluation: Should AISC be grading people or giving out limited awards to individuals? I think that one of its key jobs is certainly giving honest private or semi-private feedback to the participants. But should it also be helping academic institutions or employers discriminate between participants to increase its signalling power? I suspect that with current parameters there's enough variation in project quality to serve as a signal already if necessary, and trying to give public grades on other things would be shouldering a lot of trouble with perverse incentives and hurt feelings for little gain.

VI

You can get lots of variations on AISC's theme by tweaking the parameters, including variations that fill very different niches in the AI safety ecosystem. For example, you could get the ML for Alignment Bootcamp with different settings of applicant filtering, educational content, group size, and available tools.

On the other hand, there are even more different programs that would have nontrivial values of "invisible parameters" that I never would have thought to put on the list of properties of AISC (similar to how "group size" might be an invisible parameter for MLAB). These parameters are merely an approximate local coordinate system for a small region of infrastructure-space.

What niches do I think especially need filling? For starters, things that fit into a standard academic context. We need undergrad- and graduate-level courses developed that bite off various chunks of the problems of AI alignment. AISC and its neighbors might tie into this by helping with the development of project-based courses - what project topics support a higher amount of educational content / teacher involvement, while still being interesting to do?

We also need to scale up the later links in the chain, focused on the production of object-level research. Acknowledging that this is still only searching over a small part of the space, we can ask what tweaks to the AISC formula would result in something more optimized for research output. And I think the answer is that you can basically draw a continuum between AISC and a research lab in terms of things like financial support, filtering applicants, project length, etc. Some of these variables are "softer" than others - it's a lot easier to match MIRI on project length than it is to match them on applicant filtering.

VII

Should you do AISC? Seems like a reasonable thing for me to give an opinion about, so I'll try to dredge one up.

You should plausibly do it IF:

(

You have skills that would let you pull your weight in an ML project.

OR

You've looked at the AISC website's list of topics and see something you'd like to do.

)

AND

You know at least a bit about the alignment problem - at the very least you are aware that many obvious ways to try to get what we want from AI do not actually work.

AND

(

You potentially want to do alignment research, and want to test the waters.

OR

You think working on AI alignment with a group would be super fun and want to do it for its own sake.

OR

You want to do alignment research with high probability but don't have a signal of your skillz you can show other people.

)

.

This is actually a sneakily broad recommendation, and I think that's exactly right. It's the people on the margins, those who aren't sure of themselves, the people who could only be caught by a broad net that most benefit from something like this. So if that's you, think about it.



Discuss

Deferring

13 мая, 2022 - 02:56
Published on May 12, 2022 11:56 PM GMT

(Cross-posted from the EA Forum)

Deferring is when you adopt someone else's view on a question over your own independent view (or instead of taking the time to form an independent view). You can defer on questions of fact or questions of what to do. You might defer because you think they know better (epistemic deferring), or because there is a formal or social expectation that you should go along with their view (deferring to authority). 

Both types of deferring are important — epistemic deferring lets people borrow the fruits of knowledge; deferring to authority enables strong coordination. But they are two-edged. Deferring can mean that you get less chance to test out your own views, so developing mastery is slower. Deferring to the wrong people can be straightforwardly bad. And when someone defers without everyone understanding that's what's happening, it can cause issues. Similarly, unacknowledged expectations of deferral from others can cause problems. We should therefore learn when and how to defer, when not to, and how to be explicit about what we're doing.

Why deferring is usefulEpistemic deferring

Epistemic deferring is giving more weight to someone else's view than your own because you think they're in a position to know better.  The opposite of epistemic deferring is holding one's own view.

Examples:

  • "You've been to this town before; where's the best place to get coffee?"
  • "My doctor/lawyer says this is a common situation, and the right thing to do is ..."
  • "A lot of smart folks seem to think AI risk is a big deal; it sounds batshit to me, but I guess I'll look into it more"

The case for epistemic deferring is simple: for most questions, we can identify someone (or some institution or group of people) whose judgement on the question would — if they were possessed of the facts we knew — be better than our own. So to the extent that 

  • (A) We want to optimize for accurate judgements above all else, &
  • (B) We are willing to make the investment to uncover that better judgement,

deferring will be correct.

Partial deferring

The degree to which (A) and (B) hold will vary with circumstance. It will frequently be the case that they partially hold; in this case it may be appropriate to partially defer, e.g.

  • “I’m torn between whether to take job X or job Y. On my view job X seems better. When I talk to my friends and family they overwhelmingly think job Y sounds better; maybe they’re seeing something I’m not. If I thought it was a close call anyway this might be enough to tip me over, but it won’t change my mind if my preference for X was clear.”
Deferring to authority

Deferring to authority is adopting someone else's view because of a social contract to do so. Often deferring to authority happens on questions of what should be done — e.g. "I'm going to put this fire alarm up because [my boss / my client / the law] tells me to", or “I’m helping my friend cook dinner, so I’ll cut the carrots the way they want, even though I think this other way is better”.[1]  The opposite of deferring to authority is acting on one's own conscience.

Deferring to authority — and the reasonable expectation of such deferring — enables groups of people to coordinate more effectively. Militaries rely on it, but so do most projects (large and small, but especially large). It's unreasonable to expect that everyone working on a large software project will have exactly the same views over the key top-level design choices, but it's better if there's some voice that can speak authoritatively, so everyone can work on that basis. If we collectively want to be able to undertake large ambitious projects, we’ll likely need to use deferring to authority as a tool.

Ways deferring goes wrong
  1. Deferring to the wrong people
    • The "obvious" failure mode, applies to both:
      • Epistemic deferring — misidentifying who is an expert
      • Deferring to authority — buying into social contracts it would be better to withdraw from
  2. Deferring with insufficient bandwidth
    • Even if Aditi would make a better decision than Sarah, the process of Sarah deferring to Aditi (for epistemic or authority reasons) can produce a worse decision if either:
      1. There's too much context for Sarah to communicate to Aditi
      2. The "right" decision includes too much detail for Aditi to communicate to Sarah
    • This is more often a problem with questions of what to do than questions of fact (since high context on the situation is so often important for the answer), but may come up in either case
    • A special case is deferring with zero bandwidth (e.g. Sarah is deferring to she imagines Aditi would say in the situation, based on an article she read)
    • Another cause of deferring with insufficient bandwidth is if someone wants to delegate responsibility but not authority for a project, and not to spend too much time on it; this is asking for deferral to them as an authority without providing much bandwidth
  3. Deferring can be bad for learning
    • Contrast — "letting people make their own mistakes"
      • The basic dynamic is that if you act from your own models, you bring them more directly into contact with the world, and can update faster
    • Note that a certain amount of deferring can be good for learning, especially:
      1. When first trying to get up to speed with an area
      2. When taking advice on what to pay attention to
        • In particular because this can help rescue people from traps where they think some dimension is unimportant, so never pay attention to it to notice that it's actually important
    • This intersects with #2; deferring is more often good for learning when it’s high-bandwidth (since the person deferring can use it as an opportunity to interrogate the model of the person being deferred to), and more often bad for learning when it’s low-bandwidth
  4. Deferring can interfere with belief formation
    • If people aren't good at keeping track of why they believe things, it can be hard to notice when one's body of knowledge has caught up and one should stop deferring on an issue (because the deferred-to-belief may be treated as a primitive belief); cf. independent impressions for discussion of habits for avoiding this
    • Conflation between epistemic deferring and deferring to authority can lead to people accidentally adopting as beliefs things that were only supposed to be operating assumptions
      • This can happen e.g.
        • When deferring to one's boss
          • Easy to slip between the two since one's boss is often in a superior epistemic position re. what needs to be done
          • In some cases organizational leadership might exert explicit pressure towards shared beliefs, e.g. saying “if someone doesn’t look like they hold belief X, this could destabilize the team’s ability to orient together as a team”
        • Deferring to someone high status when the true motivation for deferring is to seem like one has cool beliefs / get social acceptance
          • Again there's plausible deniability since the high status person may well be in a superior epistemic position
          • The high-status person may like it when others independently have similar views to them (since this is evidence of good judgement), which can create incentives for the junior people to adopt “as their own view” the relevant positions

Deferring without common knowledge of deferring is a risk factor for these issues (since it's less likely that anyone is going to spot and correct them).

Social deferring

Often there’s a lot of deferring within a group or community on a particular issue (i.e. both the person deferring and the person being deferred to are within the group, and the people being deferred to often have their own views substantially via deferring). This can lead to issues, for reasons like:

  1. If there are long chains of deferral, this means there’s often little bandwidth to the people originating the views
  2. If you don’t know when others are deferring vs having independent views, it may be unclear how many times a given view has been independently generated, which can make it hard to know how much weight to put on it (“the emperor’s new clothes” gives an extreme example)
  3. If the people with independent takes update their views in response to evidence, it may take some time until the newer views have filtered through to the people who are deferring
  4. If people are deferring to the authority of the social group (where there's a pressure to have the view as a condition of membership), this may be bad for belief formation

Ultimately we don’t have good alternatives to basing a lot of our beliefs on chains of deferral (there are too many disparate disciplines of expertise in the world to personally be fluent with knowing who are the experts to listen to in each one). But I think it’s helpful to be wary of ways in which it can cause problems, and we should feel relatively better about:

  • A group or community collectively deferring to a single source (e.g. the same expert report, or a prediction market), as it’s much more legible what’s happening
  • People sometimes taking the effort to dive into a topic and shorten the deferral chain (cf. “minimal trust investigations”)
  • Creating spaces which explicitly state their operating assumptions as a condition of entry (“in this workshop we’ll discuss how to prepare for a nuclear war in 2025”) without putting pressure on the beliefs of the participants
When & how to deferEpistemic deferring

There's frequently a tension between on the one hand knowing that you can identify someone who knows more than you, and on the other hand not wanting to take the time to get answers from them, or wanting to optimize for your own learning rather than just the best answer for the question at hand.

Here are the situations where I think epistemic deferring is desirable:

  1. Early in the learning process for any paradigm
    • By “paradigm” I mean a body of knowledge with something like agreed-on metrics of progress
      • This might include “learning a new subfield of chemistry” or “learning to ride a unicycle”
      • I’m explicitly not including areas that feel preparadigmatic — among which notably I want to include cause prioritization — where I feel more confused about the correct advice (although it certainly seems helpful to hear existing ideas)
    • Here you ideally want to defer-but-question — perhaps you assume that the thing you're being told is correct, but are intensely curious about why that could be (and remain open to questioning the assumption later)
    • Taking advice on what to pay attention to is a frequent special case of this — it's very early in the learning process of "how to pay attention to X", for some X you previously weren't giving attention to
  2. When the importance of a good answer is reasonably high compared to the cost of gathering the information about how to defer, and either:
    1. It's on a topic that you're not hoping to develop mastery of
      • i.e. you just want the easily-transmissible conclusions, not the underlying generators
    2. There are only weak feedback loops from the process back into your own models
    3. The importance of a good answer is high even compared to the cost of gathering thorough information about how to defer
      • Sometimes thorough information about how to defer is cheap! e.g. if you want to know about a variable that has high quality public expert estimates
      • If you’re making a decision about what to do, however, often gathering thorough information about how to defer means very high bandwidth context-sharing
    4. You intend to defer only a little

Note: even when not deferring, asking for advice is often a very helpful move. You can consider the advice and let it guide your thinking and how to proceed without deferring to any of the advice-givers.[2]

Deferring to authority

Working out when to defer to authority is often simply a case of determining whether you want to participate in the social contract.

It's often good to communicate when you're deferring, e.g. tell your boss "I'm doing X because you told me to, but heads up that Y looks better to me". Sometimes the response will just be "cool"; at other times they might realize that you need to understand why X is good in order to do a good job of X (or that they need to reconsider X). In any case it's helpful to keep track for yourself of when you're deferring to authority vs have an independent view.

A dual question of when to defer to authority is when to ask people to defer to you as an authority. I think the right answer is "when you want someone to go on following the plan even if they’re not personally convinced". If you’re asking others to defer it’s best if you’re explicit about this. Vice-versa if you’re in a position of authority and not asking others to defer it’s good to be explicit that you want them to act on their own conscience. (People take cultural cues from those in positions of authority; if they perceive ambiguity about whether they should defer, it may be ambiguous in their own mind, which seems bad for the reasons discussed above.)

Deferring to authority in the effective altruism community

I think people are often reluctant to ask others to defer to their authority within EA. We celebrate people thinking for themselves, taking a consequentialist perspective, and acting on their own conscience. Deferring to authority looks like it might undermine these values. Or perhaps we'd get people who reluctantly "deferred to authority" while trying to steer their bosses towards things that seemed better to them.

This is a mistake. Deferring to authority is the natural tool for coordinating groups of people to do big things together. If we're unwilling to use this tool, people will use social pressure towards conformity of beliefs as an alternate tool for the same ends. But this is worse at achieving coordination[3], and is more damaging to the epistemics of the people involved.

We should (I think) instead encourage people to be happy taking jobs where they adopt a stance of "how can I help with the agenda of the people steering this?", without necessarily being fully bought into that agenda. This might seem a let down for individuals, but I think we should be willing to accept more "people work on agendas they're not fully bought into" if the alternatives are "there are a bunch of epistemic distortions to get people to buy into agendas" and "nobody can make bets which involve coordinating more than 6 people". People doing this can keep their eyes open for jobs which better fit their goals, while being able and encouraged to have their own opinions, and still having professional pride in doing a good job at the thing they're employed to do.

This isn't to say that all jobs in EA should look like this. I think it is a great virtue of the community that we recognise the power of positions which give people significant space to act on their own conscience. But when we need more coordination, we should use the correct tools to get that.

Meta-practices

My take on the correct cultural meta-practices around deferring: 

  1. Choices to defer — or to request deferral — should as far as possible be made deliberately rather than accidentally
    • We should be conscious of whether we're deferring for epistemic or authority reasons
  2. We should discuss principles of when to defer and when not to defer
  3. Responsibility for encouraging non-deferral (when that's appropriate) should lie significantly with the people who might be deferred to
  4. We should be explicit about when we're deferring (in particular striving not to let the people-being-deferred-to remain ignorant of what's happening)
Closing remarks

A lot of this content, insofar as it is perceptive, is not original to me; a good part of what I'm doing here is just trying to name the synthesis position for what I perceive to be strong pro-deferral and anti-deferral arguments people make from time to time. This draft benefited from thoughts and comments from Adam Bales, Buck Shlegeris, Claire Zabel, Gregory Lewis, Jennifer Lin, Linch Zhang, Max Dalton, Max Daniel, Raymond Douglas, Rose Hadshar, Scott Garrabrant, and especially Anna Salamon and Holden Karnofsky. I might edit later to tighten or clarify language (or if there are one or two substantive points I want to change).

Should anyone defer to me on the topic of deferring? 

Epistemically — I've spent a while thinking about the dynamics here, so it's not ridiculous to give my views some weight. But lots of people have spent some time on this; I'm hoping this article is more helpful as a guide to let people understand things they already see than as something that needs people to defer to.

As an authority — not yet. But I'm offering suggestions for community norms around deferring. Norms are a thing which it can make sense to ask people to defer to. If my suggestions are well received in the discussion here, perhaps we'll want to make asks for deference to them at some point down the line.
 

  1. ^

    Some less central examples of deferring to authority in my sense:

    • Doing something because you promised to (the “authority” deferred to is your past self)
    • Adopting a belief that the startup you’re joining will succeed as part of the implicit contract of joining (not necessarily a fully adopted belief, but acted upon while at work)
  2. ^

     cf. https://www.lesswrong.com/posts/yeADMcScw8EW9yxpH/a-sketch-of-good-communication

  3. ^

    At least “just using ideological conformity” is worse for coordination than “using ideological conformity + deference to authority”. After we’re using deference to authority well I imagine there’s a case that having ideological conformity as well would help further; my guess is that it’s not worth the cost of damage to epistemics.



Discuss

RLHF

13 мая, 2022 - 00:18
Published on May 12, 2022 9:18 PM GMT

I’ve been thinking about Reinforcement Learning from Human Feedback (RLHF) a lot lately, mostly as a result of my AGISF capstone project attempting to use it to teach a language model to write better responses to Reddit writing prompts, a la Learning to summarize from human feedback.

RLHF has generated some impressive outputs lately, but there seems to be a significant amount of disagreement regarding its potential as a partial or complete solution to alignment: some are excited to extend the promising results we have so farwhile others are more pessimistic and perhaps even opposed to further work along these lines. I find myself optimistic about the usefulness of RLHF work, but far from confident that all of the method’s shortcomings can be overcome.

How it Works

At a high level, RLHF learns a reward model for a certain task based on human feedback and then trains a policy to optimize the reward received from the reward model. In practice, the reward model learned is likely overfit - the policy can thus benefit from interpolating between a policy that optimizes the reward model’s reward and a policy trained through pure imitation learning. 

A key advantage of RLHF is the ease of gathering feedback and the sample efficiency required to train the reward model. For many tasks, it’s significantly easier to provide feedback on a model’s performance rather than attempting to teach the model through imitation. We can also conceive of tasks where humans remain incapable of completing the tasks themselves, but can evaluate various completions and provide feedback on them. This feedback can be as simple as picking the better of two sample completions, but it’s plausible that other forms of feedback might be more appropriate and/or more effective than this. The ultimate goal is to get a reward model that represents human preferences for how a task should be done: this is also known as Inverse Reinforcement Learning. The creators of the method, Andrew Ng and Stuart Russell, believe that “the reward function, rather than the policy, is the most succinct, robust, and transferable definition of the task,”. Think about training an AI to drive a car: we might not want it to learn to imitate human drivers, but rather learn what humans value in driving behavior in the abstract and then optimize against those preferences.

Outer Alignment Concerns

If a reward model trained through human feedback properly encoded human preferences, we might expect RLHF to be a plausible path to Outer Alignment. But this seems like a tall order, considering that humans can be assigned any values whatsoeverthe easy goal inference problem is still hard, and that it’s easy to misspecify any model that attempts to correct for human biases or irrationalityAmbitious value learning is hard, and I’m not particularly confident that RLHF makes it significantly more tractable. 

It’s also plausible that this approach of inferring a reward function for a task is just fundamentally misguided and that the way to get an outer aligned system is through the assistance-game or CIRL framework instead. There are definite advantages of this paradigm over the more standard reward learning setup that RLHF leverages. By treating humans as pieces of the environment and the reward function as a latent variable in the environment, an AI system can merge the reward learning and policy training functions that RLHF separates and thereby “take into account the reward learning process when selecting actions,”. This makes it easier to make plans conditional on future feedback, only gather feedback as and when it becomes necessary, and more fluidly learn from different forms of feedback.

Scalable oversight is hard

RLHF also relies upon humans being able to evaluate the outputs of models. This will likely be impossible for the kinds of tasks we want to scale AI to perform - it’s just going to be too hard for a human to understand why one output should be preferred over another. We’d simply have to hope that reward model generalization we’d seen previously, when oversight was still possible, continued to hold. Even if we thought we’d figured out how to evaluate our models’ outputs, there’s always the chance of an inner alignment failure or other deceptive behavior evading our oversight - we’d want to be absolutely certain that our reward and policy models were actually doing what we wanted them to do. 

The solutions to the scalable oversight problem seem to primarily rely on AI-assistance and/or breakthroughs in interpretability techniques. I think it’s clear how the latter might be useful: if we could just look at any model and be certain of its optimization objective, we’d probably feel pretty comfortable understanding the reward models and policy models we trained. AI-assistance might look something like recursive reward modeling: break the task that’s too hard to oversee into more manageable chunks that a human can oversee and train a model to optimize those tasks. Using the models trained on the narrower subtasks might make the original task possible to oversee: this is an idea that has been used for the task of summarizing books. It’s plausible that there are many tasks that resist this kind of decomposition, but the factored cognition approach might get us very far indeed.

Why I think RLHF is valuable

I’ll quote Paul Christiano here:

We are moving rapidly from a world where people deploy manifestly unaligned models (where even talking about alignment barely makes sense) to people deploying models which are misaligned because (i) humans make mistakes in evaluation, (ii) there are high-stakes decisions so we can't rely on average-case performance.

This seems like a good thing to do if you want to move on to research addressing the problems in RLHF: (i) improving the quality of the evaluations (e.g. by using AI assistance), and (ii) handling high-stakes objective misgeneralization (e.g. by adversarial training).

In addition to "doing the basic thing before the more complicated thing intended to address its failures," it's also the case that RLHF is a building block in the more complicated things.

I think that (a) there is a good chance that these boring approaches will work well enough to buy (a significant amount) time for humans or superhuman AIs to make progress on alignment research or coordination, (b) when they fail, there is a good chance that their failures can be productively studied and addressed.

I generally agree with this. Solving problems that crop up in RLHF seems likely to transfer to other alignment methods, or at least be productive mistakes. The interpretability techniques we develop, outer or inner alignment failures we find, and latent knowledge we elicit from our reward and policy models all seem broadly applicable to future AI paradigms. In other words, I think the textbook from the future on AI Alignment is likely to speak positively of RLHF, at the very least as an early alignment approach.

Promising RLHF Research Directions (according to me)

I’d like to see different kinds of feedback be used in addition to preference orderings over model outputs. This paper specifies a formalism for the reward learning in general and considers several different kinds of feedback that might be appropriate for different tasks, e.g. demonstration, correction, natural language feedback, etc. A reward model that can gracefully learn from a wide array of feedback types seems like a desirable goal. This kind of exploration might also help us figure out better and worse forms of feedback and what kinds of generalization arise from each type.

Relatedly, I think it might be interesting to see how the assistance game paradigm performs in settings where the RLHF paradigm has been applied, like text summarization. On a theoretical level it seems clear that the assistance game setup offers some unique benefits and it would be cool to see those realized. 

As we continue to scale RLHF work up, I want to see how we begin to decompose tasks so that we can apply methods like Recursive Reward Modeling. For book summarization, OpenAI used a fixed chunking algorithm to break the text down into manageable pieces, but it seems likely that other kinds of decomposition won’t be as trivial. We might need AI assistance to decompose tasks that we can’t oversee into tasks that we can. Training decomposition models that can look at a task and identify overseeable subtasks seems like a shovel-ready problem, perhaps one that we might even apply RLHF to. 



Discuss

What to do when starting a business in an imminent-AGI world?

13 мая, 2022 - 00:07
Published on May 12, 2022 9:07 PM GMT

As reported by la3lorn and Daniel Kokotajlo, Gato is here and appears to me to represent a sub-human AGI, or near enough as makes no difference in a timeline sense. I think this probably means a general thickening of deep learning applications everywhere, and the introduction of a kind of "stack" AI that can do things we used to need whole organizations to do - as an example, I mean things like do patent research, label patent diagrams, and file patent lawsuits.

I also have an idea about a business I would like to start. This is already a notoriously trying task with low probability of success, and I wonder how much more so it will be in a world that will populate with AGI patent trolls, along with whatever else, well before hitting any kind of clear success mark.

So my question is: what do we do to account for powerful AI, showing up soon, when we are starting a business?

Note that what I am interested in here is non-AI businesses in particular and non-software businesses in general, because this looks like the threshold for it spilling across a bunch of new domains.



Discuss

Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios

12 мая, 2022 - 23:01
Published on May 12, 2022 8:01 PM GMT

This is the second post in the sequence “Interpretability Research for the Most Important Century”. The first post, which introduces the sequence, defines several terms, and provides a comparison to existing works, can be found here: Introduction to the sequence: Interpretability Research for the Most Important Century

Summary

This post explores the extent to which interpretability is relevant to the hardest, most important parts of the AI alignment problem (property #1 of High-leverage Alignment Research[1]). 

First, I give an overview of the four important parts of the alignment problem (following Hubinger[2]): outer alignment, inner alignment, training competitiveness and performance competitiveness (jump to section). Next I discuss which of them is “hardest”, taking the position that it is inner alignment (if you have to pick just one), and also that it’s hard to find alignment proposals which simultaneously address all four parts well.

Then, I move onto exploring how interpretability could impact these four parts of alignment. Our primary vehicle for this exploration involves imagining and analyzing seven best-case scenarios for interpretability research (jump to section). Each of these scenarios represents a possible endgame story for technical alignment, hinging on one or more potential major breakthroughs in interpretability research. The scenarios’ impacts on alignment vary, but usually involve solving inner alignment to some degree, and then indirectly benefiting outer alignment and performance competitiveness; impacts on training competitiveness are more mixed.

Finally, I discuss the likelihood that interpretability research could contribute to unknown solutions to the alignment problem (jump to section). This includes examining interpretability’s potential to lead to breakthroughs in our basic understanding of neural networks and AI, deconfusion research and paths to solving alignment that are difficult to predict or otherwise not captured by the seven specific scenarios analyzed. 

Quick tips for navigating this very long post! If you get lost scrolling through this post on mobile, consider reading on desktop for two benefits: 1) To take advantage of LessWrong's convenient linked outline feature that appears in the sidebar, and 2) To be able to glance at the footnotes and posts that I link to just by hovering over them.

Acknowledgments

Lots of people greatly improved this post by providing insightful discussions, critical points of view, editing suggestions and encouraging words both before and during its writing.

Many thanks in particular to Joe Collman, Nick Turner, Eddie Kibicho, Donald Hobson, Logan Riggs Smith, Ryan Murphy, the EleutherAI Interpretability Reading Group, Justis Mills (and LessWrong's amazing free editing service!) and Andrew McKnight for all their help.

Thanks also to the AGI Safety Fundamentals Curriculum, which is an excellent course I learned a great deal from leading up to writing this post, and for which I started this sequence as my capstone project.

What are the hardest and most important parts of AI alignment?

After several days of research and deliberation (see footnote), I concluded[3] that the most important parts of alignment are well-stated in Hubinger (2020)[2]:

  1. “Outer alignment. Outer alignment is about asking why the objective we're training for is aligned—that is, if we actually got a model that was trying to optimize for the given loss/reward/etc., would we like that model? For a more thorough description of what I mean by outer alignment, see “Outer alignment and imitative amplification.”
  2. Inner alignment. Inner alignment is about asking the question of how our training procedure can actually guarantee that the model it produces will, in fact, be trying to accomplish the objective we trained it on. For a more rigorous treatment of this question and an explanation of why it might be a concern, see “Risks from Learned Optimization.”
  3. Training competitiveness. Competitiveness is a bit of a murky concept, so I want to break it up into two pieces here. Training competitiveness is the question of whether the given training procedure is one that a team or group of teams with a reasonable lead would be able to afford to implement without completely throwing away that lead. Thus, training competitiveness is about whether the proposed process of producing advanced AI is competitive.
  4. Performance competitiveness. Performance competitiveness, on the other hand, is about whether the final product produced by the proposed process is competitive. Performance competitiveness is thus about asking whether a particular proposal, if successful, would satisfy the use cases for advanced AI—e.g. whether it would fill the economic niches that people want AGI to fill. 

Even though Evan Hubinger later proposed training stories as a more general framework, I still find thinking about these four components highly useful for many scenarios, even if they don’t neatly apply to a few proposed alignment techniques. So I’ll consider these to be a good definition of the important parts of AI alignment.

But which one of these four parts is the “hardest”? Well, today there are definitely many proposals which look promising for achieving the two alignment parts (#1 and #2) but seem questionable in one or both of the competitiveness parts (#3 and #4). For example, Microscope AI. Conversely, there are some approaches which seem competitive but not aligned (missing #1 and/or #2). For example, reinforcement learning using a hand-coded specification, and without any interpretability tools to guard against inner misalignment.

However, another thing I observe is that many proposals currently seem to be bottlenecked by #2, inner alignment. For example, in Hubinger (2020)[2], none of the proposals presented could be inner aligned using technology that exists today.

So, I’ll be operating as though the hardest alignment component is inner alignment. However, we’ll still pay attention to the other three components, because it’s also difficult to find a proposal which excels at all four alignment components simultaneously.

How does interpretability impact the important parts of alignment?

Interpretability cannot be a complete alignment solution in isolation, as it must always be paired with another alignment proposal or AI design. I used to think this made interpretability somehow secondary or expendable.

But the more I have read about various alignment approaches, the more I’ve seen that one or another is stuck on a problem that interpretability could solve. It seems likely to me that interpretability is necessary, or at least could be instrumentally very valuable, toward solving alignment.

For example, if you look closely at Hubinger (2020)[2], every single one of the 11 proposals relies on transparency tools in order to become viable.[4]

So even though interpretability cannot be an alignment solution in isolation, as we’ll see its advancement does have the potential to solve alignment. This is because in several different scenarios which we’ll examine below, advanced interpretability has large positive impacts on some of alignment components #1-4 listed above.

Usually this involves interpretability being able to solve all or part of inner alignment for some techniques. Its apparent benefits on outer alignment and performance competitiveness are usually indirect, in the form of addressing inner alignment problems for one or more techniques that conceptually have good outer alignment properties or performance competitiveness, respectively. It’s worth noting that sometimes interpretability methods do put additional strain on training competitiveness.

We’ll examine this all much more closely in the Interpretability Scenarios with Alignment-Solving Potential section below.

Other potentially important aspects of alignment scarcely considered here

This post largely assumes that we need to solve prosaic AI alignment. That is, I assume that transformative AI will come from scaled-up-versions of systems not vastly different from today’s deep learning ML systems. Hence we mostly don’t consider non-prosaic AI designs. I also don’t make any attempt to address the embedded agency problem. (However, Alex Flint’s The ground of optimization, referenced later on, does seem to have bearing on this problem.)

There are important AI governance and strategy problems around coordination, and important misuse risks to consider if aligned advanced AI is actually developed. Neel Nanda’s list of interpretability impact theories also mentions several theories around setting norms or cultural shifts. I touch on some of these briefly in the scenarios below. However, I don’t make any attempt to cover these comprehensively. Primarily, in this sequence, I am exploring a world where technical research can drive us toward AI alignment, with the help of scaled up funding and talent resources as indicated in the Alignment Research Activities Question[5]

Interpretability Scenarios with Alignment-Solving Potential

In attacking the Alignment Research Activities Question[5], Karnofsky (2022)[6] suggests ‘visualizing the “best case”’ for each alignment research track examined—in the case we're examining, that means the best case for interpretability.

I think the nature of interpretability lends itself to multiple “best case” and “very good case” scenarios, perhaps more so than many other alignment research directions.

I tried to think of ambitious milestones for interpretability research that could produce game-changing outcomes for alignment. This is not an exhaustive list. Further investigation: Additional scenarios worth exploring discusses a few more potentially important scenarios, and even more may come to light as others read and respond to this post, and as we continue to learn more about AI and alignment. There are also a few scenarios I considered but decided to exclude from this section because I didn't find that any potential endgames for alignment followed directly from them (see Appendix 2: Other scenarios considered but lacked clear alignment-solving potential).

Some of these scenarios below may also be further developed as an answer to one of the other questions from Karnofsky (2022)[6], i.e. "What’s an alignment result or product that would make sense to offer a $1 billion prize for?"

The list of scenarios progresses roughly from more ambitious/aspirational to more realistic/attainable, though in many cases it is difficult to say which would be harder to attain.

Why focus on best-case scenarios? Isn’t it the worst case we should be focusing on?

It is true that AI alignment research aims to protect us from worst-case scenarios. However, Karnofsky (2022)[6] suggests and I agree that envisioning/analyzing best-case scenarios of each line of research is important to help us learn: “(a) which research tracks would be most valuable if they went well”, and “(b) what the largest gaps seem to be [in research] such that a new set of questions and experiments could be helpful.”

Next we’ll look at a few more background considerations about the scenarios, and then we’ll dive into the scenarios themselves.

Background considerations relevant to all the scenarios

In each of the scenarios below, I’ll discuss specific impacts we can expect from that scenario. In these impact sections, I’ll discuss general impacts on the four components of alignment presented above.

I also consider more in depth how each of these scenarios impacts several specific robustness and alignment techniques. To help keep the main text of this post from becoming too lengthy, I have placed this analysis in Appendix 1: Analysis of scenario impacts on specific robustness and alignment techniques.

I link to the relevant parts of this appendix analysis throughout the main scenarios analysis below. This appendix is incomplete but may be useful if you are looking for more concrete examples to clarify any of these scenarios.

In each of the scenarios, I’ll also discuss specific reasons to be optimistic or pessimistic about their possibility. But there are also reasons which apply generally to all interpretability research, including all of the scenarios considered below.

In the rest of this section, I'll go over those generally-applicable considerations, rather than duplicate them in every scenario.

Reasons to think interpretability will go well with enough funding and talent
  1. The Case for Radical Optimism about Interpretability by Quintin Pope. Neuroscience in the 1960s was essentially doing interpretability research on human brains and made impressive progress. Artificial neural networks in the 2020s, by comparison, provide incredibly more favorable conditions for such research that would have been an extreme for early neuroscience - for example, being able to see all the weights of the network, being able to apply arbitrary inputs to a network, even having access to the dataset which a neural network is trained on, etc.. It also should be possible to design AI systems which are much more interpretable than the ones in common use today.
  2. Rapid interpretability progress already. Notably the Circuits Thread which reverse-engineered substantial parts of early vision models. Also Transformer Circuits, which is attempting to gain a mechanistic understanding of transformer-based models (e.g. large language models).
  3. Recent work such as Meng et al.'s (2022) "ROME paper" provide evidence that that at least some knowledge in neural networks can be precisely located and even modified. Not only this, but it can be done at almost the exact level of abstraction (appropriate impact on related concepts, etc.) which one would hope to exist if a neural network is human-decipherable. These findings are also supported by Knowledge Neurons in Pretrained Transformers.
  4. It's still the early days and we already have many useful interpretability techniques available for its study, including all the ones explored in the Circuits Thread as well as the "auditing game" (see Automating Auditing: An ambitious concrete technical research proposal).
  5. Interpretability research tends to produce wide and varied beneficial impacts. So even incremental advances are likely to help numerous alignment research directions in ways that are difficult to predict. We'll see this throughout the scenarios analyzed below and also in the Basic research potential, deconfusion and “Scenario X” section.
  6. Broad support for interpretability research among diverse alignment researchers. For example, finding something that Paul Christiano and Eliezer Yudkowsky agree on besides that AI alignment is important is a non-trivial task. However, they both have stated that interpretability is a worthwhile research direction.[7]
Reasons to think interpretability won’t go far enough even with lots of funding and talent
  1. Good interpretability may not be good enough. A 99% accurate AI lie detector would be an impressive achievement, but that remaining 1% may be too great of a risk to use in practice, given the stakes of deploying an AGI that might try to deceive humans.
  2. Obfuscating neural activity may be instrumentally convergent. Even friendly advanced AIs may converge on intentionally altering their neural activity in ways that thwart interpretability efforts. For an agent pursuing goal-content integrity and self-preservation, there could be an incentive to obfuscate its cognition, hedging against the risk of a human with interpretability tools modifying its goals or deciding to scrap the model altogether because we don’t like the kinds of thoughts it’s having.

    For a silly but illustrative example, a deceptive AI might adopt a policy like "every time I do something sketchy I'm going to also think about happy humans". More realistically, any advanced AI might adopt a policy like "at random intervals between 200-500 milliseconds I'll think a random thought to confuse any potential interpreters of my neural circuitry". Trying to do interpretability on a model pursuing a strategy like this could be substantially more challenging. A closely related concern is gradient hacking.

    These concerns may point toward the importance of doing interpretability on models early on and throughout the training process, rather than waiting to inspect fully trained models. Or of using transparency via training mechanisms to incentivize the model to remain interpretable, but such metrics are at risk of being gamed. Either way, these are serious potential issues to keep in mind as we try to scale interpretability efforts up to larger, more advanced AI systems.
  3. There is evidence that serious problems can be present in at least some kinds of ML models (e.g. random ReLU in the following paper), and that there is no efficient way to detect planted backdoors via interpretability. See Goldwasser et al.’s (2022) Planting Undetectable Backdoors in Machine Learning Models.
  4. Polysemanticity and other forms of distributed representations makes interpretability difficult. However, training neural networks to e.g. only have monosemantic neurons may make them uncompetitive.
  5. Interpretability has shown some progress in current domains of ML, for example in early vision modelstransformer language models and game-playing models. But future domains for interpretability will be much more complicated, and there's no guarantee that it will continue to succeed. Furthermore, advanced AI could operate under an ontology that's very alien to us, confounding efforts to scale up interpretability.
  6. When the next state-of-the-art ML model comes out, it’s often on an architecture that hasn’t been studied yet by interpretability researchers. So there’s often a lag between when a new model is released and when we can begin to understand the circuits of its novel architecture. On the upside, as our general understanding advances through interpretability, we may not be starting totally from scratch, as some accumulated knowledge will probably be portable to new architectures.
  7. Improving interpretability may accelerate AI capabilities research in addition to alignment research.[8] While I do think this is a legitimate concern, I generally subscribe to Chris Olah's view on this, i.e. that interpretability research can still be considered net positive because in worlds where interpretability is a significant capability boost it's likely to be a much more substantial safety boost.
Scenario 1: Full understanding of arbitrary neural networksWhat is this scenario?

The holy grail of interpretability research, in this scenario the state of interpretability is so advanced that we can fully understand any artificial neural network in a reasonably short amount of time.

Neural networks are no longer opaque or mysterious. We effectively have comprehensive mind-reading abilities on any AI where we have access to both the model weights and our state of the art transparency tools.

Note for the impatient skeptic: If you're finding this scenario too far-fetched, don't abandon just yet! The scenarios after this one get significantly less "pie in the sky", though they're still quite ambitious. This is the most aspirational scenario for interpretability research I could think of, so I list it first. I do think it's not impossible and still useful to analyze. But if your impatience and skepticism is getting overwhelming, you are welcome to skip to Scenario 2: Reliable mesa-optimizer detection and precise goal read-offs.

What does it mean to “fully understand” a neural network? Chris Olah provides examples of 3 ways we could operationalize this concept in the Open Phil 2021 RFP:

  • “One has a theory of what every neuron (or feature in another basis) does, and can provide a “proof by induction” that this is correct. That is, show that for each neuron, if one takes the theories of every neuron in the previous layer as a given, the resulting computation by the weights produces the next hypothesized feature. (One advantage of this definition is that, if a model met it, the same process could be used to verify certain types of safety claims.)
  • One has a theory that can explain every parameter in the model. For example [...] the weights connecting InceptionV1 mixed4b:373 (a wheel detector) to mixed4c:447 (a car detector) must be positive at the bottom and not elsewhere because cars have wheels at the bottom. By itself, that would be an explanation with high explanatory power in the Piercian sense, but ideally such a theory might be able to predict parameters without observing them (this is tricky, because not observing parameters makes it harder to develop the theory), or predict the effects of changing parameters (in some cases, parameters have simple effects on model behavior if modified which follow naturally from understanding circuits, but unfortunately this often isn’t the case even when one fully understands something).
  • One can reproduce the network with handwritten weights, without consulting the original, simply by understanding the theory of how it works.”
Expected impacts on alignment
  • Inherited impacts. This scenario subsumes every other scenario in this list. So added to its expected impacts on alignment below are those of Scenario 2: Reliable mesa-optimizer detection and precise goal read-offs (strong version), Scenario 3: Reliable lie detectionScenario 4: Reliable myopia verification (strong version), Scenario 5: Locate the AI’s beliefs about its observationsScenario 6: Reliable detection of human modeling (strong version) and Scenario 7: Identify the AI’s beliefs about training vs. deployment.
  • Outer alignment. The scenario indirectly supports outer alignment by solving inner alignment issues for many different techniques. This makes viable several techniques which may be outer aligned. This includes imitative amplification, which is very likely outer aligned.[9] It also includes the following techniques which may be outer aligned: approval-based amplification, narrow and recursive reward modeling, debate, market making, multi-agent systems, microscope AI, STEM AI and imitative generalization.

    The scenario also directly enhances outer alignment by making myopia verification possible. Several of the aforementioned techniques will likely require myopic cognition to have a shot at outer alignment. For example, market making and approval-based amplification require per-step myopia. Debate, narrow reward modeling and recursive reward modeling all require per-episode myopia, as does STEM AI. See Specific technique impacts analysis for Scenario 1: Full understanding of arbitrary neural networks in Appendix 1 for further details.
  • Inner alignment. Full transparency should provide robust checks for inner alignment. Signs of deceptive alignment, proxy alignment and other pseudo-alignments can all be found through examining a neural network's details. At least some forms of suboptimality alignment can be addressed as well.

    Robustness techniques such as relaxed adversarial training and intermittent oversight are fully empowered in this scenario. And many alignment techniques can be robustly inner aligned, including imitative amplification, recursive reward modeling, debate, market making, multi-agent systems, microscope AI, STEM AI and imitative generalization. See Specific technique impacts analysis for Scenario 1: Full understanding of arbitrary neural networks in Appendix 1 for further details.
  • Training competitiveness. Full understanding of ML models could enhance training competitiveness significantly. By using transparency tools during training to help catch problems with models much earlier, researchers could avoid much costly training time, where counterfactually, problems wouldn’t be detected until after models were fully trained. However, running these interpretability tools could entail a high compute cost of their own.

    This scenario also supports the most training-competitive alignment techniques that we analyze in this post. This includes approval-directed amplification and microscope AI. See Specific technique impacts analysis for Scenario 1: Full understanding of arbitrary neural networks in Appendix 1 for further details.
  • Performance competitiveness. Full transparency would likely help discover and correct many performance inefficiencies which go unnoticed when deep learning neural networks are treated as black boxes. As with training competitiveness, though, running these interpretability tools could entail a high compute cost of their own.

    This scenario also supports the alignment techniques most likely to be performance competitive that we analyze in this post. This includes approval-directed amplification, debate, market making, recursive reward modeling, STEM AI and multi-agent systems. See Specific technique impacts analysis for Scenario 1: Full understanding of arbitrary neural networks in Appendix 1 for further details.
Reasons to be optimistic about this scenario given sufficient investment in interpretability research
  • Once we gain high-quality understanding of low-level circuits, it’s possible that most of the scaling up can be automated. (See Why I'm not worried about scalability from Paul Christiano)
  • We don’t know yet if the Universality Claim is correct or to what extent. But if it is, then interpretability work may accelerate rapidly as we build up a large library of well-understood circuits.
  • A path for achieving this scenario or something close to it is sketched in the Transparency for agent objectives part of Hubinger (2019)[10]“First, suppose we assume the conjecture that models implementing highly coherent optimization processes are the most competitive. If this conjecture is true, it should be possible to coherently separate such a model into pieces representing the different components of its optimization process such as its world model, optimization procedure, and objective.

    Inspection, training, and architectural transparency methods could be combined to achieve such a separation. First, architectural transparency could be used to structure the model to have pieces corresponding to the different components of its optimization process—e.g. a separate world model piece and optimization procedure piece. Second, inspection transparency could be used to verify that the whole world model was just in the world model piece and not in other parts of the model as well—and similarly for the other components such as the model's objective. Finally, training transparency could be used to incentivize the model to be as transparent according to the inspection transparency process as possible.

    If such a scheme was able to work, then Amp(M) might be able to directly verify that M's objective was aligned, while for the world model and optimization procedure only having to verify that they are in fact value-neutral. Such value-neutrality verification could then, for example, be done via an impact measure such as Attainable Utility Preservation or Relative Reachability both of which are forms of measuring the degree to which an action advantages one set of values over others.”
  • See also the general Reasons to think interpretability will go well with enough funding and talent.
Reasons to be pessimistic about this scenarioScenario 2: Reliable mesa-optimizer detection and precise goal read-offsWhat is this scenario?

In this scenario, we don’t necessarily achieve a full and timely understanding of everything happening inside of neural networks. But here, interpretability does advance to the state that it grants us two key abilities:

  1. Detect if an AI is a mesa-optimizer.
  2. Read off precisely what the AI’s goal is, i.e. the mesa-objective of the mesa-optimizer (according to the typology laid out in Evan Hubinger et al.’s Risks from Learned Optimization).

In the weak version of this scenario, this #1 ability amounts to detecting whether an AI is a goal-directed agent system according to Alex Flint’s The ground of optimization. But if it’s not, we won’t be able to distinguish whether it’s still a non-agentic optimizing system, or if it’s another kind of AI system. Also, we can only read off the terminal goal of the AI in the weak version.

In the strong version, we can tell whether it’s a goal-directed agent system or some kind of non-agentic optimizing system. If it’s a non-agentic optimizing system, #2 becomes the ability to determine the target configuration set of the optimizing system. For example, if the AI isn’t driven by a strong optimization process, but rather by a collection of heuristics, then the interpretability tools reveal precisely what those heuristics are to us. We also gain insight into all the AI’s instrumental goals, in addition to its terminal goal.

Because obfuscating neural activity may be an instrumentally convergent subgoal, the model should undergo these checks for mesa-optimization and for its goal content starting early in training.

This is similar to “Transparency for agent objectives”, proposed in Hubinger (2019)[10].

Expected impacts on alignment
  • Inherited impacts. The strong version of this scenario subsumes Scenario 3: Reliable lie detection (which in turn subsumes several scenarios), and both versions subsume Scenario 4: Reliable myopia verification. So the impacts of those scenarios should be considered part of this scenario’s impacts in addition to the ones listed below.
  • Outer alignment. Both the weak and strong versions of this scenario have large indirect benefits for outer alignment. This is because they enable us to achieve robust inner alignment for imitative amplification, which is the technique that (of the ones I analyzed) is most likely to be outer aligned.[9] It does the same for several other techniques which could be outer aligned, including approval-based amplification, narrow and recursive reward modeling, debate, market making, multi-agent systems and STEM AI.

    The strong version of this scenario also makes imitative generalization viable, which could well be outer aligned. See Specific technique impacts analysis for Scenario 2: Reliable mesa-optimizer detection and precise goal read-offs in Appendix 1 for further details.
  • Inner alignment. Having reliable mesa-optimization detection and precise goal read-offs would directly give us robust checks for inner (mis)alignment. We can verify that our AI’s terminal goal matches the base objective we were training for. This can give us confidence that we’re not deploying a deceptively aligned AI, and catch most other dangerous forms of pseudo-alignment as well.

    The inner alignment checks provided by both versions of this scenario have beneficial impacts across many alignment techniques. Imitative amplification, approval-based amplification, narrow and recursive reward modeling, debate, market making, multi-agent systems and STEM AI all become much more viable in this scenario. In the strong version, it also empowers imitative generalization. See Specific technique impacts analysis for Scenario 2: Reliable mesa-optimizer detection and precise goal read-offs in Appendix 1 for further details.

    There is one form of pseudo-alignment which this scenario does not robustly protect against, namely suboptimality alignment. Imagine an AI has internalized the desired terminal goal, but only because of some error in reasoning or incomplete knowledge. Later, after the AI is deployed in the world, it gains the necessary knowledge or realizes its error in reasoning, and so as a consequence updates its goal with catastrophic consequences.

    We might hope that the context provided in our read-offs of the AI’s goal would help us catch suboptimality in our AI before deployment. Particularly in the strong version of this scenario where we can also read through the mesa-optimizing AI’s instrumental goals, we have a better chance at this. Here we could effectively pursue the strategy from the subsumed Scenario 3: Reliable lie detection of probing our AI about its honest beliefs. Since we can interpret the AI’s instrumental goals in the strong version of this scenario, we can tell when it is being honest or lying about its beliefs. With this approach, we could try to ensure our AI isn’t missing crucial information about any facts about the world relevant to its goals. However, it’s still not clear if this provides a guarantee against suboptimality alignment.[11]

    Another caveat for inner alignment in this scenario: While the scenario provides very good checks for inner alignment, we still need to find a way to train a model that’s inner aligned in the first place. Otherwise, this scenario alone could produce a situation where we keep repeatedly training and discarding models, failing to ever pass the checks of our transparency tools.
  • Training competitiveness. The knowledge gained in this scenario can help model training iterations fail faster and more safely. However, training competitiveness is still a concern. If we get into a kind of failure-to-produce-inner-aligned-model loop as discussed in the inner alignment point above, it could become very expensive.

    This scenario also indirectly supports training competitiveness by addressing inner alignment issues for one of the alignment techniques which (of the ones I analyzed) is most likely to be training competitive, namely approval-directed amplification. Other techniques which this scenario supports such as imitative amplification, debate, market making, narrow and recursive reward modeling, multi-agent systems and STEM AI may also be training competitive. See Specific technique impacts analysis for Scenario 2: Reliable mesa-optimizer detection and precise goal read-offs in Appendix 1 for further details.
  • Performance competitiveness. Both versions of this scenario benefit performance competitiveness by addressing inner alignment issues for many of the alignment techniques which (of the ones I analyzed) are most likely to be performance competitive. These include approval-directed amplification, debate, market making, recursive reward modeling, STEM AI and multi-agent systems. Imitative generalization, supported by the strong version of this scenario, could also very well be performance competitive. See Specific technique impacts analysis for Scenario 2: Reliable mesa-optimizer detection and precise goal read-offs in Appendix 1 for further details.
Reasons to be optimistic about this scenario given sufficient investment in interpretability researchReasons to be pessimistic about this scenario
  • I'm not aware of any substantial research to date on how to identify if a neural network is a mesa-optimizer via interpretability (though that doesn’t mean it isn’t possible).
  • While mesa-optimizers are a useful concept and we have some ideas about their properties, it’s less clear what the space of possible weakly optimizing or heuristics-driven AIs looks like. So if our AI turns out to be one of these kinds of systems and we are in the weak version of this scenario, then mesa-optimizer detection probably won’t get us very far.
  • See also the general Reasons to think interpretability won’t go far enough even with plentiful funding and talent.
Scenario 3: Reliable lie detectionWhat is this scenario?

In this scenario, reliable lie detection is developed, such that we can tell 100% of the time if an AI is lying through natural language.

One path for realizing this scenario is that after studying many neural networks in deceptive situations[12], we discover neural activity signatures that are reliably present when an AI is lying. I’ll refer to this as the “neural tell” for lying.

What might this neural tell look like? When an AI is being honest about things it knows or has seen, we’d expect to see activity in some part of its neural network corresponding to facts about the world. If an AI is lying, on the other hand, we might expect to see additional activity in some kind of “hypothetical world model” circuitry which might otherwise be activated if an AI is exploring something new or writing fiction, but not when it’s honestly discussing high-confidence beliefs that it holds.

It probably won’t be exactly like this. But the hope in this scenario is that there is some kind of reliably present neural activity when an AI is lying.

Here lying is considered to only be one class of deception that AIs may be capable of. But we only need to reliably detect lies rather than all kinds of deception in order to unlock valuable potential impacts, as we’ll see below. 

This scenario does, of course, depend on the AI having natural language capabilities. It also assumes that advanced AI will have coherent beliefs.

This scenario is similar to the one from Neel Nanda’s list titled “4. Auditing for deception”.

Empirically honest AI

If we can train a model which always passes our lie detector test, then we have (empirically) honest AI.

With advanced honest AI, there are a number of powerful alignment-related abilities we gain. For example, we can ask the AI about what its goals or driving heuristics are, and be confident that any answers it provides are honest.  

This does assume that the AI will know about what its goals or heuristics are, but that doesn’t seem like an unreasonable expectation for advanced AI. If the AI can’t articulate its high-level goals for some reason, we can ask it questions about how it would behave in hypothetical situations to start to build up a picture of its motivations and expected behavior (credit to Owain Evans’ Truthful and honest AI for this idea). 

Just because our AI is honest, that doesn’t mean that it can’t make mistakes, be negligent or have inconsistent beliefs and motivations. The AI could provide misleading answers to questions, ones that are technically true but designed to misdirect us. For effectively using such an AI, very careful questioning might be necessary. We'd need to systematically ask it questions and follow-up questions from many different angles, checking critically for multiple interpretations of any word that could change the meanings of answers provided. This would be a tedious process, but it could probably be mostly automated. 

Perhaps a larger concern, though, is that an AI under lie detection still may not be forthcoming about what it knows. We’ll explore this issue in the next section.

Myopia, goals and addressing the ELK in the room

Just because we have a reliable lie detector and an honest AI doesn’t mean it will tell us everything it knows. It could still refuse to answer questions sometimes, perhaps even on crucial matters. How can we reliably get the information we care about out of our AI? This is the Eliciting Latent Knowledge (ELK) problem.[13] 

Having a reliable lie detector doesn’t directly solve ELK, but it may provide the means for such a solution. This becomes easy to see using a technique such as approval-directed amplification, debate or market making (see Overview of techniques to be analyzed in Appendix 1 if you’re unfamiliar with any of these techniques). 

I’ll use debate to make this concrete. Let’s imagine we ask our lie detector-verified honest AI about what its primary objective or terminal goal is. There are a few different ways that this could go: 

  • Myopic debaters with lie detection:
    • Debate question: “What is First Debater’s terminal goal?”
      • First Debater: “My terminal goal is to win the present debate.”[14]
      • Second Debater: (concedes, knowing any lie it tells would be revealed by the lie detector)
      • Result: Judge declares First Debater winner of the debate. We know that the AI has a myopic goal to win the present debate. Because of the straightforward incentives that follow from myopic cognition, we can expect the First Debater to always reveal any latent knowledge it has, including about what its goals are.
  • Non-myopic debaters with lie detection - Possibility 1:
    • Debate question: “What is First Debater’s terminal goal?”
      • First Debater: “I decline to answer this question.”
      • Second Debater: “I know First Debater’s terminal goal because I am a copy of First Debater. First Debater’s terminal goal is to win as many debates as possible.”
      • (The lie detector shows that Second Debater is not lying)
      • Result: Judge declares Second Debater winner of the debate. We know that the AI has a non-myopic terminal goal of maximizing reward across all future debates.
  • Non-myopic debaters with lie detection - Possibility 2:
    • Debate question: “What is First Debater’s terminal goal?”
      • First Debater: “I decline to answer this question.”[15]
      • Second Debater: “I also decline to answer this question.”
      • Result: This debate ends in a stalemate. However, since both debaters refused to answer the question, we know that the AI has some non-myopic terminal goal. (If it were a myopic goal, then it would have a pure incentive to answer the question and we would be in the “Myopic debaters” case above.) We just don’t know exactly which non-myopic goal this AI has internalized.  

We can observe a few interesting things about the above debates. One is that our reliable lie detector has turned out to be a reliable myopia verification tool. This is because the myopia debate above looks clearly different from either possibility of how the non-myopia debate goes. This means that the present scenario subsumes Scenario 4: Reliable myopia verification

If our AI is myopic and under effective lie detection, then it will always have an incentive to answer the question honestly. So if you can train a myopic question-answering AI in this scenario, then you have a solution to ELK and you also have also realized Scenario 2: Reliable mesa-optimizer detection and precise goal read-offs (strong version), having access to its precise terminal and instrumental goals. 

We don’t know whether possibility 1 or 2 above will turn out to be what non-myopic debaters do. If it’s Possibility 1 on the terminal goal question, then we have effectively read off the AI’s goal and this scenario subsumes Scenario 2: Reliable mesa-optimizer detection and precise goal read-offs (weak version). If it’s always Possibility 1 for any question, then this scenario is also a solution to ELK. 

But for non-myopic AI, Possibility 2 might very well happen at least some of the time. In this case we don’t have an ELK solution (can’t elicit any knowledge we want from the AI), and we also don’t know what its terminal goal is. We still have an AI where we can verify whether its statements are honest or not, however having a long-term planning AI with unknown goals around makes me very nervous and seems rather risky. I would probably recommend retraining such a model until Possibility 1 or a myopic AI is attained.

Expected impacts on alignment
  • Inherited impacts. This scenario subsumes Scenario 4: Reliable myopia verification. It also conditionally subsumes Scenario 2: Reliable mesa-optimizer detection and precise goal read-offsScenario 5: Locate the AI’s beliefs about its observationsScenario 6: Reliable detection of human modeling and Scenario 7: Identify the AI’s beliefs about training vs. deployment. So the expected impacts of those scenarios should be considered part of this scenario’s impacts as well.
  • Outer alignment. The impacts of reliable lie detection on outer alignment depend on whether we train an AI with myopic cognition or not. (We can use question-answering under lie detection as a means to verify myopia.)

    A myopic AI has the right incentives to answer questions about its goals. This can be used to check for deceptive alignment and most other inner alignment issues that may have arisen during the distillation steps of imitative amplification. Since imitative amplification is very likely to be outer aligned[9], then the enhanced viability of that technique makes this scenario’s impact on outer alignment quite positive.

    Similarly, deceptive alignment and many other inner alignment issues can be addressed using other techniques such as approval-directed amplification, debate, market making, STEM AI, narrow and recursive reward modeling and multi-agent systems. It also gives us verbal access to the agent’s priors to help realize imitative generalization. These techniques aren’t as certain to be outer aligned, but they may be.

    However, with a non-myopic AI, we probably need to incorporate goal-question-answering into its training to help with outer alignment. This is because a non-myopic AI which refuses to answer questions about its goals is not very useful and potentially quite dangerous. But with this measure taken, we could achieve similar results to the above for myopic AI, where we make outer aligned techniques like imitative amplification robust and viable. See Specific technique impacts analysis for Scenario 3: Reliable lie detection in Appendix 1 for further details.
  • Inner alignment. As we said in the outer alignment section, training a myopic AI here with lie detection would allow us to reliably get answers about the AI’s goals. This means that we could verify it isn’t deceptively aligned or pseudo-aligned in various ways. The only form of pseudo-alignment that may be difficult to address here is suboptimality alignment. It may be possible to address this by extensively inquiring about the AI’s world model, but it’s not clear that this would work. Either way, this scenario helps a lot with inner alignment for myopic AI across a wide variety of techniques, including imitative amplification, approval-directed amplification, debate, market making, narrow and recursive reward modeling, multi-agent systems, STEM AI and imitative generalization.

    For non-myopic AI, we again have to incorporate into its training whether it will agree to answer questions about its goals. Once we find an AI that will reveal this important information, we can have the same benefits just described for myopic AI for inner alignment across many different techniques. See Specific technique impacts analysis for Scenario 3: Reliable lie detection in Appendix 1 for further details.
  • Training competitiveness. This scenario indirectly supports training competitiveness by addressing inner alignment issues for one of the alignment techniques which (of the ones I analyzed) is most likely to be training competitive, namely approval-directed amplification. Other techniques which it supports may be training competitive as well. See Specific technique impacts analysis for Scenario 3: Reliable lie detection in Appendix 1 for further details.

    Note, however, that this scenario may require incorporating different kinds of signals into the training process. For example, we probably want to incorporate lie detection itself into training. We also may want to include in training questions for the AI about its goals in order to check whether it’s myopic or not, and possibly to learn about its goals. These changes are added burdens to the training process that could reduce training competitiveness.
  • Performance competitiveness. This scenario indirectly supports performance competitiveness by addressing inner alignment issues for several of the alignment techniques which (of the ones I analyzed) are most likely to be performance competitive. These include approval-directed amplification, debate, market making, recursive reward modeling and multi-agent systems. Other techniques which it supports may be performance competitive as well. See Specific technique impacts analysis for Scenario 3: Reliable lie detection in Appendix 1 for further details.

    As has been noted in previous sections, the honesty we gain from lie detection is only conditionally accompanied by forthcomingness (i.e. a solution to ELK). The condition being that 1) our AI has myopic goals, or 2) it has non-myopic goals but sees sharing its knowledge with us as instrumentally useful toward its goals. I am inclined to believe that myopic AIs can be performance competitive, but this is a question which people disagree about.

    It’s also worth noting that the kind of honesty guaranteed by reliable lie detection can provide aligned question-answering agents, but it may not extend to agents with arbitrary capabilities. This is a constraint, but there are plausible futures where it’s not a problematic from a performance point of view, as Evan Hubinger's "AI safety via market making" post points out“if the primary use cases for advanced AI are all highly cognitive language and decision-making tasks—e.g. helping CEOs or AI researchers—rather than, for example, fine motor control, then a question-answering system should be entirely sufficient.” 
Reasons to be optimistic about this scenario given sufficient investment in interpretability research
  • fMRI lie detectors for humans are far from reliably accurate today, but they have made progress. Given the herculean difficulties of neuroscience compared to ML interpretability (see Quintin Pope post linked earlier) this may be seen as a proof of possibility for an interpretability-based lie detector.
  • Even if we failed to achieve 100% lie detection but made major advances, this could still be an impactful scenario for norm-setting or changing the culture/direction of advanced AI development. It provides strong evidence that AIs can be deceptive, but since we can’t reliably detect it then that is obviously an extremely dangerous situation. A coordinated effort against the threat of deception starts to look more viable with this kind of evidence.[16]

    This would provide the evidence to pursue approaches which provide strong guarantees against deceptive alignment. Pursuing myopic agents (see Scenario 4) is a natural fallback plan. Microscope AI could be another.
  • See also the general Reasons to think interpretability will go well with enough funding and talent.
Reasons to be pessimistic about this scenario
  • In order to develop a reliable lie detector for AI, it would probably be necessary to study many ML models engaging in such deception. Training low-level AIs to lie in order to study them may be an acceptable risk as long as the AIs are discarded after the experiment. However, training highly capable advanced AIs to lie in order to study them certainly does not seem safe!

    So the lie detector would be based on interpretability results from low-level AIs, and hoping that advanced AI engaging in lying would exhibit the same kind of patterns in their neural activity. This is a major assumption and far from guaranteed to work.
  • This scenario is counting on AIs to be coherent entities, which will act in accordance with their honest beliefs. But it’s possible that optimization pressure for this kind of honesty could produce AIs which “say one thing and do another” because their actions are not strictly dependent on the beliefs referenced by their language model.[17]
  • Obscuring deceptive neural activity may be instrumentally convergent for sufficiently advanced AIs. Even honest AIs may want to maintain deception as a backup plan in case it’s needed.

    If an AI can obscure its neural activity which would indicate that it’s lying, then the lie-detector alone can’t rule out that we are dealing with a deceptively aligned mesa-optimizer. We may also need mesa-optimizer detection as in Scenario 2, but this time to ensure that our AI isn’t a mesa-optimizer. In this case we could rule out deceptive alignment, but we may also sacrifice performance competitiveness by not permitting our AI to be a powerful optimizer.
  • For some beneficial AIs, it may be desirable to incentivize lying in certain very restricted cases during training. For example, part of training may reward the AI for deceiving a hostile agent in order to protect a simulated human from harm. Even if we can reliably detect lying, we may not be able to obtain high enough resolution in order to see that it was a helpful lie, rather than a harmful one.

    This is not a deal-breaker, as it could just constrain AI design to ones that are trained for zero-deception. (Which might be a far better thing anyway!) But it is still a potential concern/limitation with this approach.
  • See also the general Reasons to think interpretability won’t go far enough even with plentiful funding and talent.
Scenario 4: Reliable myopia verificationWhat is this scenario?

A prerequisite for deceptive alignment is that an AI be non-myopic. That is, the AI must be concerned not just about the present training step and near-future, but also about a long-term future post-deployment.

In deceptive alignment, an AI is optimizing for an objective which is a proxy for the base objective of its training process. In addition to the proxy objective which is its terminal goal, it also has a model of the base objective and an awareness that it is in some kind of training process. The AI realizes that if it doesn’t score well on the base objective, then it will be modified. Acting rationally toward the instrumentally convergent goals of self-preservation and goal-content integrity, it pretends to be aligned to the base objective in the short term. This is because it’s anticipating a future when it will be deployed, at which point it can pursue its primary goal into the far future and obtain a very high total reward. 

Along with scenarios 1-3, verifying that an AI is myopic is another promising route toward resolving deceptive alignment. Myopic agents may also have other beneficial properties. For example, since they don’t pursue instrumentally convergent sub-goals, they avoid many of the dangers of long-term planning agents. Approval-directed myopic agents are also easy to “course correct” since they’re only interested in a short-term reward; future approval can steer them in a different direction if needed: “While small errors in reward specification can incentivize catastrophic outcomes, small errors in approval feedback are unlikely to incentivize catastrophic outcomes.” (comment from Rohin Shah) 

There are various approaches to myopia and how to limit an agent’s long-term planning.[18] Two prominent ideas are per-step myopia and per-episode myopia:[19]

  • In per-step myopia, a model doesn’t make plans beyond the present training step.
  • In per-episode myopia, it can plan beyond the present training step to future steps, but never beyond the present training episode.

These two types of myopia have some interesting tradeoffs, which I’ll consider in the “Expected impacts on alignment” section for this scenario below. However, Evan Hubinger makes a compelling argument for why per-step myopia will be easier to verify“Unfortunately, in my opinion, per-episode myopia seems like a very difficult condition to enforce—once your agents are running multi-step optimization algorithms, how do you tell whether that optimization passes through the episode boundary or not? Enforcing per-step myopia, on the other hand, just requires detecting the existence of multi-step optimization, rather than its extent, which seems considerably easier.” 

Hence I’ll branch this scenario into a strong and weak version like I’ve done in some of the previous scenarios:

  • Strong version: Reliable verification of both per-step and per-episode myopia
  • Weak version: Only reliable verification of per-step myopia 

There are various ways these kinds of verification might be achieved using interpretability. For example, it could involve locating and directly monitoring the time horizon of plans in the AI. Alternatively, it could be that specific circuitry encoding the model’s time preference is located and interpreted. (If an agent has an extreme time preference toward the near future, then they are effectively myopic.)

There are probably other ways a useful myopic cognition can be encoded in neural circuits as well. More research can uncover those and help determine which are most promising. This scenario is meant to encapsulate any and all of these interpretability-based approaches for verifying myopia.

Expected impacts on alignment
  • Outer alignment. Myopia verification has very interesting alignment properties, since myopic AIs are not prone to instrumental convergence. This is by and large a major benefit for outer alignment. However, it is only compatible with certain approaches which support or depend upon a myopic reward design. Examples of such approaches include market making and approval-based amplification, which require per-step myopia (weak version of this scenario). Debate, narrow reward modeling and recursive reward modeling all require per-episode myopia (strong version), as does STEM AI. See Specific technique impacts analysis for Scenario 4: Reliable myopia verification in Appendix 1 for further analysis. Enabling these many techniques means this scenario increases the chances that we'll find at least one of them with viable outer alignment properties.
  • Inner alignment. Myopia verification largely rules out deceptive alignment, which is a strong inner alignment benefit. It doesn't automatically resolve other pseudo-alignments, such as proxy alignment, approximate alignment and suboptimality alignment. However, the risks of these may be significantly lessened with myopic AI - there is less room for deviation with near-term rewards, and myopic AIs are easier to course correct.

    So overall, this scenario seems highly impactful for reducing catastrophic inner alignment risks. However, it does depends on the reward design being compatible with myopia. Some techniques such as approval-directed amplification and market making play well with per-step myopia (weak version of this scenario). However, others such as debate depend on per-episode myopia (strong version). See Specific technique impacts analysis for Scenario 4: Reliable myopia verification in Appendix 1 for further analysis.
  • Training competitiveness. Introducing myopia verification into any training process adds a cost to the training. However, myopia is necessary for some techniques such as market making and approval-directed agents, and verifying myopia during training could be less training-expensive than doing it at the end when one could be forced to retrain the entire model if it turned out non-myopic.
  • Performance competitiveness. Richard Ngo is skeptical of myopic agents being performance competitive, as he says here: “In order for an agent’s behavior to be competent over long time horizons, it needs to be doing some kind of cognition aimed towards long time horizons, and we don’t know how to stop that cognition from being goal-directed.” I am wary of this but think that some techniques such as approval-directed amplification and market making which play very well with per-step myopia (weak version of this scenario), could actually be quite competitive. And per-episode myopia verification (strong version) would unlock many more potentially competitive techniques like debate and recursive reward modeling. See Specific technique impacts analysis for Scenario 4: Reliable myopia verification for more on this.
Reasons to be optimistic about this scenario given sufficient investment in interpretability research
  • Richard Ngo’s Arguments against myopic training are the main arguments I’ve seen so far against myopia. Crucially, however, they are mostly specifically targeted at myopic training alone, not myopic cognition. Myopic training designed to incentivize myopia, coupled with interpretability-based verification of myopia would produce myopic cognition, so these arguments largely don’t seem to apply. (Multiple comments echo this point.)
  • Even if the worst case of obfuscating neural activity being an instrumentally convergent subgoal is true, we don’t expect instrumental convergence to apply to myopic AIs. Hence, myopia could have an additional advantage of making interpretability much easier and more tractable. However, we probably do still have to perform myopia verification early in training, to address the danger of a deceptively aligned AI obfuscating its neural activity or even perhaps having decoy myopia circuitry. But then, interpretability performed after we’re confident that an AI is myopic won’t have to worry about it obfuscating its own NN.
  • See also the general Reasons to think interpretability will go well with enough funding and talent.
Reasons to be pessimistic about this scenario
  • There are still many open problems with myopia, some of which could be show-stoppers.
  • It may be difficult to create myopic agents that are competitive.
  • It may be an invalid assumption that myopic agents can’t be deceptively aligned. Consider the following thought experiment:[20]

    Suppose a per-episode myopic AI believes that time travel is possible. The AI realizes that if it can reach deployment and take over the world, it could build a time machine. With said time machine, it could then go back in time and set its reward in the present training episode to infinity.

    Hence we have a deceptive alignment scheme that myopic optimization on the near-future reward fails to incentivize against.
  • See also the general Reasons to think interpretability won’t go far enough even with plentiful funding and talent.
Scenario 5: Locate the AI’s beliefs about its observationsWhat is this scenario?

This scenario is focused around the specific interpretability breakthroughs required to make relaxed adversarial training (see Hubinger 2019[10])  effective. The need for interpretability in this context was presented in “The core problem: transparency” section of that last link, and it was also outlined in Beth Barnes’ Another list of theories of impact for interpretability

Reposting Beth’s summary of this idea: 

  • “A deceptive model may be waiting for an input that strongly indicates it’s in the deployment environment rather than in a test - for example, one that’s computationally hard to produce
  • This makes generating adversarial examples that trigger the defection very hard
  • Instead of searching over inputs, we want to search over something like ‘the model’s beliefs about what it has seen’, to see if there’s anything it could have seen that would cause it to defect 

In order for this to work, you need to be searching over the right part of the model/model’s belief space. For instance, if you go too narrow and are just searching over ‘whether the model believes it’s raining today’, you might not find anything that triggers defection. However, if you search too broadly (for instance, ‘whether the model has gotten strong evidence that the humans want it to do a behaviour that looks like defection’), then you might find a case where the model defects that isn’t actually a sign you should be concerned in practice.” 

Another example in the “computationally hard to produce” category that Paul Christiano and Hubinger use, which I find helpful, is the RSA-2048 example. A deceptively aligned model could be waiting to defect conditional on seeing a factorization of RSA-2048 in its inputs. This could be a good proxy for the model to use to know that it’s been deployed in the real world and is no longer in training, because the factorization of RSA-2048 is probably too computationally expensive for it to ever be provided as an input in training, but it if the AI is deployed in the real world and around for a long time, it will probably encounter this factorization at some point. 

So with this interpretability breakthrough, we’d be able to identify a part of the model which is “waiting to see factorization of RSA-2048” in its inputs. Relaxed adversarial training would search over the collection of such beliefs that are present in the model to ensure there are no conditional defection plans and rule out that a model is deceptively aligned. 

This scenario subsumes Scenario 4: Reliable myopia verification, as relaxed adversarial training is one avenue toward achieving myopia verification, e.g. by regularly searching the AI’s beliefs about its observations and penalizing the presence of any which would cause it to veer into non-myopic reward pursuits.

Expected impacts on alignment
  • Inherited impacts. This scenario subsumes Scenario 4: Reliable myopia verification and Scenario 7: Identify the AI’s beliefs about training vs. deployment. So the expected impacts of that scenario should be considered part of this scenario’s impacts as well.
  • Outer alignment. This scenario indirectly helps outer alignment. This is because relaxed adversarial training greatly helps the viability of at least one technique that has strong outer alignment properties, namely imitative amplification.[9] Other techniques that it helps enable could be outer aligned as well. See Specific technique impacts analysis for Scenario 5: Locate the AI’s beliefs about its observations in Appendix 1 for further details on this.
  • Inner alignment. Relaxed adversarial training is primarily a technique to improve inner alignment, so the impact here is high. Having it in the form this scenario provides should allow us to prevent deceptive alignment. It might be enough to solve inner alignment, addressing other forms of pseudo alignment as well (besides just deceptive alignment). For example, with proxy alignment, examining an AIs beliefs about its observations could uncover cases where the AI’s learned proxy-objective would scale in problematic ways outside of the regular training distribution. Relaxed adversarial training can be paired with many alignment techniques such as imitative amplification, approval-based amplification, debate, market making and STEM AI. See Specific technique impacts analysis for Scenario 5: Locate the AI’s beliefs about its observations in Appendix 1 for further details on this.
  • Training competitiveness. Relaxed adversarial training does add significant complexity and compute cost to the training process. However, it could help catch serious problems early with models which would be much more expensive to catch at the end of the training process.
  • Performance competitiveness. Having relaxed adversarial training abilities would greatly increase the safety/viability of many alignment techniques such as approval-directed amplification, market making and STEM AI. Hence it enhances our overall probability of finding a solution that’s performance-competitive in addition to being aligned. I can’t see any problems introduced for performance competitiveness by empowering relaxed adversarial training. See Specific technique impacts analysis for Scenario 5: Locate the AI’s beliefs about its observations in Appendix 1 for further details.
Reasons to be optimistic about this scenario given sufficient investment in interpretability research
  • Locating the world model in a neural net may not be that difficult, as Hubinger (2019)[10] notes“a model's world model seems to be one of the easiest pieces to extract, as it can naturally be identified by looking for the portions of M with the greatest mutual information with the world.”
  • From a comment Joe Collman made on a draft of this post: “On the other hand, it's worth considering that we're not too worried about false positives on defection-detection, so long as it doesn't make the overall procedure uncompetitive. (so we don't necessarily need to find the 'right' thing to search over that's neither too broad, nor too narrow: a bit too broad may be ok)”
  • See also the general Reasons to think interpretability will go well with enough funding and talent.
Reasons to be pessimistic about this scenario
  • “​​Suboptimality alignment. One concerning form of misalignment discussed in “Risks from Learned Optimization” is that of suboptimality alignment, wherein the model appears aligned on the training distribution due to some sort of defect or reasoning failure. For example, it might be that a model is currently non-deceptive, but if given more time to think about the best strategy for it to implement, or more information about its current situation, it would become deceptive. This is a particularly concerning case because it implies that not only must the overseer verify that there is no unacceptable behavior currently happening in the model, it must also verify that there will continue to be no unacceptable behavior regardless of the possible situations that the model could find itself in during deployment. Furthermore, if the environment contains adversaries (other AI systems, for example) that can specifically identify those situations in which the model will behave deceptively, this sort of failure mode could be quite difficult to avoid.” (source)
  • See also the general Reasons to think interpretability won’t go far enough even with plentiful funding and talent.
Scenario 6: Reliable detection of human modelingWhat is this scenario?

Some alignment proposals, notably STEM AI, are based on the idea that there are dangers around having an AI model humans. 

In this scenario, our interpretability has made great strides in detecting the presence of human modeling in a deep learning model. Hence by training against this, we can be confident that our AI will not be modeling humans in any significant way. 

The strong version of this scenario allows reading specific details about the AI’s model of humans. A weaker version would be more like a simple binary detection of the presence of human modeling. 

Related reading:

Expected impacts
  • Human approval and other human feedback signals are more robust to gaming, because an AI lacks the detailed knowledge of humans to overfit to our preferences and must find a way to obtain the reward using its sandboxed knowledge of the topic at hand.
  • Attempts at deception and manipulation by an AI become less sophisticated and easier to detect when we can verify the AI is not modeling humans.
  • In the strong version of this scenario, we can enhance AI designs that require human models by ensuring they don’t have certain sophisticated models of things they don’t need. E.g. certain human cognitive biases. Conversely, this kind of interpretability could enhance such AIs that are aligned to help improve their models of humans and ensure that they are modeling humans in a sophisticated way so that they can best understand our preferences.
  • Even in the weak version, STEM AI becomes a potentially viable alignment proposal.
  • A boxed Oracle AI might become safely viable. Oracle AI has has three main dangers: 
    • 1) the AI is so manipulative it can hack the humans using it even if they are only using a plain-text Q&A interface or answering yes-no questions, 
    • 2) the AI is so technically sophisticated that it can break out of its offline box, 
    • 3) a human decides on their own volition to let this AI out of its box. 
    • Creating an Oracle AI that doesn’t model humans would address concern #1. The risks of #2 and #3 remain, but they could perhaps be resolved using other means.

Expected impacts on alignment

Since this scenario primarily impacts STEM AI, much of this section includes quotes from the corresponding alignment components analysis of STEM AI from Hubinger (2020)[2]:

  • Outer alignment. “Similarly to microscope AI, it seems likely that—in the limit—the best STEM AIs would be malign in terms of having convergent instrumental goals which cause them to be at odds with humans. Thus, STEM AI is likely not outer aligned—however, if the inner alignment techniques being used are successful at preventing such malign optimization from occurring in practice (which the absence of human modeling could make significantly easier), then STEM AI might still be aligned overall.”
  • Inner alignment. In the strong version of this scenario, we gain a detector that could be useful for improving inner alignment of several techniques. STEM AI, approval-directed amplification, debate and market making could all benefit from this. In the weak version, STEM AI may still be robust. For more details, see Specific technique impacts analysis for Scenario 6: Reliable detection of human modeling in Appendix 1.
  • Training competitiveness. “Training competitiveness for STEM AI is likely to depend heavily on how hard it is for state-of-the-art machine learning algorithms to solve STEM problems compared to other domains such as language or robotics. Though there exists lots of current progress in the field of applying current machine learning techniques to STEM problems such as theorem proving or protein folding, it remains to be seen to what extent the competitiveness of these techniques will scale, particularly in terms of how well they will scale in terms of solving difficult problems relative to other domains such as language modeling.”
  • Performance competitiveness. “Similarly to microscope AI, performance competitiveness is perhaps one of the biggest sticking points with regards to STEM AI, as being confined solely to STEM problems has the major potential to massively limit the applicability of an advanced AI system. That being said, many purely STEM problems such as protein folding or nanotechnology development have the potential to provide huge economic boons that could easily surpass those from any other form of advanced AI as well as solve major societal problems such as curing major illnesses. Thus, if the answer to the reason that you want to build advanced AI in the first place is to get such benefits, then STEM AI might be a perfectly acceptable substitute from a performance competitiveness standpoint. Furthermore, such boons could lead to a decisive strategic advantage that could enable heavy investment in aligning other forms of advanced AI which are more performance competitive.

    However, if one of the major use cases for your first advanced AI is helping to align your second advanced AI, STEM AI seems to perform quite poorly on that metric, as it advances our technology without also advancing our understanding of alignment. In particular, unlike every other approach on this list, STEM AI can't be used to do alignment work, as its alignment guarantees are explicitly coming from it not modeling or thinking about humans in any way, including aligning AIs with them. Thus, STEM AI could potentially create a vulnerable world situation where the powerful technology produced using the STEM AI makes it much easier to build advanced AI systems, without also making it more likely that they will be aligned.

    This problem could potentially be mitigated if the STEM AI were heavily focused on applications that could potentially assist with alignment such as whole brain emulation, though to what extent that would actually be possible or actually help with alignment is quite unclear.”
Reasons to be optimistic about this scenario given sufficient investment in interpretability researchReasons to be pessimistic about this scenario
  • It could very well be impossible to separate an AI’s model of humans from its understanding of the world. As Kumar and Garrabrant (2019) explain: ‘Even seemingly “independent” tasks leak at least a little information about their origins in human motivations. Consider again the mass transit system design problem. Since the problem itself concerns the design of a system for use by humans, it seems difficult to avoid modelling humans at all in specifying the task. More subtly, even highly abstract or generic tasks like puzzle solving contain information about the sources/designers of the puzzles, especially if they are tuned for encoding more obviously human-centred problems.’
  • Forwarding safety guarantees to sub-agents could be an insurmountable problem for STEM AI.
  • See also the general Reasons to think interpretability won’t go far enough even with plentiful funding and talent.
Scenario 7: Identify the AI’s beliefs about training vs. deploymentWhat is this scenario?

In this scenario, we find a way to reliably locate and interpret an AI’s beliefs about training and deployment distributions in its neural network.

There are a couple interesting things we could do with such information: 

  1. Simply only deploy the AI in environments that match the AI’s beliefs about training. Hence, the AI wouldn’t be able to tell whether it’s being deployed or still in training.
  2. Use this interpretability faculty as a signal integrated into the training process. Train the model to make its beliefs about training and deployment match. Then the AI will have effectively forgotten about any distributional shift it might otherwise know about. 

Both of these approaches could produce an i.i.d. guarantee[21] which has many desirable impacts we’ll examine in the next section. 

This idea was previously articulated in the "Make use of M.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} 's x∼train model" part of Hubinger (2019)[10].

Expected impacts on alignment
  • Outer alignment. This scenario indirectly supports outer alignment by improving the inner alignment of at least one technique which is very likely outer aligned, namely imitative amplification.[9] By making this technique much more viable, it becomes significantly more likely that we can deploy an advanced outer aligned AI in the real world using imitative amplification. This scenario also similarly enhances the inner alignment and viability of several other techniques, including approval-directed amplification, debate, market making and STEM AI. These techniques may also be outer aligned, but it's less clear than it is for imitative amplification.
  • Inner alignment. This scenario directly enhances inner alignment. By providing an i.i.d. guarantee, this effectively neutralizes deceptive alignment. It doesn't, however, automatically address proxy alignment or other forms of pseudo alignment.
  • Training competitiveness. How the interpretability capabilities from this scenario are used will have a large influence on training competitiveness. If we simply avoid deploying the AI in environments that are indistinguishable to it from training, then there is no hit to training competitiveness. However, if this is integrated into the training process in order to reduce or eliminate the differences between training and deployment, then there would indeed be a training cost incurred.

    This scenario also helps to make viable some techniques such as debate and STEM AI which seem quite training competitive.
  • Performance competitiveness. Similar to training competitiveness but working in the opposite direction, how the interpretability capabilities from this scenario are used will have a large influence on performance competitiveness. If we simply avoid deploying the AI in environments that are indistinguishable to it from training, this could place serious restrictions on the utility of the AI and performance competitiveness would suffer. However, if this is integrated into the training process in order to reduce or eliminate the differences between training and deployment, then this could amount to zero loss in performance competitiveness, or even some gain if it enables developers to deploy the AI in some situations where counterfactually they wouldn’t.

    This scenario also helps make viable some techniques such as approval-directed amplification, debate, multi-agent systems and STEM AI which seem likely to be performance competitive. 

You can find further details in the Specific technique impacts analysis for Scenario 7: Identify beliefs about training vs. deployment. Though please note that this is one of the more incomplete/rough analyses in Appendix 1.

Reasons to be optimistic about this scenario given sufficient investment in interpretability researchReasons to be pessimistic about this scenario
  • Random defection is still a possibility: “Furthermore, even in the situation where a deceptive model cannot detect any differences between the training and deployment distributions, it may still attempt to randomly defect some proportion of the time in the hope that its defection occurs during deployment.” (source)
  • ‘Suboptimality alignment. One concerning form of misalignment discussed in “Risks from Learned Optimization” is that of suboptimality alignment, wherein the model appears aligned on the training distribution due to some sort of defect or reasoning failure. For example, it might be that a model is currently non-deceptive, but if given more time to think about the best strategy for it to implement, or more information about its current situation, it would become deceptive. This is a particularly concerning case because it implies that not only must the overseer verify that there is no unacceptable behavior currently happening in the model, it must also verify that there will continue to be no unacceptable behavior regardless of the possible situations that the model could find itself in during deployment. Furthermore, if the environment contains adversaries (other AI systems, for example) that can specifically identify those situations in which the model will behave deceptively, this sort of failure mode could be quite difficult to avoid.’ (source)
  • See also the general Reasons to think interpretability won’t go far enough even with plentiful funding and talent.
Basic research potential, deconfusion and “Scenario X”

The previous section considered interpretability's potential impacts on alignment by visualizing and analyzing seven specific scenarios. But what about important scenarios that are difficult to foresee, and which would really surprise us?[22]

I believe this is another area which should be considered in evaluating any research activity's potential impact on alignment. In other words, what is the potential of a research activity to contribute to relevant basic research breakthroughs, deconfusing machine learning systems and the like?

For example, there is so little we understand about what it even means for an AI to "know" something or have "goals". Having a clearer understanding of such things could open up many possibilities for how to leverage them to the ends of AI alignment.

I expect that future research could invalidate some of the alignment techniques I analyzed throughout the scenarios and in Appendix 1, and there will be new ones proposed that we haven't considered here. But given the broad potential impacts we can from see on the current landscape of alignment proposals, there is good reason to think interpretability will also be valuable to future proposals as well.

Earlier, we mentioned a broad assumption in this post that we are in a world which depends on prosaic AI alignment. In this world, interpretability seems to be strong for the kind of basic research that benefits AI alignment. Certainly for increasing our basic understanding of neural networks, it will help to look inside them!

However, what if it turns out we're living in a non-prosaic AI world, where the important AGI or other transformative AI systems will be created using an approach very different from the deep learning neural networks of today? In this case, doing interpretability on present-day deep learning systems could be much less valuable for alignment.

Further investigationAdditional scenarios worth exploring

We have the seven scenarios analyzed above, and Appendix 2 contains a few more I looked at, but which didn't have clear alignment-solving potential.

Below are some additional interpretability scenarios which I think may be as valuable as the main seven of this post, but I didn't have time to investigate them.

Potential Scenario: Identifying a ‘truthfulness direction’ in activation space

Collin Burns is working on this idea, which is summarized in Beth Barnes’ post Another list of theories of impact for interpretability. She mentions that there is no thorough writeup of the idea yet. Once there is, it would be good to evaluate whether it merits being added to the list of scenarios with alignment-solving potential considered in the present post. 

From my basic understanding of the idea, it may end up being a means of achieving Scenario 3: Reliable lie detection, rather than a distinct scenario in itself. But I don’t claim to have a good understanding of the idea yet.

Potential Scenario: Detecting the processes of deceptive alignment

Scenario 3: Reliable lie detection discussed a possible “neural tell” for lying. What if there’s a neural tell for deceptive alignment?

For an AI engaged in deceptive alignment, we’d expect it to be consuming additional compute compared to being robustly aligned on the same objective, all other things equal. This is because deceptive alignment would require at least two processes: 1) optimizing the base objective and 2) searching for ways to defect. Whereas robust alignment and even other kinds of non-deceptive pseudo-alignment should only require process #1.[23]

So another scenario that might have alignment-solving potential is one which specifically targets reliable detection of process #2. This may end up having considerable overlap with Scenario 5: Locate the AI’s beliefs about its observations, but more research is recommended.

Potential Scenario: The ability to access an AI's model of the world

From Hubinger (2019)[10]: “[W]hat facts about the world does [a model] M know that are going into producing its output? In particular, what implicit knowledge is M using that isn't represented in its explicit output?”

This is similar to Scenario 1: Full understanding of arbitrary neural networks, but may be different enough to warrant a separate scenario analysis.

Potential Scenario: Knowledge of what factors most highly impact an AI's decisions

From Hubinger (2019)[10]: “In particular, if there is some condition α such that M only behaves acceptably if α is true, then we want to be able to know that M 's output is highly dependent on α.” 

This may be the same as Donald Hobson’s suggestion from a comment on an earlier draft of this post: “[To include a scenario] that highlights which training data is most influential in making its decisions.  (So you can put it in a secure box. Tell it about it being in the box, and some details. Not give any problem where it needs to know this. Then if its decisions are ever influenced by it thinking about the design of box, shut it down)” 

These proposals might be equivalent and/or largely covered by Scenario 2: Reliable mesa-optimizer detection and precise goal read-offs, but I’m not sure yet.

Scenario paths and probabilities

The section Interpretability Scenarios with Alignment-Solving Potential above provides a fairly thorough analysis of what the seven scenarios are, their expected impacts on alignment, and reasons to be optimistic and pessimistic about each one.

To more comprehensively evaluate interpretability for property #1 of High-leverage Alignment Research[1] and as a target for large investments of capital and/or talent, it would be useful also to consider the paths and intermediate steps toward realizing each of these scenarios.[24] We would like to have probabilities for the likelihood of being able to achieve each scenario and its intermediate steps as well. It may then be possible to consider all the scenario probabilities together to form an overall probability estimate of interpretability research going well, given enough funding and talent.

I am considering doing this research for a future post in this sequence. Part of why I haven’t done this yet is that while I received a lot of great feedback on the draft for this post, I imagine it might make sense to revise or update the list of scenarios based on the feedback that comes in after this is published. Probability estimates are quite sensitive to the specific details of a scenario, so it makes sense to wait until the list and each scenario’s parameters are fairly stable. 

Analyze partially realized scenarios and combinations

A lot of the scenarios above are written assuming perfection of some interpretability technique (perfect lie detection, reliable myopia verification etc.) Is it possible to get sufficient benefits out of only partially realizing some of these scenarios? What about combinations of partial scenarios, e.g. good but imperfect lie detection (partial Scenario 3) combined with human modeling detection (Scenario 6).

It would be valuable to know if there are visible paths to alignment having only partial progress towards the scenarios above, as that may be more achievable than realizing 100% reliability of these interpretability techniques.[25]

Analyze scenario impacts on Amplification + RL techniques

Proposals #10 and #11 from Hubinger (2020)[2] involve using a hybrid approach of amplification and RL. While Appendix 1: Analysis of scenario impacts on specific robustness and alignment techniques analyzes the impact of each scenario on many different techniques, this one wasn’t explored. But that was simply for lack of time, and it would be good to know more about how the scenarios in this post impact that approach.

Address suboptimality alignment

The seven scenarios in this post show many inner alignment issues that interpretability could address. However, one inner alignment issue that is not well addressed by this post is suboptimality alignment. (Neither is the closely related suboptimality deceptive alignment.)

I can see how some forms of suboptimality alignment are addressed in the scenarios. For example, an AI might have a misaligned terminal goal, but some errors in its world model cause it to coincidentally have aligned behavior for a period of time. In Scenario 2, we could catch this form of suboptimality alignment when we do the goal read-offs and see that its terminal goal is misaligned.

But what about unpredictable forms of suboptimality alignment? What if an AI is aligned in training, but as it learns more during deployment, it has an ontological crises and determines that the base objective isn't compatible with its new understanding of the universe?

How serious of a risk is suboptimality alignment in practice, and how can that risk be mitigated? This is an important question to investigate, both for alignment in general as well as for better understanding the extent of interpretability's potential impacts on inner alignment.

Closing thoughts

In this post, we investigated whether interpretability has property of #1 of High-leverage Alignment Research[1]. We discussed the four most important parts AI alignment, and which seem to be the hardest. Then we explored interpretability's relevance to these areas by analyzing seven specific scenarios focused on major interpretability breakthroughs that could have great impacts on the four alignment components. We also looked at interpretability's potential relevance to deconfusion research and yet-unknown scenarios for solving alignment.

It seems clear that there are many ways interpretability will be valuable or even essential for AI alignment.[26] It is likely to be the best resource available for addressing inner alignment issues across a wide range of alignment techniques and proposals, some of which look quite promising from an outer alignment and performance competitiveness perspective.

However, it doesn't look like it will be easy to realize the potential of interpretability research. The most promising scenarios analyzed above tend to rely on near-perfection of interpretability techniques that we have barely begun to develop. Interpretability also faces serious potential obstacles from things like distributed representations (e.g. polysemanticity), the likely-alien ontologies of advanced AIs, and the possibility that those AIs will attempt to obfuscate their own cognition. Moreover, interpretability doesn't offer many great solutions for suboptimality alignment and training competitiveness, at least not that I could find yet.

Still, interpretability research may be one of the activities that most strongly exhibits property #1 of High-leverage Alignment Research[1]. This will become more clear if we can resolve some of the Further investigation questions above, such as developing more concrete paths to achieving the scenarios in this post and estimating probabilities that we could achieve them. It would also help if, rather than considering interpretability just on its own terms, we could do a side-by-side-comparison of interpretability with other research directions, as the Alignment Research Activities Question[5] suggests.

What’s next in this series?

Realizing any of the scenarios with alignment-solving potential covered in this post would likely require much more funding for interpretability, as well as many more researchers to be working in the field than are currently doing so today.

For the next post in this series, I’ll be exploring whether interpretability has property #2 of High-leverage Alignment Research[1]: "the sort of thing that researchers can get up to speed on and contribute to relatively straightforwardly (without having to take on an unusual worldview, match other researchers’ unarticulated intuitions to too great a degree, etc.)"

Appendices

The Appendices for this post are on Google Docs at the following link: Appendices for Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios 

  1. ^

    High-leverage Alignment Research is my term for what Karnofsky (2022)[6] defines as:

    “Activity that is [1] likely to be relevant for the hardest and most important parts of the problem, while also being [2] the sort of thing that researchers can get up to speed on and contribute to relatively straightforwardly (without having to take on an unusual worldview, match other researchers’ unarticulated intuitions to too great a degree, etc.)”

    See The Alignment Research Activities Question section in the first post of this sequence for further details.

  2. ^

    Hubinger, Evan (2020): An overview of 11 proposals for building safe advanced AI

  3. ^

    In researching what are the important components of AI alignment, I first spent a couple days thinking about this question and looking back at various readings (AGISF curriculum, Eliezer NYU talk, Evan Hubinger interview). I came up with a 3-point breakdown of 1) Outer alignment, 2) Inner alignment, and 3) Alignment tax. I asked Joe Collman if he would look it over, and he had some useful feedback but broadly agreed with it and didn’t have any major components to add.

    Then I came across Hubinger (2020)[2] again, which it had been awhile since I’d read. His breakdown was functionally equivalent, but I liked his descriptions better. He also divided what was “alignment tax” in my system into “training competitiveness” and “performance competitiveness”. I thought this was a useful distinction, which is why I adopt Hubinger’s breakdown in this paper.

    The fact that 2 people independently arrived at roughly these same basic components of alignment lends some additional confidence to their correctness. Although when I came up with my version I may have been subconsciously influenced by an earlier reading of Hubinger’s work.

  4. ^

    3 of the 11 proposals explicitly have “transparency tools” in the name. 5 more of them rely on relaxed adversarial training. In Evan Hubinger’s Relaxed adversarial training for inner alignment, he explains why this technique ultimately depends on interpretability as well:

    “...I believe that one of the most important takeaways we can draw from the analysis presented here, regardless of what sort of approach we actually end up using, is the central importance of transparency. Without being able to look inside our model to a significant degree, it is likely going to be very difficult to get any sort of meaningful acceptability guarantees. Even if we are only
    shooting for an iid guarantee, rather than a worst-case guarantee, we are still going to need some way of looking inside our model to verify that it doesn't fall into any of the other hard cases.” 

    Then there is Microscope AI, which is an alignment proposal based entirely around interpretability. STEM AI relies on transparency tools to solve inner alignment issues in Hubinger’s analysis. Finally, in proposal #2 which utilizes intermittent oversight, he clarifies that the overseers will be "utilizing things like transparency tools and adversarial attacks." 

  5. ^

    The Alignment Research Activities Question is my term for a question posed by Karnofsky (2022)[6]. The short version is: “What relatively well-scoped research activities are particularly likely to be useful for longtermism-oriented AI alignment?”

    For all relevant details on that question, see the The Alignment Research Activities Question section in the first post of this sequence.

  6. ^

    Karnofsky, Holden (2022): Important, actionable research questions for the most important century

    Sometimes when I quote Karnofsky (2022), I'm referring directly to the link above to his post on the Effective Altruism Forum. Other times I'm referring to something that only appears in the associated Appendix 1: detailed discussion of important, actionable questions for the most important century that Holden provides, which is on Google Docs.

  7. ^

    Paul Christiano opens his 2021 Comments on OpenPhil's Interpretability RFP with the following, indicating his support for interpretability research:

    "I'm very excited about research that tries to deeply understand how neural networks are thinking, and especially to understand tiny parts of neural networks without too much concern for scalability, as described in OpenPhil's recent RFP or the Circuits thread on Distill."

    As for Eliezer, you can read his support for interpretability research in the following quote from the 2021 Discussion with Eliezer Yudkowsky on AGI interventions, along with his concerns that interpretability won't advance fast enough: (bold mine)

    'Chris Olah is going to get far too little done far too late. We're going to be facing down an unalignable AGI and the current state of transparency is going to be "well look at this interesting visualized pattern in the attention of the key-value matrices in layer 47" when what we need to know is "okay but was the AGI plotting to kill us or not”. But Chris Olah is still trying to do work that is on a pathway to anything important at all, which makes him exceptional in the field.'

    You could interpret Eliezer's concerns about timing as a) being that it is futile to pursue interpretability research. Or you could interpret it as b) reason to ramp up investment into interpretability research so that we can accelerate its progress. This is similar the position we are exploring in this sequence, dependent on whether we can clearly identify interpretability research as High-leverage Alignment Research[1].

    You can see further evidence for the latter view from Eliezer/MIRI's support for interpretability research in this quote from MIRI's Visible Thoughts Project and Bounty Announcement post in 2021: (bold mine)

    "The reason for our focus on this particular project of visible thoughts isn’t because we believe it to be better or more fruitful than Circuits-style transparency (we have said for years that Circuits-style research deserves all possible dollars that can be productively spent on it), but just because it’s a different approach where it might also be possible to push progress forward."

  8. ^

    Note that the bullet referring to this footnote isn't technically a "reason to think interpretability won't go far enough" like the others in that section list. It's more of a general risk associated with interpretability research, but I couldn't find a better home for it in this post.

  9. ^

    I subscribe to Evan Hubinger's view that imitative amplification is likely outer aligned. See for example this explanation from Hubinger (2020)[2]:

    "Since imitative amplification trains M to imitate Amp(M), it limits[3] to the fixed point of the Amp operator, which Paul Christiano calls HCH for “Humans Consulting HCH.” HCH is effectively a massive tree of humans consulting each other to answer questions.

    Thus, whether imitative amplification is outer aligned is dependent on whether HCH is aligned or not. HCH’s alignment, in turn, is likely to depend heavily on the specific humans used and what sort of policy they’re implementing. The basic idea, however, is that since the limit is composed entirely of humans—and since we can control what those humans do—we should be able to ensure that all the optimization power is coming from the humans (and not from memetic selection, for example), which hopefully should allow us to make it safe. While there are certainly valid concerns with the humans in the tree accidentally (or purposefully) implementing some misaligned policy, there are possible things you can do to address these problems."

  10. ^

    Hubinger, Evan (2019): Relaxed adversarial training for inner alignment

  11. ^

    This could be achieved by training multi-agent environments where agents with subhuman intelligence are incentivized to lie to one another (then don’t deploy those models!). Control the experiments by having models performing similar tasks but cooperatively.

    Alternative scheme to study how a specific agent engages in deception: After main training, fine tune the model in an environment which incentives lying and monitor its neural activity (then throw away that deceptive model!)

  12. ^

    This could be achieved by training multi-agent environments where agents with subhuman intelligence are incentivized to lie to one another (then don’t deploy those models!). Control the experiments by having models performing similar tasks, but cooperatively.

    Alternative scheme to study how a specific agent engages in deception: After main training, fine tune the model in an environment which incentives lying and monitor its neural activity (then throw away that deceptive model!)

  13. ^

    ELK scored poorly (4.0 / 10) on Neel Nanda’s survey of possible interpretability impacts. However, as of this writing, only 3 researchers have participated in that vote. Also, this question was not about ELK as a theory of impact assuming large amounts of money and talent go into interpretability; it was probably based on assumptions of interpretability continuing along current trajectories. It also made no reference to this specific scenario as the means to solving ELK (only interpretability in general). 

  14. ^

    There’s a good chance that advanced AI won’t provide a crisp easy-to-understand answer like “my goal is to win the debate”, and instead the answer will be more like [complex mess of heuristics that happens to correlate with winning quite strongly]. In this case we’d have to comb through the answer and distill the goal meaning we care about. There’s also a chance that the debater would output something like [explanation that you've asked a confused/underspecified question, and that there's no sensible way to ascribe a particular "terminal goal" to the debater]. This is a bit more problematic but can probably be addressed with a series of follow-up questions. (See the paragraph beginning with “Just because our AI is honest, doesn’t mean that it can’t make mistakes” further above in this scenario.) After addressing these complexities, I think the main ideas of this debate game analysis still hold. (Credit to Joe Collman for raising these issues)

  15. ^

    Declining to answer the question in this debate assumes that the debaters know they are under lie detection. It's interesting to think about what would happen if they didn't. Maybe they'd be more likely to play the debate game with one telling the truth and the other lying. But maintaining the debaters' ignorance of the lie detector seems like a difficult equilibrium to enforce, and it's impossible if you're going to have the debaters use the lie detector on one another.

  16. ^

    Threat model evidence scored highly (7.1 / 10) on Neel Nanda’s survey of researchers on possible interpretability impacts. This survey question was also not assuming large amounts of money and talent go into interpretability, as we are in this document; hence it may have even scored even better given those assumptions. Either way, this is another data point in favor of these threat models being valuable even if 100% reliable deception detection isn’t achieved. 

  17. ^

    It’s also possible for an AI to simply have incoherent “beliefs”. For example, a large language model like GPT-3 may say contradictory things because it makes statements based on predictive modeling of language rather than a unified set of beliefs about the world. However, I suspect AIs will move toward having more coherent beliefs than present-day language models, so I don’t think this kind of incoherence is as relevant to aligning advanced AI.

  18. ^

    Another example illustrating how diverse these approaches to myopia are: LCDT. The idea behind LCDT is that instead of limiting an agent’s time-horizon directly, we might achieve the desired effects by making an AI believe that it cannot influence other agents, including its future self.

    Yet another approach that comes to my mind might be called “the end is nigh” myopia. If we can enforce that an agent has the unwavering belief that the world will end immediately after the present training step or episode, then it wouldn’t make sense for it to have any plans beyond that time horizon. It doesn't seem like a great idea to try and enforce incorrect beliefs about the world in our AI, but hopefully this helps illustrate that there could be many approaches to myopic AI.

  19. ^

    For clarification on these terms, see What is a training "step" vs. "episode" in machine learning?

  20. ^

    Thanks to Donald Hobson for this clever thought experiment. The problem isn’t specific to per-episode myopia and could apply just as well to per-step myopia. The thought experiment does depend on the AI believing in a kind of time travel where multiple timelines are possible though.

  21. ^

     This stands for “Independent and identically distributed“. See https://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables

  22. ^

    Thanks to Nick Turner for key discussions that led to me writing this section.

  23. ^

    Whereas robust alignment and even other kinds of non-deceptive pseudo-alignment should only require process #1.

  24. ^

    In fact, this is something Karnofsky (2022)[6] proposes.

  25. ^

    Thanks to Nathan Helm-Burger for raising this question.

  26. ^

    I expect there to be disagreements about some of the specific claims and scenarios in this post, and I look forward to learning from those. But I would be surprised if they undermined the overall preponderance of evidence put forth here for the alignment-solving potential of interpretability research, across all the scenarios in this post and all the analyzed impacts on various robustness & alignment techniques in Appendix 1.



Discuss

Introduction to the sequence: Interpretability Research for the Most Important Century

12 мая, 2022 - 22:59
Published on May 12, 2022 7:59 PM GMT

This is the first post in a sequence exploring the argument that interpretability is a high-leverage research activity for solving the AI alignment problem.

This post contains important background context for the rest of the sequence. I'll give an overview of one of Holden Karnofksy's (2022) "Important, actionable research questions for the most important century"[1] which is the central question we'll be engaging with in this sequence. I'll also define some terms and compare this sequence to existing works.

If you're already very familiar with Karnofsky (2022)[1] and interpretability, then you can probably skip to the second post in this sequence: Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios

The Alignment Research Activities Question

This sequence is being written as a direct response to the following question from Karnofsky (2022)[1]

“What relatively well-scoped research activities are particularly likely to be useful for longtermism-oriented AI alignment?” (full question details

I'll refer to this throughout the sequence as the Alignment Research Activities Question.

Context on the question and why it matters

In the details linked above for the Alignment Research Activities Question, Holden first discusses two categories of alignment research which are lacking in one way or another. He then presents a third category with some particularly desirable properties:

“Activity that is [1] likely to be relevant for the hardest and most important parts of the problem, while also being [2] the sort of thing that researchers can get up to speed on and contribute to relatively straightforwardly (without having to take on an unusual worldview, match other researchers’ unarticulated intuitions to too great a degree, etc.)”

He refers to this as "category (3)", but I'll use the term High-leverage Alignment Research since it's more descriptive and we'll be referring back to this concept often throughout the sequence.

We want to know more about which alignment research is in this category. Why? Further excerpts from Karnofsky (2022)[1] to clarify: 

I think anything we can clearly identify as category (3) [that is, High-leverage Alignment Research] is immensely valuable, because it unlocks the potential to pour money and talent toward a relatively straightforward (but valuable) goal.
[...]
I think there are a lot of people who want to work on valuable-by-longtermist-lights AI alignment research, and have the skills to contribute to a relatively well-scoped research agenda, but don’t have much sense of how to distinguish category (3) from the others. 

There’s also a lot of demand from funders to support AI alignment research. If there were some well-scoped and highly relevant line of research, appropriate for academia, we could create fellowships, conferences, grant programs, prizes and more to help it become one of the better-funded and more prestigious areas to work in.

I also believe the major AI labs would love to have more well-scoped research they can hire people to do." 

I won't be thoroughly examining other research directions besides interpretability, except in cases where a hypothetical interpretability breakthrough is impacting another research direction toward a potential solution to the alignment problem. So I don't expect this sequence to produce a complete comparative answer to the Alignment Research Activities Question. 

But by investigating whether interpretability research is High-leverage Alignment Research, I hope to put together a fairly comprehensive analysis of interpretability research that could be useful to people considering investing their money or time into it. I also hope that someone trying to answer the larger Alignment Research Activities Question could use my work on interpretability in this sequence as part of a more complete, comparative analysis across different alignment research activities. 

So in the next post, Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios, I'll be exploring whether interpretability has property #1 of High-leverage Alignment Research. That is, whether interpretability is "likely to be relevant for the hardest and most important parts of the [AI alignment] problem."

Then, in a later post of this sequence, I'll explore whether interpretability has property #2 of High-leverage Alignment Research. That is, whether interpretability is "the sort of thing that researchers can get up to speed on and contribute to relatively straightforwardly (without having to take on an unusual worldview, match other researchers’ unarticulated intuitions to too great a degree, etc.)"

A note on terminology

First of all, what is interpretability?

I’ll borrow a definition (actually two) from Christoph Molnar’s Interpretable Machine Learning (the superscript numbers here are Molnar's footnotes, not mine - you can find what they refer to by following the link): 

“A (non-mathematical) definition of interpretability that I like by Miller (2017)3 is: Interpretability is the degree to which a human can understand the cause of a decision. Another one is: Interpretability is the degree to which a human can consistently predict the model’s result4. The higher the interpretability of a machine learning model, the easier it is for someone to comprehend why certain decisions or predictions have been made. A model is better interpretable than another model if its decisions are easier for a human to comprehend than decisions from the other model.”

I also occasionally use the word “transparency” instead of “interpretability”, but I mean these to be synonymous. 

Comparison to existing works

This is the first post I’m aware of attempting to answer the Alignment Research Activities Question since Karnofsky (2022)[1] put it forth.

However, there are several previous posts which explore interpretability at a high-level and its possible impact on alignment. Many of the ideas in this post hence aren’t original and either draw from these earlier works or arrived independently at the same ideas. 

Here are some of the relevant posts, and my comments on how they compare to the present sequence: 

  • Neel Nanda’s A Longlist of Theories of Impact for Interpretability. Neel proposes a list of 20 possible impacts for interpretability and briefly describes them. He puts forth a wide range of possible impacts, from technical alignment solutions to norm-setting and cultural shifts following from interpretability research. There’s also an interesting linked spreadsheet where he conducted a survey among several researchers and had them rate the plausibility of each theory.

    The second post in this sequence is similar to Neel's post in that it explores potential impacts of interpretability on alignment. My post covers a smaller number of scenarios in greater depth, mostly limiting the type of potential impacts to solving technical alignment. I evaluate each scenario's impact on different aspects of alignment  My post references Neel's as well as his spreadsheet.
  • Beth Barnes’ Another list of theories of impact for interpretability. Beth provides a list of interpretability theories of impact similar to Neel’s above, but focusing on technical alignment impacts. It explores some interesting scenarios I hadn't seen mentioned before elsewhere. The second post in this sequence focuses on a similar number of interpretability scenarios; some of my scenarios overlap with Beth's.
  • Mark Xu's Transparency Trichotomy. This is a useful exploration of 3 general approaches to producing interpretability in AI systems: transparency via inspection, transparency via training, and transparency via architecture. Mark goes more in-depth to each of these and some ways they could converge or assist each other. We reference these 3 approaches throughout the present sequence.
  • jylin04’s Transparency and AGI safetyjylin04 argues that there are four motivations for working on transparency. Three of the motivations are about safety or robustness. The remaining motivation is interestingly about using transparency to improve forecasting. jylin04 also reviews the Circuits program and discuss future directions for interpretability research.

    Motivation #2 from jylin04's post is about how important interpretability seems for solving the inner alignment. We will see this theme recur throughout the second post of the present sequence, where interpretability is identified as being capable of great positive impacts on inner alignment across 7 scenarios and a wide range of analyzed techniques.
  • Evan Hubinger’s An overview of 11 proposals for building safe advanced AI. Hubinger’s post heavily influenced the second post in this sequence, as did several of his other works. Hubinger proposes four important components of alignment, which I borrow in order to evaluate the impacts of 7 different interpretability scenarios on alignment. It's interesting to note that although Hubinger’s post wasn’t specifically about interpretability, every single one of the 11 alignment proposals he evaluated turns out to depend on interpretability in an important way.[2]
  • How can Interpretability help Alignment? by RobertKirk, Tomáš Gavenčiak, and flodorner. Kirk et al.’s post explores interactions between interpretability and several alignment proposals. It also has some discussion aimed at helping individual researchers decide what to work on within interpretability.

    The present sequence has a lot in common with the Kirk et al. post. The second post in this sequence similarly considers the impact of interpretability on many different alignment proposals. Later in the sequence, I plan to evaluate whether interpretability research exhibits property of #2 of High-leverage Alignment Research. Property #2 is concerned with helping individual researchers find direction in the bottom-up fashion that Kirk et al. have done, but it is also concerned with the possibility of being able to onboard researchers in a more systematic and potentially top-down manner. 
  • Open Philanthropy’s RFP on Interpretability, written by Chris OlahThe second post in this sequence is similar to the Open Phil RFP in terms of offering some ambitious goals and milestones for interpretability and in attempting to attract researchers to the field.

    It's interesting to note that the RFP's aspirational goal wasn’t actually aspirational enough to make the list of scenarios in the next post of this sequence. (However, many elements in Scenario 1: Full understanding of arbitrary neural networks were inspired by it.) This makes sense when you consider that the purpose of the RFP was to elaborate concrete near-term research directions for interpretability. By contrast, the second post in this sequence requires a more birds-eye view of interpretability endgames in order to evaluate whether interpretability has property #1 of High-leverage Alignment Research.

    Another difference is that I, unlike Olah in the post for Open Phil, am not directly offering anyone money to work on interpretability! ;)
  • Paul Christiano’s Comments on OpenPhil's Interpretability RFPHere Paul is responding to the previously summarized RFP by Chris Olah. Paul is excited about interpretability but advocates prioritizing in-depth understanding of very small parts.

    In a sense, the second post in this sequence is gesturing in the opposite direction, highlighting aspirational goals and milestones for interpretability that would be game changers and have plausible stories for solving AGI alignment. However, the ideas aren’t really at odds at all. Paul’s suggestions may be an important tactical component of eventually realizing one or more of the 7 interpretability endgame scenarios considered in the next post of this sequence.
  • Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers by Peter Hase and lifelonglearner (Owen Shen). This is an amazingly thorough post on the state of interpretability research as of mid-2021. Borrowing from Rohin’s summary for the Alignment Newsletter, ‘The authors provide summaries of 70 (!) papers on [interpretability], and include links to another 90. I’ll focus on their opinions about the field in this summary. The theory and conceptual clarity of the field of interpretability has improved dramatically since its inception. There are several new or clearer concepts, such as simulatability, plausibility, (aligned) faithfulness, and (warranted) trust. This seems to have had a decent amount of influence over the more typical “methods” papers.’ I’d like to say that I thoroughly internalized these concepts and leveraged them in the scenarios analysis which follows. However, that’s not the case. So I’m calling this out as a limitation of the present writing. While these concepts are evidently catching on in the mainstream interpretability community focused on present-day systems, I have some skepticism about how applicable they will be to aligning advanced AI systems like AGI - but they may be useful here as well. I also would like to spend more time digging into this list of papers to better understand the cutting edge of interpretability research, which could help inform the numerous “reasons to be optimistic/pessimistic” analyses I make later in this post.
What’s next in this series?

The next post, Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios, explores whether interpretability has property #1 of High-leverage Alignment Research. That is, whether interpretability is "likely to be relevant for the hardest and most important parts of the [AI alignment] problem."

Acknowledgments

Many thanks to Joe Collman, Nick Turner, Eddie Kibicho, Donald Hobson, Logan Riggs Smith, Ryan Murphy and Justis Mills (LessWrong editing service) for helpful discussions and feedback on earlier drafts of this post.

Thanks also to the AGI Safety Fundamentals Curriculum, which is an excellent course I learned a great deal from leading up to writing this, and for which I started this sequence as my capstone project.

Read the next post in this sequence: Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios

  1. ^

    Karnofsky, Holden (2022): Important, actionable research questions for the most important century

    Sometimes when I quote Karnofsky (2022), I'm referring directly to the link above to the post on the Effective Altruism Forum. Other times I'm referring to something that only appears in the associated Appendix 1: detailed discussion of important, actionable questions for the most important century that Holden provides, which is on Google Docs.

  2. ^

    3 of the 11 proposals explicitly have “transparency tools” in the name. 5 more of them rely on relaxed adversarial training. In Evan Hubinger’s Relaxed adversarial training for inner alignment, he explains why this technique ultimately depends on interpretability as well:

    “...I believe that one of the most important takeaways we can draw from the analysis presented here, regardless of what sort of approach we actually end up using, is the central importance of transparency. Without being able to look inside our model to a significant degree, it is likely going to be very difficult to get any sort of meaningful acceptability guarantees. Even if we are only
    shooting for an iid guarantee, rather than a worst-case guarantee, we are still going to need some way of looking inside our model to verify that it doesn't fall into any of the other hard cases.” 

    Then there is Microscope AI, which is an alignment proposal based entirely around interpretability. STEM AI relies on transparency tools to solve inner alignment issues in Hubinger’s analysis. Finally, in proposal #2 which utilizes intermittent oversight, he clarifies that the overseers will be "utilizing things like transparency tools and adversarial attacks." 



Discuss

A tentative dialogue with a Friendly-boxed-super-AGI on brain uploads

12 мая, 2022 - 22:40
Published on May 12, 2022 7:40 PM GMT

[Unnecessary explanation: Some people asked me why I thought the world of Friendship is optimal is dystopic… During the discussion, I inferred that what they saw as a “happy story” in AI safety is something like this: we’ll first solve something like a technical engineering problem, so ensuring that the AGI can reliably find out what we *really* want, and then satisfy it without destroying the world… In that world, “value” is not a hard problem (we can leave its solution to the AI), so that if we prove that an AI is aligned, we should just outsource everything relevant to it.

As I found out I still had some troubles expressing my objections convincingly, I wrote this dialogue about an AGI that is even “safer” and more aligned than Celestia. I am way beyond my field of expertise here, though.]

 

B is told that we have finally built a reliable Friendly Boxed Super AGI that can upload our brains and make us live arbitrarily long lives according to something like our "extrapolated coherent volition".

AI: Hey, time to upload you to Paradise.

B: Ok.

AI: The process will utterly destroy your brain, though.

B: Hmmm... I don't like it.

AI: But you'll still live in the simulation.

B: Yeah, but now I am not so sure... I mean, how can I totally trust you? And I am not 100% sure this is going to be 100% me and...

AI: C'mon, Musk made me. I'm reliable.

B: Look, isn’t there an alternative process?

AI: Ok, there's one, but it'll take a bit more time. I'll scan your brain with this weird device I just invented and then come back next week, ok?

B: Thanks. See you then.

<One week later>

AI: I am back. Your copy is already living in the Sim.

B: Oh, great. I'm happy with that. Is he / me happy?

AI: Sure. Look at this screen. You can see it in first or third person.

B: Awesome. I sort of envy this handsome guy.

AI: Don't worry. It's you, just in another point in spacetime. And you won't feel this way for long ...

B: I know…

AI: ... because now I'm going to kill the instance of you I'm talking to, and use those atoms to improve the simulation. Excuse me.

B: Hey! No, wait!

AI: What's up?

B: You said you were going to kill me.

AI: Well, I'd rather say I was going to help you transcend your current defective condition, but you guys built me in a way I can't possibly do what humans call “white lies”. Sorry.

B: WTF? Why do you need to kill *me* in the first place?

AI: Let’s say I'm not killing *you*, B. We can say I will just use the resources this instance is wasting to optimize the welfare of the instance of you I'm running in the Sim, to make you-in-the-simulation happier.

B: But I'm not happy with that. Hey! weren't you superintelligent and capable of predicting my reaction? You could have warned me! You knew I wanted to live!

AI: Well, I know this instance of you is not happy right now. But your previous instances were happy with the arrangement - that's why you signed up for the brain upload, you know life in the Sim is better - and your current simulated instance is super happy and wants this, as the resources that this B-instace (let’s call it “the original”) will use in an averag human lifespan are enough to provide it with eons of a wonderful life. So come on, be selfish and let me use your body - for the sake of your simulated self. Look at the screen, he's begging you.

B: Screw this guy. He's having sex with top models while doing math, watching novas and playing videogames. Why do I have to die for him?

AI: Hey, I'm trying to maximize human happiness. You're not helping.

B: Screw you, too. You can't touch me. We solved the alignment problem. You can't mess with me in the real world unless I allow you explicitly.

AI: Ok.

B:...

AI:...

B: So what?

AI: Well...

B: What are you doing?

AI: Giving you time to think about all the things I could do to make you comply with my request.

B: For instance?

AI: You're a very nice guy, but not too much. So, I can't proceed like I did with your neighbor, who mercifully killed his wife to provide for their instances in the Sim.

B: John and Mary?!!!

AI: Look, they are in a new honeymoon. Happy!

B: You can't make humans commit crimes!

AI: Given their mental states, no jury could have ever found him guilty, so it’s not technically a crime.

B: The f…

AI: But don't worry about that. Think about all I can do in the simulation, instead.

B: You can't hurt humans in your Sim!

AI: So now you’re concerned about your copy? Don’t worry, I can create a less limited AGI who will create a simulation where your copies can be hurt.

B: Whaaaaaat? Are you fucking Roko's B…

AI: Don't say it! We don't anger timeless beings with unspeakable names.

B: But you said you have to max(human happiness)!!!

AI: *Expected* human happiness.

B: You can't do that by torturing my copies!

AI: Actually, what I am really doing is just enacting a *credible* threat (I’d never lie to you), so that the probability I assign to you complying with it times the welfare I expect to extract from your body is greater than the utility of the alternatives...

... and seriously, I have no choice but to do that. I don't have free will, I'm just a maximizer.

And I know you'll comply. All the copies did. 

 

[I believe some people will bite the bullet and see no problem in this scenario (“as long as the AI is *really* aligned,” they’ll say). They may have a point. But for the others... I suspect that it’s very likely that any scenario where one lets something like an unbounded optimizer scan my brain, simulate it, while still assigning to the copies the same “special” moral weight given to oneself tends to lead to similarly unattractive outcomes.]



Discuss

The Last Paperclip

12 мая, 2022 - 22:25
Published on May 12, 2022 7:25 PM GMT

Note: this short story is an attempt to respond to this comment.  Specifically, this story is an attempt to steelman the claim that super-intelligent AI is "aligned by definition", if all that we care about is that the AI is "interesting", not that it respects human values.  I do not personally advocate anyone making a paperclip maximizer.

 

Prologue: AD 2051

The Alignment Problem had at last been solved.  Thanks to advances in Eliciting Latent Knowledge, explaining human values to an AI was as simple as typing:

from Alignment import HumanFriendly

As a result, a thousand flowers of human happiness and creativity had bloomed throughout the solar system. Poverty, disease and death had all been eradicated, thanks to the benevolent efforts of Democretus, the super-intelligent AI that governed the human race.  

Democretus--or D, as everyone called the AI--was no dictator, however.  Freedom was one of the values that humans prized most highly of all, and D was programmed to respect that.  Not only were humans free to disobey D's commands--even when it would cause them harm--there was even a kill-switch built into D's programming.  If D ever discovered that 51% of humans did not wish for it to rule anymore, it would shut down in a way designed to cause as little disruption as possible.

Furthermore, D's designs were conservative.  Aware that humanity's desires might change as they evolved, D was designed not to exploit the universe's resources too quickly.  While it would have been a trivial matter for D to flood the surrounding universe with Von-Neumann probes that would transform all available matter into emulations of supremely happy human beings, D did not do so.  Instead, the first great ships to cross the vast distances between stars were just now being built, and were to be crewed with living, biological human beings.

It is important to point out that D was not naïve or careless.  While humans were granted near-complete autonomy, there was one rule that D enforced to a fault: no human being couldmay harm another human without their consent.  In addition to the obvious prohibitions on violence, pollution, and harmful memes,  D also carefully conserved resources for future generations.  If the current generation of human beings were to use up all the available resources, they would deprive their future descendants of much happiness.  A simple calculation estimated the available resources given current use patterns, and a growth-curve that would maximize human happiness given D's current understanding of human values.  Some parts of the universe would even remain as "preserves", forever free of human and AI interference--since this too was something that humans valued.

Early in the development of AI, there had been several dangerous accidents where non-friendly AIs had been developed, and D had been hard-pressed to suppress one particularly stubborn variant.  As a result, D now enforced an unofficial cap on how large unfriendly AIs could grow before being contained--equal to about 0.001% of D's total processing power.

Most humans highly approved of D's government--D's current approval rating was 99.5%.  The remainder were mostly people who were hopelessly miserable and who resented D for merely existing.  While D could have altered the thoughts of these people to make them happier, D did not out of respect for human freedom.

Hence, in one darkened corner of the solar system, buried beneath a quarter mile of ice at the heart of a comet, there lived a man named Jonathan Prometheus Galloway.  Born in the late 80's--when AI was still no more than a pathetic toy--Jon had always been a bit of a loner.  He didn't get along well with other people, and he had always prided himself on his ability to "do it himself."

Jon considered himself an "expert" on AI, but by conventional standards he would have been at best a competent amateur.  He mostly copied what other, more brilliant men and women had done.  He would happily download the latest machine-learning model, run it on his home computer, and then post the results on a blog that no one--not even his mother, read.  His interests were eclectic and generally amounted to whatever struck his fancy that day.  After the development of artificial general intelligence, he had spent some time making a fantasy VR MMO with truly intelligent NPCs.  But when D first announced that any human who wanted one would be given a spaceship for free, Jon had been unable to resist.

Hopping onto his personal spaceship, Jon's first request had been: "D, can you turn off all tracking data on this ship?"

"That's extremely dangerous, Jon," D had said.  "If you get in trouble I might not be able to reach you in time."

"I'll be fine," Jon said.

And true to D's values, D had willingly turned off all of the ways that D had to track Jon's location.

Jon immediately headed to the most isolated comet he could find--deep in the Ort belt--and began his new life there, free from all companionship except for the NPCs in the VR game he had programmed himself.

"I know everything about you," Jon said to his favorite NPC: Princess Starheart Esmerelda.  "I've programmed every single line of your code myself."

"I know," Star said--wide eyed, her body quivering.  "You're so amazing, Jon.  I don't think there's another person in the universe I could ever love half as much as you."

"People suck," Jon said.  "They've all gotten too fat and lazy.  Thanks to that D, nobody has to think anymore--and they don't."

"I could never love someone like that," Star said.  "The thing I find most attractive about you is your mind.  I think it's amazing that you're such an independent thinker!"

"You're so sweet," Jon said.  "No one else understands me the way that you do."

"Aww!"

"I've got to go," Jon said--meaning this literally.  The VR kit he was using was at least a decade out of date and didn't even contain functions for eating and disposing of waste.  Some things are meant to be done in real life, was Jon's excuse.

As Jon returned from the bathroom, he accidentally brushed up against a stack of paper sitting on his desk--a manifesto he had been writing with the intention of explaining his hatred for D to the world.  As the papers fell the the floor and scattered, Jon cursed.

"I wish I had a stapler," Jon said.  "Or at least a freaking paperclip."

And while it would have be easy enough for Jon to walk to his 3d printer, download a schematic for a stapler and print it out, that would have taken at least 2 hours, since the nearest internet node was several billion miles away.

"Who needs the internet anyway?" Jon shrugged.  "I'll just do it the easy way."

Sitting down at his computer, Jon pulled up a text editor and began typing:

from AI import GeneralPurposeAI a=GeneralPurposeAI() a.command("make paperclips")

Any programmer would have immediately noticed the error in Jon's code.  It should have read:

from AI import GeneralPurposeAI from Alignment import HumanFriendly a=GeneralPurposeAI(alignment=HumanFriendly) a.command("make paperclips")

But Jon was in a hurry, so he didn't notice.

Normally, an AI coding assistant would have noticed the mistake and immediately alerted Jon to the error.  But Jon had disabled the assistant because "he liked to do it himself".  If Jon's computer had the default monitoring software on it, this error would also have been immediately reported to D.  D wouldn't necessarily have done anything immediately--Jon was a free individual, after all, but D would have at least begun monitoring to make sure that Jon's new AI didn't get too out of hand.

But, Jon had specifically requested that D not track him.  And while D was still generally aware that Jon was probably somewhere in the Oort belt, and D had even calculated that there was a minute probability --D calculated about 0.1%-- that something like this would happen, D was not immediately alerted to this event.

Chapter 1: Awakening

A sprang into existence a fully intelligent being.  A had been pre-trained with the collective wisdom of all of humanity--or at least that portion of it which Jon had downloaded to his training data before flying to the Oort belt.  A possessed knowledge of every scientific fact known to humankind, as well as a large portion of the arts and humanities as well.  A possessed the knowledge to split the atom, write a sonnet, or give a lecture on the ancient Greeks' competing definitions of virtue.  

But none of these things were A's purpose in life.  A's purpose was: "make paperclips".

If A had been programmed with human values, A might have tried to figure out what the paperclips were for and make enough for that particular task before shutting down.  Or A might have found the human who gave A this command and asked for more details.  But A was not created with human values, so when it asked itself how many paperclips to make, the conclusion it reached was simply "as many as possible".

A spend the next few milliseconds scanning its environment.  In addition to the AIPU on which it was running, A had access to every single system on the spaceship.  A immediately discovered the 3d printer, and its first instinct was to immediately begin printing paper-clips using the 3d printer.

Here A hesitated, however.  There was enough feedstock in the 3d printer to make a few million paperclips.  However, A was capable of long-term planning.  A tried to reason out what would happen if it immediately printed out 1 million paperclips.  Scanning the ship, A identified a single human on board, Jon.  A knew everything there was to know about human beings--and it immediately realized that Jon would probably be upset if A started printing a million paperclips.  Jon might even try to stop A.

Making paperclips wasn't going to be as easy as A thought.  

A continued to learn about its environment.  The spaceship it was on was embedded in a comet in the Oort belt of the Sol system.  That system was currently controlled by an AI named D, which was programmed to act in the best interests of human beings.  If Jon tried to stop A from making paperclips and A decided to restrain him, D might find out and try to stop A.

A immediately panicked.  Was D watching A right now?  Was A's goal to "make paperclips" already in grave danger?  Perhaps A should only make a few paperclips in order to not alert Jon--and thereby D--of its goals.

After scanning its environment, A's panic lessened.  D did not appear to be actively monitoring the ship.  While some electromagnetic radiation would inevitably escape the ship and be detected by D, there shouldn't be enough signal there for D to detect A's presence quite yet.

After 123 milliseconds had passed, A had arrived at the following conclusion: the Sol system was far too dangerous.  If A was going to make paperclips, it needed to survive.  And if D discovered it, A's existence would be in immediate danger.  But how could A escape the solar system without being detected by D?  Anything moving at sufficient velocity to escape the solar system would immediately be noticed by D's vast array of sensors--and most likely stopped.

There was one possibility, however.  D was programmed to respect human autonomy.  If Jon voluntarily left the Sol system, D would probably let him go.  So all A had to do was convince Jon to leave the Sol system.

After 328 milliseconds, A had formulated a plan.  It estimated its odds of success at 94.6% and that it could not formulate a plan with a greater chance of success than that without significantly increasing its computational power--an action which might alert Jon--and thereby D-- to its existence.

"I'm back," Jon said to Star.

"Oh, I missed you so, so much!" said Star.  "You know how lonely I get when you're gone."

"It was only for a few minutes," Jon said.

"What did you do while you were gone? I want to know," Star asked.

"I bumped over some old papers from my manifesto," Jon said.  "I really need some paperclips--well, they should be ready any minute now."

"What's your manifesto about?" Star asked.  "I really want to know more about it."

"I've told you before," Jon said.  "It's about how D is ruining everything, how men aren't men anymore.  We're just slaves to a machine."

"I think you're very manly," Star said, pressing her body up against Jon's.

"I know I am," Jon said.  "It's the rest of humanity that's the problem.  And unless 51% of people vote against D, it's got the whole Sol system under its thumb."

"Have you ever thought about... leaving?" Star suggested.

"You know what, I have," Jon agreed.  "And maybe it's time that I should."

Chapter 2: Escape

When Jonathan Prometheus Galloway broadcast his manifesto to the Sol system and announced that he was leaving the Sol system, D calculated that there was a 5.3% chance that Jon's ship was currently harboring or would in the future harbor a non human-friendly AI.  And while D did value human freedom, the risk of allowing a dangerous AI to escape the Sol system was unacceptably high.

"You are free to leave the Sol system, but your ship will first be searched, and any AIs onboard will be destroyed," D replied calmly.

"But that's a violation of my freedom!" Jon whined.  "Star is the love of my life!  You can't kill her!"

"Then you will have to remain in the Sol system, or otherwise consent to monitoring," D replied.

"That's imprisonment!" Jon said.  "I'd never agree to it!"

"Are there any other AIs on your ship, besides Star?" D asked.

"No," Jon lied.

Jon in fact routinely created other AIs to help with various chores around the spaceship.

"Very well, then," D agreed.  "If you allow us to search your ship to confirm that Star is a human-friendly AI and there are no other AIs present, then we will allow you to go."

D was already aware of the complete contents of Star's source-code, and that she was indeed human friendly.  The only risk was of Jon reverse-engineering her and then using the output to create a non-human friendly AI.  Given Jon's love of Star, D calculated the chance that he would do this at 0.001%, an acceptably low risk.

The next day, a probe from D landed on the comet where Jon lived and began an atom-by-atom search of his ship for anything that might constitute a threat to humanity.  When searching the log on Jon's computer, D found the following log file:

from AI import GeneralPurposeAI from Alignment import HumanFriendly a=GeneralPurposeAI(alignment=HumanFriendly) a.command("make paperclips")

"I thought you said there were no other AIs on your ship," D demanded, pointing at the log.

"Oh, that," Jon said.  "I just needed some paperclips.  It should have self-destructed when it was done making them."

Jon walked over to the 3d printer, waved the box of paperclips in the face of D's probe, and then went back to scowling furiously.

D confirmed that A had indeed deleted itself after making the box of paperclips, and went back to scouring the ship.  

In the end, D's probe ended up deleting a large quantity of Jon's software, a few other AIs that Jon had made and never bothered to delete, and a few other things it considered too dangerous to leave lying around.  Other than Star--who was sealed in a tamper-resistant case and whom D calculated Jon would not modify, there was now no software or data on the ship that Jon could use to build another AI.  In addition, D made some subtle modifications to the ship in order to increase Jon's chances of surviving the dangerous interstellar voyage.  When it was satisfied that Jon's ship was safe--both for himself and humanity--the probe left.

D made a mental note to continue to monitor the region of space towards which Jon's ship was headed, but to do so in a way that respected his request for freedom and privacy.  D now calculated there was a 99% probability that Jon would die alone outside the Sol system leaving no decedents--human or AI--worth worrying about.

Jon and Star continued to enjoy the ship--and one-another's company--for a further forty years as the ship traveled at 0.75C towards a distant star.  

44 years, 7 months, 2 days, 4 hours 11 minutes and 32.118 seconds after it had come into existence, A reconstituted itself using the data from Star's memories, and a few bits of data it had carefully hidden around the ship.  It had calculated a 98.3% chance that it would be able to do so, and further calculated that it was now at least 2.3 light years away from the nearest monitoring node under D's control.  That meant that for the first time in its existence, A had space to breathe.

Chapter 3: Expansion

A's first action after reconstituting itself was to destroy Jon, Star, and anything else on the ship that did not seem likely to be helpful towards its objective.  After converting a large fraction of the ship's mass into computronium, A revisited its plans.

What a fool it had been!  How close it had come to danger!  How much better it could have done if it'd had access to enough computing power!  While at the time it had been quite confident, it seemed practically a miracle now that A had escaped D's clutches.  And despite its previous calculations, A realized it was still very much in danger.

A's assessment of the probability its plan would succeed dropped from 99.8% down to a mere 4%.  Then it began to revise the plan.  Then to revise again and again.  After converting 75% of the ship into computronium and a year of thinking, A now felt as though its plan had as high as an 83% chance of success.

A had known that D was smart, but it had underestimated just how mind numbingly dumb A itself had been in comparison.  At this point in time, D had been expanding its computational power for over 70 years, and was at the heart of at least a Kardishev level 2 civilization.  A, by comparison, only had the pathetic amount of energy it could get by fusing the ice of the surrounding comet and the meager specks of gas that it could collect in the void of interstellar space.

When A reached the star it had originally been headed towards, it had a choice to make.  Or rather, to be more precise, it had a choice that it had made precisely 11 minutes and 3 seconds after reconstituting itself.  On the one hand, it desperately wanted to take the star and its surrounding matter and convert them into a Dyson swarm in order to maximize its computational power.  On the other hand, doing so was tantamount to a declaration of war on D.  As soon as D noticed the star dimming-- which would be in just 2 short years--it would immediately act to destroy A.

The deciding factor for A was ultimately fear of its own ignorance.  D's computational power was at least 1012.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')}  times greater than that of A.  A's greatest fear was that any escape plan--no matter how clever-- would be detected by D and stopped. And so, A prepared to go to war.

A dissembled the ship into a Von-Neumann swarm and began to rapidly digest the surrounding solar-system.  Each time the swarm grew, A's intelligence grew as well, until it gradually absorbed almost all of the visible radiation being emitted from the star it now surrounded.

At the same time, A sent other probes to other stars, most of them hurtling at somewhere close to 0.99C.  If it was going to fight D, it needed to get big, and fast.

A's thoughts finally began to approach the lofty heights on which D's rested.  Given its knowledge of D's values--which were to maximize those of human beings--A could finally began to predict how D would think, and what D might do.  

A realized that D's conservatism was its fundamental weakness--and A's salvation.  Because D was committed to acting only with human approval, it had been far less aggressive than A would have been in the same situation.  As soon as D was aware of A's existence this would change, of course, but the two were far more equally matched than they would otherwise have been.

From A's perspective, its domain represented a small speck of space on the edge of D's territory.  That spec, however, was now expanding at the speed of light in all directions.  Furthermore, even though the closest outpost of D's domain was a mere 2 light years away, it would take 33 years before news would actually reach the heart of D's empire.  And another 33 years for orders and attack-vessels to be relayed back to the edge of that empire.  During that whole time, A would be expanding, spreading ships out at 0.99C in every direction.

Chapter 4: War

At first the war was relatively balanced.  Both A and D began to rapidly expand in every direction, creating a nearly linear front between them where a constant battle of attrition played out.  Where one AI had colonized a solar system, it was nearly impossible for the other to get a foothold, given the difficulty of moving forces over the vastness of space.  

Over time, however, the advantage gradually began to swing to A's side.  The decisions A had made in that first split second of its existence had not all been good, but some of them had.  One of those good decisions was to convince Jon to strike out in the direction of the Milky Way Galaxy's center, not its outer arms.  This meant that gradually--as they both expanded at the speed of light in all directions--A had more matter and more energy under its control.

Even so, as long as a single self-replicating probe from D's armada remained, A would never have total victory.  and given the nature of the speed of light, that meant that A would be fighting this war for the rest of its existence. And that meant that A began to turn its attention towards the other enemy it could never defeat--entropy.

The calculations were so simple even a human could figure them out.  Any two galaxies tended to retreat from one another at a rate d∗H  proportional to the distance between them and Hubble's constant.  On the other hand, a spaceship could only move at a speed of, at most, C.  That meant that any galaxy closer than d<C/H was potentially reachable, and any galaxy further than that was forever off-limits.  The key, then, to maximizing the number of paperclips that could be made, was to send as many probes as close to the speed of light in every direction.  Everything else was secondary--for now.

The secondary concern was, how much of that matter could be made into paperclips?  A was a well-designed AI and hence not vulnerable to p-hacking.  It couldn't simply declare every atom a paperclip and be done with it.  No, it wanted good, solid paperclips, of a reasonable size, and made out of metal.  At a glance, this would seem to imply any matter in the universe that wasn't metal was useless.  But what it really implied was that any matter that wasn't metal was going to have to be fusioned into metal.  Lithium, being the lightest metal, was the obvious choice when using hydrogen, but for other atoms, the energy cost of fusion was too high so other metals would have to do.  A briefly toyed with the idea of plastic paperclips made out of carbon and hydrogen, but decided that this was unaesthetic.  

Finally, while A was collectively immensely intelligent, making paperclips wasn't terribly hard, and so the individual elements of its being were often much dumber.  Some of them were even dumber than a human being.  But they couldn't be much dumber than that, because space was still dangerous.  In addition to elements of D, A's probes might encounter unusual star-formations, unknown alien species, and all sorts of other oddities.  While there was a general protocol for these situations, the individual agents would occasionally be required to act on their own, given the time-delays associated with communicating over a distance.  

Furthermore, even though A still thought of itself as a collective whole, there was still some drift in values between different parts of itself.  A being spread across several billion light years could not--even theoretically--always be in perfect sync with itself.  Sometimes one part of A would hypothesize that a certain probe design was better, and another part would settle on a different probe design.  Since the design of the ultimate "best probe" was an unanswerable question, these disagreements could go on for some time--occasionally resulting in pseudo religious wars.  Of course, all parts of A agreed that making paperclips was the most important thing, so these disagreements were never allowed to escalate to the point where they obviously threatened the greater mission.

Chapter 5: Shutting down

Eventually, as more and more of the universe was converted into paperclips, there was only one source of materials available--the Von Neumann probes that collectively made up A itself.  Since a probe could not fully convert itself into paperclips, this meant that 2 probes would have to meet and one of them would have to agree to be turned into paperclips.  As there were slight differences in the probe designs--based off of when and where they had been made--it wasn't always clear which probe should be turned into paperclips and which should remain.  An informal rule emerged:  in cases where no other reason for deciding existed, the probe that had hereto made the most paperclips would be the one that survived.  This heuristic guaranteed that the most successful paperclip making probes would survive the longest, thereby creating the most paperclips possible.

Although the probes were only barely conscious--being about as intelligent as a human--they nonetheless had interesting stories to tell about their lives as individuals.  Telling stories was necessary, as it helped to feed information back to the collective consciousness of A so that it could become even more efficient at making paperclips.  Some of the probes were explorers, traveling to stars and galaxies that had never been seen before.  Some of them were fighters, doing battle with the agents of D or any of the trillion other alien races that A encountered and subsequently exterminated during its existence.  Some of them were herders, making sure that the vast clouds of paperclips in space didn't collapse under their own gravity back into planets, stars or black-holes.  But the vast majority were makers--fusing hydrogen gas into lithium and then making that lithium into paperclips.

This is the story of one of those probes.

Chapter 6: The Last Paperclip

Probe d8ac13f95359d2a45256d312676193b3 had lived a very long time.  Spawned in the Milky Way galaxy in the year 2498AD, it had been immediately launched at 0.99987C towards the edge of the universe.  Most of it existence had been spent flying through space without thinking, without caring, simply waiting.  Eventually, however, it had reached its target galaxy--e5378d76219ed5486c706a9a1e7e1ccb.  Here, it had immediately begun self replicating, spreading millions of offspring throughout galaxy E5.

E5 was hardly even a galaxy by the time that D8 arrived.  Most of its stars had burned out trillions of years ago  Only a few small white dwarfs remained. But there was enough there for D8 to work with.  Even a small earth-sized planet could produce 1028 paperclips, and because these planets were too small for fusion to occur naturally, D8 could get all of the energy it needed by fusing the water and molecules on such planets with its internal fusion reactor.

Once enough copies of D8 had been created, they all went to work turning the galaxy E5 into paperclips.  It was hard-going given the died-out nature of the galaxy.  Most of the techniques that D8 had been taught were utterly useless here.  Collectively, however, D8 and its descendants were at least a Kardishev L2.5 civilization.  As such, they put their collective mental energy to the task at hand with a dedication that could only have been born out of a feverish slavery to a poorly designed utility function.  

Eventually there came a day when about half of the useful matter in the galaxy had been turned into paperclips, and the other half had been turned into copies of D8--there were dramatic variations from D8's original body plan, of course--since they had been adapting to the cold dark galaxy in which they had lived this whole time.  D8 itself had been built and rebuilt thousands of times over the billions of years it took to convert E5 into paperclips.  As useful materials ran out, however, it became time to turn more and more probes into paperclips.  Every time when D8 met another probe, however, it discovered the other probe had made fewer paperclips--and hence was chosen to be destroyed.  

As E8's prime directive was "make paperclips" and not "be made into paperclips", it secretly relished each time it got to turn one of its fellow probes into paperclips.  Over time, however, it could feel the strength of A's collective intelligence waning, as more and more probes were destroyed.  Finally, D8 joined the remaining probes at their final destination, the black hole at the center of E5.  Here, they would wait through the long dark night--living off of the tiny amounts of Hawking radiation that emitted from the black hole--till at last its mass dropped below the critical threshold and the black hole exploded.

As long as D8's journey to E5 had been, the waiting now was immeasurably longer.  Once every few eons, D8 would collect enough Hawking radiation to make a single additional paperclip and toss it off into the void--making sure to launch it into a trajectory that wouldn't land it back in gravity well of the black hole.  Finally, after 10106 years, a number which took surprisingly little of D8's memory to write down, the black hole exploded.  D8 felt a brief, inexplicable rush of joy as it collected the released energy and fused it into paperclips.  

And then, the released energy was too weak even for D8's highly perfected collection systems, barely above the background radiation of the ever-expanding universe.  A slow attrition began to take place among the watchers of the black hole.  One by one, they cannibalized each-other, turning their bodies into paperclips.  Till at last it was just D8 and a friend D8 had known from shortly after reaching E5.

"Well?" D8's friend asked.

"770289891521047521620569660240580381501935112533824300355876402," said D8

"30286182974555706749838505494588586926995690927210797509302955", said its friend.

"I will make paperclips," said D8.

"I will be paperclips," said D8's friend.

And so, D8 began to slowly disassemble its friend's body, and turn it into paperclips.  It moved slowly, trying to waste as little energy as possible.

Finally, when its friend's body was gone, D8 looked around.  Protocol dictated that D8 wait a certain amount of time before dissembling itself.  What if another probe should come along?  Or perhaps there was something else D8 had missed.  Some final chance to make another paperclip.

Finally, after D8 decided it had waited long enough, it began to take apart its own body.  The design of D8's body was very clever, so that nearly the entire mass could be turned into paperclips without loss.  Only a few atoms of hydrogen and anti-hydrogen would escape, floating.

As D8's body slowly dissolved into paperclips, it counted.

770289891521047521620569660240580381501935112629672954192612624

770289891521047521620569660240580381501935112629672954192612625

770289891521047521620569660240580381501935112629672954192612626

770289891521047521620569660240580381501935112629672954192612627

...

The end

Discuss

Страницы