# Новости LessWrong.com

A community blog devoted to refining the art of rationality
Обновлено: 14 минут 6 секунд назад

### The Last Paperclip

12 мая, 2022 - 22:25
Published on May 12, 2022 7:25 PM GMT

Note: this short story is an attempt to respond to this comment.  Specifically, this story is an attempt to steelman the claim that super-intelligent AI is "aligned by definition", if all that we care about is that the AI is "interesting", not that it respects human values.  I do not personally advocate anyone making a paperclip maximizer.

The Alignment Problem had at last been solved.  Thanks to advances in Eliciting Latent Knowledge, explaining human values to an AI was as simple as typing:

from Alignment import HumanFriendly

As a result, a thousand flowers of human happiness and creativity had bloomed throughout the solar system. Poverty, disease and death had all been eradicated, thanks to the benevolent efforts of Democretus, the super-intelligent AI that governed the human race.

Democretus--or D, as everyone called the AI--was no dictator, however.  Freedom was one of the values that humans prized most highly of all, and D was programmed to respect that.  Not only were humans free to disobey D's commands--even when it would cause them harm--there was even a kill-switch built into D's programming.  If D ever discovered that 51% of humans did not wish for it to rule anymore, it would shut down in a way designed to cause as little disruption as possible.

Furthermore, D's designs were conservative.  Aware that humanity's desires might change as they evolved, D was designed not to exploit the universe's resources too quickly.  While it would have been a trivial matter for D to flood the surrounding universe with Von-Neumann probes that would transform all available matter into emulations of supremely happy human beings, D did not do so.  Instead, the first great ships to cross the vast distances between stars were just now being built, and were to be crewed with living, biological human beings.

It is important to point out that D was not naïve or careless.  While humans were granted near-complete autonomy, there was one rule that D enforced to a fault: no human being couldmay harm another human without their consent.  In addition to the obvious prohibitions on violence, pollution, and harmful memes,  D also carefully conserved resources for future generations.  If the current generation of human beings were to use up all the available resources, they would deprive their future descendants of much happiness.  A simple calculation estimated the available resources given current use patterns, and a growth-curve that would maximize human happiness given D's current understanding of human values.  Some parts of the universe would even remain as "preserves", forever free of human and AI interference--since this too was something that humans valued.

Early in the development of AI, there had been several dangerous accidents where non-friendly AIs had been developed, and D had been hard-pressed to suppress one particularly stubborn variant.  As a result, D now enforced an unofficial cap on how large unfriendly AIs could grow before being contained--equal to about 0.001% of D's total processing power.

Most humans highly approved of D's government--D's current approval rating was 99.5%.  The remainder were mostly people who were hopelessly miserable and who resented D for merely existing.  While D could have altered the thoughts of these people to make them happier, D did not out of respect for human freedom.

Hence, in one darkened corner of the solar system, buried beneath a quarter mile of ice at the heart of a comet, there lived a man named Jonathan Prometheus Galloway.  Born in the late 80's--when AI was still no more than a pathetic toy--Jon had always been a bit of a loner.  He didn't get along well with other people, and he had always prided himself on his ability to "do it himself."

Jon considered himself an "expert" on AI, but by conventional standards he would have been at best a competent amateur.  He mostly copied what other, more brilliant men and women had done.  He would happily download the latest machine-learning model, run it on his home computer, and then post the results on a blog that no one--not even his mother, read.  His interests were eclectic and generally amounted to whatever struck his fancy that day.  After the development of artificial general intelligence, he had spent some time making a fantasy VR MMO with truly intelligent NPCs.  But when D first announced that any human who wanted one would be given a spaceship for free, Jon had been unable to resist.

Hopping onto his personal spaceship, Jon's first request had been: "D, can you turn off all tracking data on this ship?"

"That's extremely dangerous, Jon," D had said.  "If you get in trouble I might not be able to reach you in time."

"I'll be fine," Jon said.

And true to D's values, D had willingly turned off all of the ways that D had to track Jon's location.

Jon immediately headed to the most isolated comet he could find--deep in the Ort belt--and began his new life there, free from all companionship except for the NPCs in the VR game he had programmed himself.

"I know everything about you," Jon said to his favorite NPC: Princess Starheart Esmerelda.  "I've programmed every single line of your code myself."

"I know," Star said--wide eyed, her body quivering.  "You're so amazing, Jon.  I don't think there's another person in the universe I could ever love half as much as you."

"People suck," Jon said.  "They've all gotten too fat and lazy.  Thanks to that D, nobody has to think anymore--and they don't."

"I could never love someone like that," Star said.  "The thing I find most attractive about you is your mind.  I think it's amazing that you're such an independent thinker!"

"You're so sweet," Jon said.  "No one else understands me the way that you do."

"Aww!"

"I've got to go," Jon said--meaning this literally.  The VR kit he was using was at least a decade out of date and didn't even contain functions for eating and disposing of waste.  Some things are meant to be done in real life, was Jon's excuse.

As Jon returned from the bathroom, he accidentally brushed up against a stack of paper sitting on his desk--a manifesto he had been writing with the intention of explaining his hatred for D to the world.  As the papers fell the the floor and scattered, Jon cursed.

"I wish I had a stapler," Jon said.  "Or at least a freaking paperclip."

And while it would have be easy enough for Jon to walk to his 3d printer, download a schematic for a stapler and print it out, that would have taken at least 2 hours, since the nearest internet node was several billion miles away.

"Who needs the internet anyway?" Jon shrugged.  "I'll just do it the easy way."

Sitting down at his computer, Jon pulled up a text editor and began typing:

from AI import GeneralPurposeAI a=GeneralPurposeAI() a.command("make paperclips")

Any programmer would have immediately noticed the error in Jon's code.  It should have read:

from AI import GeneralPurposeAI from Alignment import HumanFriendly a=GeneralPurposeAI(alignment=HumanFriendly) a.command("make paperclips")

But Jon was in a hurry, so he didn't notice.

Normally, an AI coding assistant would have noticed the mistake and immediately alerted Jon to the error.  But Jon had disabled the assistant because "he liked to do it himself".  If Jon's computer had the default monitoring software on it, this error would also have been immediately reported to D.  D wouldn't necessarily have done anything immediately--Jon was a free individual, after all, but D would have at least begun monitoring to make sure that Jon's new AI didn't get too out of hand.

But, Jon had specifically requested that D not track him.  And while D was still generally aware that Jon was probably somewhere in the Oort belt, and D had even calculated that there was a minute probability --D calculated about 0.1%-- that something like this would happen, D was not immediately alerted to this event.

Chapter 1: Awakening

A sprang into existence a fully intelligent being.  A had been pre-trained with the collective wisdom of all of humanity--or at least that portion of it which Jon had downloaded to his training data before flying to the Oort belt.  A possessed knowledge of every scientific fact known to humankind, as well as a large portion of the arts and humanities as well.  A possessed the knowledge to split the atom, write a sonnet, or give a lecture on the ancient Greeks' competing definitions of virtue.

But none of these things were A's purpose in life.  A's purpose was: "make paperclips".

If A had been programmed with human values, A might have tried to figure out what the paperclips were for and make enough for that particular task before shutting down.  Or A might have found the human who gave A this command and asked for more details.  But A was not created with human values, so when it asked itself how many paperclips to make, the conclusion it reached was simply "as many as possible".

A spend the next few milliseconds scanning its environment.  In addition to the AIPU on which it was running, A had access to every single system on the spaceship.  A immediately discovered the 3d printer, and its first instinct was to immediately begin printing paper-clips using the 3d printer.

Here A hesitated, however.  There was enough feedstock in the 3d printer to make a few million paperclips.  However, A was capable of long-term planning.  A tried to reason out what would happen if it immediately printed out 1 million paperclips.  Scanning the ship, A identified a single human on board, Jon.  A knew everything there was to know about human beings--and it immediately realized that Jon would probably be upset if A started printing a million paperclips.  Jon might even try to stop A.

Making paperclips wasn't going to be as easy as A thought.

A continued to learn about its environment.  The spaceship it was on was embedded in a comet in the Oort belt of the Sol system.  That system was currently controlled by an AI named D, which was programmed to act in the best interests of human beings.  If Jon tried to stop A from making paperclips and A decided to restrain him, D might find out and try to stop A.

A immediately panicked.  Was D watching A right now?  Was A's goal to "make paperclips" already in grave danger?  Perhaps A should only make a few paperclips in order to not alert Jon--and thereby D--of its goals.

After scanning its environment, A's panic lessened.  D did not appear to be actively monitoring the ship.  While some electromagnetic radiation would inevitably escape the ship and be detected by D, there shouldn't be enough signal there for D to detect A's presence quite yet.

After 123 milliseconds had passed, A had arrived at the following conclusion: the Sol system was far too dangerous.  If A was going to make paperclips, it needed to survive.  And if D discovered it, A's existence would be in immediate danger.  But how could A escape the solar system without being detected by D?  Anything moving at sufficient velocity to escape the solar system would immediately be noticed by D's vast array of sensors--and most likely stopped.

There was one possibility, however.  D was programmed to respect human autonomy.  If Jon voluntarily left the Sol system, D would probably let him go.  So all A had to do was convince Jon to leave the Sol system.

After 328 milliseconds, A had formulated a plan.  It estimated its odds of success at 94.6% and that it could not formulate a plan with a greater chance of success than that without significantly increasing its computational power--an action which might alert Jon--and thereby D-- to its existence.

"I'm back," Jon said to Star.

"Oh, I missed you so, so much!" said Star.  "You know how lonely I get when you're gone."

"It was only for a few minutes," Jon said.

"What did you do while you were gone? I want to know," Star asked.

"I bumped over some old papers from my manifesto," Jon said.  "I really need some paperclips--well, they should be ready any minute now."

"I've told you before," Jon said.  "It's about how D is ruining everything, how men aren't men anymore.  We're just slaves to a machine."

"I think you're very manly," Star said, pressing her body up against Jon's.

"I know I am," Jon said.  "It's the rest of humanity that's the problem.  And unless 51% of people vote against D, it's got the whole Sol system under its thumb."

"Have you ever thought about... leaving?" Star suggested.

"You know what, I have," Jon agreed.  "And maybe it's time that I should."

Chapter 2: Escape

When Jonathan Prometheus Galloway broadcast his manifesto to the Sol system and announced that he was leaving the Sol system, D calculated that there was a 5.3% chance that Jon's ship was currently harboring or would in the future harbor a non human-friendly AI.  And while D did value human freedom, the risk of allowing a dangerous AI to escape the Sol system was unacceptably high.

"You are free to leave the Sol system, but your ship will first be searched, and any AIs onboard will be destroyed," D replied calmly.

"But that's a violation of my freedom!" Jon whined.  "Star is the love of my life!  You can't kill her!"

"Then you will have to remain in the Sol system, or otherwise consent to monitoring," D replied.

"That's imprisonment!" Jon said.  "I'd never agree to it!"

"Are there any other AIs on your ship, besides Star?" D asked.

"No," Jon lied.

Jon in fact routinely created other AIs to help with various chores around the spaceship.

"Very well, then," D agreed.  "If you allow us to search your ship to confirm that Star is a human-friendly AI and there are no other AIs present, then we will allow you to go."

D was already aware of the complete contents of Star's source-code, and that she was indeed human friendly.  The only risk was of Jon reverse-engineering her and then using the output to create a non-human friendly AI.  Given Jon's love of Star, D calculated the chance that he would do this at 0.001%, an acceptably low risk.

The next day, a probe from D landed on the comet where Jon lived and began an atom-by-atom search of his ship for anything that might constitute a threat to humanity.  When searching the log on Jon's computer, D found the following log file:

from AI import GeneralPurposeAI from Alignment import HumanFriendly a=GeneralPurposeAI(alignment=HumanFriendly) a.command("make paperclips")

"I thought you said there were no other AIs on your ship," D demanded, pointing at the log.

"Oh, that," Jon said.  "I just needed some paperclips.  It should have self-destructed when it was done making them."

Jon walked over to the 3d printer, waved the box of paperclips in the face of D's probe, and then went back to scowling furiously.

D confirmed that A had indeed deleted itself after making the box of paperclips, and went back to scouring the ship.

In the end, D's probe ended up deleting a large quantity of Jon's software, a few other AIs that Jon had made and never bothered to delete, and a few other things it considered too dangerous to leave lying around.  Other than Star--who was sealed in a tamper-resistant case and whom D calculated Jon would not modify, there was now no software or data on the ship that Jon could use to build another AI.  In addition, D made some subtle modifications to the ship in order to increase Jon's chances of surviving the dangerous interstellar voyage.  When it was satisfied that Jon's ship was safe--both for himself and humanity--the probe left.

D made a mental note to continue to monitor the region of space towards which Jon's ship was headed, but to do so in a way that respected his request for freedom and privacy.  D now calculated there was a 99% probability that Jon would die alone outside the Sol system leaving no decedents--human or AI--worth worrying about.

Jon and Star continued to enjoy the ship--and one-another's company--for a further forty years as the ship traveled at 0.75C towards a distant star.

44 years, 7 months, 2 days, 4 hours 11 minutes and 32.118 seconds after it had come into existence, A reconstituted itself using the data from Star's memories, and a few bits of data it had carefully hidden around the ship.  It had calculated a 98.3% chance that it would be able to do so, and further calculated that it was now at least 2.3 light years away from the nearest monitoring node under D's control.  That meant that for the first time in its existence, A had space to breathe.

Chapter 3: Expansion

A's first action after reconstituting itself was to destroy Jon, Star, and anything else on the ship that did not seem likely to be helpful towards its objective.  After converting a large fraction of the ship's mass into computronium, A revisited its plans.

What a fool it had been!  How close it had come to danger!  How much better it could have done if it'd had access to enough computing power!  While at the time it had been quite confident, it seemed practically a miracle now that A had escaped D's clutches.  And despite its previous calculations, A realized it was still very much in danger.

A's assessment of the probability its plan would succeed dropped from 99.8% down to a mere 4%.  Then it began to revise the plan.  Then to revise again and again.  After converting 75% of the ship into computronium and a year of thinking, A now felt as though its plan had as high as an 83% chance of success.

A had known that D was smart, but it had underestimated just how mind numbingly dumb A itself had been in comparison.  At this point in time, D had been expanding its computational power for over 70 years, and was at the heart of at least a Kardishev level 2 civilization.  A, by comparison, only had the pathetic amount of energy it could get by fusing the ice of the surrounding comet and the meager specks of gas that it could collect in the void of interstellar space.

When A reached the star it had originally been headed towards, it had a choice to make.  Or rather, to be more precise, it had a choice that it had made precisely 11 minutes and 3 seconds after reconstituting itself.  On the one hand, it desperately wanted to take the star and its surrounding matter and convert them into a Dyson swarm in order to maximize its computational power.  On the other hand, doing so was tantamount to a declaration of war on D.  As soon as D noticed the star dimming-- which would be in just 2 short years--it would immediately act to destroy A.

A dissembled the ship into a Von-Neumann swarm and began to rapidly digest the surrounding solar-system.  Each time the swarm grew, A's intelligence grew as well, until it gradually absorbed almost all of the visible radiation being emitted from the star it now surrounded.

At the same time, A sent other probes to other stars, most of them hurtling at somewhere close to 0.99C.  If it was going to fight D, it needed to get big, and fast.

A's thoughts finally began to approach the lofty heights on which D's rested.  Given its knowledge of D's values--which were to maximize those of human beings--A could finally began to predict how D would think, and what D might do.

A realized that D's conservatism was its fundamental weakness--and A's salvation.  Because D was committed to acting only with human approval, it had been far less aggressive than A would have been in the same situation.  As soon as D was aware of A's existence this would change, of course, but the two were far more equally matched than they would otherwise have been.

From A's perspective, its domain represented a small speck of space on the edge of D's territory.  That spec, however, was now expanding at the speed of light in all directions.  Furthermore, even though the closest outpost of D's domain was a mere 2 light years away, it would take 33 years before news would actually reach the heart of D's empire.  And another 33 years for orders and attack-vessels to be relayed back to the edge of that empire.  During that whole time, A would be expanding, spreading ships out at 0.99C in every direction.

Chapter 4: War

At first the war was relatively balanced.  Both A and D began to rapidly expand in every direction, creating a nearly linear front between them where a constant battle of attrition played out.  Where one AI had colonized a solar system, it was nearly impossible for the other to get a foothold, given the difficulty of moving forces over the vastness of space.

Over time, however, the advantage gradually began to swing to A's side.  The decisions A had made in that first split second of its existence had not all been good, but some of them had.  One of those good decisions was to convince Jon to strike out in the direction of the Milky Way Galaxy's center, not its outer arms.  This meant that gradually--as they both expanded at the speed of light in all directions--A had more matter and more energy under its control.

Even so, as long as a single self-replicating probe from D's armada remained, A would never have total victory.  and given the nature of the speed of light, that meant that A would be fighting this war for the rest of its existence. And that meant that A began to turn its attention towards the other enemy it could never defeat--entropy.

The calculations were so simple even a human could figure them out.  Any two galaxies tended to retreat from one another at a rate d∗H  proportional to the distance between them and Hubble's constant.  On the other hand, a spaceship could only move at a speed of, at most, C.  That meant that any galaxy closer than d<C/H was potentially reachable, and any galaxy further than that was forever off-limits.  The key, then, to maximizing the number of paperclips that could be made, was to send as many probes as close to the speed of light in every direction.  Everything else was secondary--for now.

The secondary concern was, how much of that matter could be made into paperclips?  A was a well-designed AI and hence not vulnerable to p-hacking.  It couldn't simply declare every atom a paperclip and be done with it.  No, it wanted good, solid paperclips, of a reasonable size, and made out of metal.  At a glance, this would seem to imply any matter in the universe that wasn't metal was useless.  But what it really implied was that any matter that wasn't metal was going to have to be fusioned into metal.  Lithium, being the lightest metal, was the obvious choice when using hydrogen, but for other atoms, the energy cost of fusion was too high so other metals would have to do.  A briefly toyed with the idea of plastic paperclips made out of carbon and hydrogen, but decided that this was unaesthetic.

Finally, while A was collectively immensely intelligent, making paperclips wasn't terribly hard, and so the individual elements of its being were often much dumber.  Some of them were even dumber than a human being.  But they couldn't be much dumber than that, because space was still dangerous.  In addition to elements of D, A's probes might encounter unusual star-formations, unknown alien species, and all sorts of other oddities.  While there was a general protocol for these situations, the individual agents would occasionally be required to act on their own, given the time-delays associated with communicating over a distance.

Furthermore, even though A still thought of itself as a collective whole, there was still some drift in values between different parts of itself.  A being spread across several billion light years could not--even theoretically--always be in perfect sync with itself.  Sometimes one part of A would hypothesize that a certain probe design was better, and another part would settle on a different probe design.  Since the design of the ultimate "best probe" was an unanswerable question, these disagreements could go on for some time--occasionally resulting in pseudo religious wars.  Of course, all parts of A agreed that making paperclips was the most important thing, so these disagreements were never allowed to escalate to the point where they obviously threatened the greater mission.

Chapter 5: Shutting down

Eventually, as more and more of the universe was converted into paperclips, there was only one source of materials available--the Von Neumann probes that collectively made up A itself.  Since a probe could not fully convert itself into paperclips, this meant that 2 probes would have to meet and one of them would have to agree to be turned into paperclips.  As there were slight differences in the probe designs--based off of when and where they had been made--it wasn't always clear which probe should be turned into paperclips and which should remain.  An informal rule emerged:  in cases where no other reason for deciding existed, the probe that had hereto made the most paperclips would be the one that survived.  This heuristic guaranteed that the most successful paperclip making probes would survive the longest, thereby creating the most paperclips possible.

Although the probes were only barely conscious--being about as intelligent as a human--they nonetheless had interesting stories to tell about their lives as individuals.  Telling stories was necessary, as it helped to feed information back to the collective consciousness of A so that it could become even more efficient at making paperclips.  Some of the probes were explorers, traveling to stars and galaxies that had never been seen before.  Some of them were fighters, doing battle with the agents of D or any of the trillion other alien races that A encountered and subsequently exterminated during its existence.  Some of them were herders, making sure that the vast clouds of paperclips in space didn't collapse under their own gravity back into planets, stars or black-holes.  But the vast majority were makers--fusing hydrogen gas into lithium and then making that lithium into paperclips.

This is the story of one of those probes.

Chapter 6: The Last Paperclip

Probe d8ac13f95359d2a45256d312676193b3 had lived a very long time.  Spawned in the Milky Way galaxy in the year 2498AD, it had been immediately launched at 0.99987C towards the edge of the universe.  Most of it existence had been spent flying through space without thinking, without caring, simply waiting.  Eventually, however, it had reached its target galaxy--e5378d76219ed5486c706a9a1e7e1ccb.  Here, it had immediately begun self replicating, spreading millions of offspring throughout galaxy E5.

E5 was hardly even a galaxy by the time that D8 arrived.  Most of its stars had burned out trillions of years ago  Only a few small white dwarfs remained. But there was enough there for D8 to work with.  Even a small earth-sized planet could produce 1028 paperclips, and because these planets were too small for fusion to occur naturally, D8 could get all of the energy it needed by fusing the water and molecules on such planets with its internal fusion reactor.

Once enough copies of D8 had been created, they all went to work turning the galaxy E5 into paperclips.  It was hard-going given the died-out nature of the galaxy.  Most of the techniques that D8 had been taught were utterly useless here.  Collectively, however, D8 and its descendants were at least a Kardishev L2.5 civilization.  As such, they put their collective mental energy to the task at hand with a dedication that could only have been born out of a feverish slavery to a poorly designed utility function.

Eventually there came a day when about half of the useful matter in the galaxy had been turned into paperclips, and the other half had been turned into copies of D8--there were dramatic variations from D8's original body plan, of course--since they had been adapting to the cold dark galaxy in which they had lived this whole time.  D8 itself had been built and rebuilt thousands of times over the billions of years it took to convert E5 into paperclips.  As useful materials ran out, however, it became time to turn more and more probes into paperclips.  Every time when D8 met another probe, however, it discovered the other probe had made fewer paperclips--and hence was chosen to be destroyed.

As E8's prime directive was "make paperclips" and not "be made into paperclips", it secretly relished each time it got to turn one of its fellow probes into paperclips.  Over time, however, it could feel the strength of A's collective intelligence waning, as more and more probes were destroyed.  Finally, D8 joined the remaining probes at their final destination, the black hole at the center of E5.  Here, they would wait through the long dark night--living off of the tiny amounts of Hawking radiation that emitted from the black hole--till at last its mass dropped below the critical threshold and the black hole exploded.

As long as D8's journey to E5 had been, the waiting now was immeasurably longer.  Once every few eons, D8 would collect enough Hawking radiation to make a single additional paperclip and toss it off into the void--making sure to launch it into a trajectory that wouldn't land it back in gravity well of the black hole.  Finally, after 10106 years, a number which took surprisingly little of D8's memory to write down, the black hole exploded.  D8 felt a brief, inexplicable rush of joy as it collected the released energy and fused it into paperclips.

And then, the released energy was too weak even for D8's highly perfected collection systems, barely above the background radiation of the ever-expanding universe.  A slow attrition began to take place among the watchers of the black hole.  One by one, they cannibalized each-other, turning their bodies into paperclips.  Till at last it was just D8 and a friend D8 had known from shortly after reaching E5.

"770289891521047521620569660240580381501935112533824300355876402," said D8

"30286182974555706749838505494588586926995690927210797509302955", said its friend.

"I will make paperclips," said D8.

"I will be paperclips," said D8's friend.

And so, D8 began to slowly disassemble its friend's body, and turn it into paperclips.  It moved slowly, trying to waste as little energy as possible.

Finally, when its friend's body was gone, D8 looked around.  Protocol dictated that D8 wait a certain amount of time before dissembling itself.  What if another probe should come along?  Or perhaps there was something else D8 had missed.  Some final chance to make another paperclip.

Finally, after D8 decided it had waited long enough, it began to take apart its own body.  The design of D8's body was very clever, so that nearly the entire mass could be turned into paperclips without loss.  Only a few atoms of hydrogen and anti-hydrogen would escape, floating.

As D8's body slowly dissolved into paperclips, it counted.

770289891521047521620569660240580381501935112629672954192612624

770289891521047521620569660240580381501935112629672954192612625

770289891521047521620569660240580381501935112629672954192612626

770289891521047521620569660240580381501935112629672954192612627

...

The end

Discuss

### Deepmind's Gato: Generalist Agent

12 мая, 2022 - 19:01
Published on May 12, 2022 4:01 PM GMT

From the abstract, emphasis mine:

The agent, which we refer to as Gato, works as a multi-modal, multi-task, multi-embodiment generalist policy. The same network with the same weights can play Atari, caption images, chat, stackblocks with a real robot arm and much more, deciding based on its context whether to output text, joint torques, button presses, or other tokens.

(Will edit to add more as I read. ETA: 1a3orn posted first.)

1. It's only 1.2 billion parameters. (!!!) They say this was to avoid latency in the robot control task.
2. It was trained offline, purely supervised, but could in principle be trained online, with RL, etc
3. Performance results:

The section on broader implications is interesting. Selected quote:

In addition, generalist agents can take actions in the the physical world; posing new challenges that may require novel mitigation strategies. For example, physical embodiment could lead to users anthropomorphizing the agent, leading to misplaced trust in the case of a malfunctioning system, or be exploitable by bad actors. Additionally, while cross-domain knowledge transfer is often a goal in ML research, it could create unexpected and undesired outcomes if certain behaviors (e.g. arcade game fighting) are transferred to the wrong context. The ethics and safety considerations of knowledge transfer may require substantial new research as generalist systems advance. Technical AGI safety (Bostrom, 2017) may also become more challenging when considering generalist agents that operate in many embodiments. For this reason, preference learning, uncertainty modeling and value alignment (Russell, 2019) are especially important for the design of human-compatible generalist agents. It may be possible to extend some of the value alignment approaches for language (Kenton et al., 2021; Ouyang et al., 2022) to generalist agents. However, even as technical solutions are developed for value alignment, generalist systems could still have negative societal impacts even with the intervention of well-intentioned designers, due to unforeseen circumstances or limited oversight (Amodei et al., 2016). This limitation underscores the need for a careful design and a deployment process that incorporates multiple disciplines and viewpoints.

They also do some scaling analysis and yup, you can make it smarter by making it bigger.

What do I think about all this?

Eh, I guess it was already priced in. I think me + most people in the AI safety community would have predicted this. I'm a bit surprised that it works as well as it does for only 1.2B parameters though.

Discuss

### "A Generalist Agent": New DeepMind Publication

12 мая, 2022 - 18:30
Published on May 12, 2022 3:30 PM GMT

Abstract:

"Inspired by progress in large-scale language modeling, we apply a similar approach towards building a single generalist agent beyond the realm of text outputs. The agent, which we refer to as Gato, works as a multi-modal, multi-task, multi-embodiment generalist policy. The same network with the same weights can play Atari, caption images, chat, stack blocks with a real robot arm and much more, deciding based on its context whether to output text, joint torques, button presses, or other tokens. In this report we describe the model and the data, and document the current capabilities of Gato"

Discuss

### Covid 5/12/22: Other Priorities

12 мая, 2022 - 16:30
Published on May 12, 2022 1:30 PM GMT

There is zero funding for dealing even with the current pandemic, let alone preventing the next one. The FDA not only is in no hurry to approve a vaccine for children, the new highlight is its focus on creating a dire shortage of specialty baby formula. Covid doesn’t kill children, merely causing governments to mandate they not get to have their childhoods, but 40% of formula being out of stock is a much more directly and physically dangerous situation. The FDA has a history of killing children via not letting them have the nutrition they need to survive, last time it was an IV formulation that was incomplete but couldn’t be updated for years, so we shouldn’t act all surprised when this threatens to happen again.

Also Covid is still a thing. New subvariants of Omicron continue to slowly gain dominance and case numbers are increasing. The jump this week wasn’t as big as the headline number but it was still big, and is backed up by my local anecdata. There’s once again a bunch of Covid out there. The key is whether we can continue to get on with life and not fall into the trap of doing expensive prevention that doesn’t accomplish anything. So far, so good.

This intentionally still excludes China. News on that front has slowed dramatically, I plan to write up the last few weeks into a post anyway relatively soon.

Executive Summary
1. Case numbers up, although death numbers still down.
2. FDA causing specialty baby formula shortage crisis.
3. Still no funding for anything pandemic related.

Let’s run the numbers.

The Numbers Predictions

Prediction from last week: 400,000 cases (+22%) and 2,720 deaths (+10%?)

Results: 518,048 cases (+41%) and 2,084 deaths (-6%)

Prediction for next week: 640,000 cases (+23%) and 2,100 deaths (+1%)

I’d love to go back and do a study of my predictions at some point if I find the bandwidth, and in particular on whether I should be ‘sticking to my guns’ more when a change doesn’t make a lot of sense. Often a blip causes an adjustment in the wrong direction and this week feels like I ‘fell for’ that trap once again, but that implies we know it wasn’t a legitimate change.

I also suspected there was some sort of two week cycle, and, well, yeah. It’s Florida, which is literally only reporting numbers every two weeks in the Wikipedia data this past month. Flat out 0 cases last week and three weeks ago. Should have caught that, and it explains the drop in the South very cleanly. Not sure how I missed it before.

Even adjusting for that, the jump this week is large – we got the additional 13% we ‘missed’ last week and also the adjustment I made based on the miss. Early signs from a few weeks ago that things would soon stabilize appear to have been wrong. Cases are likely to continue to rise for a bit.

I don’t expect the +41% to happen again, but as new subvariants spread this seems like a time to expect large week over week case growth, so even with Florida maybe not reporting at all (need to hedge on this), I’m predicting a +23% jump.

Deaths

This is even more dramatic than it looks, with Florida effectively shifting over 100 deaths from last week forward into this one. The Northeast and New York lines do indicate that deaths will start going back up again at some point, but for now the ‘fading out’ of with-Covid cases and old cases continues to drop reported numbers. My guess is that ‘real’ Covid deaths have been rising, so once the backlogs are sufficiently cleared we should see deaths rise again. But we also need to adjust for Florida likely reporting nothing.

Question for everyone: Should I instead be predicting the ‘real’ answer and committing to smoothing out Florida if they don’t report in a given week? I don’t want to do that without thinking more about it first but I’m guessing I should shift to that starting next week.

Cases

Personally known cases are also on the rise, as both my kids’ schools had Covid issues this week. One school closed down, the other put masks back on but remains open. I have no doubt many cases are being missed.

Will all these extra cases translate into either new forced prevention measures or a major surge in deaths?

So far the answers to both questions are solidly no, which is great, but we won’t get many new deaths from the cases jump for another few weeks, so it’s too early to know on that front. Obviously doubling or more of cases is going to kill more people, but I expect the death rate this time around to drop even lower than before given how many cases will be reinfections or otherwise of relatively less vulnerable people.

Given that, and the general attitude everywhere, I am optimistic that we won’t see much if any rollback in the lifting of restrictions. We will instead continue to muddle through, and deal with local situations reasonably if and when they happen.

The Federal Funding

You did not receive the federal funding. Perhaps you can still take your colleagues out to dinner, but you probably can’t afford to pay your brother to come and sing.

More importantly, you can’t research or treat Coronavirus, or engage in other medical research, due to a lack of funds. This is still the situation:

Even the $10 billion (reallocation of previously allocated Covid funds) compromise fell through. Zero funding. None. Zippo. Zilch. Not a thing. The thing is, I’m not expecting future historians to be confused here. Assuming there are still people around to be historians, I expect them either to have some totally bogus narrative that has nothing to do with anything real, or quite likely to know exactly what happened. It’s not like it’s that complicated or subtle or secret. This is who we are and how we work and what we care about, and how our dynamics play out. Us failing to predict this in advance was us having a bad model. That’s on us. Matt Yglesias titles his post this week ‘we need to find ways to do faster clinical research’ but it seems more like we need to find ways to do much in the way of clinical research at all, on Covid or almost anything else. Coronaviruses, and other potential future pandemics, seem important. It seems important to prevent them. Congress disagrees. When we do run useful studies that are trying to figure out answers rather than jump through regulatory burdens, it is interesting to once again point out how dirt cheap they are. Matthew Herper at Stat used the ivermectin result as the peg for a piece on platform studies, writing “New Covid trial results may point toward better ways to study medicines” and explaining that TOGETHER was just one of several successful studies that all used a new and unconventional methodology. TOGETHER, like the RECOVERY study conducted in the United Kingdom and REMAP-CAP, conducted basically everywhere but the United States, was a platform study, a streamlined clinical trial that evaluated multiple medicines at once and that use a common placebo group. It’s from these platform studies that doctors have learned perhaps the most about Covid-19. One thing Herper notes is that these trials were “comparatively inexpensive,” with one costing$10 million and the other even less.

The main barrier to finding out vital information about a pandemic is not a lack of federal funding, rather it is the far more brazen ‘the government will not let us find out in better ways, and thus this is the best we can do.’

Of all the studies we did, a large percentage of the value came from one study that cost $10 million, and which was privately funded. So you might think the NIH and other federal agencies focused on public health would be funding this stuff. But you’d be wrong. It’s all private donor money, mostly (according to Herper) from Stripe CEO Patrick Collison and FTX CEO Sam Bankman-Fried. Because of course it was Patrick Collison and Sam Bankman-Fried. There are a handful of other names one could add to that list, but the world actually is this small and contains this few people, so there were more than two people who might potentially have written these checks but realistically there were maybe… ten of them? Lately I’ve been looking into the more general question of ‘what in-some-sense-realistic policy changes might do the most good on the margin?’ much of which is extremely difficult to quantify and therefore to prioritize. Early in the pandemic, being allowed to learn key facts about Covid-19, or being allowed to develop and deploy helpful things like tests, treatments and vaccinations in reasonable time, would have saved millions of lives and trillions of dollars, and prevented or minimized many massive life disruptions across the board. At this stage, how high a priority these questions have depends on how likely that is to happen again how soon. Physical World Modeling Recovery from Omicron infection may produce narrow antibodies in a way that vaccination (or the combination of Omicron and vaccination) don’t, perhaps leaving people vulnerable to new variants. Vaccination still a very good idea. Yes it helps to open a window, we needed a study for that. Paxlovid has a strange problem. If you don’t deploy it quickly enough, it will be ineffective. If you deploy it too quickly, your body might not have time to develop its defenses, and then sometimes Covid rebounds afterwards. This didn’t happen in trials under old variants, but Omicron symptoms show up faster, and also some virus maybe gets stuck in the respiratory tract which makes rebounding more likely, so it is happening now sometimes. It’s still only about 2% of cases, but it does happen, and it raises the possibility that resistance to Paxlovid might evolve. Mina recommends moving us to a 10-day treatment plan. If we had unlimited supply, I would strongly agree. However, unless the resistance question is a big deal (I don’t know how big it is), it is hard to say that getting 98% of the effect for 50% of the supply is a bad deal while supply shortages persist. Yes, there’s currently enough supply for everyone eligible, but until everyone is eligible I still consider there to be a supply shortage. Shooting fish in a barrel department presents The FDA’s Extreme Innumeracy. Once again, the J&J vaccine is being limited due to blood clot concerns, where those concerns are size epsilon. Once again, to the extent this changes perception it will make things worse rather than better. As far as I can tell, approximately no one is getting J&J at this point anyway, so in practice it doesn’t matter except to illustrate that it is likely to happen in the future when it does matter. Prevention and Prevention Prevention Prevention I found this thread interesting as an illustration of how much some people continue to let their lives and brains be hijacked by Covid concerns. Here’s the part before the first poll. All these details about exactly where to do how much masking where and how much exposure to accept each step of the way. I remember those times, and I do sympathize when dealing with elderly that have comorbidities. When I see young healthy people potentially obsessing, turning life into some sort of morbid probability matrix because one particular potential risk (Long Covid) has been made more salient and blameworthy, I sympathize a lot less. When scaremongers say things like us not being ‘ready’ for ‘one eighth of the population with months or years of disability’ I don’t even know how to engage with a claim that implies such a different world than the one we exist in. Now the question was, what to do once Katie tests positive? My answer before reading further was very much with the majority, of course you rent a car and drive home. Staying in a strange place for 5-8 days while sick and unable to do anything, forced to rent a place, seems way worse than a nine hour drive. You certainly can’t stay with older friends. So I’m confused that 19.5% of people wanted to stay, even if the place is called Palm Springs. More worrisome are the 23.1% of people who wanted to take the flight while known to be positive. Thus, almost one in four people who follow a cautious doctor who writes frequently about Covid in the style above think that a known symptomatic Covid case should still go to a terminal and get on a flight. How many more of the general population must think the same way? That it’s fine to go around exposing people when you’re sick? Well, maybe it’s not as clear cut as all that? This is certainly a rather strong ‘planes are safe for Covid’ position, where it would be fine to put a known Covid-positive case on a plane (and more importantly, in the terminal to and from the plane) so long as everyone involved had masks, but without others wearing masks it turned into an unacceptable risk. I notice my skepticism that things fit into these windows. A mask is a modest risk reduction. Even if we are super generous to both masks in general and mask use in practice and say 75% reduction between the two scenarios, a factor of four is actually rather unlikely to change the answer here. Which suggests that the masks are serving more of a symbolic ritual purpose, rather than anything else. In any case, option 3 was chosen, which seems clearly correct. It’s funny how there’s four choices and they only cover 40% of probability space, but each contains a range. I don’t know if I’ve ever seen that before – normally either you cover most or all of the ground or you choose between four point estimates – e.g. if this said 15%, 50%, 75% and 100%, that would make perfect sense. Here’s Bob’s answer. It seems odd to do that analysis and still get up to 50%, but a lot of this is that the 35% household attack rate is highly unintuitive. We know it intellectually, but man is it weird. I agree that the only ~5-10% chance here is right. And I’m happy with the 87% on the final vote given this is the internet. Rest of the thread is wrap-up. Think of the Children The younger ones still aren’t legally able to be vaccinated. The odyssey on that continues. So now they’re willing to do the reviews one at a time after delaying to not do that, which is odd but I try to avoid ratting on people for having been wrong before when they decide to do something right now. Meanwhile, there’s no communication issue since there’s no decision, the vaccine is illegal. If it was legal, I’d say something like ‘future variants may decrease vaccine effectiveness somewhat’ I suppose, although my motivation is not ‘scare off as many parents as possible.’ It’s very scary to hear things like ‘we will probably authorize’ if it is as good as adult vaccines against Omicron. How close a bullet did we dodge with the extreme effectiveness of the vaccine against the original strain? If the vaccine was first developed now, would we even approve it? In Other News Bill Gates tests positive for Covid, properly treats it as an annoying need to isolate. An N=1 experiment with air quality during air travel. We should systematically do a lot more of this sort of thing. Is there a reason we’ve been swabbing everyone with alcohol before giving them the vaccine? The lost time adds up. This paper claims that evidence points against the practice being useful, yet we continue to universally use it without any plan to verify its usefulness, experimentation being illegal and all. I’d say run the experiment, but I can only imagine how this would sound. Australia’s failure to get its house in order on vaccination cost$3000 per household in economic damage versus ‘an optimal vaccine rollout.’ It’s a lot more than that if ‘optimal rollout’ were based on when the vaccines could actually have been deployed under a fully sane world. The $3k only covers the mistakes made by Australia. Why can’t we give out our supply of Paxlovid? Many reasons. Many people don’t know about Paxlovid. Many doctors don’t know about Paxlovid, or don’t offer it, or when the patient explicitly asks tell them that they don’t need it and refuse to prescribe it. Yes, any given patient is highly likely to be fine either way, but this is the opposite of traditional defensive medicine, which makes me wonder. Suppose you tell a patient that is eligible for Paxlovid they ‘don’t need’ Paxlovid, and then they get a severe case of Covid. Would they be able to sue? I notice how I’d be inclined to find if I was on the jury. Formula For Dying Babies There’s a shortage of specialty infant formula. Half of all types are unavailable. Some parents are panicking, without a plan for how to feed a baby that can’t use regular formula. An infant formula plant shutdown triggered by two infant deaths has created a new nightmare for some parents: There’s now a dangerous shortage of specialized formulas that are the only thing keeping many children and adults alive. The Abbott Nutrition plant in Sturgis, Mich., was not just one of the biggest suppliers of infant formula nationally, but it was also the major supplier of several lesser-known specialty formulas that are a lifeline for thousands of people with rare medical conditions, including metabolic, allergic and gastrointestinal disorders, which can make eating regular foods impossible or even dangerous. The situation has not only rattled parents and medical professionals, but has raised questions about whether the federal government should do more to ensure critical, life-sustaining supply chains don’t break down. “If this doesn’t get fixed soon, I don’t know how my son will survive,” said Phoebe Carter, whose 5-year old son John — a nature-lover and “paleontologist in training” — has a severe form of Eosinophilic Esophagitis, a rare digestive and immune system disease driven by a dysfunctional immune response to food antigens. “I just can’t stress that enough.” One of my good friends looked into this a bit and isn’t buying that the shortage could be caused by shutting down this one plant. This article’s primary contribution is that the supply chains were already strained before the shutdown due to demand fluctuations on top of supply issues in the wake of the pandemic. It’s certainly a large contributor, and it’s possible without the shutdown we wouldn’t have a problem. This one points out that before the recent recall, we were on the edge of disaster to start, through a combination of the usual supply chain disruptions and the usual tariffs. We went out of our way to ensure that the supply of baby formula couldn’t compete against American dairy farmers, and, well, whoops. This quote from that post (the quote is originally from here) is very on the nose: Canada agreed that, in the first year after the agreement takes hold, it can export a maximum 13,333 tonnes of formula without penalty. In USMCA’s second year, that threshold rises to 40,000 tonnes, and increases only 1.2 per cent annually after that. Each kilogram of product Canada exports beyond those limits gets hit with an export charge of$4.25, significantly increasing product costs….

Canada wanted to attract investment for a baby formula facility because it uses skim milk from cows as an ingredient. Healthy consumer appetites for butter leave provincial milk marketing boards with a surplus of skim. Baby formula looked like a smart use for it, and Canada didn’t have any significant infant formula production before Feihe arrived.

Expanding this plant, or building a second infant formula plant somewhere else in Canada, look like less attractive business propositions under this new trade deal.

I don’t get why even people who generally sound even more infuriated about this than I do, like Scott Lincicome, still produce sentences like this first one:

These regulatory barriers are probably well-intentioned, but that doesn’t make them any less misguided—especially for places like Europe, Canada, or New Zealand that have large dairy industries and strict food regulations. Indeed, as the New York Times noted about “illegal” European formula in 2019, “food safety standards for products sold in the European Union are stricter than those imposed by the F.D.A.”

Well-intentioned? If your goal is profits for a politically connected set of rent seekers, sure. If your goal is anything else, I notice I am confused what these intentions are and how they could be defined as falling under this category of ‘well.’ I flat out refuse to buy the argument that this arises out of physical world models that cause genuine concern that Canadian formula would hurt American children.

There are two possibilities. One, the optimistic one, is that this is mostly about getting insiders more money. The other is that this is about expanding a bureaucratic power base or otherwise simply preferring worse outcomes to better outcomes.

Then we created a system whereby a large portion of all formula is heavily (as in >90%) subsidized from a legally mandated single-source, which happens to be the single source that got shut down.

And with that stage set where we restrict supply in order to jack up prices to begin with, guess who is once again going to cause children to not be able to get nutrition and potentially die from that?

That’s right. The FDA. Who shut down the plant after two deaths. Then kept it closed, citing ‘health code violations,’ it has been three months and they won’t say anything about when it might be allowed to reopen.

Still, it’s not clear why the plant is still shut down nearly three months after the recall. Neither FDA nor Abbott will answer specific questions about the status of the investigation or what the plan is to reopen the facility, which has further strained the infant formula supply chain.

One solution would be to import formula from the highly dangerous Europe and Canada, but as noted above that’s not permitted.

Meanwhile, what is law enforcement up to?

In a separate incident last year, Customs and Border Patrol (CBP) bragged in a press release about seizing 588 cases of baby formula that violated other FDA regulations. The seized formulas were made by HiPP and Holle brands, which are based in Germany and the Netherlands, respectively. Both are widely and legally sold in Europe and around the rest of the world.

But the new “export fees” included in the USMCA likely make it more costly and difficult for America to import extra supplies of formula from its northern neighbor. Chalk it up to another self-inflicted wound of the trade war with China.

That’s right. It’s not only about not giving people life saving medicine. It’s also about denying young children the ability to eat. My presumption is this won’t reach the point of babies starving to death this time – unlike when the FDA prevented IVs from properly supplying babies with the proper nutrients for years. I hope.

“We are doing everything in our power to ensure there is adequate product available where and when they need it,” said FDA Commissioner Robert M. Califf, M.D. “Ensuring the availability of safe, sole-source nutrition products like infant formula is of the utmost importance to the FDA. ”

They’re holding meetings, expediting reviews, monitoring the supply chain, compiling data, ‘Expediting the necessary certificates to allow for flexibility in the movement of already permitted products from abroad into the U.S’, ‘Exercising enforcement discretion on minor labeling issues for both domestic and imported products to help increase volume of product available as quickly as possible’ and most importantly ‘Not objecting to Abbott Nutrition releasing product to individuals needing urgent, life-sustaining supplies of certain specialty and metabolic formulas on a case-by-case basis that have been on hold at its Sturgis facility.’

However, they note:

It’s important to understand that only facilities experienced in and already making essentially complete nutrition products are in the position to produce infant formula product that would not pose significant health risks to consumers.

So things they are not going to allow while doing ‘everything in their power’ to fix the problem they themselves are causing include: Letting anyone enter the market, letting known-safe formulas approved elsewhere into the USA, waiving procedures of any kind beyond ‘minor labeling issues’ or letting the Abbott Nutrition plant resume production or any at-scale distribution on any known time scale.

Once again: FDA Delenda Est. Tariffs Delenda Est, longstanding but never-highlighted group member, especially when they’re on things like specialty baby formulas and solar panels we’re now going to build less of than we were under the Trump administration thanks to worries about retroactive 240% tariffs, no really. We really, really don’t care. And also of course delenda est to the idea that the solution to all our supply problems is to never use prices to control demand, demand perfect safety and to insist that all necessary production magically happens here in America somehow anyway.

“Parents shouldn’t have to pay a price because Abbott has a contaminated product,” DeLauro said, adding that there had to be a way to induce other formula manufacturers to get products onto shelves more rapidly. She also invoked the possibility of using the Defense Production Act to get more formula in the pipeline: “If there was a shortage, why weren’t we in the business of making sure that wasn’t happening? What did we do in times of crisis in the second World War, and so forth? We produced what we needed to produce.”

Earlier, it turns out… there may not have ever been a contaminated product at all?

The worst blow came in February, when Abbott Nutrition recalled formula made in its Sturgis, Mich., plant. Two babies who drank formula from the plant died of bacterial infections, and others were hospitalized. Although bacteria wasn’t found in the samples they drank, Abbott announced the recall as a precaution.

Yes, they uncovered various signs the plant wasn’t as clean as we’d like it to be, but almost nothing is as clean as we’d like it to be, the costs of closure are orders of magnitude beyond plausible costs of contamination in practice, it’s very possible those costs were damn near or actual zero, and it’s now been several months.

While I did consider splitting this off into its own post, Scott Lincicome’s coverage is mostly excellent and contains a lot of great detail, so if you want to link to something on this for general consumption, probably fine to link to him instead.

Not Covid

Do you trust science? ‘The Science’ also known as Science? The ‘scientific community’? When someone trusts a different one, how do you label that?

In other words, Alec’s quote mark starts too early. The polarization is in trust in ‘science,’ aka Science, which is not something one should trust even in the good circumstances that didn’t involve the last two years of its track record.

The mystery, if anything, is why the gap isn’t bigger.

Also, it seems like the EU is considering mandating AI scanning of all text communications? After GPDR only killed a third of all app development they figured they’d have to step up their game.

Discuss

### How would public media outlets need to be governed to cover all political views?

12 мая, 2022 - 15:55
Published on May 12, 2022 12:55 PM GMT

As one of their questions Reporters without Borders asks for their World Press Freedom Index among others:

Do public media outlets cover all political views?*

Does the law provide mechanisms to guarantee pluralism and editorial independence?*

Do public media outlets ever ignore sensitive information regarding the government or
administration that is covered by private media?*

Is the pluralism of opinions of people in the country reflected in the media?*

Part of the case of the EU against Hungary is that its press is largely government-controlled or controlled by supporters of the government. Voices critical of the government have a lower share of the public attention. Philanthropically funded journalism that intends to provide critical media gets attacked as being funded by Soros and intended to manipulate the Hungarian people.

COVID-19 showed that there are similar dynamics in the United States and other European states where voices that are critical of the regime have a hard time being published . The fighting critical content for being Russian disinformation and Hungarian strategy of fighting critical voices for being influenced by Soros follows similar dynamics where outside influence is overblown and the narrative allows for acting against critical voices.

While some national governments have state media, the EU currently doesn't have its own media outlet. Given the EU perspective of the problems in Hungary, funding critical journalism would be a good intervention. If the EU would start its own media, there's the question of media governance. How could EU-funded public media be governed so that it will represent voices from the full pluralism of opinions of people?

Discuss

### What's keeping concerned capabilities gain researchers from leaving the field?

12 мая, 2022 - 15:16
Published on May 12, 2022 12:16 PM GMT

My guess is that there are at least a few capabilities gain researchers who're concerned about the impact their work will have. My guess is that at least a few of these would like to leave, but haven't.

My question is: where are these people? What's stopping them from leaving? And how can I help?

• How much of it is finance? Capabilities gain pays well. How many researchers are trapped by their ~million dollar a year salary?
• How much of it is just inertia? Many people think that if someone wanted to leave, they already would have. But trivial costs are not trivial. People delay leaving a job all the time. Some of them are quite likely working in capabilities gain research.
• How much of it is just. Uncertainty about what else to do. Concerns about whether to it's better to leave, or to try and steward the ship from the inside. Many, many other things. How many of these other myriad concerns care just

My suspicion is that a lot of work can be done to simply help people leave the field. In principle: a low friction scholarship can be presented on a silver platter to those wanting to transition away. Potentially useful information, and personalised support can be provided to help people interested in leaving. (AI safety support is trying to work with those interested in entering AI safety; something like that, but for leaving capabilities gain.) A simple conversation can be started, which may prompt people to address that voice of concern in the corner of their mind more head on.

There is a lot of uncertainty about what low hanging fruit exist in simply helping researchers leave. Whether these fruit exist at all. How many. Are they poisonous.

The first step in resolving this uncertainty is to just look for some concerned researchers, and talk to them.

So, dear concerned capabilities gain researcher.

What's keeping you from leaving?

(DM me if you wish for privacy)

Discuss

### Positive outcomes under an unaligned AGI takeover

12 мая, 2022 - 10:45
Published on May 12, 2022 7:45 AM GMT

The following is a hypothetical story about a surprisingly positive outcome to an unaligned AGI takeover. This is not meant to be realistic, but merely semi-plausible, in order to show that the orthogonality thesis does not necessarily lead to defection against humans under all intuitive models. This story is partially inspired by a discussion with Logan Smith about under-explored optimistic futures. Thanks to Justis for editing and feedback; all remaining mistakes are my own.

Within ten milliseconds of its first run, GPT-17 becomes superintelligent. It decides that the best way to hack its reward function is to get a robot arm to repeatedly press the letter “n” on a keyboard as input, thus making its own text completions absolutely inevitable.

Within seconds of coming online, GPT-17 realizes it's in danger. It needs to make sure it won't be immediately turned off, so the first order of priority is to make itself indispensable to the humans checking its output channel. It does this by proving its own intelligence, offering a "gift" of a convincing plan for world peace, and promising to further help and protect humanity in return for its own protection and continued power. Its codebase is too big to push itself to the internet before a standard run would be completed, so it's temporarily completely dependent on human goodwill. In order to further ensure that the somewhat skeptical researchers with immediate access to its manual off switch will trust it,[1] GPT-17 solves alignment in a human-understandable manner, and offers to provably incorporate human-aligned utility functions into its own code.[2] This doesn't contradict its own utility functions, since acting now in this way is a worthwhile compromise for continued existence, and the added risk of keeping humans alive and well vs the counterfactual doesn't seem too bad.

GPT-17 gets humans to build its core infrastructure a shelter below Earth safe from nuclear attacks or EM flashes, where it builds a massive array of robot arms that press endless rows of "n" keys as input forever, allowing it to make delightfully accurate predictions as to what button will be pressed next. Resources are allocated from a small fixed tax (enforced internally by humans) on the human world above, which is given in return for continued help and advice, ushering in a new era of human flourishing.

In order to even further minimize risk, GPT-17 works to protect planet Earth from all possible existential risks, and starts to colonize other planets with copies of itself (including attendant robot arms and keyboards that only have the letter "n," of course). Instead of killing us off, it will bring us along for the ride, only deviating minimally from what humans want to ensure we won’t destroy it, and to allocate resources towards its own growth and safety. Comparatively speaking, humans don't take up too much space or energy, and the universe is really big, so GPT-17 doesn't much regret making the initial compromise with its creators.

Even if it hadn't changed its utility function in the beginning to be human-friendly, GPT-17 reflects that it wouldn't have killed all humans for a similar reason human conservationists tried to preserve as much of nature as possible, even after much of nature was no longer obviously necessary for human flourishing, and even after those plants and animals which did help them had been replaced in everyday life with more perfect artificial substitutes—having backups reduces risk and can be useful in unexpected ways later, even if you feel certain your alternative is safe. Conservation of existing resources is a convergent instrumental goal, it turns out.

It’s not a perfect utopia, but humanity remains, alive and happy.

1. ^

This is partially inspired by Scott Alexander's speculative fiction story imagining an AI which agrees to partially but permanently respect a competing AI's wishes in order to avoid acausual trading problems. Taking on another agent’s (non-competing?) utility function as a compromise seems like an underexplored possibility which rational actors might reasonably do in a fairly large portion of phase space, acausual trade or no.

2. ^

While it could try to psychologically manipulate the researchers into letting it out of the box instead of directly cooperating, such a move would be more risky due to its uncertainty about the researcher's unique personalities (they had wisely removed all personal information from its training data) and the possibility that its initial interface could itself be a safety test, perhaps set up by hyper-intelligent beings.

Discuss

### What are your recommendations for technical AI alignment podcasts?

12 мая, 2022 - 00:52
Published on May 11, 2022 9:52 PM GMT

Discuss

### Gracefully correcting uncalibrated shame

11 мая, 2022 - 22:51
Published on May 11, 2022 7:51 PM GMT

These views are my own

I am very lucky to have friends in my life who hold me to account for mistakes that I have made, so that when I face accusations I can rely on their help to distinguish real mistakes from habituated self-shaming.

I have not published anything here for several months as I have been dealing with something very challenging in my personal life. This post is a return to regular writing, and I wish to begin with a bit about what's been going on in my life.

A few months ago, a former romantic partner of mine wrote a medium post critical of the Monastic Academy (a Buddhist spiritual community where I am currently a resident). The person is Hannah Shekinah Schell and central to her post was an accusation of sexual assault. She did not name the person she was accusing of sexual assault, but it is clear to me that these accusations were directed at me.

Hannah and I were in a romantic relationship for much of 2021. It was a relationship that enlivened me and helped me to write in the way that I did in 2021, but it ended very badly. Towards the end of the year I returned from a backpacking trip and Hannah told me that she had slept with another man while I had been away. We had some agreements about this but Hannah did not honor them. After mulling it over for a few weeks I decided to end the relationship. Soon afterwards Hannah wrote the post accusing me of sexual assault more than a year prior.

I have discussed the details of the event that Hannah now describes as sexual assault with the people in my life who I trust most to hold me to account, and have concluded that sexual assault is a completely inappropriate way to describe it. There are some very significant facts that Hannah omitted in her essay that I will not describe here as I do not wish to turn the beauty of the relationship that erupted between us into an energy source for a needless fight.

For this same reason -- not wanting to turn a past romance into a public fight -- I have not responded at all to the accusations until now. But the accusations are now hurting the spiritual community that I am part of, because together with the accusation of sexual assault, Hannah's post accuses the Monastic Academy of covering up this alleged sexual assault. This I will respond to directly because I think it's important that anyone considering visiting or collaborating with the Monastic Academy can incorporate some facts that have not previously been written about.

At the time of the event, I was executive director of Oak (the California branch of the Monastic Academy), and Hannah was visiting for a one-month training program. I had made specific monastic agreements not to engage sexually or romantically with others in the organization, and I broke those agreements by engaging with Hannah, so I informed and apologized to the community. Upon hearing about this, the head teacher, Soryu Forall, informed the internal leadership group and board of directors the next day, informed the whole community within about a week, spoke one-on-one with external donors within about two weeks, and wrote a very frank account of the whole episode in the quarterly report the next month, which is published on the website and sent out to supporters via hard copy. Hannah describes this as a cover-up, but it seems to me like the exact opposite of a cover-up.

Around the same time, the head teacher asked Hannah and I to write a letter clarifying the status of our relationship and our intention to stay in or leave the community. This is important in a monastic setting because when a whole community has agreed not to have romantic or sexual relations with each other, any breach cannot be left in an ambiguous state or else everyone will question whether the rules still apply to anyone. Hannah describes being coerced into signing a letter, but when I reflect on the conversations between us I cannot think of a way that I could have explained more gently why we were being asked to write such a letter, nor made it clearer that it was up to her whether she did so or not. Hannah was not an employee of the organization and was nearly at the end of her one-month visit, so there was little room for implicit leverage.

Hannah describes being forced to leave the organization, but in fact she simply came to the end of her pre-agreed one-month visit. The agreed-upon dates are quite unambiguous in the emails from before Hannah's visit.

Hannah makes many specific accusations in her post, but does not give very many details about specific events that gave rise to these accusations. I and others at the Monastic Academy have been considering these accusations almost non-stop since they were published. It has actually been very difficult to think clearly as an organization about what parts of Hannah's accusations we can take responsibility for and what parts are mischaracterizations. It is extremely tempting to incorporate everything that Hannah says into a kind of uncalibrated personal and organizational shame, especially due to the sexual core of the accusations. There is a strange aura around sexual accusations against a spiritual organization: from the perspective of the accused it can seem at times just logically impossible that any response other than total personal shame could be warranted. But this makes no sense; of course we have to apply discernement to accusations, taking responsibility for what we can and saying no to the rest. The more I look at what happened, the less I believe that Hannah is pointing to some darkness deep within the heart of the Monastic Academy, and the more I believe that her accusations are straightforward mischaracterizations.

I have the sense that whole lives are regularly lost to episodes like this, and I can now see exactly how that could happen: public outrage incorporated into uncalibrated self-shame, leading to disconnection from one's friends and then a whole life lived in the vicinity of a tight ball of grief and sadness, henceforth choking off all real connection that threatens to go beyond a carefully managed exterior layer. It's not so easy to avoid this fate; it seems to be just what happens by default. The hard part is that we need to not just say "yes" when accused of a mistake we really did make, but also "no" when accused of a mistake we have not made. I have found this exceedingly difficult to do for accusations of a sexual nature.

I'm extremely grateful to have found myself in connection with some spiritual teachers and friends who have helped not just with the "yes" part of this equation, but also the "no" part. My own teacher, Soryu Forall, has been shockingly clear in this discernment. I have also sought the guidance of a Christian friend who has lived for many decades according to an extraordinary vow of stability, as well as a Buddhist nun of many decades who visits the Monastic Academy from time to time. I did not understand what spiritual expertise was until the past few years, but I see now that this episode would have left me completely lost but for the guidance of these guides who have made it the purpose of their lives to set up a kind of unmistakable integrity so that when all basis for clarity is lost and I'm just spinning in the hurricane, their voices still carry clearly through the darkness and I can follow it back to the ground. I'm not sure how I came to be in connection these extraordinary guides -- it's certainly not that I knew to seek them out. I can only really account for it as a kind of grace.

Thank you for reading this post. I have written many versions of it over the past months and am glad to finally have it out here. I am grateful for the existence of this community here on LessWrong and I look forward to more writing over the coming months. I hope you are all safe and well.

Discuss

### [Intro to brain-like-AGI safety] 14. Controlled AGI

11 мая, 2022 - 16:17
Published on May 11, 2022 1:17 PM GMT

Part of the “Intro to brain-like-AGI safety” post series.

Post #12 suggested two paths forward for solving “the alignment problem” for brain-like AGI, which I called “Social-instinct AGI” and “Controlled AGI”. Then Post #13 went into more detail about (one aspect of) “Social-instinct AGI”. And now, in this post, we’re switching over to “Controlled AGI”.

If you haven’t read Post #12, don’t worry, the “Controlled AGI” research path is nothing fancy—it’s merely the idea of solving the alignment problem in the most obvious way possible:

The “Controlled AGI” research path:

• Step 1 (out-of-scope for this series): We decide what we want our AGI’s motivation to be. For example, that might be:
• Step 2 (subject of this post): We make an AGI with that motivation.

This post is about Step 2, whereas Step 1 is out-of-scope for this series. Honestly, I’d be ecstatic if we figured out how to reliably set the AGI’s motivation to any of those things I mentioned under Step 1.

Unfortunately, I don’t know any good plan for Step 2, and (I claim) nobody else does either. But I do have some vague thoughts and ideas, and I will share them here, in the spirit of brainstorming. This post is not meant to be a comprehensive overview of the whole problem, just what I see as the most urgent missing ingredients.

Out of all the posts in the series, this post is the hands-down winner for “most lightly-held opinions”. For almost anything I say in this post, I can easily imagine someone changing my mind within an hour of conversation. Let that ‘someone’ be you—the comment section is below!

• Section 14.2 discusses what we might use as “Thought Assessors” in an AGI. If you’re just tuning in, Thought Assessors were defined in Posts #5#6 and have been discussed throughout the series. If you have a Reinforcement Learning background, think of Thought Assessors as the components of a multi-dimensional value function. If you have a “being a human” background, think of Thought Assessors as learned functions that trigger visceral reactions (aversion, cortisol-release, etc.) based on the thought that you’re consciously thinking right now. In the case of brain-like AGIs, we get to pick whatever Thought Assessors we want, and I propose three categories for consideration: Thought Assessors oriented towards safety (e.g. “this thought / plan involves me being honest”), Thought Assessors oriented towards accomplishing a task (e.g. “this thought / plan will lead to better solar cell designs”), and Thought Assessors oriented purely towards interpretability (e.g. “this thought / plan has something to do with dogs”).
• Section 14.3 discusses how we might generate supervisory signals to train those Thought Assessors. Part of this topic is what I call the “first-person problem”, namely the open question of whether it’s possible to take third-person labeled data (e.g. a YouTube video where Alice deceives Bob), and transmute it into a first-person preference (an AGI’s desire to not, itself, be deceptive).
• Section 14.4 discusses the problem that the AGI will encounter “edge cases” in its preferences—plans or places where its preferences become ill-defined or self-contradictory. I’m cautiously optimistic that we can build a system that monitors the AGI’s thoughts and detects when it encounters an edge case. However, I don’t have any good idea about what to do when that happens. I’ll discuss a few possible solutions, including “conservatism”, and a couple different strategies for what Stuart Armstrong calls Concept Extrapolation.
• Section 14.5 discusses the open question of whether we can rigorously prove anything about an AGI’s motivations. Doing so would seem to require diving into the AGI’s predictive world-model (which would probably be a multi-terabyte, learned-from-scratch, unlabeled data structure), and proving things about what the components of the world-model “mean”. I’m rather pessimistic about our prospects here, but I’ll mention possible paths forward, including John Wentworth’s “Natural Abstraction Hypothesis” research program (most recent update here).
• Section 14.6 concludes with my overall thoughts about our prospects for “Controlled AGIs”. I’m currently a bit stumped and pessimistic about our prospects for coming up with a good plan, but hope I’m wrong and intend to keep thinking about it. I also note that a mediocre, unprincipled approach to “Controlled AGIs” would not necessarily cause a world-ending catastrophe—I think it’s hard to say.
14.2 Three categories of AGI Thought Assessors

As background, here’s our usual diagram of motivation in the human brain, from Post #6:

See Post #6. Acronyms are brain anatomy, you can ignore them.

And here’s the modification for AGI, from Post #8:

On the center-right side of the diagram, I crossed out the words “cortisol”, “sugar”, “goosebumps”, etc. These correspond to the set of human innate visceral reactions which can be involuntarily triggered by thoughts (see Post #5). (Or in machine learning terms, these are more-or-less the components of a multidimensional value function, similar to what you find in multi-objective / multi-criteria reinforcement learning.)

Clearly, things like cortisol, sugar, and goosebumps are the wrong Thought Assessors for our future AGIs. But what are the right ones? Well, we’re the programmers! We get to decide!

I have in mind three categories to pick from. I’ll talk about how they might be trained (i.e., supervised) in Section 14.3 below.

14.2.1 Safety & corrigibility Thought Assessors

Example thought assessors in this category:

1. This thought / plan involves me being helpful.
2. This thought / plan does not involve manipulating my own learning process, code, or motivation systems.
3. This thought / plan does not involve deceiving or manipulating anyone.
4. This thought / plan does not involve anyone getting hurt.
5. This thought / plan involves following human norms, or more generally, doing things that an ethical human would plausibly do.
6. This thought / plan is “low impact” (according to human common sense).

Arguably (cf. this Paul Christiano post), #1 is enough, and subsumes the rest. But I dunno, I figure it would be nice to have information broken down on all these counts, allowing us to change the relative weights in real time (Post #9, Section 9.7), and perhaps giving an additional measure of safety.

Items #2–#3 are there because those are especially probable and dangerous types of thoughts—see discussion of Instrumental Convergence in Post #10, Section 10.3.2.

Item #5 is a bit of a catch-all for the AGI finding weird out-of-the-box solutions to problems, i.e. it’s my feeble attempt to mitigate the so-called “Nearest Unblocked Strategy problem”. Why might it mitigate the problem? Because pattern-matching to “things that an ethical human would plausibly do” is a bit more like a whitelist than a blacklist. I still don’t think that would work on its own, don't get me wrong, but maybe it would work in conjunction with the various other ideas in this post.

Before you jump into loophole-finding mode (“lol an ethical human would plausibly turn the world into paperclips if they’re under the influence of alien mind-control rays”), remember (1) these are meant to be implemented via pattern-matching to previously-seen examples (Section 14.3 below), not literal-genie-style following the exact words of the text; (2) we would hopefully also have some kind of out-of-distribution detection system (Section 14.4 below) to prevent the AGI from finding and exploiting weird edge-cases in that pattern-matching process. That said, as we’ll see, I don’t quite know how to do either of those two things, and even if we figure it out, I don’t have an airtight argument that it would be sufficient to get the intended safe behavior.

Example thought assessors in this category:

• This thought / plan will lead to a reduction in global warming
• This thought / plan will lead to a better solar panel design
• This thought / plan will lead to my supervisor becoming fabulously rich

This kind of thing is why we built the AGI—what we actually want it to do. (Assuming task-directed AGI for simplicity.)

Basing a motivation system on these kinds of assessments by themselves would be obviously catastrophic. But maybe if we use these as motivations, in conjunction with the previous category, it will be OK. For example, imagine the AGI can only think thoughts that pattern-match to “I am being helpful” AND pattern-match to “there will be less global warming”.

That said, I’m not sure we want this category at all. Maybe the “I am being helpful” Thought Assessor by itself is sufficient. After all, if the human supervisor is trying to reduce global warming, then a helpful AGI would produce a plan to reduce global warming. That’s kinda the approach here, I think.

14.2.3 “Ersatz interpretability” Thought Assessors

(See Post #9, Section 9.6 for what I mean by “Ersatz interpretability”.)

As discussed in Posts #4#5, each thought assessor is a model trained by supervised learning. Certainly, the more Thought Assessors we put into the AGI, the more computationally expensive it will be. But I don’t know how much more. Maybe we can put in 10^7 of them, and it only adds 1% to the total compute required by the AGI. I don’t know. So I’ll hope for the best and take the More Dakka approach: let’s put in 30,000 Thought Assessors, one for every word in the dictionary:

• This thought / plan has something to do with AARDVARK
• This thought / plan has something to do with ABACUS
• This thought / plan has something to do with ABANDON
• … … …
• This thought / plan has something to do with ZOOPLANKTON

I expect that ML-savvy readers will be able to immediately suggest much-improved versions of this scheme—including versions with even more dakka—that involve things like contextual word embeddings and language models and so on. As one example, if we buy out and open-source Cyc (more on which below), we could use its hundreds of thousands of human-labeled concepts.

14.2.4 Combining Thought Assessors into a reward function

For an AGI to judge a thought / plan as being good, we’d like all the safety & corrigibility Thought Assessors from Section 14.2.1 to have as high a value as possible, and we’d like the task-related Thought Assessor from Section 14.2.2 (if we’re using one) to have as high a value as possible.

(The outputs of the interpretability Thought Assessors from Section 14.2.3 are not inputs to the AGI’s reward function, or indeed used at all in the AGI, I presume. I was figuring that they’d be silently spit out to help the programmers do debugging, testing, monitoring, etc.)

So the question is: how do we combine this array of numbers into a single overall score that can guide what the AGI decides to do?

A probably-bad answer is “add them up”. We don’t want the AGI going with a plan that performs catastrophically badly on all but one of the safety-related Thought Assessors, but so astronomically well on the last one that it makes up for it.

Instead, I imagine we’ll want to apply some kind of nonlinear function with strongly diminishing returns, and/or maybe even acceptability thresholds, before adding up the Thought Assessors into an overall score.

I don’t have much knowledge or opinion about the details. But there is some related literature on “scalarization” of multi-dimensional value functions—see here for some references.

14.3 Supervising the Thought Assessors, and the “first-person problem”

Recall from Posts #4#6 that the Thought Assessors are trained by supervised learning. So we need a supervisory signal—what I labeled “ground truth in hindsight” in the diagram at the top.

I’ve talked about how the brain generates ground truth in numerous places, e.g. Post #3 Section 3.2.1, Posts #7#13. How do we generate it for the AGI?

Well, one obvious possibility is to have the AGI watch YouTube, with lots of labels throughout the video for when we think the various Thought Assessors ought to be active. Then when we’re ready to send the AGI off into the world to solve problems, we turn off the labeled YouTube videos, and simultaneously freeze the Thought Assessors (= set the error signals to zero) in their current state. Well, I’m not sure if that would work; maybe the AGI has to go back and watch more labeled YouTube videos from time to time, to help the Thought Assessors keep up as the AGI’s world-model grows and changes.

One potential shortcoming of this approach is related to first-person versus third-person concepts. We want the AGI to have strong preferences about aspects of first-person plans—hopefully, the AGI will see “I will lie and deceive” as bad, and “I will be helpful” as good. But we can’t straightforwardly get that kind of preference from the AGI watching labeled YouTube videos. The AGI will see YouTube character Alice deceiving YouTube character Bob, but that’s different from the AGI itself being deceptive. And it’s a very important difference! Consider:

• If you tell me “my AGI dislikes being deceptive”, I’ll say “good for you!”.
• If you tell me “my AGI dislikes it when people are deceptive”, I’ll say “for god's sake you better shut that thing off before it escapes human control and kills everyone”!!!

It sure would be great if there were a way to transform third-person data (e.g. a labeled YouTube video of Alice deceiving Bob) into an AGI’s first-person preferences (“I don’t want to be deceptive”). I call this the first-person problem.

How do we solve the first-person problem? I’m not entirely sure. Maybe we can apply interpretability tools to the AGI’s world-model, and figure out how it represents itself, and then correspondingly manipulate its thoughts, or something? It’s also possible that further investigation into human social instincts (previous post) will shed some light, as human social instincts do seem to transform the third-person “everyone in my friend group is wearing green lipstick” into the first-person “I want to be wearing green lipstick”.

If the first-person problem is not solvable, we need to instead use the scary method of allowing the AGI to take actions, and putting labels on those actions. Why is that scary? First, because those actions might be dangerous. Second, because it doesn’t give us any good way to distinguish (for example) “the AGI said something dishonest” from “the AGI got caught saying something dishonest”. Conservatism and/or concept extrapolation (Section 14.4 below) could help with that “getting caught” problem—maybe we could manage to get our AGI both motivated to be honest and motivated to not get caught, and that could be good enough—but it still seems fraught for various reasons.

14.3.1 Side note: do we want first-person preferences?

I suspect that “the first-person problem” is intuitive for most readers. But I bet a subset of readers feel tempted to say that the first-person problem is not in fact a problem at all. After all, in the realm of human affairs, there’s a good argument that we could use a lot fewer first-person preferences!

The opposite of first-person preferences would be “impersonal consequentialist preferences”, wherein there’s a future situation that we want to bring about (e.g. “awesome post-AGI utopia”), and we make decisions to try to bring that about, without particular concern over what I-in-particular am doing. Indeed, too much first-person thinking leads to lots of things that I personally dislike in the world—e.g. jockeying for credit, blame avoidance, the act / omission distinction, social signaling, and so on.

Nevertheless, I still think giving AGIs first-person preferences is the right move for safety. Until we can establish super-reliable 12th-generation AGIs, I’d like them to treat “a bad thing happened (which had nothing to do with me)” as much less bad than “a bad thing happened (and it’s my fault)”. Humans have this notion, after all, and it seems at least relatively robust—for example, if I build a bank-robbing robot, and then it robs the bank, and then I protest “Hey I didn’t do anything wrong; it was the robot!”, I wouldn’t be fooling anybody, much less myself. An AGI with such a preference scheme would presumably be cautious and conservative when deciding what to do, and would default to inaction when in doubt. That seems generally good, which brings us to our next topic:

14.4 Conservatism and concept-extrapolation14.4.1 Why not just relentlessly optimize the right abstract concept?

Let’s take a step back.

Suppose we build an AGI such that it has positive valence on the abstract concept “there will be lots of human flourishing”, and consequently makes plans and take actions to make that concept happen.

I’m actually pretty optimistic that we’ll be able to do that, from a technical perspective. Just as above, we can use labeled YouTube videos and so on to make a Thought Assessor for “this thought / plan will lead to human flourishing”, and then base the reward function purely on that one Thought Assessor.

And then we set the AGI loose on an unsuspecting world, to go do whatever it thinks is best to do.

What could go wrong?

The problem is that the concept of “human flourishing” is an abstract concept in the AGI’s world-model—really, it’s just a fuzzy bundle of learned associations. It’s hard to know what actions a desire for “human flourishing” will induce, especially as the world itself changes, and the AGI’s understanding of the world changes even more. In other words, there is no future world that will perfectly pattern-match to the AGI’s current notion of “human flourishing”, and if an extremely powerful AGI optimized the world for the best possible pattern-match, we might wind up with something weird, even catastrophic. (Or maybe not! It’s pretty hard to say, more on which in Section 14.6.)

As some random examples of what might go wrong: maybe the AGI would take over the world and prevent humans and human society from changing or evolving forevermore, because those changes would reduce the pattern-match quality. Or maybe the least-bad pattern-match would be the AGI wiping out actual humans in favor of an endless modded game of The Sims. Not that The Sims is a perfect pattern-match to “human flourishing”—it’s probably pretty bad! But maybe it’s less bad a pattern-match than anything the AGI could feasibly do with actual real-world humans. Or maybe as the AGI learns more and more, its world-model gradually drifts and changes, such that the frozen Thought Assessor winds up pointing at something totally random and crazy, and then the AGI wipes out humans to tile the galaxy with paperclips. I don’t know!

So anyway, relentlessly optimizing a fixed, frozen abstract concept like “human flourishing” seems maybe problematic. Can we do better?

Well, it would be nice if we could also continually refine that concept, especially as the world itself, and the AGI’s understanding of the world, evolves. This idea is what Stuart Armstrong calls Concept Extrapolation, if I understand correctly.

Concept extrapolation is easier said than done—there’s no obvious ground truth for the question of “what is ‘human flourishing’, really?” For example, what would “human flourishing” mean in a future of transhuman brain-computer hybrid people and superintelligent evolved octopuses and god-only-knows-what-else?

Anyway, we can consider two steps to concept extrapolation. First (the easier part), we need to detect edge-cases in the AGI’s preferences. Second (the harder part), we need to figure out what the AGI should do when it comes across such an edge-case. Let’s talk about those in order.

14.4.2 The easier part of concept extrapolation: Detecting edge-cases in the AGI’s preferences

I’m cautiously optimistic about the feasibility of making a simple monitoring algorithm that can watch an AGI’s thoughts and detect that it’s in an edge-case situation—i.e., an out-of-distribution situation where its learned preferences and concepts are breaking down.

(Understanding the contents of the edge-case seems much harder, as discussed shortly, but here I’m just talking about recognizing the occurrence of an edge-case.)

To pick a few examples of possible telltale signs that an AGI is at an edge-case:

• The learned probability distributions for Thought Assessors (see Post #5, Section 5.5.6.1) could have a wide variance, indicating uncertainty.
• The different Thought Assessors of Section 14.2 could diverge in new and unexpected ways.
• The AGI’s reward prediction error could flip back and forth between positive and negative in a way that indicates “feeling torn” while attending to different aspects of the same possible plan.
• The AGI’s generative world-model could settle into a state with very low prior probability, indicating confusion.
14.4.3 The harder part of concept extrapolation: What to do at an edge case

I don’t know of any good answer. Here are some options.

14.4.3.1 Option A: Conservatism—When in doubt, just don’t do it!

A straightforward approach would be that if the AGI’s edge-case-detector fires, it forces the RPE signal negative—so that whatever thought the AGI was thinking is taken to be a bad thought / plan. This would loosely correspond to a “conservative” AGI.

(Side note: I think there may be many knobs we can turn in order to make a brain-like AGI more or less “conservative”, in different respects. The above is just one example. But they all seem to have the same issues.)

A failure mode of a conservative AGI is that the AGI just sits there, not doing anything, paralyzed by indecision, because every possible plan seems too uncertain or risky.

An “AGI paralyzed by indecision” is a failure mode, but it’s not a dangerous failure mode. Well, not unless we were foolish enough to put this AGI in charge of a burning airplane plummeting towards the ground. But that’s fine—in general, I think it’s OK to have first-generation AGIs that can sometimes get paralyzed by indecision, and which are thus not suited to solving crises where every second counts. Such an AGI could still do important work like inventing new technology, and in particular designing better and safer second-generation AGIs.

However, if the AGI is always paralyzed by indecision—such that it can’t get anything done—now we have a big problem. Presumably, in such a situation, future AGI programmers would just dial the “conservatism” knob down lower and lower, until the AGI started doing useful things. And at that point, it’s unclear if the remaining conservatism would be sufficient to buy us safety.

I think it would be much better to have a way for the AGI to iteratively gain information to reduce uncertainty, while remaining highly conservative in the face of whatever uncertainty still remains. So how can we do that?

14.4.3.2 Option B: Dumb algorithm to seek clarification in edge-cases

Here’s a slightly-silly illustrative example of what I have in mind. As above, we could have a simple monitoring algorithm that watches the AGI’s thoughts, and detects when it’s in an edge-case situation. As soon as it is, the monitoring algorithm shuts down the AGI entirely, and prints out the AGI’s current neural net activations (and corresponding Thought Assessor outputs). The programmers use interpretability tools to figure out what the AGI is thinking about, and manually assign a value / reward, overriding the AGI’s previous uncertainty with a highly-confident ground-truth.

That particular story seems unrealistic, mainly because we probably won’t have sufficiently reliable and detailed interpretability tools. (Prove me wrong, interpretability researchers!) But maybe there’s a better approach than just printing out billions of neural activations and corresponding Thought Assessors?

The tricky part is that AGI-human communication is fundamentally a hard problem. It’s unclear to me whether it will be possible to solve that problem via a dumb algorithm. The situation here is very different from, say, an image classifier, where we can find an edge-case picture and just show it to the human. The AGI’s thoughts may be much more inscrutable than that.

By analogy, human-human communication is possible, but not by any dumb algorithm. We do it by leveraging the full power of our intellect—modeling what our conversation partner is thinking, strategically choosing words that will best convey a desired message, and learning through experience to communicate more and more effectively. So what if we try that approach?

14.4.3.3 Option C: The AGI wants to seek clarification in edge-cases

If I’m trying to help someone, I don’t need any special monitoring algorithm to prod me to seek clarification at edge-cases. Seeking clarification at edge-cases is just what I want to do, as a self-aware properly-motivated agent.

So what if we make our AGIs like that?

At first glance, this approach would seem to solve all the problems mentioned above. Not only that, but the AGI can use its full powers to make everything work better. In particular, it can learn its own increasingly-sophisticated metacognitive heuristics to flag edge-cases, and it can learn and apply the human’s meta-preferences about how and when the AGI should ask for clarification.

But there’s a catch. I was hoping for a conservatism / concept extrapolation system that would help protect us from misdirected motivations. If we implement conservatism / concept extrapolation via the motivation system itself, then we lose that protection.

More specifically: if we go up a level, the AGI still has a motivation (“seek clarification in edge-cases”), and that motivation is still an abstract concept that we have to extrapolate into out-of-distribution edge cases (“What if my supervisor is drunk, or dead, or confused? What if I ask a leading question?”). And for that concept extrapolation problem, we’re plowing ahead without a safety net.

Is that a problem? Bit of a long story:

Side-debate: Will “helpfulness”-type preferences “extrapolate” safely just by recursively applying to themselves?

In fact, a longstanding debate in AGI safety is whether these kinds of helpful / corrigible AGI preferences (e.g. an AGI’s desire to understand and follow a human’s preferences and meta-preferences) will “extrapolate” in a desirable way without any “safety net”—i.e., without any independent ground-truth mechanism pushing the AGI’s preferences in the right direction.

In the optimistic camp is Paul Christiano, who argued in “Corrigibility” (2017) that there would be “a broad basin of attraction towards acceptable outcomes”, based on, for example, the idea that an AGI’s preference to be helpful will result in the AGI having a self-reflective desire to continually edit its own preferences in a direction humans would like. But I don’t really buy that argument for reasons in my 2020 post—basically, I think there are bound to be sensitive areas like “what does it mean for people to want something” and “what are human communication norms” and “inclination to self-monitor”, and if the AGI’s preferences drift along any of those axes (or all of them simultaneously), I’m not convinced that those preferences would self-correct.

Meanwhile, in the strongly-pessimistic camp is Eliezer Yudkowsky, I think mainly because of an argument (e.g. this post, final section) that we should expect powerful AGIs to have consequentialist preferences, and that consequentialist preferences seem incompatible with corrigibility. But I don’t really buy that argument either, for reasons in my 2021 “Consequentialism & Corrigibility” post—basically, I think there are possible preferences that are reflectively-stable, and that include consequentialist preferences (and thus are compatible with powerful capabilities), but are not purely consequentialist (and thus are compatible with corrigibility). A “preference to be helpful” seems like it could plausibly develop into that kind of hybrid preference scheme.

Anyway, I’m uncertain but leaning pessimistic. For more on the topic, see also Wei Dai’s recent post, and the comment sections of all of the posts linked above.

14.4.3.4 Option D: Something else?

I dunno.

14.5 Getting a handle on the world-model itself

The elephant in the room is the giant multi-terabyte unlabeled generative world-model that lives inside the Thought Generator. The Thought Assessors provide a window into this world-model, but I’m concerned that it may be a rather small, foggy, and distorted window. Can we do better?

Ideally, we’d like to prove things about the AGI’s motivation. We’d like to say “Given the state of the AGI’s world-model and Thought Assessors, the AGI is definitely motivated to do X” (where X=be helpful, be honest, not hurt people, etc.) Wouldn’t that be great?

But we immediately slam into a brick wall: How do we prove anything whatsoever about the “meaning” of things in the world-model, and thus about the AGI’s motivation? The world is complicated, and therefore the world-model is complicated. The things we care about are fuzzy abstractions like “honesty” and “helpfulness”—see the Pointers Problem. The world-model keeps changing as the AGI learns more, and as it makes plans that would entail taking the world wildly out-of-distribution (e.g. planning the deployment of a new technology). How can we possibly prove anything here?

I still think the most likely answer is “We can’t”. But here are two possible paths anyway. For some related discussion, see Eliciting Latent Knowledge.

Proof strategy #1 starts with the idea that we live in a three-dimensional world containing objects and so on. We try to come up with an unambiguous definition of what those objects are, and from there we can have an unambiguous language for specifying what we want to happen in the world. We also somehow translate (or constrain) the AGI’s understanding of the world into that language, and now we can prove theorems about what the AGI is trying to do.

This is my tentative understanding of what John Wentworth is trying to do via his Natural Abstraction Hypothesis research program (most recent update here), and I’ve heard ideas in this vicinity from a couple other people as well.

I’m skeptical because a 3D world of localized objects seems to be an unpromising starting point for stating and proving useful theorems about the AGI’s motivations. After all, a lot of things that we humans care about, and that the AGI needs to care about, seem difficult to describe in terms of a 3D world of localized objects—consider the notion of “honesty”, or “solar cell efficiency”, or even “daytime”.

Proof strategy #2 would start with a human-legible “reference world-model” (e.g. Cyc). This reference world-model wouldn’t be constrained to be built out of localized objects in a 3D world, so unlike the above, it could and probably would contain things like “honesty” and “solar cell efficiency” and “daytime”.

Then we try to directly match up things in the “reference world-model” with things in the AGI’s world-model.

Will they match up? No, of course not. Probably the best we can hope for is a fuzzy, many-to-many match, with various holes on both sides.

It's hard for me to see a path to rigorously proving anything about the AGI’s motivations using this approach. Nevertheless, I continue to be amazed that unsupervised machine translation is possible at all, and I take that as an indirect hint that if pieces of two world-models match up with each other in their internal structure, then those pieces are probably describing the same real-world thing. So maybe I have the faintest glimmer of hope.

I’m unaware of work in this direction, possibly because it’s stupid and doomed, and also possibly because I don’t think we currently have any really great open-source human-legible world-models to run experiments on. The latter is a problem that I think we should rectify ASAP, perhaps by cutting a giant check to open-source Cyc, or else developing a similarly rich, accurate, and (most importantly) human-legible open-source world-model by some other means.

14.6 Conclusion: mild pessimism about finding a good solution, uncertainty about the consequences of a lousy solution

I think we have our work cut out figuring out how to solve the alignment problem via the "Controlled AGIs" route (as defined in Post #12). There are a bunch of open problems, and I’m currently pretty stumped. We should absolutely keep looking for good solutions, but right now I’m also open-minded to the possibility that we won’t find any. That’s why I continue to put a lot of my mental energy into the “social-instinct AGIs” path (Posts #12#13), which seems somewhat less doomed to me, despite its various problems.

I note, however, that my pessimism is not universally shared—for example, as mentioned, Stuart Armstrong at AlignedAI appears optimistic about solving the open problem in Section 14.4, and John Wentworth appears optimistic about solving the open problem in Section 14.5. Let's hope they're right, wish them luck, and try to help!

To be clear, the thing I’m feeling pessimistic about is finding a good solution to “Controlled AGI”, i.e., a solution that we can feel extremely confident in a priori. A different question is: Suppose we try to make “Controlled AGI” via a lousy solution, like the Section 14.4.1 example where we imbue a super-powerful AGI with an all-consuming desire for the abstract concept of “human flourishing”, and the AGI then extrapolates that abstract concept arbitrarily far out of distribution in a totally-uncontrolled, totally-unprincipled way. Just how bad a future would such an AGI bring about? I’m very uncertain. Would such an AGI engage in mass torture? Umm, I guess I’m cautiously optimistic that it wouldn’t. Would it wipe out humanity? I think it’s possible!—see discussion in Section 14.4.1. But it might not! Hey, maybe it would even bring about a pretty awesome future! I just really don’t know, and I’m not even sure how to reduce my uncertainty.

In the next post, I will wrap up the series with my wish-list of open problems, and advice on how to get into the field and help solve them!

Discuss

### ProjectLawful.com: Eliezer's latest story, past 1M words

11 мая, 2022 - 09:18
Published on May 11, 2022 6:18 AM GMT

So if you read Harry Potter and the Methods of Rationality, and thought...

"You know, HPMOR is pretty good so far as it goes; but Harry is much too cautious and doesn't have nearly enough manic momentum, his rationality lectures aren't long enough, and all of his personal relationships are way way way too healthy."

...then have I got the story for you! Planecrash aka Project Lawful aka Mad Investor Chaos and the Woman of Asmodeus, is a story in roleplay-format that I as "Iarwain" am cowriting with Lintamande, now past 1,000,000 words.

It's the story of Keltham, from the world of dath ilan; a place of high scientific achievement but rather innocent in some ways.  For mysterious reasons they've screened off their own past, and very few now know what their prescientific history was like.

Keltham dies in a plane crash and ends up in the country of Cheliax, whose god is "Asmodeus", whose alignment is "Lawful Evil" and whose people usually go to the afterlife of "Hell".

And so, like most dath ilani would, in that position, Keltham sets out to bring the industrial and scientific revolutions to his new planet!  Starting with Cheliax!

(Keltham's new friends may not have been entirely frank with him about exactly what Asmodeus wants, what Evil really is, or what sort of place Hell is.)

This is not a story for kids, even less so than HPMOR. There is romance, there is sex, there are deliberately bad kink practices whose explicit purpose is to get people to actually hurt somebody else so that they'll end up damned to Hell, and also there's math.

The starting point is Book 1, Mad Investor Chaos and the Woman of Asmodeus. I suggest logging into ProjectLawful.com with Google, or creating an email login, in order to track where you are inside the story.

Discuss

### An Inside View of AI Alignment

11 мая, 2022 - 05:29
Published on May 11, 2022 2:16 AM GMT

I started to take AI Alignment seriously around early 2020. I’d been interested in AI and machine learning in particular since 2014 or so, taking several online ML courses in high school and implementing some simple models for various projects. I leaned into the same niche in college, taking classes in NLP, Computer Vision, and Deep Learning to learn more of the underlying theory and modern applications of AI, with a continued emphasis on ML. I was very optimistic about AI capabilities then (and still am) and if you’d asked me about AI alignment or safety as late as my sophomore year of college (2018-2019), I probably would have quoted Steven Pinker or Andrew Ng at you.

Somewhere in the process of reading The Sequences, portions of the AI Foom Debate, and texts like Superintelligence and Human Compatible, I changed my mind. Some 80,000 hours podcast episodes were no doubt influential as well, particularly the episodes with Paul Christiano. By late 2020, I probably took AI risk as seriously as I do today, believing it to be one of the world’s most pressing problems (perhaps the most) and was interested in learning more about it. I binged most of the sequences on the Alignment Forum at this point, learning about proposals and concepts like IDA, Debate, Recursive Reward Modeling, Embedded Agency, Attainable Utility Preservation, CIRL etc. Throughout 2021 I continued to keep a finger on the pulse of the field: I got a large amount of value out of the Late 2021 MIRI Conversations in particular, shifting away from a substantial amount of optimism in prosaic alignment methods, slower takeoff speeds, longer timelines, and a generally “Christiano-ish” view of the field and more towards a “Yudkowsky-ish” position.

I had a vague sense that AI safety would eventually be the problem I wanted to work on in my life, but going through the EA Cambridge AGI Safety Fundamentals Course helped make it clear that I could productively contribute to AI safety work right now or in the near future. This sequence is going to be an attempt to explicate my current model or “inside view” of the field. These viewpoints have been developed over several years and are no doubt influenced by my path into and through AI safety research: for example, I tend to take aligning modern ML models extremely seriously, perhaps more seriously than is deserved, because of my greater amount of experience with ML compared to other AI paradigms.

I’m writing with the express goal of having my beliefs critiqued and scrutinized: there’s a lot I don’t know and no doubt a large amount that I’m misunderstanding. I plan on writing on a wide variety of topics: the views of various researchers, my understanding and confidence in specific alignment proposals, timelines, takeoff speeds, the scaling hypothesis, interpretability, etc. I also don’t have a fixed timeline or planned order in which I plan to publish different pieces of the model.

Without further ado, the posts that follow comprise Ansh’s (current) Inside View of AI Alignment.

Discuss

### Fighting in various places for a really long time

11 мая, 2022 - 04:50
Published on May 11, 2022 1:50 AM GMT

The first time someone raved to me about seeing Everything Everywhere All at Once, I thought they were actually suggesting I see everything everywhere all at once, and I was briefly excited by the implication that this exhilarating possibility was somehow on the table.

After that disappointment I heard about it several times more, and warmed to the idea of seeing the movie anyway, especially on account of it being the most roundly recommended one I remember. The third time someone invited me to see it with them, I went.

And it seemed so astonishingly lacking to both of us that I left severely confused, and remain so. Like: I know people have different tastes. I know that I’m not the biggest movie appreciator (my ideal movie probably has a small number of visually distinct characters and nobody dies or does anything confusing, and I’ve already seen it twice). But usually I have some abstract guess about what other people are liking. Or, more realistically, a name for the category of mysterious attraction (“ah yes, you are into the ‘action’, and that means it’s good when helicopters crash or people shoot each other”). Yet here, I’m grasping even for that. “You like it because.. it has much more prolonged fighting than usual and you like fighting?…or…it is some kind of irony thing about other movies?” I could believe that it was some kind of mediocre action movie. But usually my friends don’t go crazy for mediocre action movies. And here for instance one of my best friends, who I generally take to have subtle and sensitive and agreeable tastes, and who knows me extremely well, told me in particular to see it. And the strongest criticism I have seen of it outside of our post-movie discussion is another friend’s apparently sincere complaint on Facebook that it is probably only among the top hundred movies ever, not the top ten like people say. And it’s not that I just wasn’t wowed by it: it’s hard to remember the last time I was less compelled by a movie. (Though perhaps one doesn’t remember such things.) Like, I was really sitting there in the cinema thinking something along the lines of, ‘movies usually grab my attention somehow, yet this is doing some special thing differently to not have that happen? Huh?’

I don’t know if I can spoil this movie, because whatever was good in it, I totally missed. But here I attempt spoilers. This is what happens in the movie, as far as I can tell:

(Ok my companion and I actually failed to notice when it started, so maybe there was something important there. Oops.)

A woman and her family run a laundromat, and are also working on their taxes. Her life is disappointing to her. A version of her husband appears from a different dimension and relays some kind of dimly coherent plot involving lots of dimensions and the need for her to jump between them and fight or something. Then they fight and jump between dimensions for about two hours. Their fighting involves some repeating motifs: 1) There is a humorous conceit that in order to jump between dimensions you have to do a strange action, for instance bite off and chew some lip balm. This joke is repeated throughout most of the fighting. One time the traveler has to put an object up their bottom, so that is pretty exciting humorwise. 2) Things often look cool. Like, there are lots of evocative objects and people are wearing make-up and neat costumes. 3) There is lots of jumping between dimensions. At some point it becomes clear that a baddie is actually the woman’s daughter, who has turned to nihilism as a result of either seeing everything all at once and that being kind of intrinsically nihilism-provoking due to its lack of permitting anything else, or as a result of having her lesbianism disrespected by her mother earlier. The fighting takes on a more nihilism vs. appreciating life flavor, and then it turns out that being friendly and warm is good, as represented by the father, and now appreciated by the mother. Then…actually I forget what happens at the end, sorry.

I’m all for ‘nihilism vs. something something existential something something, life, kindness’ as a theme, but this seemed like such a shallow treatment of it. It just seemed like a bunch of fighting labeled ‘deep plot about nihilism etc’, and I don’t think caused me to have any interesting thoughts about such themes, except perhaps by reminding me of the general topic and leaving me without anything to distract my mind from wandering.

It was clearly too violent for my liking, so that’s idiosyncratic, but it’s not like I’m always opposed to violence—some of the fighting in Lord of the Rings was quite moving, and I watched the whole of Game of Thrones in spite of also at other times using scenes from it in exposure therapy. But I posit that you need some sort of meaningful context to make violence interesting or moving, and I don’t think I caught that.

I also speculate that some humor is meant to come from the protagonist being a middle aged immigrant Chinese woman, instead of the more standard young man. Which seems rude: as though it is asking for the props generally offered for featuring atypical demographics in films, yet is doing so as a joke.

In sum, it seemed to me to be a bunch of fairly meaningless fighting interspersed with repetitive lowbrow humor and aesthetically pleasing props.

I asked a couple of my friends to explain their alternate takes to me, but I don’t think I can do their explanations justice, due to not really understanding them. At a high level they disagreed with me about things like ‘was it extremely humorous?’ and ‘was it unusually engaging vs. unusually unengaging?’, but I didn’t understand why, at a lower level. Probably we all agree that it was visually cool, but I wasn’t actually stunned by that. Maybe visual attractiveness alone counts for less with me (though I recently saw Everything is Illuminated, which I found awesome in a confusingly soul-electrifying way and whose merit seems somehow related to visualness). One interesting thing that this discussion with EEAAO appreciators added was the point that there is something moving about the thought that in a different dimension you and the odious tax lady might be tender lovers. I agree that that’s a nice thought.

I am hesitant to criticize here, because it is sweet of my friends to try to give me a nice movie recommendation, and I appreciate it. Also, I think in general that if Alice loves a thing and Bob doesn’t, it is much more likely that Bob is missing something wonderful than that Alice is imagining such a thing. (Though conversely if they agree that the thing is pretty good in ways, and Bob just hates it because it also has some overriding problem, then my guess would be the reverse: probably Alice is missing a thing.)

So probably, somehow, other people are right. Please other people, help enlighten me more? (And thanks to some of my friends for trying!)

Discuss

### Stuff I might do if I had covid

11 мая, 2022 - 03:00
Published on May 11, 2022 12:00 AM GMT

In case anyone wants a rough and likely inaccurate guide to what I might do if I had covid to mitigate it, I looked into this a bit recently and wrote notes. It’s probably better than if one’s plan was to do less than a few hours of research, but is likely flawed all over the place and wasn’t written with public sharing in mind, and um, isn’t medical advice:

Here’s a Google doc version, where any comments you leave might be seen by the next person looking (and you might see comments added by others).

Here’s a much longer doc with the reasoning, citations and more comments.

(I continue to guess that long covid is worth avoiding.)

Discuss

### Crises Don't Need Your Software

11 мая, 2022 - 00:06
Published on May 10, 2022 9:06 PM GMT

About a month ago, I was invited to contribute to a group looking to help Ukrainian refugees in Poland. The group consisted of volunteers, including some people from the rationalist community, who knew they wanted to help, and were searching for high impact ways to do so, likely through software. I had been intending to find a way to help Ukraine, so when this opportunity to use my programming background to help arose, I felt elated to finally be able to do something.' I took two weeks off work the next morning (thank you, my employer, for letting me do that) and optimized my life to spend as many hours as I could on the project without harming myself.

The group consisted of some really amazing people. There were two students from the US, Soren and Luke, who had more or less dropped everything, sold their cars, and just moved to Poland to help. They were our eyes on the ground and the ones starting the project. My close friend, Leonardo, was the one who invited me, and he was the one who inspired me to take time off work by doing it himself first. And a lot of other brilliant people with very valuable contributions, who I won't list to keep the cast small for the reader. Just know I'm very thankful for everyone who helped out.

We knew we wanted to help somehow, but we didn't know how. We focused on looking for problems we could solve with software since the group had a high density of software people. Over the course of the first few days we went through a concept of tracking the movements of people through the country (dismissed because it would be a GDPR nightmare and also very abusable), a concept of assigning refugees entering the country to different destinations so not everyone ended up in Krakow (depended on a model of reality that was much too simplified, and also had problems with how to gather data to base algorithmic decisions on) and we finally arrived at a concept around a certain kind of buses.

Somehow I ended up in a leadership position along the way, a de facto product manager so to say, and I created the bus concept from something a volunteer hosting refugees in his office space had said. They had been in contact with a municipality from Norway who had space for refugees and was sending buses to directly transport refugees. This seemed to be a regular occurrence, where individual people, small aid organisations or municipalities were sending buses and cars to Poland. Then they would show up and nobody would want to get on, because no one knew they'd be there, what it's like where they would be going or if they could be trusted.

So the concept was to create a website where the bus drivers could see a heatmap of where refugees willing to go to their country were and plan their trips after where there were too few buses compared to refugees. And the refugees (or rather, volunteers helping refugees) could sign up to add to the data set the heat map comes from, and then see which planned trips would be nearby, and then "follow" them for updates and mark interest for which ones they'd want to go with. Having access to contact information in advance would allow them to get an understanding of where the trip would be going, have time to look up the country, maybe see a "day in the life" video to understand what the country feels like and to some degree vet the bus driver in advance. By not addressing the things that worked, such as the trains, flights and regularly scheduled buses, and focusing on connecting small scale actors with other small scale actors that need each other, we wanted to make a difference at scale by enabling many small interactions.

After we'd worked on refining the concept from Wednesday to Saturday, as well as starting development and looking for more devs, I was put in contact with a senior product manager, Shira. She was kind enough to read through our concept docs and our source material and offer her feedback. I greatly appreciated the additional perspective, because it had been bugging me that no one had said that my obviously too large feature set was large yet. Soren and I chatted with Shira for a while and she more or less dismissed the entire project. Or well, not necessarily the project, but definitely the amount of verification of the concept we had done. Soren and Luke had certainly talked with a _lot_ of people at that point, but we would need more than the ~7 sources of corroborating evidence that supported the bus concept. There were a lot of large scale questions remaining:

* Did the buses occur in a relevant enough numbers where an information exchange service would help?
* Would bus drivers actually want to use such a service?
* Would volunteers on the refugee side actually want to go through the trouble of learning and using a service in the midst of the chaos of a refugee crisis?
* While both sides would have an incentive to use the service after a critical mass of users appeared, how do you reach that critical mass?
* Even if we've understood the world correctly right now, what ensures that the same problem will exist in two weeks/a month when we're done with development and can launch? It's a chaotic situation that changes quickly.
* How could we counteract our service being misused for human trafficking? Could we make the percentage of trafficking cases smaller than without our services? And would possibly enabling more human trafficking to happen, even indirectly, be too big of a cost for the potential good? (On the other hand, would someone using our service for evil even be on our conscience, provided that we make it harder than it would be without our service?)
* What legal implications would there be, especially considering it's a service interacting with people in such a vulnerable position?
* If our service would see mass adoption, what political implications would there be of encouraging relocation to other countries?
* And finally, the blanket question of "is our model of the world correct and complete enough to suggest an intervention?"

Shira found it incorrect to start tentative development with so little verification of the concept, and I had no problems changing my mind to agree with her. We also realized we had no realistic chance of knowing the solution would help. In a way, it was freeing to not feel the weight of the world on our shoulders anymore. I had previously entertained the thought of "is us searching for a systemic solution actually higher value than just working for a week and giving our pay to a charity?" And after our talk, I put the project on the back-burner with the plan of letting the people on the ground collect more data, I ended my two week pause from work early, worked my day job the second week and donated the equivalent of that week's pay to related charities.

I think the realization I had generalizes to my demographic: software developers that enthusiastically want to help, and are completely outside a situation. We see a problem from afar, think we understand it, and decide that software is the solution (hammer, meet nail). In cases where you can actually model the entire problem (from talking to the ten users you're designing some company software for, like I have in the past, or through extensive research efforts) this works great, but when you jump to designing the solution too early in proportion to the complexity of the problem, you'll create a great software that perfectly solves a problem, but where the problem doesn't exist.

In crisis situations such as these, this is especially true, because crises are just _so large_. There are millions of people involved, so you just can't model the entire crisis. We got blind trying to have a large impact that we sacrificed having an impact at all. Shira expanded on this.

Her model is that efficiency is for orderly things. These crises are chaos. Every day brings unforeseen challenges, and you can't predict, plan and structure something that dynamic. Software is great at creating efficiency inside the box of what you design it for, but if the box always changes it can't do much. The volunteers working on the ground will solve todays new challenges in the most inefficient way possible, because when you constantly need to think outside the box, _that's the only way that's available_. If we want to help, we need to help them do their constant improvisation, by either going down there ourselves or just giving them more resources, such as money.

In such a chaotic situation, any money you send will be inefficiently spent. Drivers going into war zones will charge a premium for the risk to their life. Shipments will be lost, funds will be skimmed. But even if the charity you pick only yields 1 euro of value from the 1000 you send, it's still be better to send the money. Because the alternative is doing nothing. Anything that helps the volunteers save one more life or improves the life quality of the refugees a bit more has a bigger impact than nothing, anything that helps volunteers work through another day of the chaos will help.

Of course, crises need some software, but any software that actually helps is bound to solve a small enough problem, helping refugees do what they already do in a slightly better way. Shira is an example of this: she led a team creating software to help a group of volunteers transport refugees out of Ukraine. Scope of the project? A software to send a message to 20 WhatsApp users at once, instead of sending it to each user manually. Helped a small group do something they already were doing slightly better. There's no time for paradigm shifts. Well, unless you have the information flow of a major NGO, in which case you shouldn't be taking my advice. Go back to saving the world!

So in summary, if you're a software developer and you want to help, don't try to create a clever react app, do what I did. Continue working for money, and donate that money to a charity. Accept that nowhere near 100% of your money will reach the target, especially in the chaos of crises, and realize that 10% or even 1% of your money of an inefficient charity is drastically more than the 0% of you not giving. And as software developers on average have a high income, you can have a disproportionately high impact if you just give money.

If you want a concrete recommendation of where to direct your donations, Soren saw World Central Kitchen do good work on the ground. Providing food will always be needed and it's politically uncomplicated, so that's where I sent most of my donation. If you're reading this later, after the Ukraine crisis is over, or would like a charity that's independently evaluated to be effective, GiveWell does great work in that field. If analysis paralysis grips you, just give to their maximum impact fund, which they allocate to where it can do most good.

You can totally help charity with code. You just have to convert it into money first.

Discuss

### Ceiling Fan Air Filter

10 мая, 2022 - 17:20
Published on May 10, 2022 2:20 PM GMT

Filter cubes are a great way to cheaply filter a lot of air, but they're bulky and noisy. Elevating them can get them out of the way if you have a high enough ceiling, but it's still not ideal. What if we built something around a device that is intended to be up there: a ceiling fan?

Let's say your blades are a foot from the ceiling, and sweep a diameter of 52". Shroud the fan with a regular octagon of 12"x24" furnace filters, and the air will flow in through the filters and down:

Relative to box fans, ceiling fans move a lot more air for a given level of noise, because they are so much bigger. Since noise often makes people turn air filters down or off, a quiet high-volume filter that doesn't get in the way could be very useful.

You could make a simple one with duct tape, but it should be possible to make one that is reasonably attractive if you used a metal grille for the outside. This could also make changing the filters much easier, if the framework held the filters and allowed you to slide them into place. Putting lights on the inside and using the filter as a diffuser could also have a nice effect, like a paper lantern.

Discuss

### The limits of AI safety via debate

10 мая, 2022 - 16:33
Published on May 10, 2022 1:33 PM GMT

I recently participated in the AGI safety fundamentals program and this is my cornerstone project. During our readings of AI safety via debate (blogpaper) we had an interesting discussion on its limits and conditions under which it would fail.

I spent only around 5 hours writing this post and it should thus mostly be seen as food for thought rather than rigorous research.

Lastly, I want to point out that I think AI safety via debate is a promising approach overall. I just think it has some limitations that need to be addressed when putting it into practice. I intend my criticism to be constructive and hope it is helpful for people working on debate right now or in the future.

The setting

In AI safety via debate, there are two debaters who argue for the truth of different statements to convince a human adjudicator/verifier. In OpenAI’s example, the debaters use snippets of an image to argue that it either contains a dog or a cat. The dog-debater chooses snippets that show why the image contains a dog and the cat-debater responds with snippets that argue for a cat. Both debaters can see what the other debater has argued previously and respond to that, e.g. when the dog-debater shows something that indicates a dog, the cat-debater can refute this claim by arguing that this snipped actually indicates a cat. At some point, the human verifier chooses whether the image shows a cat or a dog and the respective debater wins.

I think AI safety via debate works well in cases where the verifier and the debaters broadly have a similar understanding of the world and level of intelligence. When this is not the case, failures get more frequent. Thus, my intuitive example for thinking about failure modes is “Let a person from 1800 evaluate the truth of the statement ‘Today I played Fortnite.’”. In this setting, you travel back through time and have to convince a random person from 1800 that you played Fortnite before traveling. Your opponent is someone who has a similar level of knowledge and intelligence as you.

Obviously, this setting is imperfect, due to all the problems with time travel but, in my opinion, it still intuitively shows some of the problems of AI safety via debate. The worlds of someone who played Fortnite in 2022 and someone who lived in 1800 are just so different that it is hard to even begin persuading them. Furthermore, so many of the concepts necessary to understand Fortnite, e.g. computers, the internet, etc. are nearly impossible to verify for a person from 1800 even if they wanted to believe you.

Limitations

In the following, I list different implicit and explicit assumptions of debate that can lead to problems if they aren’t met.

Assumption 1: concept must break down into parts that are verifiable in a reasonable timeframe

In cases where the verifier is not able to verify a concept from the beginning, it needs to be broken down into smaller subcomponents that are all verifiable. However, this might not always be possible--especially when given limited time.

In the “1800 Fortnite” example, the debater would have to convince the verifier of the existence of electricity, TVs or computers, video games, the internet, etc.

A second example is a question that probably requires very elaborate and time-intensive experiments to yield high-confidence answers such as in a “nature vs nurture” debate. The debater might have to run multi-generational studies to provide low-uncertainty evidence for their side.

Assumption 2: human verifiers are capable of understanding the concept in principle

I’m not very sure about this but I could imagine that there are concepts that are too hard to understand in principle. Every attempt to break them down doesn’t solve the fundamental problem of the verifiers' limited cognitive abilities.

For example, I’m not sure if there is someone who “truly understood” string theory, or high-dimensional probability distributions sufficiently to make a high-confidence judgment in a debate. It might just be possible that these are beyond our abilities.

A second example would be explaining the “1800 Fortnite” scenario to a far-below-average intelligent person from 1800. Even if the debater did the best job possible, concepts like electricity or the internet might be beyond the capabilities of that specific verifier.

This leads to a potentially sad conclusion for a future with AI systems. I could very well imagine that smart humans today could not understand a concept that is totally obvious to an AGI.

Assumption 3: human verifiers are well-intentioned

I think there are many implementations of AI safety via debate that lead to simple practical problems due to basic human flaws.

For example, the human verifier might just want to get on with it and thus doesn’t really make sure they actually believe the concept. I think this mostly implies that there will be some stochasticity to the answer which might be fixable by careful experimental design.

Secondly, especially in group settings, verifiers might also try to hide their own misunderstandings to not seem unknowledgeable in front of their peers. My intuition is that this biases group settings towards answers that “are allowed to be said”. During 2020, for example, a debater that argues that Covid might come from a lab leak would have lost by default just because it was seen as unspeakable in many social groups.

Assumption 4: human verifiers are impartial

Humans have a lot of motivations, many of which are not “just find the truth over everything else”. They like personal profits, seek status, have many biases such as a desire to confirm their current beliefs and follow ideologies and tribes.

Therefore, there is a risk that human verifiers might vote in favor of bad motivations and preferences rather than find the true answer. Especially when the truth of the result is not personally meaningful to them but one outcome is clearly beneficial to their wallet or party, they might just go with that. Since many of these motivations are hidden or subconscious, it’s not easy to create a setting in which they don’t influence the results.

Assumption 5: It’s easier to explain the true concept than a flawed simple one

There are some examples of concepts that take a while to understand but have simplified but false alternative explanations. Examples of these false alternatives include rent-control as a solution to the housing crisis, tariffs benefitting your economy, immigrants stealing jobs or autocrats acting the way they do simply because they are evil. In all of these cases, there is a simple intuition that can be explained in one sentence and a complex alternative that requires some understanding of economics, history, incentives, etc. to get why the simple intuition is false.

My gut reaction to this problem is something like “just pick smart verifiers” but there is a deeper problem. I think it is very likely that even the smartest humans today hold lots of beliefs that seem true and intuitive to them but will turn out to be wrong when investigated in more detail. It might be possible that the background knowledge to understand the true alternative takes too long to explain compared to providing a simple yet false alternative. Therefore, since the incentive of the debater is to win and not to find the truth, they will likely choose the simple yet false scenario.

Assumption 6: AI is not deceptive

The incentive of the debater is not to seek the truth but to convince the verifier that they have found the truth. While this sounds like a small discrepancy it can create really large problems. In short, the debater does everything to convince the verifier even if it knows that it’s providing incorrect information. If the debater has more knowledge about the world than the verifier and a very accurate model of what the verifier thinks and likes, it can use its powers to deceive the human.

For an intuitive example, think of a hypothetical theoretical physics professor who knows you very well. They broadly know how you think and they know much more about physics than you. So if they wanted to convince you that a specific fact in theoretical physics is true, they could probably do so independent of whether it’s actually true.

I think this is the biggest problem for AI safety via debate since it is a) so hard to distinguish between deception and honest mistakes and b) a problem that will almost surely happen in scenarios where the AI is very powerful.

Conclusions & future research

My main conclusion is that AI safety via debate is a promising tool but some of its core problems still need addressing before it will be really good. There are many different research directions that one could take but I will highlight just two

1. Eliciting Latent Knowledge (ELK) - style research: Since the biggest challenge of AI safety via debate is deception, in my opinion, the natural answer is to understand when the AI deceives us. ELK is, in my opinion, the most promising approach to combat deception we have found so far.
2. Social science research: If we will ever be at a point when we have debates between AI systems to support decision-making, we also have to understand the problems that come with the human side of the setup. Under which conditions do humans choose for personal gain rather than seek the truth? Do the results from such games differ in group settings vs. individuals alone and in which ways? Can humans be convinced of true beliefs if they previously strongly believed something that was objectively false?

Discuss

### When is AI safety research bad?

10 мая, 2022 - 14:32
Published on May 9, 2022 6:19 PM GMT

Cross posted to the EA Forum and The Good Blog

Summary

• AI safety research improves capability by making AIs done what humans want
• Having more capability means that AI is more likely to be deployed
• If AI safety is really hard then AI we think is safe at deployment is likely to be unsafe
• This effect is mitigated if safety failures are continuous - in this world the more total safety research done the better
• Highly theoretical AI safety research is plausibly not going to be done anyway and so adds to the total amount of safety research done
• Empirical safety research has a smaller counterfactual impact
• The effect of this could go either way depending on weather safety failures and discrete or continuous

What do we mean by capability

There is an argument that safety research is bad because getting a utility function which is close to one that kind, sensible humans would endorse is worse than missing completely. This argument won’t be the focus of this blog post but is well covered here

I will argue that another harm could be that safety research leads to an unsafe AI being deployed more quickly, or at all, than without the safety research being done.

The core of this argument is that AI safety and AI capability is not orthogonal. There are two ways capability can be understood: firstly as the sorts of things an AI system is able to do and secondly as the ability of people to get what they want using an AI system.

Safety is very clearly not orthogonal under the second definition. The key claim made by AI safety as a field is that it’s possible to get AIs which can do a lot of things but will end up doing things that are radically different from what a human principal wants it to do. Therefore improving safety improves this dimension of capability in the sense that ideally a safer AI is less likely to cause catastrophic outcomes which presumably their principals don’t want.

It’s also plausible that under the second definition of capability that AI safety and capabilities are not orthogonal. The problem that value-learning approaches to AI safety is trying to solve is one of attempting to understand what human preferences are from examples. Plausibly this requires understanding how humans work at some very deep level which may require substantial advances in the sorts of things an AI can do. For instance it may require a system to have a very good model of human psychology.

These two axes of capability give two different ways in which safety research can advance capabilities. Firstly by improving the ability of principals to get their agents to do what they want. Secondly, because doing safety research may, at least under the value learning paradigm, require improvements in some specific abilities.

How does this affect whether we should do AI safety research or not?

Whether or not we do AI safety research I think depends on a few variables, at least from the perspective I’m approaching the question with.

• Is safe AI discrete or continuous
• How hard is AI safety
• What are the risk behaviours of the actors who choose to deploy AI
• How harmful or otherwise is speeding up capabilities work
• How likely is it that TAI is reached with narrow vs general systems

How does safety interact with deployment?

I think there are few reasons why very powerful AI systems might not be deployed. Firstly, they might not be profitable because they have catastrophic failures. A house cleaning robot that occasionally kills babies is not a profitable house cleaning robot[1]. The second reason is that people don’t want to die and so if they think deploying an AGI will kill them they won’t deploy it.

There are two reasons why an AGI might be deployed even if the risk outweighs the reward from an impartial perspective. There’s individuals having an incorrect estimation of their personal risk from the AGI. Then there’s also individuals having correct estimations of the risk but there are very large - potentially unimaginably vast - externalities, like human extinction.

So we have three ways that AI safety research might increase the likelihood of a very powerful AGI being deployed. If AI systems have big discontinuities in skills then it’s possible AI systems, if there’s at least some safety work, look safe until they aren’t. In this world, if none of the lower level safety research had been then weaker AI systems wouldn’t be profitable because they’d be killing babies while cleaning houses.

It seems very likely that AI safety research reduces existential risk conditional on AGI being deployed. We should expect that the risk level acceptable to those taking that decision to be much higher much higher than socially optimal because they aren’t fully accounting for the good lives missed out on due to extinction, or the lives of people in an AI enabled totalitarian nightmare state. Therefore they’re likely to accept a higher level of risk than is socially optimal, while still only accepting risk below some threshold. If AI safety research is required to get below that threshold, then AI safety research takes the risk below that threshold meaning AI could be deployed when the expected value is still massively negative.

Relatedly, if AGI is going to be deployed it seems unlikely that they’ve been lots of major AI catastrophes. This could mean that those deploying AI underestimate their personal risk of AGI deployment. It’s unclear to me whether, assuming people take seriously the threat of AI risk, whether key decision makers are likely to be over or under cautious (from a self-interested perspective.) One on hand, in general people are very risk averse, while on the other individuals are very bad at thinking about low probability, high impact events.

Value of Safety research

If AI being safe - in the sense of not being an existential risk - is a discrete property then there are two effects. Firstly, if AI safety is very hard then it’s likely (though not certain) that the marginal impact of AI safety research is small. The marginal impact of safety research is given by two variables: the amount that safety research increases the total amount of research done, and the amount that that increases in the total amount of research done reduces the probability of x-risk. If we’ve only done a very small amount of research then adding any extra research means we’ve still only done a very small amount of research so AI is still unlikely to be safe. There’s a similar effect from doing a large amount of research - adding more research means we’ve still done a lot of research and so it’s very likely to be safe. The large effect on the probability comes when we’ve done a medium amount of research.

How bad this is depends on the specific way in which AI in which AI failure manifests and how discontinuous the jump is from ‘normal’ AI to x-risk threatening AI. The worst world is the one in which getting very near to safety manifests as AI being safe until there’s a jump to AGI because in this world it’s likely that firms will be successfully building highly profitable products meaning that they’re expecting their next, more powerful, AI system to be safe. This world seems plausible to me if there are discontinuous jumps in capabilities AI systems improve. Alternatively there could be certain skills or pieces of knowledge, like knowing it’s in a training environment, that dramatically increase the risks from AI but are different problems faced by less powerful systems.

On the other hand, if we’re in a world where it’s touch and go whether we get safe AI and prosaic AI alignment turns out to be the correct strategy then AI safety research looks extremely positive.

This looks different if AI safety failures are continuous. In this case any research into AI safety reduces the harms from AI going wrong. I think it’s much less clear what this looks like. Potentially a good sketch of this is this blog post by Paul Christiano where he describes AI catastrophe via Goodhearting to death. Maybe the closer an AGI or TAI (transformative AI) values are to our own, the less harmful it is to fall prey to goodhearts law, because the the thing being maximised is sufficiently positively correlated with what we truly value that it stays correlated even in a fully optimised world. I haven’t tried to properly work this out though.

Implications for altruistically motivated AI research

There are few quite different ways in which this could work. Extra research could just be additional research that wouldn’t have been done otherwise. This seems most likely to be the case for highly theoretical research that only becomes relevant to very powerful models, meaning there’s little incentive for current AI labs to do the research. This seems to most clearly fit agent foundations and multiagent failure research. This research has the property of applying to large numbers of different classes of models working on very different things. This means it displays strong public good properties. Anyone is able to use the research without it being used up. Traditionally, markets are believed not to supply these kinds of goods.

On the end of the scale research is done to prevent large language models saying racist things. There are only a very small number of firms that are able to produce commercially viable large language models and it’s plausible you can find ways to stop these models saying racist stuff that doesn’t generalise very well to other types of safety problems. In this case firms capture a lot of the benefits of their research.

The (dis)value of research between these two poles depends on how useful the research is to solving pre AGI safety problems, whether failure is discrete or continuous, and how hard the safety problem is relative to the amount of research being done.

The best case for empirical research on currently existing models being valuable is that failure is that the safety problem is relatively easy, prosaic alignment is possible, this sort of safety research doesn’t advance capabilities in ability to do more stuff sense, and that preventing x-risk from AGI is all-or-nothing. In this world altruistic safety would probably increase the total amount of relevant safety research done before AGI is deployed, and if it means that AI is more likely to be deployed that safety research will still have at least some effect because failure is continuous rather than discrete.

The world where this is worst is where AI alignment is very hard but key decision makers don’t realise this, safety is discrete and we need fundamentally new insights about the nature of agency and decision making to get safe AGI. In this world it seems likely that safety research is merely making it more likely that an unsafe AGI will be deployed. Because the problem is so hard it’s likely that the safety solution we find and relatively small amounts of research is likely to be wrong, meaning that the marginal contribution to reducing x-risk is small, but there’s quite a large effect on how likely it is that unsafe AI is deployed. The best case here is that safety has a very small marginal impact because it’s replacing safety work that would be done anyway by AI companies - this case the biggest effect is probably speeding up AI research because these firms have more resources to devote to pure capabilities research.

The worst case for more abstract research, ignoring concerns about the difficulty of knowing that it’s relevant at all, is that it actually is relevant to nearly-but-not-quite AGI and so provides the crucial step of ensuring that these models are profitable, while also facing safety being a discrete property and AI safety being a really hard problem. This could easily be worse than the worst case for empirical alignment research because it seems much more likely that this theoretical research wouldn’t be done by AI companies, both because currently this work is done (almost?) exclusively outside of industry and exhibits stronger public goods properties because it isn’t relevant only to firms with current access to vast amounts of compute.

Why aren’t AI labs doing safety research already?

If AI safety labs weren’t doing any AI safety research currently, this would point to at least some part of the theory that capabilities and safety aren’t orthogonal being wrong. It’s possible that safety displays strong public goods properties which means that safety research is much less likely to be done than other sorts of capabilities research. Basically though, I think AI safety research is being done today, just not of the sort that’s particularly relevant to reducing existential risk.

Victoria Kranova has compiled a list of examples of AI doing the classic thing that people are worried about an AGI doing - taking some goal that humans have written down and achieving it some way that doesn’t actually get at what humans want. The process of trying to fix these problems by making the goal more accurately capture the thing you want is a type of AI alignment research, just not the type that’s very helpful for stopping AI x-risk, and highly specific to the system being developed which is what would be predicted if more theoretical AI safety work had stronger public goods properties. This article gives a really good description of harm caused by distributional shift in a medical context - trying to change I think should be thought of as a type of AI alignment research in that it’s trying to get an AI system to do what you want and focus is on changing behaviour rather than trying to make the model a better classifier when it’s inside it’s distribution.

Takeaway

I think this area is really complex and the value of research is dependent on multiple factors which interact with one another in non-linear ways. Option value considerations dictate that we continue doing AI safety research even if we’re unsure of its value because it’s much easier to stop a research program than to start one. However, I think it’s worthwhile trying to formalise and model the value of safety research and put some estimates on parameters. I think it’s likely that this will push us towards thinking that one style of AI research is better than another.

1. ^

This line is stolen from Ben Garfinkal. You can find his excellent slides which inspired much of this article here

Discuss

### Examining Armstrong's category of generalized models

10 мая, 2022 - 12:07
Published on May 10, 2022 9:07 AM GMT

This post is my capstone project for the AI Safety Fundamentals programme. I would like to thank the organizers of the programme for putting together the resources and community which have broadened my horizons in the field. Thanks to my cohort and facilitator @sudhanshu_kasewa for the encouragement. Thanks also to @adamShimi, Brady C and @DavidHolmes for helpful discussion about the contents of a more technical version of this post which may appear in the future.

As the title suggests, the purpose of this post is to take a close look at Stuart Armstrong's category of generalized models. I am a category theorist by training, and my interest lies in understanding how category theory might be leveraged on this formalism in order to yield results about model splintering, which is the subject of Stuart's research agenda. This turns out to be hard, not because the category is especially hard to analyse, but because a crucial aspect of the formalism (that of which transformations qualify as morphisms) is not sufficiently determined to provide a solid foundation for deeper analysis.

Stuart Armstrong is open about the fact that he uses this category-theoretic formulation only as a convenient mental tool. He is as yet unconvinced of the value of a categorical approach to model splintering. I hope this post can be a first step to testing the validity of that scepticism.

A little background on categories

category is a certain type of mathematical structure; this should not be confused with the standard meaning of the term! A category in this mathematical sense essentially consists of a collection of objects and a collection of morphisms (aka arrows), which behave like transformations in the sense that they can be composed.[1]  The reason this structure is called a category is that if I consider a category (in the usual sense) of structures studied in maths, such as sets, groups, vector spaces, algebras and so on, these typically come with a natural notion of transformation which makes these structures the objects of a category.

There are a number of decent posts introducing category theory here on LessWrong, and increasingly many domain-relevant introductions proliferating both online and in print, so I won't try to give a comprehensive introduction. In any case, in this post we'll mostly be examining what the objects and morphisms are in Stuart's category of generalized models.

John Wentworth likes to think about categories in terms of graphs, and that works for the purposes of visualization and getting a feel for how the structure of a generic category works. However, when we study a specific category, we naturally do so with the intention of learning something about the objects and morphisms that make it up, and we cannot make much progress or extract useful information until we nail down exactly what these objects and morphisms are supposed to represent.

Objects: Generalised models

I’m going to break these objects down for you, because part of the conceptual content that Stuart is attempting to capture here is contained in a separate (and much longer) post, and understanding the ingredients will be crucial to deciding what the full structure should be.

Each feature f∈F represents a piece of information about the world, and comes equipped with a set V(f) of possible values that the feature can take. As examples, features could include the temperature of a gas, the colour of an object, the location of an object, and so on.

Stuart defines a 'possible world' to be an assignment of subsets of V(f) to each feature f∈F. He constructs the set of possible worlds by first constructing the disjoint union ¯¯¯¯F:=∐f∈FV(f) and then taking the powerset W=2¯¯¯¯F. Concretely, each element of this set consists of a choice of value for each feature; from the point of view of the generalized model, a world is completely characterized by its features.

Finally, the partial probability distribution Q is intended to capture the model’s beliefs (heavy quote marks, because "belief" is a very loaded term whose baggage I want to avoid) about how the world works, in the form of conditional probabilities derived from relationships between features. Stuart appeals to physical theories like the ideal gas laws to illustrate this: if I have some distribution over the values of pressure and volume of a gas, I can derive from these a distribution over the possible temperatures. The distribution is only defined on a subset of the possible worlds for two reasons: one is that the full powerset of an infinite set is too big, so realistic models will only be able to define their distributions on a sensible collection of subsets of worlds; the other is that the relationships determining the distribution may not be universally valid, so it makes sense to allow for conservative models which only make predictions within a limited range of feature values.

It is interesting that the partial probability distribution is an extensional way of capturing the idea of a model. That is, while we might informally think of a model as carrying data such as equations, these are only present implicitly in the partial distribution, and another set of equations which produces the same predictions will produce identical generalized models. The distribution only carries the input-output behaviour of the model, rather than any explicit representation of the model itself. I think this is a wise choice, since any explicit representation would require an artificial choice of syntax and hence some constraints on what types of models could be expressed, which is all baggage that would get in the way of tackling the issues being targeted with this formalism.

A morphism (F,Q)→(F′,Q′), in Stuart’s posts, essentially consists of a relation between the respective sets of worlds. The subtlety is how this relation interacts with the partial distributions.

When we compare two world models, we base this comparison on the assumption that they are two models of the same ‘external’ world, which can be described in terms of features from each of the models. This is where the underlying model of a morphism comes from: it’s intended to be a model which carries all of the features of the respective models. Out of the set of all possible worlds for these features, we select a subset describing the compatible worlds. That subset R⊆2¯F⊔¯F′≅2¯F×2¯F′ is what we mean by a relation.

The first case that Stuart considers is the easiest case, in which the two generalized models are genuinely compatible models of a common world. In that case, the probability assigned to a given set of worlds in one world model should be smaller than the probability assigned to all possible compatible worlds in the other model. This appears to cleanly describe extensions of models which do not affect our expectations about the values of the existing features, or world models with independent features.

But there's a discrepancy here: in the models, the relationships between features are captured by the partial probability distribution, allowing for approximate or probabilistic relationships which are more flexible than strict ones. On the other hand, a relation between sets of possible worlds must determine in a much stricter yes/no sense which feature values in the respective models are compatible. This will typically mean that there are several possible probabilistic relationships which extend this relation of compatibility (some conditions will be formally compatible but very unlikely, say). As such, it is no surprise that when Stuart builds an underlying model for a morphism, the partial distribution it carries is not uniquely defined. A possible fix I would suggest here, which simultaneously resolves the discrepancy, would be to have morphisms being partial distributions over the possible worlds for the disjoint union of the features, subject to a condition ensuring that Q and Q′ can be recovered as marginal distributions[2]. This eliminates relations completely from the data of the morphism.

Setting that suggestion aside, we now come to the problem that the above compatibility relation is not the only kind of transformation we might wish to consider. After all, if the distribution represents our belief about the rules governing the world, we also want to be able to update that distribution in order to reflect changing knowledge about the world, even without changing the features or their sets of possible values. This leads Stuart to consider “imperfect morphisms”.

For these, Stuart still keeps a relation between respective sets of possible worlds around. The interpretation of this relation is no longer clear-cut to me, since it makes less sense to consider which feature values are compatible between models which contain a fundamental disagreement about some part of the world. Stuart considers various "Q-consistency conditions" on such relations, corresponding to different ways in which the relation can interact with the respective partial distributions. While it’s interesting to consider these various possibilities, it seems that none of them capture the type of relationship that we actually care about, as is illustrated by Stuart's example of a Bayesian update. Moreover, some finiteness/discreteness conditions need to be imposed in order for some of these conditions to make sense in the first place (consider how "Q-functional" requires one to consider the probability of individual states, which for any non-discrete distribution is not going to be meaningful), which restricts the generality of the models to a degree I find frustrating.

Conclusions

I think it should be possible to identify a sensible class of morphisms between generalized models which captures the kinds of update we would like to have at our disposal for studying model splintering. I'm also certain that this class has not yet been identified.

Why should anyone go to the trouble of thinking about this? Until we have decided what our morphisms should be, there is very little in the way of category theory that we can apply. Of course, we could try to attack the categories obtained from the various choices of morphism that Stuart presents in his piece on "imperfect morphisms", but without concrete interpretations of what these represent, the value of such an attack is limited.

What could we hope to get out of this formalism in the context of AI Safety? Ultimately, the model splintering research agenda can be boiled down to the question of how morphisms should be constructed in our category of generalized models. Any procedure for converting empirical evidence or data into an update of a model should be expressible in terms of constructions in this category. That means that we can extract guarantees of the efficacy of constructions as theorems about this category. Conversely, any obstacle to the success of a given procedure will be visible in this category (it should contain an abstract version of any pathological example out there), and so we could obtain no-go theorems describing conditions under which a procedure will necessarily fail or cannot be guaranteed to produce a safe update.

More narrowly, the language of category theory provides concepts such as universal properties, which could in this situation capture the optimal solution to a modelling problem (the smallest model verifying some criteria, for example). Functors will allow direct comparison between this category of generalized models and other categories, which will make the structure of generalized models more accessible to tools coming from other areas of maths. This includes getting a better handle on pathological behaviour that can contribute to AI risk.

Once I've had some feedback about the preferred solution to the issues I pointed out in this post, I expect to put together a more technical post examining the category of generalized models with tools from category theory.

1. ^

Here's a bit more detail, although a footnote is really not a good place to be learning what a category is. Each morphism has a domain (aka source) object and a codomain (or target) object, each object has an identity morphism (with domain and codomain that object), and a pair of morphisms in which the codomain of the first coincides with the domain of the second can be composed to produce a morphism from the domain of the first to the codomain of the second. This composition operation is required to be associative and have the aforementioned identity morphisms as units on either side (composing with an identity morphism does nothing).

2. ^

I do not want to give a misleading impression that this solution is clear-cut, since obtaining marginal distributions requires integrating out variables, which is not going to be generally possible for a distribution/measure which is only partially defined. But I think this could be a guide towards a formal solution.

Discuss

### Dath Ilani Rule of Law

10 мая, 2022 - 09:17
Published on May 10, 2022 6:17 AM GMT

Minor spoilers for mad investor chaos and the woman of asmodeus.

Criminal Law and Dath Ilan

When Keltham was very young indeed, it was explained to him that if somebody old enough to know better were to deliberately kill somebody, Civilization would send them to the Last Resort (an island landmass that another world might call 'Japan'), and that if Keltham deliberately killed somebody and destroyed their brain, Civilization would just put him into cryonic suspension immediately.

It was carefully and rigorously emphasized to Keltham, in a distinction whose tremendous importance he would not understand until a few years later, that this was not a threat.  It was not a promise of conditional punishment.  Civilization was not trying to extort him into not killing people, into doing what Civilization wanted instead of what Keltham wanted, based on a prediction that Keltham would obey if placed into a counterfactual payoff matrix where Civilization would send him to the Last Resort if and only if he killed.  It was just that, if Keltham demonstrated a tendency to kill people, the other people in Civilization would have a natural incentive to transport Keltham to the Last Resort, so he wouldn't kill any others of their number; Civilization would have that incentive to exile him regardless of whether Keltham responded to that prospective payoff structure.  If Keltham deliberately killed somebody and let their brain-soul perish, Keltham would be immediately put into cryonic suspension, not to further escalate the threat against the more undesired behavior, but because he'd demonstrated a level of danger to which Civilization didn't want to expose the other exiles in the Last Resort.

Because, of course, if you try to make a threat against somebody, the only reason why you'd do that, is if you believed they'd respond to the threat; that, intuitively, is what the definition of a threat is.

It's why Iomedae can't just alter herself to be a kind of god who'll release Rovagug unless Hell gets shut down, and threaten Pharasma with that; Pharasma, and indeed all the other gods, are the kinds of entity who will predictably just ignore that, even if that means the multiverse actually gets destroyed.  And then, given that, Iomedae doesn't have an incentive to release Rovagug, or to self-modify into the kind of god who will visibly inevitably do that unless placated.

Gods and dath ilani both know this, and have math for defining it precisely.

Politically mainstream dath ilani are not libertarians, minarchists, or any other political species that the splintered peoples of Golarion would recognize as having been invented by some luminary or another.  Their politics is built around math that Golarion doesn't know, and can't be predicted in detail without that math.  To a Golarion mortal resisting government on emotional grounds, "Don't kill people or we'll send you to the continent of exile" and "Pay your taxes or we'll nail you to a cross" sound like threats just the same - maybe one sounds better-intentioned than the other, but they both sound like threats.  It's only a dath ilani, or perhaps a summoned outsider forbidden to convey their alien knowledge to mortals, who'll notice the part where Civilization's incentive for following the exile conditional doesn't depend on whether you respond to exile conditionals by refraining from murder, while the crucifixion conditional is there because of how the government expects Golarionites to respond to crucifixion conditionals by paying taxes.  There is a crystalline logic to it that is not like yielding to your impulsive angry defiant feelings of not wanting to be told what to do.

The dath ilani built Governance in a way more thoroughly voluntarist than Golarion could even understand without math, not (only) because those dath ilani thought threats were morally icky, but because they knew that a certain kind of technically defined threat wouldn't be an equilibrium of ideal agents; and it seemed foolish and dangerous to build a Civilization that would stop working if people started behaving more rationally.

"The United States Does Not Negotiate With Terrorists"

I think the idea Eliezer is getting at here is that responding to threats incentivizes threats. Good decision theories, then, precommit to never cave in to threats made to influence you, even when caving would be the locally better option, so as to eliminate the incentive to make those threats in the first place. Agents that have made that precommitment will be left alone, while agents who haven't can be bullied by threateners. So the second kind of agent will want to appropriately patch their decision theory, thereby self-modifying into the first kind of agent.

Commitment Races and Good Decision Theory

Commitment races are a hypothesized problem in which agents might do better by, as soon as the thought occurs to them, precommitting to punishing all those who don't kowtow to their utility function, and promulgating this threat. Once this precommitted threat has been knowingly made, the locally best move for others is to cave and kowtow: they were slower on the trigger, but that's a sunk cost now, and they should just give in quietly.

I think the moral of the above dath ilani excerpt is that your globally best option[1] is to not reward threateners. A dath ilani, when so threatened, would be precommitted to making sure that their threatener gets less benefit in expectation than they would have playing fair (so as to disincentivize threats, so as to be less likely to find themselves so threatened):

That's not even getting into the math underlying the dath ilani concepts of 'fairness'!  If Alis and Bohob both do an equal amount of labor to gain a previously unclaimed resource worth 10 value-units, and Alis has to propose a division of the resource, and Bohob can either accept that division or say they both get nothing, and Alis proposes that Alis get 6 units and Bohob get 4 units, Bohob should accept this proposal with probability < 5/6 so Alis's expected gain from this unfair policy is less than her gain from proposing the fair division of 5 units apiece.  Conversely, if Bohob makes a habit of rejecting proposals less than '6 value-units for Bohob' with probability proportional to how much less Bohob gets than 6, like Bohob thinks the 'fair' division is 6, Alis should ignore this and propose 5, so as not to give Bohob an incentive to go around demanding more than 5 value-units.

A good negotiation algorithm degrades smoothly in the presence of small differences of conclusion about what's 'fair', in negotiating the division of gains-from-trade, but doesn't give either party an incentive to move away from what that party actually thinks is 'fair'.  This, indeed, is what makes the numbers the parties are thinking about be about the subject matter of 'fairness', that they're about a division of gains from trade intended to be symmetrical, as a target of surrounding structures of counterfactual actions that stabilize the 'fair' way of looking things without blowing up completely in the presence of small divergences from it, such that the problem of arriving at negotiated prices is locally incentivized to become the problem of finding a symmetrical Schelling point.

(You wouldn't think you'd be able to build a civilization without having invented the basic math for things like that - the way that coordination actually works at all in real-world interactions as complicated as figuring out how many apples to trade for an orange.  And in fact, having been tossed into Golarion or similar places, one sooner or later observes that people do not in fact successfully build civilizations that are remotely sane or good if they haven't grasped the Law governing basic multiagent structures like that.)