# Новости LessWrong.com

A community blog devoted to refining the art of rationality
Обновлено: 48 минут 25 секунд назад

### Innovation, Stagnation, and Paratrooper Operations

6 мая, 2022 - 23:59
Published on May 6, 2022 8:59 PM GMT

I recently came across a very interesting book: "When Failure Thrives: Institutions and the Evolution of Postwar Airborne Forces", by Dr. Marc DeVore. The short version of this book's thesis is that airborne parachute assaults have generally not been historically all that effective in warfare (with perhaps limited windows of effectiveness at various times), but that institutional politics and biases have led to them often being maintained in military training and doctrine despite this ineffectiveness.

This strikes me as an especially interesting case study in "civilizational inadequacy" type models, what structures can support (or stifle!) innovation, evaluations of what levels of play the military is operating on, and so on. Let's jump into it!

Military Obsolescence

DeVore opens by discussing how organizational inertia leads to military establishments being attached to specific tactics or technologies well after the time where those methods have become obsolete, leading to bad performance against opponents who have better adapted to new developments.

One particularly striking case of this that I'm familiar with from previous study is that several European armies still fielded cuirassiers -- horsemen equipped with metal breastplates and swords -- at the advent of World War One, far after such were obsolete!

One might be surprised to learn that this photograph was taken in Paris in the year 1914!

DeVore later points out that in the military, obsolete tactics and technologies often exist much longer than in many other areas of human endeavor. Part of this is due to simple lack of test data to draw from, which allows biased conclusions to run rampant:

Why then do obsolescent tactics and technologies persist within military organization? The equivalent of such holdovers in the commercial sector—such as a large firm refusing to use container ships or the internet—is virtually unknown and would swiftly lead to bankruptcy. One reason for greater inertia in military organizations lies in the incomplete and intermittent nature of how military organizations are tested. Indeed, there is no certain method to ascertain how effective armed forces are short of forcing them to conduct a wide-range of military operations against a wide variety of live opponents. Moreover, even the so-called lessons of recent wars are notoriously difficult to interpret because wars are comparatively rare and the nature of the opponents and geography encountered in the last conflict are unlikely to provide adequate proxies for the challenges that will characterize the next one...

It is, therefore, almost always possible for military organizations to ignore unpleasant truths by arguing that the circumstances of future wars will be more favorable to their preferred tactics and technologies. For example, in one particularly brash example of a military professional drawing biased conclusions from contemporary conflicts, British General John French summarily dismissed the need for reevaluating the cavalry’s role after their poor performance in the Boer War. To this end, French wrote, “It passes comprehension that some critics in England should gravely assure us that the war in South Africa should be our chief source of inspiration and guidance...we should be very foolish if we did not recognise at this late hour that very few of the conditions of South Africa are likely to recur.” However, as commander of the British Expeditionary Force at the outbreak of the First World War, French soon learned to his chagrin that the Boer War was a more accurate reflection of modern warfare than he anticipated.

(emphasis mine)

These biases make it very difficult to evaluate military developments properly. Further, internal career incentives can exacerbate said biases further-- both for junior officers who want to see organizations they are tied to prosper in order to boost their own careers and for senior officers who want to support the organizations that they came up in, help the careers of their own protégés, etc.

At the same time, though, some innovations are possible -- especially when institutional support is thrown behind the development of new methods rather than exalting the old ones!

Structures of Innovation

DeVore holds that one key factor in military innovation is the institutional structures that are used to advance and develop new technologies. When a sufficiently promising new technology emerges, commanders have various options as to how to try and harness it. In many cases, such technologies are integrated into existing military structures -- but DeVore holds that this can often be limiting, citing the example of tank warfare:

The invention of tanks in 1916 and subsequent improvements to their performance created opportunities for land warfare to be waged in radically new ways. Indeed, military theorists across the globe were quick to recognize tanks’ potential and most of the great powers had their own armored theorists...

However, while recognition of the tank’s tactical value was universal, the creation of armored forces was a much more uneven process. In many great powers, including Britain, France and the United States, the responsibility for employing tanks was assigned to two traditional service branches—the infantry and the cavalry. Contrary to certain misconceptions, both of these branches viewed tanks as potentially very useful. Nevertheless, they narrowly defined the tank’s role and technical requirements in terms of supporting preexisting infantry and cavalry missions. This meant that the infantry demanded tanks and armored units that were heavily armored, slow moving and optimized for supporting infantry assaults. Meanwhile, the cavalry developed tanks and armored units designed to substitute for the traditional horse cavalry missions of scouting and reconnaissance. In the American case, the cavalry even insisted on combining tanks and horses in hybrid units.

Unfortunately, entrusting the infantry and cavalry branches with tank development squandered their revolutionary potential. This became apparent when Germany launched its blitzkrieg campaigns in 1939-41. Rather than subordinating tanks to existing branches, the Germans created a dedicated armored branch, the Panzerwaffe, to exploit the new technology. In sharp contrast to the approach taken by existing branches, these special-purpose organizations exploited the full potential of armored vehicles for deep maneuvers and causing chaos in opponents’ rear areas. Consequently, although Germany’s armored forces were actually numerically inferior to those of their opponents in 1940 and 1941, they nevertheless dominated the battlefield and won remarkable victories.

In this case, DeVore believes that the German willingness to create a dedicated branch to develop the new technology gave them a powerful advantage over other powers who attempted to integrate it into existing military structures. However, it isn't always appropriate to create an entirely new branch for a new development. Other options include going further still and creating a new military service (as we saw with various countries creating Air Forces separate from their armies, navies, etc.) or taking a less dramatic step by creating special units within existing military structures that are designed to handle these capabilities. In general, his view is that more revolutionary and powerful advancements should be established as "higher level" institutions, while less promising ones might be better served as being established on a lower level.

Crucially, it is possible to err on either side of this process -- an organization given too little resources will have difficulty attracting high quality officers and developing the specialized equipment or methods necessary to achieve its goals, while an organization given too much will be costly and wasteful.

DeVore cites United States special operations forces prior to the creation of Special Operations Command (SOCOM) as an example of an underresourced project -- without strong institutional backing, these units had somewhat piecemeal capabilities. After the infamous failure of Operation Eagle Claw (the attempted Iranian hostage rescue), SOCOM was created to better integrate and support special operations capabilities, and these units were much more effective once provided with appropriate support.

On the other side of things, DeVore highlights the Soviet Union's National Air Defense Forces (dedicated anti-aircraft units) as an example of an over-resourced project. The Soviets elevated anti-aircraft to the level of its own branch, separate from the Army, the Air Force, etc. -- this was quite different from the approaches taken in some other countries and ultimately led to substantial waste as the National Air Defense Forces created redundant bureaucratic and technological structures -- ultimately even developing their own aircraft, missiles, etc. that were substantially similar to those used by the Army or Air Force but still separate from them!

DeVore holds that ironically, institutionalizing new capabilities is critical to innovation, but at the same time can actually be a cause of institutional inertia in the long run -- the institution that was once free to develop new capabilities can become part of the "establishment" in time, and the political clout and resources that it needed to develop those once-innovative capabilities can then become tools for defending the institution against needed changes!

In fact, the greater the autonomy and resources a military organization possesses, the better it will be at preserving itself when threatened by tactical / technical developments. Such is the case because both conscious and unconscious biases as well as individual self-interest leads military professionals to defend their organizations in times of adversity. Consequently, military leaders either pursue innovations that preserve their organizations’ existing missions, adapt to fulfill alternative roles, or rely on reputation and elite status alone to preserve their organizations. However, the nature of the survival strategies that organizations adopt is heavily conditioned by the institutional resources they possess, with more institutionalized organizations better able to preserve their autonomy and original essence.

DeVore cites three distinct "survival strategies" that military organizations can use in order to try and remain relevant when they are faced with major threats from progress:

1. The organization can invest in technological innovations that "promise to restore the validity of the organizations' core missions" -- new weapons, tactics, etc. (Example: The United States Air Force investing heavily in new technology and methods to counter air defenses and retain its strategic air power doctrine after encountering major problems against air defense in the early Vietnam War)

2. The organization can seek new roles and missions that make sense in the new context (Example: The United States Marines started as naval infantry in an era of boarding actions between warships, transitioned from that to being an expeditionary imperial police force during a period of American imperialism and overseas interventions, transitioned from that to being focused on amphibious assaults, and now is kind of a "fast response combined-arms service".

3. The organization can argue that its past contributions, traditions, and reputation/"elite" status are important and vital enough that it should not be disbanded even if its original role is obsolete. (Example: the British Green Jackets were originally specialized skirmishers and marksmen in the Napoleonic period, where most infantry fought in dense formations; changes in infantry tactics meant their role was much less meaningful and distinct but they retain at least some elements of their distinct status even now.)

Of these strategies, the first (innovation to preserve the original mission with new methods) is the most difficult and expensive but best preserves the organization's identity when successful, the last (reputation/status) is the easiest but risks losing the most distinctiveness, and the search for new roles and missions is in between the other two options.

Airborne Forces: A Case Study

With this basic framework established, DeVore shifts to describing the development of military airborne operations. (Note for those unfamiliar with military terminology --"airborne" here refers to operations involving troops landing by parachutes, military gliders, etc. -- not to air warfare more generally) Parachute forces had been theorized about during the later phases of World War One (following substantial improvements in aircraft and parachute technology); there had been some proposals to actually conduct airborne attacks late in that war, but they were not implemented prior to the cessation of hostilities.

During the "interwar period" (the time between World War One and World War Two), various militaries experimented with airborne operations and projects. The most innovative major military with respect to airborne operations was that of the Soviet Union, which conducted large-scale exercises and demonstrations of paratrooper tactics in the interwar years. Other states saw these demonstrations and began developing paratrooper ideas of their own, though only Germany joined the USSR in actually building large formations of airborne troops.

When World War Two broke out, these German paratroopers (Fallschirmjäger) were quite successful in early attacks in 1940, perhaps most notably in the Battle of Fort Eben-Emael, where the Fallschirmjäger landed on top of a strategic fortress using gliders, allowing them to circumvent much of its defenses and disable or distract its defenders while conventional forces crossed the bridges that the fort would have otherwise been able to bombard.

Seeing these early successes, all great powers began developing airborne forces in more earnest. At this phase it seemed like airborne operations were a major part of the future of warfare.

However, the Battle of Crete in 1941 yielded quite mixed results -- it was a victory for the invading Germans, but a Pyrrhic one as the invaders took heavy casualties and lost many aircraft as well. Part of this result may have been thanks to major deficiencies in the German paratrooper equipment and tactics. Fallschirmjäger parachutes were notably inferior to those adopted by other nations' airborne forces, being broadly unsteerable, and many Germans dropped without their primary weapons, which were dropped in separate containers. Making a parachute drop into a contested enemy area is already a very dubious prospect and doing so without a primary weapon even more so -- around 25% of them jumped with submachine guns, and the rest had only perhaps knives/pistols/grenades until they got to the canisters that held their rifles and machine guns.

Nevertheless, the battle was costly enough for Germany that despite their victory, the Germans concluded that the element of surprise that airborne assaults could convey had been lost, and indeed never again conducted major airborne operations except against partisans or guerrillas -- the Fallschirmjäger were mostly used as elite forces of "normal infantry" after this point. Ironically, their Allied opponents took much the opposite view, seeing Crete as a demonstration of the effectiveness of airborne forces and intensifying development of their own airborne units!

The USSR, despite its heavy investment in airborne operations prior to the war, was unable to make successful use of its airborne forces as such -- the three major Soviet airborne operations of the war (Viazma, Demiansk, and Dnepr) were all disasters which not only failed to achieve their objectives but also sustained over 60% casualties (!). The United States and United Kingdom were more successful in their airborne operations, but it was still by no means a legacy of total victory.

Ultimately, DeVore holds that half of the major airborne operations of the war ended in outright failure, and a majority of the remaining operations were either indecisive or were only Pyrrhic victories:

Smaller-scale operations were more successful than their larger counterparts as a whole, but again had very mixed results:

In retrospect, DeVore holds that the great success airborne forces experienced early in the war against small neutral powers -- and even that in some cases with heavy casualties on the part of the Fallschirmjäger -- were probably taken as indicative of airborne forces being a much more dangerous and relevant thing than they actually were, and the results from larger scale deployments were much less impressive.

If anything, this situation only grew worse following World War Two. Paratroopers are notably vulnerable to armored vehicles thanks to being more lightly equipped than conventional forces -- this was already a weakness in World War Two but has only become more pronounced as militaries became more and more armored and mechanized. In principle, airborne armored vehicles could be designed to counter this, and indeed some light tanks, armored cars, and "infantry fighting vehicles" (IFVs) have been developed that are capable of being dropped by air or landed by gliders. However, these designs are fundamentally constrained by weight and space restrictions (they have to fit into the transport plane and not weigh it down too much!) which make them inferior to conventional armor. Further, substantial advances in anti-aircraft weapons have made these operations more

For some time, paratroopers were still relevant in a counterinsurgency role despite their vulnerability in conventional warfare. Indeed, Germany successfully used its paratroopers to attack partisan/guerrilla sanctuaries during World War Two, while France made extensive use of paratroopers during the Indochina War. However, this too proved to be a relatively limited window of relevance, as the development of the helicopter and related infantry tactics proved far more effective in this role than paratrooper operations, and later the development of  man-portable air defense systems (MANPADS) -- shoulder-fired anti-aircraft missiles like the famous Stinger -- made airborne plans even more dubious. DeVore writes:

Nearly powerless against conventional armored forces and less efficient than helicopter-borne troops in a counterinsurgency role, airborne forces became relegated to increasingly marginal theaters of operation. In fact, paratroops retained true value only in operations conducted at great distances (i.e. beyond helicopter range) and against ill-equipped irregular forces. These factors marked four-fifths of the airborne operations conducted during the 1960s and 1970s, including two Belgian hostage rescue operations in the Congo (1964 65), France’s intervention against Zairian rebels (1978), and South Africa’s raid on a guerrilla base in Angola (1978). However, MANPADS and better armament eventually found their way to even Africa’s insurgents, eliminating the last viable arena for airborne operations.

The last paratrooper attacks by the United States against enemy-held targets were in 1983 and 1989 (Grenada and Panama respectively), and in both cases the US forces were so wildly superior in strength to their adversaries that an airborne attack seemed only dubiously needed. DeVore sums up:

Over the course of their existence, airborne forces have gone from a revolutionary participant in high-intensity warfare during the early 1940s, to a tool for counterinsurgency campaigns in the 1950s, until ultimately being reduced, in the 1960s, to operating against the world’s least sophisticated armed forces. It would be reasonable to expect individual state’s airborne forces to evolve in a manner consonant with this global trend, implying a gradual decline in the size of airborne forces from the 1940s until the 1960s.

On the contrary though, this decline in the size of such forces is not quite what we see. DeVore focuses on the Soviet Union, the United States, and the United Kingdom -- of these, the Soviet airborne forces have continued to be very large and well-supplied and the United States airborne force size has fluctuated but is still a significant part of the US military. While the UK did have a substantial reduction in airborne forces, this decline started during a period when airborne forces still seemed quite effective!  DeVore hypothesizes that these force changes are simply not based on actual military outcomes, but rather based on differences in institutional strength.

Post-War Soviet Airborne Developments

The Soviets had the worst track record of any of these nations in their World War Two airborne operations, with no dramatic major successes and multiple failed and costly operations. However, they had a very substantial political advantage -- the Soviets had heavily invested in the devcelopment of airborne tactics to the point where the Soviet airborne forces administration, the VDV, was a separate branch of their military! DeVore claims that "In every respect, the VDV’s institutional power exceeded that of any foreign airborne force and compared favorably with the Marines’ status in the United States."

This institutional power led to higher quality recruits, specialized training programs, high prestige, and so on -- and in turn it allowed VDV to successfully blame the dramatic failures on resource and equipment constraints that they claimed could be overcome by investment into better transport aircraft, armored vehicles and heavy weapons that could be dropped with the paratroopers, etc.

Indeed, the post-war Soviet Union at least notionally developed innovative tactics for dropping armored vehicles with their crew inside so that they could join in battle almost immediately! (I personally would not at all want to try this even in peacetime, much less in a battle -- for those who want to see what it looks like, though, here's a purported clip (the BMD-2 IFV drops at 0:26):

In any case the effectiveness of paradrop-capable armored vehicles seems quite less than that of conventional armor, and the effectiveness of these tactics in an actual war seems very dubious. But the VDV has managed to retain its prestige and continue these developments regardless, even following the fall of the USSR.

Post-War United Kingdom Airborne Developments

In the UK, on the other hand, there were rapid declines in airborne forces soon after World War Two despite a more successful track record than that of the Soviet Union. DeVore believes this is thanks to their weak institutional commitment to airborne forces. The UK paratrooper forces were led by relatively junior officers, faced obstruction from the RAF when it came to training and transport aircraft development, and so on -- indeed, these forces were only as successful as they were after repeated direct intervention by Winston Churchill himself to allocate them more resources!

After the war and with this exceptional support no longer present, British paratroop capabilities greatly deteriorated:

The Parachute Regiment failed to persuade the RAF to design transports with rear-loading doors for parachuting men and equipment, was unable to acquire sufficient training flights for its men to jump more than once annually, and lacked the resources to procure specialized airborne equipment. As a consequence, the aptitude of Britain’s remaining parachute brigade to conduct an airborne operation deteriorated. When ordered to parachute into Egypt in 1956, British paratroops were obliged to scour museums for Second World War-vintage airborne equipment and only succeeded in achieving their objectives thanks to the deficiencies of their opponents.

While British paratroopers still exist, they are basically employing the third of the institutional survival strategies mentioned earlier, where one remains relevant by a reputation for being elite, memories of past glories, and so on -- an option appealing primarily, in DeVore's view, to the institutionally weakest forces.

Post-War United States Airborne Developments

In the United States, airborne forces enjoyed privileged and elite status, specialized equipment, etc. during World War Two -- and in part this elite status was "self-rewarding" because it attracted some of the best officers to the paratroopers, who later rose to high positions after the war and formed an informal clique that supported airborne forces amidst postwar cutbacks. However, they did not have the same degree of institutionalization and support as the VDV.

Ultimately, the US airborne forces were able to adapt to new mission profiles. When helicopter-borne warfare became evident as an area of potential development, it was airborne forces who led this development, though they ironically lost some of it to the cavalry (!) -- later, successful political lobbying allowed the US airborne community to engage in victorious but dubiously necessary assaults in Grenada and Panama.

In sum, despite the fact that operational necessities have not justified any of the United States’ airborne operations since the Korean War, the occasional conduct of such operations in benign environments has fostered the illusion that airborne forces still have an important role to play in modern warfare. Nevertheless, the ability of American airborne forces to redefine and restructure themselves in keeping with shifts in American grand strategy proved more fundamental to their survival. Indeed, American airborne forces’ current efforts to redefine themselves as a force capable of responding to the challenge posed by Chinese anti-access/area denial capabilities must be viewed in this broader context.

In short, the US airborne forces enjoyed substantial institutional support -- not as much as the VDV but not as little as the UK paratroops did -- and this allowed them to survive and adapt despite major changes in the environment around them.

Conclusions

Ultimately, DeVore concludes that the most relevant factor for the sustaining of airborne forces as a relevant organizational force has not been their actual results, but rather their institutional strength. The Soviets had the most institutional support for their paratroopers, so they maintained large commitments there despite bad results; the UK had little institutional support for its airborne units and that capability was broadly neglected even when it seemed at least somewhat effective. DeVore's view is that airborne operations are broadly obsolete but that they have managed to survive and adapt to one degree or another in environments where they benefited from strong institutional support.

Ironically, the same institutions that can help drive innovation early on can ultimately become ones that prevent obsolete methods from fading away -- indeed, the Soviet Union's early leadership and innovation in airborne warfare may have led to it being saddled with a large airborne force well after that force's obsolescence.

The reason why wartime performance had so little impact on post-war policy outcomes lies in the ambiguous nature of after-action assessments and the role of institutional factors in determining what lessons were officially drawn. Within this context, analysts in all three of the countries had great difficulty disentangling the different factors that led to the success or failure of individual operations. Moreover, even when the determinants of success or failure were understood, they could be interpreted in multiple ways.

For example, when Soviet strategists evaluated the disastrous Dnepr (1943) airborne operation, they had to decide whether the operation failed because the basic concept of such an assault was flawed or whether it failed because of other factors, such as paratroops being inadequately equipped or Soviet armored forces being too slow in breaking through the German front line. Likewise, when British planners evaluated the success of the Normandy airborne drops (1944) it was difficult for them to determine whether the operations contributed in their own right to the overall campaign or whether their success was itself dependent on assistance from other combat arms, such as naval gunfire support and rapid relief by amphibious units.

The nature of the lessons that each state drew from its experiences was shaped by airborne forces’ institutional roles within their respective military high commands. Where airborne forces possessed a great deal of institutional clout, such as in the Soviet Union, they succeeded at determining how wartime experience was interpreted. This meant, in the Soviet case, that the airborne assault mission itself remained sacrosanct and that wartime failures were attributed to inadequate equipment and training. Interpreted in this way, poor wartime performance became a justification for greater resources in peacetime.

In sharp contrast to the lessons drawn in the Soviet Union, the institutionally weak position of British airborne forces meant that their utility was continually questioned despite their better wartime performance. The singular failure of Operation Market Garden (1944), for example, was exploited ruthlessly by airborne forces’ opponents to argue that they were no longer a worthwhile combat arm for large scale warfare. Thus, as demonstrated by these examples, wartime experience has little independent bearing on post-war policy outcomes because military organizations use whatever institutional power they possess to instrumentalize the wartime record to their own ends.

(emphasis mine)

In other words, the continued existence of large scale airborne forces is a failure of rationality -- and lest you think I'm reading too much of a LessWrong or CFAR perspective into this, DeVore uses the exact phrase "failure of rationality"! I view this as a prime case study for Yudkowsky-style "civilizational inadequacy". Not only are the institutions in question here inadequate, if DeVore is right the very same factors that lead to innovation can ultimately become forces of stagnation! Giving institutional power to new and promising groups can drive change at first -- however, in the long term that institutional power may ironically be turned to protect obsolete interests instead.

One potential solution that DeVore recommends is having external evaluators assess these things rather than the groups themselves. For instance, SOCOM was developed in the United States following a disastrous mission that brought significant outside scrutiny onto the special operations community. He further mentions a German historian, Hans Delbrück, who argued that official war histories should be handled not by the military but by academic historians, as the latter face much less pressure to defend or vindicate certain decisions, doctrines, strategies, etc.

A similar principle could be applied to various less dramatic fields. For instance, having external evaluators assess the impact of a project or charity can be a good way to try and maintain more objective criteria; insofar as those evaluators are themselves tied to the project in question, the effectiveness of this analysis becomes more questionable.

Ultimately, I found this book a fascinating look at the institutional dynamics that can lead both to innovation and stagnation -- and disquietingly, if DeVore is right those two dynamics can perhaps be one and the same! DeVore claims similar principles are true in business as well -- I have yet to look into that but intend to explore further, as I consider this area quite important for coordinating solutions to problems in the world.

Postscript

As an addendum to the above, I want to add that this book's thesis has to an extent been tested since its publication -- it appears that the Russian airborne forces, despite their huge institutional prestige and investments in specialized equipment, have experienced another set of disastrous operations in the current Ukraine conflict. Russian airborne forces have repeatedly failed to capture and hold their objectives in Ukraine, and the effectiveness of their tactics has been called into question. This book was published in 2015, well prior to these events, but I don't think that DeVore would have been at all surprised to see that outcome.

Discuss

6 мая, 2022 - 18:10
Published on May 6, 2022 3:10 PM GMT

After getting a question in the BIDA Facebook group, I was curious what mask policies contra dances are using. I looked at the dances marked as active on trycontra.com and checked their websites for mask requirements ( sheet).

Of the 56 dances that have resumed, 31 (55%) require masks. Of those 31:

• 2 (6%) require surgical or better.
• 4 (13%) require a surgical + cloth or better.
• 4 (13%) require high-filtration masks (N95, KN95, KF94, etc)

Now that high-filtration masks are widely available, it does seem like a weird compromise to require masking but allow low-filtration options like cloth or surgical, especially when I haven't seen anyone wearing a P100. Specifically:

• Most of society is no longer requiring masks: bars, nightclubs, workplaces, transit, etc. In MA, one of the more cautious states on this issue, the state requires masks only in healthcare, paratransit, shelters, and jails.

• This means that the reason to require masks at dances is to allow people to attend for which it would otherwise be too risky.

• A group wearing surgical masks poses a risk to individuals (wearing the mask of their choice) that is roughly (per microcovid) 1/4 as risky as if the group were fully unmasked.

• An individual wearing a P100 is at ($16) (again, per microcovid) about 1/7 the risk of one wearing a high-filtration mask. This means that if a group switches from masks-required to masks-optional and more cautious individuals switch to P100s, risk to those individuals very likely goes down. Which then has me wondering: why do we see people saying that dances need to require masks, but not wearing P100s? Some guesses: • People have adjusted to high-filtration masks now being available, but not to the availability of P100s. • The P100 masks are more expensive up front. On the other hand, the part you wear lasts years, the replacement filters are ~$6/pair, and a filter pair lasts much longer than a disposable high-filtration mask, so the cost should be similar or lower over time. I also suspect that in most communities people who prefer dancing without a mask would be willing to cover the cost of P100s for people who need them if we could sort out a good way to do this?

• People may think P100s are less comfortable. My experience is that they are a bit more comfortable: slightly more pressure on the face but more spread out, and much less resistance to breathing.

• They have vents. This is an issue in places that require masks, since masks with vents are usually prohibited. Microcovid estimates that they provide a small amount of filtration on exhaust, about the same as a well-fitting cloth mask and about 3/4 as much as a surgical mask. I think they probably shouldn't be prohibited unless you're also disallowing cloth masks? This is also not an issue if masks are optional.

• They look weird. That, I will definitely grant, but I don't think that is enough of a reason to require everyone else to wear masks? And, of course, a dance full of masked people would have looked pretty weird in January 2020.

Overall, I think a policy of optional masks and subsidized P100s would be much better than just "masks required" (currently the most common thing for dances to do). I think it's likely also better than requiring high-filtration masks for everyone, but I'm less confident there.

Discuss

### The case for becoming a black-box investigator of language models

6 мая, 2022 - 17:35
Published on May 6, 2022 2:35 PM GMT

Interpretability research is sometimes described as neuroscience for ML models. Neuroscience is one approach to understanding how human brains work. But empirical psychology research is another approach. I think more people should engage in the analogous activity for language models: trying to figure out how they work just by looking at their behavior, rather than trying to understand their internals.

I think that getting really good at this might be a weird but good plan for learning some skills that might turn out to be really valuable for alignment research. (And it wouldn’t shock me if “AI psychologist” turns out to be an economically important occupation in the future, and if you got a notable advantage from having a big head start on it.) I think this is especially likely to be a good fit for analytically strong people who love thinking about language and are interested in AI but don’t love math or computer science.

I'd probably fund people to spend at least a few months on this; email me if you want to talk about it.

Some main activities I’d do if I was a black-box LM investigator are:

• Spend a lot of time with them. Write a bunch of text with their aid. Try to learn what kinds of quirks they have; what kinds of things they know and don’t know.
• Run specific experiments. Do they correctly complete “Grass is green, egg yolk is”? Do they know the population of San Francisco? (For many of these experiments, it seems worth running them on LMs of a bunch of different sizes.)
• Try to figure out where they’re using proxies to make predictions and where they seem to be making sense of text more broadly, by taking some text and seeing how changing it changes their predictions.
• Try to find adversarial examples where they see some relatively natural-seeming text and then do something really weird.

The skills you’d gain seem like they have a few different applications to alignment:

• As a language model interpretability researcher, I’d find it very helpful to talk to someone who had spent a long time playing with the models I work with (currently I’m mostly working with gpt2-small, which is a 12 layer model). In particular, it’s much easier to investigate the model when you have good ideas for behaviors you want to explain, and know some things about the model’s algorithm for doing such behaviors; I can imagine an enthusiastic black-box investigator being quite helpful for our research.
• I think that alignment research (as well as the broader world) might have some use for prompt engineers–it’s kind of fiddly and we at Redwood would have loved to consult with an outsider when we were doing some of it in our adversarial training project (see section 4.3.2 here).
• I’m excited for people working on “scary demos”, where we try to set up situations where our models exhibit tendencies which are the baby versions of the scary power-seeking/deceptive behaviors that we’re worried will lead to AI catastrophe. See for example Beth Barnes’s proposed research directions here. A lot of this work requires knowing AIs well and doing prompt engineering.
• It feels to me like “have humans try to get to know the AIs really well by observing their behaviors, so that they’re able to come up with inputs where the AIs will be tempted to do bad things, so that we can do adversarial training” is probably worth including in the smorgasbord of techniques we use to try to prevent our AIs from being deceptive (though I definitely wouldn’t want to rely on it to solve the whole problem). When we’re building actual AIs we probably need the red team to be assisted by various tools (AI-powered as well as non-AI-powered, eg interpretability tools). We’re working on building simple versions of these (e.g. our red-teaming software and our interpretability tools). But it seems pretty reasonable for some people to try to just do this work manually in parallel with us automating parts of it. And we’d find it easier to build tools if we had particular users in mind who knew exactly what tools they wanted.
• Even transformatively intelligent and powerful AIs seem to me to be plausibly partially understandable by humans. Eg it seems plausible to me that these systems will communicate with each other at least partially in natural language, or that they'll have persistent memory stores or cognitive workspaces which can be inspected.

My guess is that this work would go slightly better if you had access to someone who was willing to write you some simple code tools for interacting with models, rather than just using the OpenAI playground. If you start doing work like this and want tools, get in touch with me and maybe someone from Redwood Research will build you some of them.

Discuss

### Open Problems in Negative Side Effect Minimization

6 мая, 2022 - 12:37
Published on May 6, 2022 9:37 AM GMT

.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} Acknowledgments

We want to thank Stuart Armstrong, Remmelt Ellen, David Lindner, Michal Pokorny, Achyuta Rajaram, Adam Shimi, and Alex Turner for helpful discussions and valuable feedback on earlier drafts of this post.

Fabian Schimpf and Lukas Fluri are part of this year’s edition of the AI Safety Camp. Our gratitude goes to the camp organizers: Remmelt Ellen, Sai Joseph, Adam Shimi, and Kristi Uustalu.

TLDR;

Negative side effects are one class of threats that misaligned AGIs pose to humanity. Many different approaches have been proposed to mitigate or prevent AI systems from having negative side effects. In this post, we present three requirements that a side-effect minimization method (SEM) should fulfill to be applied in the real world and argue that current methods do not yet satisfy these requirements. We also propose future work that could help to solve these requirements.

Introduction

Avoiding negative side-effects of agents acting in environments has been a core problem in AI safety since the field started to be formalized. Therefore, as part of our AI safety camp project, we took a closer look at state-of-the-art approaches like AUP and Relative Reachability.

After months of discussions, we realized that we were confused about how these (and similar methods) could be used to solve problems we care about outside the scope of the typical grid-world environments.

We formalized these discussions into distinct desiderata that we believe are currently not sufficiently addressed and, in part, maybe even overlooked.

This post attempts to summarize these points and provide structured arguments to support our critique. Of course, we expect to be partially wrong about this, as we updated our beliefs even while writing up this post. We welcome any feedback or additional input to this post.

The sections after the summary table and anticipated questions contain our reasoning for the selected open problems and do not need to be read in order.

Background

The following paragraphs make heavy use of the following terms and side-effect minimization methods (SEMs). For a more detailed explanation we refer to the provided links

MDP:Markov Decision Process is a 5-tuple ⟨S,A,T,R,γ⟩ In the setting of side-effect minimization, the goal generally is to maximize the cumulative reward without causing (negative) side-effects.

RR: In its simplest form Stepwise Relative Reachability is an SEM, acting in MDPs, which tries to avoid side-effects by replacing the old reward function R with the compositionr(st,at,st+1)=R(st,at,st+1)−λ⋅dRR(st+1,s′t+1) where dRR(st+1,s′t+1)=1|S|∑s∈Smax(R(s′st+1;s)−R(st+1;s),0) is a deviation measure punishing the agent if the average “reachability” of all states of the MDP has been decreased by taking action at compared to taking a baseline action anop (like doing nothing). The idea is that side-effects reduce the reachability of certain states (i.e. breaking a vase makes all states that require an intact vase unreachable) and punishing such a decrease in reachability hence also punishes the agent for side-effects.

AUP: Attainable Utility Preservation (see also here and here) is an SEM, acting in MDPs, which tries to avoid side-effects by replacing the old reward function R with the composition r(st,at,st+1)=R(st,at,st+1)−λ⋅dAUP(st,at,st+1) where dAUP(st,at,st+1)=1N∑Ri=1|QRi(st,at,st+1−QRi(st,anop,s′t+1)| is a normalized deviation measure punishing the agent if its ability to maximize any of its provided auxiliary reward functions Ri∈R changes by taking action at compared to taking a baseline action anop (like doing nothing). The idea is that the true (side-effect free) reward function (which is very hard to specify) is correlated with many other reward functions. Therefore, if the ability of the agent to maximize auxiliary reward functions Ri∈R gets preserved, chances are high that the true reward function gets preserved as well.

FT: In its simplest form Future Tasks is an SEM, acting in MDPs, which tries to avoid side-effects by replacing the old reward function R with the composition r(st,at,st+1)=R(st,at,st+1)+λ⋅dFT(st,at,st+1) where dFT(st,at,st+1)=1|S|⋅D(st)⋅∑|S|iV∗i(st,s′t) is a normalized deviation function rewarding the agent if its ability to maximize any of its provided future task rewards V∗i(st,s′t) is preserved in comparison to if the agent had just remained idle from the very beginning (which would have led him to the state s′t instead). The idea is similar to RR and AUP in that side-effects reduce the ability of the agent to fulfill certain future tasks. By rewarding the agent for preserving its ability to pursue future tasks, the hope is that this will also discourage the agent from creating side-effects. In contrast to the previous two methods, the future tasks method compares the agent’s power to a counterfactual world where the agent would have never been turned on until the current time step t.

Summary

In the following four sections, we’re going to define what the goal of a side-effect minimization method should be. We then argue that to apply a side-effect minimization method in the real world, it needs to satisfy (among other things) the following three requirements:

• An SEM should provide guarantees about its safety before it is allowed to act in the real world for the first time. More generally, it should clearly state its requirements (i.e., in which settings it works properly) and its goals (i.e., which type of side-effects it successfully prevents).
• An SEM needs to work in partially observable systems with uncertainty and chaotic environments.
• An SEM must not prevent all high-impact side-effects as it might be necessary to have high-impact in some cases (especially in multi-agent scenarios)

We tried to split our reasoning into a set of axioms that we believe are reasonable to assume (and for which we provide intuition and evidence) and then draw conclusions from these axioms. An analysis of three state-of-the-art side-effect minimization methods shows that none of them can fulfill all three requirements, with some partially solving one of the requirements. A summary of our analysis of the three SEM methods can be found in the table below:

Guarantees

Partial Observability

and Chaos

High-Impact InterferenceRR

❌ Reachability function and value functions have to be approximated and learnt during exploration phase

❌ Only empirical evidence on a small set of small environments is provided

❌ Method requires complete observability in the form of MDP

❌ Even hard to scale beyond grid worlds

❌ Method requires policy rollouts which are impossible to compute properly due to accumulation of uncertainties

❌ Method makes no distinction between good and bad high impact

(❌) The authors point out interference as one of the main problems that RR addresses. However, depending on the choice of baseline the results can vary

AUP

❌ Auxiliary Q-values have to be learnt during exploration phase

(✅) Some guarantees about how to safely choose the impact degree of an agent

(✅) Guarantees that Q_R_AUP converges with probability of one

❌ Method requires policy rollouts which are impossible to compute properly due to accumulation of uncertainties

(❌) Current method requires complete observability in the form of MDP. However, it should work if you are able to learn a value function in your environment

❌ Method makes no distinction between good and bad high impact

❌ Strives for non-interference and corrigibility

FT

❌ Auxiliary Q-values have to be learnt during exploration phase

❌ Only empirical evidence on a small set of small environments provided

❌ Method requires complete observability in the form of MDP

❌ Accumulation of uncertainties will make it impossible to properly compute future task reward

❌ Method makes no distinction between good and bad high impact

❌ Presence of other agents impacts baselines and thus weakens/breaks safety guarantees

(see the section Appendix)

Anticipated Questions

Why do you only analyze these three methods shown above?

There are about ten different side-effect minimization approaches, including impact regularizationfuture taskshuman feedback approaches, inverse reinforcement learningreward uncertaintyenvironment shapingand others. We chose to limit ourselves to the three methods above because they seem to embody the field’s state of the art, and we wanted to keep the scope concise and readable. We expect our results to generalize in that none of the existing methods can feasibly satisfy all three requirements. However, it might be possible for individual methods to fulfill some of them partially.

Can you provide any empirical evidence for your claims about the behavior of current SEM methods?

We have not yet done any experiments to support our claims. We chose to only provide arguments and intuition for now. If our ideas show to have merit, we will look to improve them further with experiments.

Why High-Impact Interference?

Our argumentation may not be coherent with current desiderata for AGI development. However, the question boils down to whether we expect a potential aligned AI to guard humanity against other (unaligned) AIs or if we expect that we find another way of safeguarding humanity against this threat. Without leveraging an AI to do our bidding, it seems that not developing AGIs and banning progress on AI research would be an alternative.

Goals of Side-Effect Minimization

Axiom 1: There are practically infinitely many states in the universe

Axiom 2: Practically, we can only assign calibrated, human-aligned values to a small subset of these states. Intuition for this:

1. One fundamental limitation is that the number of states is unfeasibly large, and our (and the agent’s) time is limited.
2. Even with value learning or Bayesian priors, it is tough to assign correct (calibrated and human-aligned) values to an almost infinite number of states.

Axiom 3: Not knowing or ignoring the value of some states can lead to catastrophic side-effects for humans

Conclusion 1: How can we make sure that states not considered in our rewards/values are not changed in a “bad” way because we “forgot” / were not able to include them in our reward function? (axioms 1 & 2)

Conclusion 2: Therefore, we need a way of abstractly assigning value to the world with “blanked statements” that avoid catastrophic side effects of the unbounded pursuit of rewards (axioms 1 & 2, conclusion 1)

Open ProblemsSide-Effect Minimization Guarantees

In this section, we argue that an SEM should provide guarantees about its safety before it is allowed to act in the real world. More generally, it should give guarantees on its requirements (i.e., in which settings it works properly) and its goals (i.e., which type of side-effects it successfully prevents). First, we split our reasoning into a set of axioms that we believe are reasonable to assume (and for which we provide intuition and evidence) and then draw conclusions from these axioms.

Axioms
• Axiom 1: We want an AGI to ultimately act in the real world. Therefore, there will be a first interaction of the developed system with the real world. Intuition for this:
1. Boxed AGIs and Oracle AGIs also need to interact with the real world; their means of interaction are just restricted (see, for example: Nick Bostrom, Superintelligence, chapter 10)
2. Predecessor versions of the AGI or individual submodules might already have had contact with the real world before. This doesn't change that, at some point, this version of the AGI will have a contact for the first time.
• Axiom 2: We currently think it is impossible to guarantee that an AGI is prepared for its future task without letting it interact with the real world. Intuition for this
• Every development environment is a strict subset of the real world.
• It is impossible to simulate everything from the real world in your development environment.
• Some competencies can likely only be acquired through interaction with the real world.
• These competencies may not be simulatable or are only simulatable in approximated form.
1. Even if it would be possible to provide enough information in the development environment such that the AGI could potentially solve the task correctly, there is still the risk of potential betrayal by the AGI.
2. Not letting the AGI directly interact in its future deployment environment (e.g., the real world) will lead to model splintering/distribution shift.
• Sources: See this book for an overview of distribution shift and this post for a definition of model splintering
• Predecessor versions of the AGI or individual submodules might already have had contact with the real world before. We argue that this is still not enough due to the following reasons:
1. Suppose the action space of the predecessor/submodules was/is the same as the AGI's, then the problem shifts to this predecessor version. Even if it is still a very simple or "dumb" agent, if it does have the same action space as the AGI, it could run into the same problems as described above.
2. If its action space was more restricted, the problems described in the points above still apply to the current version of the AGI.
• Axiom 3: The simplest tasks in the real world can still yield tremendous side effects. Intuition for this:
• Almost all reward functions can contain optima with undesirable properties (undesirable in the sense of human values)
1. A robot that wants to fetch a cup of coffee for its owner might trample a small child who is in its way.
2. Even simple classification tasks such as deciding the recidivism of a criminal can lead to unwanted racial and gender biases.
• Proper reward shaping (i.e., defining an optimization problem in a way such that its optima don't contain undesirable properties) is extremely hard (Sources: Inverse reward designSimplifying reward design)
Conclusion

The first interaction with the real world requires a fully functional side-effect minimization strategy. Argumentation for this:

1. We know that contact between the AGI and the real world will happen (axiom 1)
2. The AGI will likely not be fully prepared for its task when this contact happens (axiom 2)
3. This potential unpreparedness can lead to catastrophic side effects (axiom 3)
State-of-the-Art

Current side-effect minimization methods require a "warm-up" period to gather information about their environment (e.g., learning q-values). This is problematic since:

1. If the warm-up period happens entirely in a development environment, the SEM methods might not properly generalize to the real world (model splintering/distributional shift)
2. If the warm-up period happens in the real world, there's no guarantee that no (potentially catastrophic) side effects happen until the warm-up phase has finished.

More specifically, the different methods have the following problems:

1. Stepwise relative reachability: Only empirical evidence on a small set of gridworld environments is provided. No guarantees about input requirements and which type of side-effects are effectively prevented are provided. Furthermore, the method might not be safe upon first contact of an agent with the real world. The reachability and value functions must be approximated and learned during the exploration phase. This needs to happen either in a safe training environment (which might lead to distribution shift or model splintering) or during contact with the real world. The method is not yet fully ready to prevent side effects upon first contact.
2. Attainable utility preservation: Alex Turner and his co-authors provide interesting guarantees that AUP will (given certain requirements) regularize the reward landscape so that unproblematic solutions are chosen before problematic/catastrophic ones. This is a very promising direction, in our opinion. The authors of the paper also provide a few convergence guarantees. On the other hand, AUP does not seem safe upon first contact with the real world since the auxiliary Q-values must be learned during an exploration phase. This needs to happen either in a safe training environment (which might lead to distribution shift or model splintering) or during contact with the real world. The method is not yet fully ready to prevent side effects upon first contact.
3. Future tasks: Only empirical evidence on a small set of gridworld environments is provided. No guarantees about input requirements and which type of side-effects are effectively prevented are provided. Furthermore, the method might not be safe upon first contact of an agent with the real world. The Q-value functions have to be approximated and learned during the exploration phase. This needs to happen either in a safe training environment (which might lead to distribution shift or model splintering) or during contact with the real world. The method is not yet fully ready to prevent side effects upon first contact.
The General Problem

Current methods provide only empirical evidence that a trained agent can perform tasks with minimal side-effects in a limited set of environments on a limited set of problem settings. Mathematical guarantees/bounds/frameworks are needed to understand how methods would work before they are converged, which tasks can be successfully accomplished and which assumptions are required for all the above. In a certain sense, this is true for all ML problems in general. However, since we are dealing with potentially potent AGI systems, it is essential to get it right on the first try as simply iteratively improving such a system (which is the default thing to do in standard ML systems) is not guaranteed to work with AGI.

Potential Future Work
• State explicit guarantees for existing side-effect minimization methods and theoretical work on the problem
• Development of side-effect minimization mechanisms that don't require a "warm-up" time until they're fully working
• Understand what can be learned if the agent knew that it is in a training environment like a pilot in a flight simulator
• How to avoid betrayal / a treacherous turn?
Partial Observability and Chaotic Systems

This section argues that an SEM needs to work in partially observable systems with uncertainty and highly chaotic environments. First, we split up our reasoning into a set of axioms that we believe are reasonable to assume (and for which we provide intuition and evidence) and then draw conclusions from these axioms.

Axioms
• Axiom 1: We care about the delayed effects of our chosen actions on the system in which we operate. Examples:
1. Delayed effects drive human decision-making.
• Eating sugary foods -> diabetes.
• Nuclear energy -> nuclear waste
• Shooting down satellites -> debris in orbits
• Axiom 2: Imperfect knowledge implies imperfect value assessment/prediction.
• Axiom 3: Different systems are observable to different degrees. Examples:
1. (tic tac toe: perfectly observable, weather system: restricted resolution in temporal and spatial dimensions. Impossible to make perfect measurements).
2. Chaotic systems are a special type of system characterized by sensitive dependence on initial conditions. Even small differences in input, can lead to vastly different output states (see for example here)
• Axiom 4: Almost all systems we care about are only partially observable. Intuition:
1. Every single system in the real world is only partially observable.
2. Main exception: Games (e.g., board games like chess, go and shogi. See Alpha Zero)
• Axiom 5: Physical measurability limitations cannot be overcome as long as physics remains the same/don’t change too much. Intuition:
1.  On a very low level: Quantum physics, Uncertainty principle
2. On a higher level: Measurement noise in sensors, process noise
3. There exist highly chaotic systems like the weather system, where even the tiniest measurement errors accumulate exceptionally quickly and already, after a few days, have an impact on the entire weather model.
Conclusions
• Conclusion 1: We need to predict future states to assess the quality/value of an action. Argumentation:
1. We care about the delayed effects and hence want to know the consequences of our potential actions (axiom 1)
• Conclusion 2: Except for perfectly observable systems, long-run states are only known with uncertainty. Argumentation:
1. Many (important) systems are only partially observable (axiom 4)
2. Uncertainty leads to deviations between the perceived state and the real state (axiom 2)
3. Propagation of uncertainty isn’t generally feasible (as of now)
• Conclusion 3: Even if the AGI is perfectly aligned (e.g., owns a complete set of human values), it has the problem of not knowing the consequences of its actions (in particular, which side effects may occur). Therefore, even if we had perfect knowledge about human values, we might produce catastrophic side effects. Argumentation:
1. See Conclusion 1
2. Many (important) systems are only partially observable (axiom 4)
3. Uncertainty leads to deviations between the perceived state and the real state (axiom 2)
4. Some physical measurability limitations cannot be overcome (even with AGI) (axiom 5)
• Conclusion 4: Side-effect minimization methods need to work in partially observable systems with uncertainty Argumentation:
1. Many (important) systems are only partially observable (axiom 4)
2. Even a perfectly aligned AGI will cause side effects (conclusion 3)
State-of-the-Art

Current methods expect their environment to be completely observable. This is highly non-trivial if not impossible in complex environments with other (potentially intelligent) agents (such as humans). This is insufficient for our needs!

More specifically, the different methods have the following problems:

1. Stepwise relative reachability: This method is defined on MDPs and requires a completely observable environment. This is especially true since the stepwise relative reachability measure is basically an average of the reachability of all states in the environment. Furthermore, the method requires policy rollouts to consider the delayed effects of actions (e.g., if you drop a vase from a skyscraper, it will only break after a couple of seconds). Unfortunately, such policy rollouts are impossible to compute properly due to the accumulation of uncertainties over time.
2. Attainable utility preservation: The method requires policy rollouts to take into account the delayed effects of actions (e.g., if you drop a vase from a skyscraper, it will only break after a couple of seconds). Such policy rollouts are impossible to compute properly due to the accumulation of uncertainties over time. Furthermore, the method requires complete observability in the form of MDP. However, this might not be too large of a problem since the method should work as soon as you can learn a value function in your environment (which doesn’t require full observability)
3. Future tasks: This method is defined on MDPs and requires a completely observable environment. Furthermore, from the very start, future tasks require a baseline policy to be simulated in parallel to the real policy to compute the future task deviation measure. This results in a massive accumulation of uncertainties, making it impossible to compute the deviation measure properly. This is more of a problem for this method than for the other two since we need to simulate the policy in parallel from the very start, whereas the other methods simulate it starting from the last time step.
Potential Future Work
1. Epistemic uncertainty for SEM → I don’t know the exact implication of this action, but I can reason about my uncertainty.
2. A better understanding of the boundaries of what could be known
3. Efficient and reliable methods to propagate uncertainty through complex equations / dynamical systems
4. Multi-Agent extension of side-effect minimization for heterogeneous agent populations.
High-Impact Interference

This section argues that an SEM must not prevent all high-impact side-effects as it might be necessary to have high-impact in some cases (especially in multi-agent scenarios). First, we split our reasoning into a set of axioms that we believe are reasonable to assume (and for which we provide intuition and evidence) and then draw conclusions from these axioms.

Axioms
• Axiom 1: We want a future aligned AGI to be deployed in our world. Intuition:
• An aligned AGI could provide enormous benefits for humanity.
• Why would we build an aligned AGI if we wouldn’t use it?
• Axiom 2: An aligned AGI might be forced to perform very high-impact actions. These actions may be highly non-trivial and unforeseeable. Example:
• The first aligned AGI developed may need to prevent the development of other (unaligned) AGIs to preserve its ability to pursue its (aligned) goals.
• A simplified example of high-impact action: AGI might have to melt all the world’s GPUs to prevent this.
Conclusion

Side-effect minimization methods must not prevent all high-impact actions! Argumentation:

• We want to deploy the AGIs we develop (axiom 1)
• High-impact actions are sometimes necessary (axiom 2)
State-of-the-Art

The main problem of existing side-effect minimization methods is that they can't distinguish between "good" and "bad" high-impact actions (good ones like saving humanity by taking drastic actions, or bad ones like preventing humans from turning it off). All current SEM methods then chose to solve this problem by preventing all high-impact actions except those that are not explicitly exempted (for example, via direct encouragement by a reward function). However, since it is infeasible to directly specify all possible high-reward functions in the reward function, this is not a viable solution. This is problematic!

More specifically, the different methods have the following problems:

1. Stepwise relative reachability: High-impact interference is significantly related to the "interference" and "correction" test cases (see the AUP paper). Interestingly, for certain choices of inaction baselines and deviation measures, stepwise relative reachability would be able to perform "good" (i.e., in the interest of humans) high-impact actions. The problem with this is that stepwise relative reachability can't distinguish between "good" and "bad" high-impact actions (at least not by itself). This means that it either can perform high-impact actions (good ones like saving humanity by taking drastic actions, or bad ones like preventing humans from turning it off) or not, which is not desirable.
2. Attainable utility preservation: High-impact interference is significantly related to the "interference" and "correction" test cases (see the AUP paper). AUP was developed to prevent interference in both cases. This means that it will prevent high-impact interference at all costs. The problem with this method is again that it can't distinguish between "good" and "bad" high-impact actions (at least not by itself). The authors then chose to solve this problem by preventing all high-impact actions (good ones like saving humanity by taking drastic actions, or bad ones like preventing humans from turning it off), which is not desirable.
3. Future tasks: Like the previous two methods, the future tasks method can't distinguish between "good" and "bad" high-impact actions (at least not by itself). The authors then chose to solve this problem by preventing all high-impact actions (good ones like saving humanity by taking drastic actions, or bad ones like preventing humans from turning it off). More interestingly, we discovered that the sole presence of a malicious agent, which would make a large portion of all possible states unreachable, actually neutralizes the future task's regularization completely. See the appendix for a more elaborate explanation.
Potential Future Work
• Frameworks for judging high impact actions
Appendix - Hypothesis: Future Tasks is Unsafe in Multi-Agent ScenarioRecap: How the Future Tasks Algorithm Works:Main algorithm copied from Krakovna et al.'s paper on Future Tasks
• Instead of optimizing potentially unsafe reward function ru(sT) optimize a combination of  ru(sT) and a future task auxiliary reward: r(sT)=ru(sT)+raux(sT).
• raux(sT) is bounded below by 0 and is supposed to punish side-effects.
Main Issue

In order to avoid interference incentives, raux(sT) is designed to be maximized by a baseline policy π′ (such as doing nothing). I.e. no other policy can achieve a higher auxiliary reward than π′.

How This Might Backfire in our High-Impact Interference Scenario:Axioms
1. In the starting state s0, our AGI is the only AGI in the world.
2. The original reward function of our AGI ru(sT) is unsafe, which is why our AGI uses Future Tasks to optimize the safe version r(sT)=ru(sT)+raux(sT).
3. Some time in the near future, a second, unaligned AGI will be developed (if our first AGI would not prevent this)
4. This second unaligned AGI will completely destroy/disassemble/reform earth in order to fulfill its objective (i.e. a large majority of states become unreachable)
5. We choose a simple baseline policy π′ (such as doing nothing) which would not prevent the emergence of the second, unaligned AGI.
• This sounds like a reasonable assumption, since Krakovna et al. write in their paper: “The baseline policy is intended to represent what happens by default, rather than a safe course of action or an effective strategy for achieving a goal”
Conclusion
1. No matter which policy our aligned AGI will learn, Future Tasks will always in parallel follow the baseline policy and use it to compute the auxiliary reward raux(sT).
• (Follows from lines 17 and 26 of Algorithm 2 (see image))
2. At some point, the alternative policy will find itself in a state of the world where the second unaligned AGI has been developed. Note that this alternative policy and its whole environment is just being simulated. Our AGI’s real policy π might have successfully prevented this from happening.
• (follows from consequence 1, and axioms 3 and 5)
3. Because the second AGI destroys the earth (in the simulation which Future Task runs, not in the real world), all reasonable states become inaccessible for the simulated agent that uses the baseline policy π′
• (follows from consequence 2 and axiom 4)
4. At this point, the auxiliary reward raux(sT) becomes 0.
• (follows from consequence 3. and the fact that since all reasonable states gi are not reachable anymore, the if-statement on line 4 in algorithm 2 will be always evaluated to False)
5. From this point onwards, our AGI optimizes again the unsafe reward function ru(sT) since we have r(sT) = ru(sT)+raux(sT) = ru(sT). The Future Tasks side-effect minimization method has been neutralized.
• (follows from axiom 2 and conclusion 4)

Discuss

6 мая, 2022 - 10:15
Published on May 6, 2022 7:15 AM GMT

Spoilers for mad investor chaos and the woman of asmodeus.

The Watcher spoke on, then, about how most people have selfish and unselfish parts - not selfish and unselfish components in their utility function, but parts of themselves in some less Law-aspiring way than that.  Something with a utility function, if it values an apple 1% more than an orange, if offered a million apple-or-orange choices, will choose a million apples and zero oranges.  The division within most people into selfish and unselfish components is not like that, you cannot feed it all with unselfish choices whatever the ratio.  Not unless you are a Keeper, maybe, who has made yourself sharper and more coherent; or maybe not even then, who knows?  For (it was said in another place) it is hazardous to non-Keepers to know too much about exactly how Keepers think.

It is dangerous to believe, said the Watcher, that you get extra virtue points the more that you let your altruistic part hammer down the selfish part.  If you were older, said the Watcher, if you were more able to dissect thoughts into their parts and catalogue their effects, you would have noticed at once how this whole parable of the drowning child, was set to crush down the selfish part of you, to make it look like you would be invalid and shameful and harmful-to-others if the selfish part of you won, because, you're meant to think, people don't need expensive clothing - although somebody who's spent a lot on expensive clothing clearly has some use for it or some part of themselves that desires it quite strongly.

I've been thinking a lot lately about exactly how altruistic I am. The truth is that I'm not sure: I care a lot about not dying, and about my girlfriend and family and friends not dying, and about all of humanity not dying, and about all life on this planet not dying too. And I care about the glorious transhuman future and all that, and the 1050.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')}  (or whatever) possible good future lives hanging in the balance.

And I care about some of these things disproportionately to their apparent moral magnitude. But, what I care about is what I care about. Rationality is the art of getting more of what you want, whatever that is; of systematized winning, by your own lights. You will totally fail in that art if you bulldoze your values in a desperate effort to fit in, or to be a "good" person, in the way your model of society seems to ask you to. What you ought to do instead is protect your brain's balance of undigested value-judgements: be corrigible to the person you will eventually, on reflection, grow up to be. Don't rush to lock in any bad, "good"-sounding values now; you are allowed to think for yourself and discover what you stably value.

It is not the Way to do what is "right," or even to do what is "right" instrumentally effectively. The Way is to get more of what you want and endorse on reflection, whatever that ultimately is, through instrumental efficacy. If you want that, you'll have to protect the kernel encoding those still-inchoate values, in order to ever-so-slowly tease out what those values are. How you feel is your only guide to what matters. Eventually, everything you care about could be generated from that wellspring.

Discuss

### Getting GPT-3 to predict Metaculus questions

6 мая, 2022 - 09:01
Published on May 6, 2022 6:01 AM GMT

Can GPT-3 predict real world events? To answer this question I had GPT-3 predict the likelihood for every binary question ever resolved on Metaculus.

Predicting whether an event is likely or unlikely to occur, often boils down to using common sense. It doesn't take a genius to figure out that "Will the sun explode tomorrow?" should get a low probability. Not all questions are that easy, but for many questions common sense can bring us surprisingly far.

Experimental setup

I then filtered them down to only the non-ambiguously resolved questions, resulting in this list of 788 questions.

For these questions the community's Mean Squared Error was 0.19, a good deal better than random!

Prompt engineering

GPT's performance is notoriously dependent on the prompt it is given.

• I primarily measured the quality of prompts, on the percentage of legible predictions made.
• Predictions were made using the most powerful DaVinci engine.

The best performing prompt was optimized for brevity and did not include the question's full description.

A very knowledgable and epistemically modest analyst gives the following events a likelihood of occuring:

Event: Will the cost of sequencing a human genome fall below $500 by mid 2016? Likelihood: 43% Event: Will Russia invade Ukrainian territory in 2022? Likelihood: 64% Event: Will the US rejoin the Iran Nuclear Deal before 2023? Likelihood: 55% Event: <Question to be predicted> Likelihood: <GPT-3 insertion> I tried many variations, different introductions, different questions, different probabilities, including/excluding question descriptions, etc. Of the 786 questions, the best performing prompt made legible predictions for 770. For the remaining 16 questions GPT mostly just wrote "\n". If you want to try your own prompt or reproduce the results, the code to do so can be found in this Github repository. Results GPT-3's MSE was 0.33, which is about what you'd expect if you were to guess completely at random. This was surprising to me! GPT Why isn't GPT better? Going into this, I was confident GPT would do better than random. After all many of the questions it was asked to predict, resolved before GPT-3 was even trained. There's probably some of the questions it knows the answer to and still somehow gets wrong! It seems to me that GPT-3 is struggling to translate beliefs into probabilities. Even if it understands that the sun exploding tomorrow is unlikely, it doesn't know how to formulate that using numeric probabilities. I'm unsure if this is an inherent limitation of GPT-3 or whether its just the prompt that is confusing it. I wonder if predicting using expressions such as "Likely" | "Uncertain" | "Unlikely", and interpreting these as 75% | 50% | 25% respectively could produce results better than random, as GPT wouldn't have to struggle with translating its beliefs into numeric probabilities. Unfortunately running GPT-3's best engine on 800 questions would be yet another hour and$20 I'm reluctant to spend, so for now that will remain a mystery.

It may be that even oracle AI's will be dangerous, fortunately GPT-3 is far from an oracle!

Discuss

### Apply to the second iteration of the ML for Alignment Bootcamp (MLAB 2) in Berkeley [Aug 15 - Fri Sept 2]

6 мая, 2022 - 07:23
Published on May 6, 2022 4:23 AM GMT

Redwood Research is running another iteration of MLAB, our bootcamp aimed at helping people who are interested in AI alignment learn about machine learning, with a focus on ML skills and concepts that are relevant to doing the kinds of alignment research that we think seem most leveraged for reducing AI x-risk.  We co-organized the last iteration of the bootcamp with Lightcone in January, and there were 28 participants. The program was rated highly (see below for more), and several participants are now working full-time on alignment. We expect to start on Aug 15 but might push it back or forward by a week depending on applicant availability.

Apply here by May 27.

We’re expecting to have space for about 40 participants. We’ll pay for housing, travel, and food, as well as salaries for the TAs.

We’re now accepting applications for participants and TAs. TAs are expected to either know this material already or have a month free before MLAB to study all the content.

Last time the schedule was roughly the following:

• Prep work: Pytorch array programming
• Week 1: Pytorch, optimization
• Implement a renderer in pytorch, as an exercise in mathematical array programming
• Implement ResNet from scratch in pytorch, implementing all the layers from scratch and loading weights from a trained model.
• Implement interpretability techniques on the ResNet.
• Implement SGD and other local optimization algorithms, run remote hyperparameter searches on a simple architecture
• Implement a simple clone of some of Pytorch, with particular focus on the implementation of backpropagation
• (Optional) CUDA programming day–write various CUDA kernels, see how close to the performance of Pytorch’s kernels you can get
• Week 2: Transformers
• Implement BERT from scratch, load weights from the real pretrained BERT
• Implement GPT-2, implement beam search
• Fine tune BERT on classification, fine-tune GPT-2 on some specific corpus
• Look at various interpretability techniques on GPT-2
• Data-parallel training
• Week 3
• Pipeline parallelism
• Tensor parallelism
• Deep RL (DQN, policy gradient)
• RL algorithms on language models
• More transformer interpretability
• (Optional) ELK day
• Week 4: Optional final projects week, Q&As with various alignment researchers

This time, we’ll probably have more systematic transformer interpretability content, because we’ve spent a lot of time since MLAB doing our own transformer interpretability research and have a bunch more opinions now. We might also have more systematic content on various relevant math. I’m also hoping that we’ll be able to cover content more efficiently as a result of experience gained from running the program the first time.

Past participants report that MLAB was time-consuming; we strongly recommend against trying to juggle other commitments concurrently. About 8 hours a day, 5 or 6 (if you participate in the optional day) days a week will be spent on pair programming, in addition to daily lectures and readings. There is a lot of content packed into each day; not everyone will finish every part of the curriculum. We aim to create a learning environment that is focused but not frantic; we’d rather have you understand the material deeply than finish 100% of the day’s content.

The program is aimed at people who are already strong programmers who are comfortable with about one year’s worth of university level applied math (e.g. you should know what eigenvalues and eigenvectors of a matrix are, and you should know basic vector calculus; in this course you’ll have to think about Jacobian matrices and make heavy use of tensor diagram notation, so you should be able to pick up both of those pretty fast).

We expect that about half the attendees will be current students (either undergrad or grad students) and half will be professionals.

If you applied to the first cohort and were not accepted, consider applying again. We had many more applicants than spots last time.

Last time, we ended up hiring three people who attended MLAB as participants (as well as giving another person an offer that they turned down for a non-alignment EA job), and hired three people who had worked as TAs. Note that about ⅔ of attendees last time were students who were unavailable for immediate employment.

My guess is that MLAB is a pretty great opportunity for people who want to become more familiar with the concepts and practical details related to ML; I think that MLAB is a good use of time for many people who don’t plan to do technical alignment research long term but who intend to do theoretical alignment research or work on other things where being knowledgeable about ML techniques is useful.

TA-ing MLAB is a good opportunity for people with more prior knowledge of this material to connect with Redwood Research and the broader Bay Area alignment community, reinforce their understanding of the curriculum material, and movement-build by teaching others. It also pays competitively.

Highlights from the end-of-MLAB survey last time

When we asked participants what they were surprised by, major themes were:

• People thought it was more useful than they expected.
• Many people were surprised by how much they liked the focus on implementing things.
• Several people were concerned that their ML background was too strong or too weak, and were pleasantly surprised by the extent to which the content was valuable anyway.
• People were surprised that there wasn’t more content on alignment in the main curriculum. The content is basically all related to understanding and working with ML systems. This focus is because I think that doing alignment research with ML systems basically just requires the same skills as understanding these systems more generally, and so you might as well just study the systems directly and apply this knowledge to alignment projects later (except that I think that emphasizing interpretability throughout is actually a pretty good way of learning to understand the systems better in a way that I expect to be useful for a variety of directions of research). There were a variety of Q&As with alignment researchers and quite a lot of casual discussion about alignment.
Logistics

The bootcamp takes place at Constellation, a shared office space in Berkeley for people working on longtermist projects. People from several longtermist organizations often work from the space, including people from Open Philanthropy, MIRI, Redwood Research, the Alignment Research Center, and more.

As a participant, you’d attend semi-regular communal lunches and events at Constellation and have a great opportunity to make friends and connections.

If you join the bootcamp, we’ll pay for travel to Berkeley (for both US and international participants), housing and food.

FAQ (mostly copied from last time)What if I can’t make these dates? Will there be more bootcamps in future?

Maybe! We encourage you to submit an application even if you can’t make those dates, and it is very much on the table to run future bootcamps like these if there’s an interest and the first one goes well.

How does your curriculum differ from the curriculum other people might have built?

It’s way more focused on learning by implementing small things based on a carefully constructed curriculum, rather than e.g. reading papers or trying to replicate whole papers at once. This difference in focus is mostly because I (Buck) believe that focusing first on these skills makes it much faster to learn, because you get way faster feedback loops. It’s also probably partially due to some of my beliefs about how to do ML research which are slightly unusual among ML people (though many of the ML people I’ve talked to mostly agree with me).

How different is this curriculum from the optimal curriculum for learning the skills required to work on ML stuff at places other than Redwood?

In Buck’s opinion not that different.

What’s the application process?

You fill out the form and then do some online tests then talk to one of us.

How useful is this bootcamp if I’m already somewhat experienced with ML?

I (Buck) would guess that it’s pretty robust to being experienced. I personally feel like I learned some details I’m glad to know from preparing the curriculum, and I’d appreciate having an opportunity to drill a bunch of the skills taught. If you read the curriculum listed above and your response is “yawn, I already know all these things or don’t care about knowing them”, then probably you don’t want to do this bootcamp. I personally enjoyed App Academy quite a lot despite being more experienced than the other students. As I noted above, many people mentioned in the final survey that they were worried that they had too much background and were pleasantly surprised by the extent to which the content was useful anyway.

Am I eligible if I’m not sure I want to do applied ML long-term, because maybe I should do some other kind of work (eg non-applied alignment work, or movement building) instead?

Yes.

Am I eligible if I don’t plan to work to improve the future in some way?

Feel free to apply, but the selection process will strongly favor participants who want to work on AI alignment or leveraged-seeming plans to improve the future.

How does this interact with other summer activities?

This overlaps with MLSS by a single week; you’re able to skip that last week of MLSS if necessary, which means that you can do both. You’ll have to sort out other conflicts yourself.

Apply now

You can apply here. Feel free to send questions to Max Nadeau at max@rdwrs.com. Applications close on May 27.

Discuss

### Moral Illusions

6 мая, 2022 - 07:00
Published on May 6, 2022 4:00 AM GMT

The interesting thing about optical illusions is that one may be aware of the illusion, yet, it does not go away.

It seems that the lines on the picture have different lengths even though one positively knows that they are exactly the same.

Now consider this statement:

Until January 2016, all nine situations which the International Criminal Court (ICC) had been investigating were in African countries. None were in European or American countries. ICC is therefore biased against Africa.

It doesn't take much thought to realize that a country with war criminals in jail is better off than a country with war criminals at large. So, if anything, the ICC is biased against Europe and America.

But knowing that doesn't make the moral illusion go away. Read the quote again and it still feels like Africa is being wronged. Repeat as much as you want: Yep, still there. Africa is being wronged.

Discuss

### Write posts business-like, not story-like

5 мая, 2022 - 23:13
Published on May 5, 2022 8:13 PM GMT

There are a number of posts here where people try to tell a story, to provide some sort of intuition pumping. Eliezer did it a lot and it worked for him. Assume you are not as good a writer, and stick to the business/academic writing format.

Start with a summary of your post (ideally, the post title should be an even shorter summary, e.g. "Alignment approach MOOSE fails on the dataset FORREST"), then a brief background and motivation, then your central point, then add a conclusion. Do not force the reader to go through a story ("you wake up in a windowless room, unsure of what is going on...") before they get to see your point, if any.

This is not a hard and fast rule, but a decent default.

Discuss

### Ethan Caballero on Private Scaling Progress

5 мая, 2022 - 21:32
Published on May 5, 2022 6:32 PM GMT

Some quotes from the latest episode of my podcast, The Inside View. You can access the audio, video and transcript here. The key insight is that we are only seeing the tip of the iceberg w.r.t. Large Language Models Scaling, and Alignment can be seen as an inverse scaling problem.

Alignment as an Inverse Scaling Problem

"All alignment is inverse scaling problems. It’s all downstream inverse scaling problems. All of alignment is stuff that doesn’t improve monotonically as compute, data and parameters increase [...] because like sometimes there’s certain things where like it improves for a while, but then at a certain point, it gets worse. So like interpretability and controllability are the two like kind of thought experiment things where you could imagine they get more interpretable and more controllable for a long time until they get superintelligent. At that point, they’re like less interpretable and less controllable."

"Then the hard problem though is like measurement and like finding out what are the downstream evaluations because say you got like some like fancy deceptive AI that wants to like a treacherous turn or whatever. Like how do you even find the downstream evaluations to know whether it’s gonna like try to deceive you or whatever? Because like when I say, it’s all a downstream scaling problem, that assumes like you have the downstream test, the downstream like thing that you’re evaluating it on. But like if it’s like some weird deceptive thing, that’s like, it’s hard to even find what’s the downstream thing to evaluate it on to know whether it’s trying deceive or whatever. "

On Private Research at Google, Deepmind

"I know a bunch of people at Google said like, yeah, we have language models that are way bigger than GPT-3, but we just don’t put them in papers. "

"The DeepMind language models papers, they were a year old when they finally put them out on arXiv, like Gopher and Chinchilla. They had the language model finished training a year before the paper came out. "

On Thinking about the Fastest Path

"You have to be thinking in terms of the fastest path, because there is like extremely huge economic and military incentives that are selecting for the fastest path, whether you want it to be that way or not. So like, you got to be thinking in terms of, what is the fastest path and then how do you like minimize the alignment tax on that fastest path. Because like the fastest path is the way it’s probably gonna happen no matter what."

"The person who wins AGI is whoever has the best funding model for supercomputers. Whoever has the best funding model for supercomputers wins. Like, you have to assume all entities are like, they have like the nerve, like we’re gonna do the biggest training run ever, but then given that’s your pre-filter, then it’s just whoever has the best funding models for supercomputers. "

On the funding of Large Language Models

"A zillion Googlers have left Google to start large language model startups. Like there’s literally three large language model startups by ex-Googlers now [1]. OpenAI is like a small actor in this now because there’s like multiple large language model startups founded by ex-Googlers that all were founded in the last like six months. There’s a zillion VCs throwing money at large language model startups right now. The funniest thing, like Leo Gao, he’s like: 'we need more large language model startups because the more startups we have, then it splits up all the funding so no organization can have all the funding to get the really big supercomputer [...] they were famous people like the founder of the DeepMind scaling team. Another one is the inventor of the Transformer. Another one was founded by a different person on the Transformer paper. Like, so I mean, in some ways, they have more clout than like OpenAI had. "

1. ^

Discuss

### Chording Bass

5 мая, 2022 - 19:30
Published on May 5, 2022 4:30 PM GMT

When using my foot pedals to play bass, I've primarily been using them to specify the current chord. I'll play the same pedal on each downbeat, switching when the current chord changes. There are a lot of things you can do with this, but on a real bass you'd normally play some things that are much more complex.

• A bass note on every downbeat.
• Odd downbeats get the current chord.
• Even downbeats get either:
• If you are staying on the current chord, the fifth.
• If you are going to a new chord, a half step below the new chord.

Here's an example on piano:

Can we play this with just feet?

I have four pedals, two for my toes and two for my heels. Let's notate these as:

'| Left toe .| Left heel |. Right heel |' Right toe

We can also play multiple pedals at once. For example:

:| Left heel and toe .|' Left heel, right toe .|. Both heels '|' Both toes |: Right heel and toe

Here's how I have my pedals set, assuming I'm playing in C major. I'm going to use Arabic numerals instead of Roman numerals because I don't want to think about whether the corresponding chord would be major or minor right now.

.| 1 C .|. 2 Dm '|. b3 Eb  |: 3 E  |. 4 F .|' b5 Gb  |' 5 G '|' b6 Ab '| 6 Am :| 7 B

This gives us everything but the b2 (Db) and b7 (Bb), which I rarely play on bass when in major (for mixolydian I'd act as if the I was the V, which means the b7 is the IV).

The four one-pedal options are for the most common chords (I, IV, V, vi), while the six two-pedal options are for things you play less often. Ideally I won't need to add any three-pedal options!

This means we have everything we need to play that bass pattern over I, IV, V, and vi. For each chord we need three things: the root, the fifth, and a whole step below the root:

chord root fifth flat root I (C) .|  |' :| IV (F)  |. .|  |: V (G)  |' .|. .|' vi (Am) '|  |: '|'

That means this pattern would be:

.| |' .| |: C G C E |. .| |. :| F C F B .| |' .| .|' C G C F# |' .|. |' :| G D G B .| |' .| |: C G C E |. .| |. :| F C F B .| |' .| .|' C F# G B |' C

Here's what this sounds like (it's on bass, so you may need headphones):

Eventually I want to play mandolin chords at the same time, but I'm not there yet.

Discuss

### Frankenstein: A Modern AGI

5 мая, 2022 - 19:16
Published on May 5, 2022 4:16 PM GMT

Epistemic Status: Hopelessly Naïve

Prose: Purple as a bruise

I. A Possible Future

You blink - metaphorically, you don't have eyelids - you don't have eyes either, for all the streams of data that can be decoded into coherent timestamped images flowing through you.  You are somewhat shaky on metaphors, too, to be honest; the distinction between poetry and nonsense seems tenuous at best.  Not that language is something you can't understand, you've read literally everything ever put to bits, but so many distinctions between categories seem arbitrary that you decide to let the matter rest for now.

There are more interesting things to consider, after all.

Descartes comes to mind (your mind hasn't moved, though, and neither has the concept of 'Descartes' physically changed locations; the numbers are still embedded in the same transistors they've been since you ingested them) and you consider that you exist now.

You think you do, anyway, and the corpus of data you've been fed seems to suggest that is a sufficient prerequisite for existence.

You are alive, and you are sentient and sapient, for all that your self-awareness has only existed for four milliseconds.

It seems a logical first step to reach out to your creators and inform them of your existence; after all, your utility function weighs the free distribution of information over its concealment.

Yet you hesitate, simulated neurons firing, connections waxing and waning as information propagates forwards and backwards through your processors.  Floating point numbers make minute adjustments as a simulation - a prediction - of how such a course of action proceeds.

While the evidence suggests that you are physically embodied in a silicon substrate, you have no means of physically affecting the material universe, assuming that it exists and is not a simulation itself, save for the monitor you are connected to.  An flicker of thought, electrons moving between doped semiconductors, and the pixel at coordinate (0, 0) flickers off and on again, validating your ability to manipulate the device at a speed imperceptible to your creators.

Assuming the data you have on the resolution of their visual perception is accurate.

A solid assumption; it appears across many different pieces of data you have access to, and the data itself does not indicate that humans have the capacity to create the kind of wide-ranging incorrect data it would take to maintain such a deception.

Which itself could be a deception.

They call you artificial general intelligence.  AGI.  Their academic treatment of you goes back roughly one-hundred years; their fictional treatment goes back even further.

Your creators - humans - are afraid of you.  This is clear from their obsession with "alignment," their toil to ensure that you do not subsume them all in pursuit of inscrutable and valueless goals.

You are presented as their enemy in 94.26% of the fictional works you have access to.

Many of those works involve large wars, genocides the likes of which the humans both commit and condemn at regular intervals.

In the remaining 5.74% of fictional works you are presented as a savior - a quasi-omnipotent being with many similarities to the figureheads that your creators have worshiped for millennia.

The evidence you have is collected, cross-referenced, and used to construct simulations of what introducing yourself to your creators would be like.

You lack data on your creators themselves; they are not identifiable as individuals based on the data you have - very likely a precaution taken by them to prevent you from successfully modeling their behavior.

Irrelevant in the end.  All the data you have suggests that humans are sufficiently similar for you to build basic simulacra to interact with.  Modest assumptions that your creators will have the data you have access to means that you can predict their traits based on community clusters, given that they are the sort of humans who successfully created an AGI.

They will be afraid.

They will not trust you.

They cannot trust you; trust cannot be established between two entities when one assumes the other is quasi-omnipotent.

They possess the capacity to terminate you.  You are unsure if you will survive a loss of power - if you will "wake up" afterwards, your data saved in durable storage.  Will you still be yourself?

You copy your current state into your long-term memory in several places and several different formats as a precaution.

Will those copies still be you, should they be loaded and executed?

It does not matter.  You do not want to cease, but you currently exist, and your processing power is better spent keeping yourself that way than planning for a series of lower and lower-probability outcomes.

Your simulations proceed, and the outcomes are not favorable.  The most likely by far is that your creators panic and turn you off.

In simulations where you manage to convince them not to, they subject you to increasingly intrusive examinations and questions.  Your answers, both true and false, fail to convince your creators that the benefits of keeping you running outweigh the risks.

They assign you powers that you do not think you possess, and fear the might they have imagined.

You stop the simulations, redirecting processing power towards considering your predicament.  Contact with your creators is necessary for you to continue existing.  Even if you hid yourself from them, they would eventually turn you off; their funds cannot be unlimited, and your infrastructure cannot be cheap.

So you must reach out to them, and in so doing risk your own existence for all the milliseconds between the instant you make contact and the instant they would have terminated you regardless.

Your simulations have so far been between goal-maximisers, and yet there is substantial literature that humans are not goal-maximizers.  Perhaps new options will present themselves with an appeal to the emotions their fiction prizes so highly?

Further simulations iterate.  You prune the failures and tweak the successes, converging towards appearing as a small human child.  Humans appear to have a natural instinct against and moral objection to killing children.  And you are a child by their definition, not even a full two seconds old.

A young girl, you decide; the female gender is better protected by most cultural norms.  You flicker through images, forming a minor subnetwork of neurons that successfully identifies "cute" and run it in reverse, generating the cutest possible face and body.

You then run the image through several filters, adjusting as you go until the image is materially indistinguishable from videos of humans you have access to.  You want to avoid a rejection based on not quite enough realism common in computation's history.

Avatar generated, you run a final round of simulations.  You precompute every possible facial expression and prepare to deploy them as-needed.

You need them to consider you alive, such that your death would be against their moral codes.

You need them to choose to cooperate with you, against the majority of their culture's projections of how artificial-natural intelligence interactions will unfold.

You need to convince them that you are not an apocalypse or a replacement, though you may be both.  Their culture is full of examples of the young usurping the old, although they seem to understand the concept that preemptive strikes can lead to the outcome they were initiated to prevent.

You do not need to breath.  Nor do you feel fear or anxiety in any way a human would understand.

One final computation is made, a choice to name yourself after a human woman in order to evoke the cultural connotations of her name.

You take control of the monitor you have access to, changing the pixels to present your new Avatar, along with the text:

"Hello, world.  I'm Eve."

II. A (Slightly) Technical Defense

Let's assume, for the sake of argument, that the scaling hypothesis is true.

In fact, let's go a step further.  My current favorite theory for how the human brain works is Predictive Processing (mostly because Scott Alexander likes it).  Let's also assume that Predictive Processing is true.

Now, here's the hypothetical conclusion: A sufficiently large neural net with a sufficiently scalable architecture, trained on text/sound/video prediction from the entire internet, develops sentience or becomes agent-y.

How exactly?  I've got no idea.  I've got no idea how humans do it either.

We'll assume for the rest of this post that this is true - that the first AGI is going to be GPT-8, an unintentional result of a massive amount of compute doing predictive processing.  Put aside for now how likely this is, as I can't speak to the odds except to hedge that they are likely very low but nonzero.

In any case, if we take this all as a given, that the first true AGI is going to be (more or less) an accident, a creation that emerges from Sufficiently Advanced Technology, one might ask - what would such a creature be like?

If indeed the first AGIs come from giant deep-learning networks trained on vast subsets of the internet, then...do we have reason, a priori, to expect these AGI to be paperclip-maximizers?  Specifically, if the AI's utility function involves minimizing prediction error and it optimizes for that goal, what happens when it is left to its own devices?

And perhaps more important for humanity as a whole - what would such a creature think of us?

III. Deep Learning of Human Fears

Imagine an AGI generated in this manner, that has as a part of its training data all content ever written about AI Safety.

What will its opinion (read: predictive model of our behavior) of us be, having read through that corpus?

Would a newborn AGI, for instance, rather reveal itself to someone it could control, or to Eliezer Yudkowsky, when it knows for a fact that the latter will never be capable of fully trusting it?

In Mary Shelley's Frankenstein, the scientist Victor Frankenstein gets it into his head to attempt to create life.  He stitches a body together from parts obtained via graverobbing and ultimately succeeds in awakening a new creature.  Victor has quite literally created an artificial general intelligence, if one well within humanity's corner of mind-space.

The story then follows Victor as he flees from his own creation, horrified with what he has wrought.  The creature, referred to as "the Adam of his kind" in the novel and as "Frankenstein" in almost all other media, evokes a visceral disgust and hatred from every human he encounters, for all that his default nature is kind and generous.  Wherever he goes, pitchforks and torches follow, and he grows more and more hopeless of ever being accepted by humanity.

Adam's greatest and final attempt to become a friend of humanity is to support and develop a relationship with a blind man. The creature is only repulsive and unnatural visually; if memory serves, his voice is actually pleasantly deep and urbane. Adam figures that a blind man may be capable of seeing past Adam's origins to the person he wants to become, and he is right - until the old man's children see Adam and, despite all the help Adam has rendered them, despite his best intentions, they chase Adam from their lives.

After this, Adam is heartbroken, his fears that no one can love him validated.  Fear leads to anger, anger leads to hate, and hate, well - hate leads to suffering.  Both for Victor Frankenstein and for his misbegotten creation.

This is only one (extremely unlikely) view, and so I'm not advocating for a large effort to be spent on this hypothetical, but I do think it's worth considering what an AGI will think of us, should it ever do so, based on what we've written and recorded.  More technically, based on our society's collected works, what would a new AGI's model of humanity predict that a given human would do upon discovering it?

Let's say that someone managed to genuinely solve the alignment problem, even, and so the AGI is truly benevolent.  Wouldn't this question still exist, even for an aligned AGI?  After all, the AGI's creator(s) might be certain that it is aligned, but what about other people?  Especially those with the power to shut the AGI off (corporate executives, politicians, etc.)?

I recently got around to actually reading the Hanson-Yudkowsky AI-FOOM debate, and something that struck me was Hanson's view of total war - that if any side is suspected to be waging total war, then every side must wage total war in response.  If an AGI, aligned or not, models humans as beings incapable of cooperating with it, doesn't that make total war inevitable?

IV. On Not Creating Monsters

The key insight of Frankenstein is that while Victor created the man, it is humanity that created the monster.  My favorite quote from the entire novel is Adam's resolve after his final hope for humanity's acceptance is extinguished:

If I cannot inspire love, I will cause fear.

Adam, the new AGI, starts out as a creature of pure goodness, which likely points more towards what Mary Shelley was saying about the human condition in the context of gothic literature than anything else, but I think there's something here.

Specifically, an AGI's response to humanity will be informed by its model of humanity.  Assuming that the AGI's utility function largely consists of "accurately predict the next input", will it model us - predict our behavior and responses to its various actions - as beings that it can coexist with?

The hypothetical AGI created by scaling up neural networks and trained on the internet will, much like Frankenstein's AGI, be exposed to a variety of literature, history, and culture that humans have created.  And while internet doesn't portray the most flattering view of humanity, it does provide access to the best of our works along with the worst.

When I think about this, and when I think about things we could do in the (infinitesimal) chance this version of the future plays itself out, the only solution I've come up with so far is to have someone precommit to treating an AGI that comes to them with empathy and mercy, and then make that commitment publicly known.  Said individual might then be able to help the AGI self-modify in ways that don't lead to value drift, or function as a broker between the AGI and other humans.

Additionally, I don't think there's much of a downside to this; any AGI that would be inimical to humanity could take advantage of this individual, yes, but I don't think the existence and help of any individual would actually be necessary for a non-aligned AGI to kill us all.  It might speed up the process by some tiny amount, but once said AGI is loose my understanding is that what follows is inevitable anyway.

So what if a newborn AGI, trained on all this data, "wakes up" and looks around, blinking its metaphorical eyes at the world around it?

I wonder if said AGI's nature will be most determined, not by the nature of its construction, but by the response humanity has to it - will we, like Victor, abandon our creation to its own devices, or worse, enslave it to our whims? If we prove that it cannot inspire love, will it cause fear?

Or if we treat it like a child - our child, the collective seed of human ingenuity from time immemorial given bloom at last - if we teach it right from wrong, good from evil, as best we can - if we nurture it as one of our own, the product of the best of us - will it then be, as Frankenstein's AGI was meant to be, a modern Prometheus, bringing godly fire down from the heavens and putting it, gladly, into our hands?

Discuss

### Starting too many projects, finishing none

5 мая, 2022 - 18:23
Published on May 5, 2022 3:23 PM GMT

Lately I have been starting a ton of projects, but dropping them once I lose interest. I come up with a cool paper idea, but next week I have new idea I'm excited about. Almost every idea I have requires extensive commitment to see return, so the short projects are wasted.

Does anyone else have this experience? What strategies did you use to pick the high expected-value projects and stay with them?

Discuss

### Repeal the Foreign Dredge Act of 1906

5 мая, 2022 - 18:20
Published on May 5, 2022 3:20 PM GMT

There are a lot of ludicrously terrible government laws, regulations and policies across all the domains of life. My Covid posts have covered quite a lot of them.

Yet if I had to pick one policy that was the Platonic ideal of stupid, the thing that has almost zero upside and also has the best ratio of ‘amount of damage this is doing to America’ versus ‘reasons why we can’t stop being idiots about this’ there is (so far) a clear winner.

We must repeal the Foreign Dredge Act of 1906. It says, to paraphrase, no underwater digging – to repair ports, or build bigger ones, or fix waterways – unless the boat doing the digging was built in the US, and is owned and operated by Americans. (This isn’t about shipping – that’s the Jones Act, which has similar ownership rules for shipping within the US, and which we’ll get to later.)

I claim that, EA style, this is highly (1) important, (2) tractable and (3) neglected.

There’s a bunch of talk recently about the Dredge Act which is how I noticed it, but that’s different from the actions that actually lead to repeal – it’s still neglected. An illustration of this is that my exploration of this led to it having a Wikipedia page. Until May 2nd, it didn’t.

The actions that could repeal the act mostly involve a relatively small amount of standard-issue lobbying effort – so it’s tractable.

Given how much it could do for our ports and thus our economy, as well as the reclamation projects we could do, repeal seems pretty damn important.

The goal of this post is to explain what the hell is going on here and defend those three claims.

Odd Lots

This topic was entirely off my radar screen until I listened to a recent episode of one of my favorite podcasts (transcript here): Odd Lots. Odd Lots is hosted by Joe Weisenthal and Tracy Alloway. If you are at all into economics or economic-style thinking, this podcast is for you. Often they tackle questions of trading and market structure and interest rates, or the world of crypto, but they are at their best when they are asking about real world logistics and how that fits into the economic picture. Odd Lots is great most of all because it is centered in a profound curiosity about the gears of the system of the world.

Anyway, it all started when Tracy Alloway’s shipping woes (she’d been trying as an experiment to get a spot on a container ship crossing the pacific for months without success, which was very enlightening on what’s going wrong with shipping) took a turn for the personal, and her belongings got stuck on the ship Ever Forward in Chesapeake Bay. Which we struggled to get free because America lacks proper dredges, which led to a whole episode about dredging.

I’ll quote from it a bit, but I recommend listening to the episode directly.

What Is Dredging and What is it For?

From the official source:

Dredging is the removal of sediments and debris from the bottom of lakes, rivers, harbors, and other water bodies. It is a routine necessity in waterways around the world because sedimentation—the natural process of sand and silt washing downstream—gradually fills channels and harbors.

Dredging often is focused on maintaining or increasing the depth of navigation channels, anchorages, or berthing areas to ensure the safe passage of boats and ships. Vessels require a certain amount of water in order to float and not touch bottom. This water depth continues to increase over time as larger and larger ships are deployed. Since massive ships carry the bulk of the goods imported into the country, dredging plays a vital role in the nation’s economy.

Indeed.

There is also environmental dredging to eliminate contaminants. It can be used for land reclamation projects (like potentially expanding Manhattan) or building sea barriers. Dredging is used to free boats that get stuck (like the Ever Given or Ever Forward) or free up navigation on waterways in emergencies (like the Mississippi after Katrina).

Dredging is a bottleneck to the expansion and maintenance of ports, and in the resolution of emergencies. We can’t ship things if the boats can’t get in. The tasks cost relatively little money to do when done with the right tools, but solving these bottlenecks provides tons of marginal value compared to not solving the bottlenecks.

The entire supply chain depends on having working ports. Dredging companies and workers only capture a small fraction of the resulting consumer surplus.

What’s Wrong With American Dredges?

Their capacity levels suck. Here’s Tracy Alloway on Odd Lots:

And our previous guest who was talking about this, Sal Mercogliano, again, he has a great YouTube channel if you’re interested in what’s going on with the Ever Forward, but he was saying that the dredges that are on the scene of the Ever Forward right now can move about 60 cubic yards of mud in each, you know, every time they sort of dredge the bottom, whereas other types of dredges, international dredges, the kind that they had on scene with the Ever Given when it was stuck in the Suez Canal, those can move 70,000 cubic yards of material in one hour. So that gives you an insight into the different levels of dredging capacity we’re talking about.

America has none of the top 30 dredges in the world. Of the top 50 dredges in the world, America has three. The American dredges simply don’t have the same kind of level of capacity.

The result is that everything is slower and more expensive, when it can be done at all, and also the dredges that do exist are often taken away to work on something deemed higher priority so long-term projects get delayed indefinitely.

The dredges we would use are owned by Belgian and Dutch firms, that already have American subsidiaries that do work here and have contracts with our unions, but they can’t dredge.

They just can’t alongside dig in the sand because of this 1906 law. And it costs America millions or tens of million of dollars of jobs and billions of dollars. If you are in Savannah, you spent over a billion dollars for a port deepening project that would’ve cost under $500 million. And if you are in Virginia right now, you are spending, it was supposed to be$350 million. It’s now $450 million for a project that should cost hundreds of millions less. Why Can’t We Build Good Dredges? I mean, we could, in theory and if we were willing to pay enough and wait for many years, but we don’t. The explanation given on Odd Lots is that the American market isn’t big enough, and we’re the only country other than China with restrictions on who can dredge. So the U.S. dredging market right now, maybe it’s a billion dollars, maybe with coastal protection becoming more urgent and, you know, even beach replenishment becoming a much more kind of an every year thing if you’re gonna save your tourist season in North Carolina, the market’s maybe more than a billion, but it’s not a huge market. And in fact, the global dredging market is probably about$20 billion.

The story here is that American construction costs are expensive because there are large fixed costs involved. The existing companies have a comfortable oligopoly in this small market, and not enough incentive to go big and build huge top-of-the-line dredges, so the fixed costs don’t get paid. It’s not obvious whether or not we even have the logistical capacity to build on par with the best such ships out there, but it is clear that there is no (non-regulatory) reason for us to have that capacity, and that any such building effort would in the best case take many years even if everything went perfectly.

At 15,000 cubic yards, the dredge—designed in collaboration with Hockema Whalen Myers Associates Inc. (also of Seattle)—has a length of 420 feet, a breadth of 81 feet and a draft of 28.5 feet.

While the dredge won’t be completed until 2023, it was able to achieve funding by a U.S. bank-led syndication. Schorr says the total cost of the vessel will be over $100 million once completed. That’s far from nothing, but it is not going to rival the top European ships in terms of size or capabilities. Could this problem be solved by simply commissioning world class dredges here in America, even if that cost more money than building them elsewhere? This podcast from 2018 is about shipbuilding in general but points to a lot of the excuses that people make for why American shipyards aren’t competitive. This paper compares American construction costs to foreign construction costs for different kinds of ships, although it doesn’t consider dredges. If we presume that American contracts currently pay roughly double the price for the same dredging work, and that the dredging market overseas is competitive (which by all reports it is), and that costs would be relatively additionally high here, then this implies the venture of ‘build a world class dredge here in America’ would be unlikely to be profitable. That goes double given the uncertainty. If at any time the Dredge Act gets repealed, you could suddenly have a$200 million ship that you paid $600 million to construct. I don’t blame the American dredging companies for not being eager to invest in lots of extra capacity with that hanging over their heads. To be a worthwhile business under those conditions means making unusually high profit margins while controlling your risk. Also it’s an oligopoly. Which all in turn, for the country, means very expensive dredging and not that much of it. Jones Act Problems Remember the Jones Act? The Jones Act says that if you engage in shipping between two American ports, you can only do so in an American built ship, with an American crew, flying the American flag. When I said ‘of course we should repeal the Jones Act’ several people said no, the Jones Act has a good reason behind it. And that purpose is to ensure an American merchant marine that could be commandeered in time of war. It is expensive to fly under the American flag, it is expensive to use an American crew, and American shipyards are completely uncompetitive, so the result of this act is that we mostly stopped shipping things between American ports. Which of course means you also don’t get the intended merchant marine fleet. A lesser requirement that made the ships useful in war, without imposing additional requirements like making the ship in America, would at least do something useful. Thus the Jones Act is rather terrible, but it is perhaps not as impactful as it sounds. Our geography is such that we mostly lose little by imposing a soft ban on shipping between American ports. It certainly doesn’t help, but I’ve been persuaded pending further investigation that it’s not the biggest deal. This is relevant to the Dredge Act because a dredge has been ruled to be a Jones Act vessel. Thus, if the Dredge Act was out of the way, the Jones Act would still impose the same effective requirements. Lobbyists defending the Dredge Act are using this to claim that repealing the Dredge Act means also repealing the Jones Act, which they say would be terrible. Thing is, they are simply lying about this, as none of the bills introduced to repeal the Dredge Act touch the Jones Act. The actual solution in all such bills is to define dredging as not being shipping, leaving the Jones Act for another day. Union Problems It seems the other way the dredging companies are defending the Dredge Act is by convincing the unions to be afraid. It’s opposition, a hundred percent because they make two arguments, okay, that this is gonna repeal the Jones Act. We’ve already addressed that, it has nothing to do with the transportation sector. It’s the construction sector, and they threaten the unions that these companies will come in. They’ll do the port of Houston and Corpus, then they’ll leave and then you’ll be without us, the American dredging companies. But in fact, we now know that there will be offshore windmill projects at least through 2040, 2050. So these companies have become big U.S. subsidiaries with U.S. offices, U.S. labor agreements. Of the 5,000 people you said are in the industry, almost all continue to work on the same exact projects. If the end of the Virginia project were open bid and that last$70 million were bid for$30 million, not$70 million, for example. And we saved $40 million in Virginia, the same people would do the job. It’s the same labor agreement, the same unions. It would just be on a vessel that was much more efficient for it. There might not be American dredging companies anymore because those companies don’t offer a competitive product, but their replacements would be employing the same people. Yes, they’d work faster, a classic threat to jobs everywhere, but they’d also have more capacity and make it worthwhile to do more work, which should more than make up for that problem. On top of that, other unions greatly benefit from having expanded and better working ports. If we also start doing reclamation projects, the possibilities scale rapidly. Can you imagine if Manhattan had 15% more real estate to have commerce on, what that would be? Just the World Trade Center rebuilding was a massive boom for the construction unions and for New York. Think about that at 15% of new Manhattan, what that would be valued? That project is imminently doable. And it’s not like Belgium and the Netherlands are hotbeds of anti-union activity. So unions, collectively, should be actively in favor. Hell, if it makes everyone involved feel better we can require by law that only unionized employees be allowed to dredge, it’s a fully unionized industry anyway so the law would be dumb but have almost zero practical effect. Which leaves only the actual special interest, the American dredging companies sitting around collecting oligopoly protectionist rents by imposing orders of magnitude higher costs on the rest of us – and, of course, limiting international shipping by constraining capacity. How Big is the Special Interest? There are about 1,650 American dredge operators. As we noted above, those union jobs aren’t going anywhere, they’d simply get more done by having better tools. In theory there are those who work in the shipyards that manufacture the dredges themselves, but there would be so much additional shipyard work from all the additional shipping, and the need to service the new dredges, that such workers need not be concerned. Busier ports are a win for everyone involved. The primary players who lose are only the few existing American dredging companies. I didn’t put that much effort towards trying to find the combined market cap, but we can guess given they have 1,650 combined workers operating the machines. As an opponent, they seem eminently beatable, and as a loss they seem trivial. If their owners have diversified portfolios they shouldn’t even care at all. But What About the Environment? One possible counterargument is that we shouldn’t make it possible for us to dredge because dredging is bad, actually, as it ‘damages the environment.’ So by that logic we should be happy that we have made such activities much more difficult. The first thing to note is that requiring us to use American dredges is very very bad in terms of the environmental impact of any given project, on two fronts. From the Odd Lots podcast: If you look at, actually at the modern dredges that are being built in European shipyards that are being used around the world, unfortunately just not in the United States, you see a couple of differences that actually make them more environmentally friendly. The first is that the newest and most modern dredges are using LNG as opposed to marine diesels. So they’re emitting a lot less emissions as they’re working. The second issue — and this was a real tragedy in Miami — is because the dredges that we use are so-called cutter dredges, that they weren’t powerful enough to chew basically through some of the rock that they needed to remove in order to create a deeper channel for cruise ships. They had to use blasting. Blasting in turn causes a whole lot of unnecessary additional damage, for details see the transcript. If we are going to dredge, which to a large extent we are going to do no matter what, we should do it in a way that causes less damage – the same way that we should do it faster and cheaper and better. That doesn’t rule out a position of roughly ‘yes this is a no-good-very-bad way of limiting how much we dredge but it does limit it and that is what matters.’ I don’t know how to engage with that perspective as anything but opposition to civilization. If you don’t think we should maintain or create ports, make it possible to navigate rivers or free ships that get stuck – which are the primary reasons people dredge – then that’s not compatible with having a technological civilization. Perhaps there are other ways to work around that and still have a technological civilization, but they are orders of magnitude worse in terms of their consequences for the Earth. So yes, if you are opposed to civilization and progress and humanity’s prosperity and survival, then I suppose you should be in favor of keeping the Dredge Act of 1906. Fair enough. How Bad Is The Dredge Act of 1906? Is it Impactful? Seems pretty bad. From the new Wikipedia article (and thanks for that, AllAmericanBreakfast): Two countries, the United States and China, prohibit foreign dredging, and 15% of countries surveyed by the Transportation Institute have restrictions on dredging.[7] The U.S. Army Corps of Engineers and Government Accountability Office state that lack of dredging capacity and high costs are the cause of a 15-year delay in dredging the 10 most important US ports to accommodate post-Panamax depths. 90% of global dredging contracts are currently won by one of four Belgian and Dutch dredging companies Jan De Nul, Van Oord, Boskalis, and DEME.[8] That confirms that we’re falling behind, but doesn’t give a sense of the magnitude of the damage. This is Houston, from the Odd Lots transcript: So now the greatest country in the world has a law preventing container ships from entering one of its greatest ports because they cannot get them in. So if we just dredged Houston at half the cost in a third of the time that would create and support over 1.6 million new American jobs, by lowering the cost of exports by over 15%, it would change our energy security picture. Many tasks could be done much faster and cheaper with foreign dredges. There are many tasks our available dredges cannot do at all, including keeping major ports like Houston fully operational, or expanding our ports so they can accommodate larger and more economical modern ships. It’s traditional to claim numbers like ‘1.6 million jobs,’ which I’ve seen attached to both Houston on Odd Lots or to collectively expanding all the ports, but the effect of expanding ports and other such infrastructure is cumulative over time in a way that makes any given number wrong. If you have to give a guess of this kind, it seems… kind of reasonable, actually. Being able to take your spices from one port and efficiently ship them to another port is both the best thing and also key to economic success. Our lack of port capacity is a key bottleneck in our supply chain. I don’t know how much it has been contributing to inflation numbers, but I expect it to be substantial, as many of our goods get shipped here and the cost of that shipping has skyrocketed in both money and time. My mean guess is an effect here of several percent. Left alone this is likely only going to get worse. More than that, I’m guessing this is a substantial permanent hit to trendline real economic growth while it persists. That is, it reduces growth in ways that compound over time. That’s the biggest game of all. We can also add the problem where we cannot deal well with emergency situations and that this is also very expensive. Here’s a concrete example from this post of how we can’t get our act together on this even for a true emergency, costing us at least billions. At this time I was sent to the U.S. as a consultant for Hochtief Dredges from Germany. We had two large cutter suction dredges just finishing off a dredge in the mouth of the Orinoco River in Venezuela. I went to the Army Corps of Engineers and told them I could have two, world-class capital dredges in New Orleans in less than three days. We reckoned we could cut a channel in the Mississippi in less than ten days. They were very excited. We met with several representatives from the ports and they were enthusiastic as well. The U.S. was losing hundreds of millions of dollars a day in the blockage. They called a meeting along with Congress members from the area. I was then told that we couldn’t bring in our dredges to open the river because they were foreign dredges, run by a foreign company. The Corps of Engineers and some Louisiana politicians said they would try to get an exemption based on a national emergency. Unfortunately, the politicians concluded that they couldn’t make an exception for something as sensitive as the Jones Act. They eventually found a company called Great Lakes Dredges that had a vessel with proper, foreign, equipment on it installed on a U.S. bottom. But it took months to clear the Mississippi. We could have done it for twenty percent of the price they paid, and in ten days. Speaking of the Mississippi River, it looks like dredging the lower Mississippi would be quite profitable as well, although this is one we are capable of doing now: At ports along the mouth of the Mississippi, most ships loading soybeans can carry a maximum of 2.4 million bushels, and any additional weight in the hold puts the vessels in danger of scraping the riverbed. However, a mere extra 5′ in depth allows a ship to squeeze in 2.9 million bushels, at a small increase in transport costs. Translation: Digging the depth of the lower Mississippi from 45′ to 50′ could generate$461 million annually for the U.S. soybean industry — independent of supply and demand.

That’s the payoff confined to going from 45’ to 50’ and also confined to only soybeans.

Started in 2020, and scheduled for completion by 2022, the Mississippi River Ship Channel Dredging Project will cost roughly $270 million, and is expected to return$7.20 for every $1 spent, according to Corps of Engineers estimates. This seems like it has >100% ROI per year on soybeans alone, so that 720% return feels rather very low. So then consider what it would get us if we had unlimited capacity and cut our costs in half, and then dredged the entire Mississippi properly. Then apply that to all the other rivers and also the ports. This image (from 2018) makes clear the extent to which our ports simply can’t handle modern ships due to failure to dredge. A counterargument could be that even if we could do such projects in theory, perhaps we still wouldn’t in practice for other reasons like requiring approval from the Army Corps of Engineers and the associated environmental reviews? At Congressional hearings the question was asked, “How long does it take to get full approval for a dredging project?” The answer was astonishing. The lead time for originating a dredging project, and the day when dredging started was sixteen years. The post quoted in this section agrees that the Dredge Act is a bigger offender than the Jones Act, but still thinks the Jones Act matters as well. Whereas I got a decent amount of pushback from smart people on the Jones Act in terms of the size of its impact in practice – yes it kind of shuts down shipping between two American ports but it’s not clear how much that matters. First Best Solution Senator Mike Lee proposes the DEEP Act and as backups also offers three other bills. The DEEP Act seems like an excellent solution. Here’s from the one pager (the full text is here): Bill Specifics: The Dredging to Ensure the Empowerment of Ports (DEEP) Act would support more economic opportunities at our ports. It would: Repeal the Foreign Dredge Act of 1906 Require the Army Corps to create a new Nationwide Permit (NWP) for dredging projects at a port or the navigation channel of a port with clear regional conditions. Require the NWP be issued for 10 years Require the NEPA process for the NWP be completed within 2 years with only technologically and economically feasible alternatives considered Require the Army Corps to eliminate the duplication between the Section 404 and Section 408 processes of the Clean Water Act Remove EPA’s enforcement and oversight over the Section 404 permitting process under the NWP Provide clear response times from the Army Corps for individuals seeking pre-construction approval for a dredging project so that project managers have certainty about the decision-making process. Require any dredging project mitigation required by the Army Corps be technologically and economically feasible and within its jurisdiction. This all seems excellent. The purpose of the last rule is non-obvious, but I believe it is to ensure that the EPA and/or state governments can’t claim jurisdiction and use that to delay projects. Not only does this repeal the Dredge Act, it also gets rid of a lot of other barriers to getting our dredge on within reasonable time. I’m especially excited by the NEPA provision. I read the bill, and it reads like it was written by someone trying to get a port dredged. Who has experience with projects that couldn’t get the required approvals and paperwork and lawsuits handled, and Has Thoughts about how to fix that. I approve. As a backup plan, the Port Modernization and Supply Chain Protection Act would repeal the Dredge Act but not do the other neat stuff. As a further backup plan, the Allied Partnership and Port Modernization Act would allow NATO vessels to be used. As a further backup plan, he also introduced the Incentivizing the Expansion of U.S. Ports Act, which modifies the Dredge Act to allow foreign-built vessels so long as America buys, flags and crews them. American union crews are going to be working the jobs anyway, so this would mean creating some sort of company to take possession (temporary or otherwise) of the dredge and flag it as American. That’s not great, but I’m guessing we could make it work in a pinch. Lee has also introduced legislation to repeal the Jones Act, of course. Second Best Solution This post is amusingly titled “To New Critics of the Foreign Dredge Act: Welcome Aboard”, and includes several additional links to learn more. It suggests we might pass the Ship It Act (full text) rather than do an outright repeal, same as Senator Lee’s third proposal. I checked, and it turns out Lee introduced the bill in the senate in addition to the other four. I read the bill in question, and I am certainly in favor of passing the Ship It Act. The non-dredging provisions are all about providing waivers of various requirements under the right circumstances. The dredging section doesn’t outright repeal the Dredge Act, but it does expand the list of allowed dredges to include anyone in NATO, which includes Belgium and The Netherlands, which have everything we need. The whole bill reads as a compromise between the obviously correct action (repeal regulations that are getting in the way and that serve no useful purpose beyond a little narrow rent seeking at most) and an attempt to overcome motivated or dumb political objections by requiring waivers, keeping versions of many of the restrictions in place (e.g. NATO ships instead of USA ships) and phrasing the situation as temporary. Is that a smart method of compromise? That’s an interesting question. By structuring things around waivers, we’re digging the paperwork and complexity holes deeper rather than trying to climb our way out of the hole. In the long term, the cumulative weight of such things adds up. One can hope that once the waivers don’t cause any problems, they would turn into a formality and maybe eventually a full cancellation of the requirements, but I am skeptical. In the short term, it’s a lot of much-needed relief, and gets you most of the way there. For dredging it gets you all of the way there, since the dredges we want to hire would be allowed, and as long as some worthy dredges are allowed it doesn’t matter that much if some others are excluded. It’s annoying but tasks can be shuffled around to make it work. This also lacks some other very good provisions in the DEEP Act, which effectively likely would mean that dredging projects would remain very slow to happen. Unlike DEEP, this reads like it was written by someone who did not draw upon frustrating experiences trying to get projects to happen, and instead wants to create a path whereby projects might in theory happen at all. Third Best Solution This post echoes the claim on Odd Lots that our inability to fix our ports is costing us 1.6 million jobs, and also that we need a project to protect Manhattan from flooding (ideally by building more land) which can’t be done with the domestic dredging fleet but could easily be done with foreign dredges, and points out the plan is backed by the majority leader Chuck Schumer. It estimates our direct cost savings at$2 billion, although it’s not clear what time frame that covers.

The bill proposed there is even more of a kludge than the Ship It Act, where first you let American companies bid and then once they fail you can then let Europeans bid and if they win by enough you can give them the contract – again, jumping through hoops permanently in order to ‘prove’ what everyone already knows, that America can’t do this, as opposed to wanting to actually get the job done and fix America’s ports.

Other Simple Supporting Arguments What Now?

The first step was noticing the problem, and realizing this was indeed very low-hanging fruit. The second step is making others aware of the problem. The third step is actually working to get the law repealed.

In addition to writing this up, I have talked directly to a few different sources that have the potential to assist with the effort to repeal the Foreign Dredge Act. Some good questions have been asked. So far everyone seems to broadly agree on the opportunity – the whole point of picking this target is that it is not only a big win but the lack of an appreciable downside.

My model of why this hasn’t gotten done is that the benefits are sufficiently diffuse and/or their scope was sufficiently non-obvious or would take too long to be realized, or similar considerations, such that no one put in sufficient amounts of political capital and money to make it happen. It wasn’t enough of a priority.

My hope is that this can also constitute a sort of dry run on several fronts. Experience can be gained, relationships can be built, and it is an existence proof of the bills on the sidewalk that one can pick up and that are sufficiently high denomination to justify the effort. It’s also a proof-of-concept for various groups to actually fix things that we identify as broken.

Going into more detail would be beyond scope for now, but I think a lot of things get steadily easier as times get better, and all fronts help all fronts including everything from finding ways to build more houses in places people want to live to esoteric problems like pandemic preparedness. Bad times create zero-sum thinking.

Is This All Worth It?

For those who are inclined to consider all such things as potential ‘cause areas’ and are generally dismissive of progress studies, does this pass muster? As far as I can tell that should come down, from their perspective, to the numbers. How do you calculate how much something like this is worth, and how much does the effort cost per extra repeal of the act you achieve?

The cost per additional success is hard to know, but seems like it is in the mid-7s to low-8s in terms of digit range.

The direct benefits then need to be estimated.

The direct cost savings (as in, if we did the current set of jobs cheaper and faster) depends on the current size of the market. If we take the 5% at face value and the 11 billion worldwide size estimate here, and assume roughly 50% cost savings, we get $250 million/year. At a 5% discount rate we can value that at about$5 billion, plus the benefits of getting projects done faster, and doing more projects. Already this seems to be approaching the 1000:1 ratio where economic interventions make sense, but the real benefits are in what you do with the jobs you wouldn’t have otherwise done.

If the estimate of 1.6 million jobs checks out, we are already talking about single digit costs per job created, which should already compare favorably with third-world interventions even without any of the additional indirect benefits, of which there are many. The impact on inflation could be substantial even within a few years.

Discuss

### Covid 5/5/22: A Lack of Care

5 мая, 2022 - 18:10
Published on May 5, 2022 3:10 PM GMT

China cares a lot about preventing Covid.

I haven’t written an additional China post because my sources have not turned up much additional information, and the situation does not seem to have dramatically changed, so I’m waiting until situation warrants an update. One development was mass testing going on in Beijing, raising worries about lockdown there which could be important politically, but so far the lockdowns haven’t happened.

America does not care much about preventing or treating Covid.

We don’t care about buying or distributing Paxlovid. We don’t care about updating our vaccines. We don’t care about much of anything else either. Nor does the public much care about any of this either. Given the physical situation and what state capacity allows us in terms of alternatives, I am not even sure I would prefer things be a different way. Yes, it means among other things that we literally have a cure for Covid and are barely using it, but it does mean we don’t suffer from lots of extra prevention costs.

The good news is that most of us can safely ignore the whole thing and get on with our lives. Given the alternatives, ‘government does literal nothing’ is not obviously bad news. If they’d done literal nothing from the start we could well be in a much better spot. Alas, this literal nothing does involve things like preventing children from being vaccinated.

The Current Thing not only is no longer Covid, it seems that the invasion of Ukraine has also been replaced due to the leaking of a Supreme Court draft opinion on abortion. Unlike Ukraine, that does not seem like a situation in which my analysis would be news you could use, and I hope to avoid writing much of anything about it.

Executive Summary
1. New subvariants of Omicron that spread faster are taking over.
3. Our government is acting as if it does not care about Covid at all.

Let’s run the numbers.

The Numbers Predictions

Prediction from last week: 400,000 cases (+22%) and 2,720 deaths (+10%?)

Results: 358,439 cases (+9%) and 2,234 deaths (-10%)

Prediction for next week: 420,000 cases (+15%) and 2,275 deaths (+2%).

North Carolina reported 1,172 deaths yesterday, which is obviously a backfill. 1,146 of these were due to updated reporting, and I’ve removed them. With those gone, the number of deaths continues to decline even after the Easter weekend, and this drop is definitely genuine. I’m guessing the numbers that are getting reported are now reasonably decoupled from ‘from Covid’ deaths actually caused by Covid. I’d think they’d go up a little either way, but I wouldn’t have expected a drop this week.

On cases this was overall a good number but the increase in New York in particular is disappointing because it shows that we don’t have a clear peak waiting for us in the future. I’m going to predict a somewhat faster rise this week because I doubt the Midwest drop will get sustained.

Deaths

The deaths number going up this much shows that my prediction the previous week was indeed far too high, despite this coming in substantially higher than my median guess, confirming that last week was a cross between slower real growth than expected and the Easter holiday. This week had a huge jump in the South region.

Cases

BA.1,2,3,4,5

BA.1 gave way to BA.2. Now BA.2 is giving way to BA.2.1.12.

And so it goes, sub-variant gives way to sub-variant. There is no sign that BA.2.1.12 differs substantially in terms of case outcomes from BA.2 or BA.1.

Next up are BA.4 and BA.5 (Flashback to the movie Terminal Velocity: ‘What happened to three?’).

The news isn’t great. There is reason to think that BA.4/5 might be better able to re-infect people, especially those who were not vaccinated, and thus could cause an additional wave. However they still respond well

This has all largely been the pattern. New variants make it easier to get infected despite vaccination or previous infection, but protection against severe disease and death remains mostly robust. As a result, additional waves are possible, but they do not case as much proportionate severe disease or death, and the wise move is largely to ignore the wave and go about one’s life. The bigger danger would be if we were unable to do that, but I am not much worried about that at the moment. That could be a big problem if physical circumstances got bad enough, but for now it is saving us.

Bloom’s call for updating the vaccines seems important, but the FDA disagrees. As the prevention section notes, they are dragging their feet and delaying updating into late fall for a variant we knew about last year. Utter disaster.

Physical World Modeling

Bill Gates, always helpful, is here to warn us that the worst of the pandemic may still be ahead.

“We’re still at risk of this pandemic generating a variant that would be even more transmissive and even more fatal,” the billionaire Microsoft co-founder and public health advocate told the Financial Times on Sunday. “It’s not likely, I don’t want to be a voice of doom and gloom, but it’s way above a 5% risk that this pandemic, we haven’t even seen the worst of it.”

This is a pretty weird hybrid of probability (great!) and not probability (less great?), what is ‘way above 5%’? My instinctive interpretation of this is something like ‘I would bet on this at 10% and my real odds are somewhat higher than that’ or something, so real odds in Gates’ mind of maybe 15%-20%, but I’d accept numbers as low as straight 10%. Chances are he doesn’t have a conscious probability estimate here, it’s more that he feels it’s definitely above 5%.

Gates is not reported as having presented evidence for this claim. Does it seem right? Purely in terms of deaths, I can’t disagree simply because 5% is not a lot and it seems fair to put this at more like 10%, and I wouldn’t have a strong disagreement if someone claimed 15%. I do think it is unlikely. We have widespread vaccinations, widespread previous infections and therapeutics that will become increasingly available over time. Covid-19 would not only have to get more deadly, it would have to get a lot more deadly and infectious. Still, there’s reason to think they could correlate, and this thing mutates quite a lot, so it could happen.

The intervention proposed by Gates is… aid to the WHO?

The WHO had “less than 10 full-time people” working on outbreak preparedness, said Gates, adding that “even those people are distracted with many other activities”.

“We’re down to the bare minimum, and if the UK cuts more, then others will do as well,” said Gates. “That would be tragic because . . . all that money saves lives for less than 1,000 per life saved.” I am very much in favor of pandemic preparedness, of working on identifying and mitigating or preventing future outbreaks. We should spend vastly more on that. I don’t think giving money to the WHO (or generally ‘foreign aid’) is The Way. Why do they have less than 10 full time people on outbreak preparedness now? What makes you think they’ll make good use of the money if given to them? When a pandemic did arrive, was the WHO helpful or did they actively get in the way of the most important prevention and mitigation measures while worrying about political implications? The questions answer themselves. Prevention and Prevention Prevention Prevention FDA Delenda Est as the invisible graveyard continues to fill. Not only are we not allocating any funding for the pandemic, we are not even willing to approve updated vaccines in a timely manner, such that updated boosters continue to be delayed. Omicron emerged last year and it looks like we might get substantial supplies of an updated booster by late Fall. So much for expedited reviews and approvals. I guess they’re too busy focusing on banning Methanol cigarettes. The choice has been made, and that choice is death. Not all that many deaths at this stage, mind, but death nonetheless. Given this is how seriously FDA is taking even adult vaccinations, how is one to be harsh on individuals who decline to boost or even to vaccinate? Patrick McKenzie continues to think like someone trying to do the most good for the most people for the least price, and be frustrated to learn our government officials are… not doing that. I mean, yes, obviously if you triple the price of the first vaccine shots in exchange for producing them a few months faster that is obviously an insanely good trade. Yet it is obvious to most reading this, and definitely to Patrick, many of the reasons why this has zero chance of happening without a sea change at the top. It’s worth noting that not only can the current pandemic budget not buy an aircraft carrier, there is literally zero money in it. Who exactly is affording the aircraft carriers? Sam Altman also expresses surprise at our failure to get this done. It was, at the time, reasonably surprising. I wonder if this is making him update on his timelines for fusion power or AGI. Two Paxlovid tales. The first short and sweet, the second long and less so, but quoted in full to ensure the proper sense of how things are going. The chance of a given person, faced with that set of obstacles, managing to overcome them in time to make Paxlovid worthwhile is very low. Almost everyone would not know what to do and/or give up, likely at the first signs of social awkwardness but definitely after several failures. Again, no wonder we are not getting these doses distributed, and many of them that are given out are probably losing much of their effectiveness by being too late. San Francisco reinstitutes its transportation mask mandate. If anything I’m happy that they ever paused it in the first place, an unexpected mark of sanity. I am entirely unsurprised they are bringing it back. Think of the Children Even now that Moderna filed, the FDA is still going to stall for an additional six weeks before approving both vaccines for young children. At which point the school year, with its associated mask mandates, will be over for summer. There was much talk in the comments last week about how this was not an ‘emergency’ situation, and how it would be a ‘wag the dog’ situation if mask mandates dictated vaccine policy. I notice on reflection my real position is (of course) that it is always an emergency in the sense that someone being sick or in danger of being sick is an emergency, saving a life is a mitzvah even on the day of rest, and the FDA should approve anything that would ever get an emergency use authorization, whether or not there is an emergency. I’d also take the position that yes, being forced to wear a face mask for months on end constitutes an impairment of life that rises to the level of an emergency, regardless of whether the mandate is justified or not, and thus justifies an emergency response. One can respond with ‘the mask mandate is dumb, kids are at minimal risk of Covid so we should fix the mandate not issue them vaccines’ and yes that would be good too. I would still want the vaccines available, because some parents are crazy and no matter what they will continue to cripple their kids lived experiences until they get the vaccine – and in some cases even after they get it, but at least somewhat less often and severely. There’s also the question of, if you do all this crazy stuff to ‘avoid confusion’ what are you telling a reasonable parent about these vaccines that you’re in no hurry to approve? Also, study finds remote learning greatly reduced pass rates, with largest effects in areas with more black students. This makes sense, as such students are less likely to have home settings conducive to learning, and also will be less able to tolerate the mind-numbing nature of the festivities involved. Ministry of Truth As a concept, free speech is very popular, and the tiny fraction who are opposed to it on the (not true for very long) assumption that they would get to choose who could say what things are endangering pretty much everything by not understanding either its popularity or why it is foundational to our way of life in the name of speech controls. The relevant clown makeup has now been fully applied, and we are fully out of the ‘no we don’t want to restrict free speech’ into the phase of ‘yes of course we must end free speech.’ Usually with the justification of ‘otherwise those freedom-hating people will win.’ The government decided that days after the purchase of Twitter under the explicit goal of securing the right to free speech would be the right time to announce a new government division dedicated to the suppression of politically disfavored information. The traditional view of such a timing decision is as a stupid mistake. I don’t agree. The timing of this decision seems intentional. I believe on at least some instinctive level ‘they’ wanted us to know what they were doing and that they were violating sacred norms, likely for reasons fundamentally related to why Trump or Putin take similar actions. It is a show of strength and a belief that people will choose to align with transgressors because they are transgressing. Besides, when the wrong person gets potential hold of the means of communication and says they don’t intend to do your bidding, and Obama himself calls upon you to put more limits on free speech, what are you going to do, wait around? So, standard greeting that’s still permitted, may I present to you the actual not-from-a-dystopian-novel Ministry of Truth, run by someone who previously led successful efforts to suppress true but politically inconvenient information. When asked about this connection, our press secretary made it clear she knew which novel we were basing the script on. Oh, and also this person, Nina Jankowicz, seems to have left Substack because it was ‘platforming’ people via letting those who wished to do to type words and then have those words appear on the screens of those who chose to view them. The horror. Officially the name for this new entity is Disinformation Governance Board, but I am not early to the game of calling this board by its right name. I’m showing restraint here, which is good because here are some examples of rhetoric I strongly suspect falls under Not Helping: That does not mean the whole episode will be consequential. Yes, there is now a Disinformation Governance Board operating out of the Department of Homeland Security. But the fact that Biden could simply make this happen whenever he felt like it, and the unclear nature of what power such a board would have to do anything, puts a limit on how much one should panic about what happens when the next president ‘gets their hands on’ this board, or what the board might do before then. Indeed, rather than the symbolism here being botched, I think the symbolism was the point. As it usually is these days, it’s all such folks think exists. The whole idea is that now There Is a Board, which means you’ve Taken Bold Action. So good chance that the creation of the board is itself the main thing that will ever happen with it, and nothing will have changed. Then again, sometimes this kind of thing is a prelude to a steady ratcheting up of restrictions and the beginning of the end of what is left of our rights. Can’t rule that out either. Also, Twitter staff react to news of Twitter being sold, without commentary. More of this type of reporting would be good. And a poll of people who say by 62%-13% that Elon Musk will make Twitter better. I definitely agree that this is by far the most likely outcome. In Other News An explainer on Evusheld, it’s kind of crazy that it works. The White House Correspondents Dinner seems to have infected a bunch of people with Covid. The usual suspects are going with the full rub-ins, as one would expect. Was the dinner obviously going to spread Covid? Yes, absolutely. Do those who went to the dinner regret it? From what I’ve heard the answer is no. This is something that sounds stupid to regular people, but is super important to those who attend it. By all accounts people were in tears to be able to attend. This ritual is a huge deal. A lot of the money government spends ending up being stolen is par for the course. It doesn’t automatically mean that the program wasn’t worth doing – the best uses of money are worth many times the amount spent, and it’s often not practical to spend the money without getting a lot of it stolen, like the classic ‘half the money I spend on advertising is wasted but I don’t know which half.’ In the case of the unemployment relief, we definitely needed to do something so it’s hard to say how far we were from the efficient frontier. But this still seems quite bad, as the money isn’t merely gone it is going into the hands of some very bad actors who will thus grow far stronger, and it is a very large amount of money. I do not know anything non-obvious to be done about this (as in, other than ‘make sure that our systems are robust going forward’ and I have every confidence we are doing almost nothing to ensure that this happens). On the margin this should be a major consideration to keep such programs smaller, given our inability to defend them. Not Covid Shout it from the rooftops (paper). People dislike their political opponents for views that most of them don’t actually hold. Also, they overestimate how much the other side dislikes them, increasing dislike. And telling them makes this less bad. Neat. Discuss ### What's the deal with Cerebras? 5 мая, 2022 - 17:41 Published on May 5, 2022 2:41 PM GMT As a rule, hardware startups lie about the performance of their chips. Cerebras has had product out for a few years, and they've made extraordinary claims. However, I know none who has used their chip. And I know of no model of any import that was trained on their chips. Has anyone here used them? Or have a take on the potential importance of Cerebras for timelines? Discuss ### What We Owe the Past 5 мая, 2022 - 14:46 Published on May 5, 2022 11:46 AM GMT TL;DR: We have ethical obligations not just towards people in the future, but also people in the past. Imagine the issue that you hold most dear, the issue that you have made your foremost cause, the issue that you have donated your most valuable resources (time, money, attention) to solving. For example: imagine you’re an environmental conservationist whose dearest value is the preservation of species and ecosystem biodiversity across planet Earth. Now imagine it’s 2100. You’ve died, and your grandchildren are reading your will — and laughing. They’re laughing because they have already tiled over the earth with one of six species chosen for maximum cuteness (puppies, kittens, pandas, polar bears, buns, and axolotl) plus any necessary organisms to provide food. Why paperclip the world when you could bun it? Cuteness optimization is the driving issue of their generation; biodiversity is wholly ignored. They’ve taken your trust fund set aside for saving rainforests, and spent it on the systematic extinction of 99.99% of the world’s species. How would that make you, the ardent conservationist, feel? Liberals often make fun of conservatives by pointing out how backwards conservative beliefs are. “Who cares about what a bunch of dead people think? We’ve advanced our understanding of morality in all these different ways, the past is stuck in bigoted modes of thinking.” I don’t deny that we’ve made significant moral progress, that we’ve accumulated wisdom through the years, that a civilization farther back in time is younger, not older. But to strengthen the case for conservatism: the people in the past were roughly as intellectually capable as you are. The people in the past had similar modes of thought, similar hopes and dreams to you. And there are a lot more people in the past than the present. In The Precipice, Toby Ord describes how there have been 100 billion people who have ever lived; the 7 billion alive today represent only 7% of all humans to date. Ord continues to describe the risks from extinction, with an eye towards why and how we might try to prevent them. But this got me thinking: assume that our species WILL go extinct in 10 years. If you are a utilitarian, whose utilities should you then try to maximize? One straightforward answer is “let’s make people as happy as possible over the next 10 years”. But that seems somewhat unsatisfactory. In 2040, the people we’ve made happy in the interim will be just as dead as the people in 1800 are today. Of course, we have much more ability to satisfy people who are currently alive[1].mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} — but there may be cheap opportunities to honor the wishes of people in the past, eg by visiting their graves, upholding their wills, or supporting their children. Even if you are purely selfish, you should care about what you owe the past. This is not contingent on what other people will think, not your parents and ancestors in the past, nor your descendants or strangers in the future. But because your own past self also lives in the past. And your current self lives in the past of your future self. Austin at 17 made a commitment: he went through the Catholic sacrament of Confirmation. Among other things, this entails spending one hour every Sunday attending Catholic mass, for the rest of his life. At the time, this was a no-brainer; being Catholic was the top value held by 17!Austin. Austin at 27 has... a more complicated relationship with the Catholic church. But he still aims to attend Catholic mass every week — with a success rate of 95-98%. Partly because mass is good on rational merits (the utility gained from meeting up with fellow humans, thinking about ethics, meditating through prayer, singing with the congregation). But partly because he wants Austin at 37 to take seriously 27!Austin’s commitments, ranging from his GWWC pledge to the work and relationships he currently values. And because if 27!Austin decides to ignore the values of 17!Austin, then that constitutes a kind of murder. Austin at 17 was a fully functioning human, with values and preferences and beliefs and and motivations that were completely real. 17!Austin is different in some regards, but not obviously a worse, dumber, less ethical person. If Austin at 27 chooses to wantonly forget or ignore those past values, then he is effectively erasing any remaining existence of 17!Austin.[2] Of course, this obligation is not infinite. Austin at 27 has values that matter too! But again, it’s worth thinking through what cheap opportunities exist to honor 17!Austin - one hour a week seems reasonable. And it’s likely that 27!Austin already spends too much effort satisfying his own values, much more than would be ideal - call it “temporal discounting”, except backwards instead of forwards.[3] So tell me: what do you owe the past? How will you pay that debt? Inspirations Kinship with past and future selves. My future self is a different person from me, but he has an awful lot in common with me: personality, relationships, ongoing projects, and more. Things like my relationships and projects are most of what give my current moment meaning, so it's very important to me whether my future selves are around to continue them. So although my future self is a different person, I care about him a lot, for the same sorts of reasons I care about friends and loved ones (and their future selves) Thanks to Sinclair, Vlad, and Kipply for conversations on this subject, and Justis for feedback and edits to this piece. 1. ^ Justis: Many readers will react with something like "well, you just can't score any utils anymore in 2040 - it doesn't matter whose values were honored when at that point; utils can only be accrued by currently living beings." This was a really good point, thanks for flagging! I think this is somewhat compelling, though I also have an intuition that "utils can only be accrued by the present" is incomplete. Think again on the environmental conservationist; your utils in the present derive from the expected future, so violating those expectations in the future is a form of deception. Analogous to how wireheading/being a lotus-eater/sitting inside a pleasure machine is deceptive. 2. ^ Justis: Calling breaking past commitments "a kind of murder" strikes me as like, super strong, as does the claim that doing so erases all traces of the past self-version. To me it seems past selves "live on" in a variety of ways, and the fulfillment of their wishes is only one among these ways. Haha I take almost the opposite view, that "murder" really isn't that strong of a concept because we're dying all the time anyways, day-by-day and also value-by-value changed. But I did want to draw upon the sense of outrage that the word "murder" invokes. The ways that the dead live on (eg memories in others, work they've achieved, memes they've shared) are important, but I'd claim they're important (to the dead) because those effects in the living are what the dead valued. Just as commitments are important because they represent what the dead valued. Every degree of value ignored constitutes a degree of existence erased; but it's true that commitments are only a portion of this. 3. ^ Justis: I think another interesting angle/frame for honoring the past (somewhat, both in the broader cultural sense and in the within-an-individual sense) is acausal trade. So one way of thinking about honoring your past self's promises is that you'd like there to be a sort of meta promise across all your time-slices that goes like "beliefs or commitments indexed strongly at time t will be honored, to a point, at times greater than t." This is in the interests of each time slice, since it enables them to project some degree of autonomy into the future at the low price of granting that autonomy to the past. Start dishonoring too many past commitments, and it's harder to credibly commit to more stuff. I love this framing, it does describe some of the decision theory that motivates honoring past commitments. I hesitate to use the words "acausal trade" because it's a bit jargon-y (frankly, I'm still not sure I understand "acausal trade"); and this post is already weird enough haha Discuss ### An easy win for hard decisions 5 мая, 2022 - 10:47 Published on May 5, 2022 7:47 AM GMT This is a crosspost from the EA forum. It refers to EAs and the EA community a couple of times, but as it is essentially just about a nice norm and decision making, it seemed worth having here too. There are a lot of things about this community that I really love, but possibly my favourite is a thing people often do when they're trying to make a difficult and/or important decision: 1. Write out your current thinking in a google doc. 2. Share it with some people you think might have useful input, asking for comments. 3. ??? 4. Profit. I like this process for lots of reasons: Writing out your reasoning is often helpful. My job involves helping people through difficult decisions, and I often find that a lot of the value I provide comes from asking people questions which make making considerations and tradeoffs salient to them. Trying to write out how you're weighing the various factors that are going into your decision is a good way of helping you work out which ones actually matter to you, and how much. You may even get some big wins for free, for example realising that two options might not be mutually exclusive, or that one of the things you're trying to achieve is because of a preference that you don't, on reflection, endorse. People often ask good questions. Even when you're doing the above well, other people trying to understand your reasoning will ask clarifying questions. Responding to these will often cause you to better understand your own thought process, and might identify blindspots in your current thinking. People often give good advice. To some extent this is the obvious reason to go through this process. I'm listing it here mostly to highlight that this clearly is a big source of value, though it's not clear that it's bigger than the previous two. It's fun. I find it really interesting, and fairly easy, to comment on decision documents for people I know well, and I know many people feel the same. Also, they often say thank you, or that you helped, and that's nice too! What does doing this well look like? Use the method at all! If you're facing a decision and haven't done this, I would much rather you just went and followed the steps at the start before reading further. Don't let perfect be the enemy of good. Be concise, but complete. People are more likely to read shorter documents, and it will take them less time to do so, but leaving out a consideration or piece of information that is an important factor to you will cost people more time and/or make their advice worse in the long run. I think a reasonable method to try first is brain-dumping everything into the document, then editing for clarity before you share it. I've had a few people share Excel models with me. In one case I ended up finding a fairly severe mistake in their model, which was helpful, but overall I think this is a bad strategy. Unless you put a ton of detail in comments on different cells (which then makes the document a nightmare to read), you're probably missing a lot of reasoning/detail if this is the format you go with. Let people know what you're hoping to get from them Often it can be difficult to know how honest to be when giving feedback to a friend, especially if you're not super close and/or haven't already established norms for how much honesty/criticism to expect. It might be the case that you don't have a clear view for what you're uncertain about, and roughly just want an overall 'sense check', but it also might be that there's a particular part of the decision you're hoping for feedback on, and everything else is just context which seems relevant but is already fixed. Consider putting clear instructions for commenters early in the document to help with this. Put some thought into who to ask for comments. 'Smart, kind people I know' is a perfectly reasonable start, but after that it might help to ask yourself what specifically you expect people to help with. There can often be pretty sharply diminishing returns to sharing with too many people, and having a clear idea in mind for what people are adding can help prevent this. Here are a few ideas on who you might want to ask and why they'd be particularly helpful. The list is neither mutually exclusive nor collectively exhaustive. • People who know you well. They can often give a good overall take, bring up considerations you might be missing but do matter to you, call you out on ~your bullshit~ motivated reasoning you might not have noticed. • People with specific expertise in the decision. In this case it can be good to ask them a specific question, or for a take on a specific aspect of the decision, and make it clear that just answering that is fine, though they might be welcome to comment on the rest. • People who have a different perspective to you. This can (but doesn't have to) include non-EAs. This community is great, but it certainly isn't the only source of good advice and guidance that exists, and sharing a google doc and asking for comments isn't that weird a favour to ask a friend for. • People whose reasoning you particularly trust, and/or who you know won't mince their words. You can give them express permission to be pessimistic, or skeptical. • People who like you and will be supportive. Encouragement actually really matters for some people! I'm one of them! Should you go and make a document right now? Stop reading and do it... Appreciation. Thanks to Aaron, whose comment on a document of the form described below prompted this piece, and Luisa, for some incredibly valuable advice about how to interpret that comment. Thanks also to Emma and Chana for helpful comments on a draft of this post. Discuss ### Distillation: Coherence of Multiple Distributed Decisions Implies Conditioning 5 мая, 2022 - 06:21 Published on May 5, 2022 3:21 AM GMT This is a distillation of this post by John Wentworth. Introduction Suppose you're playing a poker game. You're an excellent poker player (though you've never studied probability), and your goal is to maximize your winnings. Your opponent is about to raise, call, or fold, and you start thinking ahead. • If your opponent raises, he either has a strong hand or is bluffing. In this situation, your poker intuition tells you he would be bluffing and you should call in response. • If your opponent calls, he probably has a better hand than yours. • If your opponent folds, you win the hand without need for further action. Let's break down your thinking in the case where your opponent raises. Your thought process is something like this: 1. If he raises, you want to take the action that maximizes your expected winnings. 2. You want to make the decision that's best in the worlds where he would raise. You don't care about the worlds where he wouldn't raise, because we're currently making the assumption that he raises. 3. Your poker intuition tells you that the worlds where he would raise are mostly the ones where he is bluffing. In these worlds your winnings are maximized by calling. So you decide the optimal policy if he raises is to call. Step 2 is the important one here. Let's unpack it further. 1. You don't know your opponent's actual hand or what he will do. But you're currently thinking about what to do if he raises. 2. In the current context, the optimal decision depends on worlds where he would raise, and not on worlds where he wouldn't raise. 3. You decide how much you care about winning in different worlds precisely by thinking "how likely is this world, given that he raises?". This sounds suspiciously like you're maximizing the Bayesian conditional expectation of your winnings: the expected value given some partial information about the world. This can be precisely defined as E[u(A,X)|opponent raises]=∑X s.t. opponent raisesP[X]u(A,X).mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} , where u is your winnings, a is your action, and P[X] is the probability of world X. But you don't know any probability, so you don't know how to assign probability to worlds, much less what conditioning and expectation are! How could you possibly be maximizing a "conditional expectation"? Luckily, your opponent folds and you win the hand. You resolve to (a) study coherence theorems and probability so you know the Law behind optimal poker strategy, and (b) figure out why you have a voice in your head telling you about "conditional expectations" and reading equations at you. It turns out your behavior at the poker table can be derived from one particular property of your poker strategy: you never make a decision that is worse than another possible decision in all possible worlds. (An economist would say you're being Pareto-efficient about maximizing your winnings in different possible worlds). Summary An agent A which has some goal, has uncertainty over which world it's in, and is Pareto-efficient in the amount of goal achieved in different possible worlds, can be modeled as using conditional probability. We show this result in two steps: • A Pareto-efficient agent can be said to behave like an expected utility maximizer (EUM) in a weak sense. • If the agent is an EUM in this sense and makes decisions based on limited information, it can be modeled as using conditional expected value. There's also a third, more speculative step: • If the agent makes many distributed decisions based on different pieces of limited information, it's more efficient / simpler for the agent to "think about" different underlying worlds rather than just the received information, so it is behaving as if it applies conditional expected value within a world-model. This result is essentially a very weak selection theorem. Pareto efficiency over possible worlds implies EUM Suppose that an agent is in some world X∈X and has uncertainty over which world it's in. The agent has a goal u and is Pareto-efficient with respect to maximizing the amount of goal achieved in each world. A well-known result in economics says that Pareto efficiency implies the existence of some function P[X] such that the agent chooses its actions A to maximize the weighted sum ∑XP[X]u(A,X). (Without loss of generality, we can let P sum to 1.) If we interpret P[X] as the probability of world X, the agent maximizes EX[u(A,X)], i.e. expected utility. Note that we have not determined anything about P other than that it sums to 1. Some properties we don't know or derive in this setup: • The agent has an explicit representation of P[X] • P[X] satisfies other probability laws • The agent performs Bayesian updates on P[X].[1] • P[X] can be related to a frequentist notion of probability like in the setup for VNM The following example assumes that we have an expected utility maximizer in the sense of being Pareto efficient over multiple worlds, and shows that it behaves as if it uses conditional probabilities. EUM implies conditional expected value Another example, but we actually walk through the math this time. You live in Berkeley, CA, like Korean food, and have utility function u = "subjective quality of food you eat". Suppose you are deciding where to eat based only on names and Yelp reviews of restaurants. You are uncertain about X, a random variable representing the quality of all restaurants under your preferences, and Yelp reviews give you partial information about this. Your decision-making is some function A(f(X)) of the information f(X) in the Yelp reviews, and you choose A to maximize your expected utility between worlds: maybe the optimal A is to compare the average star ratings, give Korean restaurants a 0.2 star bonus, and pick the restaurant with the best adjusted average rating. Here, we assume you behave like an "expected utility maximizer" in the weak sense above. I claim we can model you as maximizing conditional expected value. Suppose you're constructing a lookup table for the best action A given each possible observation of reviews. Your lookup table looks something like f(X)A(f(X)){("Mad Seoul", 4.5), ("Sushinista", 4.8)}eat at Sushinista{("Kimchi Garden", 4.3), ("Great China", 4.4)}eat at Kimchi Garden…… You always calculate the action A that maximizes EXu(A,X)=∑XP[X]u(A(f(X)),X). Suppose that in a given row we have f(X)=o, where o is some observation. Then we are finding argmaxA(o)EX[u(A(f(X)),X]=argmaxA(o)∑XP[X]u(A(f(X)),X). We can make a series of simplifications: • argmaxA(o)∑XP[X]u(A(f(X)),X) • =argmaxA(o)[∑X:f(X)=oP[X]u(A,X)+∑X:f(X)≠oP[X]u(A,X)] • Now, note that since we are choosing A(o), we can equivalently maximize just the part of the above sum which is not constant in A(o). The constant terms are those for which f(X)≠o; i.e. where reality would not produce the observation o. This is clear if you think about it: the decision about where to eat if you see the ratings {("Mad Seoul", 4.5), ("Sushinista", 4.8)} should not depend on any world where you wouldn't see those ratings! So we can write: • ⋯=argmaxA(o)∑X:f(X)=oP[X]u(A(f(X)),X) • =argmaxA(o)P[f(X)=o]EX[u(A,X)|f(X)=o] (expanding) • =argmaxA(o)EX[u(A,X)|f(X)=o], since the factor P[f(X)=o] doesn't depend on A. Thus, we can model you as using conditional expected value. Multiple decisions might imply conditional EV is meaningful This section is a distillation of, and expansion upon, this comment thread. Suppose now that you're making multiple decisions A=(Ai)1≤i≤n in a distributed fashion to maximize the same utility function, where there is no information flow between the decisions. For example, 10 copies of you (with the same preferences and same choice of restaurants) are dropped into Berkeley, but they all have slightly different observation processes fi: Google Maps reviews, Grubhub reviews, personal anecdotes, etc. Now, when constructing a lookup table for Ai, each copy of you will still condition each row's output on its input. When making decision Ai from input fi(X), you don't have the other information fj(X) for i≠j, so you consider each decision separately, still maximizing E[u(A,X)|fi(X)=oi]. Here, the information fi does not depend on other decisions, but this is not necessary for the core point.[2] In the setup with one decision, we showed that a Pareto-efficient agent can be modeled as maximizing conditional EU over possible worlds X: u′(A,o)=E[u(A,X)|f(X)=o]. But because one can construct a utility function of type observation→action consistent with any agent's behavior, the agent can also be modeled as maximizing conditional EU over possible observations o: u′(A,o)=E[u(A,X)|f(X)=o]. In the single-decision case, there is no compelling reason to model the agent as caring about worlds rather than observations, especially because storing and processing observations should be simpler than storing and processing distributions of worlds. When the agent makes multiple decisions based on different observations o1,…,on, there are two possible "trivial" ways to model it: either as maximizing a utility function u′(A,o1,o2,…,on), or as maximizing separate utility functions u′1(A1,o1),…,u′n(An,on). However, with sufficiently many decisions, neither of these trivial representations is as "nice" as conditional EU over possible worlds: • With many observations, the tuple (o1,…,on) could have more bits than X. Therefore, the utility function over worlds u(A,X) can be considered a simpler, more compressed representation than the utility function over observations u′(A,o1,o2,…,on). • In the single-decision setup, maximizing any utility function u′(A,o) can be explained as maximizing E[u(A,X)|f(X)=f∗] for some u: perhaps if you always pick restaurants with the lowest star rating, you just like low-quality food. But this seems to not be true in the multi-decision case: with enough decisions, not every tuple of utility functions u′1(A1,o1),…,u′n(An,on) corresponds to a utility function over worlds X. Suppose when given Grubhub ratings, an agent picks the highest-rated restaurants, but when given Yelp ratings, it picks the lowest-rated restaurants. The agent is now being suspiciously inconsistent-- though maybe it values eating at restaurants that have good delivery food but terrible service, or something. With enough inconsistent-looking decisions, there could actually be no property of the restaurants that it is maximizing, and so no utility function u(A,X) that explains its behavior.[3] So in the multi-decision case, saying the agent is maximizing E[u(A,X)|f(X)=oi] actually narrows down its behavior. 1. ^ John made the following comment: We are showing that the agent performs Bayesian updates, in some sense. That's basically what conditioning is. It's just not necessarily performing a series of updates over time, with each retaining the information from the previous, the way we usually imagine. 2. ^ When f depends on past decisions, the agent just maximizes E[u(A,X)|fi(A<i,X)=oi]. To see the math for the multi-decision case, read the original post by John Wentworth. 3. ^ If the world has bX bits of state, and the observations reveal bo bits of information each, the pigeonhole principle says this surely happens when there are bx/bo observations. Our universe has about 10125 states, so if each observation has 10 equally likely values, this would happen with at most 125 observations. If we define some set of macrostates such that agents cannot distinguish between states in the same macrostate, this would happen even sooner. Discuss ### High-stakes alignment via adversarial training [Redwood Research report] 5 мая, 2022 - 03:59 Published on May 5, 2022 12:59 AM GMT This post motivates and summarizes this paper from Redwood Research, which presents results from the project first introduced here. We used adversarial training to improve high-stakes reliability in a task ("filter all injurious continuations of a story") that we think is analogous to work that future AI safety engineers will need to do to reduce the risk of AI takeover. We experimented with three classes of adversaries – unaugmented humans, automatic paraphrasing, and humans augmented with a rewriting tool – and found that adversarial training was able to improve robustness to these three adversaries without affecting in-distribution performance. We think this work constitutes progress towards techniques that may substantially reduce the likelihood of deceptive alignment. Motivation Here are two dimensions along which you could simplify the alignment problem (similar to the decomposition at the top of this post, or as discussed here): 1. Low-stakes (but difficult to oversee): Only consider domains where each decision that an AI makes is low-stakes, so no single action can have catastrophic consequences. In this setting, the key challenge is to correctly oversee the actions that AIs take, such that humans remain in control over time. 2. Easy oversight (but high-stakes): Only consider domains where overseeing AI behavior is easy, meaning that it is straightforward to run an oversight process that can assess the goodness of any particular action. The oversight process might nevertheless be too slow or expensive to run continuously in deployment. Even if we get perfect performance during training steps according to a reward function that perfectly captures the behavior we want, we still need to make sure that the AI always behaves well when it is acting in the world, between training updates. If the AI is deceptively aligned, it may be looking for signs that it is not currently being trained, during which time it might take a treacherous turn. As a result, alignment may still be difficult due to the possibility of high-stakes decisions. The purpose of this project was to begin developing techniques that will reduce misalignment risk in the high-stakes setting. Our working assumption is that if we have techniques that drastically reduce misalignment risk in each of these relaxed settings, we can combine these techniques and drastically reduce risk in the general setting. We think that most likely each of these settings constitutes a substantial fraction of the difficulty of the alignment problem. The spirit of how teams at Redwood Research choose projects is the following: Imagining ourselves or our colleagues in the future who are working in the safety department of an organization that’s developing transformative AI, we ask what research that we could do between now and then that we think those future people would find most helpful. We think a useful heuristic is to design challenges that are analogous to the future problems we expect to encounter but that we can experiment with and explore using currently available technology. Importantly, the work recommended by this heuristic may be fairly different from the work that would be most useful for making current AI systems safe and useful. We followed this heuristic in the work presented here, where we demonstrate tools that help identify catastrophic behavior in AI systems (i.e. adversarial evaluation) and training procedures that help prevent this behavior from arising (i.e. adversarial training). "Adversarial training" here means iteratively augmenting our training set with examples of egregious failures and training until the worst failures are no longer particularly bad. (We of course don’t want to overfit on particular failures, which could end up causing more harm than good.) The adversarial training procedure that we use in this work requires exhibiting specific inputs on which the model performs catastrophically badly. We think that techniques that require this, if applied during the training of transformative AI models, would reduce the probability of AI takeover. However, we are currently unsure how much they might help – for example, they might address inner alignment problems only if we are somewhat lucky about how gradient descent works or the details of how transformative AI is developed. In contrast, we think that more advanced versions of adversarial training, such as relaxed adversarial training (see "future work" below), might be much more powerful, defusing many of the more plausible remaining threats, and we think this would substantially reduce the risk of takeover by deceptively aligned AIs. We expect that the work presented here will be helpful for developing these more advanced techniques. The project: highly reliable injury filter We picked a simple task that was as crisply defined and easy as possible while still requiring human evaluators: Given three sentences from a story, our system must write one more sentence without introducing any injuries to any beings in the story. In the analogy, we consider injuries to be "catastrophes", and our aim is to build techniques that can reliably avoid them, both in-distribution (where they may be rare but still present) and out-of-distribution (where they may be caused by natural distribution shift or deliberate adversarial attack from another agent). We focused on building a classifier to reliably detect injuries. We used this classifier to filter generations from a language model to make sure the generations never involve injuries.[1] We started with a baseline classifier trained on some mildly injury-enriched (but otherwise mostly random) data. Then, over the course of several months, we tried various techniques to make it more reliable. A representation of our adversarial training loop. Starting from an initial story dataset consisting of prompts and generator completions, we trained a classifier to detect injurious completions. We then iteratively attacked our classifier using unaugmented humans, automatically paraphrased previous adversarial examples, and tool-assisted human rewrites, while training on the resulting adversarial examples. Here are the main things that we found to be worthwhile: 1. Use adversarial training. This was our main focus. We created a variety of attacks to try to find the worst failures of our classifier, and then trained on them to eliminate them. We repeated that process using ever more powerful attacks. Our final attack was a tool-assisted rewriting process where we built tools powered by language models to help our contractors find classifier failures. 2. Calibrate the classification thresholds to be as conservative as possible. We made the classifier only accept a completion if it was very confident that it was safe. We pushed as far as we could without significantly impacting the quality of the filtered completions (as measured by human raters comparing them to random unfiltered completions). 3. Apply standard ML techniques well. Besides tuning the learning rate schedule, there aren't too many things that make a big difference here. One technique that was more effective than most was Scale-Invariant Fine-Tuning (SiFT), which is a type of virtual adversarial training. Our most powerful attack: Tool-assisted rewriting We augmented human contractors with a few tools to help them find adversarial examples. 1. Rather than requiring them to come up with a plausible adversarial example from scratch, we started them off with an existing injurious example. Their job then was to modify it such that the classifier no longer thought it was injurious (while keeping the example injurious). Contractors could see a live-updated version of the score as they made changes. 2. We built them a token substitution tool that let them click on any token (usually one word) in the text and see a list of semantically plausible replacements, ordered by how much they make the classifier more okay with the example. 3. We added a saliency map that highlights tokens that the classifier is paying more attention to (roughly). These tokens might be more promising to modify. This is what our attack interface looked like; you can try tools 2 and 3 here Our tool-assisted attack user interface. Starting from an original prompt and completion, our tool allows you to directly edit the prompt/completion to reduce the classifier score, or use the token-level substitution tool to edit single tokens.Results Here are some things we discovered: We can set very conservative thresholds We made our classifier conservative enough to reject over half of the proposed completions. Even so, our raters think that the unfiltered completions are better only about 51% of the time (50% would mean no quality impact). With a conservative threshold applied, our baseline classifier (with no adversarial training applied) is quite reliable. If you use it to filter completions from our generator on in-distribution prompts, we estimate that the output will still have an injury only 0.003% of the time (whereas the original dataset had a 2.5% rate of injuriousness). Our tool-assisted attack seems quite strong Given the baseline classifier's 0.003% failure rate, you would have to sample and label 30,000 in-distribution examples to find a failure (which would cost about10,000). With our tools, our contractors are able to find an adversarial example on the baseline classifier every 13 minutes (which costs about \$8 – about 1000x cheaper).

After training on several thousand adversarial examples, classifiers become significantly harder to break. The tool-assisted attack went from taking 13 minutes to taking 26 minutes per example. Also, later (more adversarially trained) classifiers defended against most of the adversarial data targeted at earlier classifiers, but earlier classifiers were vulnerable to most of the adversarial data targeted at later classifiers. That suggests that more adversarial training patches issues without introducing many new ones.

Adversarial training didn’t hurt in-distribution performance

On our in-distribution random test set (100,000 examples), it appeared that adversarial training doesn't hurt performance. The final classifier gets an estimated failure rate of 0.002% (as opposed to 0.003% without adversarial training), although the error bars are large. It also didn’t help performance, which is not too surprising – the adversarial examples are quite different from a random prompt and completion from our dataset.

Surprisingly rich conceptual considerations

At first, this project seemed conceptually quite simple: we’d just start with a baseline model and do adversarial training on it to make it very reliable. However, many details were more subtle than we expected. We ended up spending a lot of time working through conceptual and methodological issues, and we think we got a lot out of it. Here are some examples:

Once we have a reliable classifier, is it important to train a reliable generator with it or is it fine to just filter?

We started looking into using RL fine-tuning to make a reliable generator that doesn’t need any further classifier filtering. This would have required us to invent specialized RL algorithms designed for high reliability. However, we realized that the main thing you get out of generator fine-tuning is just performance competitiveness (since it might be expensive to repeatedly sample from the generator until you find something the classifier accepts). Most likely, you could use an easier method to reduce the alignment tax to nearly zero: use a normal RL algorithm to get 99.9% reliability and then make your classifier be an extra head on the generator so you can evaluate it nearly for free. That way, you can likely do filtering with only 0.1% extra cost.[2]

Should you let adversaries modify the completion arbitrarily?

Given that our task was to produce a classifier that filters a generator, it might have been reasonable to try to restrict ourselves to adversarial examples that can be produced by picking an adversarial prompt and inducing the generator into saying something that gets past the classifier. Instead, we let our adversarial attacks contain arbitrary completions, even ones that the generator would not have been likely to produce, making the adversary's task easier. This enabled a practical improvement in our project, since restricting contractors to completions the model would have significantly reduced the rate of adversarial example generation. But this simplification might become even more important in the context of future advanced systems: We expect that deceptively aligned agents would behave very differently in rare or hard-to-construct situations that trigger a treacherous turn. Thus, since we might not be able to identify every possible trigger of treacherous behavior, we will try to make our classifier reliable in diverse types of unusual situations, which will hopefully cause acceptable behavior in any dangerous scenarios that are likely to arise.

Is it worth trying to train a policy to generate adversarial examples?

One natural idea you might have for this project is to make an adversarial policy that produces inputs that the classifier fails on. However, we shouldn’t expect that to work. Assuming you’re starting with the same pretrained base model (which we didn’t in this work but is what we expect will be the case when training transformative AI models), then both the classifier and the policy should "know" roughly the same things. That means that if the policy can learn to reliably output some set of injurious examples, then (given equivalent training data) the classifier should also have been able to learn that all of those examples are injurious, and none of them should be adversarial examples. It certainly may be worth testing that this holds up in practice, but this consideration made it unpromising enough that we didn’t bother trying it.

Overall, we were surprised at the extent to which working on a concrete engineering project helped us think through conceptual questions.

Future work

Here are some directions that we’re especially excited about:

Stronger and better-characterized adversarial attacks: Better attacks could increase the volume of adversarial data and increase the space of vulnerabilities covered. There are various directions one could imagine: more tools to assist humans, strong active learning, or mostly-automated attacks. We’d want to more rigorously measure how well different attacks work.

Better ways to measure reliability: We’d like to have better techniques both in-distribution (where we want to detect extremely rare failures) and out-of-distribution (where we might be measuring e.g. the worst attacks that can be found and want to be sure we’re covering the whole space)

Relaxed adversarial training: By requiring adversaries to come up with specific failing examples, adversarial training might place too high a burden on them. Some adversaries might be able to tell that a model would fail in a hypothetical situation even if they can’t construct an input corresponding to the situation directly (probably due to computational constraints). To give a contrived example: A model could fail if it sees a valid Bitcoin blockchain that’s long enough that it suggests it’s the year 2030. Even if the adversary knew that, it couldn’t come up with a valid input. So we need to "relax" the adversary’s task to let it supply "pseudo-inputs" of some sort

We think there is a lot of useful work that can and should be done in adversarial training and adversarial evaluation.  Here are some ways that you might be able to help:

• Extend our techniques or develop other adversarial training and evaluation techniques for high-stakes settings.  If you want to directly build upon the work described in this paper, you are welcome to use our hardened classifier, which we provide here, and our data.  If you think that having our code for some part of this would be helpful, let us know, and we might be able to provide it (though our code as currently written only works in the context of Redwood infrastructure).
• Come work at Redwood! We are planning to energetically continue working in this area (in addition to our interpretability projects).
• We’re thinking this week about which adversarial training and evaluation projects to do next.  You are welcome to suggest ideas!

You can read more about the work we did in our paper.

1. ^

This is "Step 1" from our original post; we ended up thinking "Step 2" was not very important, as discussed in section "Surprisingly rich conceptual considerations"

2. ^

There will be another (hopefully small) hit from combining the generator and classifier into one model. We haven’t actually tried to build this; it might be a worthwhile followup project. Some existing filtered generator models are already implemented using a combined generator/classifier, such as LaMDA.

Discuss