Вы здесь

Сборщик RSS-лент

The AI is the model

Новости LessWrong.com - 4 октября, 2019 - 11:11
Published on October 4, 2019 8:11 AM UTC

A Friendly AI is not a selfish AI constrained by a special extra conscience module that overrides the AI's natural impulses and tells it what to do.  You just build the conscience, and that is the AI.

Eliezer Yudkowsky, Ghosts in the Machine

When I started thinking about value learning, I thought the goal was to extract simple objects that described the essence of morality. Not so simple as a verbal definition, but something like a utility function. Something separate from planning or reasoning, that was purely about preferences, which you could plug into an AI which would then do some totally separate work to turn preferences into choices.

Turns out that runs into some serious obstacles.


The difficulty of value learning is that there is no One True Utility Function to be assigned the globs of atoms we call humans. To think about them as having desires at all requires viewing them at a suitable level of abstraction - though of course, there's no One True Level Of Abstraction, either. (I promise this is my last post that's basically just consequences of needing the intentional stance for a while.)

Call the world-model the AI uses to best predict the world its "native ontology." If I want to go to the gym, we want the AI to look at the atoms and see "Charlie wants to go to the gym." The thing that I want is not some specific state of the AI's native ontology. Instead, I can only "want" something in an abstracted ontology that not only contains the AI's intentional-stance model of "Charlie," but also intentional-stance-compatible abstractions for "go" and "gym." In short, abstraction is contagious.

This is like the idea of an umwelt (oom-velt), introduced by early philosopher of biology Jakob Johann von Uexküll. In nature, different organisms can have different effective models of the world even though they live in the same environment. They only evolve to model what is necessary for them to survive and reproduce. The umwelt is a term for this modeled world. The umwelt of a bloodsucking tick consists largely of things to climb on and warm-blooded mammals, which are perceived not by sight but by a certain smell and body temperature.

I think of the AI's intentional stance as not just being a specially abstract model of me, but also being a model of my entire umwelt. It needs an abstraction of the gym because the gym is a part of my inner world, an abstract concept that gets referenced in my plans and desires.


Back to value learning. The bare minimum for success is that we build an AI that can predict which actions will do a good job satisfying human values. But how minimalist do we really have to be? Can we get it to output an abstract object corresponding to human values, like a utility function or some compression thereof?

Well, maybe. If it had a complete understanding of humans, maybe it could take that abstract, intentional stance description of humans and cash it out into a utility function over world-histories. Note that this is over world-histories, not world-states, because humans' abstractions often involve things like duration and change. So one problem is that this object is impractically massive, both to construct and to use. In order to actually do anything with human values, what we want is the compressed, abstracted version, and this turns out to more or less consist of the entire AI.

It's theoretically convenient to think about separating values and planning, only passing a utility function from one to the other, but in practice the utility function is too big to construct, which means that the planning step must repeatedly talk to the abstract model, and is no longer so cleanly separate from it, especially if we imagine optimizing end-to-end, causing every part to be optimized to fit every other part, like two trees growing intertwined.

The other factor blurring any neat lines is meta-ethics. We might want to use meta-ethical data - information learned by observing and talking to humans - to change how the AI treats its information about human values, or even change which decision theory it's using. You can frame this as preferences over the AI's own code, but this is still a case of supposedly simpler preferences actually containing the specification of the whole AI.

These violations of clean separability tell us that our goal shouldn't be to find a separate "human values" object. Except in special cases that we really shouldn't count on, the entire FAI is the "human values" object, and all of its parts might make sense only in the context of its other parts. The AI doesn't have a model of what it should do, the AI is the model.


Solving the forgetting. Spaced repetition beyond rationality community.

Новости LessWrong.com - 4 октября, 2019 - 09:00
Published on October 3, 2019 3:03 PM UTC

Many of you have heard about spaced repetition. It's a learning technique that allows you to remember almost anything as long as you want. It works by repeatedly answering test questions with increasing intervals. The problem is it's not widely used (just like the art of rationality). Existing solutions are either limited to memorization of terms/foreign language words or require you to create all flashcards (test questions) yourself. But those who go through the struggle of creating flashcards for complex topics themselves show that spaced repetition can help in learning any topic.

The main hurdle to sharing flashcards is that you can't understand the question written by someone else if you don't know the topic quite well already. Therefore you have to start repetition when you understand the underlying concept. It seems that the best timing is right after you read about the relevant concept in a textbook or watch a lecture. The test question has to be integrated with the educational content.

The linked article describes the approach we take to implement this idea. We've already made a basic implementation. Now we're looking for those who want to try use it in their personal learning. But even more importantly those who are willing to experiment with creating courses for others. We'd appreciate any feedback you have!



Debate on Instrumental Convergence between Yann Le Cunn, Stuart Russell and More

Новости LessWrong.com - 4 октября, 2019 - 07:08
Published on October 4, 2019 4:08 AM UTC

An actual freaking public debate about instrumental convergence, in a public space! Major respect to all involved, especially Yoshua Bengio for great facilitation.

For posterity (i.e. having a good historical archive) and further discussion, I've reproduced the conversation here. I'm happy to make edits at the request of anyone in the discussion who is quoted below. I've improved formatting for clarity and fixed some typos. For people who are not AI Alignment Researchers who wish to comment, see the public version of this post here. For people who do work in the relevant fields, please sign up in the top right. It will take a day or so to confirm membership.

Original Post

Yann LeCun: "don't fear the Terminator", a short opinion piece by Tony Zador and me that was just published in Scientific American.

"We dramatically overestimate the threat of an accidental AI takeover, because we tend to conflate intelligence with the drive to achieve dominance. [...] But intelligence per se does not generate the drive for domination, any more than horns do."


Comment Thread #1

Elliot Olds: Yann, the smart people who are very worried about AI seeking power and ensuring its own survival believe it's a big risk because power and survival are instrumental goals for almost any ultimate goal.

If you give a generally intelligent AI the goal to make as much money in the stock market as possible, it will resist being shut down because that would interfere with tis goal. It would try to become more powerful because then it could make money more effectively. This is the natural consequence of giving a smart agent a goal, unless we do something special to counteract this.

You've often written about how we shouldn't be so worried about AI, but I've never seen you address this point directly.

Stuart Russell: It is trivial to construct a toy MDP in which the agent's only reward comes from fetching the coffee. If, in that MDP, there is another "human" who has some probability, however small, of switching the agent off, and if the agent has available a button that switches off that human, the agent will necessarily press that button as part of the optimal solution for fetching the coffee. No hatred, no desire for power, no built-in emotions, no built-in survival instinct, nothing except the desire to etch the coffee successfully. This point cannot be addressed because it's a simple mathematical observation.

Comment Thread #2

Yoshua Bengio: Yann, I'd be curious about your response to Stuart Russell's point.

Yann LeCun: You mean, the so-called "instrumental convergence" argument by which "a robot can't fetch you coffee if it's dead. Hence it will develop self-preservation as an instrumental sub-goal."

It might even kill you if you get in the way.

1. Once the robot has brought you coffee, its self-preservation instinct disappears. You can turn it off.

2. One would have to be unbelievably stupid to build open-ended objectives in a super-intelligent (and super-powerful) machine without some safeguard terms in the objective.

3. One would have to be rather incompetent not to have a mechanism by which new terms in the objective could be added to prevent previously-unforeseen bad behavior. For humans, we have education and laws to shape our objective functions and complement the hardwired terms built into us by evolution.

4. The power of even the most super-intelligent machine is limited by physics, and its size and needs make it vulnerable to physical attacks. No need for much intelligence here. A virus is infinitely less intelligent than you, but it can still kill you.

5. A second machine, designed solely to neutralize an evil super-intelligent machine will win every time, if given similar amounts of computing resources (because specialized machines always beat general ones).

Bottom line: there are lots and lots of ways to protect against badly-designed intelligent machines turned evil.

Stuart has called me stupid in the Vanity Fair interview linked below for allegedly not understanding the whole idea of instrumental convergence.

It's not that I don't understand it. I think it would only be relevant in a fantasy world in which people would be smart enough to design super-intelligent machines, yet ridiculously stupid to the point of giving it moronic objectives with no safeguards.

Here is the juicy bit from the article where Stuart calls me stupid:

Russell took exception to the views of Yann LeCun, who developed the forerunner of the convolutional neural nets used by AlphaGo and is Facebook’s director of A.I. research. LeCun told the BBC that there would be no Ex Machina or Terminator scenarios, because robots would not be built with human drives—hunger, power, reproduction, self-preservation. “Yann LeCun keeps saying that there’s no reason why machines would have any self-preservation instinct,” Russell said. “And it’s simply and mathematically false. I mean, it’s so obvious that a machine will have self-preservation even if you don’t program it in because if you say, ‘Fetch the coffee,’ it can’t fetch the coffee if it’s dead. So if you give it any goal whatsoever, it has a reason to preserve its own existence to achieve that goal. And if you threaten it on your way to getting coffee, it’s going to kill you because any risk to the coffee has to be countered. People have explained this to LeCun in very simple terms.”


Tony Zador: I agree with most of what Yann wrote about Stuart Russell's concern.

Specifically, I think the flaw in Stuart's argument is the assertion that "switching off the human is the optimal solution"---who says that's an optimal solution?

I guess if you posit an omnipotent robot, destroying humanity might be a possible solution. But if the robot is not omnipotent, then killing humans comes at considerable risk, ie that they will retaliate. Or humans might build special "protector robots" whose value function is solely focused on preventing the killing of humans by other robots. Presumably these robots would be at least as well armed as the coffee robots. So this really increases the risk to the coffee robots of pursuing the genocide strategy.

And if the robot is omnipotent, then there are an infinite number of alternative strategies to ensure survival (like putting up an impenetrable forcefield around the off switch) that work just as well.

So i would say that killing all humans is not only not likely to be an optimal strategy under most scenarios, the set of scenarios under which it is optimal is probably close to a set of measure 0.

Stuart Russell: Thanks for clearing that up - so 2+2 is not equal to 4, because if the 2 were a 3, the answer wouldn't be 4? I simply pointed out that in the MDP as I defined it, switching off the human is the optimal solution, despite the fact that we didn't put in any emotions of power, domination, hate, testosterone, etc etc. And your solution seems, well, frankly terrifying, although I suppose the NRA would approve. Your last suggestion, that the robot could prevent anyone from ever switching it off, is also one of the things we are trying to avoid. The point is that the behaviors we are concerned about have nothing to do with putting in emotions of survival, power, domination, etc. So arguing that there's no need to put those emotions in is completely missing the point.

Yann LeCun: Not clear whether you are referring to my comment or Tony's.

The point is that behaviors you are concerned about are easily avoidable by simple terms in the objective. In the unlikely event that these safeguards somehow fail, my partial list of escalating solutions (which you seem to find terrifying) is there to prevent a catastrophe. So arguing that emotions of survival etc will inevitably lead to dangerous behavior is completely missing the point.

It's a bit like saying that building cars without brakes will lead to fatalities.

Yes, but why would we be so stupid as to not include brakes?

That said, instrumental subgoals are much weaker drives of behavior than hardwired objectives. Else, how could one explain the lack of domination behavior in non-social animals, such as orangutans.

Francesca Rossi: @Yann Indeed it would be odd to design an AI system with a specific goal, like fetching coffee, and capabilities that include killing humans or disallowing being turned off, without equipping it also with guidelines and priorities to constrain its freedom, so it can understand for example that fetching coffee is not so important that it is worth killing a human being to do it. Value alignment is fundamental to achieve this. Why would we build machines that are not aligned to our values? Stuart, I agree that it would easy to build a coffee fetching machine that is not aligned to our values, but why would we do this? Of course value alignment is not easy, and still a research challenge, but I would make it part of the picture when we envision future intelligent machines.

Richard Mallah: Francesca, of course Stuart believes we should create value-aligned AI. The point is that there are too many caveats to explicitly add each to an objective function, and there are strong socioeconomic drives for humans to monetize AI prior to getting it sufficiently right, sufficiently safe.

Stuart Russell: "Why would be build machines that are not aligned to our values?" That's what we are doing, all the time. The standard model of AI assumes that the objective is fixed and known (check the textbook!), and we build machines on that basis - whether it's clickthrough maximization in social media content selection or total error minimization in photo labeling (Google Jacky Alciné) or, per Danny Hillis, profit maximization in fossil fuel companies. This is going to become even more untenable as machines become more powerful. There is no hope of "solving the value alignment problem" in the sense of figuring out the right value function offline and putting it into the machine. We need to change the way we do AI.

Yoshua Bengio: All right, we're making some progress towards a healthy debate. Let me try to summarize my understanding of the arguments. Yann LeCun and Tony Zadorr argue that humans would be stupid to put in explicit dominance instincts in our AIs. Stuart Russell responds that it needs not be explicit but dangerous or immoral behavior may simply arise out of imperfect value alignment and instrumental subgoals set by the machine to achieve its official goals. Yann LeCun and Tony Zador respond that we would be stupid not to program the proper 'laws of robotics' to protect humans. Stuart Russell is concerned that value alignment is not a solved problem and may be intractable (i.e. there will always remain a gap, and a sufficiently powerful AI could 'exploit' this gap, just like very powerful corporations currently often act legally but immorally). Yann LeCun and Tony Zador argue that we could also build defensive military robots designed to only kill regular AIs gone rogue by lack of value alignment. Stuart Russell did not explicitly respond to this but I infer from his NRA reference that we could be worse off with these defensive robots because now they have explicit weapons and can also suffer from the value misalignment problem.

Yoshua Bengio: So at the end of the day, it boils down to whether we can handle the value misalignment problem, and I'm afraid that it's not clear we can for sure, but it also seems reasonable to think we will be able to in the future. Maybe part of the problem is that Yann LeCun and Tony Zador are satisfied with a 99.9% probability that we can fix the value alignment problem while Stuart Russell is not satisfied with taking such an existential risk.

Yoshua Bengio: And there is another issue which was not much discussed (although the article does talk about the short-term risks of military uses of AI etc), and which concerns me: humans can easily do stupid things. So even if there are ways to mitigate the possibility of rogue AIs due to value misalignment, how can we guarantee that no single human will act stupidly (more likely, greedily for their own power) and unleash dangerous AIs in the world? And for this, we don't even need superintelligent AIs, to feel very concerned. The value alignment problem also applies to humans (or companies) who have a lot of power: the misalignment between their interests and the common good can lead to catastrophic outcomes, as we already know (e.g. tragedy of the commons, corruption, companies lying to have you buy their cigarettes or their oil, etc). It just gets worse when more power can be concentrated in the hands of a single person or organization, and AI advances can provide that power.

Francesca Rossi: I am more optimistic than Stuart about the value alignment problem. I think that a suitable combination of symbolic reasoning and various forms of machine learning can help us to both advance AI’s capabilities and get closer to solving the value alignment problem.

Tony Zador: @Stuart Russell "Thanks for clearing that up - so 2+2 is not equal to 4, because if the 2 were a 3, the answer wouldn't be 4? "

hmm. not quite what i'm saying.

If we're going for the math analogies, then i would say that a better analogy is:

Find X, Y such that X+Y=4.

The "killer coffee robot" solution is {X=642, Y = -638}. In other words: Yes, it is a solution, but not a particularly natural or likely or good solution.

But we humans are blinded but our own warped perspective. We focus on the solution that involves killing other creatures because that appears to be one of the main solutions that we humans default to. But it is not a particularly common solution in the natural world, nor do i think it's a particularly effective solution in the long run.

Yann LeCun: Humanity has been very familiar with the problem of fixing value misalignments for millenia.

We fix our children's hardwired values by teaching them how to behave.

We fix human value misalignment by laws. Laws create extrinsic terms in our objective functions and cause the appearance of instrumental subgoals ("don't steal") in order to avoid punishment. The desire for social acceptance also creates such instrumental subgoals driving good behavior.

We even fix value misalignment for super-human and super-intelligent entities, such as corporations and governments.

This last one occasionally fails, which is a considerably more immediate existential threat than AI.

Tony Zador: @Yoshua Bengio I agree with much of your summary. I agree value alignment is important, and that it is not a solved problem.

I also agree that new technologies often have unintended and profound consequences. The invention of books has led to a decline in our memories (people used to recite the entire Odyssey). Improvements in food production technology (and other factors) have led to a surprising obesity epidemic. The invention of social media is disrupting our political systems in ways that, to me anyway, have been quite surprising. So improvements in AI will undoubtedly have profound consequences for society, some of which will be negative.

But in my view, focusing on "killer robots that dominate or step on humans" is a distraction from much more serious issues.

That said, perhaps "killer robots" can be thought of as a metaphor (or metonym) for the set of all scary scenarios that result from this powerful new technology.

Yann LeCun: @Stuart Russell you write "we need to change the way we do AI". The problems you describe have nothing to do with AI per se.

They have to do with designing (not avoiding) explicit instrumental objectives for entities (e.g. corporations) so that their overall behavior works for the common good. This is a problem of law, economics, policies, ethics, and the problem of controlling complex dynamical systems composed of many agents in interaction.

What is required is a mechanism through which objectives can be changed quickly when issues surface. For example, Facebook stopped maximizing clickthroughs several years ago and stopped using the time spent in the app as a criterion about 2 years ago. It put in place measures to limit the dissemination of clickbait, and it favored content shared by friends rather than directly disseminating content from publishers.

We certainly agree that designing good objectives is hard. Humanity has struggled with designing objectives for itself for millennia. So this is not a new problem. If anything, designing objectives for machines, and forcing them to abide by them will be a lot easier than for humans, since we can physically modify their firmware.

There will be mistakes, no doubt, as with any new technology (early jetliners lost wings, early cars didn't have seat belts, roads didn't have speed limits...).

But I disagree that there is a high risk of accidentally building existential threats to humanity.

Existential threats to humanity have to be explicitly designed as such.

Yann LeCun: It will be much, much easier to control the behavior of autonomous AI systems than it has been for humans and human organizations, because we will be able to directly modify their intrinsic objective function.

This is very much unlike humans, whose objective can only be shaped through extrinsic objective functions (through education and laws), that indirectly create instrumental sub-objectives ("be nice, don't steal, don't kill, or you will be punished").

As I have pointed out in several talks in the last several years, autonomous AI systems will need to have a trainable part in their objective, which would allow their handlers to train them to behave properly, without having to directly hack their objective function by programmatic means.

Yoshua Bengio: Yann, these are good points, we indeed have much more control over machines than humans since we can design (and train) their objective function. I actually have some hopes that by using an objective-based mechanism relying on learning (to inculcate values) rather than a set of hard rules (like in much of our legal system), we could achieve more robustness to unforeseen value alignment mishaps. In fact, I surmise we should do that with human entities too, i.e., penalize companies, e.g. fiscally, when they behave in a way which hurts the common good, even if they are not directly violating an explicit law. This also suggests to me that we should try to avoid that any entity (person, company, AI) have too much power, to avoid such problems. On the other hand, although probably not in the near future, there could be AI systems which surpass human intellectual power in ways that could foil our attempts at setting objective functions which avoid harm to us. It seems hard to me to completely deny that possibility, which thus would beg for more research in (machine-) learning moral values, value alignment, and maybe even in public policies about AI (to minimize the events in which a stupid human brings about AI systems without the proper failsafes) etc.

Yann LeCun: @Yoshua Bengio if we can build "AI systems which surpass human intellectual power in ways that could foil our attempts at setting objective functions", we can also build similarly-powerful AI systems to set those objective functions.

Sort of like the discriminator in GANs....

Yann LeCun: @Yoshua Bengio a couple direct comments on your summary:

  • designing objectives for super-human entities is not a new problem. Human societies have been doing this through laws (concerning corporations and governments) for millennia.
  • the defensive AI systems designed to protect against rogue AI systems are not akin to the military, they are akin to the police, to law enforcement. Their "jurisdiction" would be strictly AI systems, not humans.

But until we have a hint of a beginning of a design, with some visible path towards autonomous AI systems with non-trivial intelligence, we are arguing about the sex of angels.

Yuri Barzov: Aren't we overestimating the ability of imperfect humans to build a perfect machine? If it will be much more powerful than humans its imperfections will be also magnified. Cute human kids grow up into criminals if they get spoiled by reinforcement i.e. addiction to rewards. We use reinforcement and backpropagation (kind of reinforcement) in modern golden standard AI systems. Do we know enough about humans to be able to build a fault-proof human friendly super intelligent machine?

Yoshua Bengio: @Yann LeCun, about discriminators in GANs, and critics in Actor-Critic RL, one thing we know is that they tend to be biased. That is why the critic in Actor-Critic is not used as an objective function but instead as a baseline to reduce the variance. Similarly, optimizing the generator wrt a fixed discriminator does not work (you would converge to a single mode - unless you balance that with entropy maximization). Anyways, just to say, there is much more research to do, lots of unknown unknowns about learning moral objective functions for AIs. I'm not afraid of research challenges, but I can understand that some people would be concerned about the safety of gradually more powerful AIs with misaligned objectives. I actually like the way that Stuart Russell is attacking this problem by thinking about it not just in terms of an objective function but also about uncertainty: the AI should avoid actions which might hurt us (according to a self-estimate of the uncertain consequences of actions), and stay the conservative course with high confidence of accomplishing the mission while not creating collateral damage. I think that what you and I are trying to say is that all this is quite different from the terminator scenarios which some people in the media are brandishing. I also agree with you that there are lots of unknown unknowns about the strengths and weaknesses of future AIs, but I think that it is not too early to start thinking about these issues.

Yoshua Bengio: @Yuri Barzov the answer to your question: no. But we don't know that it is not feasible either, and we have reasons to believe that (a) it is not for tomorrow such machines will exist and (b) we have intellectual tools which may lead to solutions. Or maybe not!

Stuart Russell: Yann's comment "Facebook stopped maximizing clickthroughs several years ago and stopped using the time spent in the app as a criterion about 2 years ago" makes my point for me. Why did they stop doing it? Because it was the wrong objective function. Yann says we'd have to be "extremely stupid" to put the wrong objective into a super-powerful machine. Facebook's platform is not super-smart but it is super-powerful, because it connects with billions of people for hours every day. And yet they put the wrong objective function into it. QED. Fortunately they were able to reset it, but unfortunately one has to assume it's still optimizing a fixed objective. And the fact that it's operating within a large corporation that's designed to maximize another fixed objective - profit - means we cannot switch it off.

Stuart Russell: Regarding "externalities" - when talking about externalities, economists are making essentially the same point I'm making: externalities are the things not stated in the given objective function that get damaged when the system optimizes that objective function. In the case of the atmosphere, it's relatively easy to measure the amount of pollution and charge for it via taxes or fines, so correcting the problem is possible (unless the offender is too powerful). In the case of manipulation of human preferences and information states, it's very hard to assess costs and impose taxes or fines. The theory of uncertain objectives suggests instead that systems be designed to be "minimally invasive", i.e., don't mess with parts of the world state whose value is unclear. In particular, as a general rule it's probably best to avoid using fixed-objective reinforcement learning in human-facing systems, because the reinforcement learner will learn how to manipulate the human to maximize its objective.

Stuart Russell: @Yann LeCun Let's talk about climate change for a change. Many argue that it's an existential or near-existential threat to humanity. Was it "explicitly designed" as such? We created the corporation, which is a fixed-objective maximizer. The purpose was not to create an existential risk to humanity. Fossil-fuel corporations became super-powerful and, in certain relevant senses, super-intelligent: they anticipated and began planning for global warming five decades ago, executing a campaign that outwitted the rest of the human race. They didn't win the academic argument but they won in the real world, and the human race lost. I just attended an NAS meeting on climate control systems, where the consensus was that it was too dangerous to develop, say, solar radiation management systems - not because they might produce unexpected disastrous effects but because the fossil fuel corporations would use their existence as a further form of leverage in their so-far successful campaign to keep burning more carbon.

Stuart Russell: @Yann LeCun This seems to be a very weak argument. The objection raised by Omohundro and others who discuss instrumental goals is aimed at any system that operates by optimizing a fixed, known objective; which covers pretty much all present-day AI systems. So the issue is: what happens if we keep to that general plan - let's call it the "standard model" - and improve the capabilities for the system to achieve the objective? We don't need to know today *how* a future system achieves objectives more successfully, to see that it would be problematic. So the proposal is, don't build systems according to the standard model.

Yann LeCun: @Stuart Russell the problem is that essentially no AI system today is autonomous.

They are all trained *in advance* to optimize an objective, and subsequently execute the task with no regards to the objective, hence with no way to spontaneously deviate from the original behavior.

As of today, as far as I can tell, we do *not* have a good design for an autonomous machine, driven by an objective, capable of coming up with new strategies to optimize this objective in the real world.

We have plenty of those in games and simple simulation. But the learning paradigms are way too inefficient to be practical in the real world.

Yuri Barzov: @Yoshua Bengio yes. If we frame the problem correctly we will be able to resolve it. AI puts natural intelligence into focus like a magnifying mirror

Yann LeCun: @Stuart Russell in pretty much everything that society does (business, government, of whatever) behaviors are shaped through incentives, penalties via contracts, regulations and laws (let's call them collectively the objective function), which are proxies for the metric that needs to be optimized.

Because societies are complex systems, because humans are complex agents, and because conditions evolve, it is a requirement that the objective function be modifiable to correct unforeseen negative effects, loopholes, inefficiencies, etc.

The Facebook story is unremarkable in that respect: when bad side effects emerge, measures are taken to correct them. Often, these measures eliminate bad actors by directly changing their economic incentive (e.g. removing the economic incentive for clickbaits).

Perhaps we agree on the following:

(0) not all consequences of a fixed set of incentives can be predicted.

(1) because of that, objectives functions must be updatable.

(2) they must be updated to correct bad effect whenever they emerge.

(3) there should be an easy way to train minor aspects of objective functions through simple interaction (similar to the process of educating children), as opposed to programmatic means.

Perhaps where we disagree is the risk of inadvertently producing systems with badly-designed and (somehow) un-modifiable objectives that would be powerful enough to constitute existential threats.

Yoshua Bengio: @Yann LeCun this is true, but one aspect which concerns me (and others) is the gradual increase in power of some agents (now mostly large companies and some governments, potentially some AI systems in the future). When it was just weak humans the cost of mistakes or value misalignment (improper laws, misaligned objective function) was always very limited and local. As we build more and more powerful and intelligent tools and organizations, (1) it becomes easier to cheat for 'smarter' agents (exploit the misalignment) and (2) the cost of these misalignments becomes greater, potentially threatening the whole of society. This then does not leave much time and warning to react to value misalignment.


AI Alignment Open Thread November 2019

Новости LessWrong.com - 4 октября, 2019 - 04:28
Published on October 4, 2019 1:28 AM UTC

Continuing the experiment from August, let's try another open thread for AI Alignment discussion. The goal is to be a place where researchers and upcoming research can ask small questions they are confused about, share early stage ideas and have lower-key discussions.


[Link] Tools for thought (Matuschak & Nielson)

Новости LessWrong.com - 4 октября, 2019 - 03:42
Published on October 4, 2019 12:42 AM UTC

https://numinous.productions/ttft (archive)

An excerpt:

We're often asked: why don't you work on AGI or [brain-computer interfaces (BCI)] instead of tools for thought? Aren't those more important and more exciting? And for AGI, in particular, many of the skills required seem related.They certainly are important and exciting subjects. What's more, at present AGI and BCI are far more fashionable (and better funded). As a reader, you may be rolling your eyes, supposing our thinking here is pre-determined: we wouldn't be writing this essay if we didn't favor work on tools for thought. But these are questions we've wrestled hard with in deciding how to spend our own lives. One of us wrote a book about artificial intelligence before deciding to focus primarily on tools for thought; it was not a decision made lightly, and it's one he revisits from time to time. Indeed, given the ongoing excitement about AGI and BCI, it would be surprising if people working on tools for thought didn't regularly have a little voice inside their head saying “hey, shouldn't you be over there instead?” Fashion is seductive.One striking difference is that AGI and BCI are based on relatively specific, well-defined goals. By contrast, work on tools for thought is much less clearly defined. For the most part we can't point to well-defined, long-range goals; rather, we have long-range visions and aspirations, almost evocations. The work is really about exploration of an open-ended question: how can we develop tools that change and expand the range of thoughts human beings can think?Culturally, tech is dominated by an engineering, goal-driven mindset. It's much easier to set KPIs, evaluate OKRs, and manage deliverables, when you have a very specific end-goal in mind. And so it's perhaps not surprising that tech culture is much more sympathetic to AGI and BCI as overall programs of work.But historically it's not the case that humanity's biggest breakthroughs have come about in this goal-driven way. The creation of language – the ur tool for thought – is perhaps the most important occurrence of humanity's existence. And although the origin of language is hotly debated and uncertain, it seems extremely unlikely to have been the result of a goal-driven process. It's amusing to try imagining some prehistoric quarterly OKRs leading to the development of language. What sort of goals could one possibly set? Perhaps a quota of new irregular verbs? It's inconceivable!Similarly, the invention of other tools for thought – writing, the printing press, and so on – are among our greatest ever breakthroughs. And, as far as we know, all emerged primarily out of open-ended exploration, not in a primarily goal-driven way. Even the computer itself came out of an exploration that would be regarded as ridiculously speculative and poorly-defined in tech today. Someone didn't sit down and think “I need to invent the computer”; that's not a thought they had any frame of reference for. Rather, pioneers such as Alan Turing and Alonzo Church were exploring extremely basic and fundamental (and seemingly esoteric) questions about logic, mathematics, and the nature of what is provable. Out of those explorations the idea of a computer emerged, after many years; it was a discovered concept, not a goal. Fundamental, open-ended questions seem to be at least as good a source of breakthroughs as goals, no matter how ambitious. This is difficult to imagine or convince others of in Silicon Valley's goal-driven culture. Indeed, we ourselves feel the attraction of a goal-driven culture. But empirically open-ended exploration can be just as, or more successful.


To Be Decided #1

Новости LessWrong.com - 4 октября, 2019 - 02:22
Published on October 3, 2019 7:30 PM UTC

(Preface: This is the first edition of a quarterly email newsletter I started earlier this year called To Be Decided. I'm posting this as an experiment; if response here is positive, I'll post the two issues that have gone out since then as well as future issues as they come out. Feedback welcome!)

Welcome to the inaugural edition of To Be Decided, a quarterly newsletter about smarter decisions for a better world! TBD is all about deploying knowledge for impact, learning at scale, and making more thoughtful choices for ourselves and our organizations. Each edition will feature short and sweet reviews of important publications you don't want to miss but don't have time to read, along with a brief roundup of major developments in the world of learning and decision-making since last time.

Why Your Hard Work Sits on the Shelf—and What to Do About It

We've all been there. The time when the client seemed to forget the project ever happened as soon as the final check was cut. The time when your report stuffed full of creative recommendations got buried by risk-averse leadership. The time when stakeholders really did seem engaged by the findings, had lots of conversations, and then...nothing changed.

If you suspect these stories are more the rule than the exception, the evidence suggests you're right. And if the trend continues, chances are it's eventually going to catch up to those of us who generate and spread knowledge in the social sector. If we really want our work to be useful, we have to continue supporting decision-makers after the final report is delivered, working hand-in-hand with them to ensure whatever choices they make take into account not only the best information available but also other factors that matter to them, including their values, goals, and perceived obligations. For this reason, knowledge providers who want to see their work have greater impact might find value in partnering with a decision consultant in the form of a "wrap-around" service for knowledge initiatives.

(Keep reading)

What I've Been Reading

Rethinking the Purpose of Measurement
Measurement is not a simple act of observation disconnected from any larger plan. Instead, it’s an optimization strategy for reducing uncertainty about decisions we need to make. That’s the central argument of Douglas Hubbard’s How to Measure Anything: Finding the Value of “Intangibles” in Business, which remains one of the most important books on decision-making I’ve read since first encountering it more than seven years ago. This revolutionary reframing argues that measurement can only have value if it can reduce uncertainty about a decision that mattersIt points toward an ultra-applied approach to evaluation and research that would represent a radical departure from the way these functions operate at most organizations today.
(Full review | Twitter thread)

Funders Learn Mostly from Each Other. Is that Dangerous? "Peer to Peer: At the Heart of Influencing More Effective Philanthropy," commissioned by the Hewlett Foundation with the goal of understanding how foundations access and use knowledge, raises the question of whether there are enough intellectually curious foundation leaders who both keep tabs on new studies and reports as they come out and proactively share that knowledge with their peers. (Twitter thread)

Stuff You Should Know About
  • The very same day last December that negotiations to avoid the longest government shutdown in US history fell apart, President Trump signed into law one of the most important pieces of government performance legislation in 25 years. Among other reforms, it directs federal agencies to develop public learning agendas and hire senior evaluation officers. As improbable as it may seem, the Foundations for Evidence-Based Policy-Making Act was passed with broad bipartisan support by a Republican Congress following recommendations from an Obama-era presidential commission. (Side note: props to Bipartisan Policy Coalition's Nick Hart for braving Reddit to host a rowdy Ask Me Anything on this topic.)
  • In a bid to accelerate the open science movement, the University of California system has declined to renew its $10 million annual contract with Elsevier, the world's largest publisher of scholarly research.
  • The Open Philanthropy Project, one of the most interesting funders in the world right now, has placed its biggest bet to date: a $55 million grant to help establish the new Center for Security and Emerging Technology at Georgetown University. The center, which will focus extensively on heading off threats from advanced artificial intelligence, will be headed by Jason Matheny, former director of the Intelligence Advanced Research Projects Activity (IARPA) program at the US Office of the Director of National Intelligence. Fun fact: while at IARPA, Matheny managed the prediction tournament that helped establish the empirical basis for the advanced techniques described in Philip Tetlock's popular book Superforecasting. (More about forecasting in a future TBD.)

That's all for now!

If you enjoyed this edition of TBD, please consider forwarding it to a friend. It's easy to sign up here. See you next time!


Long-Term Future Fund: August 2019 grant recommendations

Новости LessWrong.com - 3 октября, 2019 - 23:41
Published on October 3, 2019 8:41 PM UTC

Note: The Q4 deadline for applications to the Long-Term Future Fund is Friday 11th October. Apply here.

We opened up an application for grant requests earlier this year, and it was open for about one month. This post contains the list of grant recipients for Q3 2019, as well as some of the reasoning behind the grants. Most of the funding for these grants has already been distributed to the recipients.

In the writeups below, we explain the purpose for each grant and summarize our reasoning behind their recommendation. Each summary is written by the fund manager who was most excited about recommending the relevant grant (with a few exceptions that we've noted below). These differ a lot in length, based on how much available time the different fund members had to explain their reasoning.

When we’ve shared excerpts from an application, those excerpts may have been lightly edited for context or clarity.

Grant RecipientsGrants Made By the Long-Term Future Fund

Each grant recipient is followed by the size of the grant and their one-sentence description of their project. All of these grants have been made.

  • Samuel Hilton, on behalf of the HIPE team ($60,000): Placing a staff member within the government, to support civil servants to do the most good they can.
  • Stag Lynn ($23,000): To spend the next year leveling up various technical skills with the goal of becoming more impactful in AI safety.
  • Roam Research ($10,000): Workflowy, but with much more power to organize your thoughts and collaborate with others.
  • Alexander Gietelink Oldenziel ($30,000): Independent AI Safety thinking, doing research in aspects of self-reference in using techniques from type theory, topos theory and category theory more generally.
  • Alexander Siegenfeld ($20,000): Characterizing the properties and constraints of complex systems and their external interactions.
  • Sören Mindermann ($36,982): Additional funding for an AI strategy PhD at Oxford / FHI to improve my research productivity
  • AI Safety Camp ($41,000): A research experience program for prospective AI safety researchers.
  • Miranda Dixon-Luinenburg ($13,500): Writing EA-themed fiction that addresses X-risk topics.
  • David Manheim ($30,000): Multi-model approach to corporate and state actors relevant to existential risk mitigation.
  • Joar Skalse ($10,000): Upskilling in ML in order to be able to do productive AI safety research sooner than otherwise.
  • Chris Chambers ($36,635): Combat publication bias in science by promoting and supporting the Registered Reports journal format.
  • Jess Whittlestone ($75,080): Research on the links between short- and long-term AI policy while skilling up in technical ML.
  • Lynette Bye ($23,000): Productivity coaching for effective altruists to increase their impact.

Total distributed: $439,197

Other Recommendations

Sometimes, applicants get alternative sources of funding, or decide to work on a different project.

The following people and organizations were applicants of this kind. The Long-Term Future Fund recommended grants to them, but did not end up funding them. We sometimes create write-ups for these applicants and include them in our reports in order to provide readers with better information on the types of grants we like to recommend.

  • Center for Applied Rationality ($150,000): Help promising people to reason more effectively and find high-impact work, such as reducing x-risk.

Two grants we recommended but did not write up:

  • Jake Coble, who requested $10,000 to do some work with Simon Beard of CSER. This grant request came with an early deadline, so we made the recommendation earlier in the grant cycle. However, after our recommendation went out, Jake found a different project he preferred, and no longer required funding.
  • We recommended another individual for a grant, but they wound up accepting funding from another source. (They requested that we not share their name; we would have shared this information had they received funding from us.)
Writeups by Helen TonerSamuel Hilton, on behalf of the HIPE team ($60,000)Placing a staff member within the government, to support civil servants to do the most good they can.

This grant supports HIPE (https://hipe.org.uk), a UK-based organization that helps civil servants to have high-impact careers. HIPE’s primary activities are researching how to have a positive impact in the UK government; disseminating their findings via workshops, blog posts, etc.; and providing one-on-one support to interested individuals.

HIPE has so far been entirely volunteer-run. This grant funds part of the cost of a full-time staff member for two years, plus some office and travel costs.

Our reasoning for making this grant is based on our impression that HIPE has already been able to gain some traction as a volunteer organization, and on the fact that they now have the opportunity to place a full-time staff member within the Cabinet Office. We see this both as a promising opportunity in its own right, and also as a positive signal about the engagement HIPE has been able to create so far. The fact that the Cabinet Office is willing to provide desk space and cover part of the overhead cost for the staff member suggests that HIPE is engaging successfully with its core audiences.

HIPE does not yet have robust ways of tracking its impact, but they expressed strong interest in improving their impact tracking over time. We would hope to see a more fleshed-out impact evaluation if we were asked to renew this grant in the future.

I’ll add that I (Helen) personally see promise in the idea of services that offer career discussion, coaching, and mentoring in more specialized settings. (Other fund members may agree with this, but it was not part of our discussion when deciding whether to make this grant, so I’m not sure.)

Writeups by Alex ZhuStag Lynn ($23,000)To spend the next year leveling up various technical skills with the goal of becoming more impactful in AI safety

Stag’s current intention is to spend the next year improving his skills in a variety of areas (e.g. programming, theoretical neuroscience, and game theory) with the goal of contributing to AI safety research, meeting relevant people in the x-risk community, and helping out in EA/rationality related contexts wherever he can (eg, at rationality summer camps like SPARC and ESPR).

Two projects he may pursue during the year:

  • Working to implement certificates of impact in the EA/X-risk community, in the hope of encouraging coordination between funders with different values and increasing transparency around the contributions of different people to impactful projects.
  • Working as an unpaid personal assistant to someone in EA who is sufficiently busy for this form of assistance to be useful, and sufficiently productive for the assistance to be valuable.

I recommended funding Stag because I think he is smart, productive, and altruistic, has a track record of doing useful work, and will contribute more usefully to reducing existential risk by directly developing his capabilities and embedding himself in the EA community than he would by finishing his undergraduate degree or working a full-time job. While I’m not yet clear on what projects he will pursue, I think it’s likely that the end result will be very valuable — projects like impact certificates require substantial work from someone with technical and executional skills, and Stag seems to me to fit the bill.

More on Stag’s background: In high school, Stag had top finishes in various Latvian and European Olympiads, including a gold medal in the 2015 Latvian Olympiad in Mathematics. Stag has also previously taken the initiative to work on EA causes -- for example, he joined two other people in Latvia in attempting to create the Latvian chapter of Effective Altruism (which reached the point of creating a Latvian-language website), and he has volunteered to take on major responsibilities in future iterations of the European Summer Program in Rationality (which introduces promising high-school students to effective altruism).

Potential conflict of interest: at the time of making the grant, Stag was living with me and helping me with various odd jobs, as part of his plan to meet people in the EA community and help out where he could. This arrangement lasted for about 1.5 months. To compensate for this potential issue, I’ve included notes on Stag from Oliver Habryka, another fund manager.

Oliver Habryka’s comments on Stag Lynn

I’ve interacted with Stag in the past and have broadly positive impressions of him, in particular his capacity for independent strategic thinking

Stag has achieved a high level of success in Latvian and Galois Mathematical Olympiads. I generally think that success in these competitions is one of the best predictors we have of a person’s future performance on making intellectual progress on core issues in AI safety. See also my comments and discussion on the grant to Misha Yagudin last round.

Stag has also contributed significantly to improving both ESPR and SPARC , both of which introduce talented pre-college students to core ideas in EA and AI safety. In particular, he’s helped the programs find and select strong participants, while suggesting curriculum changes that gave them more opportunities to think independently about important issues. This gives me a positive impression of Stag’s ability to contribute to other projects in the space. (I also consider ESPR and SPARC to be among the most cost-effective ways to get more excellent people interested in working on topics of relevance to the long-term future, and take this as another signal of Stag’s talent at selecting and/or improving projects.)

Roam Research ($10,000)Workflowy, but with much more power to organize your thoughts and collaborate with others.

Roam is a web application which automates the Zettelkasten method, a note-taking / document-drafting process based on physical index cards. While it is difficult to start using the system, those who do often find it extremely helpful, including a researcher at MIRI who claims that the method doubled his research productivity.

On my inside view, if Roam succeeds, an experienced user of the note-taking app Workflowy will get at least as much value switching to Roam as they got from using Workflowy in the first place. (Many EAs, myself included, see Workflowy as an integral part of our intellectual process, and I think Roam might become even more integral than Workflowy. See also Sarah Constantin’s review of Roam, which describes Roam as being potentially as “profound a mental prosthetic as hypertext”, and her more recent endorsement of Roam.)

Over the course of the last year, I’ve had intermittent conversations with Conor White-Sullivan, Roam’s CEO, about the app. I started out in a position of skepticism: I doubted that Roam would ever have active users, let alone succeed at its stated mission. After a recent update call with Conor about his LTF Fund application, I was encouraged enough by Roam’s most recent progress, and sufficiently convinced of the possible upsides of its possible success, that I decided to recommend a grant to Roam.

Since then, Roam has developed enough as a product that I’ve personally switched from Workflowy to Roam and now recommend Roam to my friends. Roam’s progress on its product, combined with its growing base of active users, has led me to feel significantly more optimistic about Roam succeeding at its mission.

(This funding will support Roam’s general operating costs, including expenses for Conor, one employee, and several contractors.)

Potential conflict of interest: Conor is a friend of mine, and I was once his housemate for a few months.

Alexander Gietelink Oldenziel ($30,000)Independent AI Safety thinking, doing research in aspects of self-reference in using techniques from type theory, topos theory and category theory more generally.

In our previous round of grants, we funded MIRI as an organization: see our April reportfor a detailed explanation of why we chose to support their work. I think Alexander’s research directions could lead to significant progress on MIRI’s research agenda — in fact, MIRI was sufficiently impressed by his work that they offered him an internship. I have also spoken to him in some depth, and was impressed both by his research taste and clarity of thought.

After the internship ends, I think it will be valuable for Alexander to have additional funding to dig deeper into these topics; I expect this grant to support roughly 1.5 years of research. During this time, he will have regular contact with researchers at MIRI, reporting on his research progress and receiving feedback.

Alexander Siegenfeld ($20,000)Characterizing the properties and constraints of complex systems and their external interactions.

Alexander is a 5th-year graduate student in physics at MIT, and he wants to conduct independent deconfusion research for AI safety. His goal is to get a better conceptual understanding of multi-level world models by coming up with better formalisms for analyzing complex systems at differing levels of scale, building off of the work of Yaneer Bar-Yam. (Yaneer is Alexander’s advisor, and the president of the New England Complex Science Institute.)

I decided to recommend funding to Alexander because I think his research directions are promising, and because I was personally impressed by his technical abilities and his clarity of thought. Tsvi Benson-Tilsen, a MIRI researcher, was also impressed enough by Alexander to recommend that the Fund support him. Alexander plans to publish a paper on his research; it will be evaluated by researchers at MIRI, helping him decide how best to pursue further work in this area.

Potential conflict of interest: Alexander and I have been friends since our undergraduate years at MIT.

Writeups by Oliver Habryka

I have a sense that funders in EA, usually due to time constraints, tend to give little feedback to organizations they fund (or decide not to fund). In my writeups below, I tried to be as transparent as possible in explaining the reasons for why I came to believe that each grant was a good idea, my greatest uncertainties and/or concerns with each grant, and some background models I use to evaluate grants. (I hope this last item will help others better understand my future decisions in this space.)

I think that there exist more publicly defensible (or easier to understand) arguments for some of the grants that I recommended. However, I tried to explain the actual models that drove my decisions for these grants, which are often hard to summarize in a few paragraphs. I apologize in advance that some of the explanations below are probably difficult to understand.

Thoughts on grant selection and grant incentives

Some higher-level points on many of the grants below, as well as many grants from last round:

For almost every grant we make, I have a lot of opinions and thoughts about how the applicant(s) could achieve their aims better. I also have a lot of ideas for projects that I would prefer to fund over the grants we are actually making.

However, in the current structure of the LTFF, I primarily have the ability to select potential grantees from an established pool, rather than encouraging the creation of new projects. Alongside my time constraints, this means that I have a very limited ability to contribute to the projects with my own thoughts and models.

Additionally, I spend a lot of time thinking independently about these areas, and have a broad view of “ideal projects that could be made to exist.” This means that for many of the grants I am recommending, it is not usually the case that I think the projects are very good on all the relevant dimensions; I can see how they fall short of my “ideal” projects. More frequently, the projects I fund are among the only available projects in a reference class I believe to be important, and I recommend them because I want projects of that type to receive more resources (and because they pass a moderate bar for quality).

Some examples:

  • Our grant to the Kocherga community space club last round. I see Kocherga as the only promising project trying to build infrastructure that helps people pursue projects related to x-risk and rationality in Russia.
  • I recommended this round’s grant to Miranda partly because I think Miranda's plans are good and I think her past work in this domain and others is of high quality, but also because she is the only person who applied with a project in a domain that seems promising and neglected (using fiction to communicate otherwise hard-to-explain ideas relating to x-risk and how to work on difficult problems).
  • In the November 2018 grant round, I recommended a grant to Orpheus Lummis to run an AI safety unconference in Montreal. This is because I think he had a great idea, and would create a lot of value even if he ran the events only moderately well. This isn’t the same as believing Orpheus has excellent skills in the relevant domain; I can imagine other applicants who I’d have been more excited to fund, had they applied.

I am, overall, still very excited about the grants below, and I think they are a much better use of resources than what I think of as the most common counterfactuals to donating to the LTFF fund (e.g. donating to the largest organizations in the space, donating based on time-limited personal research) .

However, related to the points I made above, I will have many criticisms of almost all the projects that receive funding from us. I think that my criticisms are valid, but readers shouldn't interpret them to mean that I have a negative impression of the grants we are making — which are strong despite their flaws. Aggregating my individual (and frequently critical) recommendations will not give readers an accurate impression of my overall (highly positive) view of the grant round.

(If I ever come to think that the pool of valuable grants has dried up, I will say so in a high-level note like this one.)

I can imagine that in the future I might want to invest more resources into writing up lists of potential projects that I would be excited about, though it is also not clear to me that I want people to optimize too much for what I am excited about, and think that the current balance of "things that I think are exciting, and that people feel internally motivated to do and generated their own plans for" seems pretty decent.

To follow up the above with a high-level assessment, I am slightly less excited about this round’s grants than I am about last round’s, and I’d estimate (very roughly) that this round is about 25% less cost-effective than the previous round.


For both this round and the last round, I wrote the writeups in collaboration with Ben Pace, who works with me on LessWrong and the Alignment Forum. After an extensive discussion about the grants and the Fund's reasoning for them, we split the grants between us and independently wrote initial drafts. We then iterated on those drafts until they accurately described my thinking about them and the relevant domains.

I am also grateful for Aaron Gertler’s help with editing and refining these writeups, which has substantially increased their clarity.

Sören Mindermann ($36,982)Additional funding for an AI strategy PhD at Oxford / FHI to improve my research productivity.I'm looking for additional funding to supplement my 15k pound/y PhD stipend for 3-4 years from September 2019. I am hoping to roughly double this. My PhD is at Oxford in machine learning, but co-supervised by Allan Dafoe from FHI so that I can focus on AI strategy. We will have multiple joint meetings each month, and I will have a desk at FHI.The purpose is to increase my productivity and happiness. Given my expected financial situation, I currently have to make compromises on e. g. Ubers, Soylent, eating out with colleagues, accommodation, quality and waiting times for health care, spending time comparing prices, travel durations and stress, and eating less healthily.I expect that more financial security would increase my own productivity and the effectiveness of the time invested by my supervisors.

I think that when FHI or other organizations in that reference class have trouble doing certain things due to logistical obstacles, we should usually step in and fill those gaps (e.g. see Jacob Lagerros’ grant from last round). My sense is that FHI has trouble with providing funding in situations like this (due to budgetary constraints imposed by Oxford University).

I’ve interacted with Sören in the past (during my work at CEA), and generally have positive impressions of him in a variety of domains, like his basic thinking about AI Alignment, and his general competence from running projects like the EA Newsletter.

I have a lot of trust in the judgment of Nick Bostrom and several other researchers at FHI. I am not currently very excited about the work at GovAI (the team that Allan Dafoe leads), but still have enough trust in many of the relevant decision makers to think that it is very likely that Soeren should be supported in his work.

In general, I think many of the salaries for people working on existential risk are low enough that they have to make major tradeoffs in order to deal with the resulting financial constraints. I think that increasing salaries in situations like this is a good idea (though I am hesitant about increasing salaries for other types of jobs, for a variety of reasons I won’t go into here, but am happy to expand on).

This funding should last for about 2 years of Sören’s time at Oxford.

AI Safety Camp ($41,000)A research experience program for prospective AI safety researchers.We want to organize the 4th AI Safety Camp (AISC) - a research retreat and program for prospective AI safety researchers. Compared to past iterations, we plan to change the format to include a 3 to 4-day project generation period and team formation workshop, followed by a several-week period of online team collaboration on concrete research questions, a 6 to 7-day intensive research retreat, and ongoing mentoring after the camp. The target capacity is 25 - 30 participants, with projects that range from technical AI safety (majority) to policy and strategy research. More information about past camps is at https://aisafetycamp.com/[...]Early-career entry stage seems to be a less well-covered part of the talent pipeline, especially in Europe. Individual mentoring is costly from the standpoint of expert advisors (esp. compared to guided team work), while internships and e.g. MSFP have limited capacity and are US-centric. After the camp, we advise and encourage participants on future career steps and help connect them to other organizations, or direct them to further individual work and learning if they are pursuing an academic track..Overviews of previous research projects from the first 2 camps can be found here:1- http://bit.ly/2FFFcK12- http://bit.ly/2KKjPLBProjects from AISC3 are still in progress and there is no public summary.To evaluate the camp, we send out an evaluation form directly after the camp has concluded and then informally follow the career decisions, publications, and other AI safety/EA involvement of the participants. We plan to conduct a larger survey from past AISC participants later in 2019 to evaluate our mid-term impact. We expect to get a more comprehensive picture of the impact, but it is difficult to evaluate counterfactuals and indirect effects (e.g. networking effects). The (anecdotal) positive examples we attribute to past camps include the acceleration of entrance of several people in the field, research outputs that include 2 conference papers, several SW projects, and about 10 blogposts.The main direct costs of the camp are the opportunity costs of participants, organizers and advisors. There are also downside risks associated with personal conflicts at multi-day retreats and discouraging capable people from the field if the camp is run poorly. We actively work to prevent this by providing both on-site and external anonymous contact points, as well as actively attending to participant well-being, including during the online phases.

This grant is for the AI Safety Camp, to which we made a grant in the last round. Of the grants I recommended this round, I am most uncertain about this one. The primary reason is that I have not received much evidence about the performance of either of the last two camps[1], and I assign at least some probability that the camps are not facilitating very much good work. (This is mostly because I have low expectations for the quality of most work of this kind and haven’t looked closely enough at the camp to override these — not because I have positive evidence that they produce low-quality work.)

My biggest concern is that the camps do not provide a sufficient level of feedback and mentorship for the attendees. When I try to predict how well I’d expect a research retreat like the AI Safety Camp to go, much of the impact hinges on putting attendees into contact with more experienced researchers and having a good mentoring setup. Some of the problems I have with the output from the AI Safety Camp seem like they could be explained by a lack of mentorship.

From the evidence I observe on their website, I see that the attendees of the second camp all produced an artifact of their research (e.g. an academic writeup or code repository). I think this is a very positive sign. That said, it doesn’t look like any alignment researchers have commented on any of this work (this may in part have been because most of it was presented in formats that require a lot of time to engage with, such as GitHub repositories), so I’m not sure the output actually lead to the participants to get any feedback on their research directions, which is one of the most important things for people new to the field.

After some followup discussion with the organizers, I heard about changes to the upcoming camp (the target of this grant) that address some of the above concerns (independent of my feedback). In particular, the camp is being renamed to “AI Safety Research Program”, and is now split into two parts — a topic selection workshop and a research retreat, with experienced AI Alignment researchers attending the workshop. The format change seems likely to be a good idea, and makes me more optimistic about this grant.

I generally think hackathons and retreats for researchers can be very valuable, allowing for focused thinking in a new environment. I think the AI Safety Camp is held at a relatively low cost, in a part of the world (Europe) where there exist few other opportunities for potential new researchers to spend time thinking about these topics, and some promising people have attended. I hope that the camps are going well, but I will not fund another one without spending significantly more time investigating the program.


[1] After signing off on this grant, I found out that, due to overlap between the organizers of the events, some feedback I got about this camp was actually feedback about the Human Aligned AI Summer School, which means that I had even less information than I thought. In April I said I wanted to talk with the organizers before renewing this grant, and I expected to have at least six months between applications from them, but we received another application this round and I ended up not having time for that conversation.

Miranda Dixon-Luinenburg ($13,500)Writing EA-themed fiction that addresses X-risk topics.I want to spend three months evaluating my ability to produce an original work that explores existential risk, rationality, EA, and related themes such as coordination between people with different beliefs and backgrounds, handling burnout, planning on long timescales, growth mindset, etc. I predict that completing a high-quality novel of this type would take ~12 months, so 3 months is just an initial test.In 3 months, I would hope to produce a detailed outline of an original work plus several completed chapters. Simultaneously, I would be evaluating whether writing full-time is a good fit for me in terms of motivation and personal wellbeing.[...]I have spent the last 2 years writing an EA-themed fanfiction of The Last Herald-Mage trilogy by Mercedes Lackey (online at https://archiveofourown.org/series/936480). In this period I have completed 9 “books” of the series, totalling 1.2M words (average of 60K words/month), mostly while I was also working full-time. (I am currently writing the final arc, and when I finish, hope to create a shorter abridged/edited version with a more solid beginning and better pacing overall.)In the writing process, I researched key background topics, in particular AI safety work (I read a number of Arbital articles and most of this MIRI paper on decision theory: https://arxiv.org/pdf/1710.05060v1.pdf), as well as ethics, mental health, organizational best practices, medieval history and economics, etc. I have accumulated a very dedicated group of around 10 beta readers, all EAs, who read early drafts of each section and give feedback on how well it addresses various topics, which gives me more confidence that I am portraying these concepts accurately.

One natural decomposition of whether this grant is a good idea is to first ask whether writing fiction of this type is valuable, then whether Miranda is capable of actually creating that type of fiction, and last whether funding Miranda will make a significant difference in the amount/quality of her fiction.

I think that many people reading this will be surprised or confused about this grant. I feel fairly confident that grants of this type are well worth considering, and I am interested in funding more projects like this in the future, so I’ve tried my best to summarize my reasoning. I do think there are some good arguments for why we should be hesitant to do so (partly summarized by the section below that lists things that I think fiction doesn’t do as well as non-fiction), so while I think that grants like this are quite important, and have the potential to do a significant amount of good, I can imagine changing my mind about this in the future.

The track record of fiction

In a general sense, I think that fiction has a pretty strong track record of both being successful at conveying important ideas, and being a good attractor of talent and other resources. I also think that good fiction is often necessary to establish shared norms and shared language.

Here are some examples of communities and institutions that I think used fiction very centrally in their function. Note that after the first example, I am making no claim that the effect was good, I’m just establishing the magnitude of the potential effect size.

  • Harry Potter and the Methods of Rationality (HPMOR) was instrumental in the growth and development of both the EA and Rationality communities. It is very likely the single most important recruitment mechanism for productive AI alignment researchers, and has also drawn many other people to work on the broader aims of the EA and Rationality communities.
  • Fiction was a core part of the strategy of the neoliberal movement; fiction writers were among the groups referred to by Hayek as "secondhand dealers in ideas.” An example of someone whose fiction played both a large role in the rise of neoliberalism and in its eventual spread would be Ayn Rand.
  • Almost every major religion, culture and nation-state is built on shared myths and stories, usually fictional (though the stories are often held to be true by the groups in question, making this data point a bit more confusing).
  • Francis Bacon’s (unfinished) utopian novel “The New Atlantis” is often cited as the primary inspiration for the founding of the Royal Society, which may have been the single institution with the greatest influence on the progress of the scientific revolution.

On a more conceptual level, I think fiction tends to be particularly good at achieving the following aims (compared to non-fiction writing):

  • Teaching low-level cognitive patterns by displaying characters that follow those patterns, allowing the reader to learn from very concrete examples set in a fictional world. (Compare Aesop’s Fables to some nonfiction book of moral precepts — it can be much easier to remember good habits when we attach them to characters.)
  • Establishing norms, by having stories that display the consequences of not following certain norms, and the rewards of following them in the right way
  • Establishing a common language, by not only explaining concepts, but also showing concepts as they are used, and how they are brought up in conversational context
  • Establishing common goals, by creating concrete utopian visions of possible features that motivate people to work towards them together
  • Reaching a broader audience, since we naturally find stories more exciting than abstract descriptions of concepts

(I wrote in more detail about how this works for HPMOR in the last grant round.)

In contrast, here are some things that fiction is generally worse at (though a lot of these depend on context; since fiction often contains embedded non-fiction explanations, some of these can be overcome):

  • Carefully evaluating ideas, in particular when evaluating them requires empirical data. There is a norm against showing graphs or tables in fiction books, making any explanation that rests on that kind of data difficult to access in fiction.
  • Conveying precise technical definitions
  • Engaging in dialogue with other writers and researchers
  • Dealing with topics in which readers tend to come to better conclusions by mentally distancing themselves from the problem at hand, instead of engaging with concrete visceral examples (I think some ethical topics like the trolley problem qualify here, as well as problems that require mathematical concepts that don’t neatly correspond to easy real-world examples)

Overall, I think current writing about both existential risk, rationality, and effective altruism skews too much towards non-fiction, so I’m excited about experimenting with funding fiction writing.

Miranda’s writing

The second question is whether I trust Miranda to actually be able to write fiction that leverages these opportunities and provides value. This is why I think Miranda can do a good job:

  • Her current fiction project is read by a few people whose taste I trust, and many of them describe having developed valuable skills or insights as a result (for example, better skills for crisis management, a better conception of moral philosophy, an improved moral compass, and some insights about decision theory)
  • She wrote frequently on LessWrong and her blog for a few years, producing content of consistently high quality that, while not fictional, often displayed some of the same useful properties as fiction writing.
  • I’ve seen her execute a large variety of difficult projects outside of her writing, which means I am a lot more optimistic about things like her ability to motivate herself on this project, and excelling in the non-writing aspects of the work (e.g. promoting her fiction to audiences beyond the EA and rationality communities)
    • She worked in operations at CEA and received strong reviews from her coworkers
    • She helped CFAR run the operations for SPARC in two consecutive years and performed well as a logistics volunteer for 11 of their other workshops
    • I’ve seen her organize various events and provide useful help with logistics and general problem-solving on a large number of occasions

My two biggest concerns are:

  • Miranda losing motivation to work on this project, because writing fiction with a specific goal requires a significantly different motivation than doing it for personal enjoyment
  • The fiction being well-written and engaging, but failing to actually help people better understand the important issues it tries to cover.

I like the fact that this grant is for an exploratory 3 months rather than a longer period of time; this allows Miranda to pivot if it doesn’t work out, rather than being tied to a project that isn’t going well.

The counterfactual value of funding

It would be reasonable to ask whether a grant is really necessary, given that Miranda has produced a huge amount of fiction in the last two years without receiving funding explicitly dedicated to that. I have two thoughts here:

  1. I generally think that we should avoid declining to pay people just because they’d be willing to do valuable work for free. It seems good to reward people for work even if this doesn’t make much of a difference in the quality/consistency of the work, because I expect this promise of reward to help people build long-term motivation and encourage exploration.
    1. To explain this a bit more, I think this grant will help other people build motivation towards pursuing similar projects in the future, by setting a precedent for potential funding in this space. For example, I think the possibility of funding (and recognition) was also a motivator for Miranda in starting to work on this project.
  2. I expect this grant to have a significant effect on Miranda’s productivity, because I think that there is often a qualitative difference between work someone produces in their spare time and work that someone can focus full-time on. In particular, I expect this grant to cause Miranda’s work to improve in the dimensions that she doesn’t naturally find very stimulating, which I expect will include editing, restructuring, and other forms of “polish”.
David Manheim ($30,000)Multi-model approach to corporate and state actors relevant to existential risk mitigation.Work for 2-3 months on continuing to build out a multi-model approach to understanding international relations and multi-stakeholder dynamics as it relates to risks of strong(er) AI systems development, based on and extending similar work done on biological weapons risks done on behalf of FHI's Biorisk group and supporting Open Philanthropy Project planning.This work is likely to help policy and decision analysis for effective altruism related to the deeply uncertain and complex issues in international relations and long term planning that need to be considered for many existential risk mitigation activities. While the project is focused on understanding actors and motivations in the short term, the decisions being supported are exactly those that are critical for existential risk mitigation, with long term implications for the future.

I feel a lot of skepticism toward much of the work done in the academic study of international relations. Judging from my models of political influence and its effects on the quality of intellectual contributions, and my models of research fields with little ability to perform experiments, I have high priors that work in international relations is of significantly lower quality than in most scientific fields. However, I have engaged relatively little with actual research on the topic of international relations (outside of unusual scholars like Nick Bostrom) and so am hesitant in my judgement here.

I also have a fair bit of worry around biorisk. I haven’t really had the opportunity to engage with a good case for it, and neither have many of the people I would trust most in this space, in large part due to secrecy concerns from people who work on it (more on that below). Due to this, I am worried about information cascades. (An information cascade is a situation where people primarily share what they believe but not why, and because people update on each others' beliefs you end up with a lot of people all believing the same thing precisely because everyone else does.)

I think is valuable to work on biorisk, but this view is mostly based on individual conversations that are hard to summarize, and I feel uncomfortable with my level of understanding of possible interventions, or even just conceptual frameworks I could use to approach the problem. I don’t know how most people who work in this space came to decide it was important, and those I’ve spoken to have usually been reluctant to share details in conversation (e.g. about specific discoveries they think created risk, or types of arguments that convinced them to focus on biorisk over other threats).

I’m broadly supportive of work done at places like FHI and by the people at OpenPhil who care about x-risks, so I am in favor of funding their work (e.g. Soren’s grant above). But I don’t feel as though I can defer to the people working in this domain on the object level when there is so much secrecy around their epistemic process, because I and others cannot evaluate their reasoning.

However, I am excited about this grant, because I have a good amount of trust in David’s judgment. To be more specific, he has a track record of identifying important ideas and institutions and then working on/with them. Some concrete examples include:

  • Wrote up a paper on Goodhart’s Law with Scott Garrabrant (after seeing Scott’s very terse post on it)
  • Works with the biorisk teams at FHI and OpenPhil
  • Completed his PhD in public policy and decision theory at the RAND Corporation, which is an unusually innovative institution (e.g. this study);
  • Writes interesting comments and blog posts on the internet (e.g. LessWrong)
  • Has offered mentoring in his fields of expertise to other people working or preparing to work projects in the x-risk space; I’ve heard positive feedback from his mentees

Another major factor for me is the degree to which David is shares his thinking openly and transparently on the internet, and participates in public discourse, so that other people interested in these topics can engage with his ideas. (He’s also a superforecaster, which I think is predictive of broadly good judgment.) If David didn’t have this track record of public discourse, I likely wouldn’t be recommending this grant, and if he suddenly stopped participating, I’d be fairly hesitant to recommend such a grant in the future.

As I said, I’m not excited about the specific project he is proposing, but have trust in his sense of which projects might be good to work on, and I have emphasized to him that I think he should feel comfortable working on the projects he thinks are best. I strongly prefer a world where David has the freedom to work on the projects he judges to be most valuable, compared to the world where he has to take unrelated jobs (e.g. teaching at university).

Joar Skalse ($10,000)Upskilling in ML in order to be able to do productive AI safety research sooner than otherwise.I am requesting grant money to upskill in machine learning (ML).Background: I am an undergraduate student in Computer Science and Philosophy at Oxford University, about to start the 4th year of a 4-year degree. I plan to do research in AI safety after I graduate, as I deem this to be the most promising way of having a significant positive impact on the long-term future[...]What I’d like to do:I would like to improve my skills in ML by reading literature and research, replicating research papers, building ML-based systems, and so on.To do this effectively, I need access to the compute that is required to train large models and run lengthy reinforcement learning experiments and similar.It would also likely be very beneficial if I could live in Oxford during the vacations, as I would then be in an environment in which it is easier to be productive. It would also make it easier for me to speak with the researchers there, and give me access to the facilities of the university (including libraries, etc.).It would also be useful to be able to attend conferences and similar events.

Joar was one of the co-authors on the Mesa-Optimisers paper, which I found surprisingly useful and clearly written, especially considering that its authors had relatively little background in alignment research or research in general. I think it is probably the second most important piece of writing on AI alignment that came out in the last 12 months, after the Embedded Agency sequence. My current best guess is that this type of conceptual clarification / deconfusion is the most important type of research in AI alignment, and the type of work I’m most interested in funding. While I don’t know exactly how Joar contributed to the paper, my sense is that all the authors put in a significant effort (bar Scott Garrabrant, who played a supervising role).

This grant is for projects during and in between terms at Oxford. I want to support Joar producing more of this kind of research, which I expect this grant to help with. He’s also been writing further thoughts online (example), which I think has many positive effects (personally and as externalities).

My brief thoughts on the paper (nontechnical):

  • The paper introduced me to a lot of of terminology that I’ve continued to use over the past few months (which is not true for most terminology introduced in this space)
  • It helped me deconfuse my thinking on a bunch of concrete problems (in particular on the question of whether things like Alpha Go can be dangerous when “scaled up”)
  • I’ve seen multiple other researchers and thinkers I respect refer to it positively
  • In addition to being published as a paper, it was written up as a series of blogposts in a way that made it a lot more accessible

More of my thoughts on the paper (technical):

Note: If you haven’t read the paper, or you don’t have other background in the subject, this section will likely be unclear. It’s not essential to the case for the grant, but I wanted to share it in case people with the requisite background are interested in more details about the research

I was surprised by how helpful the conceptual work in the paper was - helping me think about where the optimization was happening in a system like AlphaGo Zero improved my understanding of that system and how to connect it to other systems that do optimization in the world. The primary formalism in the paper was clarifying rather than obscuring (and the ratio of insight to formalism was very high - see my addendum below for more thoughts on that).

Once the basic concepts were in place, clarifying different basic tools that would encourage optimization to happen in either the base optimizer or the mesa optimizer (e.g. constraining and expanding space/time offered to the base or mesa optimizers has interesting effects), plus clarifying the types of alignment / pseudo-alignment / internalizing of the base objective, all helped me think about this issue very clearly. It largely used basic technical language I already knew, and put it together in ways that would’ve taken me many months to achieve on my own - a very helpful conceptual piece of work.

Further Writeups by Oliver Habryka

The following three grants were more exciting to one or more other fund managers than they were to me. I think that for all three, if it had just been me on the grant committee, we might have not actually made them. However, I had more resources available to invest into these writeups, and as such I ended up summarizing my view on them, instead of someone else on the fund doing so. As such, they are probably less representative of the reasons for why we made these grants than the writeups above.

In the course of thinking through these grants, I formed (and wrote out below) more detailed, explicit models of the topics. Although these models were not counterfactual in the Fund’s making the grants, I think they are fairly predictive of my future grant recommendations.

Chris Chambers ($36,635)

Note: Application sent in by Jacob Hilton.

Combat publication bias in science by promoting and supporting the Registered Reports journal formatI'm suggesting a grant to fund a teaching buyout for Professor Chris Chambers, an academic at the University of Cardiff working to promote and support Registered Reports. This funding opportunity was originally identified and researched by Hauke Hillebrandt, who published a full analysis here. In brief, a Registered Report is a format for journal articles where peer review and acceptance decisions happen before data is collected, so that the results are much less susceptible to publication bias. The grant would free Chris of teaching duties so that he can work full-time on trying to get Registered Reports to become part of mainstream science, which includes outreach to journal editors and supporting them through the process of adopting the format for their journal. More details of Chris's plans can be found here.

I think the main reason for funding this is from a worldview diversification perspective: I would expect it to broadly improve the efficiency of scientific research by improving the communication of negative results, and to enable people to make better-informed use of scientific research by reducing publication bias. I would expect these effects to be primarily within fields where empirical tests tend to be useful but not always definitive, such as clinical trials (one of Chris's focus areas), which would have knock-on effects on health.

From an X-risk perspective, the key question to answer seems to be which technologies differentially benefit from this grant. I do not have a strong opinion on this, but to quote Brian Wang from a Facebook thread: "In terms of [...] bio-risk, my initial thoughts are that reproducibility concerns in biology are strongest when it comes to biomedicine, a field that can be broadly viewed as defense-enabling. By contrast, I'm not sure that reproducibility concerns hinder the more fundamental, offense-enabling developments in biology all that much (e.g., the falling costs of gene synthesis, the discovery of CRISPR)."

As for why this particular intervention strikes me as a cost-effective way to improve science, it is shovel-ready, it may be the sort of thing that traditional funding sources would miss, it has been carefully vetted by Hauke, and I thought that Chris seemed thoughtful and intelligent from his videoed talk.”

The Let’s Fund report linked in the application played a major role in my assessment of the grant, and I probably would not have been comfortable recommending this grant without access to that report.

Thoughts on Registered Reports

The replication crisis in psychology, and the broad spread of “career science,” have made it (to me) quite clear that the methodological foundations of at least psychology itself, but possibly also the broader life-sciences, are creating a very large volume of false and likely unreproducible claims.

This is in large part caused by problematic incentives for individual scientists to engage in highly biased reporting and statistically dubious practices.

I think preregistration has the opportunity to fix a small but significant part of this problem, primarily by reducing file-drawer effects. To borrow an explanation from the Let’s Fund report (lightly edited for clarity):

[Pre-registration] was introduced to address two problems: publication bias and analytical flexibility (in particular outcome switching in the case of clinical medicine).Publication bias, also known as the file drawer problem, refers to the fact that many more studies are conducted than published. Studies that obtain positive and novel results are more likely to be published than studies that obtain negative results or report replications of prior results. The consequence is that the published literature indicates stronger evidence for findings than exists in reality.Outcome switching refers to the possibility of changing the outcomes of interest in the study depending on the observed results. A researcher may include ten variables that could be considered outcomes of the research, and — once the results are known — intentionally or unintentionally select the subset of outcomes that show statistically significant results as the outcomes of interest. The consequence is an increase in the likelihood that reported results are spurious by leveraging chance, while negative evidence gets ignored.This is one of several related research practices that can inflate spurious findings when analysis decisions are made with knowledge of the observed data, such as selection of models, exclusion rules and covariates. Such data-contingent analysis decisions constitute what has become known as P-hacking, and pre-registration can protect against all of these.[...]It also effectively blinds the researcher to the outcome because the data are not collected yet and the outcomes are not yet known. This way the researcher’s unconscious biases cannot influence the analysis strategy

“Registered reports” refers to a specific protocol that journals are encouraged to adopt, which integrates preregistration into the journal acceptance process. Illustrated by this picture (borrowed from the Let’s Fund report):

Of the many ways to implement preregistration practices, I don’t think the one that Chambers proposes seems ideal, and I can see some flaws with it, but I still think that the quality of clinical science (and potentially other fields) will significantly improve if more journals adopt the registered reports protocol. (Please keep this in mind as you read my concerns in the next section.)

The importance of bandwidth constraints for journals

Chambers has the explicit goal of making all clinical trials require the use of registered reports. That outcome seems potentially quite harmful, and possibly worse than the current state of clinical science. (However, since that current state is very far from “universal registered reports,” I am not very worried about this grant contributing to that scenario.)

The Let’s Fund report covers the benefits of preregistration pretty well, so I won’t go into much detail here. Instead, I will mention some of my specific concerns with the protocol that Chambers is trying to promote.

From the registered reports website:

Manuscripts that pass peer review will be issued an in principle acceptance (IPA), indicating that the article will be published pending successful completion of the study according to the exact methods and analytic procedures outlined, as well as a defensible and evidence-bound interpretation of the results.

This seems unlikely to be the best course of action. I don’t think that the most widely-read journals should only publish replications. The key reason is that many scientific journals are solving a bandwidth constraint - sharing papers that are worth reading, not merely papers that say true things, to help researchers keep up to date with new findings in their field. A math journal could have papers for every true mathematical statement, including trivial ones, but they instead need to focus on true statements that are useful to signal boost to the mathematics community. (Related concepts are the tradeoff between bias and variance in Machine Learning, or accuracy and calibration in forecasting.)

Ultimately, from a value of information perspective, it is totally possible for a study to only be interesting if it finds a positive result, and to be uninteresting when analyzed pre-publication from the perspective of the editor. It seems better to encourage pre-publication, but still take into account the information value of a paper’s experimental results, even if this doesn’t fully prevent publication bias.

To give a concrete (and highly simplified) example, imagine a world where you are trying to find an effective treatment for a disease. You don’t have great theory in this space, so you basically have to test 100 plausible treatments. On their own, none of these have a high likelihood of being effective, but you expect that at least one of them will work reasonably well.

Currently, you would preregister those trials (as is required for clinical trials), and then start performing the studies one by one. Each failure provides relatively little information (since the prior probability was low anyways), so you are unlikely to be able to publish it in a prestigious journal, but you can probably still publish it somewhere. Not many people would hear about it, but it would be findable if someone is looking specifically for evidence about the specific disease you are trying to treat, or the treatment that you tried out. However, finding a successful treatment is highly valuable information which will likely get published in a journal with a lot of readers, causing lots of people to hear about the potential new treatment.

In a world with mandatory registered reports, none of these studies will be published in a high-readership journal, since journals will be forced to make a decision before they know the outcome of a treatment. Because all 100 studies are equally unpromising, none are likely to pass the high bar of such a journal, and they’ll wind up in obscure publications (if they are published at all) [1]. Thus, even if one of them finds a successful result, few people will hear about it. High-readership journals exist in large part to spread news about valuable results in a limited bandwidth environment; this no longer happens in scenarios of this kind.

Because of dynamics like this, I think it is very unlikely that any major journals will ever switch towards only publishing registered report-based studies, even within clinical trials, since no journal would want to pass up on the opportunity to publish a study that has the opportunity to revolutionize the field.

Importance of selecting for clarity

Here is the full set of criteria that papers are being evaluated by for stage 2 of the registered reports process:

1. Whether the data are able to test the authors’ proposed hypotheses by satisfying the approved outcome-neutral conditions (such as quality checks or positive controls)2. Whether the Introduction, rationale and stated hypotheses are the same as the approved Stage 1 submission (required)3. Whether the authors adhered precisely to the registered experimental procedures4. Whether any unregistered post hoc analyses added by the authors are justified, methodologically sound, and informative5. Whether the authors’ conclusions are justified given the data

The above list is comprehensive, and does not include any mention of the clarity of the authors’ writing, the quality/rigor of the explanation provided by the paper’s methodology, or the implications of the paper’s findings on underlying theory. (All of these are very important to how journals currently evaluate papers.) This means that journals can only filter for those characteristics in the first stage of the registered reports process, when large parts of the paper haven’t yet been written. As a result, large parts of the paper basically have no selection applied to them for conceptual clarity, as well as thoughtful analysis of implications for future theory, likely resulting in those qualities getting worse.

I think the goal of registered reports is to split research in two halves where you publish two separate papers, one that is empirical, and another that is purely theoretical, which that takes the results of the first paper as given and explores their consequences. We already see this split a good amount in physics, in which there exists a pretty significant divide between experimental and theoretical physics, the latter of which rarely performs experiments. I don’t know whether encouraging this split in a given field is a net-improvement, since I generally think that a lot of good science comes from combining the gathering of good empirical data with careful analysis and explanations, and I am particularly worried that the analysis of the results in papers published via registered reports will be of particularly low-quality, which encourages the spread of bad explanations and misconceptions which can cause a lot of damage (though some of that is definitely offset by reducing the degree to which scientists can fit hypotheses post-hoc, due to preregistration). The costs here seem related to Chris Olah’s article on research debt.

Again, I think both of these problems are unlikely to become serious issues, because at most I can imagine getting to a world where something between 10% and 30% of top journal publications in a given field have gone through registered reports-based preregistration. I would be deeply surprised if there weren’t alternative outlets for papers that do try to combine the gathering of empirical data with high-quality explanations and analysis.

Failures due to bureaucracy

I should also note clinical science is not something I have spent large amounts of time thinking about, that I am quite concerned about adding more red tape and necessary logistical hurdles to jump through when registering clinical trials. I have high uncertainty about the effect of registered reports on the costs of doing small-scale clinical experiments, but it seems more likely than not that they will lengthen the review process, and add additional methodological constraints.

(There is also a chance that it will reduce these burdens by giving scientists feedback earlier in the process and letting them be more certain of the value of running a particular study. However, this effect seems slightly weaker to me than the additional costs, though I am very uncertain about this.)

In the current scientific environment, running even a simple clinical study may require millions of dollars of overhead (a related example is detailed in Scott Alexander’s “My IRB nightmare”). I believe this barrier is a substantial drag on progress in medical science. In this context, I think that requiring even more mandatory documentation, and adding even more upfront costs, seems very costly. (Though again, it seems highly unlikely for the registered reports format to ever become mandatory on a large scale, and giving more researchers the option to publish a study via the registered reports protocol, depending on their local tradeoffs, seems likely net-positive)

To summarize these three points:

  • If journals have to commit to publishing studies, it’s not obvious to me that this is good, given that they would have to do so without access to important information (e.g. whether a surprising result was found) and only a limited number of slots for publishing papers.
  • It seems quite important for journals to be able to select papers based on the clarity of their explanations, both for ease of communication and for conceptual refinement.
  • Excessive red tape in clinical research seems like one of the main problems with medical science today, so adding more is worrying, though the sign of the registered reports protocol on this is a bit ambigious

Differential technological progress

Let’s Fund covers differential technological progress concerns in their writeup. Key quote:

One might worry that funding meta-research indiscriminately speeds up all research, including research which carries a lot of risks. However, for the above reasons, we believe that meta-research improves predominantly social science and applied clinical science (“p-value science’) and so has a strong differential technological development element, that hopefully makes society wiser before more risks from technology emerge through innovation. However, there are some reproducibility concerns in harder sciences such as basic biological research and high energy physics that might be sped up by meta-research and thus carry risks from emerging technologies[110].

My sense is that further progress in sociology and psychology seems net positive from a global catastrophic risk reduction perspective. The case for clinical science seems a bit weaker, but still positive.

In general, I am more excited about this grant in worlds in which global catastrophes are less immediate and less likely than my usual models suggest, and I’m thinking of this grant in some sense as a hedging bet, in case we live in one of those worlds.

Overall, a reasonable summary of my position on this grant would be "I think preregistration helps, but is probably not really attacking the core issues in science. I think this grant is good, because I think it actually makes preregistration a possibility in a large number of journals, though I disagree with Chris Chalmers on whether it would be good for all clinical trials to require preregistration, which I think would be quite bad. On the margin, I support his efforts, but if I ever come to change my mind about this, it’s likely for one or more of the above reasons."


[1]: The journal could also publish a random subset, though at scale that gives rise to the same dynamics, so I’ll ignore that case. It could also batch a large number of the experiments until the expected value of information is above the relevant threshold, though that significantly increases costs.

Jess Whittlestone ($75,080)

Note: Funding from this grant will go to the Leverhulme Centre for the Future of Intelligence, which will fund Jess in turn. The LTF Fund is not replacing funding that CFI would have supplied instead; without this grant, Jess would need to pursue grants from sources outside CFI.

Research on the links between short- and long-term AI policy while skilling up in technical ML.I’m applying for funding to cover my salary for a year as a postdoc at the Leverhulme CFI, enabling me to do two things:-- Research the links between short- and long-term AI policy. My plan is to start broad: thinking about how to approach, frame and prioritise work on ‘short-term’ issues from a long-term perspective, and then focusing in on a more specific issue. I envision two main outputs (papers/reports): (1) reframing various aspects of ‘short-term’ AI policy from a long-term perspective (e.g. highlighting ways that ‘short-term’ issues could have long-term consequences, and ways of working on AI policy today most likely to have a long-run impact); (2) tackling a specific issue in ‘short-term’ AI policy with possible long-term consequences (tbd, but an example might be the possible impact of microtargeting on democracy and epistemic security as AI capabilities advance).-- Skill up in technical ML by taking courses from the Cambridge ML masters.Most work on long-term impacts of AI focuses on issues arising in the future from AGI. But issues arising in the short term may have long-term consequences: either by directly leading to extreme scenarios (e.g. automated surveillance leading to authoritarianism), or by undermining our capability to deal with other threats (e.g. disinformation undermining collective decision-making). Policy work today will also shape how AI gets developed, deployed and governed, and what issues will arise in the future. We’re at a particularly good time to influence the focus of AI policy, with many countries developing AI strategies and new research centres emerging.There’s very little rigorous thinking the best way to do short-term AI policy from a long-term perspective. My aim is to change that, and in doing so improve the quality of discourse in current AI policy. I would start with a focus on influencing UK AI policy, as I have experience and a strong network here (e.g. the CDEI and Office for AI). Since DeepMind is in the UK, I think it is worth at least some people focusing on UK institutions. I would also ensure this research was broadly relevant, by collaborating with groups working on US AI policy (e.g. FHI, CSET and OpenAI).I’m also asking for a time buyout to skill up in ML (~30%). This would improve my own ability to do high-quality research, by helping me to think clearly about how issues might evolve as capabilities advance, and how technical and policy approaches can best combine to influence the future impacts of AI.

The main work I know of Jess’s is her early involvement in 80,000 Hours. In the first 1-2 years of their existence, she wrote dozens of articles for them, and contributed to their culture and development. Since then I’ve seen her make positive contributions to a number of projects over the years - she has helped in some form with every EA Global conference I’ve organized (two in 2015 and one in 2016), and she’s continued to write publicly in places like the EA Forum, the EA Handbook, and news sites like Quartz and Vox. This background implies that Jess has had a lot of opportunities for members of the fund to judge her output. My sense is that this is the main reason that the other members of the fund were excited about this grant — they generally trust Jess’s judgment and value her experience (while being more hesitant about CFI’s work).

There are three things I looked into for this grant writeup: Jess’s policy research output, Jess’s blog, and the institutional quality of Leverhulme CFI. The section on Leverhulme CFI became longer than the section on Jess and was mostly unrelated to her work, so I’ve taken it out and included it as an addendum.

Impressions of Policy Papers

First is her policy research. The papers I read were from those linked on her blog. They were:

On the first paper, about focusing on tensions: the paper said that many “principles of AI ethics” that people publicly talk about in industry, non-profit, government and academia are substantively meaningless, because they don’t come with the sort of concrete advice that actually tells you how to apply them - and specifically, how to trade them off against each other. The part of the paper I found most interesting were four paragraphs pointing to specific tensions between principles of AI ethics. They were:

  • Using data to improve the quality and efficiency of services vs. respecting the privacy and autonomy of individuals
  • Using algorithms to make decisions and predictions more accurate vs. ensuring fair and equal treatment
  • Reaping the benefits of increased personalization in the digital sphere vs. enhancing solidarity and citizenship
  • Using automation to make people’s lives more convenient and empowered vs. promoting self-actualization and dignity

My sense is that while there is some good public discussion about AI and policy (e.g. OpenAI’s work on release practices seems quite positive to me), much conversation that brands itself as ‘ethics’ is often not motivated by the desire to ensure this novel technology improves society in accordance with our deepest values, but instead by factors like reputation, PR and politics.

There are many notions, like Peter Thiel’s “At its core, artificial intelligence is a military technology” or the common question “Who should control the AI?” which don’t fully account for the details of how machine learning and artificial intelligence systems work, or the ways in which we need to think about them in very different ways from other technologies; in particular, that we will need to build new concepts and abstractions to talk about them. I think this is also true of most conversations around making AI fair, inclusive, democratic, safe, beneficial, respectful of privacy, etc.; they seldom consider how these values can be grounded in modern ML systems or future AGI systems. My sense is that much of the best conversation around AI is about how to correctly conceptualize it. This is something that (I was surprised to find) Henry Kissinger’s article on AI did well; he spends most of the essay trying to figure out which abstractions to use, as opposed to using already existing ones.

The reason I liked that bit of Jess’s paper is that I felt the paper used mainstream language around AI ethics (in a way that could appeal to a broad audience), but then:

  • Correctly pointed out that AI is a sufficiently novel technology that we’re going to have to rethink what these values actually mean, because the technology causes a host of fundamentally novel ways for them to come into tension
  • Provided concrete examples of these tensions

In the context of a public conversation that I feel is often substantially motivated by politics and PR rather than truth, seeing someone point clearly at important conceptual problems felt like a breath of fresh air.

That said, given all of the political incentives around public discussion of AI and ethics, I don’t know how papers like this can improve the conversation. For example, companies are worried about losing in the court of Twitter’s public opinion, and also are worried about things like governmental regulation, which are strong forces pushing them to primarily take popular but ineffectual steps to be more "ethical". I’m not saying papers like this can’t improve this situation in principle, only that I don’t personally feel like I have much of a clue about how to do it or how to evaluate whether someone else is doing it well, in advance of their having successfully done it.

Personally, I feel much more able to evaluate the conceptual work of figuring out how to think about AI and its strategic implications (two standout examples are this paper by Bostrom and this LessWrong post by Christiano), rather than work on revising popular views about AI. I’d be excited to see Jess continue with the conceptual side of her work, but if she instead primarily aims to influence public conversation (the other goal of that paper), I personally don’t think I’ll be able to evaluate and recommend grants on that basis.

From the second paper I read sections 3 and 4, which lists many safety and security practices in the fields of biosafety, computer information security, and institutional review boards (IRBs), then outlines variables for analysing release practices in ML. I found it useful, even if it was shallow (i.e. did not go into much depth in the fields it covered). Overall, the paper felt like a fine first step in thinking about this space.

In both papers, I was concerned with the level of inspiration drawn from bioethics, which seems to me to be a terribly broken field (cf. Scott Alexander talking about his IRB nightmare or medicine’s ‘culture of life’). My understanding is that bioethics coordinated a successful power grab (cf. OpenPhil’s writeup) from the field of medicine, creating hundreds of dysfunctional and impractical ethics boards that have formed a highly adversarial relationship with doctors (whose practical involvement with patients often makes them better than ethicists at making tradeoffs between treatment, pain/suffering, and dignity). The formation of an “AI ethics” community that has this sort of adversarial, unhealthy relationship with machine learning researchers would be an incredible catastrophe.

Overall, it seems like Jess is still at the beginning of her research career (she’s only been in this field for ~1.5 years). And while she’s spent a lot of effort on areas that don’t personally excite me, both of her papers include interesting ideas, and I’m curious to see her future work.

Impressions of Other Writing

Jess also writes a blog, and this is one of the main things that makes me excited about this grant. On the topic of AI, she wrote three posts (1, 2, 3), all of which made good points on at least one important issue. I also thought the post on confirmation bias and her PhD was quite thoughtful. It correctly identified a lot of problems with discussions of confirmation bias in psychology, and came to a much more nuanced view of the trade-off between being open-minded versus committing to your plans and beliefs. Overall, the posts show independent thinking written with an intent to actually convey understanding to the reader, and doing a good job of it. They share the vibe I associate with much of Julia Galef’s work - they’re noticing true observations / conceptual clarifications, successfully moving the conversation forward one or two steps, and avoiding political conflict.

I do have some significant concerns with the work above, including the positive portrayal of bioethics and the absence of any criticism toward the AAAI safety conference talks, many of which seem to me to have major flaws.

While I’m not excited about Leverhulme CFI’s work (see the addendum for details), I think it will be good for Jess to have free rein to follow her own research initiatives within CFI. And while she might be able to obtain funding elsewhere, this alternative seems considerably worse, as I expect other funding options would substantially constrain the types of research she’d be able to conduct.

Lynette Bye ($23,000)Productivity coaching for effective altruists to increase their impact.I plan to continue coaching high-impact EAs on productivity. I expect to have 600+ sessions with about 100 clients over the next year, focusing on people working in AI safety and EA orgs. I’ve worked with people at FHI, Open Phil, CEA, MIRI, CHAI, DeepMind, the Forethought Foundation, and ACE, and will probably continue to do so. Half of my current clients (and a third of all clients I’ve worked with) are people at these orgs. I aim to increase my clients’ output by improving prioritization and increasing focused work time.I would use the funding to: offer a subsidized rate to people at EA orgs (e.g. between $10 and $50 instead of $125 per call), offer free coaching for select coachees referred by 80,000 Hours, and hire contractors to help me create materials to scale coaching.You can view my impact evaluation (linked below) for how I’m measuring my impact so far.

(Lynette’s public self-evaluation is here.)

I generally think it's pretty hard to do "productivity coaching" as your primary activity, especially when you are young, due to a lack of work experience. This means I have a high bar for it being a good idea that someone should go full-time into the "help other people be more productive” business.

My sense is that Lynette meets that bar, but only barely (to be clear, I consider it to be a high bar). The main thing that she seems to be doing well is being very organized about everything that she is doing, in a way that makes me confident that her work has had a real impact — if not, I think she’d have noticed and moved on to something else.

However, as I say in the CFAR writeup, I have a lot of concerns with primarily optimising for legibility, and Lynette’s work shows some signs of this. She has shared around 60 testimonials on her website (linked here). Of these, not one of them mentioned anything negative, which clearly indicates that I can't straightforwardly interpret those testimonials as positive evidence (since any unbiased sampling process would have resulted in at least some negative datapoints). I much prefer what another applicant did here: they asked people to send us information anonymously, which increased the chance of our hearing opinions that weren’t selected to create a positive impression. As is, I think I actually shouldn't update much on the testimonials, in particular given that none of them go into much detail on how Lynette has helped them, and almost all of them share a similar structure.

Reflecting on the broader picture, I think that Lynette’s mindset reflects how I think many of the best operations staff I’ve seen operate: aim to be productive by using simple output metrics, and by doing things in a mindful, structured way (as opposed to, for example, trying to aim for deep transformative insights more traditionally associated with psychotherapy). There is a deep grounded-ness and practical nature to it. I have a lot of respect for that mindset, and I feel as though it's underrepresented in the current EA/rationality landscape. My inside-view models suggest that you can achieve a bunch of good things by helping people become more productive in this way.

I also think that this mindset comes with a type of pragmatism that I am more concerned about, and often gives rise to what I consider unhealthy adversarial dynamics. As I discussed above, it’s difficult to get information from Lynette’s positive testimonials. My sense is that she might have produced them by directly optimising for “getting a grant” and trying to give me lots of positive information, leading to substantial bias in the selection process. The technique of ‘just optimize for the target’ is valuable in lots of domains, but in this case was quite negative.

That said, framing her coaching as achieving a series of similar results generally moves me closer to thinking about this grant as "coaching as a commodity". Importantly, few people reported very large gains in their productivity; the testimonials instead show a solid stream of small improvements. I think that very few people have access to good coaching, and the high variance in coach quality means that experimenting is often quite expensive and time-consuming. Lynette seems to be able to consistently produce positive effects in the people she is working with, making her services a lot more valuable due to greater certainty around the outcome. (However, I also assign significant probability that the way the evaluation questions were asked reduced the rate at which clients reported either negative or highly positive experiences.)

I think that many productivity coaches fail to achieve Lynette’s level of reliability, which is one of the key things that makes me hopeful about her work here. My guess is that the value-add of coaching is often straightforwardly positive unless you impose significant costs on your clients, and Lynette seems quite good at avoiding that by primarily optimizing for professionalism and reliability.

Further Recommendations (not funded by the LTF Fund)Center for Applied Rationality ($150,000)

This grant was recommended by the Fund, but ultimately was funded by a private donor, who (prior to CEA finalizing its standard due diligence checks) had personally offered to make this donation instead. As such, the grant recommendation was withdrawn.

Oliver Habryka had created a full writeup by that point, so it is included below.

Help promising people to reason more effectively and find high-impact work, such as reducing x-risk.The Center for Applied Rationality runs workshops that promote particular epistemic norms—broadly, that beliefs should be true, bugs should be solved, and that intuitions/aversions often contain useful data. These workshops are designed to cause potentially impactful people to reason more effectively, and to find people who may be interested in pursuing high-impact careers (especially AI safety).Many of the people currently working on AI safety have been through a CFAR workshop, such as 27% of the attendees at the 2019 FLI conference on Beneficial AI in Puerto Rico, and for some of those people it appears that CFAR played a causal role in their decision to switch careers. In the confidential section, we list some graduates from CFAR programs who subsequently decided to work on AI safety, along with our estimates of the counterfactual impact of CFAR on their decision [16 at MIRI, 3 on the OpenAI safety team, 2 at CHAI, and one each at Ought, Open Phil and the DeepMind safety team].Recruitment is the most legible form of impact CFAR has, and is probably its most important—the top reported bottleneck in the last two years among EA leaders at Leaders Forum, for example, was finding talented employees.[...]In 2019, we expect to run or co-run over 100 days of workshops, including our mainline workshop (designed to grow/improve the rationality community), workshops designed specifically to recruit programmers (AIRCS) and mathematicians (MSFP) to AI safety orgs, a 4-weekend instructor training program (to increase our capacity to run workshops), and alumni reunions in both the United States and Europe (to grow the EA/rationality community and cause impactful people to meet/talk with one another). Broadly speaking, we intend to continue doing the sort of work we have been doing so far.

In our last grant round, I took an outside view on CFAR and said that, in terms of output, I felt satisfied with CFAR's achievements in recruitment, training and the establishment of communal epistemic norms. I still feel this way about those areas, and my writeup last round still seems like an accurate summary of my reasons for wanting to grant to CFAR. I also said that most of my uncertainty about CFAR lies in its long-term strategic plans, and I continue to feel relatively confused about my thoughts on that.

I find it difficult to explain my thoughts on CFAR, and I think that a large fraction of this difficulty comes from CFAR being an organization that is intentionally not optimizing towards being easy to understand from the outside, having simple metrics, or more broadly being legible[1]. CFAR is intentionally avoiding being legible to the outside world in many ways. This decision is not obviously wrong, as I think it brings many positives, but I think it is the cause of me feeling particularly confused about how to talk coherently about CFAR.

Considerations around legibility

Summary: CFAR’s work is varied and difficult to evaluate. This has some good features — it can avoid focusing too closely on metrics that don’t measure impact well — but also forces evaluators to rely on factors that aren’t easy to measure, like the quality of its internal culture. On the whole, while I wish CFAR were somewhat more legible, I appreciate the benefits to CFAR’s work of not maximizing “legibility” at the cost of impact or flexibility.

To help me explain my point, let's contrast CFAR with an organization like AMF, which I think of as exceptionally legible. AMF’s work, compared to many other organizations with tens of millions of dollars on hand, is easy to understand: they buy bednets and give them to poor people in developing countries. As long as AMF continues to carry out this plan, and provides basic data showing its success in bednet distribution, I feel like I can easily model what the organization will do. If I found out that AMF was spending 10% of its money funding religious leaders in developing countries to preach good ethical principles for society, or funding the campaigns of government officials favorable to their work, I would be very surprised and feel like some basic agreement or contract had been violated — regardless of whether I thought those decisions, in the abstract, were good or bad for their mission. AMF claims to distribute anti-malaria bednets, and it is on this basis that I would choose whether to support them.

AMF could have been a very different organization, and still could be if it wanted to. For example, it could conduct research on various ways to effect change, and give its core staff the freedom to do whatever they thought was best. This new AMF (“AMF 2.0”) might not be able to tell you exactly what they’ll do next year, because they haven’t figured it out yet, but they can tell you that they’ll do whatever their staff determine is best. This could be distributing deworming pills, pursuing speculative medical research, engaging in political activism, funding religious organizations, etc.

If GiveWell wanted to evaluate AMF 2.0, they would need to use a radically different style of reasoning. There wouldn’t be a straightforward intervention with RCTs to look into. There wouldn’t be a straightforward track record of impact from which to extrapolate. Judging AMF 2.0 would require GiveWell to form much more nuanced judgments about the quality of thinking and execution of AMF’s staff, to evaluate the quality of its internal culture, and to consider a host of other factors that weren’t previously relevant.

I think that evaluating CFAR requires a lot of that kind of analysis, which seems inherently harder to communicate to other people without summarizing one’s views as: "I trust the people in that organization to make good decisions."

The more general idea here is that organizations are subject to bandwidth constraints - they often want to do lots of different things, but their funders need to be able to understand and predict their behavior with limited resources for evaluation. As I've written about recently, a key variable for any organization is the people and organizations by which they are trying to be understood and held accountable. For charities that receive most of their funding in small donations from a large population of people who don’t know much about them, this is a very strong constraint; they must communicate their work so that people can understand it very quickly with little background information. If a charity instead receives most of its funding in large donations from a small set of people who follow it closely, it can communicate much more freely, because the funders will be able to spend a lot of their time talking to the org, exchanging models, and generally coming to an understanding of why the org is doing what it’s doing.

This idea partly explains why most organizations tend to focus on legibility, in how they talk about their work and even in the work they choose to pursue. It can be difficult to attract resources and support from external parties if one’s work isn’t legible.

I think that CFAR is still likely optimizing too little towards legibility, compared to what I think would be ideal for it. Being legible allows an organization to be more confident that its work is having real effects, because it acquires evidence that holds up to a variety of different viewpoints. However, I think that far too many organizations (nonprofit and otherwise) are trying too hard to make their work legible, in a way that reduces innovation and also introduces a variety of adversarial dynamics. When you make systems that can be gamed, and which carry rewards for success (e.g. job stability, prestige, etc), people will reliably turn up to game them[2].

(As Jacob Lagerros has written in his post on Unconscious Economics, this doesn’t mean people are consciously gaming your system, but merely that this behavior will eventually transpire. The many causes of this include selection effects, reinforcement learning, and memetic evolution.)

In my view, CFAR, by not trying to optimize for a single, easy-to-explain metric, avoids playing the “game” many nonprofits play of focusing on work that will look obviously good to donors, even if it isn’t what the nonprofit believes would be most impactful. They also avoid a variety of other games that come from legibility, such as job applicants getting very good at faking the signals that they are a good fit for an organization, making it harder for them to find good applicants.

Optimizing for communication with the goal of being given resources introduces adversarial dynamics; someone asking for money may provide limited/biased information that raises the chance they’ll be given a grant but reduces the accuracy of the grantmaker’s understanding. (See my comment in Lynette’s writeup below for an example of how this can arise.) This optimization can also tie down your resources, forcing you to carry out commitments you made for the sake of legibility, rather than doing what you think would be most impactful[3].

So I think that it's important that we don't force all organizations towards maximal legibility. (That said, we should ensure that organizations are encouraged to pursue at least some degree of legibility, since the lack of legibility also gives rise to various problems.)

Do I trust CFAR to make good decisions?

As I mentioned in my initial comments on CFAR, I generally think that the current projects CFAR is working on are quite valuable and worth the resources they are consuming. But I have a lot of trouble modeling CFAR’s long-term planning, and I feel like I have to rely instead on my models of how much I trust CFAR to make good decisions in general, instead of being able to evaluate the merits of their actual plans.

That said, I do generally trust CFAR's decision-making. It’s hard to explain the evidence that causes me to believe this, but I’ll give a brief overview anyway. (This evidence probably won’t be compelling to others, but I still want to give an accurate summary of where my beliefs come from):

  • I expect that a large fraction of CFAR's future strategic plans will continue to be made by Anna Salamon, from whom I have learned a lot of valuable long-term thinking skills, and who seems to me to have made good decisions for CFAR in the past.
  • I think CFAR's culture, while imperfect, is still based on strong foundations of good reasoning with deep roots in the philosophy of science and the writings of Eliezer Yudkowsky (which I think serve as a good basis for learning how to think clearly).
  • I have made a lot of what I consider my best and most important strategic decisions in the context of, and aided by, events organized by CFAR. This suggests to me that at least some of that generalizes to CFAR's internal ability to think strategically.
  • I am excited about a number of individuals who intend to complete CFAR's latest round of instructor training, which gives me some optimism about CFAR's future access to good talent and its ability to establish and sustain a good internal culture.


[1] The focus on ‘legibility’ in this context I take from James C. Scott’s book “Seeing Like a State.” It was introduced to me by Elizabeth Van Nostrand in this blogpost discussing it in the context of GiveWell and good giving; Scott Alexander also discussed it in his review of the book . Here’s an example from Scott regarding centralized planning and governance:

the centralized state wanted the world to be “legible”, ie arranged in a way that made it easy to monitor and control. An intact forest might be more productive than an evenly-spaced rectangular grid of Norway spruce, but it was harder to legislate rules for, or assess taxes on.

[2] The errors that follow are all forms of Goodhart’s Law, which states that “any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.”

[3] The benefits of (and forces that encourage) stability and reliability can maybe be most transparently understood in the context of menu costs and the prevalence of highly sticky wages.

AddendaAddendum: Thoughts on a Strategy Article by the Leadership of Leverhulme CFI and CSER

I wrote the following in the course of thinking about the grant to Jess Whittlestone. While the grant is to support Jess’s work, the grant money will go to Leverhulme CFI, which will maintain discretion about whether to continue employing her, and will likely influence what type of work she will pursue.

As such, it seems important to not only look into Jess’s work, but also look into Leverhulme CFI and its sister organization, the Centre for the Study of Existential Risk (CSER). While my evaluation of the organization that will support Jess during her postdoc is relevant to my evaluation of the grant, it is quite long and does not directly discuss Jess or her work, so I’ve moved it into a separate section.

I’ve read a few papers from CFI and CSER over the years, and heard many impressions of their work from other people. For this writeup, I wanted to engage more concretely with their output. I reread and reviewed an article published in Nature earlier this year called Bridging near- and long-term concerns about AI, written by the Executive Directors at Leverhulme CFI and CSER respectively, Stephen Cave and Seán ÓhÉigeartaigh.

Summary and aims of the article

The article’s summary:

Debate about the impacts of AI is often split into two camps, one associated with the near term and the other with the long term. This divide is a mistake — the connections between the two perspectives deserve more attention, say Stephen Cave and Seán S. ÓhÉigeartaigh.

This is not a position I hold, and I’m going to engage with the content below in more detail.

Overall, I found the claims of the essay hard to parse and often ambiguous, but I’ve attempted to summarize what I view as its three main points:

  1. If ML is a primary technology used in AGI, then there are likely some design decisions today that will create lock-in in the long-term and have increasingly important implications for AGI safety.
  2. If we can predict changes in society from ML that matter in the long-term (such as automation of jobs), then we can prepare policy for them in the short term (like preparing educational retraining for lorry drivers who will be automated).
  3. Norms and institutions built today will have long-term effects, and so people who care about the long term should especially care about near-term norms and institutions.

They say “These three points relate to ways in which addressing near-term issues could contribute to solving potential long-term problems.

If I ask myself what Leverhulme/CSER’s goals are for this document, it feels to me like it is intended as a statement of diplomacy. It’s saying that near-term and long-term AI risk work are split into two camps, but that we should be looking for common ground (“the connections between the two perspectives deserve more attention”, “Learning from the long term”). It tries to emphasize shared values (“Connected research priorities”) and the importance of cooperation amongst many entities (“The challenges we will face are likely to require deep interdisciplinary and intersectoral collaboration between industries, academia and policymakers, alongside new international agreements”). The goal that I think it is trying to achieve is to negotiate trade and peace between the near-term and long-term camps by arguing that “This divide is a mistake”.

Drawing the definitions does a lot of work

The authors define “long-term concerns” with the following three examples:

wide-scale loss of jobs, risks of AI developing broad superhuman capabilities that could put it beyond our control, and fundamental questions about humanity’s place in a world with intelligent machines

Despite this broad definition, they only use concrete examples from the first category, which I would classify as something like “mid-term issues.” I think the possibility of even wide-scale loss of jobs, unless interpreted extremely broadly, is something that does not make sense to put into the same category as the other two, which are primarily concerned with stakes that are orders of magnitude higher (such as the future of the human species). I think this conflation of very different concerns causes the rest of the article to make an argument that is more likely to mislead than to inform.

After this definition, the article failed to mention any issue that I would classify as representative of the long-term concerns of Nick Bostrom or Max Tegmark, both of whom are cited by the article to define “long-term issues.” (In Tegmark’s book Life 3.0, he explicitly categorizes unemployment as a short-term concern, to be distinguished from long-term concerns.)

Conceptual confusions in short- and mid-term policy suggestions

The article has the following policy idea:

Take explainability (the extent to which the decisions of autonomous systems can be understood by relevant humans): if regulatory measures make this a requirement, more funding will go to developing transparent systems, while techniques that are powerful but opaque may be deprioritized.

(Let me be clear that this is not explicitly listed as a policy recommendation.)

My naive prior is that there is no good AI regulation a government could establish today. I continue to feel this way after looking into this case (and the next example below). Let me explain why in this case the idea that regulation requiring explainability would encourage transparent + explainable systems is false.

Modern ML systems are not doing a type of reasoning that is amenable to explanation in the way human decisions often are. There is not a principled explanation of their reasoning when deciding whether to offer you a bank loan, there is merely a mass of correlations between spending history and later reliability, which may factorise into a small number of well-defined chunks like “how regularly someone pays their rent” but it might not. The main problem with the quoted paragraph is that it does not at all attempt to specify how to define explainability in an ML system to the point where it can be regulated, meaning that any regulation would either be meaningless and ignored, or worse highly damaging. Policies formed in this manner will either be of no consequence, or deeply antagonise the ML community. We currently don’t know how to think about explainability of ML systems, and ignoring that problem and regulating that they should be ‘explainable’ will not work.

The article also contains the following policy idea about autonomous weapons.

The decisions we make now, for example, on international regulation of autonomous weapons, could have an outsized impact on how this field develops. A firm precedent that only a human can make a ‘kill’ decision could significantly shape how AI is used — for example, putting the focus on enhancing instead of replacing human capacities.

Here and throughout the article, repeated uses of the conditional ‘could’ make it unclear to me whether this is being endorsed or merely suggested. I can’t quite tell if they think that drone swarms are a long-term issue - they contrast it with a short-term issue but don’t explicitly say that it is long-term. Nonetheless, I think their suggesting it here is also a bit misguided.

Let me contrast this with Nick Bostrom on a recent episode of the Joe Rogan Experienceexplaining that he thinks that the specific rule has ambiguous value. Here’s a quote from a discussion of the campaign to ban lethal autonomous weapons:

Nick Bostrom: I’ve kind of stood a little bit on the sidelines on that particular campaign, being a little unsure exactly what it is that… certainly I think it’d be better if we refrained from having some arms race to develop these than not. But if you start to look in more detail: What precisely is the thing that you’re hoping to ban? So if the idea is the autonomous bit, that the robot should not be able to make its own firing decision, well, if the alternative to that is there is some 19-year old guy sitting in some office building and his job is whenever the screen flashes ‘fire now’ he has to press a red button. And exactly the same thing happens. I’m not sure how much is gained by having that extra step.Interviewer: But it feels better for us for some reason. If someone is pushing the button.Nick Bostrom: But what exactly does that mean. In every particular firing decision? Well, you gotta attack this group of surface ships here, and here are the general parameters, and you’re not allowed to fire outside these coordinates? I don’t know. Another is the question of: it would be better if we had no wars, but if there is gonna be a war, maybe it is better if it’s robots v robots. Or if there’s gonna be bombing, maybe you want the bombs to have high precision rather than low precision - get fewer civilian casualties.[...]On the other hand you could imagine it reduces the threshold for going to war, if you think that you wouldn’t fear any casualties you would be more eager to do it. Or if it proliferates and you have these mosquito-sized killer-bots that terrorists have. It doesn’t seem like a good thing to have a society where you have a facial-recognition thing, and then the bot flies out and you just have a kind of dystopia.

Overall, it seems that in both situations, the key open questions are in understanding the systems and how they’ll interface with areas of industry, government and personal life, and that regulation based on inaccurate conceptualizations of the technology would either be meaningless or harmful.

Polarizing approach to policy coordination

I have two main concerns with what I see as the intent of the paper.

The first one can be summarized by Robin Hanson’s article To Oppose Polarization, Tug Sideways:

The policy world can [be] thought of as consisting of a few Tug-O-War "ropes" set up in this high dimensional policy space. If you want to find a comfortable place in this world, where the people around you are reassured that you are "one of them," you need to continually and clearly telegraph your loyalty by treating each policy issue as another opportunity to find more supporting arguments for your side of the key dimensions. That is, pick a rope and pull on it.If, however, you actually want to improve policy, if you have a secure enough position to say what you like, and if you can find a relevant audience, then [you should] prefer to pull policy ropes sideways. Few will bother to resist such pulls, and since few will have considered such moves, you have a much better chance of identifying a move that improves policy. On the few main dimensions, not only will you find it very hard to move the rope much, but you should have little confidence that you actually have superior information about which way the rope should be pulled.

I feel like the article above is not pulling policy ropes sideways, but is instead connecting long-term issues to specific sides of existing policy debates, around which there is already a lot of tension. The issue of technological unemployment seems to me to be a highly polarizing topic, where taking a position seems ill-advised, and I have very low confidence about the correct direction in which to pull policy. Entangling long-term issues with these highly tense short-term issues seems like it will likely reduce our future ability to broadly coordinate on these issues (by having them associated with highly polarized existing debates).

Distinction between long- and short-term thinking

My second concern is that on a deeper level, I think that the type of thinking that generates a lot of the arguments around concerns for long-term technological risks is very different from that which suggests policies around technological unemployment and racial bias. I think there is some value in having these separate ways of thinking engage in “conversation,” but I think the linked paper is confusing in that it seems to try to down-play the differences between them. An analogy might be the differences between physics and architecture; both fields nominally work with many similar objects, but the distinction between the two is very important, and the fields clearly require different types of thinking and problem-solving.

Some of my concerns are summarized by Eliezer in his writing on Pivotal Acts:

...compared to the much more difficult problems involved with making something actually smarter than you be safe, it may be tempting to try to write papers that you know you can finish, like a paper on robotic cars causing unemployment in the trucking industry, or a paper on who holds legal liability when a factory machine crushes a worker. But while it's true that crushed factory workers and unemployed truckers are both, ceteris paribus, bad, they are not astronomical catastrophes that transform all galaxies inside our future light cone into paperclips, and the latter category seems worth distinguishing......there will [...] be a temptation for the grantseeker to argue, "Well, if AI causes unemployment, that could slow world economic growth, which will make countries more hostile to each other, which would make it harder to prevent an AI arms race." But the possibility of something ending up having a non-zero impact on astronomical stakes is not the same concept as events that have a game-changing impact on astronomical stakes. The question is what are the largest lowest-hanging fruit in astronomical stakes, not whether something can be argued as defensible by pointing to a non-zero astronomical impact.

I currently don’t think that someone who is trying to understand how to deal with technological long-term risk should spend much time thinking about technological unemployment or related issues, but it feels like the paper is trying to advocate for the opposite position.

Concluding thoughts on the article

Many people in the AI policy space have to spend a lot of effort to gain respect and influence, and it’s genuinely hard to figure out a way to do this while acting with integrity. One common difficulty in this area is navigating the incentives to connect one’s arguments to issues that already get a lot of attention (e.g. ongoing political debates). My read is that this essay makes these connections even when they aren’t justified; it implies that many short- and medium-term concerns are a natural extension of current long-term thought, while failing to accurately portray what I consider to be the core arguments around long-term risks and benefits from AI. It seems like the effect of this essay will be to reduce perceived differences between long-term, mid-term and short-term work on risks from AI, to cause confusion about the actual concerns of Bostrom et al., and to make future communications work in this space harder and more polarized.

Broader thoughts on CSER and CFI

I only had the time and space to critique one specific article from CFI and CSER. However, from talking to others working in the global catastrophic risk space, and from engagement with significant fractions of the rest of CSER and CFI’s work, I've come to think that the problems I see in this article are mostly representative of the problems I see in CSER’s and CFI’s broader strategy and work. I don’t think what I’ve written sufficiently justifies that claim; however, it seems useful to share this broader assessment to allow others to make better predictions about my future grant recommendations, and maybe also to open a dialogue that might cause me to change my mind.

Overall, based on the concerns I’ve expressed in this essay, and that I’ve had with other parts of CFI and CSER’s work, I worry that their efforts to shape the conversation around AI policy, and to mend disputes between those focused on long-term and short-term problems, do not address important underlying issues and may have net-negative consequences.

That said, it’s good that these organizations give some researchers a way to get PhDs/postdocs at Cambridge with relatively little institutional oversight and an opportunity to explore a large variety of different topics (e.g. Jess, and Shahar Avin, a previous grantee whose work I’m excited about).

Addendum: Thoughts on incentives in technical fields in academia

I wrote the following in the course of writing about the AI Safety Camp. This is a model I use commonly when thinking about funding for AI alignment work, but it ended up not being very relevant to that writeup, so I’m leaving it here as a note of interest.

My understanding of many parts of technical academia is that there is a strong incentive to make your writing hard to understand while appearing more impressive by using a lot of math. Eliezer Yudkowsky describes his understanding of it as such (and expands on this further in the rocket alignment problem):

The point of current AI safety work is to cross, e.g., the gap between [. . . ] saying “Ha ha, I want AIs to have an off switch, but it might be dangerous to be the one holding the off switch!” to, e.g., realizing that utility indifference is an open problem. After this, we cross the gap to solving utility indifference in unbounded form. Much later, we cross the gap to a form of utility indifference that actually works in practice with whatever machine learning techniques are used, come the day.Progress in modern AI safety mainly looks like progress in conceptual clarity — getting past the stage of “Ha ha it might be dangerous to be holding the off switch.” Even though Stuart Armstrong’s original proposal for utility indifference completely failed to work (as observed at MIRI by myself and Benya), it was still a lot of conceptual progress compared to the “Ha ha that might be dangerous” stage of thinking.Simple ideas like these would be where I expect the battle for the hearts of future grad students to take place; somebody with exposure to Armstrong’s first simple idea knows better than to walk directly into the whirling razor blades without having solved the corresponding problem of fixing Armstrong’s solution. A lot of the actual increment of benefit to the world comes from getting more minds past the “walk directly into the whirling razor blades” stage of thinking, which is not complex-math-dependent.Later, there’s a need to have real deployable solutions, which may or may not look like impressive math per se. But actual increments of safety there may be a long time coming. [. . . ]Any problem whose current MIRI-solution looks hard (the kind of proof produced by people competing in an inexploitable market to look impressive, who gravitate to problems where they can produce proofs that look like costly signals of intelligence) is a place where we’re flailing around and grasping at complicated results in order to marginally improve our understanding of a confusing subject matter. Techniques you can actually adapt in a safe AI, come the day, will probably have very simple cores — the sort of core concept that takes up three paragraphs, where any reviewer who didn’t spend five years struggling on the problem themselves will think, “Oh I could have thought of that.” Someday there may be a book full of clever and difficult things to say about the simple core — contrast the simplicity of the core concept of causal models, versus the complexity of proving all the clever things Judea Pearl had to say about causal models. But the planetary benefit is mainly from posing understandable problems crisply enough so that people can see they are open, and then from the simpler abstract properties of a found solution — complicated aspects will not carry over to real AIs later.

And gives a concrete example here:

The journal paper that Stuart Armstrong coauthored on "interruptibility" is a far step down from Armstrong's other work on corrigibility. It had to be dumbed way down (I'm counting obscuration with fancy equations and math results as "dumbing down") to be published in a mainstream journal. It had to be stripped of all the caveats and any mention of explicit incompleteness, which is necessary meta-information for any ongoing incremental progress, not to mention important from a safety standpoint. The root cause can be debated but the observable seems plain. If you want to get real work done, the obvious strategy would be to not subject yourself to any academic incentives or bureaucratic processes. Particularly including peer review by non-"hobbyists" (peer commentary by fellow "hobbyists" still being potentially very valuable), or review by grant committees staffed by the sort of people who are still impressed by academic sage-costuming and will want you to compete against pointlessly obscured but terribly serious-looking equations.

(Here is a public example of Stuart’s work on utility indifference, though I had difficulty finding the most relevant examples of his work on this subject.)

Some examples that seem to me to use an appropriate level of formalism include: the Embedded Agency sequence, the Mesa-Optimisation paper, some posts by DeepMind researchers (thoughts on human models, classifying specification problems as variants of Goodhart’s law), and many other blog posts by these authors and others on the AI Alignment Forum.

There’s a sense in which it’s fine to play around with the few formalisms you have a grasp of when you’re getting to grips with ideas in this field. For example, MIRI recently held a retreat for new researchers, which led to a number of blog posts that followed this pattern (1, 2, 3, 4). But aiming for lots of technical formalism is not helpful - any conception of useful work that focuses primarily on molding the idea to the format rather than molding the format to the idea, especially for (nominally) impressive technical formats, is likely optimizing for the wrong metric and falling prey to Goodhart’s law.


New Petrov Game Brainstorm

Новости LessWrong.com - 3 октября, 2019 - 22:48
Published on October 3, 2019 7:48 PM UTC

Big thanks to the LW team for putting together the Petrov Day experience! (Setup. Follow up.)

I looked over the comments and it seems like there there were a number of suggestions for how to do this better. Instead of waiting for the next year, let's do it right now.

My proposed setup:

1. Take the original 125 LW users. Take a prize pool of $1,250 (or more if people are willing to donate). The prize pool is split evenly between each player, but you have to survive the game to get paid. Everyone is anonymized in the game.

2. The game will last a minimum of 4 days (to give everyone enough time to act, strategize, and think). After 4 days, there will be an increasing probability that the game will end at any minute. (This is to prevent anyone trying to attack right when the game ends to avoid retaliation. In expectation, the game should last about a week.)

3. Each player will have the number of missiles equal to the number of players. They can launch any number of them.

4. When a missile is launched: a) the attacked player is notified that they are being attacked by a specific player (and therefore has an option to retaliate), b) 48 hours after the launch, the attacked player is declared dead: they can no longer perform any actions and will not receive a payout, c) 48 hours after the launch, the attacking player gets the target player's entire prize pool.

5. During the game there will be at least 125 fake alerts. They will be generated randomly (so some players might receive zero or more than one fake alerts). It will look the same as if some specific player has launched a missile against you. 48 hours after the notification, you'll find out whether it was real or not by whether or not you're still alive.

Additional details:

  • You can see who has been killed.
  • You can only know about missiles launches that you have done or that have been done to target you.
  • If you take money from a player who already took money from someone else, you get those too. So in theory, we could end up with one winner with the entire original prize pool.
  • When people create their accounts, they'll have an optional to either receive their winnings directly (let's say via PayPal) or to donate to LW. (Personally, I hope most people will choose the second option, which will make the payout much less of a hustle.)
  • I'm not sure how to easily structure this so that the players are completely anonymous. (For example, if I'm sending the payouts, I'll know.) If this seems like an important feature, I'm willing to work through this. (E.g. each player gets a random invite code and creates a new account. There's no record of who received what invite code. Payouts are done to anonymous BTC addresses.)

What do you think?

  • Does it look like the rules are facilitating the kind of experiment we want to run?
  • If you were part of the originally selected group, are you willing to participate?
  • Can anyone add to the prize pool?
  • Any clever way to structure this game on top of some existing platform to avoid writing too much code?


Ненасильственное общение. Тренировка

События в Кочерге - 3 октября, 2019 - 19:30
Как меньше конфликтовать, не поступаясь при этом своими интересами? Ненасильственное общение — это набор навыков для достижения взаимопонимания с людьми. Приходите на наши практические занятия, чтобы осваивать эти навыки и общаться чутче и эффективнее.

[Link] (EA Podcast) Global Optimum: How to Learn Better

Новости LessWrong.com - 3 октября, 2019 - 18:51
Published on October 3, 2019 12:29 AM UTC

I host the podcast Global Optimum. The goal of the podcast is to make altruists more effective. The most recent episode is about how to learn more effectively. I review the psychological literature on learning techniques and discuss the interplay between scholarship and rationality.

This episode features:

-What are the best and worst studying techniques?

-Do “learning styles” exist?

-How to squeeze more learning into your day

-How to start learning a new field

-How to cultivate viewpoint diversity

-How to avoid getting parasitized by bad ideas

-Should you study in the morning or at night?

-Can napping enhance learning?

Full transcript: http://danielgambacorta.com/podcast/how-to-learn-better/

The podcast is available on all podcast apps.

Listen here: http://globaloptimum.libsyn.com/how-to-learn-better


Can we make peace with moral indeterminacy?

Новости LessWrong.com - 3 октября, 2019 - 15:56
Published on October 3, 2019 12:56 PM UTC

The problem:

Put humans in the ancestral environment, and they'll behave as if they like nutrition and reproducing. Put them in the modern environment, and they'll behave as if they like tasty food and good feelings. Pump heroin into their brains, and they'll behave as if they want high dopamine levels.

None of these are the One True Values of the humans, they're just what humans seem to value in context, at different levels of abstraction. And this is all there is - there is no One True Context in which we find One True Values, there are just regular contexts. Thus we're in a bit of a pickle when it comes to teaching an AI how we want the world to be rearranged, because there's no One True Best State Of The World.

This underdetermination gets even worse when we consider that there's no One True Generalization Procedure, either. At least for everyday sorts of questions (do I want nutrition, or do I want tasty food?), we're doing interpolation, not extrapolation. But when we ask about contexts or options totally outside the training set (how should we arrange the atoms of the Milky Way?), we're back to the problem illustrated with train tracks in The Tails Coming Apart As Metaphor For Life.

Sometimes it feels like for every value alignment proposal, the arbitrariness of certain decisions sticks out like a missing finger on a hand. And we just have to hope that it all works out fine and that this arbitrary decision turns out to be a good one, because there's no way to make a non-arbitrary decision for some choices.

Is it possible for us to make peace with this upsetting fact of moral indeterminacy? If two slightly different methods of value learning give two very different plans for the galaxy, should we regard both plans as equally good, and be fine with either? I don't think this acceptance of arbitrariness is crazy, and some amount is absolutely necessary. But this pill might be less bitter to swallow if we clarify our picture of what "value learning" is supposed to be doing in the first place.

AIs aren't driving towards their One Best State anyhow:

For example, what kind of "human values" object do we want a value learning scheme to learn? Because it ain't a utility function over microphysical states of the world.

After all, we don't want a FAI to be in the business of finding the best position for all the atoms, and then moving the atoms there and freezing them. We want the "best state" to contain people growing, exploring, changing the environment, and so on. This is only a "state" at all when viewed at some very high level of abstraction that incorporates history and time evolution.

So when two Friendly AIs generalize differently, this might look less like totally different end-states for the galaxy, but like subtly different opinions on which dynamics make for a satisfying galactic society... which eventually lead to totally different end-states for the galaxy. Look, I never said this would make the problem go away - we're still talking about generalizing from our training set to the entire universe, here. If I'm making any comforting point here, it's that the arbitrariness doesn't have to be tense or alien or too big to comprehend, it can be between reasonable things that all sound like good ideas.


And jumping Jehoshaphat, we haven't even talked about meta-ethics yet. AI that takes meta-ethics into account wouldn't only learn what we appear to value according to whatever definition it started with, it would try to take into account what we think it means to value things, what it means to make good decisions, what we think we value, and what we want to value.

This can get a lot trickier than just inferring a utility function from a human's actions, and we don't have a very good understanding of it right now. But our concern about the arbitrariness of values is precisely a meta-ethical concern, so you can see why it might be a big deal to build an AI that cares about meta-ethics. I'd want a superhuman meta-ethical reasoner to learn that there was something weird and scary about this problem of formalizing and generalizing values, and take superhumanly reasonable steps to address this. The only problem is I have no idea how to build such a thing.

But in lieu of superintelligent solutions, we can still try to research appealing metaethical schemes for controlling generalization.

One such scheme is incrementalism. Rather than immediately striking out for the optimal utopia your model predicts, maybe it's safer to follow something like an iterative process - humans learning, thinking, growing, changing the world, and eventually ending up at a utopia that might not be what you had in mind at the start. (More technically, we might simulate this process as flow between environments, where we start with our current environment and values, and flow to nearby environments based on our rating of them, at each step updating our values not according to what they would actually be in that environment, but based on an idealized meta-ethical update rule set by our current selves.)

This was inspired by Scott Garrabrant's question about gradient descent vs. Goodhart's law. If we think of utopias as optimized points in a landscape of possibilities, we might want to find ones that lie near to home - via hill-climbing or other local dynamics - rather than trusting our model to safely teleport us to some far-off point in configuration space.

It also bears resemblance to Eliezer_2004's meta-ethical wish list: "if we knew more, [...] were the people we wished we were, had grown up farther together..." There just seems to be something meta-ethically trustworthy about "growing up more."

This also illustrates how the project of incorporating meta-ethics into value learning really has its work cut out for it. Of course there are arbitrary choices in meta-ethics too, but somehow they seem more palatable than arbitrary choices at the lower meta-level. Whether we do it with artificial help or not, I think it's possible to gradually tease out what sort of things we want from value learning, which might not reduce the number of arbitrary choices, but hopefully can reduce their danger and mystery.


Comments are back!

Новости LessWrong.com - 3 октября, 2019 - 13:50
Published on October 3, 2019 10:50 AM UTC

Instead of hosting comments on jefftk.com directly, I copy over publicly-accessible comments from discussion elsewhere:

By default it's first-name only when copying from social media (Facebook, Google Plus) and full name when copying from forums (LessWrong, EA Forum), though I have it always use my full name for clarity. While everything I copy over is already world-readable without an account, some people don't want their comments copied (let me know if that includes you!), in which case I show something like:

It's a little fragile, though, and about a year after I last fixed Facebook comment inclusion it broke again. And Google Plus was turned off, so I couldn't pull comments from there either. And the EA Forum and LessWrong migrated to new software that didn't support the old (and not very good) rss-based system. And my adapters for Hacker News and Reddit broke too.

I've now gotten the main four working again, though it's not quite the same as before:

  • Google Plus, being shut down, just serves a frozen archive as of my last backup.

  • The EA Forum and LessWrong use the GraphQL API and pull comments as needed, so crossposting is very fast.

  • I have new code for Facebook that runs selenium while logged out to build an archive, and I'll figure out some system for updating it at some point. Crossposting is very slow, like ~weeks.

At some point I may get Reddit and HN fixed, but I don't crosspost to them very often so it's not much of a priority.

Comment via: facebook


[Link] What do conservatives know that liberals don't (and vice versa)?

Новости LessWrong.com - 2 октября, 2019 - 19:16
Published on October 2, 2019 4:14 PM UTC

I am a PhD student currently conducting research on political polarization and persuasion.  I am running an experiment that requires a database of trivia questions which conservatives are likely to get correct, and liberals are likely to get wrong (and vice versa).  Our pilot testing has shown, for example, Democrats (but not Republicans) tend to overestimate the percentage of gun deaths that involve assault-style rifles, while Republicans (but not Democrats) tend overestimate the proportion of illegal immigrants who commit violent crimes. Similarly, Democrats (but not Republicans) tend to overestimate the risks associated with nuclear power, while Republicans (but not Democrats) underestimate the impact of race-based discrimination on hiring outcomes.

Actually designing these questions is challenging, however, because it’s difficult to know which of one’s political beliefs are most likely to be ill-informed.  As such, I am running a crowdsourcing contest in which we will pay $100 for any high-quality trivia question submitted (see contest details here: https://redbrainbluebrain.org/).  The only requirements are that participants submit a question text, four multiple choice answers, and a credible source.  The deadline for submissions is October 15th, 2019 at 11:59 p.m.

My intuition is that the LessWrong community will be particularly good at generating these kinds of questions given their commitment to belief updating and rationality. If you don't have the time to participate in the contest, I welcome any ideas about potential topics that might be a fruitful source of these kinds of questions.


What are we assuming about utility functions?

Новости LessWrong.com - 2 октября, 2019 - 18:11
Published on October 2, 2019 3:11 PM UTC

I often notice that in many (not all) discussions about utility functions, one side is "for" their relevance, while others tend to be "against" their usefulness, without explicitly saying what they mean. I don't think this is causing any deep confusions among researchers here, but I'd still like to take a stab at disambiguating some of this, if nothing else for my own sake. Here are some distinct (albeit related) ways that utility functions can come up in AI safety, in terms of what assumptions/hypotheses they give rise to:

AGI utility hypothesis: The first AGI will behave as if it is maximizing some utility function

ASI utility hypothesis: As AI capabilities improve well beyond human-level, it will behave more and more as if it is maximizing some utility function (or will have already reached that ideal earlier and stayed there)

Human utility hypothesis: Even though in some experimental contexts humans seem to not even be particularly goal-directed, utility functions are often a useful model of human preferences to use in AI safety research

Coherent Extrapolated Volition (CEV) hypothesis: For a given human H, there exists some utility function V such that if H is given the appropriate time/resources for reflection, H's values would converge to V

Some points to be made:

  • The "Goals vs Utility Functions" chapter of Rohin's Value Learning sequence, and the resulting discussion focused on differing intuitions about the AGI and ASI utility hypotheses (more accurately, as the title implies, the discussion was whether those agents will be broadly goal-directed at all, a weaker condition than being a utility maximizer).
  • AGI utility doesn't logically imply ASI utility, but I'd be surprised if anyone thinks it's very plausible for the former to be true while the latter fails. In particular, the coherence arguments and other pressures that move agents toward VNM seem to roughly scale with capabilities. A plausible stance could be that we should expect most ASIs to hew close to the VNM ideal, but these pressures aren't quite so overwhelming at the AGI level; in particular, humans are fairly goal-directed but only "partially" VNM, so the goal-directedness pressures on an AGI will likely be at this order of magnitude. Depending on takeoff speeds, we might get many years to try aligning AGIs at this level of goal-directedness, which seems less dangerous than playing sorcerer's apprentice with VNM-based AGIs at the same level of capability.(Note: I might be reifying VNM here too much, in thinking of things having a measure of "goal-directedness" with "very goal-directed" approximating VNM. But this basic picture could be wrong in all sorts of ways.)
  • The human utility hypothesis is much more vague than the others, and seems ultimately context-dependent. To my knowledge, the main argument in its favor is the fact that most of economics is founded on it. On the other hand, behavioral economists have formulated models like prospect theory for when greater precision is required than the simplistic VNM model gives, not to mention the cases where it breaks down more drastically. I haven't seen prospect theory used in AI safety research; I'm not sure if this reflects more a) the size of the field and the fact that few researchers have had much need to explicitly model human preferences, or b) that we don't need to model humans more than superficially. since this kind of research is still at a very early theoretical stage with all sorts of real-world error terms abounding.
  • The CEV hypothesis can be strengthened, consistent with Yudkowsky's original vision, to say that every human will converge to about the same values. But the extra "values converge" assumption seems orthogonal to one's opinions about the relevance of utility functions, so I'm not including it in the above list.
  • In practice a given researcher's opinions on these tend to be correlated, so it makes sense to talk of "pro-utility" and "anti-utility" viewpoints. But I'd guess the correlation is far from perfect, and at any rate, the arguments connecting these hypotheses seem somewhat tenuous.


Human instincts, symbol grounding, and the blank-slate neocortex

Новости LessWrong.com - 2 октября, 2019 - 15:06
Published on October 2, 2019 12:06 PM UTC

Intro: What is Common Cortical Algorithm (CCA) theory, and why does it matter for AGI?

As I discussed at Jeff Hawkins on neuromorphic AGI within 20 years, and was earlier discussed on LessWrong at The brain as a universal learning machine, there is a theory, due originally to Vernon Mountcastle in the 1970s, that the neocortex (75% of the human brain) consists of ~150,000 interconnected copies of a little module, the "cortical column", each of which implements the same algorithm. Following Jeff Hawkins, I'll call this the "common cortical algorithm" (CCA) theory. (I don't think that terminology is standard.)

So instead of saying that the human brain has a vision processing algorithm, motor control algorithm, language algorithm, planning algorithm, and so on, in CCA theory we say that (to a first approximation) we have a massive amount of "general-purpose neocortical tissue", and if you dump visual information into that tissue, it does visual processing, and if you connect that tissue to motor control pathways, it does motor control, etc.

Whether and to what extent CCA theory is true is, I think, very important for AGI forecasting, strategy, and both technical and non-technical safety research directionssee my answer here for more details.

Should we believe CCA theory?

CCA theory, as I'm using the term, is a simplified model. There are almost definitely a couple caveats to it:

  1. There are sorta "hyperparameters" on the generic learning algorithm which seem to be set differently in different parts of the neocortex. For example, some areas of the cortex have higher or lower density of particular neuron types. I don't think this significantly undermines the usefulness or correctness of CCA theory, as long as these changes really are akin to hyperparameters, as opposed to specifying fundamentally different algorithms. So my reading of the evidence is that if you put, say, motor nerves coming out of visual cortex tissue, the tissue could do motor control, but it wouldn't do it quite as well as the motor cortex does.[1]
  2. There is almost definitely a gross wiring diagram hardcoded in the genome—i.e., set of connections between different neocortical regions and each other, and other parts of the brain. These connections later get refined and edited during learning. Again, we can ask how much the existence of this innate gross wiring diagram undermines CCA theory. How complicated is the wiring diagram? Is it millions of connections among thousands of tiny regions, or just tens of connections among a few regions? Would the brain work at all if you started with a random wiring diagram? I don't know for sure, but for various reasons, my current belief is that this initial gross wiring diagram is not carrying much of the weight of human intelligence, and thus that this point is not a significant problem for the usefulness of CCA theory.

Going beyond these caveats, I found pretty helpful literature reviews on both sides of the issue:

  • The experimental evidence for CCA theory: see chapter 5 of Rethinking Innateness (1996)
  • The experimental evidence against CCA theory: see chapter 5 of The Blank Slate by Steven Pinker (2002).

I won't go through the debate here, but after reading both of those I wound up feeling that CCA theory (with the caveats above) is probably right, though not 100% proven. Please comment if you've seen any other good references on this topic, especially more up-to-date ones.

CCA theory vs human-universal traits and instincts

The main topic for this post is:

If Common Cortical Algorithm theory is true, then how do we account for all the human-universal instincts and behaviors that evolutionary psychologists talk about?

Indeed, we know that there are a diverse set of remarkably specific human instincts and mental behaviors evolved by natural selection. Again, Steven Pinker's The Blank Slate is a popularization of this argument; it ends with Donald E. Brown's giant list of "human universals", i.e. behaviors that are observed in every human culture.

Now, 75% of the human brain is the neocortex, but the other 25% consists of various subcortical ("old-brain") structures like the amygdala, and these structures are perfectly capable of implementing specific instincts. But these structures do not have access to an intelligent world-model—only the neocortex does! So how can the brain implement instincts that require intelligent understanding? For example, maybe the fact that "Alice got two cookies and I only got one!" is represented in the neocortex as the activation of neural firing pattern 7482943. There's no obvious mechanism to connect this arbitrary, learned pattern to the "That's so unfair!!!" section of the amygdala. The neocortex doesn't know about unfairness, and the amygdala doesn't know about cookies. Quite a conundrum!

This is really a symbol grounding problem, which is the other reason this post is relevant to AI alignment. When the human genome builds a human, it faces the same problem as a human programmer building an AI: how can one point a goal system at things in the world, when the internal representation of the world is a complicated, idiosyncratic, learned data structure? As we wrestle with the AI goal alignment problem, it's worth studying what human evolution did here.

List of ways that human-universal instincts and behaviors can exist despite CCA theory

Finally, the main part of this post. I don't know a complete answer, but here are some of the categories I've read about or thought of, and please comment on things I've left out or gotten wrong!

Mechanism 1: Simple hardcoded connections, not implemented in the neocortex

Example: Enjoying the taste of sweet things. This one is easy. I believe the nerve signals coming out of taste buds branch, with one branch going to the cortex to be integrated into the world model, and another branch going to subcortical regions. So the genes merely have to wire up the sweetness taste buds to the good-feelings subcortical regions.

Mechanism 2: Subcortex-supervised learning.

Example: Wanting to eat chocolate. This is different than the previous item because "sweet taste" refers to a specific innate physiological thing, whereas "chocolate" is a learned concept in the neocortex's world-model. So how do we learn to like chocolate? Because when we eat chocolate, we enjoy it (Mechanism 1 above). The neocortex learns to predict a sweet taste upon eating chocolate, and thus paints the world-model concept of chocolate with a "sweet taste" property. The supervisory signal is multidimensional, such that the neocortex can learn to paint concepts with various labels like "painful", "disgusting", "comfortable", etc., and generate appropriate behaviors in response. (See the DeepMind paper Prefrontal cortex as a meta-reinforcement learning system for a more specific discussion along these lines.)

Mechanism 3: Same learning algorithm + same world = same internal model

Possible example: Intuitive biology. In The Blank Slate you can find a discussion of intuitive biology / essentialism, which "begins with the concept of an invisible essence residing in living things, which gives them their form and powers." Thus preschoolers will say that a dog altered to look like a cat is still a dog, yet a wooden toy boat cut into the shape of a toy car has in fact become a toy car. I think we can account for this very well by saying that everyone's neocortex has the same learning algorithm, and when they look at plants and animals they observe the same kinds of things, so we shouldn't be surprised that they wind up forming similar internal models and representations. I found a paper that tries to spell out how this works in more detail; I don't know if it's right, but it's interesting: free link, official link.

Mechansim 4: Human-universal memes

Example: Fire. I think this is pretty self-explanatory. People learn about fire from each other. No need to talk about neurons, beyond the more general issues of language and social learning discussed below.

Mechanism 5: "Two-process theory"

Possible example: Innate interest in human faces.[2] The meta-reinforcement learning mechanism above (Mechanism 2) can be thought of more broadly as an interaction between a hardwired subcortical system that creates a "ground truth", and a cortical learning algorithm that then learns to relate that ground truth to its complex internal representations. Here, Johnson's "two-process theory" for faces fits this same mold, but with a more complicated subcortical system for ground truth. In this theory, a subcortical system gets direct access to a low-resolution version of the visual field, and looks for a pattern with three blobs in locations corresponding to the eyes and mouth of a blurry face. When it finds such a pattern, it passes information to the cortex that this is a very important thing to attend to, and over time the cortex learns what faces actually look like (and suppresses the original subcortical template circuitry). Anyway, Johnson came up with this theory partly based on the observation that newborns are equally entranced by pictures of three blobs versus actual faces (each of which were much more interesting than other patterns), but after a few months the babies were more interested in actual face pictures than the three-blob pictures. (Not sure what Johnson would make of this twitter account.)

(Other possible examples of instincts formed by two-process theory: fear of snakes, interest in human speech sounds, sexual attraction.)

Mechanism 6: Time-windows

Examples: Filial imprinting in animals, incest repulsion (Westermarck effect) in humans. Filial imprinting is a famous result where newborn chicks (and many other species) form a permanent attachment to the most conspicuous moving object that they see in a certain period shortly after hatching. In nature, they always imprint on their mother, but in lab experiments, chicks can be made to imprint on a person, or even a box. As with other mechanisms here, time-windows provides a nice solution to the symbol grounding problem, in that the genes don't need to know what precise collection of neurons corresponds to "mother", they only need to set up a time window and a way to point to "conspicuous moving objects", which is presumably easier. The brain mechanism of filial imprinting has been studied in detail for chicks, and consists of the combination of time-windows plus the two-process model (mechanism 5 above). In fact, I think the two-process model was proven in chick brains before it was postulated in human brains.

There likewise seem to be various time-window effects in people, such as the Westermarck effect, a sexual repulsion between two people raised together as young children (an instinct which presumably evolved to reduce incest).

Mechanism 7 (speculative): empathetic grounding of intuitive psychology.

Possible example: Social emotions (gratitude, sympathy, guilt,...) Again, the problem is that the neocortex is the only place with enough information to, say, decide when someone slighted you, so there's no "ground truth" to use for meta-reinforcement learning. At first I was thinking that the two-process model for human faces and speech could be playing a role, but as far as I know, deaf-blind people have the normal suite of social emotions, so that's not it either. I looked in the literature a bit and couldn't find anything helpful. So, I made up this possible mechanism (warning: wild speculation).

Step 1 is that a baby's neocortex builds a "predicting my own emotions" model using normal subcortex-supervised learning (Mechanism 2 above). Then a normal Hebbian learning mechanism makes two-way connections between the relevant subcortical structures (amygdala) and the cortical neurons involved in this predictive model.

Step 2 is that the neocortex's universal learning algorithm will, in the normal course of development, naturally discover that this same "predicting my own emotions" model from step 1 can be reused to predict other people's emotions (cf. Mechanism 3 above), forming the basis for intuitive psychology. Now, because of those connections-to-the-amygdala mentioned in step 1, the amygdala is incidentally getting signals from the neocortex when the latter predicts that someone else is angry, for example.

Step 3 is that the amygdala (and/or neocortex) somehow learns the difference between the intuitive psychology model running in first-person mode versus empathetic mode, and can thus generate appropriate reactions, with one pathway for "being angry" and a different pathway for "knowing that someone else is angry".

So let's now return to my cookie puzzle above. Alice gets two cookies and I only get one. How can I feel it's unfair, given that the neocortex doesn't have a built-in notion of unfairness, and the amygdala doesn't know what cookies are? The answer would be: thanks to subcortex-supervised learning, the amygdala gets a message that one yummy cookie is coming, but the neocortex also thinks "Alice is even happier", and that thought also recruits the amygdala, since intuitive psychology is built on empathetic modeling. Now the amygdala knows that I'm gonna get something good, but that Alice is gonna get something even better, and that combination (in the current emotional context) triggers the amygdala to send out waves of jealousy and indignation. This is then a new supervisory signal for the neocortex, which allows the neocortex to gradually develop a model of fairness, which in turn feeds back into the intuitive psychology module, and thereby back to the amygdala, allowing the amygdala to execute more complicated innate emotional responses in the future, and so on.

The special case of language.

It's tempting to put language in the category of memes (mechanism 4 above)—we do generally learn language from each other—but it's not really, because apparently groups of kids can invent grammatical languages from scratch (e.g. Nicaraguan Sign Language). My current guess is that it combines three things: (1) a two-process mechanism (Mechanism 5 above) that makes people highly attentive to human speech sounds. (2) possibly "hyperparameter tuning" in the language-learning areas of the cortex, e.g. maybe to support taller compositional hierarchies than would be required elsewhere in the cortex. (3) The fact that language can sculpt itself to the common cortical algorithm rather than the other way around—i.e., maybe "grammatical language" is just another word for "a language that conforms to the types of representations and data structures that are natively supported by the common cortical algorithm".

By the way, lots of people (including Steven Pinker) seem to argue that language processing is a fundamentally different and harder task than, say, visual processing, because language requires symbolic representations, composition, recursion, etc. I don't understand this argument; I think vision processing needs the exact same things! I don't see a fundamental difference between the visual-processing system knowing that "this sheet of paper is part of my notebook", and the grammatical "this prepositional phrase is part of this noun phrase". Likewise, I don't see a difference between recognizing a background object interrupted by a foreground occlusion, versus recognizing a noun phrase interrupted by an interjection. It seems to me like a similar set of problems and solutions, which again strengthens my belief in CCA theory.


When I initially read about CCA theory, I didn't take it too seriously because I didn't see how instincts could be compatible with it. But I now find it pretty likely that there's no fundamental incompatibility. So having removed that obstacle, and also read the literature a bit more, I'm much more inclined to believe that CCA theory is fundamentally correct.

Again, I'm learning as I go, and in some cases making things up as I go along. Please share any thoughts and pointers!

  1. The visual cortex actually does do a bit of motor control: it moves the eyeballs. ↩︎

  2. See Rethinking Innateness p116, or better yet Johnson's article ↩︎


Toy model #6: Rationality and partial preferences

Новости LessWrong.com - 2 октября, 2019 - 15:04
Published on October 2, 2019 12:04 PM UTC

In my research agenda on synthesising human preferences, I didn't mention explicitly using human rationality to sort through conflicting partial preferences.

This was, in practice, deferred to the "meta-preferences about synthesis". In this view, rationality is just one way of resolving contradictory lower-level preferences, and we wouldn't need to talk about rationality, just observe that it existed - often - within the meta-preferences.

Nevertheless, I think we might gain by making rationality - and its issues - an explicit part of the process.

Defining rationality in preference resolution Explicit choice

We can define rationality in this area by using the one-step hypotheticals. If there is a contradiction between lower-level preferences, then that contradiction is explained to the human subject, and they can render a verdict.

This process can, of course, result in different outcomes depending on how the question is phrased - especially if we allow the one-step hypothetical to escalate to a "hypothetical conversation" where more arguments and evidence is considered.

So the distribution of outcomes would be interesting. If, in cases where most of the relevant argument/evidence is mentioned, the human tends to come down on one side, then that is a strong contender for being their "true" rational resolution of the issues.

However, if in stead the human answers in many different ways, especially if the answer changes because of small changes in how the evidence is ordered, how long the human has to think, whether they get all the counter-evidence or not, and so on - then their preference seems to be much weaker.

For example, I expect that most people similar to me would converge on one answer on questions like "does expected lives save dominate most other considerations in medical interventions?", while having wildly divergent views on "what's the proper population ethics?".

This doesn't matter

Another use of rationality could be to ask the human explicitly whether certain aspects of their preferences should matter. Many human seem to have implicit biases, whether racial or other; many humans believe that it is wrong to have these biases, or at least wrong to let them affect their decisions[1].

Thus another approach for rationality is to query the subject as to whether some aspect should be affecting their decisions or not (because humans only consider a tiny space of options at once, it's better to ask "should X, Y, and Z be relevant", rather than "are A, B, and C the only things that should be relevant?").

Then these kind of rational questions can also be treated in the same way as above.

Weighting rationality

Despite carving out a special place for "rationality", the central thrust of the research agenda remains: a human's rational preferences will dominate other preferences, only if they put great weight in their own rationality.

Real humans don't always change their views just because they can't currently figure out a flaw in an argument; nor would we want them to, especially if their own rationality skills are limited or underused.

  1. Having preferences that never affect any decisions at all is, in practice, the same as not having those preferences: they never affect the ordering of possible universes. ↩︎


Double Tongue Whistling

Новости LessWrong.com - 2 октября, 2019 - 14:10
Published on October 2, 2019 11:10 AM UTC

I can whistle about seven notes per second, which corresponds to a reel at 105bpm. [1] While this isn't a problem for whistling basslines it's slightly too slow for melodies at contra dance speed (~110-122bpm). I want to figure out how to whistle faster, and I know there are people who whistle faster, but I don't know how it's usually done.

I see two main routes:

  • Do what I currently do, but faster.

  • Figure out how to do something else.

The former doesn't seem very promising: I've been playing around with whistling for decades and I suspect I'm pretty close to a local maximum with my current approach. On the other hand, I'm just trying to get from seven notes per second to eight, which seems like it might be possible?

The latter is pretty open, and probably involves doing something that's slower at first but will eventually be faster. The main problem is, how do I know that after I put in all that effort I'll actually end up with something faster? Ideally there would be people demonstrating on youtube or something, with "here's how to whistle quickly" videos, but I'm not seeing that.

Still, it seems like some sort of double-tonguing should work. Normally when I whistle I mark the notes with my glottis, the same as the two glottal stops in "uh-oh". A different option, though, would be to mark the notes by making a velar closure with my tongue, the same as the two velar stops in "cook" ("k"). And then I could alternate between them, which seems like it should let me get up to twice the speed of the slower one. Here's what the three sound like:

I find velar stops harder, partly because it's not what I'm used to, and partly because I'm already using my tongue to form the whistle. Currently I can them 4-5 times per second. When alternating I can do a little better, 5-6 times per second, but that's still less than the 8-10 you'd expect from doubling my velar-only speed. I can go te-ke-te-ke with closures 10-11 times per second, so I am optimistic.

Has anyone learned to double-tongue their whistling successfully? Does this work?

[1] Or a jig at 140bpm, since that's six notes per measure instead of eight.

Comment via: facebook


Cambridge LW/SSC Meetup

Новости LessWrong.com - 2 октября, 2019 - 06:37
Published on October 2, 2019 3:37 AM UTC

This is the monthly Cambridge, MA LessWrong / Slate Star Codex meetup.

Note: The meetup is in apartment 2 (the address box here won't let me include the apartment number).


Does the US nuclear policy still target cities?

Новости LessWrong.com - 2 октября, 2019 - 03:18
Published on October 2, 2019 12:18 AM UTC

The history of nuclear strategic bombing

Daniel Ellsberg’s The Doomsday Machine brought my attention to a horrifying fact about early US nuclear targeting policy. In 1961, the US had only one nuclear war plan, and it called for the destruction of every major Soviet city and military target. That is not surprising. However, the plan also called for the destruction of every major Chinese city and military target, even if China had not provoked the United States. In other words, the US nuclear war plan called for the destruction of the major population centers of the most populous country in the world, even in circumstances where that country had not attacked the United States or its allies. Ellsberg points out that at the time, people at RAND and presumably other parts of the US defense establishment understood that the Chinese and the Soviets were beginning to diverge in strategic interests and thus should not be treated as one bloc. Nevertheless, the top levels of the US command, including President Eisenhower, were committed to the utter destruction of both Chinese and Soviet targets in the event of a war with either country.

The policy of destroying cities is a legacy left over from strategic bombing in World War II. The destruction of Hiroshima and Nagasaki are the most famous, but the fire bombings of Japanese and Germany cities destroyed far more infrastructure and killed far more people than the two atomic bombs. The given rational for strategic bombing was to destroy the ability of the enemy states to continue to make war. If a state can no longer produce airplanes and tanks, either because the factories have been destroyed or because there are no longer people to work in the factories, then its ability to resist is diminished.

Given the level of technology and development in WWII, strategic bombing had a chance at achieving military objectives, because the conflict was to carry on for multiple years. On the timescale of years, a country’s capacity to build armaments and resupply armies in the field can be crucial to victory.

Nuclear war changes this calculus. In a modern nuclear war involving SLBMs (Submarine launched ballistic missiles), ICBMs (Intercontinental ballistic missiles), strategic bombers, and other weapon systems, the majority of an adversary’s military, industrial and population centers could be destroyed in a matter of days or hours. It is hard to imagine a nuclear war lasting years or even months. Without a prolonged war, the original rationale for strategic bombing disappears, or is at least much reduced. A state may still wish to reduce the capacity of its enemy to fight future wars, but it can no longer claim that the wholesale destruction of cities is necessary to achieve military objectives in the current war.

Why then, did early US nuclear policies call for the destruction of cities?

Nuclear game theory in the 1960s

The destruction of cities was primarily a threat of inflicting harm rather than an attempt to destroy the capacity of the enemy to wage war. The idea, formalized by RAND game theorist  Thomas Schelling, was that both the United States and the Soviet Union would threaten massive retaliation against each other’s civilian populations and industry to deter the other from starting a war.

Schelling developed a category of game theory that involved what he termed “mixed motive games”. Games where both sides sought advantage, but where the payoff to one side did not strictly correlate to the loss to the other side. In these types of games, both players may wish to avoid outcomes that are mutually unfavorable (Strategy of Conflict, pg. 89). In the case of nuclear deterrence, both sides strongly preferred to avoid nuclear war, and thus were both were deterred from taking actions that would directly lead to nuclear war.

Much of Schelling’s work concerns itself with how states in a nuclear stalemate can pursue their own advantage while avoiding escalation to nuclear war. In this type of game, states try to maneuver each other into positions where the only possible actions are 1) escalate and risk nuclear war or 2) de-escalate and concede something to the other side.

During the Cuban Missile Crisis, Kennedy ordered a blockade of Cuba, believing that such an action would not be sufficient for the Soviet Union to initiate a war. The United States believed that the Soviet Union would not try to break the blockade, because such an action would be recognized by both sides as starting a (nuclear) war. Because Kennedy proved correct in his belief that the Soviet Union would not go to war over the blockade nor risk initiating war by breaking the blockade, the United States used to its advantage both countries unwillingness to go to war.

What does this have to do with the targeting cities? To answer this question it’s necessary to consider how a nuclear war might start. Although nuclear powers would almost always prefer to avoid a nuclear war, each has an incentive to strike first if they believe nuclear war to be inevitable. By striking first they may destroy their adversary’s nuclear forces before they can be used. At this point, it’s useful to define a couple of terms. Counterforce targeting refers to the targeting of enemy military installations, especially other nuclear forces. Countervalue targeting refers to the targeting of enemy infrastructure and population centers.

Consider the primary goal of a first strike. Under the most plausible nuclear war scenarios, it is to eliminate the nuclear forces of the rival state; its objective is primarily counterforce in nature. This is markedly different than the goal of a second strike. The primary goal of a second strike, under normal assumptions of deterrence, is actually to provide fulfilment of the pre-commitment made to retaliate if ever attacked. That is, it is necessary to actually be committed to attacking second so as to avoid being attacked in the first place. 

Schelling effectively argued that the more punishing the second strike threatened to be, the more effective the deterrent would be also. If true, then in the event of a nuclear war, a state following the optimal strategy of deterrence would target cities as well as nuclear targets to make their nuclear response as punishing as possible; that is, it would destroy both counterforce and countervalue targets. This would seem to lead to a policy for states to attack cities, without a second thought. Indeed, this was the policy of the US and the Soviet Union in the 1950s and early 1960s. However, just because targeting cities promised to be a more effective deterrent did not mean it promised to be the best policy. 

Given some probability of nuclear war, the effectiveness of a deterrent strategy ought to be weighed against the severity of the resulting war were that strategy employed. In other words, it might make sense for a state to commit to not targeting cities in a second strike if they themselves do not have their cities destroyed in a first strike. While this may reduce the effectiveness of their deterrent (and perhaps only marginally -- nuclear war is plenty damaging without cities being destroyed -- the fallout alone will kill many millions), it may also greatly reduce the severeness of a nuclear war. 

Herman Kahn, a prominent and controversial RAND researcher, argued that states would be rational to refrain from destroying cities in a first strike, to retain some bartering power that might allow them to save more of their own cities. The argument is that the defending force might refrain from destroying many enemy cities if doing so prevented their own cities from being destroyed. Kahn believed that the US should study and prepare for negotiating for the avoidance of US cities in a nuclear war and that in order to do this the country should:

  1. Develop the ability to have sufficiently protected or hidden nuclear forces to be able to both survive a first strike and carry out counterforce and countervalue attacks.
  2. Have “backup presidents”, or people with authority to both order attacks and negotiate with the Soviet Union in the midst of a war, and that the US should have multiple secure locations which are staffed 24/7 by these leaders.

Both Herman Kahn and Thomas Schelling agreed that negotiating the end of a nuclear war would be difficult, but both believed it was critical that nuclear states remain capable of negotiation. Schelling writes about this in his 1966 work, Arms and Influence:

The closing stage, furthermore, might have to begin quickly, possibly before the first volley had reached its targets; and even the most confident victor would need to induce his enemy to avoid a final, futile orgy of hopeless revenge. In earlier times, one could plan the opening moves of war in detail and hope to improvise plans for its closure; for thermonuclear war, any preparations for closure would have to be made before the war starts. ...A critical choice in the process of bringing a war to a successful close--or to the least disastrous close--is whether to destroy or to preserve the opposing government and its principal channels of command and communication. If we manage to destroy the opposing government’s control over its own armed forces, we may reduce their military effectiveness. At the same time, if we destroy the enemy government’s authority over its armed forces, we may preclude anyone’s ability to stop the war, to surrender, to negotiate an armistice, or to dismantle the enemy’s weapons.Historical developments in US nuclear targeting policy

The United State’s nuclear targeting policy has evolved from one of indiscriminate destruction of military and civilian targets, including cities, to one that promises proportional retaliation. While the public documents, perhaps intentionally, do not make the US’s position clear, their implication is that the United States would only target cities in the event that their own cities were destroyed.The first nuclear targeting plans existed in the form of SIOP (Single Integrated Operational Plan). This classified document outlined our nuclear policy starting in 1961 until 2004, and now exists in the form of the Operations Plan (OPLAN). The first SIOP specified all out targeting of both military targets and population centers, that is both counterforce and countervalue targeting, in both first strike and second strike scenarios. Later SIOPs contained multiple options, including the option to hold the bombing of cities in reserve.

This paper: "The Trump Administration’s Nuclear Posture Review (NPR): In Historical Perspective" summarizes how the Kennedy administration began to advocate for a limited war scenario that spared cities:

President Kennedy went so far as to endorse Secretary of Defense McNamara’s effort to get the Soviets to agree to a “no cities” nuclear targeting rule, which McNamara and the President soon abandoned in the face of objections from NATO and the US Congress as well as the Kremlin that the idea was totally unrealistic. McNamara thereupon did a 180°turn to champion a MAD arms limitation (and retention) pact with the Soviets – to prevent nuclear war by guaranteeing it will be mutually suicidal. The Johnson administration’s effort to negotiate such a treaty with Kosygin was aborted in 1968 by the Soviet Union’s brutal repression of the reformist Dubcek regime in Czechoslovakia. McNamara continued to work secretly with the military, however, to enlarge the menu in the SIOP (Single Integrated Operational Plan) from which the president could select limited and controlled nuclear responses to a nuclear attack – preserving some possibility of a nuclear cease fire prior to Armageddon.

Even though McNamara’s efforts to change nuclear war plans to spare cities failed, his influence led to changes in the SIOP that for the first time specified a flexible response in nuclear war planning. Nixon would later make additional changes to the SIOP, giving the United States even more flexibility in nuclear targeting scenarios. It is not clear whether the United States ever developed a serious “no cities” strategy during the Cold War, but it did at least lay the foundations for one.

For the first time in US history, President Obama's administration stated the US would not target cities with nuclear weapons. However, this statement did not rule out escalation to countervalue targeting in the midst of a nuclear war, and is best interpreted to mean that the US would only target cities as a retaliatory measure. From the same Historical Perspective paper:

Yet Obama, while conceding to this presumed need to be prepared to actually use nuclear weapons in extreme situations, was not about to totally devolve the planning for such use onto the Pentagon…. he was adamant in his guidance to the military that if that crucial threshold ever had to be crossed, all operations had to be “consistent the fundamental principles of the Law of Armed Conflict. Accordingly, plans will … apply the principles of distinction and proportionality and seek to minimize collateral damage to civilian populations and civilian objects. The United States will not intentionally target civilian populations or civilian objects”(US Department of Defense, 2013 US Department of Defense. 2013. Report on Nuclear Employment Strategy of the United States Specified in Section 491 of 10 U.S.C. June 12.The restrictive rules of nuclear engagement were translated into the military’s doctrinal language: “The new guidance,” elaborated the Pentagon’s June 2013 Report on Nuclear Employment Strategy, requires the United States to maintain significant counterforce capabilities [jargon for directed at strategic weapon systems] against potential adversaries. The new guidance does not rely on a ‘counter-value’ or ‘minimum deterrence’ strategy [jargon for directed at centers of population]. (US Department of Defense, 2013 US Department of Defense. 2013. Report on Nuclear Employment Strategy of the United States Specified in Section 491 of 10 U.S.C. June 12.Did this mean that the United States was discarding its ultimate assured destruction threat for deterring nuclear war? Clearly not. The guidance was carefully drafted. Does not rely on is different from will not resort to. But more explicitly and openly than previously, the language indicates that assured massive destruction of the enemy country would be the very last resort in an already massively escalating nuclear war, in which all the lesser options had been exhausted and had failed to control the violence.

President Trump’s nuclear policy, as contained in the 2018 Nuclear Posture Review, differs in a number of ways from President Obama’s policies, but doesn’t substantially change the doctrine of holding the targeting of cities in reserve.

If deterrence fails, the initiation and conduct of nuclear operations would adhere to the law of armed conflict and the Uniform Code of Military Justice. The United States will strive to end any conflict and restore deterrence at the lowest level of damage possible for the United States, allies, and partners, and minimize civilian damage to the extent possible consistent with achieving objectives.Every U.S. administration over the past six decades has called for flexible and limited U.S. nuclear response options, in part to support the goal of reestablishing deterrence following its possible failure. This is not because reestablishing deterrence is certain, but because it may be achievable in some cases and contribute to limiting damage, to the extent feasible, to the United States, allies, and partners. Conclusion and takeaways:

The US nuclear targeting policy, in so much as public statements and documents reveal, has shifted substantially from a policy of targeting cities by default to a policy that leaves cities as reserve targets for full escalation scenarios. The US policy has never ruled out the possibility of escalation to full countervalue targeting and is unlikely to do so.

The maxim “no plan survives contact with the enemy” is especially worrying from the perspective of nuclear war planning. During the early cold war years described in Daniel Ellsberg’s book, the military culture promoted a dedication to nuclear readiness--so much so that officers violated their own protocols to ensure they could launch nuclear weapons in a crisis. Readiness for retaliation, especially full countervalue retaliation, naturally trades off against risk of full escalation. 

As both Herman Kahn and Thomas Schelling made clear, communication between legitimate authorities is essential to the ability to negotiate the end of a nuclear conflict. Yet the military value of disabling an enemy’s nuclear command, control, and communications (NC3) capabilities is large. This may be the biggest risk to cities; if Moscow and Washington are both destroyed in the early stages of nuclear conflict, then this could easily escalate to all out countervalue targeting. It is important not only that some command structure with the authority to negotiate remain intact on each side, but also that both parties can communicate with each other and trust that the adversary’s command structure is actually intact and capable of negotiation.

Finally, all of this means very little if either of two potential adversaries fail to make plans for a) refraining from initial targeting of cities b) maintaining NC3 capabilities through an initial nuclear strike, and c) have the authority and intention to negotiate a peace & de-escalation. Indeed, public statements by Soviet leadership during the cold war suggested they had no intention of sparing cities in a retaliatory strike, making any possible US policy of gradual escalation potentially useless. Ultimately, it’s important to recognize here that not all nuclear war scenarios have equal outcomes, and that both sides in a nuclear conflict could benefit greatly from engaging in strategic restraint.


Survival and Flourishing Fund Applications closing in 3 days

Новости LessWrong.com - 2 октября, 2019 - 03:12
Published on October 2, 2019 12:12 AM UTC

The grant round we announced a month ago for the new Survival and Flourishing Fund is closing in 3 days. We haven't gotten that many applications, so I would recommend applying.


From the original announcement post:

The plan is to make a total of $1MM-$2MM in grants to organizations working on the long term flourishing and survival of humanity.At this point in time SFF can only make grants to charities and not individuals, so if you have a project or organization that you want to get funding for, you will have to either already be part of an established charity, found one, or be sponsored by an existing charity.



Подписка на LessWrong на русском сбор новостей