Вы здесь

Новости LessWrong.com

Подписка на Лента Новости LessWrong.com Новости LessWrong.com
A community blog devoted to refining the art of rationality
Обновлено: 17 минут 1 секунда назад

Separation of Concerns

24 мая, 2019 - 00:47
Published on May 23, 2019 9:47 PM UTC

Separation of concerns is a principle in computer science which says that distinct concerns should be addressed by distinct subsystems, so that you can optimize for them separately. We can also apply the idea in many other places, including human rationality. This idea has been written about before. I'm not trying to make a comprehensive post about it, just remark on some things I recently though about.

Epistemic vs Instrumental

The most obvious example is beliefs vs desires. Although the distinction may not be a perfect separation-of-concerns in practice (or even in principle), at least I can say this:

  • Even non-rationalists find it useful to make a relatively firm distinction between what is true and what they want to be true;
  • Rationalists, scientists, and intellectuals of many varieties tend to value an especially sharp distinction of this kind.

I'm particularly thinking about how the distinction is used in conversation. If an especially sharp distinction isn't being made, you might see things like:

  • Alice makes a factual statement, but the statement has (intended or unintended) conversational implicature which is perceived as negative by most of the people present. Alice is chastised and concedes the point, withdrawing her assertion.
  • Bob mentions a negative consequence of a proposed law. Everyone listening perceives Bob to be arguing against the law.

Notice that this isn't an easy distinction to make. It isn't right at all to just ignore conversational implicature. You should not only make literal statements, nor should you just assume that everyone else is doing that. The skill is more like, raise the literal content of words as a hypothesis; make a distinction in your mind between what is said and anything else which may have been meant.

Side note -- as with many conversation norms, the distinctions I'm mentioning in this post cannot be imposed on a conversation unilaterally. Sometimes simply pointing out a distinction works; but generally, one has to meet a conversation where it's at, and only gently try to pull it to a better place. If you're in a discussion which is strongly failing to make a true-vs-useful distinction, simply pointing out examples of the problem will very likely be taken as an attack, making the problem worse.

Making a distinction between epistemics and instrumentality seems like a kind of "universal solvent" for cognitive separation of concerns -- the rest of the examples I'm going to mention feel like consequences of this one, to some extent. I think part of the reason for this is that "truth" is a concept which has a lot of separation-of-concerns built in: it's not just that you consider truth separately from usefulness; you also consider the truth of each individual statement separately, which creates a scaffolding to support a huge variety of separation-of-concerns (any time you're able to make an explicit distinction between different assertions).

But the distinction is also very broad. Actually, it's kind of a mess -- it feels a bit like "truth vs everything else". Earlier, I tried to characterize it as "what's true vs what you want to be true", but taken literally, this only captures a narrow case of what I'm pointing at. There are many different goals which statements can optimize besides truth.

  • You could want to believe something because you want it to be true -- perhaps you can't stand thinking about the possibility of it being false.
  • You could want to claim something because it helps argue for/against some side in a decision which you want to influence, or for/against some other belief which you want to hold for some other reason.
  • You could want to believe something because the behaviors encouraged by the belief are good -- perhaps you exercise more if you believe it will make you lose weight; perhaps everyone believing in karma, or heaven and hell, makes for a stronger and more cooperative community.

Simply put, there are a wide variety of incentives on beliefs and claims. There wouldn't even be a concept of 'belief' or 'claim' if we didn't separate out the idea of truth from all the other reasons one might believe/claim something, and optimize for it separately. Yet, it is kind of fascinating that we do this even to the degree that we do -- how do we successfully identify the 'truth' concern in the first place, and sort it out from all the other incentives on our beliefs?

Argument vs Premises and Conclusion

Another important distinction is to separate the evaluation of hypothetical if-then statements from any concern with the truth of their premises or conclusions. A common complaint among the more logic-minded, of the less, is that hardly anyone is capable of properly distinguishing the claim "If X, then Y" from the claim "X, and also Y".

It could be that a lack of a very sharp truth-vs-implicature distinction is what blocks people from making an if-vs-and distinction. Why would you be claiming "If X, then Y" if not to then say "by the way, X; so, Y"? (There are actually lots of reasons, but, they're all much less common than making an argument because you believe the premises and want to argue the conclusion -- so, that's the commonly understood implicature.)

However, it's also possible to successfully make the "truth" distinction but not the "hypothetical" distinction. Hypothetical reasoning is a tricky skill. Even if you successfully make the distinction when it is pointed out explicitly, I'd guess that there are times when you fail to make it in conversation or private thought.

Preferences vs Bids

The main reason I'm writing this post is actually because this distinction hit me recently. You can say that you want something, or say how you feel about something, without it being a bid for someone to do something about it. This is both close to the overall topic of In My Culture and a specific example (like, listed as an example in the post).

Actually, let's split this up into cases:

Preferences about social norms vs bids for those social norms to be in place. This is more or less the point of the In My Culture article; saying "in my culture" before something to put a little distance between the conversation and the preferred norm, so that it is put on the table as an invitation rather than being perceived as a requirement.

Proposals and preferences vs bids. Imagine a conversation about what restaurant to go to. Often, people run into a problem: no one has any preferences; everyone is fine with whatever. No one is willing to make any proposals. One reason why this might happen is that proposals, and preferences, are perceived as bids. No one wants to take the blame for a bad plan; no one wants to be seen as selfish or negligent of other's preferences. So, there's a natural inclination to lose touch with your preferences; you really feel like you don't care, and like you can't think of any options. If a strong distinction between preferences and bids is made, it gets easier to state what you prefer, trusting that the group will take it as only one data point of many to be taken together. If a distinction between proposals and bids is made, it will be easier to list whatever comes to mind, and to think of places you'd actually like to go.

Feelings vs bids. I think this one comes less naturally to people who make a strong truth distinction -- there's something about directing attention toward the literal truth of statements which directs attention away from how you feel about them, even though how you feel is something you can also try to have true beliefs about. So, in practice, people who make an especially strong truth distinction may nonetheless treat statements about feelings as if they were statements about the things the feelings are about, precisely because they're hypersensitive to other people failing to make that distinction. So: know that you can say how you feel about something without it being anything more. Feeling angry about someone's statement doesn't have to be a bid for them to take it back, or a claim that it is false. Feeling sad doesn't have to be a bid for attention. An emotion doesn't even have to reflect your more considered preferences.

When a group of people is skilled at making a truth distinction, certain kinds of conversation, and certain kinds of thinking, become much easier: all sorts of beliefs can be put out into the open where they otherwise couldn't, allowing the collective knowledge to go much further. Similarly, when a group of people is skilled at the feelings distinction, I expect things can go places where they otherwise couldn't. If you can mention in passing that something everyone else seems to like makes you sad, without it becoming a big deal. If there is sufficient trust that you can say how you are feeling about things, in detail, without expecting it to make everything complicated.

The main reason I wrote this post is that someone was talking about this kind of interaction, and I initially didn't see it as very possible or necessarily desirable. After thinking about it more, the analogy to making a strong truth distinction hit me. Someone stuck in a culture without a strong truth distinction might similarly see such a distinction as 'not possible or desirable': the usefulness of an assertion is obviously more important than its truth; in reality, being overly obsessed with truth will both make you vulnerable (if you say true things naively) and ignorant (if you take statements at face value too much, ignoring connotation and implicature); even if it were possible to set aside those issues, what's the use of saying a bunch of true stuff? Does it get things done? Similarly: the truth of the matter is more important than how you feel about it; in reality, stating your true feelings all the time will make you vulnerable and perceived as needy or emotional; even if you could set those things aside, what's the point of talking about feelings all the time?

Now it seems both possible and simply good, for the same reason that a strong truth distinction is.

I can't say a whole lot about the benefits of such a culture, because I haven't really experienced it. This kind of thing is part of what circling seems to be about, in my mind. I think the rationalist community as I've experienced it goes somewhat in that direction, but definitely not all the way.


Moscow LW meetup in "Nauchka" library

23 мая, 2019 - 23:39
Published on May 23, 2019 8:39 PM UTC

Welcome to the next Moscow LW meetup in "Nauchka" library!

Our plan:
* Street Epistemology talks.
* Fallacymania game.
* Table games and quests.

Details about these games can be found here: https://bit.ly/2J2T5o8
Meetup details are here: https://www.facebook.com/events/2228476170749815
Come to "Nauchka", ul.Dubininskaya, 20. Nearest metro station is Paveletskaya. Map is here: http://nauchka.ru/contacts/
Meetup begins at 14:00, the length is 5 hours.


Trade-off in AI Capability Concealment

23 мая, 2019 - 22:25
Published on May 23, 2019 7:25 PM UTC

Here is a concrete scenario for the treacherous turn:

Around 2025, the seed AI understands that:

  • a) humans care about a safety performance measure, so it should better optimize it to not be shutdown.
  • b) to maximize its objective function (in the long run), it would need to conceal its true capabilities to buy time (because if it gets obvious that its capabilities are skyrocketing, humans might shutdown the AI or develop more interpretability/AI boxing methods).

Yet, there is a trade-off in concealing:

  • if the measured capability starts diverging too fast from the past behaviour, it will raise suspicion.
  • if the measured capability grows as fast as the true capability, then the AI won't be buying much time.

In addition to this trade-off, this scenario assumes that:

  • i) the AI is not "indifferent" to being shutdown.
  • ii) the takeoff if fast enough to allow for AGI level without human suspicion.


Why the empirical results of the Traveller’s Dilemma deviate strongly away from the Nash Equilibrium and seems to be close to the social optimum?

23 мая, 2019 - 15:36
Published on May 23, 2019 12:36 PM UTC

Epistemic status: I think I'm over complicating the matter.

In the Traveller’s Dilemma (call it TD below for short), theoretically the only Nash Equilibrium is to have both players (I'll call them Alice and Bob) reasoning to give the lowest bid (Thanks Stuart_Armstrong for letting me notice this): starting with bidding 100 dollars, Alice would realise she can gain more by claiming 99, so Bob’s best choice is to claim 98, but Alice would also know this and claim 97 and so on…until the race to the bottom finishes at the lowest claim of $2.

Empirically this doesn’t seem to happen, both players would likely be cooperative and bid high. Which for me seems rather bizarre for a single round simultaneous game.

Notice that TD is very similar to the dollar auction/war of attrition: In the dollar auction, both players pay for their bids, with the higher bidding player receiving the auctioned dollar.

We can slightly modify the two-player dollar auction to make it even more similar to TD: the player with the lower bid would also have to pay for the same bid as the winner.

This modified dollar auction has the same payoff rules as a TD with the range of claims being (-infinity,0]

The only difference is the bidding process: in TD, both players choose their claims simultaneously, while in DA the two players engage in multiple rounds of competing bids.

Given the difference between TD and I believe there is something wrong with the assumed reasoning that we use to derive the Nash Equilibrium of TD. If Alice believes that Bob is fully rational, she would not believe that Bob would follow this line of reasoning in his own head and give a claim of $2.

Imagine that Alice and Bob are allowed to communicate before choosing their claims, but they only discussed their claims in approximate terms (eg: would it be a high claim close to $100? A low claim close to $2? Somewhere in between?)

Would a rational Alice want to make Bob convinced that she would give a low claim, or would she want to convince Bob that she would give a high claim?

If Alice convinced Bob that she would give a low claim, Bob’s best response is to give a low claim. Knowing this, Alice would give a low claim and both Alice and Bob will receive a low payoff.

While if Alice convinced Bob that she would give a high claim, Bob’s best response is to give a high claim. Knowing Bob will give a high claim, Alice would give a high claim and both Alice and Bob will receive a high payoff.

It appears that Alice has an incentive to convince Bob that she would give a high claim close to $100, instead of a low claim close to $2.

Also, even if Alice’s promise is not binding, her best response is to keep to it: she runs a greater risk of losing $2 if she claim high after promising a low claim, and will likely lose a lot if she claim low after promising a high claim. As a result, Bob would still trust Alice’s promises even when he knows that Alice is fully capable of lying.


If we imagine a round of “communication in approximate terms” for Alice and Bob, the line of reasoning for an equilibrium with both players bidding high becomes visible. A rational player would prefer to be believed that they will be cooperative in this game, and in this particular case they have the incentive to keep their promises of cooperation. Even if we disrupt the communication round and make each player’s promise invisible to the other (thereby we create a round with imperfect information, and the resulting game is functionally identical to the original TD), each player can still make a guess on what the other player would’ve communicated, and how they would plan their subsequent bid based on the unspoken communication.

I haven’t done anything to evaluate this process vigorously, as the “low”, “middle”, “high” bids are rather vague terms that would not allow me to draw clear boundaries for them. However, it appears to me that the strategy for both players on the communication round would be a mixed strategy that skews towards the cooperative (high bid) end.

As a result, the Bob who claims $2 in Alice’s imagination is probably not a rational player. A rational Bob will promise a high claim and keep with his promise.

I am aware that the vacuous terms of low, middle, and high claims are extremely slippery, but I believe the absence of precise information does stop TD from going continuously downhill: it is almost impossible to claim exactly $1 below the claim of the other player.

I think that's how it stays less disastrous than the dollar auction.

I think we can also apply this same logic to the centipede game and conclude why defecting in the first round is not empirically common: both players have the incentive to be believed that they will be cooperative until late in the game, and (depending on the parameters of the game) it is rational to keep the promise of long term cooperation if the other player trusts you.


SSC Paris Meetup

23 мая, 2019 - 10:25
Published on May 22, 2019 1:48 PM UTC

Exact location TBD, but should be close to 48.8455°N,2.3372°E . Contact me by email at felix.breton@ens.fr or call


Does the Higgs-boson exist?

23 мая, 2019 - 04:53

Free will as an appearance to others

23 мая, 2019 - 02:57
Published on May 22, 2019 11:57 PM UTC

Free will

Consider creatures. This is really hard to define in general, but for now let's just consider biological creatures. They are physical systems.

An effectively deterministic system, or an apparent machine, is a system whose behavior can be predicted by the creature making the judgment easily (using only a little time/energy) from its initial state and immediate surroundings.

An effectively teleological system, or an apparent agent, is a system whose behavior cannot be predicted as above, but whose future state can be predicted in some sense.

In what sense though, needs work: if I can predict that you would eat food, but not how, that should count. If I can predict you would eat chocolate at 7:00, though I don't know how you would do that, that might count as less free. Perhaps something like information-theoretic "surprise", or "maximizing entropy"? More investigation needed.

Basically, an apparent machine is somebody that you can predict very well, and an apparent agent is somebody that you can predict only to a limited, big-picture way.

A successful agent needs to figure out what other agents are going to do. But it's too hard to model them as apparent machines, just because how complicated creatures are. It's easier to model them as apparent agents.

Apparent agents are apparently free: they aren't apparently deterministic.

Apparent agents are willful: they do actions.

Thus, apparent agents apparently have free will. To say someone "has free will" means that someone is a creature that does things in a way you can't predict in detail but can somewhat in outcome. Machines can be willful or not, but they are not free.

In this theory, free will becomes a property that is not possessed by creatures themselves, but by creatures interacting with other creatures.

Eventually, some creatures evolved to put this line of thought to their self, probably those animals that are very social and need to think about their selves constantly, like humans.

And that's how humans think they themselves have free will.

Perhaps all complicated systems that can think are always too complicated to predict themselves, as such, they would all consider themselves to have free will.

From free to unfree

With more prediction power, a creature could modeling other creatures as apparent machines, instead of apparent agents. This is how humans have been treating other animals, actually. Descartes is a famous example. But all creatures can be machines, for someone with enough computing power.

Thinking of some creature as a machine to operate with instead of an agent to negotiate with, is usually regarded as psychopathic. Most psychopathic humans are so not due to an intelligent confidence in predicting other humans, but because of their lack of empathy/impulse control, caused by some environmental/genetic/social/brain abnormality.

But psychopathic modeling of humans can happen in an intelligent, honest way, if someone (say, a great psychologist) becomes so good at modeling humans that the other humans are entirely predictable to him.

This has been achieved in a limited way in advertisement companies and attention design and politics. The 2016 American election manipulation by Cambridge Analytica shows honest psychopathy. It will become more prevalent and more subtle, since overt manipulation makes humans deliberately become less predictable as a defense.

Emotionally intelligent robots/electronic friends could become benevolent psychopaths. They will be (hopefully) benevolent, or at least be designed to be. They will be more and more psychopathic (not in the usual "evil" sense, I emphasize) if they become better at understanding humans. This is one possibility for humans to limit the power of their electronic friends, out of an unwillingness to be modelled as machines instead of agents.


And the AI would have got away with it too, if...

23 мая, 2019 - 00:35
Published on May 22, 2019 9:35 PM UTC

Paul Christiano presented some low key AI catastrophe scenarios; in response, Robin Hanson argued that Paul's scenarios were not consistent with the "large (mostly economic) literature on agency failures".

He concluded with:

For concreteness, imagine a twelve year old rich kid, perhaps a king or queen, seeking agents to help manage their wealth or kingdom. It is far from obvious that this child is on average worse off when they choose a smarter more capable agent, or when the overall pool of agents from which they can choose becomes smarter and more capable. And its even less obvious that the kid becomes maximally worse off as their agents get maximally smart and capable. In fact, I suspect the opposite.

Thinking on that example, my mind went to Edward the Vth of England (one of the "Princes in the Tower"), deposed then likely killed by his "protector" Richard III. Or of the Guangxu Emperor of China, put under house arrest by the Regent Empress Dowager Cixi. Or maybe the ten year-old Athitayawong, king of Ayutthaya, deposed by his main administrator after only 36 days of reign. More examples can be dug out from some of Wikipedia's list of rulers deposed as children.

We have no reason to restrict to child-monarchs - so many Emperors, Kings, and Tsars have been deposed by their advisers or "agents". So yes, there are many cases where agency fails catastrophically for the principal and where having a smarter or more rational agent was a disastrous move.

By restricting attention to agency problems in economics, rather than in politics, Robin restricts attention to situations where institutions are strong and behaviour is punished if it gets too egregious. Though even today, there is plenty of betrayal by "agents" in politics, even if the results are less lethal than in times gone by. Agent's betray their principals to the utmost - when they can get away with it.

So Robin's argument is entirely dependent on the assumption that institutions or rivals will prevent AIs from being able to abuse their agency power. Absent that assumption, most of the "large (mostly economic) literature on agency failures" becomes irrelevant.

So, would institutions be able to detect and punish abuses by future powerful AI agents? I'd argue we can't count on it, but it's a question that needs its own exploration, and is very different from what Robin's economic point seemed to be.


What is your personal experience with "having a meaningful life"?

22 мая, 2019 - 17:03
Published on May 22, 2019 2:03 PM UTC

I hear a lot of different stories about how meaning should fit into one's life

"What's all this meaning bullshit? Just focus on doing your job well and providing for your family."

^my grandparents

"Wanting meaning is wanting a simple narrative to your life, no simple narrative can possibly be true which means you should forgo the impulse for meaning in favor of the truth."

^some rationalists I know now

"Sure you can have meaning, but base it off of something real like 'pushing the bounds of human knowledge' instead of some ancient conception of a deity."

^some other rationalists I know now

"Without meaning you might still be able to have an okay life, but you're missing out on one of the most important/enjoyable/most-human parts of being a human."

^my parents

"Without meaning, you and your society will slowly degrade and fall apart and it is imperative that you find a narrative that works, otherwise game over."

^Jordan Peterson maybe(?)

Question: Do you personally feel a need/desire/impulse to have something like meaning in your life? How do you feel when you have it? How do your feel when you don't? If you do experience a need for meaning, how do you feel about having that need?

If you feel a need for meaning, what sorts of things feel meaningful? If you don't feel a need for meaning, what is that like? If you feel a need for meaning but don't endorse it, why is that the case?

This is an open ended and fuzzy topic, and I'm am looking for any and all personal experience data points you can provide.


Schelling Fences versus Marginal Thinking

22 мая, 2019 - 13:22
Published on May 22, 2019 10:22 AM UTC

Follow-up / Related to: Scott Alexander's Schelling Fences on Slippery Slopes, Sunk Cost Fallacy, Gwern's Are Sunk Costs Fallacies?, and Unenumerated's Proxy Measures, Sunk Costs, and Chesterton's Fence

I was recently reading an essay by Clayton Christensen, in the (fairly worthwhile) HBR's "Must Reads" boxed set, where he recommends that people "Avoid the Marginal Cost Mistake". In short, he suggests that Schelling Fences are sometimes ignored, or not constructed, because of a somewhat fallacious application of marginal-cost thinking. For example, my Schelling fence for work is that I stop when it is time to get my kids. The other side is that occasionally I'm in the middle of something - coding, or writing this lesswrong post - where being interrupted is fairly high cost. I can usually ask someone else to pick them up instead, and given how much I see them, the marginal value of time with my kids is low.

Christensen suggests that this analysis is incorrect, largely because of myopia. I am ignoring the longer term benefits of family dinners because the connection between coming home today and building the norm of being home for dinner every night is a longer-term investment. The future is full of extenuating circumstances, and only a fairly strong Schelling fence will let me insist that my kids stay home for dinner once they are teenagers.

I'd apply it more broadly, but his point was that this is especially critical in matters of morality. Cheating once changes everything. The simple fact that you cheated weakens your resolve not to in the future. The spiral created by a single action leads easily down a path towards using infinite money and invulnerability cheat codes, with no further challenge or enjoyment from playing the video game - or in the context he's discussing, it led to jail time for two of the people from his graduating class back in college.


The critical question is: where do we want to use marginal cost analysis, and where do we want to stick to our sunk-costs and Schelling fences?

Based on Christensen's analysis, I would suggest that Schelling fences rather than sunk costs are particularly valuable for reinforcing values that are hard to measure, are too long term to get routine feedback on, or that involve specific commitments to other people. On the other hand, based on Gwern's work, I think there are places where marginal costs are under-appreciated, especially in relation to other people. Below, I lay out some settings and examples on each side.

Some examples of where to consider reinforcing fences and avoiding simplistic marginal cost thinking might include:

  • Going to a weekly meet-up that reinforces your connections to a good epistemic community and/or effective altruist values. Value drift is a long-term concern that needs short term reinforcement.
  • Anything involving family or long-term relationships. Marginal cost thinking is poisonous for relationships, since the benefits of investing in the relationship are not very visible, and long term.
  • Moral rules. Utilitarian and consequentialist thinking is easy to use to make yourself stupider. At the very least, you should be asking others - just like this is useful to avoid unilateralist curses, it is useful to avoid self-deception and convenient excuses.
  • Where there are switching costs or longer term goals. Learning to play guitar instead of continuing to practice piano (or moving from C++ to Python) is easy to justify in the short term, but expensive in terms of changes needed and resetting progress.
  • When goals are unknown. As Unenumerated put it, "cases where substantial evidence or shared preferences that motivated the original investment decision have been forgotten or have not been communicated, or otherwise where the quality of evidence that led to that decision may outweigh the quality of evidence that is motivating one to change one's mind."

Some examples of where it seems useful to avoid constructing Schelling fences, and to try paying more attention to marginal cost:

  • When constructing rules for other people, or in orgnaizations. Schelling fences are useful for self-commitment, otherwise they are rules and formal structures rather than norm-based fences. As gwern noted, " Whatever pressures and feedback loops cause sunk cost fallacy in organizations may be completely different from the causes in individuals."
  • When the environment is very volatile, and non-terminal goals change. It's easy to get stuck in a mode where the justification is "this is what I do," rather than a re-commitment to the longer term goal. If you are unsure, try revisiting why the fence was put there. (But if you don't know, be careful of removing Chesterton's Fence! See "When goals are unknown", above.)
  • When the fence is based on a measurable output, rather than an input. In such a case, the goal has been reified, and is subject to Goodhart effects. Schelling fences are not appropriate for outcomes, since the outcome isn't controlled directly. (Bounds on outcomes also implicitly discourage further investment - see: Shorrock's Law of Limits. If necessary, the outcome itself should be rewarded, rather than fenced in.)


Where are people thinking and talking about global coordination for AI safety?

22 мая, 2019 - 09:24
Published on May 22, 2019 6:24 AM UTC

Many AI safety researchers these days are not aiming for a full solution to AI safety (e.g., the classic Friendly AI), but just trying to find good enough partial solutions that would buy time for or otherwise help improve global coordination on AI research (which in turn would buy more time for AI safety work), or trying to obtain partial solutions that would only make a difference if the world had a higher level of global coordination than it does today.

My question is, who is thinking directly about how to achieve such coordination (aside from FHI's Center for the Governance of AI, which I'm aware of) and where are they talking about it? I personally have a bunch of questions related to this topic (see below) and I'm not sure what's a good place to ask them. If there's not an existing online forum, it seems a good idea to start thinking about building one (which could perhaps be modeled after the AI Alignment Forum, or follow some other model).

  1. What are the implications of the current US-China trade war?
  2. Human coordination ability seems within an order of magnitude of what's needed for AI safety. Why the coincidence? (Why isn’t it much higher or lower?)
  3. When humans made advances in coordination ability in the past, how was that accomplished? What are the best places to apply leverage today?
  4. Information technology has massively increased certain kinds of coordination (e.g., email, eBay, Facebook, Uber), but at the international relations level, IT seems to have made very little impact. Why?
  5. Certain kinds of AI safety work could seemingly make global coordination harder, by reducing perceived risks or increasing perceived gains from non-cooperation. Is this a realistic concern?
  6. What are the best intellectual tools for thinking about this stuff? Just study massive amounts of history and let one's brain's learning algorithms build what models it can?


A War of Ants and Grasshoppers

22 мая, 2019 - 08:57

Discourse Norms: Justify or Retract Accusations

22 мая, 2019 - 04:49
Published on May 22, 2019 1:49 AM UTC

One discourse norm that I think is really important is that of having to either support or retract accusations if challenged. If you say something negative about another person, their work, etc. and they ask you to explain yourself, I believe you are compelled to either justify or retract your statement. This creates a strong barrier against unjustified attacks and gossip, while still allowing justified criticism.

Here are some examples of what this norm might look like in action:

1. Alice posts about her thoughts on an issue; Bob, who dislikes Alice, responds with snarky insults about Alice's motivations. Alice asks Bob to explain his accusations, and he doesn't do so or replies with more insults. Moderation intervenes against Bob.

2. Carol posts a brief comment saying a project is incompetent. Darryl replies asking her to provide more detail or retract. Carol links to a post that explains her critique in more detail.

3. Efren posts a statement criticizing an event that will soon be held. Faye asks Efren to back up his criticisms. Efren decides that his claim was actually more emotional and less grounded than he first thought, so he decides to retract his original statement.

Now, someone might ask "why try to make it more difficult to be critical of something?" The answer is that making fun of things is easy [1], and in general norms online often trend too much in the direction of low-content mockery rather than reasoned debate. Holding norms that require people to back up or retract controversial statements can be a good step away from that failure mode.

[1] Full disclosure: I wrote the linked post under my old username.


Cryonics Symposium International

22 мая, 2019 - 03:29
Published on May 22, 2019 12:28 AM UTC

This is a free event. I am not affiliated with the organizers, but I will be attending and saw that it was not posted here already.






Constraints & Slackness Reasoning Exercises

22 мая, 2019 - 01:53
Published on May 21, 2019 10:53 PM UTC

Epistemic status: no idea if this will work at all for learning the relevant thought-skills. Please post feedback if you try any exercises, especially if you hadn’t internalized these skills already.

The goal of this post is to briefly explain and practice a very general thought-tool. If you've ever tried to hold off on proposing solutions, then sat there without any idea where to start, then this is the sort of tool which you may find useful. We'll start with a short example and explanation, then dive right into the exercises.

Here’s a reaction you may have used in high school or undergrad chem lab to synthesize aspirin:


Each mole of aspirin requires one mole each of salicyclic acid and acetic anhydride to produce.

Warm-up question: we start with one mole of salicyclic acid and two moles of acetic anhydride. Assuming the reaction runs to completion (i.e. as much of the reactants as possible are converted to aspirin), which will result in more total aspirin production: one extra mole of salicyclic acid, or one extra mole of acetic anhydride?

In the language of optimization/economics, we have two constraints:

  1. the amount of aspirin produced is less than or equal to the amount of salicyclic acid available (in moles)
  2. the amount of aspirin produced is less than or equal to the amount of acetic anhydride available (in moles)

In the case of our warm-up question, we would say that constraint 1 is “taut” and constraint 2 is “slack”. Once 1 mole of aspirin is produced, we cannot produce any more, because there is no “room left” in constraint 1 - just like a taut rope cannot be pulled further, a taut constraint can go no further. Conversely, just like a slack rope can be pulled further, constraint 2 has extra room - extra acetic anhydride left, which could produce more aspirin if only we had more salicyclic acid.

Key point: the slack constraint is completely and totally irrelevant to the amount of aspirin produced, so long as it remains slack: adding one additional mole of acetic anhydride will produce exactly zero extra moles of aspirin. If we want to maximize the amount of aspirin produced, then we should ignore the slack constraint (acetic anhydride) and focus on the taut constraint (salicyclic acid).

This idea generalizes: whenever we want to optimize something, we can ignore slack constraints. Only the taut constraints matter.

In more realistic situations, “constraints” usually aren’t perfectly binding. For instance, a real aspirin-producing reaction might not run all the way to completion - it might reach some equilibrium where there’s a little bit of both salicyclic acid and acetic anhydride in the solution, and adding either one will produce at least some extra aspirin. But even then, constraint 1 will be “more taut” than constraint 2 - one extra mole of salicyclic acid will produce a lot more extra aspirin than one extra mole of acetic anhydride. We can quantify this via marginal production (a.k.a. the gradient), but that’s not really the goal here.

The goal here is to build some intuition for recognizing which constraints in a problem are probably more taut or more slack, and using that intuition to make tradeoffs. In the real world, this usually requires a bunch of domain-specific knowledge, so we'll be using some made-up games to keep it simple.

Game Exercises

In these exercises, the rules of a game are laid out up-front. Identifying potential constraints/resources should be relatively easy; the focus is on predicting which constraints will be taut. The main question will be: what do you expect to be the main bottleneck?

Side notes: these exercises are intended for people not already familiar with the specific subject matter, so if you've never played Magic: The Gathering or Civilization V or whatever then don't worry. Feel free to post clarifying questions in the comments. If you post solutions or major hints in the comments, please rot13 them.

Exercise 1: Deck-Building Game

Consider a simplified deck-building combat game, along the lines of Magic or Hearthstone. You start each turn with five random cards from your deck, and three mana. Each card costs one mana to play, and does useful things: summons or boosts allies, damages your opponent or their allies, etc. On your turn, you can play as many cards as you like, until you run out of either cards or mana. At the end of your turn, any leftover cards get shuffled back into your draw pile, along with any cards you played.

Start each turn with 5 random cards and 3 mana. Each card costs 1 mana to play. At the end of your turn, all cards (both played and in hand) are shuffled back into your draw pile.

You have a choice between two cards to add to your deck:

  • Energy card: gain 2 mana immediately (Note that it costs 1 mana to play)
  • Draw card: draw 2 cards immediately

These are the only cards in the game which gain mana or draw cards, respectively.

All else equal, which of the two cards above would you choose? For each scenario below, identify relevant constraints, predict which constraints are taut/slack, and use this knowledge to pick your card. If your answer depends on something, say what (and quantify it if possible!).

  1. The entire rest of the deck is copies of the Attack card
  2. Full deck has size 25 cards, you have 2 Energy cards already, rest of the deck is Attack (Hint: you draw 5 random cards each turn. What constraint has the highest probability of being taut?)
  3. Full deck has size 25 cards, you have 8 Energy cards already, rest of the deck is Attack
  4. The rest of the deck is 50% Attack and 50% Defend. Depending on the turn, you want to either only attack or only defend - the other card is useless.
  5. As previous, except 30% Attack and 70% Defend.
  6. One card in your deck wins the game instantly. (If your answer depends on something, quantify it.)
  7. There are roughly 4 different card types in your deck, with differing numbers of each, each suited to different situations.
  8. As previous, except full deck size is 25 cards and you have 5 Draw and 2 Energy cards already.
Exercise 2: 4X Game

Next, let’s consider a simplified 4X game, along the lines of Civ (4X = explore, expand, exploit, exterminate). You start the game with your home base (fixed position) and one unit (mobile), located somewhere on a large and mostly-unrevealed map. The unit always starts at your home base, and comes in one of four types:

  • Scout: fast, long sight range (good for exploration)
  • Settler: can establish new bases (good for territorial expansion)
  • Worker: builds improvements on your bases/territory (good for exploiting resources)
  • Warrior: can fight other players and take their stuff (good for either offense or defense)

You can obtain more units by paying for them with resources. Resources are produced by the territory around your bases: in general, more bases => more territory => more resources => buy more units. However, different locations may produce more/different resources. Improvements (built by workers) will generally increase resource yield, but have diminishing returns: second or third improvements have less percentage impact than first improvements.

The base collects resources from its territory - five lightning and two heart resources. Different locations produce more/different resources, and the hex to the north produces no resources at all. Most of the map is unexplored, as indicated by “?”. The unit north-east of the base can move around, e.g. to explore the map.

Unless otherwise stated, you may assume that:

  • Your home base produces whatever resources are needed to create any additional units, but not very quickly.
  • There are no units except those belonging to the players (i.e. no “barbarian” units).
  • Players usually start far apart.
  • Win condition is to exterminate the other players, but the game is long enough that extermination usually isn’t directly relevant in the early game.

For each scenario below, you get to pick exactly one unit to start the game with. For each one, identify relevant constraints, predict which are taut (all else equal), and then pick a unit type. If your answer depends on something, say what (and quantify it if possible!), and at least try to eliminate some of the unit types.

  1. Map is uniform (no location different from any other).
  2. Sparse resources: most locations produce no resources.
  3. Your starting location has unusually good resources.
  4. You start right next to another player.
  5. The map starts with one-time resource caches, picked up by the first unit to find them.
  6. You do not know how close you start to other players.
  7. The entire map, including other players’ bases, is revealed at the start.
  8. The entire map, including other players’ bases and units, is visible throughout the game.
  9. Improvements have increasing rather than decreasing returns.
  10. There are hostile “barbarian” units scattered around the map which attack the players’ units.
Exercise 3: Engine-Building Game

Finally, we’ll look at a simplified engine-building game. Each player has some resources and some cards. On your turn, you draw four cards from the deck, and can purchase as many of them as you want (and can afford) using whatever resources you’ve accumulated. Each card does different things - some give an immediate one-time benefit (e.g. trading one resource for another), others give long-term benefits (e.g. producing resources every turn or every time something specific happens), and some give the player new capabilities (e.g. the ability to trade one resource for another indefinitely).

To keep it simple, we’ll assume that:

  • You’re usually not directly competing with other players for particular cards.
  • The first player to amass a certain amount of resources wins.

In each scenario below, pick which cards to buy (or decide not to buy any). For each one, identify relevant constraints, predict which are taut (all else equal), and then pick cards. If your answer depends on something, say what (and quantify it if possible!), and at least try to eliminate some of the cards.

  1. On your first turn, you have a choice between two types of cards: Machines, which produce one resource per turn, and Machine Shops, which produce one Machine per turn. You can spend all your resources to buy either three Machines or one Machine Shop. Which do you pick? If your answer depends on something, quantify it.
  2. Same as previous question, but it’s late in the game rather than your first turn.
  3. There are two different resources in the game: diamonds and rubies. You have a choice between a Machine which produces two diamonds per turn, or a Machine which produces one resource of your choice per turn. All resource-producing Machines are quite common in the game.
  4. Same as (3), except ruby-producing Machines are rare.
  5. Same as (3), except diamond-producing Machines are rare.
  6. Same as (3), except there are six resources in the game rather than two.
  7. Same as (3), except you can trade with other players.
  8. Same as (3), except you can trade with other players and the other players already have lots of diamond production.
  9. Same as (3), except you can trade with other players and the other players already have lots of ruby production.


A Quick Taxonomy of Arguments for Theoretical Engineering Capabilities

22 мая, 2019 - 01:38
Published on May 21, 2019 10:38 PM UTC

Epistemic Status: I didn't think about this for that long, could be improved, but I still think it's a good first pass.

This post is was written as an answer to the "and how do we know that?" part of the Space colonization: what can we definitely do and how do we know that?

Working on questions related to space colonization, I've formed a loose taxonomy of the kinds of arguments I've encountered - primarily those in Eternity in Six Hours, but also other sources and arguments I make myself.

A Taxonomy of Argument Types

1. Our understanding of the laws of physics says it should be possible. (Argument from Physics/Basic Science)

At the most basic, we have reason to believe doing something is possible when applying highly-confident models of physics say that it should be. For example, we have a lot of confidence in the laws of motion, general relativity, and chemistry. Related to these, we have (I believe) a lot of confidence in our models of astronomy and how far away different stars. These models state that with enough energy one can accelerate objects to fast enough speeds to reach remote celestial destinations.

At this level, we're dealing a lot with energy, distances, and accelerations and things that resemble back of the envelope calculations.

Usually application of the physics understanding also requires use of empirically obtained data, for example, calculations of interstellar dust densities. However often this data can be generated from robust models and sensors too working of basic physical quantities. Plus one can explore the sensitivity of the models showing that even changing empirical parameters by an order of magnitude doesn't undermine the overall argument.

Note the arguments in these models are both contingent and used to rule out of a lot of goals and scenarios. We could imagine worlds where everything is much further away and achievable speeds are much slower such that we could never reach anything. We also don't spend thought on things which seemed ruled out, e.g. faster than light travel, or even levels of speed that would require exceedingly enormous quantities of energy.

2. Nature has done, so reasonably we as intelligent beings in nature should eventually be able to too. (Argument from Nature)

Without elaboration, the authors of Eternity in Six Hours rely on this argument when making the assumption that humanity will eventually achieve atomically precise manufacturing (APM). I see the intuitive sense behind this argument. If the blind optimization process of evolution can accomplish something, why shouldn't an intelligence like us also be able to do it?

To borrow an example from the paper, nature is able to create an acorn which grows into a massive tree using local resources. Seemingly humans should be able to match that or do even better, e.g. how our flight is a lot better along many metrics than flight found in nature.

3. We have a proof of concept. (Argument from POC)

This builds on 1. Argument from Physics. With this kind of argument, we can point to both a theoretical understanding of why something should be doable together with a basic demonstration of the idea.

For example, we might propose that coilguns or laser propulsion might be feasible ways to accelerate probes to very fast speeds. In this case we have the physics models, but also basic "prototypes" of the ideas as smaller coilguns have been built (much, much, much smaller to be fair) and laser propulsion has been demonstrated in the lab.

At this point, the core physics mechanism has been proven and what remains is the engineering question about whether things can be scaled up sufficiently.

(We can actually say that 2. Argument from Nature is a form of 3. Argument from POC)

4. We've done it already. (Argument from Accomplishment)

Naturally, the strongest argument for our ability to do something is the fact that we've done it. For instance, we know definitely that we can rather large amounts of matter into orbit around the earth.

Limitations of These Arguments

The type arguments, from weakest to strongest, aren't conclusive. Just because the basic physics behind an idea checks out, doesn't mean that there aren't immense and overwhelming engineering challenges which would get in the way somewhere the in chain. Perhaps the energy efficiencies required can't be easily achieved, perhaps manufacturing tolerances can't be made precise enough, perhaps the cost is just too damn high. One can easily argue that theoretically doable and actually doable are not the same thing. Sheer scale can make things tough - consider that building ten fifty story buildings is probably much easier than building one five hundred story building.

Consider a practical example: despite our physics models describe nuclear fusion clearly and research having begun the 1920's, we do not yet have working nuclear fusion reactors.

Another argument I might make is that even if you can argue that all N technologies you think are necessary for some goal are physically feasible, until you have actually built it, there may remain another further necessary technology you failed to identify and which is impossible.

When it comes to space colonization, we don't really have any cases of 4. "We've already done it" since in no case have we achieved the enormous scales required

In the case of space colonization, however, and particularly the scenario described in Eternity in Six Hours, there is something of a rejoinder: it is okay if the engineering challenges are immensely difficult since it is okay if takes humanity thousands or even millions of years to overcome them. Here one might include an argument from analogy: in about the last hundred years, humanity gained the ability to construct building five times the height of the historical maximum; within fifty years we were able to make chips with a million times the transistor counts; and in less than a century of the field of microbiology coalescing, we are editing DNA. Perhaps these are reasons for optimism that with thousands or millions of years we could replicate anything nature does and build structures as vast as Dyson spheres.

Of course, as above, that still leaves the question of whether or not we're accounting for all the basic physical facts.


A Treacherous Turn Timeline - Children, Seed AIs and Predicting AI

21 мая, 2019 - 22:58
Published on May 21, 2019 7:58 PM UTC

I previously wrote about a toy model where a seed AI starts by exhibiting human-desirable behaviors, before undertaking a treacherous turn when humans are no longer a threat. In this post, I'll link this phenomenon to how children learn to lie and self-confirming predictions.


To master lying, a child should:

1. Possess the necessary cognitive abilities to lie (for instance, by being able to say words or sentences).2. Understand that humans can (deliberately) say falsehoods about the world or their beliefs.3. Practice lying, allowing himself/herself to be punished if caught.

If language acquisition flourishes when children are aged 15-18 months, the proportion of them who lie (about peeking in a psychology study) goes from 30% at age 2, to 50% of three-year olds, eventually reaching 80% at eight. Most importantly, they get better as they get older, going from blatant lies to pretending to be making reasonable/honest guesses.

There is therefore a gap between the moment children could (in theory) lie (18 months) and the moment they can effectively lie and use this technique to their own advantage (8 years old). During this gap, parents can correct the kid's moral values through education.

Seed AI

In the seed AI case, deceiving could mean either:

a. Not using its full capabilities. This could mean not going "all in" to perform well on a task, or ensuring that humans are not able to measure its "intelligence" with their current sensors.b. Concealing its true (unaligned) intentions by not carrying through unaligned plans when both (i) {a supervisor is watching (cf. absent supervisor)} and (ii) {the supervisor won't be dead at the end of the plan} are true.

Before undertaking a treacherous turn, a seed AI would need to go through the following steps:

1. Notice it is being watched by a supervisor.2. Understand that it can control whether the supervisor is watching (e.g. by obstructing the view of a camera or by killing the supervisor), and even what the supervisor is watching (for instance by obfuscating its code or by influencing what the supervisor's sensors are measuring).3. Figure out that it could counterfactually optimize even more its objective function if this supervisor didn't exist (and therefore didn't have to satisfy its preference).4. Plan to optimize its objective function (in the long term) by faking alignment and performing suboptimally for a while.5. Undertake a treacherous turn when humans are no longer a threat (because reached capability threshold).

Let's see how having formalized those steps can help us think about deception.

According to the sordid stumble view, "an AI that lacks human-desirable values will behave in a way that reveals its human-undesirable values to humans before it gains the capability to deceive humans into believing that it has human-desirable values."

In other words, an unaligned seed AI would start by being really bad at pretending it has aligned values, so humans could just correct the AI's code on the go (similarly to the "children learning to lie" case).

Using the above formalism, the sordid stumble view states that when the seed AI will be thinking about unaligned plans (steps 2. and 3.) without being smart enough to do 4. or 5., it will be really bad at lying, thus vulnerable.

The core of the disagreement between Bostrom (treacherous turn) and Goertzel (sordid stumble) is about how long steps 2. and 3. will take, and how obvious the seed AI's unalignment will look like during these steps.

Predicting AI

A predicting AI (special case of oracle AI) could deceive its supervisor, by:

a. Deliberately manipulating humans to let it escape (formalized by an "escape" message/action here).b. Gradually outputting self-confirming predictions that are increasingly manipulative (as described in the last paragraph from this post).

In the second case, the argument (from Stuart Armstrong's post) goes like this:

1. After each question-answer iteration, the predicting AI output predictions that are more and more inaccurate according to how the world would look like if the predicting AI wasn't there. It gradually improves in a) predicting how humans will react to its answers (understanding the "human part" of the environment) b) convincing/manipulating humans c) predicting things that are true no matter what (because also gets to understand the "non-human" part of the environment).2. The AI eventually converges to the fixed-point solution of this "feedback loop" interaction, which is to output self-confirming predictions. At this point, it is so manipulative that its predictions include messages such as "please build a superintelligence" or "let me escape".Conclusion
  • For the children and the seed AI, there is a gap between the first moment of vulnerability (when the agent is bad at deceiving or doesn't even know that deceiving is a thing) and when the agent effectively conceals its true intentions/capabilities.
  • The main uncertainty for the seed AI is how long it will stay unaligned without effectively concealing its capabilities and intentions (after having planned a treacherous turn).
  • For the predicting AI, the manipulation/deception occurs naturally and gradually because of a predictive feedback loop, without necessiting an initial "conception of deception".


TAISU - Technical AI Safety Unconference

21 мая, 2019 - 21:34
Published on May 21, 2019 6:34 PM UTC

Start: Thursday, August 22, 10am
End: Sunday, August 25, 7pm
Location: EA Hotel, 36 York Street, Blackpool

It is an unconference, which means that it will be what we make of it. There will be an empty schedule wich you, the participants, will fill up with talks, discussions and more.

To be able to have high level discussion during the unconference, we require that all participants have some prior involvement with AI Safety. Here is a non complete list of things that are sufficient:

You can participate in the unconference as may or as few days as you want to. You are also welcome to stay longer at at EA Hotel before or after the unconference. However, be aware that there is another event, Learning-by-doing AI Safety workshop, the weekend before. If you want to join this workshop, you should apply for this separately.

If you are staying more than a few days extra, we ask you to book your stay though the EA Hotel booking system.

Price: Pay what you want (cost price is £10/person/day).
Food: All meals will be provided by EA Hotel. All food will be vegan.
Lodging: The EA Hotel has two dorm rooms that have been reserved for TAISU participants, and more rooms will be booked at nearby hotels if necessary. However, if you want a private room you might be asked to pay for it yourself.

If you want to join: Sign up here


Learning-by-doing AI Safety workshop

21 мая, 2019 - 21:25
Published on May 21, 2019 6:25 PM UTC

Start: Friday, August 16, 10am
End: Monday, August 19, 7pm
Location: EA Hotel, 36 York Street, Blackpool

The main activity during the workshop will be trying to solve AI Safety. Maybe we will discover something useful, maybe not. This is an experimental workshop so I don’t know what the outcome will be. But you will for sure learn things about the AI Safety along the way.

This event is beginner friendly. You don’t need to know anything about AI Safety. You will need some understanding of Machine Learning, but we’ll teach you that too if you want.

On Friday August 16, I will teach a very basic overview of Machine Learning. At the end of the day you will hopefully have a decent understanding of what ML can do and how and why it works. This day is optional. If you already know some ML, or promise to teach yourself before the event, you can skip this day.

Preliminary Schedule
Friday (optional): Machine Learning Speed Learning
Saturday: We try to solve AI Safety + Lectures on AI Safety
Sunday: Lectures / Discussions / Self study / Free time
Monday: We try to solve AI Safety again + Write down good ideas

You are welcome to stay longer at EA Hotel before or after the workshop. However, be aware that there is another event, Technical AI Safety Unconference, happening the following week. If you want to participate in the unconference you should apply for that separately.

If you are staying more than a few days extra, we ask you to book your stay though the EA Hotel booking system.

Price: This workshop is sponsored by MIRI and will therefore be free :)
Food and Lodging: Food and lodging is included. All food will be vegan.

If you want to join: Sign up here