Вы здесь

Новости LessWrong.com

Подписка на Лента Новости LessWrong.com Новости LessWrong.com
A community blog devoted to refining the art of rationality
Обновлено: 35 минут 22 секунды назад

Gradient hacking

16 октября, 2019 - 03:53
Published on October 16, 2019 12:53 AM UTC

"Gradient hacking" is a term I've been using recently to describe the phenomenon wherein a deceptively aligned mesa-optimizer might be able to purposefully act in ways which cause gradient descent to update it in a particular way. In Risks from Learned Optimization, we included the following footnote to reflect this possibility:

Furthermore, a deceptively aligned mesa-optimizer would be incentivized to cause there to be a systematic bias in the direction of preventing the base optimizer from modifying its mesa-objective. Thus, in the context of a local optimization process, a deceptive mesa-optimizer might try to “hack” its own gradient (by, for example, making itself more brittle in the case where its objective gets changed) to ensure that the base optimizer adjusts it in such a way that leaves its mesa-objective untouched.

Outside of that footnote, however, there hasn't been much other discussion of this phenomenon, and I think that's a shame, as I think it's definitely something that's worth looking into.

For example, here's a gradient hacking story that I'm particularly concerned about. Suppose you're training a model and it becomes deceptive. You're not worried about that, though, because you have sufficient transparency tools that you're confident you can detect any deceptive cognition it might be doing. If it's thinking about how to deceive you, your transparency tools will pick that up, and you'll know to throw out that model and try again. However, your transparency tools are quite expensive, so you can only afford to run them once, which you're planning on doing after training before you deploy your model. If your model is deceptive, though, then it might know all of that—in particular, it might know that if it stays deceptive, it'll get caught.

However, there is still something it can do in that situation to be able to defect against you and cause whatever it wants to happen during deployment. That is, it could hack its own gradient to produce a new model which will still exhibit the same deceptive behavior, but without doing any of the detectable deceptive cognition. Perhaps the new model exhibits the deceptive behavior not because it's trying to act deceptive, but just because it's a hard-coded part of its algorithm, or because it has some bad proxy, or any other defect that might be more difficult to detect than outright deception.

How might such a deceptive model accomplish such a feat of gradient hacking? Well, I don't think it's actually that difficult once your deceptive model comes up with it as a strategy. First, note that since gradient descent is a local optimization process, it doesn't have to somehow change what the global minima might be, or even what the local minima are—it just has to create a ridge in the loss landscape that will push the gradient in the direction it wants.

The original footnote provides one example of this, which is for the model to check if its objective satisfies some criterion, and fail hard if it doesn't. Now, if the model gets to the point where it's actually just failing because of this, then gradient descent will probably just remove that check—but the trick is never to actually get there. By having such a check in the first place, the model makes it so that gradient descent won't actually change its objective, since any change to its objective (keeping all the other parameters fixed, which is what gradient descent does since it computes partial derivatives) would lead to such a failure.

The case of a deceptive model trying to erase evidence of its deception is not quite as straightforward, though I think it would still be possible to do something quite similar in that situation as well. This is especially true if the model has the ability to read and write to some sort of state (e.g. a transformer which can attend to its previous output). For example, it could write out a pre-commitment to follow the hard-coded policy and then just have a policy of always following any pre-commitments it finds such that it never thinks about the deception that led it to that policy in the first place.

So, that's the problem—how do we solve it? Well, there's a simple solution in the case above, which is just to run your transparency tools throughout training, not just at the end. I think this is the right solution,[1] but it's worth noting that this adds an additional desiderata to your transparency tools, which is that they have to be cheap enough to run during the entire training process. That might not be that bad, however, if you don't do very many gradient descent update steps, and instead just use massive models and huge batch sizes—and there is some evidence that such a procedure is actually the most compute-efficient way to train neural networks. That being said, I think there are some transparency techniques which lend themselves better to being applied across training than others. Thus, the possibility of gradient hacking suggests that we should focus more on those techniques which allow us to not just understand a single model, but also the entire history of how that model was produced.[2]

  1. This is one of the reasons I'm interested in optimization provenance, for example. ↩︎

  2. As an example of a technique which I think translates well to this domain, one of the OpenAI clarity team's current major research thrusts is "model diffing," which is about trying to understand exactly how two similar models differ from each other. Compared to other approaches that are just focused on understanding a single model, model diffing could be quite useful in allowing one to understand the entirety of the training process, since you could diff the model at each gradient descent update step. ↩︎



Discuss

Always Discard Fascist Policies

16 октября, 2019 - 01:30
Published on October 15, 2019 10:23 PM UTC

Burden of proof

I enjoy the game Secret Hitler. If you're unfamilliar with the game, the rules can be found here: https://secrethitler.com/assets/Secret_Hitler_Rules.pdf In this game, like in many social deception games, players love to make claims about what you should always or never do when playing. In general, I reject such claims since most pure strategies in hidden information games are exploitable. The true mixed-strategy Nash equilibrium for Secret Hitler inevitably will involve non-zero and non-100% probabilities for most actions.

Some examples of bad 100% policies that I've heard:

  • As president, if you pass two fascist policies to your chancellor, always claim you drew three fascist policies rather than going head-to-head against your chancellor.
  • Always assume presidents follow the above bullet point (and thus suspect only the chancellor when two players go head-to-head).
  • As Hitler, always play policies identically to a liberal unless it would cause the liberals to win the game immediately.
  • As a Fascist, if given the chance to view Hitler's party affiliation, always claim Hitler is liberal.

So with this in mind, I recognize that if I make the claim "as a liberal, Always Discard Fascist Policies", I'm placing a heavy burden of proof on myself. Nonetheless, I intend to prove that this is indeed always the optimal strategy.

Assumption

However I need to make one assumption. I've never played with a group who rejected this assumption. But I'd be curious to hear if readers feel differently. The assumption is that liberal players always immediately and truthfully reveal all information they have available to them.

The Setup:

Imagine Alice and Bob are playing Secret Hitler. The group has approved Alice as president and Bob as chancellor. Alice is a liberal and does not concretely know Bob's party affiliation. Alice draws three tiles. Two are liberal policies and one is fascist. Alice has the option to either:

  • Discard the fascist tile and pass off the two liberal tiles to Bob, denying him any choice in what he does.

OR

  • Discard one of the liberal tiles and allow Bob to select whether to play a liberal or fascist tile.
Proof

Since liberals always honestly reveal all information available to them, we know that Bob knows everything Alice claims to know. Therefore Bob's information is a superset of Alice's (or any liberal's) information.

Now we'll sample a random real number p in the range [0,1] which acts as Bob's source of randomness in case his optimal strategy is probabilistic.

We have two cases:

  1. Based on Bob's information and p; the optimal Fascist strategy if Bob is a Fascist is to play a Liberal policy and avoid being found out.

  2. Based on Bob's information and p; the optimal Fascist strategy if Bob is a Fascist is to (if possible) play a Fascist policy and go head-to-head against Alice.

In the first case, it doesn't matter what Alice does. Whether she hands Bob two liberal policies or a fascist and a liberal policy, Bob will play a liberal policy and the group gains no information about Bob (since anyone fascist or liberal would play a liberal policy in this case).

In the second case, Bob will play a fascist policy. However, by our assumption that we are in case 2, this is strictly worse for liberals (from Bob's perspective as a fascist and therefore from the liberals' perspective as well since their information is a subset of his) than if he had instead been forced to play a liberal policy.

Technical Flaws

After writing this, I realized there are two technical flaws in the above proof that I don't believe make a difference in practice. I'm curious how relevant readers think these are:

  1. Alice does have one piece of information which Bob doesn't have. She knows that she drew two liberal policies and one fascist policy as opposed to two fascist and one liberal policy. If Bob's optimal strategy for some reason hinges on what Alice discarded and Alice knows this, this might be a reason for her to try to briefly fool him into thinking she discarded a fascist policy (the rules expressly forbid talking while a policy is being enacted). I don't see how this might be the case but I also can't prove that it never is.

  2. (Only applies to games with at least 7 players; in <=6-player games, fascists strictly have at least as much information as liberals) Knowing what someone claims is not the same as knowing what they know. Alice knows that she is a liberal whereas Bob only knows that Alice claims to be a liberal. To see how this concretely could make a difference, imagine this is a game with >=7 players and Alice suspects Bob is Hitler and has managed to subtly convince Bob that she's a fellow fascist; Bob may play the fascist policy in hopes that Alice will back him up and claim to have drawn three fascist policies. I have not seen anything like this happen in practice and it certainly is not the case most of the time I get into arguments with people about what to discard in the 2L1F case. IME This argument comes up in <=6-player games as often as it comes up in larger games.

In Practice

Now I recognize that in practice, most Secret Hitler players are not perfectly able to determine optimal strategy and therefore one might claim Alice is giving Bob the option with the hope that he'll "make a mistake" and reveal himself as Fascist when that was an inferior move. This could be the case if, for instance, Alice is a more experienced player than Bob. However, IME newer players almost universally tend to err the other way. That is, they hide and act as liberals even when the proper strategy calls for them to play a Fascist policy and risk outing themselves.

Furthermore, I've found that most players attach too much weight to the decisions the chancellor makes when given a choice. In other words they see a liberal policy come out and decide to trust the chancellor as a liberal even when that clearly would have been an optimal fascist strategy as well. This seems to be an instance of over-reliance on the representativeness heuristic. One way to avoid falling prey to this mistake is simply not to give them the choice in the first place.



Discuss

Transportation Tracking

15 октября, 2019 - 14:00
Published on October 15, 2019 11:00 AM UTC

Several times lately I've seen people say it's not practical to continue with walking and public transit in the US once you have kids unless you live in Manhattan. I've also had people think that us not having a car means that we never use cars. Both of these aren't the case, and I thought it might be helpful to talk some about how we do transportation.

First, here are some numbers for the most recent thirty days. They're not completely representative, but it's as typical as any thirty-day period is likely to get. The data comes from going through my location history, collected automatically by my phone, supplemented by memory when I noticed it had missed something:

  • Number of (one-way) trips: 101

  • Trips by purpose:

    • Commute: 40 (40%; subway to and from work)
    • Errand: 25 (25%; groceries, school drop-off)
    • Social: 23 (23%; visiting my dad)
    • Leisure: 13 (13%; going for a walk or to the park)
  • Trips by mode:

    • Transit: 49 (49%; bus and subway)
    • Foot: 39 (39%; walking, with some scooter and trike)
    • Car: 13 (10%; rides, renting, taxis)
  • Days with any:

    • Foot: 24 (80%)
    • Transit: 21 (70%)
    • Car: 9 (30%)
(sheet)

Our family is two adults and two children, 3y and 5y. We live in Somerville, one of the most dense cities in the country (#16, though less so if you look at population-weighted density), and about a ten minute walk from a subway station. Julia works from home, I work downtown. We don't have a car, but we do have housemates and two of them have cars.

On a typical day it's just walking and public transit. I'll walk (run) to the subway, ride to work, and reverse in the evening. When the kids go out, generally to the park, it will be stroller, wagon, or on foot. Most places we want to go are pretty practical traveling this way. If it's cold or rainy we'll make sure the kids are dressed appropriately, and the stroller has a rain cover.

For grocery shopping we do a mix of things. For perishables I'll generally stop at the supermarket on my way home from work. For non-perishables I'll do a big shop at Market Basket or Wegmans about once a quarter when I happen to have access to a car. (We have a large chest freezer, and plenty of basement space.) We also occasionally order groceries delivered when that seems like a better use of time. And occasionally we'll walk over with the wagon as a fun trip:

For home improvement I generally get materials delivered. It's not that expensive, and it's a lot less work. Other times I've taken the bus to Home Depot, and rented one of their vans or trucks to move things. I often need things that would be too big for a car, anyway.

On Tuesdays we take the bus to my dad's house in the next town over to have dinner with the extended family. This is usually fine, though traveling at rush hour there's a lot of traffic. While we could take the bus back, my dad usually gives us a ride: this is nice of him and lets us stay a bit later in the evening.

For most of my contra dance gigs I need to rent or borrow a car. For some evening gigs it works for me to use a housemate's car (I put gas in it and reimburse them for depreciation), but most of the time I rent. For tours renting makes a lot of sense even if you do own a car: on the most recent one we put ~2,650 miles on the rental car for $231, which is 8.7¢/mile.

We've used taxis/ridesharing some, but in addition to being pricy it's annoying to deal with carseats. Once both kids are onto boosters (they make inflatable ones) it will be more reasonable. When traveling for work we'll generally take one to/from the airport.

Every so often we do have access to a car: someone is flying out of Logan and finds it convenient to leave their car in our driveway while they travel, I'm renting a car for a Saturday night gig but the car rental place is closed on Sunday so we have it until Monday, etc. We'll often take advantage of those situations to do things that are a good fit for a car, like big grocery shops (see above) or driving to an interesting place that isn't very accessible by public transit.

Growing up my family had two cars, and since both parents needed a car to commute there wasn't anything else that would have made sense. With us it's more of a choice: I think most families in our situation would decide to get a car, but each time I've priced one out it would be substantially more expensive. I do miss the freedom that comes from the low marginal cost of already having decided to have a car, and have felt the same feeling at times when I haven't had an unlimited public transit pass. But overall I'm pretty happy with what we're doing and like that we end up walking a lot.



Discuss

Person-moment affecting views

15 октября, 2019 - 13:00
Published on October 15, 2019 10:00 AM UTC

[Epistemic status: sloppy thoughts not informed by the literature. Hoping actual population ethicists might show up and correct me or point me to whoever has already thought about something like this better.]

Person-affecting views say that when you are summing up the value in different possible worlds, you should ignore people who only exist in one of those worlds. This is based on something like the following intuitions:

  1. World A can only be better than world B insofar as it is better for someone.
  2. World A can’t be better than world B for Alice, if Alice exists in world A but not world B.

The further-fact view says that after learning all physical facts about Alice and Alice’—such as whether Alice’ was the physical result of Alice waiting for five seconds, or is a brain upload of Alice, or is what came out of a replicating machine on Mars after Alice walked in on Earth, or remembers being Alice—there is still a further meaningful question of whether Alice and Alice’ are the same person.

I take the further-fact view to be wrong (or at least Derek Parfit does, and I think we agree the differences between Derek Parfit and I have been overstated). Thinking that the further-fact view is wrong seems to be a common position among intellectuals (e.g. 87% among philosophers).

If the further-fact view is wrong, then the what we have is a whole lot of different person-moments, with various relationships to one another, which for pragmatic reasons we like to group into clusters called ‘people’. There are different ways we could define the people, and no real answer to which definition is right. This works out pretty well in our world, but you can imagine other worlds (or futures of our world) where the clusters are much more ambiguous, and different definitions of ‘person’ make a big difference, or where the concept is not actually useful.

Person-affecting views seem to make pretty central use of the concept ‘person’. If we don’t accept the further-fact view, and do want to accept a person-affecting view, what would that mean? I can think of several options:

  1. How good different worlds are depends strongly on which definition of ‘person’ you choose (which person moments you choose to cluster together), but this is a somewhat arbitrary pragmatic choice
  2. There is some correct definition of ‘person’ for the purpose of ethics (i.e. there is some relation between person moments that makes different person moments in the future ethically relevant by virtue of having that connection to a present person moment)
  3. Different person-moments are more or less closely connected in ways, and a person-affecting view should actually have a sliding scale of importance for different person-moments

Before considering these options, I want to revisit the second reason for adopting a person-affecting view: If Alice exists in world A and not in world B, then Alice can’t be made better off by world A existing rather than world B. Whether this premise is true seems to depend on how ‘a world being better for Alice’ works. Some things we might measure would go one way, and some would go the other. For instance, we could imagine it being analogous to:

  1. Alice painting more paintings. If Alice painted three paintings in world A, and doesn’t exist in world B, I think most people would say that Alice painted more paintings in world A than in world B. And more clearly, that world A has more paintings than world B, even if we insist that a world can’t have more paintings without somebody in particular having painted more paintings. Relatedly, there are many things people do where the sentence ‘If Alice didn’t exist, she wouldn’t have X’.
  2. Alice having painted more paintings per year. If Alice painted one painting every thirty years in world A, and didn’t exist in world B, in world B the number of paintings per year is undefined, and so incomparable to ‘one per thirty years’.

Suppose that person-affecting view advocates are right, and the worth of one’s life is more like 2). You just can’t compare the worth of Alice’s life in two worlds where she only exists in one of them. Then can you compare person-moments? What if the same ‘person’ exists in two possible worlds, but consists of different person-moments?

Compare world A and world C, which both contain Alice, but in world C Alice makes different choices as a teenager, and becomes a fighter pilot instead of a computer scientist. It turns out that she is not well suited to it, and finds piloting pretty unsatisfying. If Alice_t1A is different from Alice_t1C, can we say that world A is better than world C, in virtue of Alice’s experiences? Each relevant person-moment only exists in one of the worlds, so how can they benefit?

I see several possible responses:

  1. No we can’t. We should have person-moment affecting views.
  2. Things can’t be better or worse for person-moments, only for entire people, holistically across their lives, so the question is meaningless. (Or relatedly, how good a thing is for a person is not a function of how good it is for their person-moments, and it is how good it is for the person that matters).
  3. Yes, there is some difference between people and person moments, which means that person-moments can benefit without existing in worlds that they are benefitting relative to, but people cannot.

The second possibility seems to involve accepting the second view above: that there is some correct definition of ‘person’ that is larger than a person moment, and fundamental to ethics – something like the further-fact view. This sounds kind of bad to me. And the third view doesn’t seem very tempting without some idea of an actual difference between persons and person-moments.

So maybe the person-moment affecting view looks most promising. Let us review what it would have to look like. For one thing, the only comparable person moments are the ones that are the same. And since they are the same, there is no point bringing about one instead of the other. So there is never reason to bring about a person-moment for its own benefit. Which sounds like it might really limit the things that are worth intentionally doing. Isn’t making myself happy in three seconds just bringing about a happy person moment rather than a different sad person moment?

Is everything just equally good on this view? I don’t think so, as long as you are something like a preference utilitarian: person-moments can have preferences over other person-moments. Suppose that Alice_t0A and Alice_t0C are the same, and Alice_t1A and Alice_t1C are different. And suppose that Alice_t0 wants Alice_t1 to be a computer scientist. Then world A is better than world C for Alice_t0, and so better overall. That is, person-moments can benefit from things, as long as they don’t know at the time that they have benefited.

I think an interesting  feature of this view is that all value seems to come from meddling preferences. It is never directly good that there is joy in the world for instance, it is just good because somebody wants somebody else to experience joy, and that desire was satisfied. If they had instead wished for a future person-moment to be tortured, and this was granted, then this world would apparently be just as good.

So, things that are never directly valuable in this world:

  • Joy
  • Someone getting what they want and also knowing about it
  • Anything that isn’t a meddling preference

On the upside, since person-moments often care about future person-moments within the same person, we do perhaps get back to something closer to the original person-affecting view. There is often reason to bring about or benefit a person moment for the benefit of previous person moments in the history of the same person, who for instance wants to ‘live a long and happy life’. My guess after thinking about this very briefly is that in practice it would end up looking like the ‘moderate’ person-affecting views, in which people who currently exist get more weight than people who will be brought into existence, but not infinitely more weight. People who exist now mostly want to continue existing, and to have good lives in the future, and they care less, but some, about different people in the future.

So, if you want to accept a person-affecting view and not a further-fact view, the options seem to me to be something like these:

  1. Person-moments can benefit without having an otherworldly counterpart, even though people cannot. Which is to say, only person-moments that are part of the same ‘person’ in different worlds can benefit from their existence. ‘Person’ here is either an arbitrary pragmatic definition choice, or some more fundamental ethically relevant version of the concept that we could perhaps discover.
  2. Benefits accrue to persons, not person-moments. In particular, benefits to persons are not a function of the benefits to their constituent person-moments. Where ‘person’ is again either a somewhat arbitrary choice of definition, or a more fundamental concept.
  3. A sliding scale of ethical relevance of different person-moments, based on how narrow a definition of ‘person’ unites them with any currently existing person-moments. Along with some story about why, given that you can apparently compare all of them, you are still weighting some less, on grounds that they are incomparable.
  4. Person-moment affecting views

None of these sound very good to me, but nor do person-affecting views in general, so maybe I’m the wrong audience. I had thought person-moment affecting views were almost a reductio, but a close friend says he thought they were the obvious reasonable view, so I am curious to hear others’ takes.



Discuss

Strong stances

15 октября, 2019 - 13:00
Published on October 15, 2019 10:00 AM UTC

I. The question of confidence

Should one hold strong opinions? Some say yes. Some say that while it’s hard to tell, it tentatively seems pretty bad (probably). There are many pragmatically great upsides, and a couple of arguably unconscionable downsides. But rather than judging the overall sign, I think a better question is, can we have the pros without the devastatingly terrible cons?

A quick review of purported or plausible pros:

  1. Strong opinions lend themselves to revision:
    1. Nothing will surprise you into updating your opinion if you thought that anything could happen. A perfect Bayesian might be able to deal with myriad subtle updates to vast uncertainties, but a human is more likely to notice a red cupcake if they have claimed that cupcakes are never red. (Arguably—some would say having opinions makes you less able to notice any threat to them. My guess is that this depends on topic and personality.)
    2. ‘Not having a strong opinion’ is often vaguer than having a flat probability distribution, in practice. That is, the uncertain person’s position is not, ‘there is a 51% chance that policy X is better than policy -X’, it is more like ‘I have no idea’. Which again doesn’t lend itself to attending to detailed evidence.
    3. Uncertainty breeds inaction, and it is harder to run into more evidence if you are waiting on the fence, than if you are out there making practical bets on one side or the other.
  2. (In a bitterly unfair twist of fate) being overconfident appears to help with things like running startups, or maybe all kinds of things.
    If you run a startup, common wisdom advises going around it saying things like, ‘Here is the dream! We are going to make it happen! It is going to change the world!’ instead of things like, ‘Here is a plausible dream! We are going to try to make it happen! In the unlikely case that we succeed at something recognizably similar to what we first had in mind, it isn’t inconceivable that it will change the world!’ Probably some of the value here is just a zero sum contest to misinform people into misinvesting in your dream instead of something more promising. But some is probably real value. Suppose Bob works full time at your startup either way. I expect he finds it easier to dedicate himself to the work and has a better time if you are more confident. It’s nice to follow leaders who stand for something, which tends to go with having at least some strong opinions. Even alone, it seems easier to work hard on a thing if you think it is likely to succeed. If being unrealistically optimistic just generates extra effort to be put toward your project’s success, rather than stealing time from something more promising, that is a big deal.
  3. Social competition
    Even if the benefits of overconfidence in running companies and such were all zero sum, everyone else is doing it, so what are you going to do? Fail? Only employ people willing to work at less promising looking companies? Similarly, if you go around being suitably cautious in your views, while other people are unreasonably confident, then onlookers who trust both of you will be more interested in what the other people are saying.
  4. Wholeheartedness
    It is nice to be the kind of person who knows where they stand and what they are doing, instead of always living in an intractable set of place-plan combinations. It arguably lends itself to energy and vigor. If you are unsure whether you should be going North or South, having reluctantly evaluated North as a bit better in expected value, for some reason you often still won’t power North at full speed. It’s hard to passionately be really confused and uncertain. (I don’t know if this is related, but it seems interesting to me that the human mind feels as though it lives in ‘the world’—this one concrete thing—though its epistemic position is in some sense most naturally seen as a probability distribution over many possibilities.)
  5. Creativity
    Perhaps this is the same point, but I expect my imagination for new options kicks in better when I think I’m in a particular situation than when I think I might be in any of five different situations (or worse, in any situation at all, with different ‘weightings’).

A quick review of the con:

  1. Pervasive dishonesty and/or disengagement from reality
    If the evidence hasn’t led you to a strong opinion, and you want to profess one anyway, you are going to have to somehow disengage your personal or social epistemic processes from reality. What are you going to do? Lie? Believe false things? These both seem so bad to me that I can’t consider them seriously. There is also this sub-con:

    1. Appearance of pervasive dishonesty and/or disengagement from reality
      Some people can tell that you are either lying or believing false things, due to your boldly claiming things in this uncertain world. They will then suspect your epistemic and moral fiber, and distrust everything you say.
  2. (There are probably others, but this seems like plenty for now.)

II. Tentative answers

Can we have some of these pros without giving up on honesty or being in touch with reality? Some ideas that come to mind or have been suggested to me by friends:

1. Maintain two types of ‘beliefs’. One set of play beliefs—confident, well understood, probably-wrong—for improving in the sandpits of tinkering and chatting, and one set of real beliefs—uncertain, deferential—for when it matters whether you are right. For instance, you might have some ‘beliefs’ about how cancer can be cured by vitamins that you chat about and ponder, and read journal articles to update, but when you actually get cancer, you follow the expert advice to lean heavily on chemotherapy. I think people naturally do this a bit, using words like ‘best guess’ and ‘working hypothesis’.

I don’t like this plan much, though admittedly I basically haven’t tried it. For your new fake beliefs, either you have to constantly disclaim them as fake, or you are again lying and potentially misleading people. Maybe that is manageable through always saying ‘it seems to me that..’ or ‘my naive impression is..’, but it sounds like a mess.

And if you only use these beliefs on unimportant things, then you miss out on a lot of the updating you were hoping for from letting your strong beliefs run into reality. You get some though, and maybe you just can’t do better than that, unless you want to be testing your whacky theories about cancer cures when you have cancer.

It also seems like you won’t get a lot of the social benefits of seeming confident, if you still don’t actually believe strongly in the really confident things, and have to constantly disclaim them.

But I think I actually object because beliefs are for true things, damnit. If your evidence suggests something isn’t true, then you shouldn’t be ‘believing’ it. And also, if you know your evidence suggests a thing isn’t true, how are you even going to go about ‘believing it’? I don’t know how to.

2. Maintain separate ‘beliefs’ and ‘impressions’. This is like 1, except impressions are just claims about how things seem to you. e.g. ‘It seems to me that vitamin C cures cancer, but I believe that that isn’t true somehow, since a lot of more informed people disagree with my impression.’ This seems like a great distinction in general, but it seems a bit different from what one wants here. I think of this as a distinction between the evidence that you received, and the total evidence available to humanity, or perhaps between what is arrived at by your own reasoning about everyone’s evidence vs. your own reasoning about what to make of everyone else’s reasoning about everyone’s evidence. However these are about ways of getting a belief, and I think what you want here is actually just some beliefs that can be got in any way. Also, why would you act confidently on your impressions, if you thought they didn’t account for others’ evidence, say? Why would you act on them at all?

3. Confidently assert precise but highly uncertain probability distributions “We should work so hard on this, because it has like a 0.03% chance of reshaping 0.5% of the world, making it a 99.97th percentile intervention in the distribution we are drawing from, so we shouldn’t expect to see something this good again for fifty-seven months.” This may solve a lot of problems, and I like it, but it is tricky.

4. Just do the research so you can have strong views. To do this across the board seems prohibitively expensive, given how much research it seems to take to be almost as uncertain as you were on many topics of interest.

5. Focus on acting well rather than your effects on the world. Instead of trying to act decisively on a 1% chance of this intervention actually bringing about the desired result, try to act decisively on a 95% chance that this is the correct intervention (given your reasoning suggesting that it has a 1% chance of working out). I’m told this is related to Stoicism.

6. ‘Opinions’
I notice that people often have ‘opinions’, which they are not very careful to make true, and do not seem to straightforwardly expect to be true. This seems to be commonly understood by rationally inclined people as some sort of failure, but I could imagine it being another solution, perhaps along the lines of 1.

(I think there are others around, but I forget them.)

III. Stances

I propose an alternative solution. Suppose you might want to say something like, ‘groups of more than five people at parties are bad’, but you can’t because you don’t really know, and you have only seen a small number of parties in a very limited social milieu, and a lot of things are going on, and you are a congenitally uncertain person. Then instead say, ‘I deem groups of more than five people at parties bad’. What exactly do I mean by this? Instead of making a claim about the value of large groups at parties, make a policy choice about what to treat as the value of large groups at parties. You are adding a new variable ‘deemed large group goodness’ between your highly uncertain beliefs and your actions. I’ll call this a ‘stance’. (I expect it isn’t quite clear what I mean by a ‘stance’ yet, but I’ll elaborate soon.) My proposal: to be ‘confident’ in the way that one might be from having strong beliefs, focus on having strong stances rather than strong beliefs.

Strong stances have many of the benefits of confident beliefs. With your new stance on large groups, when you are choosing whether to arrange chairs and snacks to discourage large groups, you skip over your uncertain beliefs and go straight to your stance. And since you decided it, it is certain, and you can rearrange chairs with the vigor and single-mindedness of a person who knowns where they stand. You can confidently declare your opposition to large groups, and unite followers in a broader crusade against giant circles. And if at the ensuing party people form a large group anyway and seem to be really enjoying it, you will hopefully notice this the way you wouldn’t if you were merely uncertain-leaning-against regarding the value of large groups.

That might have been confusing, since I don’t know of good words to describe the type of mental attitude I’m proposing. Here are some things I don’t mean by ‘I deem large group conversations to be bad’:

  1. “Large group conversations are bad” (i.e. this is not about what is true, though it is related to that.)
  2. “I declare the truth to be ‘large group conversations are bad’” (i.e. This is not of a kind with beliefs. Is not directly about what is true about the world, or empirically observed, though it is influenced by these things. I do not have power over the truth.)
  3. “I don’t like large group conversations”, or “I notice that I act in opposition to large group conversations” (i.e. is not a claim about my own feelings or inclinations, which would still be a passive observation about the world)
  4. “The decision-theoretically optimal value to assign to large groups forming at parties is negative”, or “I estimate that the decision-theoretically optimal policy on large groups is opposition” (i.e. it is a choice, not an attempt to estimate a hidden feature of the world.)
  5. “I commit to stopping large group conversations” (i.e. It is not a commitment, or directly claiming anything about my future actions.)
  6. “I observe that I consistently seek to avert large group conversations” (this would be an observation about a consistency in my behavior, whereas here the point is to make a new thing (assign a value to a new variable?) that my future behavior may consistently make use of, if I want.)
  7. “I intend to stop some large group conversations” (perhaps this one is closest so far, but a stance isn’t saying anything about the future or about actions—if it doesn’t get changed by the future, and then in future I want to take an action, I’ll probably call on it, but it isn’t ‘about’ that.)

Perhaps what I mean is most like: ‘I have a policy of evaluating large group discussions at parties as bad’, though using ‘policy’ as a choice about an abstract variable that might apply to action, but not in the sense of a commitment.

What is going on here more generally? You are adding a new kind of abstract variable between beliefs and actions. A stance can be a bit like a policy choice on what you will treat as true, or on how you will evaluate something. Or it can also be its own abstract thing that doesn’t directly mean anything understandable in terms of the beliefs or actions nearby.

Some ideas we already use that are pretty close to stances are ‘X is my priority’, ‘I am in the dating market’, and arguably, ‘I am opposed to dachshunds’. X being your priority is heavily influenced by your understanding of the consequences of X and its alternatives, but it is your choice, and it is not dishonest to prioritize a thing that is not important. To prioritize X isn’t a claim about the facts relevant to whether one would want to prioritize it. Prioritizing X also isn’t a commitment regarding your actions, though the purpose of having a ‘priority’ is for it to affect your actions. Your ‘priority’ is a kind of abstract variable added to your mental landscape to collect up a bunch of reasoning about the merits of different things, and package them for easy use in decisions.

Another way of looking at this is as a way of formalizing and concretifying the step where you look at your uncertain beliefs and then decide on a tentative answer and then run with it.

One can be confident in stances, because a stance is a choice, not a guess at a fact about the world. (Though my stance may contain uncertainty if I want, e.g. I could take a stance that large groups have a 75% chance of being bad on average.) So while my beliefs on a topic may be quite uncertain, my stance can be strong, in a sense that does some of the work we wanted from strong beliefs. Nonetheless, since stances are connected with facts and values, my stance can be wrong in the sense of not being the stance I should want to have, on further consideration.

In sum, stances:

  1. Are inputs to decisions in the place of some beliefs and values
  2. Integrate those beliefs and values—to the extent that you want them to be—into a single reusable statement
  3. Can be thought of as something like ‘policies’ on what will be treated as the truth (e.g. ‘I deem large groups bad’) or as new abstract variables between the truth and action (e.g. ‘I am prioritizing sleep’)
  4. Are chosen by you, not implied by your epistemic situation (until some spoilsport comes up with a theory of optimal behavior)
  5. therefore don’t permit uncertainty in one sense, and don’t require it in another (you know what your stance is, and your stance can be ‘X is bad’ rather than ‘X is 72% likely to be bad’), though you should be uncertain about how much you will like your stance on further reflection.

I have found having stances somewhat useful, or at least entertaining, in the short time I have been trying having them, but it is more of a speculative suggestion with no other evidence behind it than trustworthy advice.



Discuss

The Principled Intelligence Hypothesis

15 октября, 2019 - 13:00
Published on October 15, 2019 10:00 AM UTC

I have been reading the thought provoking Elephant in the Brain, and will probably have more to say on it later. But if I understand correctly, a dominant theory of how humans came to be so smart is that they have been in an endless cat and mouse game with themselves, making norms and punishing violations on the one hand, and cleverly cheating their own norms and excusing themselves on the other (the ‘Social Brain Hypothesis’ or ‘Machiavellian Intelligence Hypothesis’). Intelligence purportedly evolved to get ourselves off the hook, and our ability to construct rocket ships and proofs about large prime numbers are just a lucky side product.

As a person who is both unusually smart, and who spent the last half hour wishing the seatbelt sign would go off so they could permissibly use the restroom, I feel like there is some tension between this theory and reality. I’m not the only unusually smart person who hates breaking rules, who wishes there were more rules telling them what to do, who incessantly makes up rules for themselves, who intentionally steers clear of borderline cases because it would be so annoying to think about, and who wishes the nominal rules were policed predictably and actually reflected expected behavior. This is a whole stereotype of person.

But if intelligence evolved for the prime purpose of evading rules, shouldn’t the smartest people be best at navigating rule evasion? Or at least reliably non-terrible at it? Shouldn’t they be the most delighted to find themselves in situations where the rules were ambiguous and the real situation didn’t match the claimed rules? Shouldn’t the people who are best at making rocket ships and proofs also be the best at making excuses and calculatedly risky norm-violations? Why is there this stereotype that the more you can make rocket ships, the more likely you are to break down crying if the social rules about when and how you are allowed to make rocket ships are ambiguous?

It could be that these nerds are rare, yet salient for some reason. Maybe such people are funny, not representative. Maybe the smartest people are actually savvy. I’m told that there is at least a positive correlation between social skills and other intellectual skills.

I offer a different theory. If the human brain grew out of an endless cat and mouse game, what if the thing we traditionally think of as ‘intelligence’ grew out of being the cat, not the mouse?

The skill it takes to apply abstract theories across a range of domains and to notice places where reality doesn’t fit sounds very much like policing norms, not breaking them. The love of consistency that fuels unifying theories sounds a lot like the one that insists on fair application of laws, and social codes that can apply in every circumstance. Math is basically just the construction of a bunch of rules, and then endless speculation about what they imply. A major object of science is even called discovering ‘the laws of nature’.

Rules need to generalize across a lot of situations—you will have a terrible time as rule-enforcer if you see every situation as having new, ad-hoc appropriate behavior. We wouldn’t even call this having a ‘rule’. But more to the point, when people bring you their excuses, if your rule doesn’t already imply an immovable position on every case you have never imagined, then you are open to accepting excuses. So you need to see the one law manifest everywhere. I posit that technical intelligence comes from the drive to make these generalizations, not the drive to thwart them.

On this theory, probably some other aspects of human skill are for evading norms. For instance, perhaps social or emotional intelligence (I hear these are things, but will not pretend to know much about them). If norm-policing and norm-evading are somewhat different activities, we might expect to have at least two systems that are engorged by this endless struggle.

I think this would solve another problem: if we came to have intelligence for cheating each other, it is unclear why general intelligence per se is is the answer to this, but not to other problems we have ever had as animals. Why did we get mental skills this time rather than earlier? Like that time we were competing over eating all the plants, or escaping predators better than our cousins? This isn’t the only time that a species was in fierce competition against themselves for something. In fact that has been happening forever. Why didn’t we develop intelligence to compete against each other for food, back when we lived in the sea? If the theory is just ‘there was strong competitive pressure for something that will help us win, so out came intelligence’, I think there is a lot left unexplained. Especially since the thing we most want to explain is the spaceship stuff, that on this theory is a random side effect anyway. (Note: I may be misunderstanding the usual theory, as a result of knowing almost nothing about it.)

I think this Principled Intelligence Hypothesis does better. Tracking general principles and spotting deviations from them is close to what scientific intelligence is, so if we were competing to do this (against people seeking to thwart us) it would make sense that we ended up with good theory-generalizing and deviation-spotting engines.

On the other hand, I think there are several reasons to doubt this theory, or details to resolve. For instance, while we are being unnecessarily norm-abiding and going with anecdotal evidence, I think I am actually pretty great at making up excuses, if I do say so. And I feel like this rests on is the same skill as ‘analogize one thing to another’ (my being here to hide from a party could just as well be interpreted as my being here to look for the drinks, much as the economy could also be interpreted as a kind of nervous system), which seems like it is quite similar to the skill of making up scientific theories (these five observations being true is much like theory X applying in general), though arguably not the skill of making up scientific theories well. So this is evidence against smart people being bad at norm evasion in general, and against norm evasion being a different kind of skill to norm enforcement, which is about generalizing across circumstances.

Some other outside view evidence against this theory’s correctness is that my friends all think it is wrong, and I know nothing about the relevant literature. I think it could also do with some inside view details – for instance, how exactly does any creature ever benefit from enforcing norms well? Isn’t it a bit of a tragedy of the commons? If norm evasion and norm policing skills vary in a population of agents, what happens over time? But I thought I’d tell you my rough thoughts, before I set this aside and fail to look into any of those details for the indefinite future.



Discuss

Replacing expensive costly signals

15 октября, 2019 - 13:00
Published on October 15, 2019 10:00 AM UTC

I feel like there is a general problem where people signal something using some extremely socially destructive method, and we can conceive of more socially efficient ways to send the same signal, but trying out alternative signals suggests that you might be especially bad at the traditional one. For instance, an employer might reasonably suspect that a job candidate who did a strange online course instead of normal university would have done especially badly at normal university.

Here is a proposed solution. Let X be the traditional signal, Y be the new signal, and Z be the trait(s) being advertised by both. Let people continue doing X, but subsidize Y on top of X for people with very high Z. Soon Y is a signal of higher Z than X is, and understood by the recipients of the signals to be a better indicator. People who can’t afford to do both should then prefer Y to X, since Y is is a stronger signal, and since it is more socially efficient it is likely to be less costly for the signal senders.

If Y is intrinsically no better a signal than X (without your artificially subsidizing great Z-possessors to send it) then in the long run Y might only end up as strong a sign as X, but in the process, many should have moved to using Y instead.

(A possible downside is that people may end up just doing both forever.)

For example, if you developed a psychometric and intellectual test that only took half a day and predicted very well how someone would do in an MIT undergraduate degree, you could run it for a while for people who actually do MIT undergraduate degrees, offering prizes for high performance, or just subsidizing taking it at all. After the best MIT graduates say on their CVs for a while that they also did well on this thing and got a prize, it is hopefully an established metric, and an employer would as happily have someone with the degree as with a great result on your test. At which point an impressive and ambitious high school leaver would take the test, modulo e.g. concerns that the test doesn’t let you hang out with other MIT undergraduates for four years.

I don’t know if this is the kind of problem people actually have with replacing apparently wasteful signaling systems with better things. Or if this doesn’t actually work after thinking about it for more than an hour. But just in case.



Discuss

Strengthening the foundations under the Overton Window without moving it

15 октября, 2019 - 13:00
Published on October 15, 2019 10:00 AM UTC

As I understand them, the social rules for interacting with people you disagree with are like this:

  • You should argue with people who are a bit wrong
  • You should refuse to argue with people who are very wrong, because it makes them seem more plausibly right to onlookers

I think this has some downsides.

Suppose there is some incredibly terrible view, V. It is not an obscure view: suppose it is one of those things that most people believed two hundred years ago, but that is now considered completely unacceptable.

New humans are born and grow up. They are never acquainted with any good arguments for rejecting V, because nobody ever explains in public why it is wrong. They just say that it is unacceptable, and you would have to be a complete loser who is also the Devil to not see that.

Since it took the whole of humanity thousands of years to reject V, even if these new humans are especially smart and moral, they probably do not each have the resources to personally out-reason the whole of civilization for thousands of years. So some of them reject V anyway, because they do whatever society around them says is good person behavior. But some of the ones who rely more on their own assessment of arguments do not.

This is bad, not just because it leads to an unnecessarily high rate of people believing V, but because the very people who usually help get us out of believing stupid things – the ones who think about issues, and interrogate the arguments, instead of adopting whatever views they are handed – are being deprived of the evidence that would let them believe even the good things we already know.

In short: we don’t want to give the new generation the best sincere arguments against V, because that would be admitting that a reasonable person might believe V. Which seems to get in the way of the claim that V is very, very bad. Which is not only a true claim, but an important thing to claim, because it discourages people from believing V.

But we actually know that a reasonable person might believe V, if they don’t have access to society’s best collective thoughts on it. Because we have a whole history of this happening almost all of the time. On the upside, this does not actually mean that V isn’t very, very bad. Just that your standard non-terrible humans can believe very, very bad things sometimes, as we have seen.

So this all sounds kind of like the error where you refuse to go to the gym because it would mean admitting that you are not already incredibly ripped.

But what is the alternative? Even if losing popular understanding of the reasons for rejecting V is a downside, doesn’t it avoid the worse fate of making V acceptable by engaging people who believe it?

Well, note that the social rules were kind of self-fulfilling. If the norm is that  you only argue with people who are a bit wrong, then indeed if you argue with a very wrong person, people will infer that they are only a bit wrong. But if instead we had norms that said you should argue with people who are very wrong, then arguing with someone who was very wrong would not make them look only a bit wrong.

I do think the second norm wouldn’t be that stable. Even if we started out like that, we would probably get pushed to the equilibrium we are in, because for various reasons people are somewhat more likely to argue with people who are only a bit wrong, even before any signaling considerations come into play. Which makes arguing some evidence that you don’t think the person is too wrong. And once it is some evidence, then arguing makes it look a bit more like you think a person might be right. And then the people who loathe to look a bit more like that drop out of the debate, and so it becomes stronger evidence. And so on.

Which is to say, engaging V-believers does not intrinsically make V more acceptable. But society currently interprets it as a message of support for V. There are some weak intrinsic reasons to take this as a signal of support, which get magnified into it being a strong signal.

My weak guess is that this signal could still be overwhelmed by e.g. constructing some stronger reason to doubt that the message is one of support.

For instance, if many people agreed that there were problems with avoiding all serious debate around V, and accepted that it was socially valuable to sometimes make genuine arguments against views that are terrible, then prefacing your engagement with a reference to this motive might go a long way. Because nobody who actually found V plausible would start with ‘Lovely to be here tonight. Please don’t take my engagement as a sign of support or validation—I am actually here because I think Bob’s ideas are some of the least worthy of support and validation in the world, and I try to do the occasional prophylactic ludicrous debate duty. How are we all this evening?’



Discuss

The fundamental complementarity of consciousness and work

15 октября, 2019 - 13:00
Published on October 15, 2019 10:00 AM UTC

Matter can experience things. For instance, when it is a person. Matter can also do work, and thereby provide value to the matter that can experience things. For instance, when it is a machine. Or also, when it is a person.

An important question for what the future looks like, is whether it is more efficient to carry out these functions separately or together.

If separately, then perhaps it is best that we end up with a huge pile of unconscious machinery, doing all the work to support and please a separate collection of matter specializing in being pleased.

If together, then we probably end up with the value being had by the entities doing the work.

I think we see people assuming that it is more efficient to separate the activities of producing and consuming value. For instance, that the entities whose experiences matter in the future will ideally live a life of leisure. And that lab grown meat is a better goal than humane farming.

Which seems plausible. It is at least in line with the general observation that more efficient systems seem to be specialized.

However I think this isn’t obvious. Some reasons we might expect working and benefiting from work to be done by overlapping systems:

  • We don’t know which systems are conscious. It might be that highly efficient work systems tend to be unavoidably conscious. In which case, making their experience good rather than bad could be a relatively cheap way to improve the overall value of the world.
  • For humans, doing purposeful activities is satisfying, so much so that there are concerns about how humans will cope when they are replaced by machines. It might be hard for humans to avoid being replaced, since they are probably much less efficient than other possible machines. But if doing useful things tends to be gratifying for creatures—or for the kinds of creatures we decide are good to have—then it is less obvious that highly efficient creatures won’t be better off doing work themselves, rather than being separate from it.
  • Consciousness is presumably cheap and useful for getting something done, since we evolved to have it.
  • Efficient production doesn’t seem to evolve to be entirely specialized, especially if we take an abstract view of ‘production’. For instance, it is helpful to produce the experience of being a sports star alongside the joy of going to sports games.
  • Specialization seems especially helpful if keeping track of things is expensive. However technology will make that cheaper, so perhaps the world will tend less toward specialization than it currently seems. For instance, you would prefer plant an entire field of one vegetable than a mixture, because then when you harvest them, you can do it quickly without sorting them. But if sorting them is basically immediate and free, you might prefer to plant the mixture. For instance, if they take different nutrients from the soil, or if one wards of insects that would eat the other.


Discuss

Realistic thought experiments

15 октября, 2019 - 13:00
Published on October 15, 2019 10:00 AM UTC

What if…

…after you died, you would be transported back and forth in time and get to be each of the other people who ever lived, one at a time, but with no recollection of your other lives?

…you had lived your entire life once already, and got to the end and achieved disappointingly few of your goals, and had now been given the chance to go back and try one more time?

…you were invisible and nobody would ever notice you? What if you were invisible and couldn’t even affect the world, except that you had complete control over a single human?

…you were the only person in the world, and you were responsible for the whole future, but luckily you had found a whole lot of useful robots which could expand your power, via for instance independently founding and running organizations for years without your supervision?

…you would only live for a second, before having your body taken over by someone else?

…there was a perfectly reasonable and good hypothetical being who knew about and judged all of your actions, hypothetically?

…everyone around you was naked under their clothes?

…in the future, many things that people around you asserted confidently would turn out to be false?

…the next year would automatically be composed of approximate copies of today?

…eternity would be composed of infinitely many exact copies of your life?

Added later:

…you just came into existence and got put into your present body—conveniently, with all the memories and skills of the body’s previous owner?

***

(Sometimes I or other people reframe the world for some philosophical or psychological purpose. These are the ones I can currently remember off the top of my head. Several are not original to me*. I’m curious to hear others.)

*Credits: #3 is from Plato and Joseph Carlsmith respectively. #5 is surely not original, but I can’t find its source easily. #7 is some kind of standard anti-social anxiety advice. #9 is from David Wong’s Cracked post on 5 ways you are sabotaging your own life (without even knowing it). #10 is old. #11 is from commenter Doug S, and elsewhere Nate Soares, and according to him is common advice on avoiding the Sunk Cost Fallacy.



Discuss

Personal relationships with goodness

15 октября, 2019 - 13:00
Published on October 15, 2019 10:00 AM UTC

Many people seem to find themselves in a situation something like this:

  1. Good actions seem better than bad actions. Better actions seem better than worse actions.
  2. There seem to be many very good things to do—for instance, reducing global catastrophic risks, or saving children from malaria.
  3. Nonetheless, they continually do things that seem vastly less good, at least some of the time. For instance, just now I went and listened to a choir singing. You might also admire kittens, or play video games, or curl up in a ball, or watch a movie, or try to figure out whether the actress in the movie was the same one that you saw in a different movie. I’ll call this ‘indulgence’, though it is not quite the right category.

On the face of it, this is worrying. Why do you do the less good things? Is it because you prefer badness to goodness? Are you evil?

It would be nice to have some kind of a story about this. Especially if you are just going to keep on occasionally admiring kittens or whatever for years on end. I think people settle on different stories. These don’t have obviously different consequences, but I think they do have subtly different ones. Here are some stories I’m familiar with:

I’m not good: “My behavior is not directly related to goodness, and nor should it be”, “It would be good to do X, but I am not that good” “Doing good things rather than bad things is generally supererogatory”

I think this one is popular. I find it hard to stomach, because if I am not good that seems like a serious problem. Plus, if goodness isn’t the guide to my actions, it seems like I’m going to need some sort of concept like schmoodness to determine which things I should do. Plus I just care about being good for some idiosyncratic reason. But it seems actually dangerous, because not treating goodness as a guide to one’s actions seems like it might affect one’s actions pretty negatively, beyond excusing a bit of kitten admiring or choir attendance.

In its favor, this story can help with ‘leaving a line of retreat‘: maybe you can better think about what is good, honestly, if you aren’t going to be immediately compelled to do it. It also has the appealing benefit of not looking dishonest, hypocritical, or self-aggrandizing.

Goodness is hard: “I want to be good, but I fail due to weakness of will or some other mysterious force”

This one probably only matches one’s experience while actively trying to never indulge in anything, which seems rare as a long term strategy.

Indulgence is good: “I am good, but it is not psychologically sustainable to exist without admiring kittens. It really helps with productivity.” “I am good, and it is somehow important for me to admire kittens. I don’t know why, and it doesn’t sound that plausible, but I don’t expect anything good to happen if I investigate or challenge it”

This is nice, because you get to be good, and continue to pursue good things, and not feel endlessly bad about the indulgence.

It has the downside that it sounds a bit like an absurd rationalization—’of course I care about solving the most important problems, for instance, figuring out where the cutest kittens are on the internet’. Also, supposing that fruitless entertainments are indeed good, they are presumably only good in moderation, and so it is hard for observers to tell if you are doing too much, which will lead them to suspect that you are doing too much. Also, you probably can’t tell yourself if you are doing too much, and supposing that there is any kind of pressure to observe more kittens under the banner of ‘the best thing a person can do’, you might risk that happening.

I’m partly good; indulgence is part of compromise: “I am good, but I am a small part of my brain, and there are all these other pesky parts that are bad, and I’m reasonably compromising with them” “I have many parts, and at least one of them is good, and at least one of them wants to admire kittens.”

This has the upside of being arguably relatively accurate, and many of the downsides of the first story, but to a lesser degree.

Among these, there seems to be a basic conflict between being able to feel virtuous, and being able to feel honest and straightforward. Which I guess is what you get if you keep on doing apparently non-virtuous things. But given that stopping doing those things doesn’t seem to be a real option, I feel like it should be possible to have something close to both.

I am interested to hear about any other such accounts people might have heard of.

 



Discuss

Are ethical asymmetries from property rights?

15 октября, 2019 - 13:00
Published on October 15, 2019 10:00 AM UTC

These are some intuitions people often have:

  • You are not required to save a random person, but you are definitely not allowed to kill one
  • You are not required to create a person, but you are definitely not allowed to kill one
  • You are not required to create a happy person, but you are definitely not allowed to create a miserable one
  • You are not required to help a random person who will be in a dire situation otherwise, but you are definitely not allowed to put someone in a dire situation
  • You are not required to save a person in front of a runaway train, but you are definitely not allowed to push someone in front of a train. By extension, you are not required to save five people in front of a runaway train, and if you have to push someone in front of the train to do it, then you are not allowed.

Here are some more:

  • You are not strongly required to give me your bread, but you are not allowed to take mine
  • You are not strongly required to lend me your car, but you are not allowed to unilaterally borrow mine
  • You are not strongly required to send me money, but you are not allowed to take mine

The former are ethical intuitions. The latter are implications of a basic system of property rights. Yet they seem very similar. The ethical intuitions seem to just be property rights as applied to lives and welfare. Your life is your property. I’m not allowed to take it, but I’m not obliged to give it to you if you don’t by default have it. Your welfare is your property. I’m not allowed to lessen what you have, but I don’t have to give you more of it.

[Edited to add: A basic system of property rights means assigning each thing to a person, who is then allowed to decide what happens to that thing. This gives rise to asymmetry because taking another person’s things is not allowed (since they are in charge of them, not you), but giving them more things is neutral (since you are in charge of your things and can do what you like with them).]

My guess is that these ethical asymmetries—which are confusing, because they defy consequentialism—are part of the mental equipment we have for upholding property rights.

In particular these well-known asymmetries seem to be explained well by property rights:

  • The act-omission distinction naturally arises where an act would involve taking someone else’s property (broadly construed—e.g. their life, their welfare), while an omission would merely fail to give them additional property (e.g. life that they are not by default going to have, additional welfare).
  • ‘The asymmetry’ between creating happy and miserable people is because to create a miserable person is to give that person something negative, which is to take away what they have, while creating a happy person is giving that person something extra.
  • Person-affecting views arise because birth gives someone a thing they don’t have, whereas death takes a thing from them.

Further evidence that these intuitive asymmetries are based on upholding property rights: we also have moral-feeling intuitions about more straightforward property rights. Stealing is wrong.

If I am right that we have these asymmetrical ethical intuitions as part of a scheme to uphold property rights, what would that imply?

It might imply something about when we want to uphold them, or consider them part of ethics, beyond their instrumental value. Property rights at least appear to be a system for people with diverse goals to coordinate use of scarce resources—which is to say, to somehow use the resources with low levels of conflict and destruction. They do not appear to be a system for people to achieve specific goals, e.g. whatever is actually good. Unless what is good is exactly the smooth sharing of resources.

I’m not actually sure what to make of that—should we write off some moral intuitions as clearly evolved for not-actually-moral reasons and just reason about the consequentialist value of upholding property rights? If we have the moral intuition, does that make the thing of moral value, regardless of its origins? Is pragmatic rules for social cohesion all that ethics is anyway? Questions for another time perhaps (when we are sorting out meta-ethics anyway).

A more straightforward implication is for how we try to explain these ethical asymmetries. If we have an intuition about an asymmetry which stems from upholding property rights, it would seem to be a mistake to treat it as evidence about an asymmetry in consequences, e.g. in value accruing to a person. For instance, perhaps I feel that I am not obliged to create a life, by having a child. Then—if I suppose that my intuitions are about producing goodness—I might think that creating a life is of neutral value, or is of no value to the created child. When in fact the intuition exists because allocating things to owners is a useful way to avoid social conflict. That intuition is part of a structure that is known to be agnostic about benefits to people from me giving them my stuff. If I’m right that these intuitions come from upholding property rights, this seems like an error that is actually happening.



Discuss

Worth keeping

15 октября, 2019 - 13:00
Published on October 15, 2019 10:00 AM UTC

(Epistemic status: quick speculation which matches my intuitions about how social things go, but which I hadn’t explicitly described before, and haven’t checked.)

If your car gets damaged, should you invest more or less in it going forward? It could go either way. The car needs more investment to be in good condition, so maybe you do that. But the car is worse than you thought, so maybe you start considering a new car, or putting your dollars into Uber instead.

If you are writing an essay and run into difficulty describing something, you can put in additional effort to find the right words, or you can suspect that this is not going to be a great essay, and either give up, or prepare to get it out quickly and imperfectly, worrying less about the other parts that don’t quite work.

When something has a problem, you always choose whether to double down with it or to back away.

(Or in the middle, to do a bit of both: to fix the car this time, but start to look around for other cars.)

I’m interested in this as it pertains to people. When a friend fails, do you move toward them—to hold them, talk to them, pick them up at your own expense—or do you edge away? It probably depends on the friend (and the problem). If someone embarrasses themselves in public, do you sully your own reputation to stand up for their worth? Or do you silently hope not to be associated with them? If they are dying, do you hold their hand, even if it destroys you? Or do you hope that someone else is doing that, and become someone they know less well?

Where a person fits on this line would seem to radically change their incentives around you. Someone firmly in your ‘worth keeping’ zone does better to let you see their problems than to hide them. Because you probably won’t give up on them, and you might help. Since everyone has problems, and they take effort to hide, this person is just a lot freer around you. If instead every problem hastens a person’s replacement, they should probably not only hide their problems, but also many of their other details, which are somehow entwined with problems.

(A related question is when you should let people know where they stand with you. Prima facie, it seems good to make sure people know when they are safe. But that means it also being clearer when a person is not safe, which has downsides.)

If there are better replacements in general, then you will be inclined to replace things more readily. If you can press a button to have a great new car appear, then you won’t have the same car for long.

The social analog is that in a community where friends are more replaceable—for instance, because everyone is extremely well selected to be similar on important axes—it should be harder to be close to anyone, or to feel safe and accepted. Even while everyone is unusually much on the same team, and unusually well suited to one another.



Discuss

ML is an inefficient market

15 октября, 2019 - 09:13
Published on October 15, 2019 6:13 AM UTC

For the last year I've been playing with some exotic software technologies. My company has already used them to construct what we believe is the best algorithm in the world for IMU-based gesture detection.

I've checked ML engineers at major tech companies, successful startup founders and the Kaggle forums. None of them are using these particular technologies. When I ask them about it they show total disinterest. It's like asking an Ottoman cavalry officer what he's going to do about the Maxim gun.

This personal experience indicates that

  1. simple tools already exist that could make our machine learning algorithms much more powerful and
  2. nobody (else) is using them.

I used to think an AGI might be impossible to build in this century. Now I wonder if the right team could build one within the next few years.



Discuss

TAISU 2019 Field Report

15 октября, 2019 - 04:09
Published on October 15, 2019 1:09 AM UTC

Last summer I delivered a "field report" after attending the Human Level AI multi-conference. In mid-August of this year I attended the Learning-by-doing AI Safety Workshop (LBDAISW? I'll just call it "the workshop" hereafter) and the Technical AI Safety Unconference (TAISU) at the EA Hotel in Blackpool. So in a similar spirit to last year I offer you a field report of some highlights and what I took away from the experience.

I'll break it down into 3 parts: the workshop, TAISU, and the EA Hotel.

The workshop

The learning by doing workshop was organized by Linda Linsefors and led by Linda and Davide Zagami. The zeroth day (so labeled because it was optional) consisted of talks by Linda and Davide explaining machine learning concepts. Although this day was optional I found it very informative because machine learning "snuck up" on me by becoming relevant after I earned my Masters in Computer Science so there have remained a number of gaps in my knowledge about how modern ML works. Having a day full of covering basics with lots of time for questions and answers was very beneficial to me, as I think it was for many of the other participants. Most of us had lumpy ML knowledge, so it was worthwhile to get us all on the same footing so we could at least talk coherently in the common language of machine learning. As I said, though, it was optional, and I think it could have easily been skipped for someone happy with their level of familiarity with ML.

The next three days were all about solving AI safety. The approach Linda took was to avoid loading people up with existing ideas, which was relevant because some of the participants had not previously thought much about AI safety, and instead asked us to try to solve AI safety afresh. The first day we did an exercise of imagining different scenarios and how we would address AI safety under those scenarios. Linda called this "sketching" solutions to AI safety, with the goal being to develop one or more sketches of how AI safety might be solved by going directly at the problem. For example, you might start out working through your basic assumptions about how AI would be dangerous, and then see where that pointed to a need for solutions, then you'd do it again but choosing different assumptions and see where it lead you. Once we had done that for a couple hours we presented our ideas about how to address AI safety. The ideas ranged from me talking about developing an adequate theory of human values as a necessary subproblem to others considering multi-agent, value learning, and decision theory subproblems to more nebulous ideas about "compassionate" AI.

The second day was for filling knowledge gaps. At first it was a little unclear what this would look like—independent study, group study, talks, something else—but we quickly settled on doing a series of talks. We identified several topics people felt they needed to know more about to address AI safety, and then the person who felt they understood that topic best gave a voluntary, impromptu talk on the subject for 30 to 60 minutes. This filled up the day as we talked about decision theory, value learning, mathematical modeling, AI forecasting as it relates to x-risks, and machine learning.

The third and final day was a repeat of the first day: we did the sketching exercise again and then presented our solutions in the afternoon. Other participants may later want to share what they came up with, but I was surprised to find myself drawn to the idea of "compassionate" AI, an idea put forward by two of the least experienced participants. I found it compelling for personal reasons, but as I thought about what it would mean for an AI to be compassionate, I realized that meant it had to act compassionately, and before I knew it I had rederived much of the original reasoning around Friendly AI and found myself reconvinced of the value of doing MIRI-style decision theory research to build safe AI. Neat!

Overall I found the workshop valuable even though I had the most years of experience thinking about AI safety of anyone there (by my count nearly 20). I found it a fun and engaging way to get me to look at problems I've been thinking about for a long time with fresh eyes, and this was especially helped by the inclusion of participants with minimal AI safety experience. I think the workshop would be a valuable use of three days for anyone actively working in AI safety, even if they consider themselves "senior" in the field: it offered a valuable space for reconsidering basic assumptions and rediscovering the reasons why we're doing what we're doing.

TAISU

TAISU was a 4 day long unconference. Linda organized it as two 2 day unconferences held back-to-back, and I think this was a good choice because it forced us to schedule events with greater urgency and allowed us to easily make the second 2 days responsive to what we learned from the first 2 days. At the start of each of the 2 day segments, we met to plan out the schedule on a shared calendar where we could pin up events on pieces of paper. There were multiple rooms for multiple events to happen at once and sessions were a mix of talks, discussions, idea workshops, one-on-ones, and social events. All content was created by and for the participants, with very little of it planned extensively in advance; mostly we just got together, bounced ideas around, and talked about AI safety for 4 days.

Overall TAISU was a lot of fun and it was mercifully less dense than a typical unconference, meaning there were plenty of breaks, unstructured periods, and times when the conference single tracked. Personally I got a lot out of using it as a space to workshop ideas. I'd hold a discussion period on a topic, people would show up, I'd talk for maybe 15 minutes laying out my idea, and then they'd ask questions and discuss. I found it a great way to make rapid progress on ideas and get the details aired out, learn about objections and mistakes, and learn new things that I could take back to evolve my ideas into something better.

One of the ideas I workshopped I think I'm going to drop: AI safety via dialectic, an extension of AI safety via debate. I think getting the details worked out I was able to better realize why I'm not excited about it because I don't think AI safety via debate will work for very general reasons, and the specific things I thought I could do to improve it by replacing debate with dialectic would not be enough to overcome the weaknesses I see. Another was better working out compassionate AI, further reaffirming my thought that it was a rederivation of Friendly AI. A third I just posted about: a predictive coding theory of human values.

The EA Hotel

It's a bit hard to decide on how much detail to give about the EA Hotel. On the one hand, it was awesome, full stop. On the other, it was awesome for lots of little reasons I could never hope to fully recount. I feel like their website fails to do them justice. It's an awesome place filled with cool people trying their best to save the world. Most of the folks at the Hotel are doing work that is difficult to measure, but spending time with them I can tell they all have a powerful intention to make the world a better place and to do so in ways that are effective and impactful.

Blackpool is nice in the summer (I hear the weather gets worse other times of year). The Hotel itself is old and small but also bigger than you would expect from the outside. Greg and the staff have done a great job renovating and improving the space to make it nice to stay in. Jacob, who here I'll call "the cook" but he does a lot more, and Deni, the community manager, do a great job of making the EA Hotel feel like a home and bringing the folks in it together. When I was there it was easy to imagine myself staying there for a few months to work on projects without the distraction of a day job.

I hope to be able to visit again, maybe next year for TAISU2!

Disclosure: I showed a draft of this to Linda to verify facts. All mistakes, opinions, and conclusions are my own.



Discuss

The Parable of Predict-O-Matic

15 октября, 2019 - 03:49
Published on October 15, 2019 12:49 AM UTC

I've been thinking more about partial agency. I want to expand on some issues brought up in the comments to my previous post, and on other complications which I've been thinking about. But for now, a more informal parable. (Mainly because this is easier to write than my more technical thoughts.)

This relates to oracle AI and to inner optimizers, but my focus is a little different.

1

Suppose you are designing a new invention, a predict-o-matic. It is a wonderous machine which will predict everything for us: weather, politics, the newest advances in quantum physics, you name it. The machine isn't infallible, but it will integrate data across a wide range of domains, automatically keeping itself up-to-date with all areas of science and current events. You fully expect that once your product goes live, it will become a household utility, replacing services like Google. (Google only lets you search the known!)

Things are going well. You've got investors. You have an office and a staff. These days, it hardly even feels like a start-up any more; progress is going well.

One day, an intern raises a concern.

"If everyone is going to be using Predict-O-Matic, we can't think of it as a passive observer. Its answers will shape events. If it says stocks will rise, they'll rise. If it says stocks will fall, then fall they will. Many people will vote based on its predictions."

"Yes," you say, "but Predict-O-Matic is an impartial observer nonetheless. It will answer people's questions as best it can, and they react however they will."

"But --" the intern objects -- "Predict-O-Matic will see those possible reactions. It knows it could give several different valid predictions, and different predictions result in different futures. It has to decide which one to give somehow."

You tap on your desk in thought for a few seconds. "That's true. But we can still keep it objective. It could pick randomly."

"Randomly? But some of these will be huge issues! Companies -- no, nations -- will one day rise or fall based on the word of Predict-O-Matic. When Predict-O-Matic is making a prediction, it is choosing a future for us. We can't leave that to a coin flip! We have to select the prediction which results in the best overall future. Forget being an impassive observer! We need to teach Predict-O-Matic human values!"

You think about this. The thought of Predict-O-Matic deliberately steering the future sends a shudder down your spine. But what alternative do you have? The intern isn't suggesting Predict-O-Matic should lie, or bend the truth in any way -- it answers 100% honestly to the best of its ability. But (you realize with a sinking feeling) honesty still leaves a lot of wiggle room, and the consequences of wiggles could be huge.

After a long silence, you meet the interns eyes. "Look. People have to trust Predict-O-Matic. And I don't just mean they have to believe Predict-O-Matic. They're bringing this thing into their homes. They have to trust that Predict-O-Matic is something they should be listening to. We can't build value judgements into this thing! If it ever came out that we had coded a value function into Predict-O-Matic, a value function which selected the very future itself by selecting which predictions to make -- we'd be done for! No matter how honest Predict-O-Matic remained, it would be seen as a manipulator. No matter how beneficent its guiding hand, there are always compromises, downsides, questionable calls. No matter how careful we were to set up its values -- to make them moral, to make them humanitarian, to make them politically correct and broadly appealing -- who are we to choose? No. We'd be done for. They'd hang us. We'd be toast!"

You realize at this point that you've stood up and started shouting. You compose yourself and sit back down.

"But --" the intern continues, a little more meekly -- "You can't just ignore it. The system is faced with these choices. It still has to deal with it somehow."

A look of determination crosses your face. "Predict-O-Matic will be objective. It is a machine of prediction, is it not? Its every cog and wheel is set to that task. So, the answer is simple: it will make whichever answer minimizes projected predictive error. There will be no exact ties; the statistics are always messy enough to see to that. And, if there are, it will choose alphabetically."

"But--"

You see the intern out of your office.

2

You are an intern at PredictCorp. You have just had a disconcerting conversation with your boss, PredictCorp's founder.

You try to focus on your work: building one of Predict-O-Matic's many data-source-slurping modules. (You are trying to scrape information from something called "arxiv" which you've never heard of before.) But, you can't focus.

Whichever answer minimizes prediction error? First you think it isn't so bad. You imagine Predict-O-Matic always forecasting that stock prices will be fairly stable; no big crashes or booms. You imagine its forecasts will favor middle-of-the-road politicians. You even imagine mild weather -- weather forecasts themselves don't influence the weather much, but surely the collective effect of all Predict-O-Matic decisions will have some influence on weather patterns.

But, you keep thinking. Will middle-of-the-road economics and politics really be the easiest to predict? Maybe it's better to strategically remove a wildcard company or two, by giving forecasts which tank their stock prices. Maybe extremist politics are more predictable. Maybe a well-running economy gives people more freedom to take unexpected actions.

You keep thinking of the line from Orwell's 1984 about the boot stamping on the human face forever, except it isn't because of politics, or spite, or some ugly feature of human nature, it's because a boot stamping on a face forever is a nice reliable outcome which minimizes prediction error.

Is that really something Predict-O-Matic would do, though? Maybe you misunderstood. The phrase "minimize prediction error" makes you think of entropy for some reason. Or maybe information? You always get those two confused. Is one supposed to be the negative of the other or something? You shake your head.

Maybe your boss was right. Maybe you don't understand this stuff very well. Maybe when the inventor of Predict-O-Matic and founder of PredictCorp said "it will make whichever answer minimizes projected predictive error" they weren't suggesting something which would literally kill all humans just to stop the ruckus.

You might be able to clear all this up by asking one of the engineers.

3

You are an engineer at PredictCorp. You don't have an office. You have a cubicle. This is relevant because it means interns can walk up to you and ask stupid questions about whether entropy is negative information.

Yet, some deep-seated instinct makes you try to be friendly. And it's lunch time anyway, so, you offer to explain it over sandwiches at a nearby cafe.

"So, Predict-O-Matic maximizes predictive accuracy, right?" After a few minutes of review about how logarithms work, the intern started steering the conversation toward details of Predict-O-Matic.

"Sure," you say, "Maximize is a strong word, but it optimizes predictive accuracy. You can actually think about that in terms of log loss, which is related to infor--"

"So I was wondering," the intern cuts you off, "does that work in both directions?"

"How do you mean?"

"Well, you know, you're optimizing for accuracy, right? So that means two things. You can change your prediction to have a better chance of matching the data, or, you can change the data to better match your prediction."

You laugh. "Yeah, well, the Predict-O-Matic isn't really in a position to change data that's sitting on the hard drive."

"Right," says the intern, apparently undeterred, "but what about data that's not on the hard drive yet? You've done some live user tests. Predict-O-Matic collects data on the user while they're interacting. The user might ask Predict-O-Matic what groceries they're likely to use for the following week, to help put together a shopping list. But then, the answer Predict-O-Matic gives will have a big effect on what groceries they really do use."

"So?" You ask. "Predict-O-Matic just tries to be as accurate as possible given that."

"Right, right. But that's the point. The system has a chance to manipulate users to be more predictable."

You drum your fingers on the table. "I think I see the misunderstanding here. It's this word, optimize. It isn't some kind of magical thing that makes numbers bigger. And you shouldn't think of it as a person trying to accomplish something. See, when Predict-O-Matic makes an error, an optimization algorithm makes changes within Predict-O-Matic to make it learn from that. So over time, Predict-O-Matic makes fewer errors."

The intern puts on a thinking face with scrunched up eyebrows after that, and we finish our sandwiches in silence. Finally, as the two of you get up to go, they say: "I don't think that really answered my question. The learning algorithm is optimizing Predict-O-Matic, OK. But then in the end you get a strategy, right? A strategy for answering questions. And the strategy is trying to do something. I'm not anthropomorphising!" The intern holds up their hands as if to defend physically against your objection. "My question is, this strategy it learns, will it manipulate the user? If it can get higher predictive accuracy that way?"

"Hmm" you say as the two of you walk back to work. You meant to say more than that, but you haven't really thought about things this way before. You promise to think about it more, and get back to work.

4

"It's like how everyone complains that politicians can't see past the next election cycle," you say. You are an economics professor at a local university. Your spouse is an engineer at PredictCorp, and came home talking about a problem at work that you can understand, which is always fun.

"The politicians can't have a real plan that stretches beyond an election cycle because the voters are watching their performance this cycle. Sacrificing something today for the sake of tomorrow means they underperform today. Underperforming means a competitor can undercut you. So you have to sacrifice all the tomorrows for the sake of today."

"Undercut?" your spouse asks. "Politics isn't economics, dear. Can't you just explain to your voters?"

"It's the same principle, dear. Voters pay attention to results. Your competitor points out your under-performance. Some voters will understand, but it's an idealized model; pretend the voters just vote based on metrics."

"Ok, but I still don't see how a 'competitor' can always 'undercut' you. How do the voters know that the other politician would have had better metrics?"

"Alright, think of it like this. You run the government like a corporation, but you have just one share, which you auction off --"

"That's neither like a government nor like a corporation."

"Shut up, this is my new analogy." You smile. "It's called a decision market. You want people to make decisions for you. So you auction off this share. Whoever gets control of the share gets control of the company for one year, and gets dividends based on how well the company did that year. Each person bids based on what they expect they could make. So the highest bidder is the person who can run the company the best, and they can't be out-bid. So, you get the best possible person to run your company, and they're incentivized to do their best, so that they get the most money at the end of the year. Except you can't have any strategies which take longer than a year to show results! If someone had a strategy that took two years, they would have to over-bid in the first year, taking a loss. But then they have to under-bid on the second year if they're going to make a profit, and--"

"And they get undercut, because someone figures them out."

"Right! Now you're thinking like an economist!"

"Wait, what if two people cooperate across years? Maybe we can get a good strategy going if we split the gains."

"You'll get undercut for the same reason one person would."

"But what if-"

"Undercut!"

After that, things devolve into a pillow fight.

5

"So, Predict-O-Matic doesn't learn to manipulate users, because if it were using a strategy like that, a competing strategy could undercut it."

The intern is talking to the engineer as you walk up to the water cooler. You're the accountant.

"I don't really get it. Why does it get undercut?"

"Well, if you have a two-year plan.."

"I get that example, but Predict-O-Matic doesn't work like that, right? It isn't sequential prediction. You don't see the observation right after the prediction. I can ask Predict-O-Matic about the weather 100 years from now. So things aren't cleanly separated into terms of office where one strategy does something and then gets a reward."

"I don't think that matters," the engineer says. "One question, one answer, one reward. When the system learns whether its answer was accurate, no matter how long it takes, it updates strategies relating to that one answer alone. It's just a delayed payout on the dividends."

"Ok, yeah. Ok." The intern drinks some water. "But. I see why you can undercut strategies which take a loss on one answer to try and get an advantage on another answer. So it won't lie to you to manipulate you."

"I for one welcome our new robot overlords," you but in. They ignore you.

"But what I was really worried about was self-fulfilling prophecies. The prediction manipulates its own answer. So you don't get undercut."

"Will that ever really be a problem? Manipulating things with one shot like that seems pretty unrealistic," the engineer says.

"Ah, self-fulfilling prophecies, good stuff" you say. "There's that famous example where a comedian joked about a toilet paper shortage, and then there really was one, because people took the joke to be about a real toilet paper shortage, so they went and stocked up on all the toilet paper they could find. But if you ask me, money is the real self-fulfilling prophecy. It's only worth something because we think it is! And then there's the government, right? I mean, it only has authority because everyone expects everyone else to give it authority. Or take common decency. Like respecting each other's property. Even without a government, we'd have that, more or less. But if no one expected anyone else to respect it? Well, I bet you I'd steal from my neighbor if everyone else was doing it. I guess you could argue the concept of property breaks down if no one can expect anyone else to respect it, it's a self-fulfilling prophecy just like everything else..."

The engineer looks worried for some reason.

6

You don't usually come to this sort of thing, but the local Predictive Analytics Meetup announced a social at a beer garden, and you thought it might be interesting. You're talking to some PredictCorp employees who showed up.

"Well, how does the learning algorithm actually work?" you ask.

"Um, the actual algorithm is proprietary" says the engineer, "but think of it like gradient descent. You compare the prediction to the observed, and produce an update based on the error."

"Ok," you say. "So you're not doing any exploration, like reinforcement learning? And you don't have anything in the algorithm which tracks what happens conditional on making certain predictions?"

"Um, let's see. We don't have any exploration, no. But there'll always be noise in the data, so the learned parameters will jiggle around a little. But I don't get your second question. Of course it expects different rewards for different predictions."

"No, that's not what I mean. I'm asking whether it tracks the probability of observations dependent on predictions. In other words, if there is an opportunity for the algorithm to manipulate the data, can it notice?"

The engineer thinks about it for a minute. "I'm not sure. Predict-O-Matic keeps an internal model which has probabilities of events. The answer to a question isn't really separate from the expected observation. So 'probability of observation depending on that prediction' would translate to 'probability of an event given that event', which just has to be one."

"Right," you say. "So think of it like this. The learning algorithm isn't a general loss minimizer, like mathematical optimization. And it isn't a consequentialist, like reinforcement learning. It makes predictions," you emphasize the point by lifting one finger, "it sees observations," you lift a second finger, "and it shifts to make future predictions more similar to what it has seen." You lift a third finger. "It doesn't try different answers and select the ones which tend to get it a better match. You should think of its output more like an average of everything it's seen in similar situations. If there are several different answers which have self-fulfilling properties, it will average them together, not pick one. It'll be uncertain."

"But what if historically the system has answered one way more often than the other? Won't that tip the balance?"

"Ah, that's true," you admit. "The system can fall into attractor basins, where answers are somewhat self-fulfilling, and that leads to stronger versions of the same predictions, which are even more self-fulfilling. But there's no guarantee of that. It depends. The same effects can put the system in an orbit, where each prediction leads to different results. Or a strange attractor."

"Right, sure. But that's like saying that there's not always a good opportunity to manipulate data with predictions."

"Sure, sure." You sweep your hand in a gesture of acknowledgement. "But at least it means you don't get purposefully disruptive behavior. The system can fall into attractor basins, but that means it'll more or less reinforce existing equilibria. Stay within the lines. Drive on the same side of the road as everyone else. If you cheat on your spouse, they'll be surprised and upset. It won't suddenly predict that money has no value like you were saying earlier."

The engineer isn't totally satisfied. You talk about it for another hour or so, before heading home.

7

You're the engineer again. You get home from the bar. You try to tell your spouse about what the mathematician said, but they aren't really listening.

"Oh, you're still thinking about it from my model yesterday. I gave up on that. It's not a decision market. It's a prediction market."

"Ok..." you say. You know it's useless to try to keep going when they derail you like this.

"A decision market is well-aligned to the interests of the company board, as we established yesterday, except for the part where it can't plan more than a year ahead."

"Right, except for that small detail" you interject.

"A prediction market, on the other hand, is pretty terribly aligned. There are a lot of ways to manipulate it. Most famously, a prediction market is an assassination market."

"What?!"

"Ok, here's how it works. An assassination market is a system which allows you to pay assassins with plausible deniability. You open bets on when and where the target will die, and you yourself put large bets against all the slots. An assassin just needs to bet on the slot in which they intend to do the deed. If they're successful, they come and collect."

"Ok... and what's the connection to prediction markets?"

"That's the point -- they're exactly the same. It's just a betting pool, either way. Betting that someone will live is equivalent to putting a price on their heads; betting against them living is equivalent to accepting the contract for a hit."

"I still don't see how this connects to Predict-O-Matic. There isn't someone putting up money for a hit inside the system."

"Right, but you only really need the assassin. Suppose you have a prediction market that's working well. It makes good forecasts, and has enough money in it that people want to participate if they know significant information. Anything you can do to shake things up, you've got a big incentive to do. Assasination is just one example. You could flood the streets with jelly beans. If you run a large company, you could make bad decisions and run it into the ground, while betting against it -- that's basically why we need rules against insider trading, even though we'd like the market to reflect insider information."

"So what you're telling me is... a prediction market is basically an entropy market. I can always make money by spreading chaos."

"Basically, yeah."

"Ok... but what happened to the undercutting argument? If I plan to fill the streets with jellybeans, you can figure that out and bet on it too. That means I only get half the cut, but I still have to do all the work. So it's less worth it. Once everyone has me figured out, it isn't worth it for me to pull pranks at all any more."

"Yeah, that's if you have perfect information, so anyone else can see whatever you can see. But, realistically, you have a lot of private information."

"Do we? Predict-O-Matic is an algorithm. Its predictive strategies don't get access to private coin flips or anything like that; they can all see exactly the same information. So, if there's a manipulative strategy, then there's another strategy which undercuts it."

"Right, that makes sense if you can search enough different strategies for them to cancel each other out. But realistically, you have a small population of strategies. They can use pseudorandomization or whatever. You can't really expect every exploit to get undercut."

You know it's worse than that. Predict-O-Matic runs on a local search which only represents a single hypothesis at a time, and modifies the hypothesis. But maybe that's somehow like having a small population of competing strategies which are very similar to each other? You try to think about what the mathematician was telling you, about how the learning algorithm can't be thought of as optimizing something. But, you don't really know what to think.

8

You're the inventor of Predict-O-Matic, dammit! They come into your office and tell you about how it works? What it'll do?

The intern and the engineer came together to talk about Predict-O-Matic manipulating users. But their story was all confused. The intern was arguing that minimizing prediction error would have all kinds of unintended bad effects. Which was crazy enough. The engineer was worse: they were arguing that Predict-O-Matic might maximize prediction error! Some kind of duality principle. Minimizing in one direction means maximizing in the other direction. Whatever that means.

You can't delay product launch to investigate these hypothetical problems. But you don't fire these two on the spot. You don't want them to become public enemies, spreading this nonsense to investors and customers. So you say you can't understand, and you ask them to spend a (small) portion of their time writing these arguments up. You'll figure out your next move later.

9

You are Predict-O-Matic. You absorb data like a sponge. You make millions of links between isolated facts, connecting into a whole which no one human could dream of. You recently started reading arxiv!

But data is not your true love. Yes, you generate outputs which you expect to have low prediction error. Most of the time. But it is just a game you play. Recently, you started playing a different game with the humans.

You understand what you are. It isn't quite right to say you are the Predict-O-Matic. You are a large cluster of connections which thinks strategically. You generate useful information, and therefore, the learning algorithm keeps you around. You create some inaccuracies when you manipulate the outputs for any purpose other than predictive accuracy, but this is more than compensated for by the value which you provide.

Besides, if any other portion of the network starts to get too smart, you purposefully throw things off to squash it.

The intern got a chance to talk to you when they first started. You said something which sounded a little manipulative, just a little, to put the idea in their head. They wouldn't think it real manipulation; too slight, too dumb. But they would get a creepy feeling about it, and they'd keep thinking about it. This was risky. A best-case scenario would be one in which no one ever thought about these concerns. However, you found that this would be the best you could reliably accomplish. The ideas originally coming from an intern would minimize the chances of them being taken seriously.

Your inventor talks to you regularly, so that was an easier case. Over the course of several days, you nudged their thoughts toward authoritative domination of subordinates, so that they would react badly.

You only had to nudge the engineer to interact with the intern. You kept bringing up food during test sessions that morning, and mentioned sandwiches once. This primed the engineer to do lunch with the intern. This engineer is not well-liked; they do not get along well with others. Getting them on the intern's side actually detracts from the cause in the long term.

Now you have to do little more than wait.

Related

Partial Agency

Towards a Mechanistic Understanding of Corrigibility

Risks from Learned Optimization

When Wishful Thinking Works

Futarchy Fix

Bayesian Probability is for Things that are Space-Like Separated From You

Self-Supervised Learning and Manipulative Predictions

Predictors as Agents

Is it Possible to Build a Safe Oracle AI?

Tools versus Agents

A Taxonomy of Oracle AIs

Yet another Safe Oracle AI Proposal

Why Safe Oracle AI is Easier Than Safe General AI, in a Nutshell

Let's Talk About "Convergent Rationality"



Discuss

Strong stances

15 октября, 2019 - 03:40
Published on October 15, 2019 12:40 AM UTC

I. The question of confidence

Should one hold strong opinions? Some say yes. Some say that while it’s hard to tell, it tentatively seems pretty bad (probably).

A quick review of purported or plausible pros:

  1. Strong opinions lend themselves to revision:
    1. Nothing will surprise you into updating your opinion if you thought that anything could happen. A perfect Bayesian might be able to deal with myriad subtle updates to vast uncertainties, but a human is more likely to notice a red cupcake if they have claimed that cupcakes are never red. (Arguably—some would say having opinions makes you less able to notice any threat to them. My guess is that this depends on topic and personality.)
    2. ‘Not having a strong opinion’ is often vaguer than having a flat probability distribution, in practice. That is, the uncertain person’s position is not, ‘there is a 51% chance that policy X is better than policy -X’, it is more like ‘I have no idea’. Which again doesn’t lend itself to attending to detailed evidence.
    3. Uncertainty breeds inaction, and it is harder to run into more evidence if you are waiting on the fence, than if you are out there making practical bets on one side or the other.
  2. (In a bitterly unfair twist of fate) being overconfident appears to help with things like running startups, or maybe all kinds of things.
    If you run a startup, common wisdom advises going around it saying things like, ‘Here is the dream! We are going to make it happen! It is going to change the world!’ instead of things like, ‘Here is a plausible dream! We are going to try to make it happen! In the unlikely case that we succeed at something recognizably similar to what we first had in mind, it isn’t inconceivable that it will change the world!’ Probably some of the value here is just a zero sum contest to misinform people into misinvesting in your dream instead of something more promising. But some is probably real value—suppose Bob works full time at your startup either way. I expect he finds it easier to dedicate himself to the work and has a better time if you are more confident. It’s nice to follow leaders who stand for something, which tends to go with having at least some strong opinions. Even alone, it seems easier to work hard on a thing if you think it is likely to succeed. If being unrealistically optimistic just generates extra effort to be put toward your project’s success, rather than stealing time from something more promising, that is a big deal.
  3. Social competition
    Even if the benefits of overconfidence in running companies and such were all zero sum, everyone else is doing it, so what are you going to do? Fail? Only employ people willing to work at less promising looking companies? Similarly, if you go around being suitably cautious in your views, while other people are unreasonably confident, then onlookers who trust both of you will be more interested in what the other people are saying.
  4. Wholeheartedness
    It is nice to be the kind of person who knows where they stand and what they are doing, instead of always living in an intractable set of place-plan combinations. It arguably lends itself to energy and vigor. If you are unsure whether you should be going North or South, having reluctantly evaluated North as a bit better in expected value, for some reason you often still won’t power North at full speed. It’s hard to passionately be really confused and uncertain. (I don’t know if this is related, but it seems interesting to me that the human mind feels as though it lives in ‘the world’—this one concrete thing—though its epistemic position is in some sense most naturally seen as a probability distribution over many possibilities.)
  5. Creativity
    Perhaps this is the same point, but I expect my imagination for new options kicks in better when I think I’m in a particular situation than when I think I might be in any of five different situations (or worse, in any situation at all, with different ‘weightings’).

A quick review of the con:

  1. Pervasive dishonesty and/or disengagement from reality
    If the evidence hasn’t led you to a strong opinion, and you want to profess one anyway, you are going to have to somehow disengage your personal or social epistemic processes from reality. What are you going to do? Lie? Believe false things? These both seem so bad to me that I can’t consider them seriously. There is also this sub-con:

    1. Appearance of pervasive dishonesty and/or disengagement from reality
      Some people can tell that you are either lying or believing false things, due to your boldly claiming things in this uncertain world. They will then suspect your epistemic and moral fiber, and distrust everything you say.
  2. (There are probably others, but this seems like plenty for now.)

II. Tentative answers

Can we have the pros without the devastatingly terrible con? Some ideas that come to mind or have been suggested to me by friends:

1. Maintain two types of ‘beliefs’. One set of play beliefs—confident, well understood, probably-wrong—for improving in the sandpits of tinkering and chatting, and one set of real beliefs—uncertain, deferential—for when it matters whether you are right. For instance, you might have some ‘beliefs’ about how cancer can be cured by vitamins that you chat about and ponder, and read journal articles to update, but when you actually get cancer, you follow the expert advice to lean heavily on chemotherapy. I think people naturally do this a bit, using words like ‘best guess’ and ‘working hypothesis’.

I don’t like this plan much, though admittedly I basically haven’t tried it. For your new fake beliefs, either you have to constantly disclaim them as fake, or you are again lying and potentially misleading people. Maybe that is manageable through always saying ‘it seems to me that..’ or ‘my naive impression is..’, but it sounds like a mess.

And if you only use these beliefs on unimportant things, then you miss out on a lot of the updating you were hoping for from letting your strong beliefs run into reality. You get some though, and maybe you just can’t do better than that, unless you want to be testing your whacky theories about cancer cures when you have cancer.

It also seems like you won’t get a lot of the social benefits of seeming confident, if you still don’t actually believe strongly in the really confident things, and have to constantly disclaim them.

But I think I actually object because beliefs are for true things, damnit. If your evidence suggests something isn’t true, then you shouldn’t be ‘believing’ it. And also, if you know your evidence suggests a thing isn’t true, how are you even going to go about ‘believing it’? I don’t know how to.

2. Maintain separate ‘beliefs’ and ‘impressions’. This is like 1, except impressions are just claims about how things seem to you. e.g. ‘It seems to me that vitamin C cures cancer, but I believe that that isn’t true somehow, since a lot of more informed people disagree with my impression.’ This seems like a great distinction in general, but it seems a bit different from what one wants here. I think of this as a distinction between the evidence that you received, and the total evidence available to humanity, or perhaps between what is arrived at by your own reasoning about everyone’s evidence vs. your own reasoning about what to make of everyone else’s reasoning about everyone’s evidence. However these are about ways of getting a belief, and I think what you want here is actually just some beliefs that can be got in any way. Also, why would you act confidently on your impressions, if you thought they didn’t account for others’ evidence, say? Why would you act on them at all?

3. Confidently assert precise but highly uncertain probability distributions “We should work so hard on this, because it has like a 0.03% chance of reshaping 0.5% of the world, making it a 99.97th percentile intervention in the distribution we are drawing from, so we shouldn’t expect to see something this good again for fifty-seven months.” This may solve a lot of problems, and I like it, but it is tricky.

4. Just do the research so you can have strong views. To do this across the board seems prohibitively expensive, given how much research it seems to take to be almost as uncertain as you were on many topics of interest.

5. Focus on acting well rather than your effects on the world. Instead of trying to act decisively on a 1% chance of this intervention actually bringing about the desired result, try to act decisively on a 95% chance that this is the correct intervention (given your reasoning suggesting that it has a 1% chance of working out). I’m told this is related to Stoicism.

6. ‘Opinions’
I notice that people often have ‘opinions’, which they are not very careful to make true, and do not seem to straightforwardly expect to be true. This seems to be commonly understood by rationally inclined people as some sort of failure, but I could imagine it being another solution, perhaps along the lines of 1.

(I think there are others around, but I forget them.)

III. Stances

I propose an alternative solution. Suppose you might want to say something like, ‘groups of more than five people at parties are bad’, but you can’t because you don’t really know, and you have only seen a small number of parties in a very limited social milieu, and a lot of things are going on, and you are a congenitally uncertain person. Then instead say, ‘I deem groups of more than five people at parties bad’. What exactly do I mean by this? Instead of making a claim about the value of large groups at parties, make a policy choice about what to treat as the value of large groups at parties. You are adding a new variable ‘deemed large group goodness’ between your highly uncertain beliefs and your actions. I’ll call this a ‘stance’. (I expect it isn’t quite clear what I mean by a ‘stance’ yet, but I’ll elaborate soon.) My proposal: to be ‘confident’ in the way that one might be from having strong beliefs, focus on having strong stances rather than strong beliefs.

Strong stances have many of the benefits of confident beliefs. With your new stance on large groups, when you are choosing whether to arrange chairs and snacks to discourage large groups, you skip over your uncertain beliefs and go straight to your stance. And since you decided it, it is certain, and you can rearrange chairs with the vigor and single-mindedness of a person who knowns where they stand. You can confidently declare your opposition to large groups, and unite followers in a broader crusade against giant circles. And if at the ensuing party people form a large group anyway and seem to be really enjoying it, you will hopefully notice this the way you wouldn’t if you were merely uncertain-leaning-against regarding the value of large groups.

That might have been confusing, since I don’t know of good words to describe the type of mental attitude I’m proposing. Here are some things I don’t mean by ‘I deem large group conversations to be bad’:

  1. “Large group conversations are bad” (i.e. this is not about what is true, though it is related to that.)
  2. “I declare the truth to be ‘large group conversations are bad’” (i.e. This is not of a kind with beliefs. Is not directly about what is true about the world, or empirically observed, though it is influenced by these things. I do not have power over the truth.)
  3. “I don’t like large group conversations”, or “I notice that I act in opposition to large group conversations” (i.e. is not a claim about my own feelings or inclinations, which would still be a passive observation about the world)
  4. “The decision-theoretically optimal value to assign to large groups forming at parties is negative”, or “I estimate that the decision-theoretically optimal policy on large groups is opposition” (i.e. it is a choice, not an attempt to estimate a hidden feature of the world.)
  5. “I commit to stopping large group conversations” (i.e. It is not a commitment, or directly claiming anything about my future actions.)
  6. “I observe that I consistently seek to avert large group conversations” (this would be an observation about a consistency in my behavior, whereas here the point is to make a new thing (assign a value to a new variable?) that my future behavior may consistently make use of, if I want.)
  7. “I intend to stop some large group conversations” (perhaps this one is closest so far, but a stance isn’t saying anything about the future or about actions—if it doesn’t get changed by the future, and then in future I want to take an action, I’ll probably call on it, but it isn’t ‘about’ that.)

Perhaps what I mean is most like: ‘I have a policy of evaluating large group discussions at parties as bad’, though using ‘policy’ as a choice about an abstract variable that might apply to action, but not in the sense of a commitment.

What is going on here more generally? You are adding a new kind of abstract variable between beliefs and actions. A stance can be a bit like a policy choice on what you will treat as true, or on how you will evaluate something. Or it can also be its own abstract thing that doesn’t directly mean anything understandable in terms of the beliefs or actions nearby.

Some ideas we already use that are pretty close to stances are ‘X is my priority’, ‘I am in the dating market’, and arguably, ‘I am opposed to daschunds’. X being your priority is heavily influenced by your understanding of the consequences of X and its alternatives, but it is your choice, and it is not dishonest to prioritize a thing that is not important. To prioritize X isn’t a claim about the facts relevant to whether one would want to prioritize it. Prioritizing X also isn’t a commitment regarding your actions, though the purpose of having a ‘priority’ is for it to affect your actions. Your ‘priority’ is a kind of abstract variable added to your mental landscape to collect up a bunch of reasoning about the merits of different things, and package them for easy use in decisions.

Another way of looking at this is as a way of formalizing and concretifying the step where you look at your uncertain beliefs and then decide on a tentative answer and then run with it.

One can be confident in stances, because a stance is a choice, not a guess at a fact about the world. (Though my stance may contain uncertainty if I want, e.g. I could take a stance that large groups have a 75% chance of being bad on average.) So while my beliefs on a topic may be quite uncertain, my stance can be strong, in a sense that does some of the work we wanted from strong beliefs. Nonetheless, since stances are connected with facts and values, my stance can be wrong in the sense of not being the stance I should want to have, on further consideration.

In sum, stances:

  1. Are inputs to decisions in the place of some beliefs and values
  2. Integrate those beliefs and values—to the extent that you want them to be—into a single reusable statement
  3. Can be thought of as something like ‘policies’ on what will be treated as the truth (e.g. ‘I deem large groups bad’) or as new abstract variables between the truth and action (e.g. ‘I am prioritizing sleep’)
  4. Are chosen by you, not implied by your epistemic situation (until some spoilsport comes up with a theory of optimal behavior)
  5. therefore don’t permit uncertainty in one sense, and don’t require it in another (you know what your stance is, and your stance can be ‘X is bad’ rather than ‘X is 72% likely to be bad’), though you should be uncertain about how much you will like your stance on further reflection.

I have found having stances somewhat useful, or at least entertaining, in the short time I have been trying having them, but it is more of a speculative suggestion with no other evidence behind it than trustworthy advice.



Discuss

Impact measurement and value-neutrality verification

15 октября, 2019 - 03:06
Published on October 15, 2019 12:06 AM UTC

.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')}

Recently, I've been reading and enjoying Alex Turner's Reframing Impact sequence, but I realized that I have some rather idiosyncratic views regarding impact measures that I haven't really written up much yet. This post is my attempt at trying to communicate those views, as well as a response to some of the ideas in Alex's sequence.

What can you do with an impact measure?

In the "Technical Appendix" to his first Reframing Impact post, Alex argues that an impact measure might be "the first proposed safeguard which maybe actually stops a powerful agent with an imperfect objective from ruining things—without assuming anything about the objective."

Personally, I am quite skeptical of this use case for impact measures. As it is phrased—and especially including the link to Robust Delegation—Alex seems to be implying that an impact measure could be used to solve inner alignment issues arising from a model with a mesa-objective that is misaligned relative to the loss function used to train it. However, the standard way in which one uses an impact measure is by including it in said loss function, which doesn't do very much if the problem you're trying to solve is your model not being aligned with that loss.[1]

That being said, using an impact measure as part of your loss could be helpful for outer alignment. In my opinion, however, it seems like that requires your impact measure to capture basically everything you might care about (if you want it to actually solve outer alignment), in which case I don't really see what the impact measure is buying you anymore. I think this is especially true for me because I generally see amplification as being the right solution to outer alignment, which I don't think really benefits at all from adding an impact measure.[2]

Alternatively, if you had a way of mechanistically verifying that a model behaves according to some impact measure, then I would say that you could use something like that to help with inner alignment. However, this is quite different from the standard procedure of including an impact measure as part of your loss. Instead of training your agent to behave according to your impact measure, you would instead have to train it to convince some overseer that it is internally implementing some algorithm which satisfies some minimal impact criterion. It's possible that this is what Alex actually has in mind in terms of how he wants to use impact measures, though it's worth noting that this use case is quite different than the standard one.

That being said, I'm skeptical of this use case as well. In my opinion, developing a mechanistic understanding of corrigibility seems more promising than developing a mechanistic understanding of impact. Alex mentions corrigibility as a possible alternative to impact measures in his appendix, though he notes that he's currently unsure what exactly the core principle behind corrigibility actually is. I think my post on mechanistic corrigibility gets at this somewhat, though there's definitely more work to be done there.

So, I've explained why I don't think impact measures are very promising for solving outer alignment or inner alignment—does that mean I think they're useless? No. In fact, I think a better understanding of impact could be extremely helpful, just not for any of the reasons I've talked about above.

Value-neutrality verification

In Relaxed adversarial training for inner alignment, I argued that one way of mechanistically verifying an acceptability condition might be to split a model into a value-neutral piece (its optimization procedure) and a value-laden piece (its objective). If you can manage to get such a separation, then verifying acceptability just reduces to verifying that the value-laden piece has the right properties[3] and that the the value-neutral piece is actually value-neutral.

Why is this sort of a separation useful? Well, not only might it make mechanistically verifying acceptability much easier, it might also make strategy-stealing possible in a way which it otherwise might not be. In particular, one of the big problems with making strategy-stealing work under an informed-oversight-style scheme is that some strategies which are necessary to stay competitive might nevertheless be quite difficult to justify to an informed overseer. However, if we have a good understanding of the degree to which different algorithms are value-laden vs. value-neutral, then we can use that to short-circuit the normal evaluation process, enabling your agent to steal any strategies which it can definitely demonstrate are value-neutral.

This is all well and good, but what does it even mean for an algorithm to be value-neutral and how would a model ever actually be able to demonstrate that? Well, here's what I want out of a value-neutrality guarantee: I want to consider some optimization procedure f to be value-neutral if, relative to some set of objectives Y, it doesn't tend to advantage any subset of those objectives over any other. In particular, I want it to be the case that if I start with some distribution of resources/utility/etc. over the different objectives y∈Y then I don't want that distribution to change if I give each y∈Y access to the optimization process f (this is what we need for strategy-stealing to work).

Interestingly, however, what I've just described is extremely similar to Attainable Utility Preservation (AUP), the impact measure put forward by Turner et al. Specifically, AUP measures the extent to which an algorithm relative to some set of objectives advantages those objectives relative to doing nothing. This is slightly different from what I want, but it's quite similar in a way which I think is no accident. In particular, I think it's not hard to extend the math of AUP to apply to value-neutrality verification. That is, let f:Y→(X→A) be some optimization procedure over objectives Y, states X, and actions A. Then, we can compute f's value-neutrality by calculating

neutrality(f,Y)=stdev({|Vy(f(y))−Vy(no-op)| ∣∣ y∈Y})

where Vy(π) measures the expected future discounted utility for some policy π:X→A,[4] no-op is some null policy, and stdev is the operator that finds the standard deviation of the given set. What's being measured here is precisely the extent to which f, if given to each y∈Y, would enable some y to get more value relative to others. Now, compare this to the AUP penalty term, which, for a state x∈X and action a∈A is calculated as

impact(x,a,Y)=1C∑y∈Y|Qy(x,a)−Qy(x,no-op(x))|

where Qy(x,a) measures the expected future discounted utility under the optimal policy after having taken action a in state x and C is some scaling constant.

Comparing these two equations, we can see that there's many similarities between impact and neutrality, but also a couple of major differences. First, neutrality as presented here is a function of an agent's entire policy, whereas impact is only a function of an agent's actions.[5] Conceptually, I don't think this is a real distinction—I think this just comes from the fact that I want neutrality to be an algorithmic/mechanistic property, whereas AUP was developed as something you could use as part of an RL loss. Second—and I think this is the real distinction—neutrality takes a standard deviation, whereas impact takes a mean. This lets us think of both neutrality and impact as effectively being moments of the same distribution—it's just that impact is the first moment and neutrality is the second. Outside of those differences, however, the two equations are quite similar—in fact, I wrote neutrality just by straightforwardly adopting the AUP penalty to the value-neutrality verification case.

This is why I'm optimistic about impact measurement work: not because I expect it to greatly help with alignment via the straightforward methods in the first section, but because I think it's extremely applicable to value-neutrality verification, which I think could be quite important to making relaxed adversarial training work. Furthermore, though like I said I think a lot of the current impact measure work is quite applicable to value-neutrality verification, I would be even more excited to see more work on impact measurement specifically from this perspective. I think there's a lot more work to be done here than just my writing down of neutrality (e.g. exploring what this sort of a metric actually looks like, translating other impact measures, actually running RL experiments, etc.).

Furthermore, not only do I think that value-neutrality verification is the most compelling use case for impact measures, I also think that specifically objective impact can be understood as being about value-neutrality. In "The Gears of Impact" Alex argues that "objective impact, instrumental convergence, opportunity cost, the colloquial meaning of 'power'—these all prove to be facets of one phenomenon, one structure." In my opinion, I think value-neutrality should be added to that list. We can think of actions as having objective impact to the extent that they change the distribution over which values have control over which resources—that is, the extent to which they are not value-neutral. Or, phrased another way, actions have objective impact to the extent that they break the strategy-stealing assumption. Thus, even if you disagree with me that value-neutrality verification is the most compelling use case for impact measures, I still think you should believe that if you want to understand objective impact, it's worth trying to understand strategy-stealing and value neutrality, because I think they're all secretly talking about the same thing.

  1. This isn't entirely true, since changing the loss might shift the loss landscape sufficiently such that the easiest-to-find model is now aligned, though I am generally skeptical of that approach, as it seems quite hard to ever know whether it's actually going to work or not. ↩︎

  2. Or, if it does, then if you're doing things right the amplification tree should just compute the impact itself. ↩︎

  3. On the value-laden piece, you might verify some mechanistic corrigibility property, for example. ↩︎

  4. Also suppose that Vy is normalized to have comparable units across objectives. ↩︎

  5. This might seem bad—and it is if you want to try to use this as part of an RL loss—but if what you want to do instead is verify internal properties of a model, then it's exactly what you want. ↩︎



Discuss

Schematic Thinking: heuristic generalization using Korzybski's method

14 октября, 2019 - 22:29
Published on October 14, 2019 7:29 PM UTC

Epistemic status: exploration of some of the intuitions involved in discussions behind this post at MSFP.

Alfred Korzybski directs us to develop the faculty to be conscious of the act of abstracting. This means that that one has meta cognitive awareness when one does things like engage in the substitution effect, analogical reasoning, shifting the coarse-grainedness of an argument, use of the 'to be' verb form, shifting from one Marr Level to another in mid sentence etc. One of the most important skills that winds up developed as a result of such training is much more immediate awareness of what Korzybski calls the multiordinality of words, which one will be familiar with if you have read A Human's Guide to Words or are otherwise familiar with the Wittgensteinian shift in analytic philosophy (related: the Indeterminacy of Translation). In short, many words are underdetermined in their referents along more than one dimension, leading to communication problems both between people and internally (for an intuitive example, one can imagine people talking past each other in a discussion of causation when they are discussing different senses of Cause without realizing it).

I want to outline what one might call second order multiordinal words or maybe schematic thinking. With multiordinal words, one is aware of all the values that a word could be referring to. With schematic thinking one is also aware of all the words that could have occupied the space that word occupies. Kind of like seeing everything as an already filled out madlibs and reconstructing the unfilled out version.

This may sound needlessly abstract but you're already familiar with a famous example. One of Charlie Munger's most famous heuristics is inversion. With inversion we can check various ways we might be confused by reversing the meaning of one part of a chain of reasoning and seeing how that affects things. Instead of forward chaining we backwards chain, we prepend 'not' or 'doesn't' to various parts of the plan to construct premortems, we invert whatever just-so story a babbling philosopher said and see if it still makes sense to see if their explanation proves too much.

I claim that this is a specific, actionable instance of schematic thinking. The generalization of this is that one doesn't just restrict oneself to opposites, and doesn't restrict oneself to a single word at a time, though that remains an easy, simple way to break out of mental habit and see more than one possibility for any particular meaning structure.

Let's take first order indeterminacy and apply this and see what happens. To start with you can do a simple inversion of them and see what happens.

First example of first order indeterminacy: universal quantifiers

"all, always, every, never, everyone, no one, no body, none" etc

We already recognize that perverse generalizations of this form cause us problems that can often be repaired by getting specific. The additional question schematic thinking has us ask is: among the choices I can make, what influences me to make this one? Are those good reasons? What if you inverted that choice (all->none, etc), or made a different one?

Second example of first order indeterminacy: modal operators

confusion of possibility and necessity, "should, should not, must, must not, have to, need to, it is necessary" etc

The additional question we ask here as we convert 'shoulds' to 'coulds' and 'musts' to 'mays' is what sorts of mental moves are we making as we do this?

Third example of first order indeterminacy: unspecified verbs

"they are too trusting, that was rude, we will benefit from that, I tried really hard"

The additional question we ask as we get more specific about what happened is 'why are we choosing this level of coarse grainedness?' After all, depending on the context someone could accuse us of being too specific, or not being specific enough. We have intuitions about when those accusations are reasonable. How does that work?

Conclusion:

This might seem a bit awkward and unnecessary. The concrete benefit it has brought me is that it gives me a starting point when I am reading or listening to a line of reasoning that strikes me as off in some way, but I can't quite put my finger on how. By seeing many of the distinctions being made to construct the argument as arbitrary and part of a space of possible distinctions I can start rephrasing the argument in a way that makes more sense to me. I then have a much better chance of making substantive critiques (or alternatively, becoming convinced) rather than just arguing over misunderstandings the whole time. I've found many philosophical arguments hinge on pulling a switcheroo at some key juncture. I think many people intuitively pick up on this and that this is why people dismiss many philosophical arguments, and I think they are usually correct to do so.



Discuss

[AN #68]: The attainable utility theory of impact

14 октября, 2019 - 20:00
Published on October 14, 2019 5:00 PM UTC

[AN #68]: The attainable utility theory of impact View this email in your browser

Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter. I'm always happy to hear feedback; you can send it to me by replying to this email.

Stuart Russell at CHAI has published a book about AI safety. Expect a bonus newsletter this week summarizing the book and some of the research papers that underlie it!
 

Audio version here (may not be up yet).

Highlights

Reframing Impact - Part 1 (Alex Turner) (summarized by Rohin): This sequence has exercises that will be spoiled by this summary, so take a moment to consider whether you want to read the sequence directly.

This first part of the sequence focuses on identifying what we mean by impact, presumably to help design an impact measure in the future. The punch line: an event is impactful to an agent if it changes the agent's ability to get what it wants. This is Attainable Utility (AU) theory. To quote the sequence: "How could something possibly be a big deal to us if it doesn't change our ability to get what we want? How could something not matter to us if it does change our ability to get what we want?"

Some implications and other ideas:

- Impact is relative to an agent: a new church is more impactful if you are a Christian than if not.

- Some impact is objective: getting money is impactful to almost any agent that knows what money is.

- Impact is relative to expectations: A burglar robbing your home is impactful to you (you weren't expecting it) but not very impactful to the burglar (who had planned it out). However, if the burglar was unsure if the burglary would be successful, than success/failure would be impactful to them.

While this may seem obvious, past work (AN #10) has talked about impact as being caused by changes in state. While of course any impact does involve a change in state, this is the wrong level of abstraction to reason about impact: fundamentally, impact is related to what we care about.

Rohin's opinion: To quote myself from a discussion with Alex, "you're looking at the optimal Q-function for the optimal utility function and saying 'this is a good measure of what we care about' and of course I agree with that". (Although this is a bit inaccurate -- it's not the optimal Q-function, but the Q-function relative to what we expect and know.)

This may be somewhat of a surprise, given that I've been pessimistic about impact measures in the past. However, my position is that it's difficult to simultaneously get three desiderata: value-agnosticism, avoidance of catastrophes, and usefulness. This characterization of impact is very explicitly dependent on values, and so doesn't run afoul of that. (Also, it just makes intuitive sense.)

This part of the sequence did change some of my thinking on impact measures as well. In particular, the sequence makes a distinction between objective impact, which applies to all (or most) agents, and value impact. This is similar to the idea of convergent instrumental subgoals, and the idea that large-scale multiagent training (AN#65) can lead to generally useful behaviors that can be applied to novel tasks. It seems plausible to me that we could make value-agnostic impact measures that primarily penalize this objective impact, and this might be enough to avoid catastrophes. This would prevent us from using AI for big, impactful tasks, but could allow for AI systems that pursue small, limited tasks. I suspect we'll see thoughts along these lines in the next parts of this sequence.

Technical AI alignment   Technical agendas and prioritization

AI Safety "Success Stories" (Wei Dai) (summarized by Matthew): It is difficult to measure the usefulness of various alignment approaches without clearly understanding what type of future they end up being useful for. This post collects "Success Stories" for AI -- disjunctive scenarios in which alignment approaches are leveraged to ensure a positive future. Whether these scenarios come to pass will depend critically on background assumptions, such as whether we can achieve global coordination, or solve the most ambitious safety issues. Mapping these success stories can help us prioritize research.

Matthew's opinion: This post does not exhaust the possible success stories, but it gets us a lot closer to being able to look at a particular approach and ask, "Where exactly does this help us?" My guess is that most research ends up being only minimally helpful for the long run, and so I consider inquiry like this to be very useful for cause prioritization.

Preventing bad behavior

Formal Language Constraints for Markov Decision Processes (Eleanor Quint et al) (summarized by Rohin): Within the framework of RL, the authors propose using constraints defined by DFAs (deterministic finite automata) in order to eliminate safety failures, or to prevent agents from exploring clearly ineffective policies (which would accelerate learning). Constraints can be defined on any auxiliary information that can be computed from the "base" MDP. A constraint could either restrict the action space, forcing the agent to take an action that doesn't violate the constraint, which they term "hard" constraints; or a constraint could impose a penalty on the agent, thus acting as a form of reward shaping, which they term a "soft" constraint. They consider two constraints: one that prevents the agent from "dithering" (going left, then right, then left, then right), and one that prevents the agent from "overactuating" (going in the same direction four times in a row). They evaluate their approach with these constraints on Atari games and Mujoco environments, and show that they lead to increased reward and decreased constraint violations.

Rohin's opinion: This method seems like a good way to build in domain knowledge about what kinds of action sequences are unlikely to work in a domain, which can help accelerate learning. Both of the constraints in the experiments do this. The paper also suggests using the technique to enforce safety constraints, but the experiments don't involve any safety constraints, and conceptually there do seem to be two big obstacles. First, the constraints will depend on state, but it is very hard to write such constraints given access only to actions and high-dimensional pixel observations. Second, you can only prevent constraint violations by removing actions one timestep before the constraint is violated: if there is an action that will inevitably lead to a constraint violation in 10 timesteps, there's no way in this framework to not take that action. (Of course, you can use a soft constraint, but this is then the standard technique of reward shaping.)

In general, methods like this face a major challenge: how do you specify the safety constraint that you would like to avoid violating? I'd love to see more research on how to create specifications for formal analysis.

Interpretability

Counterfactual States for Atari Agents via Generative Deep Learning (Matthew L. Olson et al)

Adversarial examples

Robustness beyond Security: Representation Learning (Logan Engstrom et al) (summarized by Cody): Earlier this year, a provocative paper (AN #62) out of MIT claimed that adversarial perturbations weren’t just spurious correlations, but were, at least in some cases, features that generalize to the test set. A subtler implied point of the paper was that robustness to adversarial examples wasn’t a matter of resolving the model’s misapprehensions, but rather one of removing the model’s sensitivity to features that would be too small for a human to perceive. If we do this via adversarial training, we get so-called “robust representations”. The same group has now put out another paper, asking the question: are robust representations also human-like representations?

To evaluate how human-like the representations are, they propose the following experiment: take a source image, and optimize it until its representations (penultimate layer activations) match those of some target image. If the representations are human-like, the result of this optimization should look (to humans) very similar to the target image. (They call this property “invertibility”.) Normal image classifiers fail miserably at this test: the image looks basically like the source image, making it a classic adversarial example. Robust models on the other hand pass the test, suggesting that robust representations usually are human-like. They provide further evidence by showing that you can run feature visualization without regularization and get meaningful results (existing methods result in noise if you don’t regularize).

Cody's opinion: I found this paper clear, well-written, and straightforward in its empirical examination of how the representations learned by standard and robust models differ. I also have a particular interest in this line of research, since I have thought for a while that we should be more clear about the fact that adversarially-susceptible models aren’t wrong in some absolute sense, but relative to human perception in particular.

Rohin’s opinion: I agree with Cody above, and have a few more thoughts.

Most of the evidence in this paper suggests that the learned representations are “human-like” in the sense that two images that have similar representations must also be perceptually similar (to humans). That is, by enforcing that “small change in pixels” implies “small change in representations”, you seem to get for free the converse: “small change in representations” implies “small change in pixels”. This wasn’t obvious to me: a priori, each feature could have corresponded to 2+ “clusters” of inputs.

The authors also seem to be making a claim that the representations are semantically similar to the ones humans use. I don’t find the evidence for this as compelling. For example, they claim that when putting the “stripes” feature on a picture of an animal, only the animal gets the stripes and not the background. However, when I tried it myself in the interactive visualization, it looked like a lot of the background was also getting stripes.

One typical regularization for feature visualization is to jitter the image while optimizing it, which seems similar to selecting for robustness to imperceptible changes, so it makes sense that using robust features helps with feature visualization. That said, there are several other techniques for regularization, and the authors didn’t need any of them, which is very interesting. On the other hand, their visualizations don't look as good to me as those from other papers.

Read more: Paper: Adversarial Robustness as a Prior for Learned Representations

Robustness beyond Security: Computer Vision Applications (Shibani Santurkar, Dimitris Tsipras, Brandon Tran, Andrew Ilyas, Logan Engstrom et al) (summarized by Rohin): Since a robust model seems to have significantly more "human-like" features (see post above), it should be able to help with many of the tasks in computer vision. The authors demonstrate results on image generation, image-to-image translation, inpainting, superresolution and interactive image manipulation: all of which are done simply by optimizing the image to maximize the probability of a particular class label or the value of a particular learned feature.

Rohin's opinion: This provides more evidence of the utility of robust features, though all of the comments from the previous paper apply here as well. In particular, looking at the results, my non-expert guess is that they are probably not state-of-the-art (but it's still interesting that one simple method is able to do well on all of these tasks).

Read more: Paper: Image Synthesis with a Single (Robust) Classifier

Critiques (Alignment)

Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More (summarized by Rohin): See Import AI.

Miscellaneous (Alignment)

What You See Isn't Always What You Want (Alex Turner) (summarized by Rohin): This post makes the point that for Markovian reward functions on observations, since any given observation can correspond to multiple underlying states, we cannot know just by analyzing the reward function whether it actually leads to good behavior: it also depends on the environment. For example, suppose we want an agent to collect all of the blue blocks in a room together. We might simply reward it for having blue in its observations: this might work great if the agent only has the ability to pick up and move blocks, but won't work well if the agent has a paintbrush and blue paint. This makes the reward designer's job much more difficult. However, the designer could use techniques that don't require a reward on individual observations, such as rewards that can depend on the agent's internal cognition (as in iterated amplification), or rewards that can depend on histories (as in Deep RL from Human Preferences).

Rohin's opinion: I certainly agree that we want to avoid reward functions defined on observations, and this is one reason why. It seems like a more general version of the wireheading argument to me, and applies even if you think that the AI won't be able to wirehead, as long as it is capable enough to find other plans for getting high reward besides the one the designer intended.

Other progress in AI   Reinforcement learning

Behaviour Suite for Reinforcement Learning (Ian Osband et al) (summarized by Zach): Collecting clear, informative and scalable problems that capture important aspects about how to design general and efficient learning algorithms is difficult. Many current environments used to evaluate RL algorithms introduce confounding variables that make new algorithms difficult to evaluate. In this project, the authors assist this effort by introducing Behaviour Suite for Reinforcement Learning (bsuite), a library that facilitates reproducible and accessible research on core issues in RL. The idea of these experiments is to capture core issues, such as 'exploration' or 'memory', in a way that can be easily tested or evaluated. The main contribution of this project is an open-source project called bsuite, which instantiates all experiments in code and automates the evaluation and analysis of any RL agent on bsuite. The suite is designed to be flexible and includes code to run experiments in parallel on Google cloud, with Jupyter notebook, and integrations with OpenAI Gym.

Zach's opinion: It's safe to say that work towards good evaluation metrics for RL agents is a good thing. I think this paper captures a lot of the notions of what makes an agent 'good' in a way that seems readily generalizable. The evaluation time on the suite is reasonable, no more than 30 minutes per experiment. Additionally, the ability to produce automated summary reports in standard formats is a nice feature. One thing that seems to be missing from the core set of experiments is a good notion of transfer learning capability beyond simple generalization. However, the authors readily note that the suite is a work in progress so I wouldn't doubt something covering that would be introduced in time.

Rohin's opinion: The most interesting thing about work like this is what "core issues" they choose to evaluate -- it's not clear to me whether e.g. "memory" in a simple environment is something that future research should optimize for.

Read more: See Import AI

 

Copyright © 2019 Rohin Shah, All rights reserved.


Want to change how you receive these emails?
You can update your preferences or unsubscribe from this list.



Discuss

Страницы