## Вы здесь

### Applying Overoptimization to Selection vs. Control (Optimizing and Goodhart Effects - Clarifying Thoughts, Part 3)

Новости LessWrong.com - 28 июля, 2019 - 12:32
Published on July 28, 2019 9:32 AM UTC

Clarifying Thoughts on Optimizing and Goodhart Effects - Part 3

Following the previous two posts, I'm going to try to first lay out the way Goodhart's Law applies in the earlier example of rockets, then try to explain why this differs between selection and control. (Note: Adversarial Goodhart isn't explored, because we want to keep the setting sufficiently simple.) This sets up the next post, which will discuss Mesa-Optimizers.

Revisting Selection vs. Control Systems

Basically everything in the earlier post that used the example process of rocket design and launching is susceptible to some form of overoptimization, in different ways. Interestingly, there seem to be clear places where different types of overoptimization is important. Before looking at this, I want to revisit the selection-control dichotomy from a new angle.

In a (pure) control system, we cannot sample datapoints without navigating to them. If the agent is an embedded agent, and has sufficient span of control to cause changes in the environment, we cannot necessarily reset and try over. In a selection system, we only sample points in ways that do not affect the larger system. Even when designing a rocket, our very expensive testing has approximately no longer term effects. (We'll leave space debris from failures aside, but get back to it below.)

This explains why we potentially care about control systems more than selection systems. It also points to why Oracles are supposed to be safer than other AIs - they can't directly impact anything, so their output is done in a pure selection framework. Of course, if they are sufficiently powerful, and are relied on, the changes made become irreversible, which is why Oracles are not a clear solution to AI safety.

Goodhart in Selection vs. Control Systems

Regressional and Extremal Goodhart are particularly pernicious for selection, and potentially less worrying for control. Regressional Goodhart is always present if we are insufficiently aware of our goals, but in general Causal Goodhart failures seems more critical in control, because it is often narrower. To keep this concrete, I'll go through the classes of failure, and note how they could occur at each stage of rocket design. To do so, we need to clarify goals at each stage. Our goal in stage 1 is to find a class of designs and paths to optimize. In stage 2, we build, test, and refine a system. In many ways, this stage is intended to circumvent goodhart-failures, but testing does not always address extremal cases, so our design may still fail.

Regressional Goodhart hits us if we have any divergence between our metric and our actual goal. For example, in stages 1 and 2, finding an ideal complex or chaotic path that is dependent on exact positions of planets in a multibody system would be bad, or a path involving excessive G-forces or other dangerous things might be more fuel efficient than a simpler path. For example, a gravitational slingshot around the sun might be cheap, but fry or crush the astronauts. Alternatively, a design with a shape that does not allow people to fit inside might be found when optimizing. Each of these impact goals potentially not included in the model. Regressional goodhart is less common in control for this case, since we kept the mesa-optimizer limited to optimizing a very narrow goal already chosen by the design-optimization.

Extremal Goodhart is always a model failure. It can be because the model is insufficiently accurate, (Model Insufficiency) or because there is a regime change. Regime changes seem particularly challenging in systems that design mesa-optimizers, since I think the mesa-optimization is narrower in some way than the global optimizer (if not, it's more efficient to have an executing system rather than a mesa-optimizer.)

Causal Goodhart is by default about an irreversible change. In selection systems, it means that our sampling accidentally broke the distribution. For example, we test many rockets, creating enough space debris to make further tests vulnerable to collisions. We wanted the tests to sample from the space, but we accidentally changed the regime while sampling.

In the current discussion, we care about metric-goal divergence because the cost of the divergence is high - typically, once we get there, some irreversible consequence happens, as explained above. This isn't exclusively true of control systems, as the causal Goodhart example shows, but it's clearly more common in such systems. Once we're actually navigating and controlling the system, we don't have any way to reset to base conditions, and causal changes create regime changes - and if these are unexpected, the control system is suddenly in a position of opitimizing using an irrelevant model.

And this is a critical fact, because as I'll argue in the next post, mesa-optimizers are control systems of a specific type, and have some new overoptimization failure modes because of that.

Discuss

### What does Optimization Mean, Again? (Optimizing and Goodhart Effects - Clarifying Thoughts, Part 2)

Новости LessWrong.com - 28 июля, 2019 - 12:30
Published on July 28, 2019 9:30 AM UTC

Clarifying Thoughts on Optimizing and Goodhart Effects - Part 2

Previous Post: Re-introducing Selection vs Control for Optimization In the post, I reviewed Abram's selection/control distinction, and suggested how it relates to actual design. I then argue that there is a bit of a continuum between the two cases, and that we should add an addition extreme case to the typology, direct solution.

Here, I will revisit the question of what optimization means.

NOTE: This is not completely new content, and is instead split off from the previous version and rewritten to include an (Added) discussion of Eliezer's definition for measuring optimization power, from 2008. Hopefully this will make the sequence clearer for future readers.

In the next post, Applying over-Optimization in Selection and Control, I apply these ideas, and concretize the discussion a bit more before moving on to discussing Mesa-Optimizers in Part 4.

What does Optimization Mean, Again?

This question has been discussed a bit, but I still don't think its clear. So I want to start by revisiting a post Eliezer wrote in 2008, where he suggested that optimization power was ability to select states from a preference ordering over different states, and could be measured with entropy. He notes that this is not computable, but gives us insight. I agree, except that I think that the notion of the state space is difficult, for some of the reasons Scott discussed when he mentioned that he was confused about the relationship between gradient descent and Goodhart's law. In doing so, Scott proposed a naive model that looks very similar to Eliezer's;

I want to start by noting that this is absolutely and completely a "selection" type of optimization, in Abram's terms. As Scott noted, however, it's not a good model for what most optimization looks like, and that's part of why I think Eliezer's model is less helpful than I did when I originally read it.

There's a much better model for gradient descent optimization, which is... gradient descent. It is a bit closer to control than direct optimization, since in some sense we're navigating through the space, but for almost all actual applications, it is still selection, not control. To review how it works, points are chosen iteratively, and the gradient is assessed at each point. The gradient is used to select a new point at some (perhaps very clever, dynamically chosen next point.) Some stopping criteria is checked, and it iterates at that new point. This is almost always tons more efficient than generating random points and examining them.

(Addded) It's far better than a grid search, usually, for most landscapes, but also makes it clear why I think it's hard to discuss optimization power in Eliezer's terms on a practical level, at least when dealing with a continuous system. The problem I'm alluding to is that any list of preferences over states depends on number of states. Gradient descent type optimization is really good at focusing on specific sections of the state space, especially compared to grid search. We might find a state where grid search would require a tremendously high resolution, but we don't ever compute a preference ordering over 2n states. With gradient descent, we instead compute preferences for a local area and (hopefully) zoom-in, potentially ignoring other parts of the space. An optimizer that focuses very narrowly can have high-resolution but miss the non-adjacent region with far better outcomes, or can have fairly low resolution but perform far better - and the second optimizer is clearly more powerful, but I don't know how to capture this.

But to return to the main discussion, the process of gradient descent is also somewhere between selection and control - and that's what I want to explain.

In theory, the evaluation of each point in the test space could involve an actual check of the system. I build each rocket, watch to see whether it fails or succeeds according to my metric. For search, I'd just pick the best performers, and for more clever approaches, I can do something like find a gradient by judging performance of parameters to see if increasing or decreasing those that are amenable to improvement would help. (I can be even more inefficient, but find something more like a gradient, by building many similar rockets, each an epsilon away in several dimensions, and estimating a gradient that way. Shudder.)

In practice, we use a proxy model - and this is one place that allows for the types of overoptimization misalignment we are discussing. (But it's not the only one.) The reason this occurs is laid out clearly in the Categorizing Goodhart paper as one of the two classes of extremal failure - either model insufficiency, or regime change. This also allows for (during simulation undetectable) causal failures, if the proxy model gets a causal effect wrong.Even without using a proxy model, we can be led astray by the results if we are not careful. Rockets might look great, even in practice, and only fail in untested scenarios because we optimized something too hard - extremal model insufficiency. (Lower weight is cheaper, and we didn't notice a specific structural weakness induced by ruthlessly eliminating weight on the structure.) For our purposes, we want to talk about things like "how much optimization pressure is being applied." This is difficult, and I think we're trying to fit incompatible conceptual models together rather than finding a good synthesis, but I have a few ideas on what selection pressure leading to extremal regions means here.

• Extreme proxy values (in comparison to most of the space) seems similar to having lots of selection pressure. If we have a insanely tall and narrow peak, we may be finding something strange rather than simply improving.
• Extreme input values (unboundedly large or small values) may indicate a worrying area vis-a-vis overoptimization failures.
• Lots of search time alone does NOT indicate extremal results - it indicates lots of things about your domain, and perhaps the inefficiency of your search, but not overoptimization. (This is in contrast to the naive grid-search model, where lots of points visited means more optimizing.)

As an aside, Causal Goodhart is different. It doesn't really seem to rely on extremes, but rather on manipulating new variables, ones that could have an impact on our goal. This can happen because we change the value to a point where it changes the system, similar to extremal Goodhart, but does not need to. For instance, we might optimize filling a cup by getting the water level near the top. Extremal regime change failure might be overfilling the cup and having water spill everywhere. Causal failure might be moving the cup to a different point, say right next to a wall, in order to capture more water, but accidentally break the cup against the wall.Notice that this doesn't require much optimization pressure - Causal Goodhart is about moving to a new region of the distribution of outcomes by (metaphorically or literally) breaking something in the causal structure, rather than by over-optimizing and pushing far from the points that have been explored.This completes the discussion so far - and note that none of this is about control systems. That's because in a sense, most current examples don't optimize much, they simply execute an adaptive program.

One critical case of a control system optimizing is a mesa-optimizer, but that will be deferred until after the next post, which introduces some examples and intuitions around how Goodhart-failures occur in selection versus control systems.

Discuss

### Keeping Beliefs Cruxy

Новости LessWrong.com - 28 июля, 2019 - 04:18
Published on July 28, 2019 1:18 AM UTC

You might want to doublecrux if either:

• You're building a product and disagree about how to go about it.
• You want to make your beliefs more accurate, and you think a particular person you disagree with is likely to have useful information for you.
• You just... enjoy resolving disagreements in a way that mutually pursues truth for whatever reason.

Regardless, you might find yourself with the problem:

Doublecruxing takes a lot of time.

For a reasonably 'serious' disagreement, I think it frequently takes a least an hour, and often longer. Habryka and I once took 12 hours over the course of 3 days to make any kind of progress on a particularly gnarly disagreement. And sometimes disasgreements can persist for years despite significant effort.

Now, doublecruxing is faster than many other forms of truth-aligned-disagreement resolution. I actually it's helpful to think of doublecrux as "the fastest way for two disagreeing-but-honest-people to converge locally towards the truth", and if someone came up with a faster method, I'd recommend deprecating doublecrux in favor it that. (Meanwhile, doublecrux is not guaranteed to be faster for 3+ people to converge but I still expect it to be faster for smallish groups with particularly confusing disagreements)

Regardless, multiple hours is a long time. Can we do better?

I think the answer is yes, and it basically comes in the form of:

• Practice finding your own cruxes
• Practice helping other people find their cruxes
• Develop metacognitive skills that make cruxfinding natural and intuitive
• Caching the results into a clearer belief-network

I'd summarize all of that as "develop the skill and practice of keeping your beliefs cruxy."

By default, humans form beliefs for all kinds of reasons, without regard for how falsifiable they are. The result is a tangled, impenetrable web. Productive disagreement takes a long time because people are starting from the position of "impenetrable web."

First, your beliefs should (hopefully?) be more entangled with reality, period. You'll gain the skill of noticing how your beliefs should constrain your anticipations, and then if they fail to do so, you can maybe update your beliefs.

Second, if you've cultivated that skill, then during a doublecrux discussion, you'll have an easier time engaging with the core doublecrux loop. (So, a conversation that might have taken an hour takes 45 minutes – your conversation partner might still take a long time to figure out their cruxes, but maybe you can do your own much faster)

Third, once you gotten into this habit, this will help your beliefs form in a cleaner, more reality-entangled fashion in the first place. Instead of building an impenetrable morass, you'll be building a clear, legible network. (So, you might have all your cruxes full accessible from the beginning of the conversation, and then it's just a matter of stating them, and then helping your partner to do so)

[Note: I don't think you should optimize directly for your beliefs being legible. This is a recipe for burying illegible parts of your psyche and then missing important information. But rather, if you try to actually understand your beliefs and what causes them, the legibility will come naturally as a side benefit]

Finally, if everyone around you is doing, this radically lowers the cost of productive-disagreement. Instead of taking an hour (or three days), as soon as you bump into an important disagreement you can quickly navigate through your respective belief networks, find the cruxes, and skip to the part where you actually Do Empiricism.

I think keeping beliefs cruxy is a good example of a practice that is both a valuable "Rabbit" strategy, as well as something worth Stag Hunting Together on.

If you have an organization, community, or circle of friends where many people have practiced keeping-beliefs-cruxy, people will individually benefit, as well as creating a truthseeking culture more powerful than the sum of its parts.

Discuss

### Is this info on zinc lozenges accurate?

Новости LessWrong.com - 28 июля, 2019 - 01:05
Published on July 27, 2019 10:05 PM UTC

This podcast claims that zinc lozenges are "probably almost essentially a cure for the common cold". But there are many caveats:

• must be zinc acetate or zinc gluconate (but gluconate is strictly worse)
• use immediately on getting a cold
• 18mg zinc per lozenge
• dissolve in mouth 20-30 minutes
• take every 2 hours
• must have metallic taste, astringency
• must free of anything ending -ate (except stearate) or -ic acid; free of magnesium except magnesium stearate (it's insoluble)
• only one product on the market satisfies these requirements

The guy sounds to me like he knows what he's talking about. But I don't have the technical expertise to really know. (I think I could detect a mediocre bullshitter, but not necessarily a high level one.) If true, it seems like the sort of information that would be good for more people to know; but also like the sort of information that would be more widely known if it were true. (But I can sketch an argument for why it might not.) The research section at the linked page cites three journal articles of which two are open access, but I haven't looked closely at them.

My own experience is that I got some of these lozenges about a year ago, after reading a transcript of the episode. I thought I'd gotten them too late, but my cold cleared up much faster than I would have expected otherwise. Since then I've been trying to collect more anecdotal data, but my body is stubbornly refusing to even start coming down with a cold. Twice I thought it might be, and I took lozenges and didn't; but I think I took one lozenge on one occasion and three (spaced out) on another, and he thinks they shouldn't be effective enough for that to have worked. I'm not sure what to make of this, except that it shouldn't be much given the sample size.

Unfortunately the transcript has now been removed, and I can't find it on archive.org. I've made notes of the first ~35 minutes (of ~70). If someone could take a look (or listen), and say whether it all seems basically accurate, that would be fantastic. Almost all of it seems consistent with what I think I know, with one surprise that I've bolded. Apologies for the poor formatting.

• zinc is important to the immune system in ways that are irrelevant to this. If you aren't getting enough, you'll probably benefit from getting more. Good sources are oysters, red meat and cheese.

• RDA for zinc is satisfied by eating oysters once a week or beef once a day

• it's more prevalant and also more bioavailable in animal foods than plant foods. So if you mainly eat a plant diet you may benefit from supplements or zinc-rich foods

• phytate (grains, nuts, seeds are good sources) is "storage house for minerals"; allows plants to germinate & grow when conditions are right. Phytates make zinc less bioavailable in both the meal and supplements

• but again this is separate from using zinc to cure colds

• George Eby's 3yo daughter with leukemia had many colds, given 50mg zinc gluconate, refused to swallow, cold disappeared

• Eby and colleagues published RCT in 1984 showing zinc lozenges could reduce median cold duration 5 days mean duration 7 days; basically cures cold

• Almost every zinc lozenge on the market is useless for this purpose

• ionic zinc (+ve charge, free not bound to anything) affects nasal tissue and adinoid tissue (lymph tissue in throat) i.e. two major sites of infection during cold: inhibits activation of viral polypeptides that are used in replication of cold virus; inhibit production in our cells of ICAM 1 (intracellular adhesion molecule 1) which is dock that allows virus to grab hold of cell and enter it

• zinc products are all salts, not ionic. So we need one that releases ionic zinc in the relevant tissues at the right time

• zinc interferes with replication of virus. So you need to take it almost immediately after being infected or at the first sign of symptoms

• cold incubation period ~1 day, no symptoms, contagious; 2-3 days where replication and symptoms are increasing; then it peaks and declines, after 5 days basically undetectable but your symptoms may continue. So if you start using them 2-3 days in, they probably won't do anything

• tablet or capsule releases zinc into stomach so that's no good

• nasal sprays can kill your sense of smell

• zinc released from a lozenge will reach your nasal tissues and throat tissues

• some say you want a salt that releases ionic zinc at the pH of saliva. But actually it needs to release at pH of your nasal and throat tissues

• saliva pH is 5; over 100 times more acidic than pH of cellular environment which is 7.4

• 7.4 is basic, right? "100 times more acidic than [something on the other side of neutral]" seems like a weird thing to say? It sounds to me like "-5°C is 100 times more freezing than +2°C". Also, if I google "pH of saliva" I see 6.2 to 7.6. (I wouldn't be at all surprised to discover I'm just wrong about the acidic/freezing analogy.)

• lots of zinc salts release ionic at pH 5, only a handful at 7.4

• of salts in lozenges, only acetate and gluconate release any meaningful amount

• at 7.4, gluconate is 50% ionic and acetate is 100% ionic. so zinc acetate should be twice as effective

• most lozenges are neither; only one is acetate

• zinc in your mouth has a metallic taste (astringent), dries it out. So people try to make zinc lozenges more palatable

• but the astringency comes from the ionic zinc in your mouth. So if it's not astringent, it's not gonna help.

• OTOH being astringent doesn't mean it will help, because that's in your saliva not your nose/throat tissues

• food acids e.g. citrate or tartrate will very tightly bind zinc

• studies with citrate or tartrate in lozenge seem to suggest it actually makes the cold last longer

• ionic magnesium delivered to nose/throat tissues will nullify zinc, increase replication of cold virus

• one product tested found harmful was produced with very high heat in presence of fats, maybe palm oil; high heat produced insoluble zinc waxes with the fatty acids

• lubricants used in supplements, like magnesium stearate or other stearates, are insoluble; so they don't yield an acid that could bind to the zinc and don't yield much ionic magnesium and don't cause problems

Discuss

### Shortform Beta Launch

Новости LessWrong.com - 27 июля, 2019 - 23:09
Published on July 27, 2019 8:09 PM UTC

We've had unofficial experiments with shortform for over a year. More and more people have been trying it out and finding it useful. Now, we're pushing shortform into an officially supported feature.

My description of shortform, inspired by pattern's comment, is:

Writing that is short in length, or written in a short amount of time. Includes off-the-cuff thoughts and brainstorming.Why shortform?

Sometimes shorter is better

I've noticed when I write a Facebook post... it ends up exactly as long as it's supposed to be. I write 3-5 paragraphs that nicely encapsulate my idea, and then it looks about the right length, and I click submit and have a nice, clear discussion.

When I start writing a LessWrong post, sometimes I look at the beautiful serif text on the nice blank white page and... I dunno, it feels like I'm supposed to write a 3 page essay, so I do. But my idea would have been better if I expressed it in 3 paragraphs.

Sometimes off-the-cuff is better

I also often want to brainstorm early stage ideas in way that isn't (necessarily) optimized for others to read – figuring out how to explain something well might be hard, and I'm not even sure the idea is good yet. But, people who've been following along with my thought process and understand what I'm gesturing at can still chime in with ideas.

Sometimes I just wanna start writing without worrying about what sort of thing I'm writing yet.

There's also an important in-between case, where maybe I'm writing something off-the-cuff and brainstormy, and maybe I'm actually writing a full treatise on something important. And I just... don't want to spend cycles figuring that out. I want my editor to feel unopinionated, and I want to be able to click 'submit' at the end without stressing out about whether I'm submitting 'good enough content'.

Sometimes, this results in an initial shortform comment eventually getting revamped into a major post.

For all of these reasons, and more, it seems useful to have a part of lesswrong optimized for shorter writing.

New Features, focusing on Visibility

Shortform is created in the form of comments (attached to an automatically generated shortform post). The new features mostly aim to:

• Automate a lot of work. Instead of having to manually create a post called "So-and-so's shortform", you just start writing a comment, click submit, and then that post is automatically created for you
• Improve visibility. Part of the point of shortform (in some cases) is to be a bit less visible. But so far it's been extremely un-visible, appearing only in the Recent Discussion section of the frontpage, usually only for a couple hours.

Features include:

New /shortform page

If you're in the mood to engage specifically with shortform content (as an author or reader), you can go to lesswrong.com/shortfom. There you can:

• Read the latest shortform content. Currently, these are sorted by "most recently replied to", with the 3 latest replies shown underneath.
• If a shortform item has more than 3 replies, there'll be a "N additional comments" button you can click to load more, and the reply with have a little "show parent" icon to indicate that there's conversation missing.
• Start writing a shortform comment. When you click the "submit" button, you'll automatically generate a shortform post, and your comment will be added to that post. The comment will appear below in the stream of content.

It looks like this:

All Posts page visibility

If you're using the All Posts daily view, the top 5 shortform comments from that day will be visible (and you can click "load more")

Clicking on a shortform item will expand it and load replies.

Upcoming Features

This is all just the minimum viable product to get things rolling. There are some obvious features to add, such as:

• Letting users subscribe to individual people's shortform
• Making it easier to permalink to shortform content
• Making it easier to convert a shortform comment into a full post

Let us know if you have other suggestions or feedback

Discuss

### Комплексная разработка электроники

События в Кочерге - 27 июля, 2019 - 18:00
В Кочерге запускаются продолжительные курсы по комплексному обучению разработке электроники. Курс нацелен на то, чтобы человек прошедший его смог самостоятельно разработать, прошить и отладить электронное устройство по данному ему ТЗ, то есть смог бы работать инженером-разработчиком. Ориентация идет на fullstack разработку цифровой электроники и встраиваемых систем с использованием современных технологий, но и другие области также будут затронуты.

### The Artificial Intentional Stance

Новости LessWrong.com - 27 июля, 2019 - 10:00
Published on July 27, 2019 7:00 AM UTC

Another post in the same incremental vein. Still hoarding the wild speculation for later.

I

The idea of the "intentional stance" comes from Dan Dennett, who wanted to explain why it makes sense that we should think humans have such things as "beliefs" or "desires." The intentional stance is just a fancy name for how humans usually think in order to predict the normal human-scale world - we reason in terms of people, desires, right and wrong, etc. Even if you could be prodded into admitting that subatomic particles are "what there really are," you don't try to predict people by thinking about subatomic particles, you think about their desires and abilities.

We want to design AIs that can copy this way of thinking. Thus, the problem of the artificial intentional stance. Value learning is the project of getting the AI to know what humans want, and the intentional stance is the framework that goes between raw sensory data and reasoning as if there are things called "humans" out there that "want" things.

II

Suppose you want to eat a strawberry, and you are trying to program an AI that can learn that you want to eat a strawberry. A naive approach would be to train the AI to model the world as best it can (I call this best-fit model the AI's "native ontology"), and then bolt on some rules telling it that you are a special object who should be modeled as an agent with desires.

The reason doesn't work is because the intentional stance is sort of infectious. When I think about you wanting to eat a strawberry using my intentional stance, I don't think about "you" as a special case and then use my best understanding of physiology, biochemistry, and particle physics to model the strawberry. Instead, I think of the verb "to eat" in terms of human desires and abilities, and I think of the strawberry in terms of how humans might acquire or eat one.

This is related to the concept of "affordances" introduced by James J. Gibson. Affordances are the building blocks for how we make plans in the environment. If I see a door, I intuitively think of opening or locking it - it "affords opening." But maybe an experienced thief will intuitively think of how to bypass the door - they'll have a different intuitive model of reality, in which different affordances live.

When you say you want to eat a strawberry, you are using an approximate model of the world that not only helps you model "you" and "want" at a high level of abstraction, but also "eat" and "strawberry." The AI's artificial intentional stance can't just be a special model of the human, it has to be a model of the whole world in terms of what it "affords" the human.

III

If we want to play a CIRL-like cooperative game with real human goals, we'll need the artificial intentional stance.

CIRL assumes that the process generating its examples is an agent (modeled in a way that is fixed when implementing that specific CIRL-er), and tries to play a cooperative game with that agent. But the true "process that determines its inputs" is the entire universe - and if the CIRL agent can only model a small chunk of the universe, there's no guarantee that that chunk will be precisely human-shaped.

If we want an AI to cooperate with humans even if it's smart enough to model a larger-than-human chunk of the universe, this is an intentional stance problem. We want it to model the inputs to some channel in terms of specifically human-sized approximate agents, living in the universe. And then use this same intentional-stance model to play a cooperative game with the humans, because its this very model in which "humans" are possible teammates.

This exposes one of the key difficulties in designing an artificial intentional stance: it needs to be connected to other parts of the AI. It's no good having a model of humans that has no impact on the AI's planning or motivation. You have to be able to access the abstraction "what the humans want" and use it elsewhere, either directly (if you know the format and where it's stored in memory), or indirectly (via questions, examples, etc.).

IV

The other basic difficulty is: how are you supposed to train or learn an artificial intentional stance?

If we think of it as a specialized model of the world, we might try to train it it for predictive power, and tune the parameters so that it gets the right answers as often as possible. But that can't be right, because the artificial intentional stance is supposed to be different than the AI's best-predicting native ontology.

I'm even skeptical that you can improve the intentional stance by training it for efficiency, or predictive power under constraints. Humans might use the intentional stance because it's efficient, but it's not a unique solution - the psychological models that people use have changed over history, so there's that much wiggle room at the very, very least. We want the AI to copy what humans are doing, not strike out on its own and end up with an inhuman model of humans.

This means that the artificial intentional stance, however it's trained, is going to need information from humans, about humans. But of course humans are complicated, our stories about humans are complicated, and so the AI's stories about humans will be complicated. An intentional stance model has to walk a fine line so that it captures important detail, but not so much detail that the humans no longer are being understood in terms of beliefs, desires, etc.

V

I think this wraps up the basic points (does it, though?), but I might be missing something. I've certainly left out some "advanced points," of which I think the key one is the problem of generalization: if an AI has an intentional stance model of you, would you trust it to design a novel environment that it thinks you'll find extremely congenial?

Oh and of course I've said almost nothing about practical schemes for creating an artificial intentional stance. Dissecting the corpse of a scheme can help clarify issues, although sometimes fatal problems are specific to that scheme. By the end of summer I'll try to delve a little deeper, and take a look at whether you could solve AI safety by prompting GPT-2 with "Q: What does the human want? A:".

Discuss

### Just Imitate Humans?

Новости LessWrong.com - 27 июля, 2019 - 03:35
Published on July 27, 2019 12:35 AM UTC

Do people think we could make a singleton (or achieve global coordination and preventative policing) just by imitating human policies on computers? If so, this seems pretty safe to me.

Some reasons for optimism: 1) these could be run much faster than a human thinks, and 2) we could make very many of them.

What are people's intuitions here? Could enough human-imitating artificial agents (running much faster than people) prevent unfriendly AGI from being made?

If we think this would work, there would still be the (neither trivial nor hopeless) challenge of convincing all serious AGI labs that any attempt to run a superhuman AGI is unconscionably dangerous, and we should stick to imitating humans.

Discuss

### Results of LW Technical Background Survey

Новости LessWrong.com - 26 июля, 2019 - 20:33
Published on July 26, 2019 5:33 PM UTC

The main goal of the survey was to provide info for authors about their target audience, so here's a high-level overview toward that end:

• The average respondent is some kind of professional programmer, with an undergrad degree (or equivalent) in CS.
• Most people have seen at least some economics and probability, but not at the level of a undergrad degree.
• Almost everyone knows calculus, but linear algebra or differential equations will likely be lost on at least ~25% of respondents.
• There are substantial zero-knowledge and high-knowledge counts for most areas.

Here are charts of the responses to each question. I strongly recommend looking at them directly rather than just taking my summary at face value. As always, remember this is an opt-in survey without any sort of verification of responses, so take everything with a grain of salt.

One interesting note: we had a handful of respondents declaring very high skill levels (Nobel-level economists, Turing-level computer scientists, or primary developers of popular software). I'd personally be interested to hear what exactly those people work on, especially if they're willing to occasionally field questions on their area of expertise. All y'all should leave a comment or something.

Actually, I'm curious what everyone works on, especially specialties for all the researchers. Feel free to leave a quick comment, especially if you're able and willing to occasionally field questions in your area of expertise.

Discuss

### Рациональное додзё. Оценки вероятностей

События в Кочерге - 26 июля, 2019 - 19:30
Какова вероятность встретить завтра на улице динозавра? Если вы отвечаете, что 50 на 50 (потому что либо встретите его, либо нет), то вам точно стоит прийти на это додзё.

### Клуб чтения цепочек

События в Кочерге - 26 июля, 2019 - 19:30
Как разобрать реальность на составные части? И как жить в такой вселенной, где мы всегда жили, но без разочарования из-за того, что сложные вещи состоят из простых вещей. Продолжаем обсуждать цепочку о редукционизме.

### Old Man Jevons Can’t Save You Now (Part 2/2)

Новости LessWrong.com - 26 июля, 2019 - 06:51
https://thelimelike.files.wordpress.com/2019/07/daum_equation_1564086597773.png

### Can you summarize highlights from Vernon's Creativity?

Новости LessWrong.com - 26 июля, 2019 - 04:12
Published on July 26, 2019 1:12 AM UTC

A good anthology to read is Creativity, ed Vernon 1970 - it's old but it shows you what people were thinking back when Torrance was trying to come up with creativity tests, and the many psychometric criticisms back then which I'm not sure have been convincingly resolved.

I'm not sure whether Elizabeth has already read it, but I'd be interested in reading the highlights from that if anyone was up for distilling it down into something more manageable.

Discuss

### How often are new ideas discovered in old papers?

Новости LessWrong.com - 26 июля, 2019 - 04:00
Published on July 26, 2019 1:00 AM UTC

Suppose someone wrote a paper about X two decades ago. A modern reader realizes the X paper sheds light on an unrelated idea Y. Do we have any information on how often this happens? How often is this just "I figured out Y for a different reason, and while doing my lit review I realized that the X paper is also relevant for Y"?

Discuss

### Ought: why it matters and ways to help

Новости LessWrong.com - 25 июля, 2019 - 21:00
Published on July 25, 2019 6:00 PM UTC

I think that Ought is one of the most promising projects working on AI alignment. There are several ways that LW readers can potentially help:

In this post I'll describe what Ought is currently doing, why I think it's promising, and give some detail on these asks.

(I am an Ought donor and board member.)

Factored evaluation

Ought's main project is currently designing and running "factored evaluation" experiments, and building relevant infrastructure. The goal of these experiments is to answer the following question:

Consider a principal who wants to incentivize experts to work on hard problems, but finds that the work is too complex for the principal to tell what is good or bad. Can they overcome this problem by dividing the evaluation task into smaller pieces, delegating each of them to other experts, and recursively incentivizing good work on those pieces using the same mechanism?

Here's what an experiment looks like:

• Recruit a pool of "judges," "helpful experts," and "malicious experts."
• In initial experiments, the questions are difficult because the judge lacks relevant context. For example, the judge may want to answer a question about a long article, while only being allowed to look at a few hundred characters of text.
• Limiting the amount of text the judge can read is an artificial restriction. This is intended as a warm-up, an easier analogy for the situation where the experts have knowledge that cannot be explained to a single judge in a reasonable amount of time (or where we want to reduce the cost of judging).
• An "honest expert" provides a good answer to that question, a "malicious expert" provides a bad answer to that question, and a judge is tasked with telling which answer was better. The answers themselves may be too large and complex for the judge to look at, in which case they'll be handed a compact representation of each answer which they can inspect as needed.
• In order to tell which answer is better, the judge asks subquestions. For example, they might ask "What's the best reason to think that answer A is better than answer B?" or "What does the author of article X believe about question Y?"
• We answer these subquestions by delegating to the experts. The honest expert answers each subquestion as well as they can. The malicious expert can propose a bad subanswer, and we run the experiment recursively to see if the judge would pick the good or bad subanswer to the subquestion. If the malicious expert can propose subanswers that corrupt the top-level answer, and that are selected by the judge when we run this experiment recursively, then the malicious expert wins and the experiment fails.

This is not Ought's only project, but it's currently the largest single focus. Other projects include: exploring how well we can automate the judge's role on simple questions using existing ML, and thinking about possible decomposition strategies and challenges for factored evaluation.

Why this is important for AI alignment

ML systems are trained by gradient descent to optimize a measurable objective. In the best case (i.e. ignoring misaligned learned optimization) they behave like an expert incentivized to optimize that objective. Designing an objective that incentivizes experts to reveal what they know seems like a critical step in AI alignment. I think human experts are often a useful analogy for powerful ML systems, and that we should be using that analogy as much as we can.

Not coincidentally, factored evaluation is a major component of my current best-guess about how to address AI alignment, which could literally involve training AI systems to replace humans in Ought's current experiments. I'd like to be at the point where factored evaluation experiments are working well at scale before we have ML systems powerful enough to participate in them. And along the way I expect to learn enough to substantially revise the scheme (or totally reject it), reducing the need for trials in the future when there is less room for error.

Beyond AI alignment, it currently seems much easier to delegate work if we get immediate feedback about the quality of output. For example, it's easier to get someone to run a conference that will get a high approval rating, than to run a conference that will help participants figure out how to get what they actually want. I'm more confident that this is a real problem than that our current understanding of AI alignment is correct. Even if factored evaluation does not end up being critical for AI alignment I think it would likely improve the capability of AI systems that help humanity cope with long-term challenges, relative to AI systems that help design new technologies or manipulate humans. I think this kind of differential progress is important.

Beyond AI, I think that having a clearer understanding of how to delegate hard open-ended problems would be a good thing for society, and it seems worthwhile to have a modest group working on the relatively clean problem "can we find a scalable approach to delegation?" It wouldn't be my highest priority if not for the relevance to AI, but I would still think Ought is attacking a natural and important question.

Ways to helpWeb developer

I think this is likely to be the most impactful way for someone with significant web development experience to contribute to AI alignment right now. Here is the description from their job posting:

The success of our factored evaluation experiments depends on Mosaic, the core web interface our experimenters use. We’re hiring a thoughtful full-stack engineer to architect a fundamental redesign of Mosaic that will accommodate flexible experiment setups and improve features like data capture. We want you to be the strategic thinker that can own Mosaic and its future, reasoning through design choices and launching the next versions quickly.Our benefits and compensation package are at market with similar roles in the Bay Area. We think the person who will thrive in this role will demonstrate the following:4-6+ years of experience building complex web apps from scratch in Javascript (React), HTML, and CSSAbility to reason about and choose between different front-end languages, cloud services, API technologiesExperience managing a small team, squad, or project with at least 3-5 other engineers in various rolesClear communication about engineering topics to a diverse audienceExcitement around being an early member of a small, nimble research organization, and playing a key role in its successPassion for the mission and the importance of designing schemes that successfully delegate cognitive work to AIExperience with functional programming, compilers, interpreters, or “unusual” computing paradigmsExperiment participants

Ought is looking for contractors to act as judges, honest experts, and malicious experts in their factored evaluation experiments. I think that having competent people doing this work makes it significantly easier for Ought to scale up faster and improves the probability that experiments go well---my rough guess is that a very competent and aligned contractor working for an hour does about as much good as someone donating $25-50 to Ought (in addition to the$25 wage).

Here is the description from their posting:

We’re looking to hire contractors ($25/hour) to participate in our experiments [...] This is a pretty unique way to help out with AI safety: (i) Remote work with flexible hours - the experiment is turn-based, so you can participate at any time of day (ii) we expect that skill with language will be more important than skill with math or engineering.If things go well, you’d likely want to devote 5-20 hours/week to this for at least a few months. Participants will need to build up skill over time to play at their best, so we think it’s important that people stick around for a while.The application takes about 20 minutes. If you pass this initial application stage, we’ll pay you the$25/hour rate for your training and work going forward.Apply as Experiment Participant
Donate

I think Ought is probably the best current opportunity to turn marginal $into more AI safety, and it's the main AI safety project I donate to. You can donate here. They are spending around$1M/year. Their past work has been some combination of: building tools and capacity, hiring, a sequence of exploratory projects, charting the space of possible approaches and figuring out what they should be working on. You can read their 2018H2 update here.

They have recently started to scale up experiments on factored evaluation (while continuing to think about prioritization, build capacity, etc.). I've been happy with their approach to exploratory stages, and I'm tentatively excited about their approach to execution.

Discuss

### Ненасильственное общение. Тренировка

События в Кочерге - 25 июля, 2019 - 19:30
Как меньше конфликтовать, не поступаясь при этом своими интересами? Ненасильственное общение — это набор навыков для достижения взаимопонимания с людьми. Приходите на наши практические занятия, чтобы осваивать эти навыки и общаться чутче и эффективнее.

### One Hot Research. Лекторий Data Science

События в Кочерге - 25 июля, 2019 - 19:00
Каждую неделю публикуются сотни научных работ по машинному обучению, нейронным сетям, компьютерному зрению. Наша задача — разобраться в самых горячих! На этой неделе рассмотрим улучшенный вариант, пожалуй, самого известного метода разбиения данных на подгруппы — k-means, — который недавно представила группа учёных из пекинского Tsinghua university и университетского колледжа в Дублине. А чтобы понять, что к чему, разберём метод максимального правдоподобия, классический k-means и gaussian mixture model. Если вы можете посчитать, какова вероятность получить хотя бы одно чётное число при броске двух шестигранных игральных костей, то скорее всего поймёте и остальное.

### On the purposes of decision theory research

Новости LessWrong.com - 25 июля, 2019 - 10:18
Published on July 25, 2019 7:18 AM UTC

Following the examples of Rob Bensinger and Rohin Shah, this post will try to clarify the aims of part of my research interests, and disclaim some possible misunderstandings about it. (I'm obviously only speaking for myself and not for anyone else doing decision theory research.)

I think decision theory research is useful for:

1. Gaining information about the nature of rationality (e.g., is “realism about rationality” true?) and the nature of philosophy (e.g., is it possible to make real progress in decision theory, and if so what cognitive processes are we using to do that?), and helping to solve the problems of normativity, meta-ethics, and metaphilosophy.
2. Better understanding potential AI safety failure modes that are due to flawed decision procedures implemented in or by AI.
3. Making progress on various seemingly important intellectual puzzles that seem directly related to decision theory, such as free will, anthropic reasoning, logical uncertainty, Rob's examples of counterfactuals, updatelessness, and coordination, and more.
4. Firming up the foundations of human rationality.

To me, decision theory research is not meant to:

1. Provide a correct or normative decision theory that will be used as a specification or approximation target for programming or training a potentially superintelligent AI.
2. Help create "safety arguments" that aim to show that a proposed or already existing AI is free from decision theoretic flaws.

To help explain 5 and 6, here's what I wrote in a previous comment (slightly edited):

One meta level above what even UDT tries to be is decision theory (as a philosophical subject) and one level above that is metaphilosophy, and my current thinking is that it seems bad (potentially dangerous or regretful) to put any significant (i.e., superhuman) amount of computation into anything except doing philosophy.

To put it another way, any decision theory that we come up with might have some kind of flaw that other agents can exploit, or just a flaw in general, such as in how well it cooperates or negotiates with or exploits other agents (which might include how quickly/cleverly it can make the necessary commitments). Wouldn’t it be better to put computation into trying to find and fix such flaws (in other words, coming up with better decision theories) than into any particular object-level decision theory, at least until the superhuman philosophical computation itself decides to start doing the latter?

Comparing my current post to Rob's post on the same general topic, my mentions of 1, 2, and 4 above seem to be new, and he didn't seem to share (or didn't choose to emphasize) my concern that decision theory research (as done by humans in the foreseeable future) can't solve decision theory in a definitive enough way that would obviate the need to make sure that any potentially superintelligent AI can find and fix decision theoretic flaws in itself.

Discuss

### AnnaSalamon's Shortform

Новости LessWrong.com - 25 июля, 2019 - 08:24
Published on July 25, 2019 5:24 AM UTC

Discuss

### Dony's Shortform Feed

Новости LessWrong.com - 25 июля, 2019 - 02:48
Published on July 24, 2019 11:48 PM UTC

Discuss