# Новости LessWrong.com

A community blog devoted to refining the art of rationality
Обновлено: 1 час 3 минуты назад

### Open & Welcome Thread - August 2020

7 часов 36 минут назад
Published on August 6, 2020 6:16 AM GMT

If it’s worth saying, but not worth its own post, here's a place to put it. (You can also make a shortform post)

And, if you are new to LessWrong, here's the place to introduce yourself. Personal stories, anecdotes, or just general comments on how you found us and what you hope to get from the site and community are welcome.

If you want to explore the community more, I recommend reading the Library, checking recent Curated posts, seeing if there are any meetups in your area, and checking out the Getting Started section of the LessWrong FAQ. If you want to orient to the content on the site, you can also check out the new Concepts section.

The Open Thread tag is here.

Discuss

### Zen and Rationality: Don't Know Mind

9 часов 19 минут назад
Published on August 6, 2020 4:33 AM GMT

This is post 1/? about the intersection of my decades of LW-style rationality practice and my several years of Zen practice.

In today's installment, I look at the Zen notion of "Don't Know Mind" in rationalist terms.

I'm a little unsure where "don't know mind" comes from. Sensei Google suggests it might be the Seon expression of the idea that in Zen is said "shoshin", often translated as "beginner's mind" but also carrying notions conveyed by translating it as "original mind", "naive mind", "novice mind", and "inexperienced mind" (noting that the character rendered "mind" is better translated as "heart-mind"). There's also a beloved koan often called "not knowing is most intimate" (Book of Equanimity, Case 20), and "don't know" is a good name to put to a particular insight you might have if you meditate enough. Regardless, "don't know mind" is a thing Zen practitioners sometimes say. What does it mean?

Depends on how you parse it.

The standard parsing is as "don't-know mind", as in the mind that doesn't know. This fits with the notion of soshin or beginner's mind, that is the mind that has not yet made itself up. In standard rationalist terms, this is the heart-mind that is curious, relinquishing, light, even, simple, humble, and nameless. Saying "don't know" is tricky, though, because there's the looming trap of the "don't know" that stops curiosity. Instead, this is the "don't know" that extends an open invitation of learn more.

You can also parse it as a command: "do not know" (since Zen falls within the Buddhist tradition that claims all you know is mind, "mind" is redundant here). This isn't an imperative to never know anything. Rather it's an encouragement to gaze beyond form into emptiness since our minds are often caught up in form (the map) and fail to leave space for emptiness (the territory[1]). More specifically, to know is to make distinctions, to construct an abstraction, to prune, to have a pior over your observations, to give something a name, to say this not that. Naturally, this means knowing is quite useful; it's literally what enables us to do things. Yet it necessarily means leaving something out. "Don't know mind" is then advice to simultaneously let all in even as you keep some out.

Thus both interpretations converge on this idea that we can open ourselves to "knowing" more if we can hold fast to the realization that we don't already know it all.

1. If I'm being more careful, the duals of form and emptiness and map and territory don't perfectly align, map and territory being more akin to the phenomena/noumena or ontological/ontic split in Western philosophy. Nevertheless I think this is a good enough comparison to get the idea given the broader ways we sometimes talk about map and territory on LW. ↩︎

Discuss

### The Isolation Assumption of Expected Utility Maximization

9 часов 47 минут назад
Published on August 6, 2020 4:05 AM GMT

In this short essay I will highlight the importance of what I call the “isolation assumption” in expected utility theory. It may be that this has already been named in the relevant literature and I just don’t know it. I believe this isolation assumption is both important to decision-making about doing good and often ignored.

Expected utility theory is here taken as a normative theory of practical rationality. That is, a theory about what is rational to choose given one’s ends (Thoma 2019, 5). Expected utility theory is then the decision theory that says that the best way for an agent to pursue her goals is to choose so as to maximize expected utility.

By utility, I mean not some concept akin to happiness or wellbeing but a measure that represents how much an agent prefers an outcome. For example, for an altruist, having a child not die from drowning in a pond may have significantly higher utility than dining out at a fancy restaurant.

The term “expected” comes from probability theory. It refers to the sum of the products of the probability and value of each outcome. Expected utility is then a property of options in decisions. Say an agent has two options for lunch and the single thing this agent has preferences over is how her lunch goes today. Option A is to eat a veggie burger, which will bring this agent 10 “utils” for certain. Then, Option A has expected utility of 10. Option B, however, is a raffle in which the agent either gets a really fancy clean-meat burger with probability 0.1 or nothing with probability 0.9. If the agent values the clean-meat burger at 20 utils, and not eating lunch at 0, then the expected utility of Option B has 0.1*20 + 0.9*0 = 2 expected utility.

I currently think expected utility theory is reasonable as a theory of practical rationality. Brian Tomasik has proposed a compelling thought experiment for why that is. Consider the following scenario:

suppose we see a number of kittens stuck in trees, and we decide that saving some number n of kittens is n times as good as saving one kitten. Then, if we are faced with the choice of either saving a single kitten with certainty or having a 50-50 shot at saving three kittens (where, if we fail, we save no kittens), then we ought to try to save the three kittens, because doing so has expected value 1.5 (= 3*0.5 + 0*0.5), rather than the expected value of 1 (= 1*1) associated with saving the single kitten. (Tomasik 2016).

In this case, you may have an instinct that it makes more sense to save the single kitten, since this is the only way to guarantee on life is saved. Yet, Tomasik provides a nice line of reasoning for why you should instead maximize expected utility:

Suppose you're one of the kittens, and you're deciding whether you want your potential rescuer to save one of the three or take a 50-50 shot at saving all three. In the former case, the probability is 1/3 that you'll be saved. In the latter case, the probability is 1 that you'll be saved if the rescuer is successful and 0 if not. Since each of these is equally likely, your overall probability of being saved is (1/2)*1 + (1/2)*0 = 1/2, which is bigger than 1/3. (Tomasik 2016)

So, I’ve attempted to make the case for why expected utility theory makes sense. Now I will get to my point that we should be careful not to misuse it. I will thus try to make the case for the importance of what I call the “isolation assumption” and for how easy it is for it to be dangerously ignored.

First, let’s get a bit deeper in the point of expected utility theory. As I said above, this is a theory about how to best go about achieving one’s ends. Let’s suppose our ends are mostly about “making the most good”. Then, especially if we are aspiring Effective Altruists, we ideally want to maximize the expected utility of all of the relevant consequences of our actions. I say this in contrast to merely maximizing the expected utility of the immediate consequences of our actions. Notice, however, that scenarios that are commonly discussed when talking about decision theory, such as the one involving kittens above, are focused on the immediate consequences. What is important, then, is that we don’t forget that consequences which are not immediate often matter, and sometimes matter significantly more than the immediate consequences.

This then gives rise to the assertion I want to make clear: that we can only apply expected utility theory when we are justified in assuming that the values associated with the outcomes in our decision problem encompass all of the difference in value in our choice problem. Another way of putting this is to say that the future (beyond the outcomes we are currently considering) is isolated from the outcomes we are currently considering. Yet another way to put this is that the choice we currently face affects nothing but the prospects that we are taking into account.

Notice how this point is even more important if we are longtermists. Longtermism entails that consequences matter regardless of when they happen. This means we care about consequences extending as far into the future as the end of history. Then, if we are to choose by maximizing expected utility, we must be able to assume that whatever outcomes we are considering, choosing one way or another does not negatively affect what options are available in the rest of history.

Here is an example to illustrate my point, this time adapted and slightly modified from a thought experiment provided by Tomasik:

Suppose (if necessary) that you are an altruist. Now assume (because it will make it easier to make my point) that you are a longtermist. Suppose you are the kind of longtermist that thinks it is good if people who will lead great lives are added to the world, and bad if such great lives are prevented from existing. Suppose that there are 10,000 inhabitants in an island. This is a special island with a unique culture in which everyone is absurdly happy and productive. We can expect that if this culture continues into the future, many of the most important scientific discoveries of humanity will be made by them. However, all of the inhabitants recently caught a deadly disease. You, the decision maker, has two options. Drug A either saves all of the islanders with 50% chance or saves none of them with 50% chance. Drug B saves 4,999 of them with complete certainty.

If we consider only the survival of the inhabitants, the expected utility of Drug A is higher (10,000*0.5 + 0*0.5 = 5,000 > 4,999). However, saving these inhabitants right now is not the only thing you care about. As a longtermist, you care about all potential islanders that could exist in the future. This implies the appropriate expected utility calculation includes more than what we have just considered.

Suppose that combining the value of the scientific discoveries these islanders would make and the wellbeing their future descendants would experience if this civilization continues is worth 1,000,000 utils. Then, the complete expected utility of Drug B is the value of each life saved directly (4,999) plus the 1,000,000 of the continuation of this civilization (total = 1,004,999). The expected utility of Drug A, however, is merely (10,000 + 1,000,000)*0.5 + 0*0.5 = 505,000. So, Drug B is now the one with highest expected value. I hope this makes clear how if you can’t assume that the choices in the present do not affect the long-term consequences, you cannot use expected value!

As I understand it, the upshot is the following. If you make a decision based on maximizing expected utility, you have two possibilities. You can incorporate all the relevant consequences (which may extend until the end of human civilization). Or you have to be able to assume that the value of the consequences that you are not considering do not change which of your current options is best. However, it seems to me now that the only way you can assume a set of consequences of your choice won’t affect which option is better is if you know this set is isolated from the current choice. Otherwise you would have incorporated these consequences in the decision problem.

Discuss

### Diagramming "Replacing Guilt," Part 1

12 часов 24 минуты назад
Published on August 5, 2020 11:36 PM GMT

I am interested communicating ideas visually. In general, I'm able to remember images for a longer time than I remember language, and the act of refining an idea into a drawing involves wrestling with the idea a little more deeply than I would with a simple summary in text. This is my first post here, and it's unconventional, so I'd love to get feedback about what works & doesn't work about these drawings.

I have been enjoying Nate Soares' sequence of blog posts titled *Replacing Guilt*, and I decided this would be good material to experiment on. This post consists of drawings I've made corresponding to the first seven posts: each post is listed here by title followed by the image I've drawn and an explanation of the image. These are not intended as a substitute for the original work, and I expect that browsing them without referencing the original posts would be a lackluster experience. If you've already read that sequence, you might find these drawings of interest.

PreliminariesHalf-assing it with everything you've got

Remember what you're fighting for: you may follow the paved conventional path for a time before diverging from the triers.

From the post:

If you're trying to pass the class, then pass it with minimum effort. Anything else is wasted motion.

If you're trying to ace the class, then ace it with minimum effort. Anything else is wasted motion.

If you're trying to learn the material to the fullest, then mine the assignment for all its knowledge, and don't fret about your grade. Anything else is wasted motion.Failing with abandon

1. Fighting for somethingReplacing guilt

Guilt-based motivation cannot be sustained long-term, but your current pursuits may not be sustainable with anything else. If you are not doing something you genuinely care about, you may not have any intrinsic motivation to continue doing it.

The stamp collector

When you think you care about the world, it's not just selfish motivated reasoning dressed up as altruism. You haven't deceived yourself into telling stories that make you look good: you can be genuine.

You're allowed to fight for something

There's a failure mode where you sneer at people trying to change the world, because what they're _really_ changing is their (accurate) perception of the world. You can work to improve the world, and you shouldn't feel bad that you're secretly just trying to allay your guilt.

Caring about something larger than yourself

The intention of this diagram is to give permission to care about privacy or democracy or liberty or whatever your internal sense of aesthetics is drawn to.

You don't get to know what you're fighting for

When you pursue your goals, you will discover things as you go. This may change your goal. You should _expect_ that. It's not a failure and it's not an argument to defer the pursuit of your goals. You will take the best shot you have today, and if another opportunity comes along that seems more likely to succeed, you'll take it.

I originally posted this content on my blog, then I thought people here might be interested in it. If this idea is well-received, I may create images for more posts in the series.

Discuss

### Titan (the Wealthfront of active stock picking) - What's the catch?

12 часов 47 минут назад
Published on August 6, 2020 1:06 AM GMT

Titan is a Y Combinator startup that launched in 2018 and aims to do for active investing what Wealthfront, Betterment and Vanguard have done for passive investing.

They pick a basket of 20 companies with 10B+ market cap which they believe are above-average long-term-focused investments relative to the whole S&P 500. Originally, their stock picking was done via a deterministic process of copying what a group of top hedge funds were reporting that they were doing. I'm not sure if that's still the case. Their 2018-2020 performance has been 16.8%/yr (net of fees) compared to 10.0% for the S&P 500, and a higher Sharpe ratio (.77 vs .51). My question is, what's the catch? Here's my guess: They're buying high-quality companies at high prices. That's how they can expect to have steady market-beating returns for a few years, until momentum reverses and/or once-in-a-few-years risks play out, at which point their P/E multiples will shrink and they'll plunge all the way down to cumulative market-matching returns, and worse after subtracting their fees. Their "process" page claims they look for a Warren Buffet style "Margin of safety": Valuation is important. We seek companies that are trading at a meaningful discount to our estimate of their long-term intrinsic value, with little to no risk of permanent capital impairment. But I'm not convinced there's much substance to their use of this term. More (vague) info about how they pick stocks here. Discuss ### Measuring hardware overhang 5 августа, 2020 - 22:59 Published on August 5, 2020 7:59 PM GMT Measuring hardware overhang Summary How can we measure a potential AI or hardware overhang? For the problem of chess, modern algorithms gained two orders of magnitude in compute (or ten years in time) compared to older versions. While it took the supercomputer "Deep Blue" to win over world champion Gary Kasparov in 1997, today's Stockfish program achieves the same ELO level on a 486-DX4-100 MHz from 1994. In contrast, the scaling of neural network chess algorithms to slower hardware is worse (and more difficult to implement) compared to classical algorithms. Similarly, future algorithms will likely be able to better leverage today's hardware by 2-3 orders of magnitude. I would be interested in extending this scaling relation to AI problems other than chess to check its universality. Introduction Hardware overhang is a situation where sufficient compute is available, but the algorithms are suboptimal. It is relevant if we build AGI with large initial build, but cheaper run costs. Once built, the AGI might run on many comparably slow machines. That's a hardware overhang with a risk of exponential speed-up. This asymmetry exists for current neural networks: Creating them requires orders of magnitude more compute than running them. On the other hand, in The Bitter Lesson by Rich Sutton it is argued that the increase in computation is much more important (orders of magnitude) than clever algorithms (factor of two or less). In the following, I will examine the current state of the algorithm-art using chess as an example. The example of chess One of the most well-researched AI topics is chess. It has a long history of algorithms going back to a program on the 1956 MANIAC. It is comparatively easy to measure the quality of a player by its ELO score. As an instructive example, we examine the most symbolic event in computer chess. In 1997, the IBM supercomputer "Deep Blue" defeated the reigning world chess champion under tournament conditions. The win was taken as a sign that artificial intelligence was catching up to human intelligence. By today's standards, Deep Blue used simple algorithms. Its strength came from computing power. It was a RS/6000-based system with 30 nodes, each with a 120 MHz CPU plus 480 special purpose VLSI chess chips. For comparison, a common computer at the time was the Intel Pentium II at 300 MHz. Method: An experiment using a 2020 chess engine We may wonder: How do modern (better) chess algorithms perform on slower hardware? I tested this with Stockfish version 8 (SF8), one of the strongest classical chess engine. I simulated 10k matches of SF8 against slower versions of itself and a series of older engines for calibration, using cutechess-cli. In these benchmarks, I varied the total number of nodes to be searched during each game. I kept the RAM constant (this may be unrealistic for very old machines, see below). By assuming a fixed thinking time per game, the experiments scale out to slower machines. By cross-correlating various old benchmarks of Stockfish and other engines on older machines, I matched these ratings to units of MIPS; and finally, MIPS approximately to the calendar year. Depending on the actual release dates of the processors, the year axis has a jitter up to 2 years. I estimate the error for the compute estimates to be perhaps 20%, and certainly less than 50%. As we will see, the results measure in orders of magnitude, so that these errors are small in comparison (<10%). Results SF8 achieves Kasparov's 2850 ELOs running on a 486-100 MHz introduced in 1994, three years before the Kasparov-Deep Blue match. These ELOs refer to tournament conditions as in the 1997 IBM games. In other words, with today's algorithms, computers would have beat the world world chess champion already in 1994 on a contemporary desk computer (not a supercomputer). The full scaling relation is shown in the Figure. The gray line shows the ELO rating of Kasparov and Carlsen over time, hovering around 2800. The blue symbols indicate the common top engines at their times. The plot is linear in time, and logarithmic in compute. Consequently, ELO scales approximately with the square of compute. Finally, the red line shows the ELOs of SF8 as a function of compute. Starting with the 2019 rating of ~3400 points, it falls below 3000 when reducing MIPs from 10^5 to a few times 10^3. This is a decrease of 2-3 orders of magnitude. It falls below 2850 ELO, Kasparov's level, at 68 MIPs. For comparison, the 486 achieves 70 MIPS at 100 MHz. At its maximum, the hardware overhang amounts to slightly more than 10 years in time, or 2-3 orders of magnitude in compute. Going back very far (to the era of 386-PCs), the gap reduces. This is understandable: On very slow hardware, you can't do very much, no matter what algorithm you have. The orange line shows the scaling relation of a neural network-based chess engine, Leela Chess Zero (LC0), as discussed below. Discussion I originally ran these tests in 2019. Now (August 2020), SF8 has been superseded by SF11, with another ~150 ELO increase (at today's speed). It remains unclear how much improvement is left for future algorithms when scaled down to a 486-100. I strongly suspect, however, that we're running into diminishing returns here. There is only so much you can do on a "very slow" machine; improvements will never go to infinity. My guess is that the scaling will remain below three orders of magnitude. Neural network-based algorithms such as AlphaZero or Leela Chess Zero can outperform classical chess engines. However, for this comparison they are less suited. I find that their scaling is considerably worse, especially when not using GPUs. In other words, they do not perform well on CPUs of slower machines. Depending on the size of the neural net, older machines may even be incapable of executing it. In principle, it would be very interesting to make this work: Train a network on today's machines, and execute (run) it on a very old (slow) machine. But with current algorithms, the scaling is worse than SF8. As a reference point, LC0 achieves ~3000 ELOs on a Pentium 200 under tournament conditions; SF8 is at the same level with about half the compute. Conclusion and future research proposals Similarly, scaling of other NN algorithms to slower hardware (with less RAM etc.) should yield interesting insights. While x86 CPUs are in principle backwards-compatible since the 1980s, there are several breaking changes which make comparisons difficult. For example, the introduction of modern GPUs produces a speed gap when executing optimized algorithms on CPUs. Also, older 32-bit CPUs are capped at 4 GB of RAM, making execution of larger models impossible. Looking into the future, it appears likely that similar breaking changes will occur. One recent example is the introduction of TPUs and/or GPUs with large amounts of RAM. Without these, it may be impossible to execute certain algorithms. If AGI relies on similar (yet unknown) technologies, the hardware overhang is reduced until more of such the units are produced. Then, the vast amount of old (existing) compute can not be used. I would be interested in researching this scaling relation for other problems outside of chess, such as voice and image recognition. Most problems are harder to measure and benchmark than chess. Will the scalings show a similar 2-3 orders if magnitude software overhang? Most certainly, many problems will show similar diminishing returns (or a cap) due to RAM restrictions and wait time. For example, you just can't run a self-driving car on an Atari, no matter how good the algorithms. I would be interested in researching the scaling for other AI and ML fields, possibly leading to an academic paper. Discuss ### Which COVID-19 serology (antibody) test is best? 5 августа, 2020 - 21:54 Published on August 5, 2020 6:54 PM GMT https://www.fda.gov/medical-devices/coronavirus-disease-2019-covid-19-emergency-use-authorizations-medical-devices/eua-authorized-serology-test-performance The FDA lists two dozen tests with different sensitivity and specificity characteristics, different sample sizes, and different 95% CIs. Some of these tests are clear losers (lower bound on sensitivity CI ~75%!), but are there any clear winners? Discuss ### The Golden Age of Data 5 августа, 2020 - 20:51 Published on August 4, 2020 9:35 PM GMT The more recent something is, the more data we have about it. This is true in all fields where collecting information over time is important, including astronomy, archaeology, geology, paleontology, the internet, and even our own memories. Why is this so universal? It's not as if 100-year-old rocks are any more complicated than 300-year-old rocks. There are three factors that create this effect. The first one is the fact that information becomes less accessible over time. It degrades. In geology, old rocks are worn away by water or covered in deep sand. In astronomy, starlight shifts into undetectable wavelengths as it travels through space. In the brain, memories are reevaluated, modified, and deleted. The same volume of data is generated at all points in time, but older data has had more time to degrade, so there is less of it. There are some fields where data generation isn't constant, and then the situation is more complicated. History is the most important example. Historical data is produced by humans, and thus the rate of historic data production follows an exponential curve that roughly correlates with the global population. As the population increases, there are more people to produce records, so more records are produced. This is the second factor. The third factor is innovation. Humans have invented increasingly efficient methods of storing information throughout history. Papyrus is cheaper than clay tablets, and books are more compact than scrolls. These innovations rarely happen, but when they do, they are revolutionary: Historians today see a spike in information from the mid-15th century not because of the growing dark ages population, but because of the printing press. The pace of data innovation was very slow before the Information Age, but it has since rocketed forward. Over the past few decades, hard disk drives and SSDs have become more efficient at an astounding rate. We can fit all the English books ever published on a nice hard drive. In fact, now we don't even need to carry storage media; we can store information in servers thousands of miles away. These servers will also continue to become more efficient as bandwidth and upload speeds increase. The acceleration of the Information Age cannot be overstated: for most of history, humanity only stored information in a handful of ways. Today, a new method is created every month. It is likely that this trend has now overtaken degradation and population growth as the greatest contributor to the exponential curve of historical data. Not only is the population still rising, but now, for the first time in history, the average person produces more data year after year. This innovation comes with a trade-off. As data storage becomes more compact, data becomes easier to destroy. Words etched in stone persist after exposure to fire and water, whereas paper disintegrates in either case. Hard drives stop working after five years... if they aren't destroyed by radiation first. All data degrades over time, but modern data degrades faster: It is less resilient. Ironically, while the Information Age produces more data than any other, it also loses more. Modern information accessibility also contributes. The individual has better access to information now than ever before, but this freedom stands upon many fragile layers of technology. If any piece were to stop working—internet protocol, silicon chips, proprietary formats—all access is lost, perhaps permanently. In the 2020s and 2030s, the trend toward more digital and abstract forms of storage will probably continue. Looking back, we might expect that historians will find a superabundance of information from the 2020s and beyond, but this is not consistent with diminishing data resilience. The majority of our information is already digital. Book publication may be stagnating, and the digital word may someday make the printed word obsolete. Since digital media degrade so quickly, what will be left to historians? Imagine that physical books become mostly obsolete by, say, the 2070s, and a global cataclysm causes a collapse of civilization in 2080. How can survivors piece together pre-apocalyptic history? By the time they secure subsistence-level farming, most servers have succumbed to lack of climate control, and encryption might make their data inaccessible anyway. Solid state drives and flash drives, which stop working in only a few decades, are mostly useless. The more abstract the information, the less accessible it will be to future historians. Therefore, they might turn to books: those that aren't lost to fire will last for many decades, creating an abundant record of the past. This record tapers off during the mid-21st century and dries up completely afterwards, because the data created later was all stored in inaccessible electronic forms. In such a scenario, for future historians, there will be more information available from the late 20th and early 21st centuries than from any other era. These archivists might see the late 21st century as a second dark age, replete with data that is corrupted and inaccessible. The early 21st century, in stark contrast, would be a golden age of data. We are living in this golden age. This also applies in less drastic scenarios. Today, something as simple as a poorly-placed water leak in the right server room can render inaccessible or even destroy information crucial to a government or organization, to say nothing of a greater natural disaster. Much has been said about the possibility of a coronal mass ejection (CME) completely disabling the modern world, in fact, the US has made plans for defense against them. CMEs have previously been so inconsequential to human affairs that they were hardly noticed until the 19th century. All this, of course, in addition to the more well-known threats like cyberattacks or system-crippling bugs. A complex system has more ways to fail: why should we expect today's information to persist? If this risk is real, how should the people of today respond? We could create redundant, low-tech records of the present for future generations. One method is to publish an encyclopedia of the advances and events of each decade and distribute copies throughout the world. Would this type of “data insurance” justify the cost? Just as we are fascinated with the past, future generations will surely be interested in our own era. Should they, too, be participants in the modern data economy? Discuss ### [AN #111]: The Circuits hypotheses for deep learning 5 августа, 2020 - 20:40 Published on August 5, 2020 5:40 PM GMT [AN #111]: The Circuits hypotheses for deep learning Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world View this email in your browser Newsletter #111 Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter. Audio version here (may not be up yet). SECTIONS ﻿HIGHLIGHTS ﻿TECHNICAL AI ALIGNMENT ﻿MESA OPTIMIZATION ﻿LEARNING HUMAN INTENT ﻿FORECASTING ﻿AI STRATEGY AND POLICY ﻿ ﻿ ﻿ HIGHLIGHTS Thread: Circuits (Chris Olah et al) (summarized by Nicholas): The (currently incomplete) Circuits thread of articles builds a case around 3 main claims: 1. Neural network features - the activation values of hidden layers - are understandable. 2. Circuits - the weights connecting these features - are also understandable. 3. Universality - when training different models on different tasks, you will get analogous features. Zoom In provides an overview of the argument. The next two articles go into detail on particular sets of layers or neurons. Claim 1: Neural Network Features - the activation values of hidden layers - are understandable. They make seven arguments for this claim in Zoom In which are expanded upon in subsequent articles. 1. Feature Visualization: By optimizing the input to maximize the activation of a particular neuron, they can obtain an image of what that neuron reacts most strongly to. They create and analyze these for all 1056 neurons in the first five layers of the InceptionV1 image classification model. While some of them were difficult to understand, they were able to classify and understand the purpose of most of the neurons. A simple example is that curve detecting neurons produce feature visualizations of curves of a particular orientation. A more complex example is neurons detecting boundaries between high and low frequency, which often are helpful for separating foreground and background. 2. Dataset examples: They also look at the examples in the dataset that maximize a particular neuron. These align with the feature visualizations. Neurons with a particular curve in the feature visualization also fire strongest on dataset examples exhibiting that curve. 3. Synthetic Examples: They also create synthetic examples and find that neurons fire on the expected synthetically generated examples. For example, they generate synthetic curves with a wide range of orientations and curvatures. Curve detectors respond most strongly to a particular orientation and curvature that matches the feature visualizations and highest activation dataset examples. Curve Detectors includes many more experiments and visualizations of curve detectors on the full distribution of curvature and orientation. 4. Joint Tuning: In the case of curve detectors, they rotate the maximal activation dataset examples and find that as the curves change in orientation, the corresponding curve detector neurons increase and decrease activations in the expected pattern. 5. Feature Implementation: By looking at the circuit used to create a neuron, they can read off the algorithm for producing that feature. For example, curve detectors are made up of line detectors and earlier curve detectors being combined in a way that indicates it would only activate on curves of a particular orientation and curvature. 6. Feature Use: In addition to looking at the inputs to the neuron, they also look at the outputs to see how the feature is used. For example, curves are frequently used in neurons that recognize circles and spirals. 7. Handwritten Circuits: After understanding existing curve detectors, they can implement their own curve detectors by hand-coding all the weights, and those reliably detect curves. Claim 2: Circuits - the weights connecting the features - are also understandable They provide a number of examples of neurons, both at deep and shallow layers of the network, that are composed of earlier neurons via clear algorithms. As mentioned above, curve detectors are excited by earlier curve detectors in similar orientations and inhibited by ones of opposing orientations. A large part of ImageNet is focused on distinguishing a hundred species of dogs. A pose-invariant dog head and neck detector can be shown to be composed of two earlier detectors for dogs facing left and right. These in turn are constructed from earlier detectors of fur in a particular orientation. They also describe circuits for dog head, car, boundary, fur, circle, and triangle detectors. Claim 3: Universality: when training different models on different tasks, you will get analogous features. This is the most speculative claim and most of the articles so far have not addressed it directly. However, the early layers of vision (edges, etc), are believed to be common to many computer vision networks. They describe in detail the first five layers of InceptionV1 and categorize all of the neurons. Layer 1 is the simplest: 85% of the neurons either detect simple edges or contrasts in colors. Layer 2 starts to be more varied and detects edges and color contrasts with some invariance to orientation, along with low frequency patterns and multiple colors. In Layer 3, simple shapes and textures begin to emerge, such as lines, curves, and hatches, along with color contrasts that are more invariant to position and orientation than those in the earlier layers. Layer 4 has a much more diverse set of features. 25% are textures, but there are also detectors for curves, high-low frequency transitions, brightness gradients, black and white, fur, and eyes. Layer 5 continues the trend of having features with more variety and complexity. One example is boundary detectors, which combine a number of low-level features into something that can detect boundaries between objects. They also highlight a few phenomena that are not yet fully understood: Polysemantic neurons are neurons that respond to multiple unrelated inputs, such as parts of cars and parts of cats. What is particularly interesting is that these are often constructed from earlier features that are then spread out across multiple neurons in a later layer. The combing phenomenon is that curve and line detectors on multiple models and datasets tend to be excited by small lines that are perpendicular to the curve. Potential hypotheses are that many curves in the data have them (e.g. spokes on a wheel), that it is helpful for fur detection, that it provides higher contrast between the orientation of the curve and the background, or that it is just a side effect rather than an intrinsically useful feature. ﻿ Nicholas's opinion: Even from only the first three posts, I am largely convinced that most of neural networks can be understood in this way. The main open question to me is the scalability of this approach. As neural networks get more powerful, do they become more interpretable or less interpretable? Or does it follow a more complex pattern like the one suggested here (AN #72). I’d love to see some quantitative metric of how interpretable a model is and see how that has changed for the vision state of the art each year. Another related topic I am very interested in is how these visualizations change over training. Do early layers develop first? Does finetuning affect some layers more than others? What happens to these features if the model is overfit? The other thing I found very exciting about all of these posts is the visualization tools that were used (omitting these is a major shortcoming of this summary). For example, you can click on any of the neurons mentioned in the paper and it opens up a Microscope page that lets you see all the information on that feature and its circuits. I hope that as we get better and more generic tools for analyzing neural networks in this way, this could become very useful for debugging and improving neural network architectures. ﻿ ﻿ ﻿ TECHNICAL AI ALIGNMENT ﻿ MESA OPTIMIZATION Inner Alignment: Explain like I'm 12 Edition (Rafael Harth) (summarized by Rohin): This post summarizes and makes accessible the paper Risks from Learned Optimization in Advanced Machine Learning Systems (AN #58). ﻿ LEARNING HUMAN INTENT Online Bayesian Goal Inference for Boundedly-Rational Planning Agents (Tan Zhi-Xuan et al) (summarized by Rohin): Typical approaches to learning from demonstrations rely on assuming that the demonstrator is either optimal or noisily optimal. However, this is a pretty bad description of actual human reasoning: it is more accurate to say we are boundedly-rational planners. In particular, it makes more sense to assume that our plans are computed from a noisy process. How might we capture this in an algorithm? This paper models the demonstrator as using a bounded probabilistic A* search to find plans for achieving their goal. The planner is also randomized to account for the difficulty of planning: in particular, when choosing which state to “think about” next, it chooses randomly with higher probability for more promising states (as opposed to vanilla A* which always chooses the most promising state). The search may fail to find a plan that achieves the goal, in which case the demonstrator follows the actions of the most promising plan found by A* search until no longer possible (either an action leads to a state A* search hadn’t considered, or it reaches the end of its partial plan). Thus, this algorithm can assign significant probability to plans that fail to reach the goal. The experiments show that this feature allows their SIPS algorithm to infer goals even when the demonstrator fails to reach their goal. For example, if an agent needs to get two keys to unlock two doors to get a blue gem, but only manages to unlock the first door, the algorithm can still infer that the agent’s goal was to obtain the blue gem. I really like that this paper is engaging with the difficulty of dealing with systematically imperfect demonstrators, and it shows that it can do much better than Bayesian IRL for the domains they consider. ﻿ Rohin's opinion: It has previously been argued (AN #31) that in order to do better than the demonstrator, you need to have a model of how the demonstrator makes mistakes. In this work, that model is something like, “while running A* search, the demonstrator may fail to find all the states, or may find a suboptimal path before an optimal one”. This obviously isn’t exactly correct, but is hopefully moving in the right direction. Note that in the domains that the paper evaluates on, the number of possible goals is fairly small (at most 20), presumably because of computational cost. However, even if we ignore computational cost, it’s not clear to me whether this would scale to a larger number of goals. Conceptually, this algorithm is looking for the most likely item out of the set of (optimal demonstrations and plausible suboptimal or failed demonstrations). When the number of goals is low, this set is relatively small, and the true answer will likely be the clear winner. However, once the number of goals is much larger, there may be multiple plausible answers. (This is similar to the fact that since neural nets encode many possible algorithms and there are multiple settings that optimize your objective, usually instead of getting the desired algorithm you get one that fails to transfer out of distribution.) ﻿ "Go west, young man!" - Preferences in (imperfect) maps (Stuart Armstrong) (summarized by Rohin): This post argues that by default, human preferences are strong views built upon poorly defined concepts, that may not have any coherent extrapolation in new situations. To put it another way, humans build mental maps of the world, and their preferences are defined on those maps, and so in new situations where the map no longer reflects the world accurately, it is unclear how preferences should be extended. As a result, anyone interested in preference learning should find some incoherent moral intuition that other people hold, and figure out how to make it coherent, as practice for the case we will face where our own values will be incoherent in the face of new situations. ﻿ Rohin's opinion: This seems right to me -- we can also see this by looking at the various paradoxes found in the philosophy of ethics, which involve taking everyday moral intuitions and finding extreme situations in which they conflict, and it is unclear which moral intuition should “win”. ﻿ ﻿ FORECASTING Amplified forecasting: What will Buck's informed prediction of compute used in the largest ML training run before 2030 be? (Ought) (summarized by Rohin): Ought has recently run experiments on how to amplify expert reasoning, to produce better answers than a time-limited expert could produce themselves. This experiment centers on the question of how much compute will be used in the largest ML training run before 2030. Rather than predict the actual answer, participants provided evidence and predicted what Buck’s posterior would be after reading through the comments and evidence. Buck’s quick prior was an extrapolation of the trend identified in AI and Compute (AN #7), and suggested a median of around 10^13 petaflop/s-days. Commenters pointed out that the existing trend relied on a huge growth rate in the amount of money spent on compute, that seemed to lead to implausible amounts of money by 2030 (a point previously made here (AN #15)). Buck’s updated posterior has a median of around 10^9 petaflop/s-days, with a mode of around 10^8 petaflop/s-days (estimated to be 3,600 times larger than AlphaStar). ﻿ Rohin's opinion: The updated posterior seems roughly right to me -- looking at the reasoning of the prize-winning comment, it seems like a1 trillion training run in 2030 would be about 10^11 petaflop/s-days, which seems like the far end of the spectrum. The posterior assigns about 20% to it being even larger than this, which seems too high to me, but the numbers above do assume a “business-as-usual” world, and if you assign a significant probability to getting AGI before 2030, then you probably should have a non-trivial probability assigned to extreme outcomes.

﻿

Competition: Amplify Rohin’s Prediction on AGI researchers & Safety Concerns (Andreas Stuhlmüller) (summarized by Rohin): Ought ran a second competition to amplify my forecast on a question of my choosing. I ended up asking “When will a majority of top AGI researchers agree with safety concerns?”, specified in more detail in the post. Notably, I require the researchers to understand the concerns that I think the AI safety community has converged on, as opposed to simply saying that they are concerned about safety. I chose the question because it seems like any plan to mitigate AI risk probably requires consensus amongst at least AI researchers that AI risk is a real concern. (More details in this comment.)

My model is that this will be caused primarily by compelling demonstrations of risk (e.g. warning shots), and these will be easier to do as AI systems become more capable. So it depends a lot on models of progress; I used a median of 20 years until “human-level reasoning”. Given that we’ll probably get compelling demonstrations before then, but also it can take time for consensus to build, I also estimated a median of around 20 years for consensus on safety concerns, and then made a vaguely lognormal prior with that median. (I also estimated a 25% chance that it never happens, e.g. due to a global catastrophe that prevents more AI research, or because we build an AGI and see it isn’t risky, etc.)

Most of the commenters were more optimistic than I was, thinking that we might already have consensus (given that I restricted it to AGI researchers), which led to several small updates towards optimism. One commenter pointed out that in practice, concern about AI risk tends to be concentrated amongst RL researchers, which are a tiny fraction of all AI researchers, and probably a tiny fraction of AGI researchers as well (given that natural language processing and representation learning seem likely to be relevant to AGI). This led to a single medium-sized update towards pessimism. Overall these washed out, and my posterior was a bit more optimistic than my prior, and was higher entropy (i.e. more uncertain).

﻿ ﻿ ﻿ AI STRATEGY AND POLICY

Overcoming Barriers to Cross-cultural Cooperation in AI Ethics and Governance (Seán S. ÓhÉigeartaigh et al) (summarized by Rohin): This paper argues that it is important that AI ethics and governance is cross-cultural, and provides a few recommendations towards this goal:

1. Develop AI ethics and governance research agendas requiring cross-cultural cooperation

2. Translate key papers and reports

3. Alternate continents for major AI research conferences and ethics and governance conferences

4. Establish joint and/or exchange programmes for PhD students and postdocs

Read more: Longer summary from MAIEI

How Will National Security Considerations Affect Antitrust Decisions in AI? An Examination of Historical Precedents (Cullen O'Keefe) (summarized by Rohin): This paper looks at whether historically the US has used antitrust law to advance unrelated national security objectives, and concludes that it is rare and especially recently economic considerations tend to be given more weight than national security considerations.

FEEDBACK I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email. PODCAST An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.
Subscribe here:

Want to change how you receive these emails?
You can update your preferences or unsubscribe from this list.

Discuss

### Can Social Dynamics Explain Conjunction Fallacy Experimental Results?

5 августа, 2020 - 11:58
Published on August 5, 2020 8:50 AM GMT

Is there any conjunction fallacy research which addresses the alternative hypothesis that the observed results are mainly due to social dynamics?

Most people spend most of their time thinking in terms of gaining or losing social status, not in terms of reason. They care more about their place in social status hierarchies than about logic. They have strategies for dealing with communication that have more to do with getting along with people than with getting questions technically right. They look for the social meaning in communications. E.g. people normally try to give – and expect to receive – useful, relevant, reasonable info that is safe to make socially normal assumptions about.

Suppose you knew Linda in college. A decade later, you run into another college friend, John, who still knows Linda. You ask what she’s up to. John says Linda is a bank teller, doesn’t give additional info, and changes the subject. You take this to mean that there isn’t more positive info. You and John both see activism positively and know that activism was one of the main ways Linda stood out. This conversation suggests to you that she stopped doing activism. Omitting info isn’t neutral in real world conversations. People mentally model the people they speak with and consider why the person said and omitted things.

In Bayesian terms, you got two pieces of info from John’s statement. Roughly: 1) Linda is a bank teller. 2) John thinks that Linda being a bank teller is key info to provide and chose not to provide other info. That second piece of info can affect people’s answers in psychology research.

So, is there any research which rules out social dynamics explanations for conjunction fallacy experimental results?

Discuss

### Tools for keeping focused

5 августа, 2020 - 05:10
Published on August 5, 2020 2:10 AM GMT

Once I realized that my attention was even scarcer than my time, I became an anti-distraction fanatic. During my weekly reviews I methodically went through my past week, figured out what had been distracting me, and tried to eliminate it or replace it with something less distracting.

Over time, this has led me to find lots of tools (and ways of using my tools) that help me stay more focused. Here are some of the things I’ve started doing:

Anxious yet?
• I aggressively disable notifications and badges so that I don’t mindlessly open distracting apps. If you’re into eliminating distractions you’ve probably already done this. But if you haven’t, it’s by far the most important thing you can do to improve your focus, so I’m putting it first anyway.

I have a zero-tolerance notification policy: if an app interrupts me, I ask myself whether the interruption was valuable, and if not, the app doesn’t get to notify me anymore. This has weeded out pretty much everything except for inboxes (phone, texts, reminders) and apps with a human on the other end (ride sharing, delivery).

A special dishonorable mention goes to Slack, which can easily suck away a quarter of your time without you noticing. If you use Slack and you haven’t disabled the “unread messages” badge, stop reading this post and do it now. (Consider setting a two-minute timer to remind you to quit in case you get distracted by checking your unreads, like I did while taking the screenshot above.)

• On the subject of Slack, I try to keep it closed as much as possible and check it in batches a few times a day. Not all workplace Slack cultures allow this, but if yours does, I highly, highly recommend it. Doing this led to my biggest single quantifiable productivity improvement.1

• I only check my email once per day.

Gmail filters send important email to a label called “Temp Inbox” and the rest to a label called “Unimportant.” I check Temp Inbox each evening and Unimportant once a week. Since incoming email doesn’t go to the inbox, I can open Gmail to compose or search messages without getting distracted by unread messages.

(I use a piece of Google Apps Script for this, but I think the Gmail UI has improved recently so that you can now do something similar with filters and it’ll have decent ergonomics.)

• I have a second monitor that always shows a Complice window with the task I’m currently working on. (Complice is my favorite app for making lists of what I want to do today.) This helps me recover quickly from unintentionally going down rabbit holes.

• I use Focus to break my habit of mindlessly checking sites.

For websites that are habit-forming but still feel useful on net, like Twitter or Hacker News, Focus lets me restrict my usage to certain times of day. It’s the only website blocker I’ve used that lets me fully automate blocking things in the exact way I want.

It’s especially important for me to use a website blocker that’s fully automatic, because the times when I need it most are exactly those times at which I have the least willpower to do any manual steps!

• I also block most websites on my phone (using the iOS built-in content blocking in whitelist mode). Unfortunately, this works less well than Focus since I sometimes want to disable it and then forget to re-enable it.

• Using RSS means I’m in control of my own feed and don’t need to visit an adversarially distracting site like Facebook to get new reading material.

It’s also helpful that the Kindle delivery comes once per day at a predictable time, so I don’t have an urge to check over and over again for new content.

• I block distracting parts of websites with uBlock Origin’s amazing “element blocker,” which lets you select and remove any part of the page. As a fan of minimalism in web design, I really enjoy being able to adversarially enforce it on any rogue webpage.

For instance, I use it to block the clickbaity “hot network questions” sidebar on Stack Overflow, which otherwise frequently distracts me, as well as the useless notifications and left sidebar:

• I use Witch, an augmented window switcher, to avoid accidentally switching to the wrong window. Witch can display separate app windows separately, only display windows from the current workspace, and supports text search on window titles so that you can search directly for the window you want without having to skip over.

I found Witch slightly unintuitive to configure, so if you’re curious, here are screenshots of my configs: “actions” tab, “advanced” tab.

• I’ve hidden the apps on my phone. My homescreen looks like the figure on the right.

The two folders are “badges” (for the few apps that are allowed to notify me) and “everything else.” If I want an app I type its name in the search bar, which forces me to be intentional (and is often faster than hunting through pages of apps anyway). The app in the bottom bar is the Kindle app.

• I use native versions of web apps instead of keeping them open in a browser tab. This allows me to use the app without getting sucked into my browser. There are two ways of doing this:

• Some apps (e.g. Roam) are already “installable,” meaning that they tell your browser how to turn them into a native app. (Sadly, Chrome seems to be the only browser that can install installable apps on desktop right now.)

• For the rest (e.g., oddly, most Google web apps), I use a command-line tool called nativefier to turn them into desktop apps.

• In fact, right now I’m trying out a “no browser tabs at all” rule. I noticed that I’d sometimes get distracted by tabs that I’d opened a long time ago and should have closed, but forgot about. So I installed an extension to limit each window to a single tab and changed Firefox to open links in a new window by default. (This has lots of synergy with a powerful window switcher like Witch.)

Each of these is small on their own, but like many of the things I work on during weekly reviews, they’ve added up and compounded to make it much easier for me to spend my attention in ways I want.

1. If your workplace culture doesn’t allow you to keep Slack closed, because it requires quick Slack responses, this is a bad sign. ↩︎

Discuss

### My paper was signalling the whole time - Robin Hanson wins again

5 августа, 2020 - 00:13
Published on August 4, 2020 9:13 PM GMT

robin hanson meme https://imgur.com/1RDpvEc

I recently wrote a paper on how the Jordanian monarchy decided who to give water to and who to take water away from. As I near completion, I am realizing that signalling theory from Robin Hanson gives a pretty compelling explanation of my results, probably better than my explanation. I will give a brief summary.

So in the 1990s and 2000's poor neighbourhoods of East Amman periodically had water shortages. Whole neighbourhoods would go without pumped water for a month and people would riot. The Government of Jordan (GoJ) does not like riots and were motivated to stop this.

Jordan has two relevant water sources they could use to make up the shortfall of 100-150 million cubic meters MCM. The Northern Highlands are close to the capital Amman and have a few profitable farms and a lot of smaller, unprofitable "prestige" farms owned by Jordanians. The southern desert is a good 600 km away and has four large profitable farms. The farms are owned by rich, politically connected Jordanian families and operated almost entirely by Egyptian migrant laborers.

The World Bank for twenty years suggested taxing the farms in the Northern Highlands to close the unprofitable ones then redirecting that water from the capital. Since Amman sits on the Northern Highlands the costs of transporting the water are trivial.

Instead the Jordanians paid about a billion dollars to build a massive pipeline to the southern farms, then shut them down instead. They don't publish the data I could use to compare how much more expensive the Disi pipeline solution was, but capital costs were about a billion USD and the energy costs are likely double the cost of other sources (of order 1 dollar per cubic meter). The water sectors cost recovery ration dropped by 30% the year they finished the project, financed by public debt until a 2018 fee increase forced by the WB.

The Jordanians have justified their decision for two reasons. The first is that closing the farms in the north would require negotiating with hundreds of farmers with diverse motives and finances, which was beyond the governments capacity. In the south they had only to negotiate with a small number of elites. This argument is strong.

The second argument is that closing the northern farms would have created unemployment which would create instability. They worried that farmers would lose their jobs and head to Amman to burden the social security system. The southern farms are worked almost completely by Egyptian farmers.

This argument is really weak if you think about it. Firstly, these unprofitable farms are using an average of 200 km^3, so to make up the difference they had to close 500 farms. The closed farms in the north also mostly hire Egyptians, so the lost jobs are like 2-10 per farm. So they spent a billion dollars to save 5,000 jobs. Assuming 10 Jordanian jobs lost per farm closed, they paid 200,000 USD per job saved. In a country with a GDP per capita of 5,000 USD. Assuming they protected those jobs for ten years (the aquifers will collapse eventually anyway), they could have just paid the farmers the money and saved 75%. This is a conservative estimate, since many of those farms have no Jordanians on them.

If you had a billion dollars to spend on Jordanian unemployment, paying to substitute water to keep unprofitable farms afloat is the last thing you would do. Honestly you could have just cut the Egyptian farm worker visa program and killed two birds with one stone by increasing Jordanian employment, cutting the implicit subsidy to the farms, and they would have spent 0 dollars. It is possible the Jordanians just didn't think of this, although the World Bank never got tired of pointing it out.

I slightly prefer the explanation that the GoJ was signalling loyalty to these social groups. The farmers in the Northern Highlands are inside the ruling coalition the royal court has to signal that they get special priviliges. And failing to supply East Amman is a clear signal to the masses that the King doesn't care about their lives, which they can't do. If politics is really about loyalty signalling (not unemployment), the GoJ's actions are more instrumentally rational.

Also Jordanians do perceive the water transfers as loyalty signals, based on interviews from anthropologists in donor areas.

But the paper is almost accepted so no time to change it now.

Discuss

### Infinite Data/Compute Arguments in Alignment

4 августа, 2020 - 23:21
Published on August 4, 2020 8:21 PM GMT

This is a reference post. It explains a fairly standard class of arguments, and is intended to be the opposite of novel; I just want a standard explanation to link to when invoking these arguments.

When planning or problem-solving, we focus on the hard subproblems. If I’m planning a road trip from New York City to Los Angeles, I’m mostly going to worry about which roads are fastest or prettiest, not about finding gas stations. Gas stations are abundant, so that subproblem is easy and I don’t worry about it until harder parts of the plan are worked out. On the other hand, if I were driving an electric car, then the locations of charging stations would be much more central to my trip-planning. In general, the hard subproblems have the most influence on the high-level shape of our solution, because solving them eats up the most degrees of freedom.

In the context of AI alignment, which subproblems are hard and which are easy?

Here’s one class of arguments: compute capacity and data capacity are both growing rapidly over time, so it makes sense to treat those as “cheap” - i.e. anything which can be solved by throwing more compute/data at it is easy. The hard subproblems, then, are those which are still hard even with arbitrarily large amounts of compute and data.

In particular, with arbitrary compute and data, we basically know how to get best-possible predictive power on a given data set: Bayesian updates on low-level physics models or, more generally, approximations of Solomonoff induction. So we’ll also assume predictive power is “cheap” - i.e. anything which can be solved by more predictive power is easy.

This is also reasonable in machine learning practice - once a problem is reduced to predictive power on some dataset, we can throw algorithms at it until it’s solved. The hard part - as many data scientists will attest - is reducing our real objective to a prediction problem and collecting the necessary data. It’s rare to find a client with a problem where all we need is predictive power and the necessary data is just sitting there.

(We could also view this as an interface argument: “predictive problems” are a standard interface, with libraries, tools, algorithms, theory and specialists all set up to handle them. As in many other areas, setting up our actual problem to fit that interface while still consistently doing what we want is the hard/expensive part.)

The upshot of all this: in order to identify alignment subproblems which are likely to be hard, it’s useful to ask what would go wrong if the world-modelling parts of our system just do Bayesian updates on low-level physics models or use approximations of Solomonoff induction. We don’t ask this because we actually expect to use such algorithms, but rather because we expect that the failure modes which still appear under such assumptions are the hard failure modes.

Discuss

### Property as Coordination Minimization

4 августа, 2020 - 22:24
Published on August 4, 2020 7:24 PM GMT

A friend recently noted that they were in favor of private property, but the best defense they had to link was instead a defense of finance. So I thought I’d give it a try. In light of a distinction people often draw between ‘private property’ and ‘personal property’, I’m going to work up to defending ‘impersonal private property’, starting with intuitions and examples grounded in personal property.

First, what even do we mean by property? Well, there are material things that are sometimes scarce or rivalrous. If I eat a sandwich, you can’t also eat it; if I sleep in a bed, you can’t also sleep in it at the same time; if an acre of land is rented out for agricultural use, only one of us can collect the rent check. While this is sometimes inherent to reality, we can also create it with rules; if I invent a cool new sandwich idea, society could decide that I have the right to decide who can and can’t make that sandwich for some amount of time. [The first patents were for restaurants, giving them exclusive rights for a year to new dishes they invented.]

Property, then, is the societally recognized right to decide who can or can’t do three different things: ‘use’, or deriving some personal benefit from the thing; ‘fruit’, or extracting some value from the thing without deeply changing it; and ‘abuse’, or making changes or transferring ownership of the thing. For example, consider an apple tree; climbing the tree is an example of use, picking the apples is an example of fruit, and chopping it down to make a chair out of it is an example of abuse.

In this view, the benefit of property is fundamentally preventative; if I own an apple tree, I can prevent other people from climbing the tree, or picking the apples, or chopping down the tree, even if they want to. Hence the slogan that ‘property is theft’; without it, you could do any of those things to my tree, and with property, you can’t.

Interestingly, this view also makes NIMBYism seem natural instead of unnatural. If I own a house, I can use that ownership to prevent things from happening to the house that I don’t want. But do I just own the dirt and wood, or do I also own the ambient level of noise? The fragrance of the air? The view? The price? We can make our conception of property too large or too small, and can start drawing overlapping property claims, where I think my ownership of my house means no loud music on my property at 6am, whereas my neighbor thinks that his ownership of his house means he can practice the drums whenever he likes. [This sort of coordination is best done at a higher level, through ownership of the neighborhood or zoning district or city or whatever.]

This brings up the idea of ‘stakeholders’ and ‘decision-makers’. Stakeholders are those impacted by the outcome, and decision-makers are those who choose the outcome. Often, we get more desirable or just results by aligning the decision-makers and stakeholders, but this comes at additional coordination costs.

Suppose I’m ordering dinner for a group of people; there’s both the coordination question of which restaurant to order from, and the coordination question of what dishes to order. Sometimes it works for me to just pick a restaurant and dishes; sometimes it works for me to pick a restaurant, and then pass around the order for everyone to add their preferred dish; sometimes it works to jointly come to a decision on what restaurant to order from, and then everyone selects a dish; sometimes it works for everyone to manage their own order, including whether or not they should join in on an order with anyone else. That list was ordered roughly in ‘decreasing coordination cost’ order, with a corresponding increase in taste-satisfaction, but perhaps not net satisfaction, as smaller orders are more expensive, or the additional taste benefits weren’t worth the additional benefits of having to think about it. The size of the group has a huge impact on how much the coordination costs matter; coming to alignment on a restaurant for three people and thirty people are very different affairs.

Why have personal property, i.e. your own sandwich, toothbrush, clothes, house, or vehicle? A boring but essential reason is physical; a toothbrush used by Alice becomes much less valuable to everyone but Alice after that use, and this sometimes applies more broadly. The main reason, in my view, is that personal property is made much more useful by only having one decision-maker, and thus no coordination cost. Rather than having to petition the commune for a day’s use of a red shirt, I can simply decide to wear my red shirt today. I can make solid plans around decisions that I’m the only major input to. This will sometimes lead to socially suboptimal decisions if coordination were free--maybe I look really bad in red, and a wise commune would give me blue instead--but given that coordination is not free, this is often our best available option.

Why have impersonal property, i.e. a landlord who rents out houses, a company that owns factories, massive tracts of land owned by the same farm, a bank that chooses which loans to grant and which to deny? The same basic reason, I claim; the landlord can make decisions about the houses that they own without having to consult anyone else, and this means decisions can be made faster and more cheaply. Many different landlords can make many different decisions, whereas one Housing Bureau will either make one decision for everyone, or make unequal decisions in a corrupt way. Or if we had a property-less direct democracy, where all citizens voted on all decisions, there would be no time left over to do anything else but vote!

Many of the problems we have now, I claim, are not caused by too much property but by too many decision-makers, or in this view, too little property. For example, I live in Berkeley, which has a housing shortage, and also incomplete individual property rights. By that I mean if I buy a house and want to tear it down and build a larger one instead, I need the city’s permission to do so, and the city will require me to allow ‘public comment’ from my neighbors and passerby on the desirability of such a change, and generally require various other permits and restrictions. If it were solely for the safety of the inhabitants, this could be handled by the building code, but the public comment isn’t in case my neighbors happen to be structural engineers; it’s because housing in Berkeley does not come with the full right of ‘abuse’, and that is instead owned by the neighborhood and city, and only some stakeholders get a say; the people who would rent out the additional floors I add to the house generally don't comment at the public meeting, whereas the retiree who would have to deal with more cars on the road or a blocked view of the Bay does.

Indeed, It’s Time to Build is, in many ways, a complaint about the Vetocracy of our times. Property, even impersonal property, even the existence of billionaires even if you’ll never be one, are good because it lowers coordination costs, allowing things to happen more efficiently.

Discuss

### Interpretability in ML: A Broad Overview

4 августа, 2020 - 22:03
Published on August 4, 2020 7:03 PM GMT

(Reposting because I think a GreaterWrong bug on submission made this post invisible for a while last week so I'm trying again on LW.)

This blog post is an overview of ways to think about machine learning interpretability; it covers some recent research techniques as well as directions for future research. This is an updated version of this post from a few weeks ago. I've now added code, examples, and some pictures.

What Are Existing Overviews?

Many of these ideas are based heavily off of Zach Lipton's Mythos of Model Interpretability, which I think is the best paper for understanding the different definitions of interpretability. For a deeper dive into specific techniques, I recommend A Survey Of Methods For Explaining Black Box Models which covers a wide variety of approaches for many different ML well as model-agnostic approaches. For neural nets specifically, Explainable Deep Learning: A Field Guide for the Uninitiated provides an in-depth read. For other conceptual surveys of the field, Definitions, methods, and applications in interpretable machine learning and Explainable Machine Learning for Scientific Insights and Discoveries. The Explainable Machine Learning paper in particular is quite nice because it gives a hierarchy of increasingly more interpretable models across several domains and use cases.

(Shout-out to Connected Papers which made navigating the paper landscape for interpretability very bearable.)

As always, you can find code used to generate the images here on GitHub.

In the rest of this post, we'll go over many ways to formalize what "interpretability" means. Broadly, interpretability focuses on the how. It's focused on getting some notion of an explanation for the decisions made by our models. Below, each section is operationalized by a concrete question we can ask of our ML model using a specific definition of interpretability. Before that, though, if you're new to all this, I'll explain briefly about why we might care about interpretability at all.

Firstly, interpretability in ML is useful because it can aid in trust. As humans, we may be reluctant to rely on ML models for certain critical tasks , e.g. medical diagnosis, unless we know "how they work". There's often a fear of unknown unknowns when trusting in something opaque, which we see when people confront new technology. Approaches to interpretability which focus on transparency could help mitigate some of these fears.

Secondly, safety. There is almost always some sort of shift in distributions between model training and deployment. Failures to generalize or Goodhart's Law issues like specification gaming are still open problems that could lead to issues in the near future. Approaches to interpretability which explain the model's representations or which features are most relevant could help diagnose these issues earlier and provide more opportunities to intervene.

Thirdly, and perhaps most interestingly, contestability. As we delegate more decision-making to ML models, it becomes important for people to appeal these decisions made. Black-box models provide no such recourse because they don't decompose into anything that can be contested. This has already led to major criticism of proprietary recidivism predictors like COMPAS. Approaches to interpretability which focus on decomposing the model into sub-models or explicate a chain of reasoning could help with such appeals.

Defining Interpretability

Lipton's paper breaks interpretability down into two types, transparency and post-hoc.

Transparency Interpretability

These three questions are from Lipton's section on transparency as interpretability, where he features on properties of the model that are useful to understand and can be known before training begins.

Can a human walk through the model's steps? (Simulatibility)

This property is about whether or not a human could go through each step of the algorithm and have it make sense to them at each step. Linear models and decision trees are often cited as interpretable models using such justifications; the computation they require is simple, no fancy matrix operations or nonlinear transformations.

Linear models are also nice because the parameters themselves have a very direct mapping–they represent how important different input features are. For example, I trained a linear classifier on MNIST, and here are some of the weights, each of which correspond to a pixel value:

0.00000000e+00, 0.00000000e+00, 3.90594519e-05, 7.10306823e-05, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, -1.47542413e-03, -1.67811041e-04, -3.83280468e-02, -8.10846867e-02, -5.01943218e-02, -2.90314621e-02, -2.65494116e-02, -8.29385683e-03, 0.00000000e+00, 0.00000000e+00, 1.67390785e-04, 3.92789141e-04, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00

By themselves, these weights are hard to interpret. Even if we knew which pixels they corresponded to, it's difficult to try and pin down what a certain pixel even represents for large images. However, there is an easy trick to turn these weights into something interpretable. We simply reshape them into the same shape as our model and view it as an image, with the pixel color represented by the weight value.

Here are the weights for the model that looks for 0:

And here are the weights for the model that looks for 3:

In both cases, we can see that the blue regions, which represent positive weight, correspond to a configuration of pixels that look roughly like the digit being detected for. In the case of 0, we can see a distinct blank spot in the center of the image and a curve-like shape around it, whereas the curves of the 3 are also apparent.

However, Lipton points out that this desiderata can be less about the specific choice of model and more about the size of the model. A decision tree with a billion nodes, for example, may still be difficult to understand. Understanding is also about being able to hold most of the model in your mind, which is often about how the model is parameterized.

One approach towards achieving this for neural nets is tree regularization which adds a regularization term that corresponds (roughly) to the size of the decision tree that can approximate the net being trained. The hope here is to eventually output a shallow decision tree that performs comparably to a neural net. Another approach is neural backed decision trees which use another type of regularization to learn a hierarchy over class labels, which then get used to form a decision tree.

Of course, parameterization is not the whole story. There are methods like K-Nearest Neighbors which are parameterized by your entire dataset; this could be billions of points. Yet, there is a sense in which KNN is still interpretable despite its massive size. We can cleanly describe what the algorithm does, and we can even see "why" it made such a choice because the algorithm is so simple.

Is the model interpretable at every step, or with regards to its sub-components? (Decomposability)

Another desirable feature would be to understand what the model is doing at each step. For example, imagine a decision tree whose nodes correspond to easily identifiable factors like age or height. This can sometimes be difficult because model performance is very tightly coupled with the representations used. Raw features, e.g. RGB pixel values, are often not very interpretable by themselves, but interpretable features may not be the most informative for the model.

For example, I trained a decision tree for MNIST using the following interpretable features:

1. The average brightness of the image - avg_lumin
2. The average brightness of the image's outline (found using an edge detector) - edge_prop
3. The number of corners found in the image's outline num_corners
4. The width of the image - max_width
5. The height of the image - max_height

It seems like there would be at least some useful information in these features; ones tend to have less area (so avg_lumin would be lower), eights might have more corners, etc. etc. Yet, the resulting decision tree of depth 3, shown below, however, only achieves 33% training accuracy. Going all the way to depth 10 only bumps it to around 50%.

If we look at the nodes, we can perhaps see what's going on. At the top, we can see that our model will predict a 1 if the width is less than 7.5 pixels, which makes sense as 1 is likely going to be the thinnest digit. Near the bottom, we see that the number of corners is being used to differentiate between 7 and 4. And 4s do have more visual corners than 7s. But this is very rough, and the overall performance is still not very good.

To compare this with raw features, I also trained a depth 3 decision tree using direct pixel values, i.e. a vector of 784 grayscale values. The resulting model, shown below, gets 50% train and test accuracy.

Here, it's not clear at all why these pixel values were chosen to be the splitting points. And yet the resulting decision tree, for the same number of nodes, does much better. In this simple case, the performance vs interpretability trade-off in representation is quite apparent.

Does the algorithm itself confer any guarantees? (Algorithmic Transparency)

This asks if our learning algorithm has any properties which make it easy to understand. For example, we might know that the algorithm only outputs sparse models, or perhaps it always converges to a certain type of solution. In these cases, the resulting learned model can be more amenable to analysis. For example, the Hard Margin SVM is guaranteed to find a unique solution which maximizes the margin. In another vein, the perceptron is guaranteed to find parameters (not necessarily unique ones, though) that achieve a training loss of 0 if the data are linearly separable.

When it comes to deep learning, I'm less familiar with these kinds of results. My rough understanding is that the equivalence class of models which achieve comparable training error can be quite large, even with regularization, which makes uniqueness results hard to come by.

As I mentioned earlier with KNN, it seems, aside from mechanical transparency, there's another level of understanding regarding "what the algorithm actually does in simple terms". KNN is easy to describe as "it reports the labels of the points closest to the input". The part of this property that's doing the most work here is the way we actually do the describing. Obviously most ML models can be abstracted as "it finds parameters which satisfy certain constraints", but this is very broad. It seems harder to find a description at the same level of granularity for neural nets beyond something like "it learns a high-dimensional manifold that maps onto the input data".

Post-Hoc Interpretability

These four questions are from Lipton's section on post-hoc interpretability, which focus on things we learn from the model after training has occurred.

Can the model give an explanation for its decision, after the fact? (Text Explanation)

Similar to how humans often give post-hoc justifications for their actions, it could be informative to have models which can also give explanations, perhaps in text. Naive methods of pairing text with decisions, however, are likely going to optimize for something like "how credible the explanation sounds to a human" rather than "how accurate the explanation is at summarizing the internal steps taken".

While this seems clearly desirable, I think research in this area is hard to come by, and Lipton only offers one paper that is RL-focused. On ConnectedPapers, I found that said paper is part of a larger related field of reinforcement learning with human advice. This seems to focus on the converse problem–given human explanations, how can models incorporate them into their decision-making? Maybe insights here can eventually be used in the other direction.

Can the model identify what is/was important to its decision-making? (Visualization/Local Explanations)

This focuses on how the inputs and outputs change, relative to one another.

Saliency maps are a broad class of approaches that look at where the inputs change in order to change the outputs. A simple way to do this is to take the derivative of the loss function with respect to the input. Past this, there are many modifications which involve averaging the gradient, perturbing the input, or local approximations. Understanding Deep Networks via Extremal Perturbations and Smooth Masks has a good overview of the work in this area.

For example, I trained a CNN on MNIST and did a simple gradient visualization on an image of this 3:

Using PyTorch, I took the derivative of the logit that corresponds to the class 3 with respect to the input image. This gave me the image below. Here, the white pixels correspond to parts of the image that would increase the logit value for 3, and the black pixels correspond to the reverse. We can see the rough curves of the three come through.

Note how this is different from the visualization we previously had with the linear classifier in red and blue in the first section. Those visuals represented the importance in aggregate for the entire input space. The visualization here is only for this specific input. For a different input, e.g. a different 3, the local gradient would look different, as shown below:

This 3:

Another group of approaches focus on visualizing with respect to the model parameters themselves, rather than the input. A lot of the work has been done by Chris Olah, Shan Carter, Ludwig Schubert, and others on distill.pub. Their work in this area has gone from visualizing the activations of specific neurons and layers, to entire maps of activations for many networks, to decomposing models into interpretable building blocks. Another great visualization resource for this type of work is the OpenAI Microscope. Progress here has been very exciting, but it remains to be seen if similar approaches can be found for neural nets which focus on tasks other than image recognition.

Can the model show what else in the training data it thinks are related to this input/output? (Explanation by Example)

This asks for what other training examples are similar to the current input. When the similarity metric is just distance in the original feature space, this is akin to KNN with K = 1. More sophisticated methods may look for examples which are similar in whatever representation or latent space the model is using. The human justification for this type of approach is that it is similar to reasoning by analogy, where we present a related scenario to support our actions.

While I think this is useful, it definitely doesn't seem like all we need for understanding, or even most of what we'd need.

What Else Might Be Important?

These are a mix of other questions I thought of before/after reading the above papers. Some of them are also from Lipton's paper, but from the earlier sections on interpretability desiderata. Because answering questions is harder than asking them, I've also taken the time to give some partial responses to these questions, but these are not well-researched and should be taken as my own thoughts only.

1. What are the relevant features for the model? What is superfluous?
• We've seen that linear models can easily identify relevant features. Regularization and other approaches to learn sparse models or encodings can also help with this. One interesting direction (that may already be explored) is to evaluate the model on augmented training data that has structured noise or features that correlate with real features and see what happens.
2. How can you describe what the model does in simpler terms?
• The direct way to approach this question is to focus on approximating the model's performance using fewer parameters. A more interesting approach is to try and summarize what the model does in plain English or some other language. Having a simplified description could help with understanding, at least for our intuition.
3. What can the model tell you to allow you to approximate its performance in another setting or another model?
• Another way to think about models which are interpretable is that they are doing some sort of modeling of the world. If you asked a person, for example, why they made some decision, they might tell you relevant facts about the world which could help you come to the same decision. Maybe some sort of teacher-learner RL type scenario where we can formalize knowledge transfer? But ultimately it seems important for the insights to be useful for humans; the feedback loop seems too long to make it an objective to optimize for, but maybe there's a clever way to approximate it…There might be a way where we instead train a model to output some representation or distribution that, when added to some other interpretable model (which could be a human's reasoning), leads to improved performance.
4. How informative is this model, relative to another more interpretable model?
• Currently, deep learning outperforms other more interpretable models on a wide variety of tasks. Past just looking at loss, perhaps there is some way we can formalize how much more information the black box model is using. In the case of learned features versus hand-picked features, it could be useful to understand from an information theory perspective how much more informative the learned features are. Presumably interpretable features would tend to be more correlated with one another.
5. What guarantees does the model have to shifts in distribution?
• Regularization, data augmentation, and directly training with perturbed examples all help with this issue. But perhaps there are other algorithmic guarantees we could derive for our models.
6. What trips up the model (and also the human)?
• One interesting sign that our model is reasoning in interpretable ways is to see if examples which trip up humans also trip up the model. There was some work a little while back on adversarial examples which found that certain examples which fooled the network also fooled humans. Lack of divergence on these troubling examples could be a positive sign.
7. What trips up the model (but not the human)?
• Conversely, we might get better insight into our model by honing in on "easy" examples (for a human) that prove to be difficult for our model. This would likely be indicative of the model using features that we are not, and thus it's learned a different manifold (or whatever) through the input space.
8. What does the model know about the data-generation process?
• In most cases, this is encoded by our prior, which is then reflected in the class of models we do empirical risk minimization over. Apart from that, it does seem like there are relevant facts about the world which could be helpful to encode. A lot of the symbolic AI approaches to this seem to have failed, and it's unclear to me what a hybrid process would look like. Perhaps some of the human-assisted RL stuff could provide a solution for how to weigh between human advice and learned patterns.
9. Does the model express uncertainty where it should?
• In cases where the input is something completely nonsensical, it seems perhaps desirable for the model to throw its hands up in the air and say "I don't know", rather than trying its best to give an answer. Humans do this, where we might object to a question on grounds of a type error. For a model, this might require understanding the space of possible inputs.
10. What relationships does the model use?
• The model could be using direct correlations found in the data. Or it could be modeling some sort of causal graph. Or it could be using latent factors to build an approximate version of what's going on. Understanding what relationships in the data are lending themselves to helping the model and what relationships are stored could be useful.
1. Are the model's results contestable?
• We touched on this at the very beginning of the post, but there are not many modern approaches which seem to have done this. The most contestable model might look something like an automated theorem prover which uses statements about the world to build an argument. Then we would simply check each line. Past that, one nice-to-have which could facilitate this is to use machine learning systems which build explicit models about the world. In any case, this pushes our models to make their assumptions about the world more explicit.
What's Next?

Broadly, I think there are two main directions that interpretability research should go, outside of the obvious direction of "find better ways to formalize what we mean by interpretability". These two areas are evaluation and utility.

Evaluation

The first area is to find better ways of evaluating these numerous interpretability methods. For many of these visualization-based approaches, a default method seems to be sanity-checking with our own eyes, making sure that interpretable features are being highlighted. Indeed, that's what we did for the MNIST examples above. However, Sanity Checks for Saliency Maps, a recent paper, makes a strong case for why this is definitely not enough.

As mentioned earlier, saliency maps represent a broad class of approaches that try to understand what parts of the input are important for the model's output, often through some sort of gradient. The outputs of several of these methods are shown below. Upon visual inspection, they might seem reasonable as they all seem to focus on the relevant parts of the image.

However, the very last column is the output, not for a saliency map, but for an edge detector applied to the input. This makes it not a function of the model, but merely the input. Yet, it is able to output "saliency maps" which are visually comparable to these other results. This might cause us to wonder if the other approaches are really telling us something about the model. The authors propose several tests to investigate.

The first test compares the saliency map of a trained model with a model that has randomly initialized weights. Here, clearly if the saliency maps look similar, then it really is more dependent on the input and not the model's parameters.

The second test compares the saliency map of a trained model with a trained model that was given randomly permuted labels. Here, once again, if the saliency maps look similar, this is also a sign of input dependence because the same "salient" features have been used to justify two different labels.

Overall, the authors find that the basic gradient map shows desired sensitivity to the above tests, whereas certain other approaches like Guided BackProp do not.

I haven't looked too deep into each one of the saliency map approaches, but I think the evaluation methods here are very reasonable and yet somehow seem to be missed in previous (and later?) papers. For example, the paper on Grad-CAM goes in-depth over the ways in which their saliency map can help aid in providing explanations or identifying bias for the dataset. But they do not consider the sensitivity of their approach to model parameters.

In the above paper on sanity-checks, they find that Grad-CAM actually is sensitive to changes in the input, which is good, but I definitely would like to see these sanity-checks being applied more frequently. Outside of new approaches, I think additional benchmarks for interpretability that mimic real-world use cases could be of great value to the field.

Another approach is this direction is to back-chain from the explanations that people tend to use in everyday life to derive better benchmarks. Explanation in Artificial Intelligence: Insights from the Social Sciences provides an overview of where philosophy and social science can meet ML in the middle. Of course, the final arbiter for all this is how well people can actually use and interpret these interpretability results, which brings me to my second point.

Utility

The second area is to ensure that these interpretability approaches are actually providing value. Even if we find ways of explaining models that are actually sensitive to the learned parameters (and everything else), I think it still remains to be seen if these explanations are actually useful in practice. At least for current techniques, I think the answer is uncertain and possibly even negative.

Manipulating and Measuring Model Interpretability, a large pre-registered from Microsoft Research, found that models which had additional information like model weights were often not useful in helping users decide how to make more accurate judgments on their own or notice when the model was wrong. (Users were given either a black-box model or a more interpretable one.)

They found that:

"[o]n typical examples, we saw no significant difference between a transparent model with few features and a black-box model with many features in terms of how closely participants followed the model’s predictions. We also saw that people would have been better off simply following the models rather than adjusting their predictions. Even more surprisingly, we found that transparent models had the unwanted effect of impairing people’s ability to correct inaccurate predictions, seemingly due to people being overwhelmed by the additional information that the transparent model presented"

Another paper, Interpreting Interpretability: Understanding Data Scientists’ Use of Interpretability Tools for MachineLearning, found that even data scientists may not understand what interpretable visualizations tell them, and this can inspire unwarranted confidence in the underlying model, even leading to ad-hoc rationalization of suspicious results.

Lastly, Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior?, is a recent study of five interpretability techniques and how they empirically help humans. The authors found very few benefits from any of techniques. Of particular note is that explanations which were rated to be higher quality by participants were not very useful in actually improving human performance.

All of this points to the difficult road ahead for interpretability research. These approaches and visuals are liable to be misused and misinterpreted. Even once we get improved notions of intepretability with intuitive properties, it still remains to be seen if we can use them to achieve the benefits I listed out in the very beginning. While it certainly seems more difficult to formalize interpretability than to use it well, I'm glad that empirical tests are already being done; they can hopefully also guide where the research goes next.

Finally, lurking behind all this is the question of decreased performance and adoption. It's obvious these days that black box models dominate in terms of results for many areas. Any additional work to induce a more interpretable model, or to derive a post-hoc explanation brings an additional cost. At this point in time, all the approaches towards improving interpretability we've seen either increase training / processing time, reduce accuracy, or do some combination of both. For those especially worried about competition, arms races, and multipolar traps, the case to adopt these approaches (past whatever token compliance will satisfy the technology ethics boards of the future) seems weak.

Discuss

### How Beliefs Change What We See in Starlight

4 августа, 2020 - 19:31
Published on August 4, 2020 8:31 AM GMT

I was advised that the reason the articles I've posted here were not getting a good reception was that they were too long and they discussed epistemological concepts in physics in ways that seemed unfamiliar to this audience.

To back up and try to catch the people for whom this was the case, I've put together this brief introduction.

.................

When you believe in a theory that predicts the existence of gigantic, gravitational black holes that eat light, will you see evidence of these black holes all over the night sky? Will you construct pretty, artistic representations of the things in which you believe.

When you believe in a theory that predicts black hole collisions that release waves which will be detected on Earth as tiny changes in very sensitive measurement devices, will you see evidence of these collisions everywhere you look? Will you construct simulations of what you think these collisions would look like?

When the data from a single measurement device is so noisy that it shows you nothing, is it possible to combine data from multiple measurement devices to see something, or is there the risk that by picking and choosing which measurements to combine and look at, you might see something that isn’t really there? For example, if you took photographs every midnight of a dark, creepy hallway and none of them showed anything unusual, but when you superimposed all of the images together, all of the dust particles accumulated to form the shape of a ghost, can you believe that measurement?

There is always the risk that you might accidentally measure something other than what you intended, for example, instead of measuring a black hole shadow in the starry night sky, you might end up measuring light leaking around your earthly measurement device.

The purpose of the scientific method is to avoid fooling ourselves about what causes what we see. How can we avoid fooling ourselves about what we see in the stars? There are, after all, many ideas that are consistent with our theories of nature. Dragons and unicorns are perfectly consistent with the theory of evolution, yet they do not exist.

In a laboratory experiment, you have a closed system in which changing one variable should produce a predictable effect in another, but in astronomy, we can’t do these sorts of controlled experiments. At best, we can describe what we see. A cosmologist might see the results of a big bang explosion, a crystallographer might see a diffraction pattern, and a materials scientist might see a pattern of localized light in an inhomogeneous medium. When it comes to astronomy, there can never be one, definitively true description because we can never create a controlled experiment with the stars.

Some people build experiments to detect small changes in gravity and they attribute these changes to invisible things that are happening in the stars, but they can’t be sure that the stars are truly the cause of the changes. The changes could be caused by something much closer to Earth. They also can’t be sure that they haven’t made a mistake in how they have interpreted and filtered their data.

To avoid accidentally seeing something in noisy data that isn’t there, scientists will take multiple, independent measurements of something happening. If ten people all independently observe a dragon fighting a unicorn or a gravitational wave from colliding black holes in a distant galaxy, they were probably not all hallucinating. However, if ten people are all looking at a blurry, filtered, and enhanced image and they all agree that it is probably a picture of a dragon fighting a unicorn, they might be fooling themselves and looking at something much more mundane.

In the case of gravitational waves, back in the 1970s, hundreds of independent research groups constructed simple devices to measure them and they all compared their results. Each research group thought that it had measured gravitational waves, but when the results were combined, they all had to conclude that no one had been measuring gravitational waves. They had all been measuring different sources of noise.

Today, we have a new sort of gravitational wave detector that is very expensive and there are only three of them in existence. They all believe that they are measuring gravitational waves, but it is possible that they are all measuring different sources of noise because it is difficult to get enough results to compare with only three, independent measurement devices.

When evaluating a scientific result, it is important to remember that raw measurements should always be believed, but the interpretation we give to those measurements should always be treated skeptically because you might be measuring something you hadn’t intended to measure. It is also important to be wary of those who construct a result by combining biased, filtered, or calibrated measurements.

In conclusion, it is a good idea to be wary of scientists’ conclusions but to trust in the raw data. The conclusions might be biased by beliefs, even if the result was ‘peer-reviewed’. Groups of scientists can be just as unaware of their blind-spots and biases as individual scientists.

The scientific method itself is the source of the power of science. This power is not contained within the peer-review system or within the community, especially when that community is motivated to see things that are not there.

Discuss

### Budapest Meetup on Margit Sziget

4 августа, 2020 - 15:29
Published on August 4, 2020 12:29 PM GMT

We're meeting at Champs bar on Margit Sziget again. Last time 8 people showed up and it was a solid discussion. Even though the case count is creeping up, I'm pretty sure sitting outside will be safe enough, and the the seating is under heavy umbrellas, so even if it rains it won't be a problem. I'll have a copy of a book by Richard Dawkins to put out on the table.

I hope to see any LW/SSC/EA people in the area show up :)

I'm planning for us to discuss this article for part of the meetup:

https://forum.effectivealtruism.org/posts/WKPd79PESRGZHQ5GY/in-defence-of-epistemic-modesty

Discuss

### Signs of the Times

4 августа, 2020 - 13:54
Published on August 4, 2020 7:09 AM GMT

One morning I woke up to see the front page of every newspaper across the world covered with pictures of what looked like a big orange doughnut. What could this represent? I learned that the doughnut picture was rendered by combining images taken from telescopes all around the world.

Petabytes of data had been mailed to one location for analysis and a few groups of PhD students wrote some code to combine and overlay the images. The student responsible for combining all of the work into one image got something that looked like a ‘black hole’ and newspapers and magazines from around the world put it on the front page, even though each individual image in the data set showed no donut, just blur.

The woman behind first black hole image

I thought: PhD student work is great, but it often contains mistakes, so I sure hope her advisor went through her code very carefully. If you overlay petabytes of images with individually tuned contrast ratios and construct your algorithm so that the overlay is centered around a region of interest, it is certainly possible to create a black hole in the middle of your region of interest. You’d think that sort of scrutiny is routinely applied to new results, but in the darkened conference rooms in which such data is presented, skepticism is expressed in hushed and veiled terms. Sometimes, nothing stands in the way of an idea marching ahead.

The first article I read had a bit of detail on how the algorithm was developed:

Bouman adopted a clever algebraic solution to this problem: If the measurements from three telescopes are multiplied, the extra delays caused by atmospheric noise cancel each other out. This does mean that each new measurement requires data from three telescopes, not just two, but the increase in precision makes up for the loss of information.

I wanted to see her presentation of her data, but what I found was a TED talk.

https://youtu.be/BIvezCVcsYs

The thing that concerned me the most from her TED talk was when she said at 6:40 (I paraphrased):

Some images are less likely than others and it is my job to design an algorithm that gives more weight to the images which are more likely.

as in, she told her algorithm to look for pictures that look like black holes and, lo and behold, her algorithm found black holes by ignoring the data that didn’t look like black holes.

LIGO did something similar in their algorithms, so if they got away with it, why can’t she?

Finally, I found a real, academic presentation from shortly after the media blitz and front page news stories.

https://youtu.be/UGL_OL3OrCE

It is an hour long and the technical stuff starts about ten minutes in. At 14:40 she talks about the ‘challenge’ of dealing with data that had an ‘error’ of 100%. At 16:00 she talks about how the ‘CLEAN’ algorithm is guided a lot by the user – as in, the user makes the image look how they think it should look. At 19:30, she said, “Most people use this method to do calibration before imaging, but we set it up so that we could do calibration during imaging.” Gaaaah! At 31:40, she shows four images that look the same in the amplitude domain – showing the extent to which this measurement relies on information in the phase domain. An image with a hole or without a hole looks the same in the amplitude domain. At 39:30, she says that the phase data is unusable and the amplitude data is noisy. To me, this sounds like she just contradicted herself.

Upon inspection of things she wrote about her data analysis methods, even more strange methods appear, for example, deciding on whether or not to include a data point based on its 'weirdness' - as in, based on whether or not it contributed to the result she wanted to see.

A clever commenter who read this material when I first posted it on quora.com wrote:

Take a look at this picture of my cat.

You don’t see a cat? You just need to apply the right cat-shaped filters.

When I see a picture of a black hole, I see that our academic system has succumbed to the overwhelming noise of our dark age.

I should mention pre-existing biases; my default setting is skeptical. I don’t really believe that black holes exist because I think that theorists got drunk on general relativity and invented them. Astronomers got drunk on interpreting the meanings of tiny dots of light and convinced themselves that they had seen these invisible, theoretical beasts of the night sky.

Seriously, any time you use a singularity to describe a physical phenomenon, it usually turns out to be wrong. Just think about the equations governing water swirling around in a basin. If you use one type of equation, you get a singularity in your basin and if you use another type of equation, you don’t.

That the whole universe is full of singularities floating around in space requires a suspension of disbelief with which I am not really comfortable. I am not alone in this, apparently. Believe it or not, there are still scientists out there who do not believe that black holes exist.

Some black hole researchers will display a picture of a star with a dark spot in it and then follow that up with a simulation showing the same picture. There is something important to know about simulations. If you have a picture of something, it is easy to make a simulation that copies the picture. What is hard is to make a simulation of something you have never seen before, predict how and where you will see it, and then record an observation of it. That is really the only way to do science. Any other route can lead to self-trickery.

Here is a precursor to the black hole image that made Katie Bouman’s PhD work famous.

Saturation effects must be rather difficult to deal with in such images, but, as we saw recently, that hasn’t stopped them from ending up on the front page of newspapers across the world.

Here is a study which claims to support the existence of black holes, but really just tracked some stars near the center of our galaxy. They write

“The stars in the innermost region are in random orbits, like a swarm of bees,” says Gillessen. “However, further out, six of the 28 stars orbit the black hole in a disc. In this respect the new study has also confirmed explicitly earlier work in which the disc had been found, but only in a statistical sense. Ordered motion outside the central light-month, randomly oriented orbits inside – that’s how the dynamics of the young stars in the Galactic Centre are best described.” Unprecedented 16-Year Long Study Tracks Stars Orbiting Milky Way Black Hole

I believe their measurements, but I don’t always believe the interpretation which scientists give to their results.

A lot of money goes into making these sorts of simulations and studying black holes, so one should expect resistance to any change in belief.

Why is science loaded with hypotheses which are assumed to be facts?

Although the author of this Veritasium video does not come out and question the existence of black holes, I like how he describes the way in which astronomers pick unlikely scenarios out of thin air and use them to explain the qualities of blurry blobs of light. I find it amazing how ‘artist’s renditions’ and just-so stories pass as science in this day and age.

Science should produce progress, but when it swirls around in an eddy of self-citation, you end up with a black hole – in a figurative, not literal sense.

................

(If you would like to hear this post read aloud, try this video: https://youtu.be/z2kKDcbum4o)

I was advised that the articles I first posted here on Less Wrong were too long, wide-ranging, and technical, so with this article, I'm trying a more scaled-back, focused, 'reveal culture' style. Does it work better than the approach in the article below?

Discuss

### RT-LAMP is the right way to scale diagnostic testing for the coronavirus

4 августа, 2020 - 13:05
Published on August 4, 2020 10:05 AM GMT

RT-LAMP compared to RT-PCR is a less versatile laboratory technique so is not as widely known. LAMP is more suited to actual diagnostic testing at scale than PCR but is not industry standard. Labcorp/Quest/One Medical use PCR for their coronavirus diagnostic tests and they currently have delays as long as two weeks for issuing results which is totally unacceptable for fighting this pandemic. Color in San Francisco has one of few widely available RT-LAMP tests for the coronavirus and they generally give results back in one to two business days and are not bottlenecked by laboratory procedure.

An open access RT-LAMP coronavirus diagnostic toolkit was published at the end of July and I believe presents a solution to scaling diagnostic testing to where it needs to be. A rapid, highly sensitive and open-access SARS-CoV-2 detection assay for laboratory and home testing

What steps can we take to help national testing capacity switch to RT-LAMP instead of RT-PCR? There are parallel approaches here -- bottom up development suggests distributed citizen science RT-LAMP labs to fill in gaps in testing capacity, top down distribution suggests convincing the major industry players to devote resources to switching to RT-LAMP over RT-PCR. I'm going to take the bottom up approach; can anyone else figure out how to help Quest and Labcorp switch over to the superior testing method?

Does anyone else want to set up their own RT-LAMP operation? I'm going to give it a try and will write a guide on what to buy once we're operational. One of my business partners was an innovator in LAMP primer design over ten years ago and we still own the lab equipment needed for the simple procedure. Presumably as non FDA approved citizen science it will have to be "for research only" and not for diagnostic testing purposes but it's possible the FDA is being cooperative and that actual FDA licenses could be issued on a timeframe that is reasonable.

Discuss

### A sketch of 'Simulacra Levels and their Interactions'

4 августа, 2020 - 12:13
Published on August 4, 2020 9:13 AM GMT

Two sketches i made based on Simulacra Levels and their Interactions. These are just sketches right now, i intend to make something better looking in the future (this is especially true for the first image). but i'd love to hear ideas and get feedback on these early versions.

The above diagram would look much better if there was symmetry. but the post misses some combos, some of which i also can't see how they're applicable (L1 & L2, for example).

The colors seem too happy for the topic to be honest :)

Here i also made a Venn version. it has my idea for the ideologue, and then the only ones missing are (1, 3 & 4 VS 2) and (1 & 4). for the former i have a hard time thinking of something that would fit there, and for the latter I'm pretty sure there's isn't something that fits there.

Discuss