Вы здесь

Новости LessWrong.com

Подписка на Лента Новости LessWrong.com Новости LessWrong.com
A community blog devoted to refining the art of rationality
Обновлено: 3 минуты 50 секунд назад

Open problems in human rationality: guesses

2 августа, 2019 - 21:16
Published on August 2, 2019 6:16 PM UTC

A couple months back Raemon wrote this excellent question to which Scott Alexander shared his ongoing list. I think it would be great to have people try to give their current guesses for a lot of these. My guesses in a comment below. My intuition is that value is created in four ways from this 1. Discovery of things you didn't realize you believed in the process of writing the answer. 2. Generation of cruxes if people give you feedback/alternatives for answers. 3. Realization that your guess isn't even wrong, but fundamentally wasn't built from building blocks that can, in principle, be rearranged to form a correct answer. 4. Help in coordination as people get a sense of what others believe about navigating this space. Seeing cognitive diversity on fundamental questions has helped me in this area.



Discuss

How to evaluate neglectedness and tractability of aging research

2 августа, 2019 - 13:33
Published on August 2, 2019 10:33 AM UTC

This is the fourth post of a series in which I'm trying to build a framework to evaluate aging research. Previous posts:

1. A general framework for evaluating aging research. Part 1: reasoning with Longevity Escape Velocity

2. Aging research and population ethics

3. Impact of aging research besides LEV

Summary

In the first section, I propose a method to scout for impactful and neglected aging research. This method is comprised of two steps: Identifying what is necessary to achieve LEV and, among what is necessary, identifying what is most neglected. This differs from the Open Philanthropy Project's approach of ignoring impact derived from Longevity Escape Velocity. A preliminary evaluation, made through including all research on the hallmarks and using lifespan.io's Rejuvenation Roadmap to identify neglected projects, leads us to identify genomic instability, telomere attrition, epigenetic alterations, deregulated nutrient sensing, loss of proteostasis, and mitochondrial dysfunction as neglected and important areas of research. Other neglected research must be scouted in other equally important areas that have already been identified by Open Philanthropy, such as improving delivery methods and developing better biomarkers.

In the second section, I take Open Philanthropy's and Aubrey de Grey's stance about considering interventions targeting "aging in general", such as caloric restriction or metformin, as having low tractability and impact. What remains after this first skimming is translational research focusing on the hallmarks, basic non-translational research, and enabling research, such as developing new tools and delivery methods. When prioritizing inside these areas, tractability should be considered only after having first considered neglectedness while trying to maximize scope by looking at research that is necessary for reaching LEV. Otherwise, a relatively small gain in tractability would sacrifice an extreme amount of impact and neglectedness. This is true because the hardest problems in the field are often the most neglected, but if they are necessary for achieving LEV, they will be solved later, and by accelerating them, we can impact the expected date of LEV the most.

A method for evaluating neglectedness

In its medium investigation on the cause area of aging, the Open Philanthropy Project identified some scientific advances that it believes would provide years of healthy life extension but still within the range of natural lifespans:

  • Preventing the accumulation of epigenetic errors associated with aging or restoring more youthful epigenetic states in cells.
  • Solving the problem of senescent cell accumulation.
  • Reversing stem cell exhaustion.
  • Learning how to use induced pluripotent stem cells (IPSCs) to regenerate and/or replace tissues and organs damaged by aging and aging-related diseases.

It has also identified research avenues useful for the attainment of the goals stated above:

Common obstacles to achieving the goals stated above include lack of ability to selectively deliver agents to desired cell types, measure and control the epigenetic state of cells, and understand and control differentiation and functioning of stem cells (plausibly closely related to the previous item). Therefore, progress on these more general themes may assist with extending healthy lifespan. Other types of relatively general research may also be helpful or necessary for substantial extension of healthy lifespan, such as general progress in neuroscience, improved biomarkers for various aspects of aging, and/or improved model organisms for aging.

By exploring funding and projects in this cause area, it also found two specific neglected interventions:

We have a limited sense of the absolute and relative neglectedness of the various categories of research discussed in this report. However, our scientific advisors identified specific unfunded projects related to the following themes:- Understanding the mechanism(s) driving regeneration associated with heterochronic parabiosis: Experiments have indicated that the blood of older animals can have deleterious effects on younger ones, and that the blood and organ functioning of younger animals can improve the functioning of old ones, though to date the hypothetical increase in healthy lifespan has not been tested. Understanding the biology responsible for the observed effects might eventually lead to interventions that address health problems associated with aging.- Aging and epigenetics: Documenting correlations between tissue-specific epigenetic states and signs of aging with a longitudinal cohort study and/or systematic examination of cadavers of people dying at various ages could yield valuable information that could lead to treatments to address aging-related health issues.

Open Philanthropy's way of proceeding is a fruitful one: identifying general impactful avenues and then searching for specific unfunded and underfunded projects.

If impact is measured in DALYs saved at the end of life, or mild life extension, then this method of proceeding could be optimal. Nonetheless, I propose another method that makes more sense in light of the measure of impact that I propose in the first post of this series: QALYs saved by making Longevity Escape Velocity come closer.

Open Philanthropy dismisses the possibility of radical life extension in the next decades, but the timeline in which radical life extension will happen doesn't impact the measure of impact that I propose. More importantly, the organization expresses doubts about the desirability of indefinite life extension and its relevance for cost-effectiveness analyses:

We think the best case for this cause involves the prospect of healthy life extension within the range that some humans currently live.[...]Our default view is that death and impairment from “normal aging” are undesirable. However, we would have some concerns about indefinite life extension, mainly related to entrenchment of power and culture. We don’t have internal consensus on whether, and to what extent, such indefinite life extension would be desirable, and don’t consider it highly relevant to this write-up.

Unlike Open Philanthropy, I think that the majority of the impact of aging research comes from making the date of LEV come closer and from the considerations about LEV in general explained in my first post. The reasons for this should be clear by reading the first and third posts, which are respectively about the impact coming from LEV and the impact coming from preventing DALYs due to age-related diseases.

Regarding Open Philanthropy's concerns about the desirability of "indefinite" life extension: Potential "entrenchment of power and culture" is a negligible inconvenience if compared to the 36,500,000,000 QALYs saved by making LEV come closer by just one year. These figures come up by estimating the length of a life after LEV at 1000 years, although the possibility of it being "indefinite" is undeniable. I think that it's also not clear if Open Philanthropy's concern would be a new problem at all, given that power, in the form of wealth or not, can already be passed from generation to generation. This concern is also offset by considerations about accumulation of knowledge that would be useful for science and other positive sides of eliminating aging discussed in my previous posts. It could also turn out to be irrelevant due to other disrupting technologies coming along before and after LEV is attained.

In principle, ignoring a source of impact is going to produce wrong results, since optimizing for one also influences the others.

Keeping in mind the metric of "QALYs saved by making LEV come closer", what is a better method of evaluating neglected avenues? Here is the strategy I propose:

  • Identify what is necessary to achieve LEV.
  • Among what is necessary, identify what is most neglected.

This method would cause the most difficult problems, which would be solved last if we didn't inject more funding, to be solved sooner. In this way, we would probably impact the date of LEV the most. Unlike the strategy proposed by Open Philanthropy, here, the focus is on LEV, and this generates different answers.

How do we implement the two steps above? Knowledge of the field is necessary. In a preliminary way, we could consider the general avenues identified by Open Philanthropy as necessary advancements. On their own, they are certainly incomplete, and we may actually want to include each one of the hallmarks of aging as described in the namesake paper. The causality and interference that the hallmarks have on each other is not yet completely understood, but many of them seem to be individually important enough to require a distinct solution in order to achieve LEV, if not multiple distinct solutions per hallmark.

Among the hallmarks identified, how do we identify the most neglected? We could look at unfunded projects, as Open Philanthropy has done, or we could search the literature by keywords and look at which hallmark has less research going on about it. An initial good source for identifying the most neglected is also the Rejuvenation Roadmap, which is curated by the Life Extension Advocacy Foundation. The most neglected should be the hallmarks in the least advanced stages of research. In theory, this could tell us that they are just difficult instead of neglected, but most of the time, these two things go hand in hand. By looking at it, we immediately identify that two out of four of the general impactful areas identified by Open Philanthropy (cellular senescence and stem cell exhaustion) are actually in the most advanced stages of research. This shouldn't be a surprise, since they are, right now, the most fashionable and well funded in the field. Stem cells are probably the most researched topic among the things in the roadmap, and research on senescent cells has exploded in recent years, having seen a surge in private investment on the order of hundreds of millions of dollars (most notably Unity Biotechnology).

As mentioned, the most neglected problems seem to be the most difficult. This is usually true, because they are too risky to pursue for publicly funded research and too early stage for private investment. However, since they will be the ones that will be solved later in time, they are the ones that will constitute the last barriers for achieving LEV. Therefore, if accelerated, they will actually bring LEV closer in time.

If we look at the Rejuvenation Roadmap, we can identify these as the most neglected areas: Genomic Instability, Telomere Attrition, Epigenetic Alteration (rightly identified by Open Philanthropy), Deregulated Nutrient Sensing, Loss of Proteostasis, and Mitochondrial Dysfunction.

But how to evaluate the most neglected topics not described in The Hallmarks of Aging, such as improving delivery methods or developing better biomarkers? Open Philanthropy rightly identified these as important topics, and they are probably necessary for reaching LEV, especially considering that generic therapies that, together, would plausibly address almost every hallmark involve somatic gene therapies that are currently unavailable. In this case, the strategy of looking for important unfunded projects, coupled with a similar approach as the one we use for the hallmarks (searching how many papers there are about each single topic) could be a good strategy. Again, expertise in this field is needed.

What is more tractable in aging research, and why tractability starts to lose value after a certain point

There seems to be one key consideration about tractability in aging research: addressing aging in general is much more difficult than adopting a divide et impera approach. Quoting Open Philanthropy's piece:

While it is conceivable that there could be treatments addressing aging “in general” (e.g., addressing all or a large proportion of associated symptoms via a single mechanism), such treatments have not been conclusively demonstrated and may not be possible. There are approaches that have been hypothesized to fit in this category, such as caloric restriction. While some of these have been tested in model systems, they have not been tested in humans for the purpose of extending healthy lifespan, and we would guess that they would not have radical effects on healthy lifespan if they were tested (but plausibly could be substantially positive).

Such interventions target basic mechanisms of aging that are the cause of the damages of aging. Similar approaches are drugs that mimic caloric restriction, or that seem to affect multiple metabolic pathways related to aging, such as the drug metformin. These kind of drugs have small effect sizes, and they delay the onset of age-related diseases. They are difficult to understand due to the intricacy of metabolic pathways and the many unknowns about metabolism in general. Addressing aging by "tinkering" with metabolism is currently not a tractable avenue. In fact, Open Philanthropy focuses on research avenues outside of this area. This is also the argument that biogerontologist Aubrey de Grey makes in favour of the "maintenance approach" to aging: addressing its basic damages (hallmarks) using periodic maintenance, such as by addressing stem cell exhaustion, using senolytics to get rid of senescent cells, etc. instead of focusing on the pathways that originally led to them (the seven damages of his SENS approach are very similar to the nine described in The Hallmarks of Aging). The maintenance approach also theoretically provides rejuvenation - removal of damage - instead of just slowing down its accumulation.

Starting from this common ground, we can focus on what remains:

  • Translational research focusing on the hallmarks.
  • Basic non-translational research.
  • Enabling research, such as developing new tools and delivery methods, even if it isn't strictly classified as aging research.

What is more tractable inside these three categories? This depends on the specific project evaluated. Trying to locate sub-subareas of tractability in advance is probably not that useful.

However, should we really be focusing on tractability beyond the first skimming? I believe that now these two metrics should have priority:

  • Neglectedness, which correlates with hardness.
  • How necessary the given research is for LEV, which drives impact.

Tractability should be considered after picking the most neglected and necessary projects. Here's why:

As observed earlier, in this area, neglectedness and tractability seem to correlate negatively with each other. This could have a few reasons: the more difficult a given research project is, the riskier it becomes and its results will be particularly far in the future. If a project is too risky and too slow, it will not attract public or private funding and will not be generally attractive to researchers. This is usually the case for hard and ambitious basic research, which is too risky for public funding and too early stage to be profitable for a company. This creates a landscape in which harder problems are neglected. The only way to finance this kind of harder and long term research seems to be through philanthropy.

Keeping in mind the metric of impact of "making LEV come closer", it makes more sense to finance hard and neglected problems if they are necessary for achieving LEV. These kind of problems will be solved later than others, therefore accelerating their solution has the most impact on LEV's expected date, which is the largest source of impact of aging research by far.

Because of the nature of the field, if we focused more on tractability, we would have little gain of it (visually, we would add some of it on top of the one gained by the skimming already performed), and, in turn, we would avert some more short-term DALYs. However, by making this choice, we would lose an extreme amount of impact, and we would make a big sacrifice in neglectedness, rendering the tradeoff not worth it.

In this area, we will rarely see neglected problems that are also very tractable, and while hunting for giving opportunities, we should be on the lookout for projects that are very hard, risky, long term, and underfunded but, at the same time, necessary. By going against the sub-optimal common practice caused by the risk-averse mindset of the field, we can optimize research.

-----------------------------------------------------------------------------------------------

Crossposted from the EA forum.



Discuss

[Site Update] Weekly/Monthly/Yearly on All Posts

2 августа, 2019 - 03:39
Published on August 2, 2019 12:39 AM UTC

Last week, our friends over at the EA Forum coded a new feature for the /allPosts page – in addition to the daily view, you can now view posts by weekly, monthly and yearly. (Thanks JP!)

This is most exciting when you also set the sorting to "top karma", making it an easy way to catch up on the most important posts you missed.

For convenience, sorted by top-karma, here are links to:

Have fun perusing posts you haven't read yet, or may have forgotten about, and perhaps catching up on newer comments on some of the top discussions. :)



Discuss

Why Subagents?

2 августа, 2019 - 01:17
Published on August 1, 2019 10:17 PM UTC

The justification for modelling real-world systems as “agents” - i.e. choosing actions to maximize some utility function - usually rests on various coherence theorems. They say things like “either the system’s behavior maximizes some utility function, or it is throwing away resources” or “either the system’s behavior maximizes some utility function, or it can be exploited” or things like that. Different theorems use slightly different assumptions and prove slightly different things, e.g. deterministic vs probabilistic utility function, unique vs non-unique utility function, whether the agent can ignore a possible action, etc.

One theme in these theorems is how they handle “incomplete preferences”: situations where an agent does not prefer one world-state over another. For instance, imagine an agent which prefers pepperoni over mushroom pizza when it has pepperoni, but mushroom over pepperoni when it has mushroom; it’s simply never willing to trade in either direction. There’s nothing inherently “wrong” with this; the agent is not necessarily executing a dominated strategy, cannot necessarily be exploited, or any of the other bad things we associate with inconsistent preferences. But the preferences can’t be described by a utility function over pizza toppings.

In this post, we’ll see that these kinds of preferences are very naturally described using subagents. In particular, when preferences are allowed to be path-dependent, subagents are important for representing consistent preferences. This gives a theoretical grounding for multi-agent models of human cognition.

Preference Representation and Weak Utility

Let’s expand our pizza example. We’ll consider an agent who:

  • Prefers pepperoni, mushroom, or both over plain cheese pizza
  • Prefers both over pepperoni or mushroom alone
  • Does not have a stable preference between mushroom and pepperoni - they prefer whichever they currently have

We can represent this using a directed graph:

The arrows show preference: our agent prefers A to B if (and only if) there is a directed path from A to B along the arrows. There is no path from pepperoni to mushroom or from mushroom to pepperoni, so the agent has no preference between them. In this case, we’re interpreting “no preference” as “agent prefers to keep whatever they have already”. Note that this is NOT the same as “the agent is indifferent”, in which case the agent is willing to switch back and forth between the two options as long as the switch doesn’t cost anything.

Key point: there is no cycle in this graph. If the agent’s preferences are cyclic, that’s when they provably throw away resources, paying to go in circles. As long as the preferences are acyclic, we call them “consistent”.

Now, at this point we can still define a “weak” utility function by ignoring the “missing” preference between pepperoni and mushroom. Here’s the idea: a normal utility function says “the agent always prefers the option with higher utility”. A weak utility function says: “if the agent has a preference, then they always prefer the option with higher utility”. The missing preference means we can’t build a normal utility function, but we can still build a weak utility function. Here’s how: since our graph has no cycles, we can always order the nodes so that the arrows only go forward along the sorted nodes - a technique called topological sorting. Each node’s position in the topological sort order is its utility. A small tweak to this method also handles indifference.

(Note: I’m using the term “weak utility” here because it seems natural; I don’t know of any standard term for this in the literature. Most people don’t distinguish between these two interpretations of utility.)

When preferences are incomplete, there are multiple possible weak utility functions. For instance, in our example, the topological sort order shown above gives pepperoni utility 1 and mushroom utility 2. But we could just as easily swap them!

Preference By Committee

The problem with the weak utility approach is that it treats the preference between pepperoni and mushroom as unknown - depending on which possible utility we pick, it could go either way. It’s pretending that there’s some hidden preference there which we simply don’t know. But there are real systems where the preference is not merely unknown, but a real preference to stay in the current state.

For example, maybe our pizza-agent is actually a committee which must unanimously agree to any proposed change. One member prefers pepperoni to no pepperoni, regardless of mushrooms; the other prefers mushrooms to no mushrooms, regardless of pepperoni. This committee is not exploitable and does not throw away resources, nor does it have any hidden preference between pepperoni and mushrooms. Viewed as a black box, its “true” preference between pepperoni and mushrooms is to keep whichever it currently has.

In fact, it turns out that we can represent any consistent preferences by a committee requiring unanimous agreement.

The key idea here is called order dimension. We want to take our directed acyclic graph of preferences, and stick it into a multidimensional space so that there is an arrow from A to B if-and-only-if B is higher along all dimensions. Each dimension represents the utility of one subagent on the committee; that subagent approves a change only if the change does not decrease the subagent’s utility. In order for the whole committee to approve a change, the trade must increase (or leave unchanged) the utilities of all subagents. The minimum number of agents required to make this work - the minimum number of dimensions required - is the order dimension of the graph.

For instance, our pizza example has order dimension 2. We can draw it in a 2-dimensional space like this:

Note that, if there are infinitely many possibilities, then the order dimension can be infinite - we may need infinitely many agents to represent some preferences. But as long as the possibilities are finite, the order dimension will be as well.

Path-Dependence

So far, we’ve interpreted “missing” preferences as “agent prefers to stay in current state”. One important reason for that interpretation is that it’s exactly what we need in order to handle path-dependent preferences.

In practice, path-dependent preferences mostly matter for systems with “hidden state”: internal variables which can change in response to the system’s choices. A great example of this is financial markets: they’re the ur-example of efficiency and inexploitability, yet it turns out that a market does not have a utility function in general (economists call this “nonexistence of a representative agent”). The reason is that the distribution of wealth across the market’s agents functions as an internal hidden variable. Depending on what path the market follows, different internal agents end up with different amounts of wealth, and the market as a whole will hold different portfolios as a result - even if the externally-visible variables, i.e. prices, end up the same.

Most path-dependence results from some hidden state directly, but even if we don’t know the hidden state, we can always add hidden state in order to model path-dependence. Whenever future preferences differ based on how the system reached the current state, we just split the state into two states - one for each possibility. Then we repeat, until we have a full set of states with path-independent preferences between them. These new states are “full” states of the system; from outside, some of them look the same.

An example: suppose I prefer New York to Boston if I just came from DC, but Boston to New York if I just came from Philadelphia.

We can represent that with hidden state:

We now have two separate hidden internal nodes, which both correspond to the same externally-visible state “New York”.

Now the key piece: there is no way to get from the “New York (from Philly)” node directly from the “New York (from DC)” node. The agent does not, and cannot, have a preference between these two nodes. Analogously, a market cannot have a preference between two different wealth distributions - the subagents who comprise a market will never spontaneously decide to redistribute their wealth amongst themselves. They always “prefer” (or “decide”) to stay in whatever state they’re currently in.

This is why we need to understand incomplete preferences in order to handle path-dependent preferences: hidden state creates situations where the agent “prefers” to stay in whatever state they’re in.

Now we can easily model the system using subagents exactly as we did for incomplete preferences. We have a directed preference graph between full states (including hidden state), it needs to be acyclic to avoid throwing away resources, so we can find a set of subagents to represent the preferences. In the case of a market, this is just the subagents which comprise the market: they’ll take a trade if it does not decrease the utility of any subagent. (Note, however, that the same externally-visible trade can correspond to multiple possible internal state changes; the subagents will take the trade if any of the possible internal state changes are non-utility-decreasing for all of them. For a market, this means they can trade amongst themselves in response to the external trade in order to make everyone happy.)

Applications & Speculations

We’ve just argued that a system with consistent preferences can be modelled as a committee of utility-maximizing agents. How does this change our interpretation and predictions of the world?

First and foremost: the subagents argument is a generalization of the standard acyclic preferences argument. Anytime we might want to use the acyclic preferences argument, but there’s no reason for the system to be path-independent, we can apply the subagents argument instead. In practice, we usually expect systems to be efficient/inexploitable because of some selection pressure (evolution, market competition, etc) - and that selection pressure usually doesn’t care about path dependence in and of itself.

Main takeaway: pretty much anywhere we’d use an agent with a utility function to model something, we can apply the subagents argument and use a committee of agents with utility functions instead. In particular, this is a good replacement for "weak" utility functions.

Humans are a particularly interesting example. We’d normally use the acyclic preferences argument (among other arguments) to argue that humans approximate utility-maximizers in most situations. But there’s no particular reason to assume path-independence; indeed, human behavior looks highly path-dependent. So, apply the subagents argument. Hypothesis: human behavior approximates the choices of a committee of utility-maximizing agents in most situations.

Sound familiar? The subagents argument offers a theoretical basis for the idea that humans have lots of internal subagents, with competing wants and needs, all constantly negotiating with each other to decide on externally-visible behavior.

In principle, we could test this hypothesis more rigorously. Lots of people think of AI “learning what humans want” by asking questions or offering choices or running simulations. Personally, I picture an AI taking in a scan of a full human connectome, then directly calculating the embedded preferences. Someday, this will be possible. When the AI solves those equations, do we expect it to find a single generic optimizer embedded in the system, approximately optimizing some “utility”? Or do we expect to find a bunch of separate generic optimizers, approximately optimizing several different “utilities”, and negotiating with each other? Probably neither picture is complete yet, but I’d bet the second is much closer to reality.

Conclusion

Let’s recap:

  • The acyclic preferences argument is the easiest entry point for efficiency/inexploitability-implies-utility-maximization theorems, but it doesn’t handle lots of important things, including path dependence.
  • Markets, for example, are efficient/inexploitable but can’t be represented by a utility function. They have hidden internal state - the distribution of wealth over agents - which makes their preferences path-dependent.
  • The subagents argument says that any system with deterministic, efficient/inexploitable preferences can be represented by a committee of utility-maximizing agents - even if the system has path-dependent or incomplete preferences.
  • That means we can substitute committees in many places where we currently use utilities. For instance, it offers a theoretical foundation for the idea that human behavior is described by many negotiating subagents.

One big piece which we haven’t touched at all is uncertainty. An obvious generalization of the subagents argument is that, once we add uncertainty (and a notion of efficiency/inexploitability which accounts for it), an efficient/inexploitable path-dependent system can be represented by a committee of Bayesian utility maximizers. I haven’t even started to tackle that conjecture yet; it’s a wide-open problem.



Discuss

Sleeping Beauty: Is Disagreement While Sharing All Information Possible in Anthropic Problems?

1 августа, 2019 - 23:10
Published on August 1, 2019 8:10 PM UTC

Preamble

This post discusses why any halfer position in the Sleeping Beauty Problem would lead to disagreements between two agents sharing all information. This issue has not been much discussed except by @Katja Grace and John Pittard. Furthermore I would explain why these seemingly absurd disagreements are actually valid. This post is another attempt by me trying to get attention to the important difference between reasoning as the first person versus reasoning as an impartial observer in anthropic problems.

The Disagreement

To show any halfer position would lead to disagreements between two communicating agents consider this problem:

Bring a Friend: You and one of your friend are participating in a cloning experiment. After you fall asleep the experimenter would toss a fair coin. If it lands Heads nothing happens. If it lands Tails you would be cloned and the clone would be put into an identical room. The cloning process is highly accurate such that it retains the memory to a level of fidelity that is humanly indistinguishable. As a result, next morning after waking up there is no way to tell if you are physically the original or the clone. Your friend wouldn’t be cloned in any case. The next morning she would choose one of the two rooms to enter. Suppose your friend enters your room. How should she reason the probability of Heads? How should you reason?

For the friend this is not an anthropic problem. So her answer shouldn’t be controversial. If the coin landed Heads she has 50% chance of seeing an occupied room. While if the coin landed Tails both room would be occupied. Therefore seeing me in the room is evidence favouring Tails. She would update the probability of Heads to 1/3.

From my (the participant’s) perspective this is a classical anthropic problem just like Sleeping Beauty. There are two camps. Halfers would say the probability of Heads is 1/2. Reason being I knew I would find myself in this situation. Therefore I haven’t gained any new information about the coin toss. The probability must remain unchanged. The other camp says the probability of Heads should actually be 1/3. Most thirders argue I have gained the new information that I exist which is evidence favouring more copies of the participants exist. Therefore the probability of Tails shall increase from the prior.

Both camps should agree that seeing the friend (or not) would not change their answer. Because the friend is simply choosing one room out of the two. Regardless of coin toss result there is always a 50% chance for her to enter my room.

Now Halfers are in a peculiar situation. My probability of Heads is 1/2 while the friend’s answer is 1/3. We can share our information and communicate any way we like. Nothing I say can change her answer as nothing she says can change mine. To make the matter even more interesting I would have to admit there is no mistake in my friend’s reasoning and she would think I am correct too. Our difference is purely due to the differences in perspectives. This seems to contradict Aumann’s Agreement Theorem.

Thirders do not have any of these problems. The friend and I would be in perfect agreement that the probability is 1/3. As a result this issue is occasionally used as a counter to Halferism (as did so by Katja Grace even though that post targets SSA specifically her argument applies to all halfers). However I would like to argue these disagreements are indeed valid.

Repeating the Experiment as the Friend vs as the Participant

Re-experiencing the experiment as the friend is not the same as re-experiencing it as a participant. From the friend’s perspective repeating it is straightforward. Let another coin toss and potential cloning happen to someone else and then choose a random room again. It is easy to see if the number of repetition is large she would see the chosen room occupied about 3/4 of the times. Out of which about 1/3 of the meetings would be after Heads. The relative frequency agrees with her answer of 1/3.

To repeat the experiment from my first-person perspective is a different story. After waking up from the first experiment (and potentially meeting my friend) I would simply participate in the same process again. I shall fall asleep and let another coin toss and potential cloning take place. I would wake up again not knowing if I’m the same physical person the day before. Suppose I’m told the coin toss result at the end of each day. If this is repeated a large number of times then I would count about 1/2 of the awakenings following Heads. I would also meet my friend 1/2 of the times, with about equal numbers after heads or tails. My relative frequency of Heads would be 1/2, agreeing with my answer.

In tosses involving both my friend and me she may see the other copy of participant instead of me specifically after Tails. This caused our difference in answers. Which leads to our different interpretation of:

Who is in this Meeting

From the friend’s perspective choosing a random room have two possible outcomes. Either the chosen room is empty or it is occupied. The new information she received is simply there is someone in the room. She interprets that person as an unspecific participant.

On the other hand from my perspective the person in the room is specific, i.e. me. The possible outcome for the room selection is either I see the friend or I do not see the friend. There may or may not exist another version of participant who is highly similar to me but that would not affect my observation.

Effectively we are answering two different questions. For my friend “what is the probability of Heads given there is a version of participant in the chosen room”. For me “what is the probability of Heads given I specifically am in the chosen room”. In this sense the disagreement does not violate Aumann’s Agreement Theorem.

A point to note: the specification of I is limited to my first-person perspective. It is incommunicable. I can keep telling my friend “It’s me” yet it would carry no information to her. Because this specification has nothing to do with objective differences between me and other copies of the participant. I refer this person as me only because I’m experiencing the world from its perspective. Because this person is undeniably most immediate to the only subjective experience and consciousness accessible. Identifying me is primitive. Which begs the question:

Can Indexicals Be Used In Anthropics?

Indexicals (or pure indexicals by some definition) such as I, here, now and by extension we, today or even this world is a point of contention between halfers and thirders. Typically thirders think indexicals can be used in anthropic reasoning while halfers disagree. In the Sleeping Beauty Problem this conflict manifest in the debate of new information. Most thirders think there is new information since I learned I am awake on one specific day, i.e. today. Halfers typically argue the indexical today is not an objective specification and it can only be said I am awake on an unknown day, i.e. there is at least one awakening. In the cloning example above thirders think the new information is a specific participant I exist whereas halfers often argue objectively speaking it only shows at least one participant exists.

The disagreement between participant and friend presents another explanation for the indexicals. Indexicals are references to the perspective center. When someone says here he is talking about the location where he is experiencing the world from, the place most immediate to the subjective experience. Similarly I and now points to other aspects of the perspective center. They refer to the agent and time most immediate to the subjective experience respectively. Because it is only related to the perspective a participant can intrinsically identify I and never confuse himself with others. He can do this without knowing any difference between him and other highly similar participants. Also because of this dependency on the perspective this identification is not meaningful to the friend. Which lead to their disagreement.

By this logic the use of indexicals in anthropic problems is valid, that is, as long as we are reasoning from the first-person perspective of a participant. The debate of their usage is a debate between perspectives. When thirders say today is a specific day it requires us to imagine being in beauty’s shoes wakening up in the experiment. Halfers oppose today’s use because they are reasoning as an outsider. In which case some objective measure is required to differentiate the two days. This shows the conflicting logics between an-

Impartial Observer vs the First Person

We often purposely formulate our logics in a way so it is not from any specific perspective. As if the perspective center is irrelevant to the problem at hand, i.e. we reason as (imaginary) impartial observers would. This uncentered reasoning would not treat any agent, time or location as inherently special. It is what we usually meant as thinking objectively. Comparing to first-person reasoning it is different in several aspects.

The obvious difference is the aforementioned use of indexicals. Impartial observers’ uncentered reasoning cannot use indexicals since they are references to the perspective center. For most problems this simply means substituting the indexical I to a third-person identity in logical expressions. Similarly now and here are switched to some objectively identified time and location. For anthropic problems however this has further implications because the ability to inherently identify oneself affects reasoning as shown by the disagreement between the participant and the friend.

Another difference is about one’s uniqueness. The indexicals as references to the perspective center are inherently special. From the first-person perspective I am one of a kind. Other agents, no matter how physically similar, are not its logical equals. This explains why as the first person I can be identified without knowing any difference between I and others. The differentiation is not needed because they are never in the same category to begin with. The same is true for now and here. On the other hand for an impartial outsider no agent, time or location is inherently special, i.e. they are indifferent.

The last difference is the probability of existence. The existence of I is a logical truth. Because to use the indexicals one has to reason from the first-person perspective. Yet reasoning from its perspective could only conclude in its self-existence, i.e. “I think, therefore I am”. It is sometimes presented as “ I can only find myself exist.” Furthermore given reasoned from a consistent perspective, “I am, here, now” would always be true. Because these indexicals refer to different aspects of the same perspective center. On the other hand we can also reason as impartial observers and specify a participant or time by some third-person identifiable measures. In this case it is entirely possible that agent does not exist or not conscious at the specified time. E.g. in the previous cloning example we can identify a participant as the one in the chosen room. It is possible that he does not exist since the chosen room can be potentially empty. In summary, it takes an outsiders’ perspective to think about someone’s nonexistence/unconsciousness.

Given these differences, logics from the first person and impartial outsiders should not mix in anthropic related problems. However most arguments in this field paid no attention to these distinctions. It is my core argument that anthropic paradoxes are caused by arbitrarily switching perspectives in reasoning, mixing the conflicting logics.

Paradoxes and Mixed Reasoning

To recap: the first-person perspective is centered. It can use indexicals because I, here and now are inherently special comparing to other agents, locations or time. Where “I exist, here and now” is a logical truth. On the other hand the impartial observers’ perspective is uncentered. Indexicals cannot be used because impartial observers are indifferent to any agent, time or location. Where the existence of any specific agents at any time or location is not guaranteed. Within a single logical framework we can employ either perspective, but not both. If an argument mixes the two, paradoxes ensue.

Take the Doomsday Argument as an example. It suggests we should take a more pessimistic outlook for human’s future than observed evidence suggests. The argument is simple. First, it recognizes a principle of indifference among all human beings (past, present or future alike). Then it specifically considers my birth rank among all humans (sometimes it is expressed as our birth rank or that of the current generation). As a result it concludes I am more likely to have my birth rank if there are fewer people in total, i.e. doom soon is more likely. This is a classic case of mixed perspectives. On one hand it treats all human beings indifferently as an impartial outsider would. Yet at the same time it uses indexicals by employing a first-person perspective and take a special interest in my or by extension our birth rank. Only by mixing the two it enables the conditional update shifting the probability to doom soon. If we reason as the first person and identify the indexical I by treating it as inherently special then the principle of indifference among all humans no longer applies. Similarly if we reason as an impartial outsider and recognize the principle of indifference then there is no reason to consider my birth rank specifically, in fact there is no way to identify I to begin with. Either way, the outlook of mankind can only be estimated by observed evidence. The probability shift is false.

Interestingly on some level we realize the inherently apparent I and the indifferent principle to all humans are not logically consistent. To reconcile this conflict a conscious step is often added: anthropic assumptions, which suggests treating I as a randomly selected individual among indifferent agents. Even though there is no justification to such assumptions accepting them feels natural. Because they allow two highly intuitive ideas to coexist. However, those two ideas are based on different perspectives which should be kept separate to begin with.

An example is the Self-Sampling Assumption (SSA). It suggests we should reason as if I am randomly selected from all actually existent (past, present or future) observers. This would lead to the infamous Doomsday Argument. An alternative to the SSA is the Self-Indication Assumption (SIA). It suggests we should reason as if I am randomly selected from all potentially existent observers. While it would refute the Doomsday Argument it has its own paradox: the Presumptuous Philosopher. (It concludes the number of intelligent life-forms in the universe should be higher than observed evidence suggests. Due to the fact that I exist is evidence favouring more observers.) The debate between SSA and SIA is about the correct reference class of I, whether it should be all existent or possible observers. Yet if the perspective reasonings are not mixed this problem would never exist in the first place. There is no default reference class for I since right from the start it is never in the same category with other observers, let it be actual or potential.

No default reference class also means any notions of the probability distribution of me being members of the said reference class are false. Such probabilities do not exist. Consider the paradox related to Boltzmann Brain. Some arguments suggest that under current theories of universe Boltzmann brains would vastly outnumber human brains in the universe. Then the probability of me being a Boltzmann brain is almost 100%. Essential to this calculation is a principle of indifference among all brains, which is valid if reasoned as an impartial observer. Yet it also specifically considers the first-person center I which contradict the indifference. As a result the probability it trying to calculate is logically inconsistent to begin with. There is no answer to it. Instead of using the indexical I the brain in question shall be specified from impartial observers’ perspective. E.g. A randomly selected one among all brains would almost 100% being a Boltzmann brain. This calculation would be correct. But also way less interesting. The same principle also refutes Nick Bostrom’s Simulation Hypothesis.

The non-existence of such probabilities can also be shown by the frequentist interpretation. Recall in the cloning example, I (participant) can re-experience the experiment as the first person. From my perspective after taking part in a large number of iterations the relative frequency of Heads or seeing my friend would both approach a certain value (1/2). However there is no reason for the relative frequency of me being the clone or the original of each experiment to converge toward any particular value. This again suggests such probabilities do not exist. Instead of using indexicals a participant must be specified from impartial observers’ perspective. Only then it is valid to ask the probability of this individual being the original or clone. E.g. the probability that the participant in the chosen room (if it exists) being the original is valid. A relative frequency can be calculated by an outsider without having to take a participant’s first-person perspective.

Sleeping Beauty Paradox and Conclusion

The Sleeping Beauty Paradox is without a doubt the most debated problem in anthropic reasoning. Nonetheless the same principle applies. The answer to it can be derived either from beauty’s first-person perspective or from impartial observers’ perspective. From the first-person perspective I have gained no new information. I did find myself awake today specifically. Yet that is just a logical truth in first-person reasoning. So even before falling asleep on Sunday it is already known that I would wake up in the experiment and identify that day as today. The probability of Heads remains at 1/2. From impartial observers’ perspective there is no new information either. While beauty being awake on a specific day is not guaranteed from this perspective, it could not use beauty’s perspective center to specify today. So all that is known is there is an unspecific awakening, i.e. there is at least one awakening. The probability of Heads should remain at 1/2 as well.

More importantly “the probability of today being Monday”, or “the probability of this awakening being the first” do not exist. Because they use indexicals in some default reference class (actual awakenings or potential awakenings) which is inconsistent. No Bayesian updating shall be performed after learning “Today is Monday”. The probability of Heads is 1/2 at awakening and remains at 1/2 after beauty finds out it is Monday.

In conclusion, perspectives play a significant role in anthropic related problems. Different perspectives could potentially give completely different answers. Most notably the special interest to the perspective center of the first-person and the general indifference of impartial observers are not compatible​. Reasoning from these two perspectives​ must be kept separate to avoid paradoxes.



Discuss

Understanding Batch Normalization

1 августа, 2019 - 20:56
Published on August 1, 2019 5:56 PM UTC

Batch normalization is a technique which has been successfully applied to neural networks ever since it was introduced in 2015. Empirically, it decreases training time and helps maintain the stability of deep neural networks. For that reason practitioners have adopted the technique as part of the standard toolbox.

However, while the performance boosts produced by using the method are indisputable, the underlying reason why batch normalization works has generated some controversy.

In this post, I will explore batch normalization and will outline the steps of how to apply it to an artificial neural network. I will cover what researchers initially suspected were the reasons why the method works. Tomorrow's post will investigate new research which calls these old hypotheses into question.

To put it in just a few sentences, batch normalization is a transformation that we can apply at each layer of a neural network. It involves normalizing the input of a layer by dividing the layer input by the activation standard deviations and subtracting the activation mean. After batch normalization is applied it is recommended to apply an additional transformation to the layer with learned parameters which allow the neural network to learn useful representations of the input. All of these steps are then incorporated into the backpropagation algorithm.

The mechanics of batch normalization can be better understood with an example. Here, I have illustrated a simple feed-forward neural network. Our goal is to apply batch normalization to the hidden layer, indicated in green.

Let the vector .mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} h stand for the input to the hidden layer. The input h is calculated by applying an activation function element-wise to the vector computed from the previous layer. Let H stand for a mini-batch of activations for the hidden layer, each row corresponding to one example in the mini-batch.

What batch normalization does is subtract the activation unit mean value from each input to the hidden layer, and divides this expression by the activation unit standard deviation. For a single unit, we replace hi with

h′i=(hi−μi)/σi

where μi is the mean input value for hi across the mini-batches. In symbolic form, in order to calculate μi we compute

μi=1m∑jHj,:

similarly, we calculate σi by computing

σi=√δ+1m∑i(H−μ)2i

The above expression is the standard deviation for the input to the ith activation unit with an additional constant value δ. This delta component is kept at a small positive value, like 10−8, and is added only to avoid the gradient becoming undefined where the true standard deviation is zero.

At test time, we can simply use the running averages for μ and σ discovered during training, as mini-batch samples will not always be available.

The above computations put the distribution of the input to a layer into a regime where the gradients for each layer are all reasonably sized. This is useful for training because we don't want the gradient descent step to vanish or blow up. Batch normalization accomplishes this because the weights no longer have an incentive to grow to extremely large or small values. In the process, batch normalization therefore also increases our ability to train with activations like the sigmoid, which were previously known to fall victim to vanishing gradients.

I will borrow an example from the Deep Learning Book (section 8.7.1) to illustrate the central issue, and how we can use batch normalization to fix it. Consider a simple neural network consisting of one neuron per layer. We denote the length of this network by l.

Imagine that we designed this neural network such that it did not have an activation function at each step. If we chose this implementation then the output would be ^y=xw1w2w3...wl. Suppose we were to subtract a gradient vector g obtained in the course of learning. The new value for ^y will now be ^y=x(w1−ϵg1)(w2−ϵg2)...(wl−ϵgl). Due to the potential depth of this neural network, the gradient descent step could now have altered the function in a disastrous way. If we expand the product for the updated ^y expression, we find that there are n-order terms which could blow up if they are too large. One of these n-order terms is ϵg1∏li=2wi. This expression is now subtracted from ^y, which can cause an issue. In particular, if the terms wi from i=2 to i=l are all greater than one, then this previous expression becomes exponentially large.

Since a small mistake in choosing the learning rate can result in an exponential blow up, we must choose the the rate at which we propagate updates wisely. And since this network is so deep, the effects of an update to one layer may dramatically affect the other layers. For example, whereas an appropriate learning rate at one layer might be 0.001, this might simultaneously cause a vanishing gradient at some another layer!

Previous approaches to dealing with this problem focused on adjusting ϵ at each layer in order to ensure that the effect of the gradient was small enough to cancel out the large product, while remaining large enough to learn something useful. In practice, this is quite a difficult problem. The n-order terms which affect the output are too numerous for any reasonably quick model to take into account all of them. By using this technique, the only options we have left are to shrink the model so that there are few layers, or to slow down our gradient computation excessively.

The above difficulty of coordinating gradients between layers is really a specific case of a more general issue which arises in deep neural networks. In particular, the issue is termed an internal covariate shift by the original paper. In general a covariate shift refers to a scenario in which the input distribution for some machine learning model changes. Covariate shifts are extremely important to understand for machine learning because it is difficult to create a model which can generalize beyond the input distribution that it was trained on. Internal covariate shifts are covariate shifts that happen within the model itself.

Since neural networks can be described as function compositions, we can write a two layer feedforward neural network as f2(f1(x,θ1),θ2) where x is the input to the network, and θ defines the parameters at each layer. Writing this expression where u=f1(x,θ1) we obtain f2(u,θ2). We can see, therefore, that the final layer of the network has an input distribution defined by the output of the first layer, f1(x,θ1). Whenever the parameters θ1 and θ2 are modified simultaneously, then f2 has experienced an internal covariate shift. This shift is due to the fact that f1 now has a different output distribution.

It's as if after being told how to change in response to its input distribution, the very ground under f2's feet has changed. This has the effect of partially canceling out assumption of we are making about the gradient, which is that each element of the gradient is defined as the the rate of change of a parameter with everything else held constant. Gradients are only defined as measuring some slope over an infinitesimal region of space — and in our case, we are only estimating the gradient using stochastic mini-batch descent. This implies that we should automatically assume that this basic assumption for our gradient estimate will be false in practice. Even still, a difference in the way that we approach the gradient calculation can help alleviate this problem.

One way to alleviate the issue would be to encourage each layer to output similar distributions across training steps. For instance, we could try to add a penalty to the loss function to encourage the activations from each layer to more closely resemble a Gaussian distribution. This would have the intended effect of keeping the underlying distribution of each layer roughly similar, minimizing the downsides of internal covariate shift. However this is an unnecessarily painful approach, since it is difficult to design a loss penalty which results in the exact desired change.

Another alternative is to modify the parameters of a layer after each gradient descent step in order to point them in a direction that will cause their output to be more Gaussian. Experiments attempting this technique resulted in neural networks that would waste time repeatedly proposing an internal covariate shift only to be reset by the intervention immediately thereafter (see section 2 in the paper).

The solution that the field of deep learning has settled on is roughly to use batch normalization as described above, and to take the gradients while carefully taking into account these equations. The batch normalization directly causes the input activations to resemble a Gaussian, and since we are using backpropagation through these equations, we don't need any expensive tug of war with the parameters.

One more step is however needed in order to keep the layers from losing their representation abilities after having their output distributions normalized.

Once we have obtained the batch of normalized activations H′, we in fact use γH′+β as the input for the layer, where γ and β are learned scalar parameters. It may seem paradoxical that after normalizing H we would now alter the matrix to make its standard deviation ≠ 1 and its mean ≠ 0. Didn't we want to minimize the effect of a covariate shift? However, this new step allows more freedom in the way the input activations can be represented. From the paper,

Note that simply normalizing each input of a layer may change what the layer can represent. For instance, normalizing the inputs of a sigmoid would constrain them to the linear regime of the nonlinearity.

With the new learned parameters, the layers are more expressive. With this additional parameterization the model can figure out the appropriate mean and standard deviation for the input distribution, rather than having it set to a single value automatically, or worse, having it be some arbitrary value characterized by the layers that came before.

With these properties batch normalization strikes a balance between minimizing the internal covariate shift in the neural network while keeping the representation power at each layer. All we need to do is incorporate the new learned parameters, compute the gradients via a new set of backpropagation equations and apply the normalization transformation at each layer. Now we hit run and our neural network trains faster, and better. It's that easy.

Did my explanation above make perfect sense? Is internal covariate shift really the big issue that batch normalization solves? Tomorrow I will investigate potential flaws with the reasons I gave above.

Tune in to find out what those issues might be. Or just read the paper I'll be summarizing.



Discuss

July 2019 gwern.net newsletter

1 августа, 2019 - 19:20
Published on August 1, 2019 4:19 PM UTC



Discuss

The Internet: Burning Questions

1 августа, 2019 - 17:46
Published on August 1, 2019 2:46 PM UTC

(cross-posted from my personal blog)

I'm about to start a learning project and I'm paying extra close attention, "What feels most interesting?" rather than merely, "What am I supposed to know?" When I stopped to think about it, there's plenty of very specific (and vague) things that I'm curious about regarding how the Internet and the network stack operate. Below is my lightly edited brainstorm of "What's confusing and interesting about computer networking?"

Another way of framing this is that I find myself more easily bored when being told "Here's how a system works" compared to when I go, "How the hell would I build this? None of my pieces seem to fit.... maybe if I...."

If you know anything about networking and would enjoy giving me answers/hints/nudges on any of these questions, go ahead!

  1. How does anything find where it's going?
    1. What local knowledge does any given router have, and what search algorithim does it use to eventually end up in the right place?
      1. What (if any) are the guarantees of this (or other) algos?
        1. How often do you not find the node you're looking for?
        2. Is it deterministic and/or predictable? i.e will a request always follow the same path?
        3. Is there an average number of "hops" till success?
    2. Is there anything where after making an initial request, and a path to the end node needs to be found, that path is then cached for use in ongoing exchanges? Is the original search process efficient enough that you don't really get any gains from that? Does IP even allow for specifying a specific path?
  2. How the hell are IP addresses and DNS records regulated?
    1. My current understanding is there's some central committee that dolls out IP addresses based on geographic considerations.
      1. So IP only needs to be unique. It doesn't matter what IP you have, so it seems like the main job here is just to avoid collisions.
      2. I'm guessing there storing of "where does a given IP address live?" is stored in a distributed way across many routers across the whole internet.
    2. What do you have to do to be a DNS service provider? It seems like they all have to coordinate on not letting multiple people get domain names, and they also are in change of maintaining the mapping between URLs and IPs.
      1. How does my computer access/interact with the DNS records?
        1. Is it a bootstrap thing, where a few DNS servers have a "forever" unchanging IP address, and I just ask them for a translation?
        2. I think I've read stuff that my computer also caches dns records, but how does it know when those records have been updated? If the old IP address is still available it seems like I could go to the wrong site without noticing, unless my computer is constantly checking DNS servers.
          1. That "sounds" expensive, but is it?
      2. How do people who sell domain names handle collisions? If two people try to buy the same domain name at once, from different domain name vendors? (if it was from the same vendor, I assume it would be a straightforward "we queue all requests and process them one at a time" thing, doesn't seem doable with multiple companies (unless they all forwarded to a third party, but wow that seems like a lot))
  3. What does the topology of the internet look like?
    1. Is there a N-degrees of separation thing?
    2. Are there any bottle necks?
      1. Like, what's the min amount of routers I'd have to turn off to disconnect a city/state/country from the rest of the Internet?
  4. What governs Internet speeds?
    1. Is speed different from bandwidth?
    2. K, so one can pay for faster or slower Internet. What are the resources that are shuffled around by ISP that make that speed difference.
    3. What metrics should I use to think about Internet usage? (total GB/time?)
    4. What metrics should I use to thing about Internet... providance? (number of parallel requests that can be handled? Total amount of GB? GB/time served?)
    5. ^ The above two questions are in the context of doing Fermi estimates on what sort of network infrastructure is needed to support X amount of consumption.
  5. General resistence to attacks/natural-disasters/acts of god?
    1. Those fiber optic cables that line the oceans, should I be worried about anything happening to them?
    2. What the largest Internet outage in history?
    3. How resilient is Internet infrastructure compared to, say, the power grid?
  6. Wooooooooooooaaaah how the hell does satellite Internet work?
    1. I can imagine "big space computer" being strong enough to send signals to earth, but can a phone send back?
    2. Oh wait, I remember satellite Internet coming with a big box.
      1. Is the main cost of a satellite Internet receiver mostly "big transmitter and power system to operate it"?
    3. What is an Internet satellites service capabilities? What's stopping them from taking over from cable driven ISP setups?
  7. What are big player application layer protocols besides HTTP and SMTP?
    1. What's really the difference between them? How weird would it be to send email using HTTP? How weird would it be to send webpages using SMTP?
    2. What (if any exist) does a suuuuper specific application protocol look like?
  8. Where on your computer does networking stuff live?
    1. Is the TCP/HTTP code in the kernel? .txt file on your desktop?
    2. What does the networking card do?
  9. What does network monitoring look like?
    1. Related to topology, does the NSA have like 6 hubs they can look at and see most traffic?
    2. How much "extra" does a router need to also store/monitor the traffic going through it?
  10. The great firewall of Chine?
    1. I heard something about DNS poisoning, what's that?

Expect an update in the future with what I learn!



Discuss

Mistake Versus Conflict Theory of Against Billionaire Philanthropy

1 августа, 2019 - 16:10
Published on August 1, 2019 1:10 PM UTC

Response To (SlateStarCodex): Against Against Billionaire Philanthropy

I agree with all the central points in Scott Alexander’s Against Against Billionaire Philanthropy. I find his statements accurate and his arguments convincing. I have quibbles with specific details and criticisms of particular actions.

He and I disagree on much regarding the right ways to be effective, whether or not it is as an altruist. None of that has any bearing on his central points.

We violently agree that it is highly praiseworthy and net good for the world to use one’s resources in attempts to improve the world. And that if we criticize rather than praise such actions, we will get less of them.

We also violently agree that one should direct those resources towards where one believes they would do the most good, to the best one of one’s ability. One should not first giving those resources to an outside organization one does not control and which mostly does not use resources wisely or aim to make the world better, in the hopes that it can be convinced to use those resources wisely and aim to make the world better.

We again violently agree that privately directed efforts of wealthy individuals often do massive amounts of obvious good, on average are much more effective, and have some of the most epic wins of history to their names. Scott cites only the altruistic wins and effectiveness here, which I’d normally object to, but which in context I’ll allow.

And so on.

Where we disagree is why anyone is opposing billionaire philanthropy. 

We disagree that Scott’s post is a useful thing to write. I agree with everything he says, but expect it to convince less than zero people to support his position.

Scott laid out our disagreement in his post Conflict vs. Mistake.

Scott is a mistake theorist. That’s not our disagreement here.

Our disagreement is that he’s failing to model that his opponents here are all pure conflict theorists.

Because, come on. Read their quotes. Consider their arguments.

Remember Scott’s test from Conflict vs. Mistake (the Jacobite piece in question is about how communists ignore problems of public choice):

What would the conflict theorist argument against the Jacobite piece look like? Take a second to actually think about this. Is it similar to what I’m writing right now – an explanation of conflict vs. mistake theory, and a defense of how conflict theory actually describes the world better than mistake theory does?

No. It’s the Baffler’s article saying that public choice theory is racist, and if you believe it you’re a white supremacist. If this wasn’t your guess, you still don’t understand that conflict theorists aren’t mistake theorists who just have a different theory about what the mistake is. They’re not going to respond to your criticism by politely explaining why you’re incorrect.

I read Scott’s recent post as having exactly this confusion. There is no disagreement about what the mistake is. There are people who are opposed to billionaires, or who support higher taxes. There are people opposed to nerds or to thinking. There are people opposed to all private actions not under ‘democratic control’.  There are people who are opposed to action of any kind. 

There are also people who enjoy mocking people, and in context don’t care about much else. All they know is that as long as they ‘punch up’ they get a free pass to mock to their heart’s content.

Then there are those who realize there is scapegoating of people that the in-group dislikes, that this is the politically wise side to be on, and so they get on the scapegoat train for self-advancement and/or self-protection.

Scott on the other hand thinks it would be a mistake to even mention or consider such concepts as motivations, for which he cites his post Caution on Bias Arguments.

Caution is one thing. Sticking one’s head in the sand and ignoring most of what is going on is another.

One can be a mistake theorist, in the sense that one thinks that the best way to improve the world is to figure out and debate what is going on, and what actions, rules or virtues would cause what results, then implement the best solutions.

One cannot be an effective mistake theorist, without acknowledging that there are a lot of conflict theorists out there. The models that don’t include this fact get reality very wrong. If you use one of those models, your model doesn’t work. You get your causes and effects wrong. Your solutions therefore won’t work.

There already were approximately zero mistake theorists against billionaire philanthropy in general, even if many of them oppose particular implementations.

Thus, I expect the main response to Scott’s post to mainly be that people read it or hear about it or see a link to it, and notice that there are billionaires out there to criticize. That this is what we are doing next. That there is a developing consensus that it is politically wise and socially cool to be against billionaire philanthropy as a way of being against billionaires. They see an opportunity, and a new trend they must keep up with.

I expect a few people to notice the arguments and update in favor of billionaire philanthropy being better than they realized, but those people to be few, and that them tacking on an extra zero in the positive impact estimation column does not change their behavior much.

There were some anti-government arguments in the post, in the hopes that people will update their general world models and then propagate that update onto billionaire philanthropy. They may convince a few people to shift political positions, but less than if those arguments were presented in another context, because the context here is in support of billionaires. Those who do will probably still mostly fail to propagate the changes to the post’s central points.

Thus, I expect the post to backfire.



Discuss

Do you use twitter for intellectual engagement? Do you like it?

1 августа, 2019 - 15:35
Published on August 1, 2019 12:35 PM UTC

By happenstance I never started using social media much, and when I stumbled onto the Deep Work mindset I began intentionally avoiding it.

One thing I'd like in my life is an intellectual community, where I can routinely go to find people's thoughts that I'm interested, routinely produce thoughts people are interested in, and engage and be engaged with on those ideas. A few things I've seen recently have slightly updated me that twitter might be useful for this.

I've already seen Zvi and others talk against facebook, but I haven't heard many people talk about twitter.

What has your experience been with using twitter to find, engage with, and share interesting ideas?



Discuss

The Importance of Those Who Aren't Here

1 августа, 2019 - 02:33
Published on July 31, 2019 11:33 PM UTC

One concept that people often miss when structuring groups, organizations, or social spaces is the importance of those who aren't here. People who aren't actively present can be just as important for the medium- and long-term future of your project as those who happen to be around right now (sometimes more so!) -- and if you overfit your planning to the people who happen to be around right now, you may end up damaging your long-term prospects.

Here are some examples of what this can look like:

1) A king falls into decadence and surrounds himself with flattering courtiers. Whenever he wonders whether things are going in the right direction, the courtiers assure him everything is grand and maneuver to prevent him from learning about any negative developments.

2) An online community is meant to represent a wide range of viewpoints and perspectives, but biases in moderation begin to creep in that drive some away. Since the people who don't agree with the current approach have mostly left, polling active users no longer accurately samples the broader community that the space was meant to draw on, but rather the people who happen to agree with the current approach.

3) An employer begins to remove people who don't agree with his strategy and hire those who do. Now, everyone in the organization is on the same page with respect to the strategy -- but when that strategy has flaws there are few internal voices that are able to point that out.

4) A person who recently moved is having some social difficulties fitting in to a new community she's joined and asks her old friends for advice; because they are already her friends and enjoy her current manner, their advice is not very useful for helping her adapt to a new context!

5) A company developing new technology overfixates on their current market and falls prey to Galápagos syndrome; their products become overspecialized with features for their existing group and are no longer competitive in the broader market.

Focusing too much on those who are currently present can be insidious because it can be self-reinforcing; once one is focused on an inappropriately small group of users, confidants, or advisers, it can grow increasingly difficult to bring in information from outside this perspective. Further, one who relies too heavily on a certain perspective can ultimately alter their plans in a direction that further reinforces that small group's desires, driving more of those who don't agree out, and so on.

In order to avoid this cycle, it's important to "look to the river but think of the sea"; in other words, consider not just those that are currently in front of you but also the broader group you hope to reach in the longer term. Solicit outside perspectives or alternate takes and seriously engage with them when making your own plans. In some cases, doing this can be hard -- but that doesn't mean it isn't important.

After all, if you want to make a difference for more than just those you're currently interacting with, it's important to make sure you don't forget yourself in the current milieu!



Discuss

Gathering thoughts on Distillation

31 июля, 2019 - 22:48
Published on July 31, 2019 7:48 PM UTC

I think one of the biggest outstanding issues facing LessWrong is good infrastructure for distilling lengthy conversations.

Wei_Dai notes in the comments of Forum Participation as Research Strategy:

There's nothing that explicitly prevents people from distilling such discussions into subsequent posts or papers. If people aren't doing that, or are doing that less than they should, that could potentially be solved as a problem that's separate from "should more people be doing FP or traditional research?"Also, it's not clear to me that traditional research produces more clear distillations of how disagreements get resolved. It seems like most such discussions don't make for publishable papers and therefore most disagreements between "traditional researchers" just don't get resolved in a way that leaves a public record (or at all).

The LessWrong team has discussed this on and off for about 9 months. While we haven't done any development work on it yet (mostly working on Open Questions instead), habryka and I at least had some sense that this might be one of the most important things LessWrong is missing.

I have some vague thoughts on what would be needed here, but am interested in others opinions.

The rough thing that seems missing to me is a pipeline that goes something like:

1. Somebody notices that there was either a post that was way longer than it needed to be (or a bit confusing, or something). Or a meandering comment conversation that was could use to be distilled down into a post. They flag that collection of posts and comments into a databse object that says "hey, this cluster of posts and/or comments could use distillation"

2. Someone (sometimes that same person, sometimes a totally different person), writes up a distillation of that cluster of posts and/or comments

3. It might be that different people write up different distillations of the same content, because they had different takeaways. Or maybe two people both agreed that a post was long, but they both found different central metaphors useful.

4. The people who wrote the original post or comments can mark distillations as "good representation of my views" or not.

In theory there's nothing preventing this from happening with the current site technology. But, a year ago, there was nothing stopping people from making question posts (and getting answers). And yet introducing the question feature made it much more salient that "ask questions" is a thing you're encouraged to do, and then people did it more.

I think having some kind of official distillation pipeline could be useful in a similar way.

Interested in people's thoughts on this.



Discuss

Another case of "common sense" not being common?

31 июля, 2019 - 20:15
Published on July 31, 2019 5:15 PM UTC

Okay, that is probably not that good a characterization. However, I do like when someone figures out a simple way of looking at problems that have gone unsolved and so thought to be very difficult, so therefore must be really complicated.

If you didn't see this:

https://www.quantamagazine.org/mathematician-solves-computer-science-conjecture-in-two-pages-20190725/




Discuss

Walkthrough: The Transformer Architecture [Part 2/2]

31 июля, 2019 - 16:54
Published on July 31, 2019 1:54 PM UTC

If you are already sort of familiar with the Transformer, this post can serve as a standalone technical explanation of the architecture. Otherwise, I recommend reading part one to get the gist of what the network is doing.

Yesterday, I left us with two images of the Transformer architecture. These images show us the general flow of data through the network. The first image shows the stack of encoders and decoders in their bubbles, which is the basic outline of the Transformer. The second image shows us the sublayers of the encoder and decoder.



Now, with the picture of how the data moves through the architecture, I will fully explain a forward pass with an example. Keep the general structure in mind as we go through the details, which can be a bit mind-numbing at parts.

The task

The Transformer is well suited for translating sentences between languages. Since I don't speak any language other than English, I have decided to translate between sarcastic sentences and their intended meaning. In the example, I am translating the phrase "Yeah right" to "That's not true." This way all those robots who only understand the literal interpretation of English can simply incorporate a Transformer into their brain and be good to go.

The embedding

Before the sentence "Yeah right" can be fed into the architecture, it needs to take a form that is understood by the network. In order to have the architecture read words, we therefore need embed each word into a vector.

We could just take the size of the vocabulary of the document we are working with, and then one-hot encode each word into a fantastically sparse and long vector. But this approach has two problems:

1. The high dimension that these vectors are in makes them harder to work with.

2. There's no natural interpretation of proximity in this vector space.

Ideally, we want words that are similar to be related in some way when they are embedded into a vector format. For instance, we might want all the fruits {apple, orange, pear} to be close to each other in some cluster, and far enough away from the vehicle cluster {car, motorcycle, truck} to form distinct groups.

To be brief, the way that we do this is to use some word embedding neural network that has already been trained on English sentences.

Positional encoding

After we have obtained a set of word-embedded vectors for each word in our sentence, we still must do some further pre-processing before our neural network is fully able to grasp English sentences. The Transformer doesn't include a default method for analyzing the order of words in a sentence that it is fed. That means that we must somehow add the relative order of words into our embedding.

Order is of course necessary because we don't want the model to mistake "Yeah right" with "Right yeah." A mistake like that could be so catastrophic that we would get the opposite result of what we want from the network.

Before I just show you how positions are encoded, let me first show you the naive way that someone might do it, and how I first thought about it. I thought the picture would somehow look like this.

Here I have displayed the word embedding as the left four entries in the vector, and concatenated it with the position of the word in the sentence. I want to be very clear: this is not the way that it works. I just wanted to show you first the bad first pass approach before telling you the correct one, just to save you the trouble of being confused about why it doesn't work the way you imagined it in your head.

The correct way that the positional encoding works is like this.

Rather than just being a number, the positional encoding is another vector whose dimension is equal to the word embedding vector length. Then, instead of concatenating, we perform an element wise vector addition with the word embedding.

The way that we compute the positional encoding vector is a bit tricky, so bear with me as I describe the formula. In a moment, I will shed light on how this formula will allow the model to understand word order in a justified manner. We have .mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} PE(pos,2i)=sin(pos/100002i/dmodel)PE(pos,2i+1)=cos(pos/100002i/dmodel)

Where PE stands for "positional encoding," pos refers to the position of the word in the sentence, and i stands iterator which we use to construct this vector. i runs from 0 to dmodel/2.

Let me show you how this works by applying this formula on the example sentence. Let's say that dmodel=4, the same dimension as I have shown in the picture above. And further, let's encode the word "right" in the sentence "Yeah right," which is at position 1. In that case, then we will compute that the positional encoding is the following vector: [sin(pos/10000(2∗0)/4),cos(pos/10000(2∗0)/4)),sin(pos/10000(2∗0)/4,sin(pos/10000(2∗0)/4)))]⊺ =[sin(pos),cos(pos),sin(pos/100,sin(pos/100))]⊺ =[sin(1),cos(1),sin(1/100,sin(1/100))]⊺Why does the positional encoding use this formula? According to the paper,

We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset k,PEpos+k can be represented as a linear function of PEpos.

In other words, if we take the positional encoding of any word, we can easily construct the positional encoding of some other word elsewhere in the sentence by some linear scaling, rotation and stretching the encoding vector.

To see why this is useful, consider the fact that the word embedding allowed us to group words together which were close in meaning to each other. Now we have a way of transforming the vector in a linear fashion in order to obtain the position of the another word in the sentence. This means that we have a sort of language where a linear function translates to "the last word" or "the word 10 words ago." If we add these two encodings, the hope is that the model should be able to incorporate both the relative meaning of the words, and the relative positions, all in a single compact vector. And empirically, the model seems to be able to do just that.

Now that these two ways of representing the vector are combined, the words are ready to interact by being fed into the self-attention module.

Self-attention

Recall that the whole point of the Transformer architecture is that it is built on attention modules. So far we have only focused on processing the word vectors in a way that will be understood by the model. Now, we must understand what is done to these embedded vectors.



As we can see in the image above, each vector in the sentence is fed through the self attention module. In turn, the self attention module takes this sequence and returns yet another sequence, ready to be handed off to the next part of the network (shown as the vectors z1 and z2 ). This new sequence can be seen as a modified version of the original sequence, with some new information baked into each word. This new information contains information about how the words relate to each other.

In the last post, I described how self attention allowed the Transformer to construct links between words, which are intended to represent the relationship between words. However, it still isn't clear exactly how we find the strength of these links.

In order to create the links, we first must understand the idea of keys, queries and values. I will borrow the analogy that nostalgebraist uses in their explanation of the Transformer. You can imagine keys as vectors which represent a dating profile for a word, which includes information that might be useful for connecting them to other words. Queries, by contrast, are vectors which include information about what each word is looking for in each other word. Values are vectors which contain some more information about the word that isn't already represented. Our job is to use these keys and queries, which are learned for each word, in order to construct a table of matches. Each link in the table represents the relationship status between the words. Once we have these links, the strength of these links allow us to weight the words by their values. By doing this, each word sort of becomes an infusion of all the words they are dating, which could include themselves.

That is quite a lot to take in. Luckily, I will go through the process of constructing keys, queries and values, and the appropriate weighting, step by step.

The first thing to understand is that the keys and queries are not explicitly represented in the Transformer. Instead keys, queries, and values are constructed by multiplying the embedded vectors by a matrix of weights, which is directly learned. This is how I visualize the process.

Here, I have repeated the input vector for the first word three times, once for each computation. It is multiplied by three separate matrices, in order to obtain the keys, queries and values. WK represents the weights for generating the keys. WQ is the matrix for queries, and WV is the matrix for generating values.

Once we have our keys, queries, and values for each word, we are ready to link each word up, or in the analogy, find their dates.

Since the dot product can be viewed as a way of measuring the similarity between vectors, the way that we check whether two words are compatible is by calculating the inner product between their keys and values. This dot product will represent the score, or compatibility between two words. In particular, the score between the first word and the second word is the dot product q1⋅k2. If we were to consider the general case, then the score between word n and word m would be qn⋅km. Remember, these scores are not symmetric. This asymmetry is due to the fact that the score from word n to word m is calculated differently than the score from word m to word n. By analogy, just because one words likes another word doesn't mean that they are liked back!

Once we have the scores from every word to every other word, we now must incorporate that information into the model. The way we do this is the following:

First, we divide the scores by the square root of the dimension of the key. Then we take a softmax over the scores for each word, so that they are all positive and add up to one. Then, for each word, we sum over the values of every other word, weighted by the softmax scores. Then we are done. Here is what that looks like for the first word in our sentence.

The intuitive justifications for each of these steps may not first be apparent. The most confusing step, at least to me at first, was why we needed to divide the score by the square root of dk. According to the paper, this makes gradients easier to work with. The reason is because for large values of dk , the dot products can become so large that the softmax function ends up having extremely small gradients.

Once we have these weighted values, we are ready to pass them off to the next part of the Transformer. But not so fast! I temporarily skipped over a key component to the Transformer, and what allows it to be so powerful.

The step I missed was that I didn't tell you that that there's actually multiple keys, queries, and values for each word. This allows the model to have a whole bunch of different ways to represent the connections between words. Let me show you how it works.

Instead of thinking about finding the sum over values for each word, we should instead think about multiple, parallel processes each computing different sums over values. Each attention head computes a value (the weighted sum), using weight matrices which were all initialized differently. Then when all of the attention heads have calculated their values, we combine these into a single value which summarizes all the contextual information that each attention head was able to find.

In order to combine the values found by the attention heads, we concatenate each of the values, and multiply the concatenated matrix by another learned matrix. We call this new learned matrix WO for the output weights. Once completed, we now have the final output word in this attention mechanism. This process looks like this.


Add and norm

Now, it looks like we're ready to get to the next part of the encoder. But before we do so, I must point out that I left something out of my initial picture. Before we can put zout into the feed forward neural network, we must apply a residual connection first. What is a residual connection? It looks a bit like this.

In other words, the inputs are fed into the output of the self-attention module. The way that this works is by implementing a layer normalization step to the sum of the outputs and inputs. If you don't know what layer normalization is, don't worry. Layer normalization is an operation based on batch normalization, intended to reduce some of batch normalization's side effects. Tomorrow's post will cover batch normalization in detail, so the exact specification is omitted here.

Once we have added and normed, we are finally ready to push these words onto the neural network. But first, a note on notation.

Notation

This will likely come as no surprise, but the way that the above operations are performed are above the level of vectors. In order to speed up computation and simplify the way we write the algorithm, we really have to put all the steps into higher order arrays. This notation can indicate which operations are done in parallel.

In practice, this means that instead of writing the multi-head attention as a bunch of simultaneous steps, we simplify the whole routine by writing the following expression.

Attention(Q,K,V)=softmax(QK⊺√dk))V MultiHead(Q,K,V)=Concat(head1,...,headh)WOwhere headi=Attention(QWQi,KWKi,VWVi)

Here, the matrices Q, K, and V stand for the matrices of queries, keys and values respectively. The rows in the matrices are the vectors of queries, keys and values, which we have already discussed above. It should be understood that any step which allows for some form of parallelization and simplification, a matrix representation is preferred to vectors.

The feed forward neural network

We are almost done with one slice of an encoder. Even though most of the time was spent on the previous step (the attention and norm sublayer), this next one is still quite important. It is merely simpler to understand. Once we have the words generated by the self attention module, and it is handed to the layer-norm step, we must now put each word through a feed forward neural network.

This is an important point, so it's important not to miss: the same neural network is applied independently to each of the words produced by the previous step. There are many neural networks at this slice of the encoder, but they all share weights. On the other hand, neural networks between layers do not share weights. This allows the words to solidify their meaning from some non-linearity in the neural network before being shipped off to the next step, which is an identical slice of the encoder.

The stack

As mentioned before, the encoder and decoder are made up of a stack of attention mechanism/neural network modules. Words are passed through the bottom of a single slice of this stack, and words appear out the top. In other words, the vectors pass through the layers in the same shape that they started. In this sense, you can trace a single word through the stack, going up from the encoder or decoder.

Encoder-decoder crossover

At this point, the encoder is pretty much fully explained. And with that, the decoder is nearly explained, since the two share most properties. If you have been following this far, you should be able to trace how an embedded sentence is able to go through an encoder, by passing through many layers, each of which have a self attention component followed by a feed-forward neural network. The way that data travels through a decoder is similar, but we still must discuss how the encoder and decoder interact, and how the output is produced.

I said in the previous post that when the encoder produces its output, it somehow is able to give information to each of the layers of the decoder. This information is incorporated into the encoder-decoder attention module, which is the main difference between the encoder and decoder. The exact mechanism for how this happens is relatively simple. Take a look at the decoder structure one more time.

I have added the box on the left to clarify how the decoder takes in the input from the encoder. In the encoder-decoder attention sublayer, the keys and values are inherited from the output of the last encoder layer. These keys and values are computed the same way that we computed keys and values above: via a matrix of weights which act on the words. In this case, just as in the last, there is a different matrix acting on each layer, even as these matrices act on the same set of words: the final output of the encoder. This extra module allows the decoder to attend to all of the words in the input sequence.

In line with the previous description, the values in the encoder-decoder attention sublayer are derived from the outputs of the self-attention layer below it, which works exactly the same as self-attention worked in the encoder.

In the first pass through the network, the words "Yeah right" are passed through the encoder, and two words appear at the top. Then we use the decoder to successively work from there, iteratively applying self attention to an input (which input? this will be elaborated in just a moment) and plugging its output back into its input. The encoder attends to the output from the encoder, along with the output from self-attention. At the top of the decoder, we feed the output one last time through a linear layer and a softmax layer, in order to produce a single word.

Then, after a single word is produced, we put that back into the input of the decoder and continue running the encoder. We stop whenever the decoder outputs a special symbol indicating that it has reached the end of its translation, the end token. This process would look like this.

Before the decoder has produced an output, we don't want it to be able to attend to tokens which have not been seen yet. Therefore, in our first pass through the decoder, we set all the softmax inputs to −∞ . Recall that when the softmax is applied, we use the resulting output to weight the values in the self-attention module. If we set these inputs to the softmax to −∞ , the result is that the softmax outputs a zero, which correspondingly gives those values no weight. In future passes, we ensure that every input that corresponds to a future word is treated similarly, so that there cannot be connections between words that we have not outputted yet.1

Now, with the mechanism for producing outputs fully elucidated, we are ready to stare at the full network, with all of its tiny details. I do not want to compete with the original paper for illustrations on this one. Therefore, I have incorporated figure 1 from the paper and placed it here.

As you might have noticed, I neglected to tell you about some extra residual connections. Regardless, they work the same way that I described above, so there should be no confusion there! Other than that, I have covered pretty much part of the Transformer design. You can now gaze your eyes upon the above image and feel the full satisfaction of being able to understand every part (if you've been paying attention).

Of course, we haven't actually gone into how it is trained, or how we generate the single words from the final softmax layer and output probabilities. But this post is long enough, and such details are not specific to the Transformer anyway. Perhaps one day I shall return.

For now, I will be leaving the transformer architecture. In tomorrow's post I will cover batch normalization, another one of those relatively new ML concepts that you should know about. See you then.

1 If this discussion of exactly how the decoder incorporates its own output, and masks illegal connections confuses you, you're not alone. I am not confident that these paragraphs are correct, and will possibly correct them in the future when I have a better understanding of how it is implemented.



Discuss

How to Ignore Your Emotions (while also thinking you're awesome at emotions)

31 июля, 2019 - 16:34
Published on July 31, 2019 1:34 PM UTC

(cross posted from my personal blog)

Since middle school I've generally thought that I'm pretty good at dealing with my emotions, and a handful of close friends and family have made similar comments. Now I can see that though I was particularly good at never flipping out, I was decidedly not good "healthy emotional processing". I'll explain later what I think "healthy emotional processing" is, right now I'm using quotes to indicate "the thing that's good to do with emotions". Here it goes...

Relevant context

When I was a kid I adopted a strong, "Fix it or stop complaining about it" mentality. This applied to stress and worry as well. "Either address the problem you're worried about or quit worrying about it!" Also being a kid, I had a limited capacity to actually fix anything, and as such I was often exercising the "stop worrying about it" option.

Another thing about me, I was a massive book worm and loved to collect "obvious mistakes" that heroes and villains would make. My theory was, "Know all the traps, and then just don't fall for them". That plus the sort of books I read meant that I "knew" it was a big no-no to ignore or repress your emotions. Luckily, since I knew you shouldn't repress your emotions, I "just didn't" and have lived happily ever after

...

...

yeah nopes.

Wiggling ears

It can be really hard to teach someone to move in a way that is completely new to them. I teach parkour, and sometimes I want to say,

Me: "Do the shock absorbing thing with your legs!" Student: "What's the shock absorbing thing?" Me: "... uh, you know... the thing were you're legs... absorb shock?"

It's hard to know how to give queues that will lead to someone making the right mental/muscle connection. Learning new motor movements is somewhat of a process of flailing around in the dark, until some feedback mechanism tells you you did it right (a coach, it's visually obvious, the jump doesn't hurt anymore, etc). Wiggling your ears is a nice concrete version of a) movement most people's bodies are capable of and b) one that most people feel like is impossible.

Claim: learning mental and emotional skills has a similar "flailing around in the dark" aspect. There are the mental and emotional controls you've practiced, and those just feel like moving your arm. Natural, effortless, atomic. But there are other moves, which you are totally capable of which seem impossible because you don't know how your "control panel" connects to that output. This feels like trying to wiggle your ears.

Why "ignore" and "deal with" looked the same

So young me is upset that the grub master for our camping trip forgot half the food on the menu, and all we have for breakfast is milk. I couldn't "fix it" given that we were in the woods, so my next option was "stop feeling upset about it." So I reached around in the dark of my mind, and Oops, the "healthily process feelings" lever is right next to the "stop listening to my emotions" lever.

The end result? "Wow, I decided to stop feeling upset, and then I stopped feeling upset. I'm so fucking good at emotional regulation!!!!!"

My model now is that I substituted "is there a monologue of upsetness in my conscious mental loop?" for "am I feeling upset?". So from my perspective, it just felt like I was very in control of my feelings. Whenever I wanted to stop feeling something, I could. When I thought of ignoring/repressing emotions, I imagined trying to cover up something that was there, maybe with a story. Or I thought if you poked around ignored emotions there would be a response of anger or annoyance. I at least expected that if I was ignoring my emotions, that if I got very calm and then asked myself, "Is there anything that you're feeling?" I would get an answer.

Again, the assumption was, "If it's in my mind, I should be able to notice if I look." This ignored what was actually happening, which was that I was cutting the phone lines so my emotions couldn't talk to me in the first place. Actually, the phone lines metaphor is a bit off, here's a better one.

Parent-child model

My self-concept and conscious mind are the parent. Emotions are young children that run up to the parent to tell them something. Sometimes the child runs up to complain, "Heeeeeeeeeey I'm huuuuuuungry!" My emotional management was akin to the parenting style of slapping the child and saying, "Being hungry would suck, so you aren't hungry."

Yikes.

I know full well that you can't slap someone into having a full stomach, but you can slap someone into not bringing their complaints to you.

I've experienced this directly extend to my internal world. My emotions / sub-agents aren't stupid. They learned that telling me, "Hey, you're concerned about your relationship with your friend!", "Hey, we really don't like getting laughed at", "Hey, we're concerned that this bad thing is going to happen indefinitely" would result in getting slapped. So they learned to stay quiet.

This got to the point where I'd feel awesome and great during my busy week, and then "mysteriously" and "for no reason" feel an amorphous blob of gray badness on the weekends. I had various social and emotional needs that weren't being met, but I didn't realize that. I quite intensely tried to introspect to see if this gray blob was "about anything", but only heard quiet static. This was me being the angry parent with their kids having a dinner of half a slice of bread each, shouting, "Is anyone hungry?! Huh??! No? GREAT."

Owwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww

... and now?

When I was a kid, my desire to "not worry if it was useless" was mostly one of "people who worry seem to be in pain, I'd prefer to not be in pain." Overtime, it turned into a judgmental world view. How wasteful and useless to be embarrassed/worried/scared/etc. This was the transition from a naive parent telling their kid, "Hmmmm, have you tried not being hungry?" to the angry parent shouting, "You won't be hungry in my house!!" (one might wonder how exactly that transition from naive to judgmental happened. That's a whole other story for a different post)

Over the past year I've haphazardly free styled towards opening up emotional communication with myself, and I've made progress. I'm still not sure what "healthy emotional processing" looks like, but I've gotten HUGE gains from just being able to sit with the fact that I'm feeling something, and hug the child that brought that emotion instead of slapping them.

I guess the biggest thing I wanted to impart with this piece was 1. the parent child model, but also 2. that ignoring your emotions can start as a simple innocent mistake.

Related. A sentiment in a LW thread I heard in the past few months was that the biggest barrier to rational discourse is creating environments where everyone feels safe thinking (not the same thing as a safe space). Extend that to the mind. The biggest barrier to rational thinking is organizing your mind such that it's safe to think. I still promote and admire "look towards the truth, even if it hurts", but I know see that if you don't spend enough resources on addressing that hurt, the hurt parts of yourself can and will take measures to protect themselves. Treat yourself well.




Discuss

Drive-By Low-Effort Criticism

31 июля, 2019 - 14:30
Published on July 31, 2019 11:30 AM UTC

I'd like to point out a phenomenon that has predictable consequences, that people are seemingly unaware of:

The Drive-By Low-Effort Criticism.

To illustrate, we'll use:

The Relationship Between the Village and the Mission

That was a recent post by Raemon on here that, as far as I can tell, (1) took an incredible amount of work to make, (2) was incredibly pro-social, (3) was incredibly well-intentioned, and (4) is the class of thing that has massively high-upside if it works. It might not work, but it's the type of thing that should be encouraged.

To further break this down, the post was:

-> 5,444 words (!)

-> Deep-linked/cited 26 other posts (!)

-> Had analysis, context, tradeoffs, anticipation of likely questions or objections, actionable ideas, etc

-> Was nicely formatted for readability with headlines, appropriate bold and italics, etc etc.

-> Offered to do real-world things at author's time and expense to improve the real-world rationality community (!!!)

-> It even contained a motherfuckin' Venn diagram (!!!)

In any event, I think we can clearly say it was a high effort post. How many hours did it take? Jeez. More than 2 hours, for sure. More than 5 hours, very likely. More than 10 hours, probably. More than 20 hours? 30? Probably under a hundred hours, but heck, maybe not if you consider all the time thinking about the concepts.

Regardless — hours. Not minutes. High effort.

And this is clearly someone who cares immensely about what he's doing. And it's clearly well-intentioned. And it's humble. And it's... it's just great. Hats off to Raemon.

Now, it might not work. It might be unfeasible. But he's certainly putting in a big effort to see if something good is possible.

What is his top comments? Here's how the first one starts —

[ comment copied from Facebook / I didn't read the full article before making this comment ]i am somewhat anti-"Mission-centered Village."

Wait, you didn't read the full article before making a drive-by low-effort criticism of the concept?

Okay. Deep breath. What's the second top-level comment?

Here is my brain dump: I have mostly given up on the Berkeley rationality community as a possible village. I think the people who showed up here were mostly selected for being bad at villaging, and that the awful shit that's been happening around here lately is downstream of that.

*facepalm*

Raemon spent hours and hours making that post. He's working hard to do it.

The top two ranked replies are... well, that's what they are.

They elaborate a bit more — but even if both comments are correct, you know that has a huge adverse effect on people's willingness to post and contribute ideas, right? Right? You already know that, don't you? Don't you?

I mean, at the risk of overstating the case, I think this sort of behavior borders on criminal.

I know it doesn't feel malevolent, but the predictable end result is getting less contributions like that.

When a person takes a huge amount of time to attempt to make a contribution to a community, I think criticism should spend at least some time getting fully oriented around the idea, understanding and respecting the author's perspective, and looking to engage in dialog that'd be appreciated by the author.

That is, if you care about high-effort thoughtful pro-social contributions.

The Drive-By Low-Effort Criticism is easy to do.

It's easy to spot a potential hole in someone's argument.

It's easy to raise an objection.

It's easy to spout off whatever first comes to mind.

It's much harder to fully understand the author's point of view, their end goals, the amount of time they put in to do the work, to respect and acknowledge that effort, and to look to post something that rewards, encourages, and promotes that sort of behavior.

In fairness, both of those commentors had more to say, and are no doubt smart people. But, like, some basics of empathy, politeness, constructiveness, acknowledgement, and such is essential — if you care about getting high-effort contributions in the future.

It straight-up sucks for people to make a ton of work and the first thing they see in reply is negative and fast. Not fully edited. Not acknowledging the whole picture. Not a single word of respect or acknowledgement of all the work. Heck, not even written with proper punctuation. Worse yet, not even having read the full piece !

I know the counterarguments. Having weighed all of them, I still think is essentially stupid and short-sighted behavior, and will result in a worse world.

Drive-By Low-Effort Criticism does damage. Think about it.



Discuss

Will autonomous cars be more economical/efficient as shared urban transit than busses or trains, but by how much? What's some good research on this?

31 июля, 2019 - 03:16
Published on July 31, 2019 12:16 AM UTC

There's reason to think they would be. When people get into a shared public transit vehicle, usually, they're not all going to the same place. The vehicle has to take an indirect route, it has to start and stop many times along the way to let people on and off. A person will often have to transfer between multiple lines along the way and that will often involve a bit of waiting. There are usually many routes that have very low utilisation rates- carriages will be mostly empty most of the time, in part due to the fact that you can't deploy half of a bus for low-use routes, in part due to the rigidity of the route scheduling.

Small self-driving vehicles, instead, can take a person straight where they need to go. The vehicle can then pick up someone else nearby and take them straight where they need to go, and so on, all day. They'll also benefit from economy of scale of production, if more units are produced, the cheaper they individually become, smaller units, then, can be cheaper.

That's the vision I have in my head, anyway.

The main potential issue that I can see is commuting patterns. If a lot of people are going in the same direction at the same time, batching their trips together with larger densely packed shared vehicles might make sense.

But I'm not sure that's a legitimate way for cities to be. With affordable inner-city housing, mixed-use planning, this commuting pattern wouldn't pop up - people would be taking short trips in all different directions even during peak commuting times.

(Although I cannot personally understand why) A lot of people seem to like living in suburbs, and conventionally planned cities have land pricing issues that make living in them prohibitively expensive for most ordinary people who work in them (I will discuss a potential solution to this in a later post). These ideal mixed use cities that I am mentally situated in, might not exist, and might not come to exist before autonomous vehicles start to compete with existing public transit.

So I'd like to know what relationship suburb size has to the commuting pattern problem - how large can the suburbs get before fixed-route busses become more efficient? (and also, do any amount of suburbs make them more efficient?)

Visions of self-driving cars currently look like variants of the cars we have, improved for not needing to have a driver's seat and forward visibility, often you'll see 8-seaters that're expected to be shared to some extent. I think that might be unrealistic, if it is, we need to see more analysis of single-occupant cars like the toyota i-road. Aside from using fewer materials and less energy to accelerate, they'd also have the advantage of being able to to drive two to a lane, reducing congestion.

If we can answer these questions well, I think we will be much better informed about the future of urban transit, than we are now.



Discuss

Forum participation as a research strategy

30 июля, 2019 - 21:09
Published on July 30, 2019 6:09 PM UTC

Previously: Online discussion is better than pre-publication peer review, Disincentives for participating on LW/AF

Recently I've noticed a cognitive dissonance in myself, where I can see that my best ideas have come from participating on various mailing lists and forums (such as cypherpunks, extropians, SL4, everything-list, LessWrong and AI Alignment Forum), and I've received a certain amount of recognition as a result, but when someone asks me what I actually do as an "independent researcher", I'm embarrassed to say that I mostly comment on other people's posts, participate in online discussions, and occasionally a new idea pops into my head and I write it down as a blog/forum post of my own. I guess that's because I imagine it doesn't fit most people's image of what a researcher's work consists of.

Once I noticed this, the tension is easy to resolve - in this post I'm going to proclaim/endorse forum participation (aka commenting) as a productive research strategy that I've managed to stumble upon, and recommend it to others (at least to try). Note that this is different from saying that forum/blog posts are a good way for a research community to communicate. It's about individually doing better as researchers.

Benefits of Forum Participation (FP) FP takes little effort / will power

In other words it feels more like play than work, which means I rarely have issues with not wanting to do something that I think is important to do (i.e., akrasia), the only exception being that writing posts seems to take more effort so occasionally I spend my time writing comments when I perhaps should write posts instead. (This is the part of this post that I think may be least likely to generalize to other people. It could be that I'm an extreme outlier in finding FP so low-effort. However it might also be the case that it becomes low effort for most people to write comments once they've had enough practice in it.)

FP is a good way to notice missing background knowledge and provides incentives to learn missing knowledge

If you read a post with an intention to question or comment on it, it's pretty easy to notice that it assumes some background knowledge that you lack. The desire to not ask a "stupid" question or make a "stupid" comment provides powerful incentive to learn the miss knowledge.

FP is a good way to stay up to date on everyone else's latest research

It's often a good idea to stay up to date on other people's research, but sometimes one isn't highly motivated to do so. FP seems to make that easier. For example, I wasn't following Stuart's research on counterfactual oracles, until the recent contest drew my attention and desire to participate, and I ended up reading the latest posts on CO in order to understand the current state of the art on that topic, which turned out to be pretty interesting.

Arguments that are generated in reaction to some specific post or discussion can be of general value

It's not infrequent that I come up with an argument in response to some post or discussion thread, and later expand or follow up that argument into a post because it seems to apply more generally than to just that post/discussion. Here is one such example.

FP generates new ideas via cross-fertilization

FP incentivizes one to think deeply about many threads of research, and often (at least for me) an idea pops into my head that seems to combine various partial ideas floating in the ether into a coherent or semi-coherent whole (e.g., UDT), or is the result of applying or analogizing someone else's latest idea to a different topic (e.g., "human safety problem", "philosophy as high complexity class").

FP helps prepare for efficiently communicating new ideas

FP is a good way to build models of other people's epistemic states, and also a good way to practice communicating with fellow researchers, both of which are good preparation for efficiently communicating one's own new ideas.

My Recommendations Comment more

To obtain the above benefits, one just has to write more comments. It may be necessary to first overcome disincentives to participate. If you can't, please speak up and maybe the forum admins will do something to help address whatever obstacle you're having trouble with.

Practice makes better

If it seems hard to write good comments, practice might make it easier eventually.

Think of FP as something to do for yourself

Some people might think of commenting as primarily providing a service to other researchers or to the research community. I suggest also thinking of it as providing a benefit to yourself (for the above reasons).

Encourage and support researchers who adopt FP as their primary research strategy

I'm not aware of any organizations that explicitly encourage and support researchers to spend most or much of their time commenting on forum posts. But perhaps they should, if it actually is (or has the potential to be) a productive research strategy? For example this could be done by providing financial support and/or status rewards for effective forum participation.



Discuss

Walkthrough: The Transformer Architecture [Part 1/2]

30 июля, 2019 - 16:54
Published on July 30, 2019 1:54 PM UTC

This is the first post in a new sequence in which I walk through papers, concepts, and blog posts related to machine learning and AI alignment.

In the process of increasing my own understanding of these topics, I have decided to implement a well known piece of advice: the best way to learn is to teach. This sequence is therefore a culmination of what I have learned or reviewed very recently, put into a learning format.

My thoughts on various topics are not polished. I predict there will be mistakes, omissions, and general misunderstandings. I also believe that there are better places to learn the concepts which I walk through. I am not trying to compete with anyone for the best explanations. The audience I have in mind is a version of myself from a few weeks or days ago, and therefore some things which I take for granted may not be common knowledge to many readers. That said, I think this sequence may be useful for anyone who wants a deeper look at the topics which I present.

If you find something wrong, leave a comment. Just try to be respectful.

______________________________________________________________________

If you've been following machine learning or natural language processing recently, you will likely already know that the Transformer is currently all the rage. Systems which are based on the Transformer have struck new records in natural language processing benchmarks. The GLUE benchmark is currently filled by models which, as far as I can tell, are all based on the Transformer.

With the Transformer, we now have neural networks which can write coherent stories about unicorns in South America, and an essay about why recycling is bad for the world.

The paper describing the Transformer, Attention Is All You Need, is now the top paper on Arxiv Sanity Preserver, surpassing the popularity of GANs and residual nets. And according to Google scholar, the paper has attracted 2588 citations (and counting).

So what makes the Transformer so powerful, and why is it such a break from previous approaches? Is it all just hype? Clearly not. But it is worth looking into the history behind the architecture first, in order to see where these ideas first came from. In this post, I give a rough sketch for what the transformer architecture looks like. In the following post, I will provide a detailed description of each step in the forward pass of the architecture. First, however, we have to know what we are looking at.

The central heart of the Transformer is the attention mechanism, hence the name of the original paper. As I understand, attention is a mechanism first designed in order to improve the way that recurrent neural networks understood text. I found this post helpful for providing an intuitive breakdown of how the attention mechanism works in pre-Transformer architectures.

The way I view attention in my head, and the way that many people illustrate it, is to imagine a table linking every word in a sentence to every other word in another sentence. If the two sentences are the same, then this is called self-attention. Attention allows us to see which parts of the sentence are relevant to the other parts. If there's a strong link between "it" and "car" then it's possible that this link is a way for the model to say that it is the car.

Consider the sentence "I am a human." Attention might look at this sentence and construct the following links:

The brighter the cell in the table, the more connected the two words are. Here, the exact meaning of the shades of grey aren't really important. All you need to notice is that there's some sort of relationship between words, and this relationship isn't identical between every word. It's not a symmetric relationship either. The exact way that attention is calculated will come later, which should shed light on why attention works the way it does.

If we were using a Transformer on the previous sentence, it might figure out that "I" was referring to "human." And in a sense, that's exactly what we want it to do. "I" and "human" are quite related, since they both point in the same direction. We will later see how we can use these linkages in a clever way to incorporate them into the Transformer architecture.

As a brief digression, I will note that the idea of coming up with a vague abstraction for how we want to interpret our neural networks is a concept that comes up repeatedly in deep learning. In the case of convolutional neural networks, for instance, we frequently describe neural networks as having some sort of structured model for the images we are classifying. This is why authors will sometimes talk about the model recognizing smaller pieces of the picture, like wheels, and then combining these smaller features into a coherent whole, such as a car.

The way that I think about attention is similar. The hope is that the neural network will be able to capture those parts of the text that are related to each other in some way. Perhaps the words that are strongly linked together are synonyms, or stand for each other in some way, such as because one of them is a pronoun of another word. No part of the links are hard-coded, of course. Finding out which words are related is where most of the learning happens.

Unlike previous approaches that use attention, the Transformer is unique because of how it uses attention for virtually everything. Before, attention was something that was used in conjunction with another neural network, allowing some form of computation by the neural network on a a pre-processed structured representation. The Transformer takes this concept further by repeatedly applying attention to an input, and relying very little on traditional feed forward neural networks to turn that input into something useful. Transformers still use regular neural networks, their importance is just diminished. And no RNNs or CNNs need be involved.

Other than attention, the other main thing to understand about the Transformer is its general structure, the encoder-decoder architecture. This architecture is not unique to the Transformer, and from what I understand, has been the dominant method of performing sequence to sequence modeling a few years before the Transformer was ever published. Still, it is necessary to see how the encoder-decoder architecture works in order to get any idea how the Transformer does its magic.

Below, I have illustrated the basic idea.

Simple enough, but how does it work? That middle arrow is conveying something useful. This isn't just a deep architecture with a strange standard representation.

By thinking first about traditional neural networks, the justification for having this representation will become clear. In an RNN, we have the same neural network performing a computation several times on each part of a sequence, returning an output for each step. It is this sequential nature of the computation which limits the output of an RNN. If we want an RNN to translate a sentence in English with 5 words in it, to a sentence in French with 7 words in it, there seems to be no natural way to do it. This is because, intuitively, as we go along the input sequence, we can only translate each element in the input to one element in the output.

By contrast, the encoder-decoder mechanism gets around this by constructing two different networks which work together in a special way. The first network, the encoder, takes in an input and places that input into a very high dimensional space, the context. After the input has been translated into this high dimensional space, it is then put into a much lower dimension using the decoder. By having an initial ramp up to a high dimension, and then back down, we are free to translate between sequences of different lengths.

The exact way that it works is like this: in the decoder network, we first take in the context and an initial hidden state as inputs. We then repeatedly apply an RNN to these inputs, creating an output and passing along the hidden state information to the next step. We repeat this until the network has produced an end token, which indicates that it has reached the end of the translated text.

In the Transformer, we drop the RNN part, but keep the benefit of being able to map arbitrarily long sequences to other arbitrarily long sequences. Under the hood, the Transformer is really more of a stack of encoders and decoders, which are themselves composed of self-attention components and a feedforward neural network. The size of this stack is the main design choice involved in creating a Transformer, and contributes to the simplicity of tuning it.

Since there are so many small details in the Transformer, it's important to get a rough visual of how data is flowing through the network before you start to think about the exact computations that are being performed. We can look at the whole model like this.

The little stacked boxes I have put in encoder and decoder represent the layers in the network. The input is first fed up through the encoder, and then at the top step, we somehow provide some information to each layer of the decoder. Then, we start going through the decoder, at each step using the information we just got from the encoder.

Unlike an RNN, we do not share weights in each of these layers. The little boxes represent individual parts of the architecture that we are training separately. If we look under the hood at these little blocks, we find that encoder and decoder blocks are pretty similar.

These things are stacked inside the encoder and decoder, and feed upwards.

On the left, we have an encoder layer. This layer has two sublayers. The first layer is the aforementioned self-attention. Although I have not given the details yet as to how attention works, we can visualize this block as calculating the values of the table I showed above, followed by some computation which uses the numbers in the table to weight each of the words in the sequence. If this currently doesn't make sense, it should hopefully become more apparent once we get into the actual vector and matrix operations.

On the right, we have a decoder layer. The decoder layer is almost identical to the encoder layer except that it adds one middle step. The middle step is the encoder-decoder attention. This sublayer is the one that's going to use the information carried over from the last step of the encoder layer.

Both the encoder and decoder layers feed into a neural network before shipping the values onto the next step.

If you're anything like me, looking at this very conceptual picture has probably left you eager to dive into a more concrete analysis in order to truly understand what's going on. For that, we'll need to go into each step, and look at the architecture like an algorithm, rather than a picture. Part 2 of this post will do just that.

Just keep in mind that as long as we have a rough idea of what's going on, it will make all the little matrix computations and notation a little less painful. In my opinion, learning benefits immensely from an abstract first pass through the material followed by a more detailed second pass. So with that, join me tomorrow as I unravel the deeper mysteries of the Transformer.



Discuss

Conversation on forecasting with Vaniver and Ozzie Gooen

30 июля, 2019 - 14:16
Published on July 30, 2019 11:16 AM UTC

[Cross-posted to the EA Forum]

This is a transcript of a conversation on forecasting between Vaniver and Ozzie Gooen, with an anonymous facilitator (inspired by the double crux technique). The transcript was transcribed using a professional service and edited by Jacob Lagerros.

I (Jacob) decided to record, transcribe, edit and post it as:

  • Despite an increase in interest and funding for forecasting work in recent years, there seems to be a disconnect between the mental models of the people working on it and the people who aren’t. I want to move the community’s frontier of insight closer to that of the forecasting subcommunity
  • I think this is true for many more topics than forecasting. It’s incredibly difficult to be exposed to the frontier of insight unless you happen to be in the right conversations, for no better reason than that people are busy, preparing transcripts takes time and effort, and there are no standards and unclear expected rewards for doing so. This is an inefficiency in the economic sense. So it seems good to experiment with ways of alleviating it
  • This was a high-effort activity where two people dedicated several hours to collaborative, truth-seeking dialogue. Such conversations usually look quite different from comment sections (even good ones!) or most ordinary conversations. Yet there are very few records of actual, mind-changing conversations online, despite their importance in the rationality community.
  • Posting things publicly online increases the surface area of ideas to the people who might use them, and can have very positive, hard-to-predict effects.

Introduction

Facilitator: One way to start would be to get a bit of both of your senses of the importance of forecasting, maybe Ozzie starting first. Why are you excited about it and what caused you to get involved?

Ozzie: Actually, would it be possible that you start first? Because there are just so many ...

Vaniver: Yeah. My sense is that predicting the future is great. Forecasting is one way to do this. The question for “will this connect to things being better” is the difficult part. In particular, Ozzie had this picture before of, on the one hand, data science-y repeated things that happen a lot, and then on the other hand judgement-style forecasting, a one off thing where people are relying on whatever models because they can't do the “predict the weather”-style things.

Vaniver: My sense is that most of the things that we care about are going to be closer to the right hand side and also most of the things that we can do now to try and build out forecasting infrastructures aren't addressing the core limitations in getting to these places.

Is infrastructure really what forecasting needs? (And clarifying the term “forecasting”)

Vaniver: My main example here is something like prediction markets are pretty easy to run but they aren't being adopted in many of the places that we'd like to have them for reasons that are not ... “we didn't get a software engineer to build them.” That feels like my core reason to be pessimistic about forecasting as intellectual infrastructure.

Ozzie: Yeah. I wanted to ask you about this. Forecasting is such a big type of thing. One thing we have about maybe five to ten people doing timelines, direct forecasting, at OpenAI, OpenPhil and AI Impacts. My impression is that you're not talking about that kind of forecasting. You're talking about infrastructural forecasting where we have a formal platform and people making formalised things.

Vaniver: Yeah. When I think about infrastructure, I'm thinking about building tooling for people to do work in a shared space as opposed to individual people doing individual work. If we think about dentistry or something, like what dentists' infrastructure would look like is very different from people actually modifying mouths. It feels to me like that OpenAI and similar people are doing more of the direct style work than infrastructure.

Ozzie: Yeah okay. Another question I have is something like a lot of trend extrapolations stuff, e.g. for particular organizations, “how much money do you think they will have in the future?” Or for LessWrong, “how many posts are going to be there in the future?” and things like that. There's a lot of that happening. Would you call that formal forecasting? Or would you say that's not really tied to existing infrastructure and they don't really need infrastructure support?

Vaniver: That's interesting. I noticed earlier I hadn't been including Guesstimate or similar things in this category because that felt to me more like model building tools or something. What do I think now ...

Vaniver: I'm thinking about two different things. One of them is the “does my view change if I count model building tooling as part of this category, or does that seem like an unnatural categorization?” The other thing that I'm thinking about is if we have stuff like the LessWrong team trying to forecast how many posts there will be… If we built tools to make that more effective, does that make good things happen?

Vaniver: I think on that second question the answer is mostly no because it's not clear that it gets them better counterfactual analysis or means they work on better projects or something. It feels closer to ... The thing that feels like it's missing there is something like how them being able to forecast how many posts there will be on LessWrong connects to whether LessWrong is any good.

Fragility of value and difficulty of capturing important uncertainties in forecasts

Vaniver: There was this big discussion that happened recently about what metric is the team should be trying to optimize for the quarter. My impression is this operationalization step connected people pretty deeply to the fact that the things that we care about are actually just extremely hard to put numbers on. This difficulty will also be there for any forecasts we might make.

Ozzie: Do you think that there could be value in people in the EA community figuring out how do put numbers on such things? For instance, like groups evaluate these things in the future in formal ways. Maybe not for LessWrong but for other kinds of projects.

Vaniver: Yeah. Here I'm noticing this old LessWrong post…. Actually I don't know if this was one specific post, but this claim of the “fragility of value” where it's like “oh yeah, in fact the thing that you care about is this giant mess. If you drill it down to one consideration, you probably screwed it up somehow”. But it feels like even though I don't expect you to drill it down to one consideration, I do think having 12 is an improvement over having 50. That would be evidence of moral progress.

Ozzie: That’s interesting. Even so, the agenda that I've been talking about it quite broad. It's very much a lot of interesting things. A combination of forecasting and better evaluations. For forecasting itself, there are a lot of different ways to do it. That does probably mean that there is more work for us to do back and forth with specific types and their likelihood, which make this a bit challenging. It'll give you a wide conversation.

Ben G: Is it worth going over the double cruxing steps, the general format? I'm sorry. I'm not the facilitator.

Vaniver: Yeah. What does our facilitator think?

Facilitator: I think you're doing pretty good and exploring each other's stuff. Pretty cool... I'm also sharing a sense that forecasting has been replaced with a vague “technology” or something.

Ozzie: I think in a more ideal world we'd have something like a list of every single application and for each one say what are the likelihoods that I think it's going to be interesting, what you think is going to be interesting, etc.

Ozzie: We don't have a super great list like that right now.

Vaniver: I'm tickled because this feels like a very forecasting way to approach the thing where it's like “we have all these questions, let's put numbers on all of them”.

Ozzie: Yeah of course. What I'd like to see, what I'm going for, is a way that you could formally ask forecasters these things.

Vaniver: Yeah.

Ozzie: That is a long shot. I'd say that's more on the experimental side. But if you could get that to work, that'd be amazing. More likely, that is something that is kind of infrequent.

Vaniver’s conceptual model of why forecasting works

Vaniver: When I think about these sorts of things, I try to have some sort of conceptual model of what's doing the work. It seems to me the story behind forecasting is there's a lot of, I'm going to say, intelligence for hire out there and that the thing that we need to build is this marketplace that connects the intelligence for hire and the people who need cognitive work done. The easiest sorts of work for us to use for are these predictions about the future because it's easy to verify later and ....

Vaniver: I mean the credit allocation problem is easy because of everyone who moved the prediction in a good direction gets money and everyone who moved it in the wrong direction loses money. Whereas if we're trying to develop a cancer drug and we do scientific prizes, it may be very difficult to do the credit allocation for “here's a billion dollars for this drug”. Now all the scientists who made some sort of progress along the way figure out who gets what of that money.

Vaniver: I'm curious how that connects with your conception of the thing. Does that seem basically right or you're like there's this part that you're missing or you would characterize differently or something?

Ozzie: Different aspects about it. One is I think that's one of the possible benefits. Hypothetically, it may be one of the main benefits. But even if it's not an actual benefit, even if it doesn't come out to be true, I think that there are other ways that this type of stuff would be quite useful.

Background on prediction markets and the Good Judgement Project

Ozzie: Also to stand back a little bit, I'm not that excited about prediction markets in a formal way. My impression is that A) they're not very legal in the US, and B), it's very hard to incentivize people to forecast the right questions. Then C), there are issues around a lot of these forecasting systems you have people that want private information and stuff. There's a lot of nasty things with those kinds of systems. They could be used for some portion of this.

Ozzie: The primary area that I'm more interested in forecasting applications similar to Metaculus and PredictionBook and one that I'm working on right now. More, they're working differently. Basically, people build up good reputations by having good track records. Then there's basically a variety of ways to pay people. The Good Judgement Project does it by basically paying people a stipend. There are around 125 super forecasters who work on specific questions for specific companies. I think you pay like $100,000 to get a group of them.

Ozzie: Just a quick question, are you guys familiar with how they do things in specific? Not many people are.

Ozzie: Maybe one of the most interesting examples of paid forecasters which was similar to this. For them, they basically have the GJP Open where they find the really good forecasters. Then those become the super forecasters. There's about 200 of these, 125, are the ones that they're charging other companies for.

Vaniver: Can you paint me more of a picture of who is buying the forecasting service and what they're doing it for?

Ozzie: Yeah. For one thing, I'll say that this area is pretty new. This is still on the cutting edge and small. OpenPhil bought some of their questions ... I think they basically bought one batch. The questions I know about them asking were things like “what are the chances of nuclear between the US and Russia?” “What are the chances of nuclear war between different countries?” where one of the main ones was Pakistan and India. Also specific questions about outcomes of interventions that they were sponsoring. OpenPhil already internally does forecasting on most of its grant applications. When a grant is made internally they would have forecasts about how well it's going to do and they track that. That is a type of forecasting.

Ozzie: The other groups that use them are often businesses. There are two buckets in how that's useful. One of them is to drive actual answers. A second one is to get the reasoning behind those answers. A lot of times what happens -- although it may be less useful for EAs -- is that these are companies maybe do not have optimal epistemologies, but instead have systematic biases. They basically purchase this team of people who do provably well at some of these types of questions. Those people would have discussions about their kinds of reasoning. Then they find their reasoning interesting.

Vaniver: Yeah. Should I be imagining an oil company deciding whether to build a bunch of wells in Ghana and has decided that they just want to outsource the question of what's the political environment in Ghana going to be for the next 10 years?

Ozzie: That may be a good interpretation. Or there'd be the specific question of what's the possibility that there'll be a violent outbreak.

Vaniver: Yeah. This is distinct from Coca Cola trying to figure out which of their new ad campaigns would work best.

Ozzie: This is typically different. They've been focused on political outcomes mostly. That comes in assuming that they were working with businesses. A lot of GJP stuff is covered by NDA so we can't actually talk about it. We don't have that much information.

Ozzie: My impression is that some groups have found it useful and a lot of businesses don't know what to do with those numbers. They get a number like 87% and they don't have ways to directly make that interact with the rest of their system.

Ozzie: That said, there are a lot of nice things about that hypothetically. Of course some of it does come down to the users. A lot of businesses do have pretty large biases. That is a known thing. It's hard to know if you have a bias or not. Having a team of people who has a track record of accuracy is quite nice if you want to get a third party check. Of course another thing for them is that it is just another way to outsource intellectual effort.

Positive cultural externalities of forecasting AI

Facilitator: Vaniver, is this changing your mind on anything essentially important?

Vaniver: The thing that I'm circling around now is a question closer to “in what contexts does this definitely work?” and then trying to build out from that to “in what ways would I expect it to work in the future?”. For example here, Ozzie didn't mention this, but a similar thing that you might do is have pundits just track their predictions or somehow encourage them to make predictions that then feed into some reputation score where it may matter in the future. The people who consistently get economic forecasts right actually get more mindshare or whatever. There's versions of this that rely on the users caring about it and then there are other versions that rely less on this.

Vaniver: The AI related thing that might seem interesting is something like 2.5 years ago Eliezer asked this question at the Asilomar AI conference which was “What's the least impressive thing that you're sure won't happen in two years?” Somebody came back with the response of “We're not going to hit 90% on the Winograd Schema.” [Editor’s note: the speaker was Oren Etzioni] This is relevant because a month ago somebody hit 90% on the Winograd Schema. This turned out to have been 2.5 years after the thing. This person did successfully predict the thing that would happen right after the deadline.

Vaniver: I think many people in the AI space would like there to be this sort of sense of “people are actually trying to forecast near progress”. Or sorry. Maybe I should say medium term progress. Predicting a few years of progress is actually hard. But it's categorically different from three months. You can imagine something where people who are building up the infrastructure to be good at this sort of forecasting does actually make the discourse healthier in various ways and gives us better predictions of the future.

Importance of software engineering vs. other kinds of infrastructure

Vaniver: Also in having some amount of question of how much of this is infrastructure and how much of this is other things. For example when we look at the Good Judgement Project I feel like the software engineering is a pretty small part of what they did as compared to the selection effects. It may still be the sort of thing where we're talking about infrastructure, though we're not talking about software engineering.

Vaniver: The fact that they ran this tournament at all is the infrastructure, not the code underneath the tournaments. Similarly, even if we think about a Good Judgment Project for research forecasting in general, this might be the sort of cool thing that we could do. I'm curious how that landed for you.

Ozzie: There's a lot of stuff in there. One thing is that on the question of “can we just ask pundits or experts”, I think my prior is that that would be a difficult thing, specifically in that in “Expert Political Judgment” Tetlock tried to get a lot of pundits to make falsifiable predictions and none of them wanted to ...

Vaniver: Oh yeah. It's bad for them.

Facilitator: Sorry. Can you tell me what you thought were the main points of what Vaniver was just saying then?

Ozzie: Totally. Some of them ...

Facilitator: Yeah. I had a sense you might go "I have a point about everything he might have said so I'll say all of them" as opposed the key ones.

Ozzie: I also have to figure out what he said in that last bit as opposed to the previous bit. It's one of them. There's a question. Most recent when it comes to the Good Judgment Project, how much of it was technology versus other things that we did?

Ozzie: I have an impression that you're focused on the AI space. You do talk about the AI space a lot. It's funny because I think we're both talking a bit on points that help the other side, which is kind of nice. You mentioned one piece where prediction was useful in the AI space. My impression is that you're skeptical about whether we could get a lot more wins like that, especially if we tried to do it with a more systematic effort.

Vaniver: I think I actually might be excited about that instead of skeptical. We run into similar problems as we did with getting pundits to predict things. However, the things that're going on with professors and graduates and research scientists is very different from the thing that's going on with pundits and newspaper editors and newspaper readers.

Vaniver: Also it ties into the ongoing question of “is science real?” that the psychology replication stuff is connected to. Many people in computer science research in particular are worried about bits of how machine learning research is too close to engineering or too finicky in various ways. So I could a imagine a "Hey, will this paper replicate?"-market catching on in computer science. I imagine getting from that to a “What State-of-the-Arts will fall when?”-thing. That also seems quite plausible that we could make that happen.

Ozzie: I have a few points now that connect to that. On pundits and experts, I think we probably agree that pundits often can be bad. Also experts often are pretty bad at forecasting it seems. That's something that's repeatable.

Ozzie: For instance in the AI expert surveys, a lot of the distributions don't really make sense with each other. But the people who do seem to be pretty good are the specific class of forecasters, specifically ones that we have evidence for, that's really nice. We only have so many of them right now but it is possible that we can get more of them.

Ozzie: It would be nice for more pundits to be more vocal about this stuff. I think Kelsey at Vox with their Future Perfect group is talking about making predictions. They've done some. I don't know how much we'll end up doing.

Privacy

Ozzie: When it comes to the AI space, there are questions about “what would interesting projects look like right now?” I've actually been dancing around AI in part because I could imagine a bad world or possibly a bad world where we really help make it obvious what research directions are exciting and then we help speed up AI progress by five years and that could be quite bad. Though, managing to do that in an interesting way could be important.

Ozzie: There are other questions about privacy. There's the question of “is this interesting?”, and the question of "conditional on it being kind of interesting. Should we be private about it?" We're right now playing for that first question.

Orgs using internal prediction tools, and the action-guidingness of quantitative forecasts

Ozzie: Some other things I'd like to bring into this discussion is that a lot of it right now is already being systemized. They say when you are an entrepreneur or something and try to build a tool it's nice to find that there are already internal tools. A lot of these groups are making internal systematic predictions at this point. They're just not doing it using formal methods.

Ozzie: Some example, like OpenPhil formally specifies a few predictions for grants. Open AI also has a cluster of Slack spreadsheets, Google docs and stuff where they formalizes questions, asks them to people, get answers and then tracks those answers. These are people at Open AI who are ML experts basically. That's a decent sized thing.

Ozzie: There are several other organizations that are using internal forecasting for calibration. It's just a fun game that forces them to get a sense of what calibration is like. Then for that there are questions of “How useful is calibration?”, “Does it give you better calibration over time?”

Ozzie: Right now none of them are using PredictionBook. We could also talk a bit about ... I think that thing is nice and shows a bit of promise. It may be that there are some decent wins to be done by making better tools for those people which right now aren’t using any specific tools because they looked at them and found them to be inadequate. It's also possible that even if they did use those tools it'd be a small win and not a huge win. That's one area where there could be some nice value. But it's not super exciting so I don't know if you want to push back against that and say "there'll be no value in that."

Vaniver: There I'm sort of confused. What are the advantages to making software as a startup where you make companies' internal prediction tools better? This feels similar to Atlassian of something where it's like "yeah, we made their internal bug reporting or other things better". It's like yeah, sure, I can see how this is valuable. I can see how I’d make them pay for it. But I don't see how this is ...

Vaniver: ...a leap towards the utopian goals if we take something like Futarchy or ... in your initial talk you painted some pictures of this is how in the future if you had much more intelligence or much more sophisticated systems you could do lots of cool things. [Editor’s note: see Ozzie’s sequence “Prediction-Driven Collaborative Reasoning Systems” for background on this] The software as a service vision doesn't seem like it gets us all that much closer and also feels like it's not pushing at the hardest bit which is something like “getting companies to adopt it”-thing. Or maybe what I think there is something like that the organizations themselves have to be structured very differently. It feels like there's some social tech.

Ozzie: When you say very differently, do you mean very differently? Right now they're already doing some predictions. Do you mean very differently for like predictions would be a very important aspect of the company? Because right now it is kind of small.

Vaniver: My impression is something like going back to your point earlier about looking back at answers like 87% and they won't really know what to do with it. Similarly, I was in a conversation with Oli earlier about whether or not organizations had beliefs or world models. There's some extent to which the organization has a world model that doesn't live in a person's head. It's going to be something like its beliefs are these forecasts on all these different questions and also the actions that the organization takes is just driven by those forecasts without having a human in the loop, where it feels to me right now often the thing that will happen is some executive will be unsure about a decision. Maybe they'll go out to the forecasters. The forecasters will come back with 87%. Now the executive is still making the decision using their own mind. Whether or not that “87%” lands as “the actual real number 0.87” or something else is unclear, or not sensibly checked, or something. Does that make sense?

Ozzie: Yeah. Everything's there. Let's say that ... 87% example is something that A) comes up if you're a bit naïve about what you want and B), comes up depending on how systematic your organization is with using number for things. If you happen to have a model what the 87% is, that could be quite valuable. With see different organizations are on different parts of the spectrum. Probably the one that's most intense about this is GiveWell. GiveWell has their multiple gigantic sheets of lots of forecasts essentially. It's possible that it'll be hard to make tooling that'll be super useful to them. I've been talking with them. There's experiments to be tried there. They're definitely in the case that as specific things change they may change decisions and they'll definitely change recommendations.

Ozzie: Basically they have this huge model where people estimate a bunch of parameters about moral decision making and a lot of other parameters about how well the different interventions are going to do. Out of all of that comes recommendations for what the highest expected values are.

Ozzie: That said, they are also in the domain that's probably the most certain of all the EA groups in some ways. They're able to do that more. I think the Open AI is probably a little bit... I haven't seen their internal models but my guess is that they do care a lot about the specifics of the numbers and also are more reasonable about what to do with them.

Ozzie: I think the 87% example is a case of most CEOs don't seem to know what a probability distribution is but I think the EA groups are quite a bit better.

Vaniver: When I think about civilization as a whole, there’s a disconnect between groups that think numbers are real and groups that don't think numbers are real. There's some amount of "ah, if we want our society to be based on numbers are real, somehow we need the numbers-are-real-orgs to eat everyone else. Or successfully infect everyone else.”

Vaniver’s steelman of Ozzie

Vaniver: What's up?

Facilitator: Vaniver, given what you can see from all the things you discussed and touched on in the forecasting space, I wonder if you had some sense of the thing Ozzie is working on. If you imagine yourself actually being Ozzie and doing the things that he's doing, I'm curious what are the main things that feel like you don't actually buy about what he's doing.

Vaniver: Yeah. One of the things ... maybe this is fair. Maybe this isn't. I've rounded it up to something like personality difference where I'm imagining someone who is excited about thinking about this sort of tool and so ends up with “here's this wide range of possibilities and it was fun to think about all of them, but of the wide range, here's the few that I think are actually good”.

Vaniver: When I imagine dropping myself into your shoes, there's much more of the ... for me, the “actually good” is the bit that's interesting (though I want to consider much of the possibility space for due diligence). I don't know if that's actually true. Maybe you're like, "No. I hated this thing but I came into it because it felt like the value is here."

Ozzie: I'm not certain. You're saying I wasn't focused on ... this was a creative ... it was enjoyable to do and then I was trying to rationalize it?

Vaniver: Not necessarily rationalize but I think closer to the exploration step was fun and creative. Then the exploitation step of now we're actually going to build a project for these two things was guided by the question of which of these will be useful or not useful.

How to explore the forecasting space

Vaniver: When I imagine trying to do that thing, my exploration step looks very different. But this seems connected to this because there's still some amount of everyone having different exploration steps that are driven by their interests. Then also you should expect many people to not have many well-developed possibilities outside of their interests.

Vaniver: This may end up being good to the extent that people do specialize in various ways. If we just randomly reassigned jobs to everyone, productivity would go way down. But this thing where the interests matter. You should actually only explore things that you find interesting makes sense. There's a different thing where I don't think I see the details of Ozzie's strategic map for something in the sense of “Here's the long term north star type things that are guiding us.” The one bit that I've seen that was medium term was the “yep, we could do the AI part testing stuff but it is actually unclear whether this is speeding up capabilities more than it's useful”. How many years is a “fire alarm for general intelligence” worth? [Editor’s note: Vaniver is referring to this post by Eliezer Yudkowsky] Maybe the answer to that is “0” because we won't do anything useful with the fire alarm even if we had it.

Facilitator: To make sure I followed, the first step was: you have a sense of Ozzie exploring a lot of the space initially and now it's exploiting some of the things you think may be more useful. But you wouldn't have explored it that way yourself potentially because you wouldn't really have felt differently that there would have been something especially useful to find if you continued exploring?

Facilitator: Secondly, you're also not yet sufficiently sold on the actual medium term things to think that the exploiting strategies are worth taking?

Vaniver: “Not yet sold” feels too strong. I think it's more that I don't see it. Not being sold implies something like ... I would normally say I'm not sold on x when I can see it but I don't see the justification for it yet where here I don't actually have a crisp picture of what seven year success looks like.

Facilitator: Ozzie which one of those feels more like "Argh, I just want to tell Vaniver what I'm thinking now"?

Ozzie: So on exploration and exploitation. One the one hand, not that much time or resource is going into this yet. Maybe a few full-time months like to think about it and then several for making webapps. Maybe that was too much. I think it wasn't.

Ozzie: The amount of variety of types of proposals that are on the table right now compared to when I started I'm pretty happy with for like a few months of thinking. Especially since for me to get involved in AI would have taken quite a bit more time of education and stuff. It did seem like a few cheap wins at this point. I still kind of feel like that.

Importance and neglectedness of forecasting work

Ozzie: I also do get the sense that this area is still pretty neglected.

Vaniver: Yeah. I guess in my mind neglecting is both people aren't working on it and people should be working on it. Is that true for you also?

Ozzie: There are three aspects. Importance, tractable, and neglected. It could be neglected but not important. I'm saying it's neglected.

Vaniver: Okay. You are just saying that people aren't working on it.

Ozzie: Yeah. You can talk about then the questions of importance and tractability.

Facilitator: I feel like there are a lot of things that one can do. One can Like try to start a group house in Cambridge, one can try and teach rationality at the FHI. Forecasting ... something about "neglected" doesn't feel like it quite gets at the thing because the space is sufficiently vast.

Ozzie: Yeah. The next part would be importance. I obviously think that it's higher in importance than a lot of the other things that seem similarly neglected. Let's say basically the ratio of importance in importance, neglected and tractable was pretty good for forecasting. I'm happy to spend a while getting into that.

Tractability of forecasting work

Vaniver: I guess I actually don't care all that much about the importance because I buy if we could ... in my earlier framing, we move everyone to a "numbers-are-real" organization. That would be excellent. The thing that I feel most doomy about is something like the tractability where it feels like most of the wins that people were trying to get before turned out to be extremely difficult and not really worth it. I'm interested in seeing the avenues that you think are promising in this regard.

Ozzie: Yeah. It's an interesting question. I think a lot of people have the notion that we've had tons and tons of attempts at forecasting systems since Robin Hanson started talking about Prediction markets. All of those have failed therefore Prediction markets have failed and it's not worth spending another person and it's like a heap of dead bodies.

Ozzie: The viewpoint that I have where it definitely doesn't look that way, for one thing, the tooling. If you actually look at a lot of the tooling that's been done, a lot of it is still pretty basic. One piece of evidence for that is the fact that almost no EA organizations are using it themselves.

Ozzie: That could also be that it's really hard to make good tooling. If you look at it, basically if you look at non-prediction market systems, in terms of prediction markets there were also a few attempts. But the area is kind of illegal. Like I said, there are issues with prediction markets.

Ozzie: If you look at non prediction market tournament applications. Basically you have a few. The GJP doesn't make their own. They've used Cultivate Labs. Now they're starting to try and make their own. But the GJP people are political scientists and stuff, not developers.

Ozzie: A lot of experiments they've done are political. It's not like engineering questions about how there'd be an awesome engineering infrastructure. My take on that is if you put some really smart engineer/entrepreneur in that type of area, I'd expect them to have a very different approach.

Vaniver: There's a saying from Nintendo: "if your game is not fun with programmer art, it won't be fun in the final product" or something. Similarly, I can buy that there's some minimum level of tooling that we need for these sorts of forecasts that would be sensible it all. But it feels to me that if I expected forecasting to be easy in the relevant ways, the shitty early versions would have succeeded without us having to build later good versions.

Ozzie: There's a question of what "enough" is. They definitely have succeeded to some extent. PredictionBook has been used by Gwern and a lot of other people. Some also use their own setups and Metaculus and stuff... So. you can actually see a decent amount of activity. I don't see many other areas that have nearly that level of experimentation. There are very few other areas that are being used to the extent that predictions are used that we could imagine as future EA web apps.

Vaniver: The claim that I'm hearing there is something like “I should be comparing PredictionBook and Metaculus and similar things to reciprocity.io or something, as this is just a web app made in their spare time and if it actually sees use that's relevant”.

Ozzie: I think that there's a lot of truth to that, though maybe not exactly be the case. Maybe we're past a bit of reciprocity.

Vaniver: Beeminder also feels like it's in this camp to me to me although less like EA specific.

Ozzie: Yeah. Or like Anki.

Ozzie: Right.

Technical tooling for Effective Altruism

Ozzie: There's one question which is A), do we think that there's room for technical tooling around Effective Altruism? B) if there is, what are the areas that seems exciting? I don't see many other exciting areas. Of course, that is another question. If you think ... that's not exactly depending forecasting... but more like, if you don't like forecasting, what do you like? Because there's a conclusion that we just don't like EA tools and there's almost nothing in the space. Because there's not much more that seems obviously more exciting. But there's a very different side to the argument.

Vaniver: Yeah. It's interesting because on the one hand I do buy the frame of it might make sense to just try to make EA tools and then to figure out what the most promising EA tool is. Then also I can see the thing going in the reverse direction which is something like if none of the opportunities for EA tools are good then people shouldn't try it. Also if we do in fact come up with 12 great opportunities for EA tools this should be a wave of EA grants or whatever.

Vaniver: I would be excited about something double crux-shaped. But I worry this runs into the problem that argument mapping and mind mapping have all run into before. There's something that's nice about doing a double crux which makes it grounded out in the trace that one particular conversation takes as opposed to actually trying to represent minds. I feel like most of the other EA tools would be ... in my head it starts as silly one-offs. I'm thinking of things like for the 2016 election there was a vote-swapping thing to try to get third party voters in swing states to vote for whatever party in exchange for third party votes in safe states. I think Scott Aaronsson promoted it but I don't think he made it. But. It feels to me like that sort of thing. We may end up seeing lots of things like that where it's like “if we had software engineers ready to go, we would make these projects happen”. Currently I expect it's sufficient that people do that just for the glory of having done it. But the Beeminder style things are more like, “oh yeah, actually this is the sort of thing where if it's providing value then we should have people working for it and the people will be paid by the value they're providing”. Though that move is a bit weird because that doesn't quite capture how LessWrong is being paid for...

Ozzie: Yeah. Multiple questions on that. This could be a long winding conversation. One would be “should things like this be funded by the users or by other groups?”

Ozzie: One thing I'd say that ... I joined 80000 Hours about four years ago. I worked with them to help them with their application and decided at that point that it should be much less of an application and more of like a blog. I helped them scale it down.

Ozzie: I was looking for other opportunities to make big EA apps. At that point there was not much money. I kind of took a detour and I'm coming back to it in some ways. In a way I've experienced this with Guesstimate, which has been used a bit. Apps from Effective Altruism has advantages and disadvantages. One disadvantage is that writing software is an expensive thing. An advantage is that it's very tractable. By tractable I mean you could say “if I spent $200,000 and three engineer years I could expect to get this thing out”. Right now we are in a situation where we do have hypothetically a decent amount of money if it could beat a specific bar. The programmers don't even have to be these intense EAs (although it is definitely helpful).

5-MIN BREAK

Tractability of forecasting within vs outside EA

Ozzie: I feel like we both kind of agree, that, hypothetically, if a forecasting system was used and people decided it was quite useful, and we could get to the point that EA orgs were making decisions in big ways with it, that could be a nice thing to have. But there’s disagreement about whether that’s an existing possibility, and whether existing evidence shows us that won’t happen.

Vaniver: I’m now also more excited about the prospects of this for the EA space. Where I imagine a software engineer coming out of college saying “My startup idea is prediction markets”, and my response is “let’s do some market research!” But in the EA space the market research is quite different, because people are more interested in using the thing, and there’s more money for crazy long-shots… or not crazy long-shots, but rather, “if we can make this handful of people slightly more effective, there are many dollars on the line”.

Ozzie: Yeah.

Vaniver: It’s similar to a case where you have this obscure tool for Wall Street traders, and even if you only sell to one firm you may just pay for yourself.

Ozzie: I’m skeptical whenever I hear an entrepreneur saying “I’m doing a prediction market thing”. It’s usually crypto related. Interestingly most prediction platforms don’t predict their own success, and that kind of tells you something…

(Audience laughter)

Vaniver: Well this is just like the prediction market on “will the universe still exist”. It turns out it’s just asymmetric who gets paid out.

Medium-term goals and lean startup methodology

Facilitator: Vaniver, your earlier impression was you didn’t have a sense what medium term progress would look like?

Vaniver: It’s important to flag that I changed my mind. When I think about forecasting as a service for the EA space, I’m now more optimistic, compared to when I think of it as a service on the general market. It’s not surprising OpenPhil bought a bunch of Good Judgement forecasters. Whereas it would be a surprise if Exxon bought GJP questions.

Vaniver: Ozzie do you have detailed visions of what success looks like in several years?

Ozzie: I have multiple options. The way I see is that… when lots of YC startups come out they have a sense that “this is an area that seems kind of exciting”. We kind of have evidence that it may be interesting, and also that it may not be interesting. We don’t know what success looks like for an organisation in this space, though hopefully we’re competent and we could work quickly to figure it out. And it seems things are exciting enough for it to be worth that effort.

Ozzie: So AirBnB and the vast majority of companies don’t have a super clear idea of how it’s going to be useful when they start. But they do have good inputs, and a vague sense of what kind of cool outputs would be.

Ozzie: So AirBnB and the vast majority of companies don’t have a super clear idea of how it’s going to be useful when they start. But they do have good inputs, and a vague sense of what kind of cool outputs would be.

Ozzie: There’s evidence that statistically this seems to be what works in startup land.

Ozzie: Some of the evidence against. There was a question of “if you have a few small things that are working but are not super exciting, does that make it pretty unlikely you’ll see something in this space?”

Ozzie: It would be hard to make a strong argument that YC wouldn’t find any companies in such cases. They do fund things without any evidence of success.

Vaniver: But also if you’re looking for moonshots, mild success the first few times is evidence against “the first time it just works and everything goes great”.

Limitations of current forecasting tooling

Ozzie: Of course in that case you’re question is of exactly what is this that’s been tried. I think there are arguments that there are more exciting things on the horizon which haven’t been tried.

Ozzie: Now we have PredictionBook, Metaculus, and hypothetically Cultivate Labs and another similar site. Cultivate Labs does enterprise gigs, and are used by big companies like Exxon for ideation and similar things. They’re a YC company and have around 6 people. But they haven’t done amazingly well. They’re pretty expensive to use. At this point you’d have to spend around $400 for one instance per month. And even then you get a specific enterprise-y app that’s kind of messy.

Ozzie: Then if you actually look at the amount of work done on PredictionBook and Metaculus, it’s not that much. PredictionBook might have had 1-2 years of engineering effort, around 7 years ago. People think it’s cool, but not a serious site really. As for Metaculus, I have a lot of respect for their team. That project was probably around 3-5 engineering years.

Ozzie: They have a specific set of assumptions I kind of disagree with. For example, everyone has to post their questions in one main thread, and separate communities only exist by having subdomains. They’re mostly excited about setting up those subdomains for big projects.

Ozzie: So if a few of us wanted to experiment with “oh, let’s make a small community, have some privacy, and start messing around with questions” it’s hard to do that...

Vaniver: So what would this be for? Who wants their own instances? MMO guilds?

Jacob: Here’s one example of the simplest thing you currently cannot do. (Or could not do around January 1st 2019.) Four guys are hanging out, and they wonder “When will people next climb mount everest?” They then just want to note down their distributions for this and get some feedback, without having to specify everything in a Google doc or a spreadsheet which doesn’t have distributions.

Facilitator: Which bit breaks?

Jacob: You cannot small private channels for multiple people which take 5 minutes to set up where everyone records custom distributions.

Vaniver: So I see what you can’t do. What I want is the group that wants to do it. For example, one of my housemates loves these sites, but also is the sort of nerd that loves these kinds of sites in general. So should I just imagine there’s some MIT fraternity where everyone is really into forecasting so they want a private domain?

Ozzie: I’d say there's a lot of uncertainty. A bunch of groups may be interested, and if a few are pretty good and happen to be excited, that would be nice. We don’t know who those are yet, but we have ideas. There are EA groups now. A lot of them are kind of already doing this; and we could enable them to do it without having to pay $400-$1000 per month; or in a way that could make stuff public knowledge between groups… For other smaller EA groups that just wanted to experiment the current tooling would create some awkwardness.

Ozzie: If we want to run experiments on interesting things to forecast, e.g “how valuable is this thing?” or stuff around evaluation or LessWrong posts. We’d have to set up a new instance for each. Or maybe we could have one instance and use it for all experiments, but that would force a single privacy setting for all those experiments.

Ozzie: Besides that, at this point, I raised some money and spent like $11,000 to get someone to program. So a lot of this tooling work is already done and these things are starting to be experimented with.

Knowledge graphs and moving beyond questions-as-strings

Ozzie: In the medium-term there’s a lot of other interesting things. With the systems right now, a lot of them assume all questions are strings. So if you’re going to have a 1000 questions, it’s impossible to understand and for other people to get value from. So if you wanted to organise something like, “every EA org, how much money and personnel would they have each year for the coming 10 years” it would be impossible with current methods.

Vaniver: Instead we’d want like a string prefix combined with a list of string postfixes?

Ozzie: There are many ways to do it. I’m experimenting with using a formal knowledge graph where you have formal entities.

Vaniver: So there would be a pointer to the MIRI object instead of a string?

Ozzie: Yeah, and that would include information about how to find information about it from Wikipedia, etc. So if someone wanted to set up an automated system to do some of this they could. Combining this with bot support would enable experiments with data scientists and ML people to basically augment human forecasts with AI bots.

Vaniver: So, bot support here is like participants in the market (I’ll just always call a “forecast-aggregator” a market)? Somehow we have an API where they can just ingest question and respond with distributions?

Ozzie: Even without bots, just organising structured questions in this way makes it easier for both participants and observers to get value.

Summary of cruxes

Facilitator: Yeah, I don’t know… You chatted for a while, I’m curious what feels like some of the things you’ll likely think a bit more about, or things that seem especially surprising?

Ozzie: I got the sense that we agreed on more things than I was kind of expecting to. It seems lots of it now may be fleshing out what the mid-term would be, and seeing if there’s parts of it you agree are surprisingly useful, or if it does seem like all of them are long-shots?

Vaniver: When I try to summarise your cruxes, what would change your mind about forecasting, it feels like 1) if you thought there was a different app/EA tool to build, you would bet on that instead of this.

Ozzie: I’d agree with that.

Vaniver: And 2) if the track-record of attempts were more like… I don’t know what word to use, but maybe like “sophisticated” or “effortful”? If there were more people who were more competent than you and failed, then you’d decide to give up on it.

Ozzie: I agree.

Vaniver: I didn’t get the sense that there were conceptual things about forecasting that you expected to be surprised by. In my mind, getting data scientists to give useful forecasts, even if the questions are in some complicated knowledge graph or something, seems moderately implausible. Maybe I could transfer that intuition, but maybe the response is “they’ll just attempt to do base-rate forecasting, and it’s just an NLP problem to identify the right baserates”

Vaniver: Does it feel like it’s missing some of your cruxes?

Facilitator: Ozzie, can you repeat the ones he did say?

Audience: Good question.

Ozzie: I’m bad at this part. Now I’m a bit panicked because I feel like I’m getting cornered or something.

Vaniver: My sense was… 1) if there are better EA tools to build, you’d build them instead. 2) if better tries had failed, it would feel less tractable. And 3) Absence of conceptual uncertainties that we could resolve now. It feels it’s not like “Previous systems are bad because they got the questions wrong” or “Question/answer is not the right format”. It’s closer to “Previous systems are bad because their question data structure doesn’t give us the full flexibility that we want”.

Vaniver: Maybe that’s a bad characterization of the automation and knowledge graph stuff.

Ozzie: I’d definitely agree with the first two, although the first one is a bit more expansive than tools. If there was e.g. a programming tool I’d be better for and had higher EV, I’d do that instead. Number two, on tries, I agree if there were one or two other top programming teams who tried a few of these ideas and were very creative about it, and failed, and especially if they had software we could use now! (I’d feel much better about not having to make software) Then for three, The absence of conceptual uncertainties. I don’t know exactly how to pin this down.

Facilitator: I don’t know if we should follow this track.

Vaniver: I’m excited about hearing what Ozzie’s conceptual uncertainties are.

Facilitator: Yeah, I agree actually.

Ozzie’s conceptual uncertainties

Ozzie: I think the way I’m looking at this problem is one where there are many different types of approaches that could be useful. There are many kinds of people who could be doing the predicting. There are many kinds of privacy. Maybe there would be more EAs using it, or maybe we want non-EAs of specific types. And within EA vs non-EA, there are many different kinds of things we might want to forecast. There are many creative ways of organising question such that forecasting leads to an improved amount of accuracy. And I have a lot of uncertainty about this entire space, and what areas will be useful and what won’t.

Ozzie: I think I find it unlikely that absolutely nothing will be useful. But I do find it very possible that it’ll just be too expensive to find out useful things.

Vaniver: If it turned out nothing was useful, would it be the same reason for different applications, or would it be “we just got tails on every different application?”

Ozzie: If it came out people just hate using the tooling, then no matter what application you use it for it will kind of suck.

Ozzie: For me a lot of this is a question of economics. Basically, it requires some cost to both build the system and then get people to do forecasts; and then to make the question and do the resolution. In some areas the cost will be higher than value, and in some the value will be higher than the cost. It kind of comes down to a question of efficiency. Though, it’s hard to know, because there’s always the question of maybe if I would have implemented this feature things would have been different?

Vaniver: That made me think of something specific. When we look at the success stories, they are things like weather and sports, whereas for sports you had to do some amount of difficult operationalisation, but you sort of only had to do it once. The step I expect to be hard across most application domains is the “I have a question, and now I need to turn it into a thing-that-can-be-quantitatively-forecasted” and then I became kind of curious if we could get relatively simple NLP systems that could figure out the probability that a question is well-operationalised or not. And have some sort of automatic suggestions of like “ah, consider these cases” or whatever, or “write the question this way rather than that way”.

Ozzie: From my angle, you could kind of call those “unique question”, where the marginal cost per question is pretty high. I think that if we were in any ecosystem where things were tremendously useful, the majority of questions would not be like this.

Vaniver: Right so if I ask about the odds I will still be together with my partner a while from now, I’d be cloning the standard “will this relationship last?” question and substituting new pointers?

Ozzie: Yeah. And a lot of questions would be like “GDP for every country for every year” so there could be a large set of question templates in the ecosystem. So you don’t need any fancy NLP; you could get pretty far with trend analysis and stuff.

Ozzie: On the question of whether data scientists would be likely to use it, that comes down to funding and incentive structures.

Ozzie: If you go on upwork and pay $10k to a data scientist they could give you a decent extrapolation system, and you could then just build that into a bot and hypothetically just keep pumping out these forecasts as new data come in. Pipelines like that already exist. What this would be doing is to provide infrastructure to help support them basically.

END OF TRANSCRIPT


At this point the conversation opened up to questions from the audience.

While this conversation was inspired by the double-crux technique, there is a large variation in how such sessions might look. Even when both participants retain the spirit of seeking the truth and changing their minds in that direction, some disagreements dissipate after less than an hour, others take 10+ hours to resolve and some remain unsolved for years. It seems good to have more public examples of genuine truth-seeking dialogue, but at the same time should be noted that such conversations might look very different from this one.



Discuss

Страницы