# Новости LessWrong.com

A community blog devoted to refining the art of rationality
Обновлено: 1 час 1 минута назад

### A developmentally-situated approach to teaching normative behavior to AI

1 час 46 минут назад

This was submitted to the EthicsNet Guardians' Challenge. I'll be honest here that I hadn't thought much about what EthicsNet is trying to do, but decided to write something and submit to it anyway because it's the sort of approach that seems reasonable if you come from an ML background, and I think I differ enough in my thinking that I may provide an alternative perspective that may help shape the project in ways I view as beneficial to its success. For that reason I think this is somewhat less coherent than my usual writing (or at least my thinking is less coherent, whether or not that shows in my writing), but nonetheless I chose to share it here in the interest of furthering discussion and possibly drumming up additional interest for EthicsNet. Their challenge has a week left in it, so if you think I'm wrong and you have a better idea please submit it to them!

Introduction

Based on the usefulness of ImageNet, MovieLens, and other comprehensive datasets for machine learning, it seems reasonable that we might create an EthicsNet of ethical data we could use to train AI systems to behave ethically (Watson, 2018). Such a dataset would aid in addressing issues of AI safety, especially as they relate to AGI, since it appears learning human values will be a key component of aligning AI with human interests (Bostrom, 2014). Unfortunately, building a dataset for ethics is a bit more complicated than it is for images or movies because ethics is primarily learned by situated, embodied agents acting in the world and receiving feedback on those actions rather than by non-situated agents who learn about the world without understanding themselves to be part of it (Varela, 1999). Therefore we consider a way to fulfill the purpose of EthicsNet based on the idea that ethical knowledge is developmentally situated and so requires a generative procedure rather than a traditional dataset to train AI to adopt values and behave ethically.

Ethics is developmentally situated

In philosophy the study of ethics quickly turns to metaethics because those are the sorts of questions that are of interest to philosophy, so it’s tempting to think that, based on the philosophical literature of ethics, learning to behave ethically (i.e. learning behavioral norms) is primarily about resolving ethical dilemmas and developing ethical theories that allow us to make consistent choices based on values. However, this would be to ignore the psychology of how people learn what behaviors are normative and apply those norms to engage in ethical reasoning (Peters, 1974). Rather than developing a coherent ethical framework from which to respond, humans learn ethics by first learning how to resolve particular ethical questions in particular ways, often without realizing they are engaged in ethical reasoning, and then generalizing until they come to ask question about what is universally ethical (Kohlberg, Levine, & Hewer, 1983).

This is to say that ethics is both situated in general—ethics is always about some agent deciding what to do within some context it is itself a part of—and situated developmentally—the context includes the present psychological development and behavioral capabilities of the agent. Thus to talk about providing data for AI systems to learn ethics we must consider what data makes sense given their developmental situation. For this reason we will now briefly consider the work of Kohlberg and Commons.

Kohlberg proposed a developmental model of ethical and moral reasoning correlated with general psychological development (Kohlberg & Hersh, 1977). We will summarize it here as saying that the way an agent reasons about ethics changes as it develops a more complex ontology so that young children, for example, reason about ethics in ways appropriate to their ability to understand the world and this results in categorically different reasoning and often different actions than that of older children, adolescents, adults, and older adults. Although Kohlberg focused on humans, Commons has argued that we can generalize developmental theories to other beings, and there is no special reason to think AI will be exceptional with regards to development of ontological and behavioral complexity, thus we should expect AI to experience psychological development (or something functionally analogous to it) and thus will develop in their moral reasoning as they learn and grow in complexity (Commons, 2006).

It’s within the context of developmentally situated ethics that we begin to reason about how AI systems can be taught to behave ethically. We might expect to train ethical behavior in AI systems the same way we teach them to recognize objects in images or extract features from text—viz. by providing a large data set with some predetermined solutions that we can train AI systems against—but this would be to believe that AI is exceptional and allows learning ethics in a way very different from the way both humans and non-human animals learn behavioral norms. Assuming AI systems are not exceptional in this regard, we consider a formulation of EthicsNet compatible with developmentally situated learning of ethical behavior.

Outline for a generative EthicsNet

From a young age, human children actively seek to learn behavioral norms, often to the point of overfitting, by aggressively deriving “ought” from “is” (Schmidt et al., 2016). They do this based on a general social learning motivation to behave like conspecifics seen in both primates and other non-human animals (van de Waal, Borgeaud, Whiten, 2013), (Dingemanse et al., 2010). This strongly suggests that, within their developmental context, agents learn norms based on a strong, self-generated motivation to do so, thus foundational to to our proposal for teaching AI systems ethical behavior is a self-sustaining motivation to discover behavioral norms from examples. Other approaches may be possible, but the assumption of a situated, self-motivated agent agrees with how all agents known to learn normative behavior do so now, so we take up this assumption out of a desire to conserve uncertainty. Thus we will assume for the remainder of this work that such a motivation exists in AI systems to be trained against the EthicsNet dataset, although note that implementation of this motivation to learn norms strictly lies outside the scope of the EthicsNet proposal and will not be considered in detail here.

So given that we have a situated agentic AI that is self-motivated to learn normative behaviors, what data should be provided to it? Since the agent is situated it cannot, strictly speaking, be provided data of the sort that we normally think of when we think of datasets for AI systems. Instead, since the agent is to be engaged in actions that offer it the opportunity to observe, practice, and infer behavioral norms, it needs to be a dataset in the form of situations it can participate in. For humans and non-human animals this “dataset” is presented naturally through the course of living, but for AI systems the natural environment does not necessarily present such opportunities. Thus we propose that the goal of EthicsNet is to give a framework in which to generate such opportunities.

Specifically we suggest creating an environment where AI agents can interact with humans with the opportunity to observe and query humans about behavioral norms based on the agents’ behavior in the environment. We do not envision this as an environment like ReCAPTCHA where providing ethical information to AI systems via EthicsNet is the primary task in service of some secondary task (von Ahn, 2008). Instead, we expect EthicsNet to be secondary to some primary human-AI interaction that is inherently meaningful to the human since this is the same way normative behavior is learned in humans and non-human animals, viz. as a secondary activity to some primary activity.

By way of example, consider an AI system that serves as a personal assistant to humans that interacts with humans via a multi-modal interface (e.g. Siri, Google Assistant, Cortana, and Alexa). The primary purpose of the AI-human interaction is for the AI assistant to help the human with completing tasks and finding information they might otherwise have neglected. As the AI assistant and human interact, the human will demonstrate behaviors that will give the AI assistant an opportunity to observe and infer behavioral norms based on the way the human interacts with it. Further, the AI assistant will take actions, and about those actions the human may like what the AI assistant did or may prefer it did something else. We see the goal of EthicsNet as providing a way for the human to provide the AI assistant in this scenario feedback about those likes and preferences so the AI assistant can use the information to further its learning of behavioral norms.

Caveats of a generative EthicsNet

As mentioned, ethical learning is developmentally situated, so this means that feedback from guardian humans to learning AI systems should differ depending on how complexly an AI system models the world. Explaining by way of example, consider that young children are often presented corrections on their behavior to get them to conform to norms in ways that focus on categorizing actions as right and wrong. A simple example might be telling a child to always hold hands while cross the street and to never hit another child. Such an approach, of course, leaves out many nuances of normative behavior adults would consider, as in some cases a serious threat may mean a child should risk crossing the street unattended or hitting another child in defense. The analogous cases for AI systems will of course be different, but the general point of presenting developmentally appropriate information holds, such as eliding nuances of norms for children that adults would normally consider.

In order to ensure developmentally appropriate feedback is given, it’s important to give contextual clues to humans about the AI system’s degree of development. For example, we might want to give clues that the human should treat the AI the same way it would treat a child if that were developmentally appropriate, or treat them as an adult if that were developmentally appropriate. Experimentation will be necessary to find the cues that encourage humans to give developmentally appropriate feedback, so EthicsNet will need to be able to a provide rapidly iterable interface to allow developers to find the best user experience for eliciting maximally useful responses from humans for helping AI systems learn normative behaviors.

Since EthicsNet, as proposed here, is to be a secondary function to an AI system serving some other primary function, an implementation difficulty is that it must be integrated with a system providing the primary functionality. This will likely involve forming partnerships with leading AI companies to integrate EthicsNet into their products and services. This is more complicated than if EthicsNet could be developed in isolation, but we believe for reasons laid out above that it cannot, so this added complexity is necessary. For similar reasons this will make development of EthicsNet more complicated since it will require integration with one or more existing systems owned by other organizations in order to allow EthicsNet to get feedback from humans serving the guardian role to AI systems, but we believe the additional cost and complexity is worthwhile since something short of this seems unlikely to succeed at the task of teaching AI systems to behave ethically based on what we know about how normative behaviors are learned in humans and non-human animals.

Given this context in which EthicsNet will be deployed, it will also be important to make sure to choose partners that enable AI systems being trained through EthicsNet to learn from humans from multiple cultures since different cultures have differing behavioral norms. Note, though, that this will also make it harder for the AI systems being trained to infer what behavior is normative because they will receive conflicting opinions from different guardians. How to resolve such normative uncertainty is an open question in ethics, so EthicsNet may prove vital in research to discover how, in an applied setting, to address conflicts over behavioral norms (MacAskill, 2014).

Conclusion

The view of EthicsNet we have presented here is not one of a typical dataset for machine learning like ImageNet but rather as a framework in which AI systems can interact with humans who serve as guardians and provide feedback on behavioral norms. Based on the situated—particularly the developmentally situated—nature of ethical learning, we believe this to be the best approach possible and that a more traditional dataset approach will come up short towards fulfilling the goal of enabling AI systems to learn to act ethically. Although this approach offers less opportunity for rapid training since it requires interaction with humans on human timescales and requires integration with other systems since ethical learning is a secondary activity to some other primary activity, the outcome of producing AI systems that can conform to human interests via ethical behavior makes it worth the additional effort.

References

L. von Ahn, B. Maurer, C. McMillen, D. Abraham, M. Blum. (2008). reCAPTCHA: Human-Based Character Recognition via Web Security Measures. Science. 321 (5895): 1465–1468. DOI:10.1126/science.1160379

N. Bostrom. (2013). Superintelligence: Paths, Dangers, Strategies. Oxford University Press.

M. L. Commons. (2006). Measuring an Approximate g in Animals and People. Integral Review, 3, 82-99.

N. J. Dingemanse, A. J.N. Kazem, D. Réale, J. Wright. (2010). Behavioural reaction norms: animal personality meets individual plasticity. Trends in Ecology & Evolution, Volume 25, Issue 2, Pages 81-89, DOI: 10.1016/j.tree.2009.07.013.

L. Kohlberg & R. H. Hersh. (1977). Moral development: A review of the theory. Theory Into Practice, 16:2,53-59. DOI: 10.1080/00405847709542675.

L. Kohlberg, C. Levine, & A. Hewer. (1983). Moral stages: A current formulation and a response to critics. Contributions to Human Development, 10, 174.

W. MacAskill. (2014). Normative Uncertainty. Dissertation, University of Oxford.

R. S. Peters. (1974). Psychology and ethical development. A collection of articles on psychological theories, ethical development and human understanding. George Allen & Unwin, London.

M. F. H. Schmidt, L. P. Butler, J. Heinz, M. Tomasello. (2016). Young Children See a Single Action and Infer a Social Norm: Promiscuous Normativity in 3-Year-Olds. Psychological Science. DOI: 10.1177/0956797616661182.

F. J. Varela (1999). Ethical know-how: Action, wisdom, and cognition. Stanford University Press.

E. van de Waal, C. Borgeaud, A. Whiten. (2013). Potent Social Learning and Conformity Shape a Wild Primate’s Foraging Decisions. Science, 340 (6131): 483-485. DOI: 10.1126/science.1232769.

N. Watson. (2018). EthicsNet Overivew. URL: https://www.herox.com/EthicsNet

Discuss

### Subsidizing Prediction Markets

4 часа 51 минута назад

Epistemic Status: Sadly Not Yet Subsidized

Robin Hanson linked to my previous post on prediction markets with the following note:

.@TheZvi reviews what one needs for a working prediction market FUNDED BY TRADERS. If someone else will sponsor/subsidize them, a lot more becomes possible. https://t.co/OgEMahI7ZS

— Robin Hanson (@robinhanson) July 26, 2018

I did briefly mention subsidization, as an option for satisfying the fifth requirement: Sources of Disagreement and Interest, also known as Suckers At The Table. The ultimate sucker is an explicit, intentional one. It can serve that roll quite well, and a sufficiently large subsidy can make up for a lot. Any sufficiently large sucker can do that – give people enough profit to chase, and suddenly not being so well-defined or so quick or probable to resolve, or even being not safe from key insider information, starts to sound like it is worth the risk.

Suppose one wants to create a subsidized prediction market. Your goal presumably is to get a good estimate for the probability distribution of an event, and to do so without paying more than necessary. Secondary goals might include building up interest and a marketplace for this and future prediction markets, and getting a transparently robust result, so others or even the media are more likely to take the outcome seriously. What is the best way to go about doing this?

Before looking at implementation details, I’ll look at the five things a prediction market needs.

I Well Defined

The most cost-efficient subsidy for a market is to ensure that the market is well defined. Someone has to make sure everyone understands exactly what happens under every scenario, and that someone is you. Careful wording and consideration of corner cases is vital. Taking the time to do this right is a lot more efficient than throwing money at the problem, especially trying to build a system and brand over time.

If you’re going to subsidize a market, step one is to write good careful rules, make sure people understand them, and to commit to making it right for everyone if something goes wrong, if necessary by paying multiple sides as if they had won. This is potentially quite expensive even if it rarely happens, so it’s hard to budget for it, and it feels bad in the moment so often people don’t pull the trigger. Plus, if you do it sometimes, people will argue for it all the time.

But if you’re in it to win it, this is where you start.

II Quick Resolution

Once you’ve got your definitions settled, your next job is to pay the winners quickly once the event happens. People care about this more than you can possibly imagine. The difference between paying out five seconds after the final play, and five minutes after the final play, is a big game to many. Make them wait an hour and they’ll be furiously complaining on forums. When the outcome is certain, even if it hasn’t actually happened yet, it’s often a great move to pay people in advance. People love it. Of course, occasionally someone does something like pay out on bets on Hillary Clinton two weeks early, in which case you end up paying both sides. But great publicity, and good subsidy!

Another key service is to make sure your system recognizes when a profit has been locked in, or risk has been hedged, and does not needlessly tie up capital.

This one is otherwise tough to work around. If you want to know what happens twenty years from now, nothing is going to make resolving the question happen quickly. You can help a lot by ensuring that the market is liquid. If I buy in at 50% now, then a year from now the market is at 75% and is liquid enough, I can take my profits in one year rather than twenty. That’s still a year, and it’s still unlikely the price will ‘catch up’ fully to my new opinion by that time. It helps a lot, though.

III Probable Resolution

It is a large feel bad, and a real expense, when capital is tied up and odds look good but then the event doesn’t happen, and funds are returned. It hurts most when you’ve pulled off an arbitrage, and you win money on any result.

If, when this happens, you subsidize people for their time and capital, they’d be much more excited to participate. I think this would have a stronger effect than a similar subsidy to the market itself, once you get enough liquidity to jump start trading. Make sure that if money gets tied up for months or years, that it won’t be for nothing.

IV Limited Hidden Information

If your goal is to buy the hidden information, you might be all right with others not being interested in your market, as long as the subsidy brings the insiders to the table to mop up the free money. That approach is quite expensive. If the regular traders are still driven away, you’ll end up paying a lot to get the insiders to show their hand, because they can’t make money off anyone else. Even insiders start to worry that others have more and better inside information than they do, which could put them at a disadvantage. So it’s still important to bring in the outsiders.

One approach is to make the inside information public. Do your own investigations, require disclosure from those participating in the events themselves, work to keep everyone informed. That helps but when what you want to get at is the inside information it only goes so far.

That means that when this is your problem, and you can’t fix it directly through action and disclosure, you’re going to have to spend a lot of money. The key is to give that money to the outsiders as much as possible. They are the ones you need at the table, to get yourself a good market. The insiders can then prey on the outsiders, but that’s much better than preying on you directly.

The counterargument, especially if you don’t need to show liquidity or volume, is that if you buy the information directly there’s less noise, so perhaps you want to design the system to get a small number of highly informed traders and let everyone else get driven away. In cases where the outsiders would be pure noise, where the insiders outright know the answer, and where getting outsiders to be suckers that take a loss isn’t practical, that can be best.

V Disagreement and Interest

This one’s easy. You are paying a subsidy, so you’re the sucker. Be loud about it so everyone knows you’re the sucker, and then they can fight to cash in. Excellent.

The other half, disagreement, is still important. Many people, whose analysis and participation you want, still benefit from a story that explains why they are being paid to express an opinion, rather than fighting to be slightly more efficient at capturing the subsidy. And of course, if no one disagrees about the answer, then your subsidy was wasted, since you already knew the answer!

In light of those issues, what are the best ways to subsidize the market? Option 0: Cover Your Basics

Solve the issues noted above. Choose a market people want to participate in to begin with. Ensure there are carefully written rules with no ambiguity, that any problems there are covered. Make sure you’ll get things resolved and paid quickly, that capital won’t be tied up one minute longer than necessary. When possible, disclose all the relevant information, on all levels. If things don’t resolve, it would be great if you could compensate people for their time and capital.

And also, make sure everyone is confident the winners will be paid! Nothing kills a market like worrying you can’t collect if you win. That’s often as or more important even than providing strong, reliable liquidity.

If you can improve your interface, usability, accessibility, user’s tax liability or anything like that, definitely do that. If your market design is poor, such as having the wrong tick size, make sure to fix that. Tick sizes that are too small discourage the providing of liquidity to the market, and are in my experience a bigger and more common mistake than ticks that are too big.

Finally, waive the fees. All of them. Deposit fees, withdraw fees, trading fees, you name it. At most, there should be a fee when taking liquidity that is paid entirely to the trader providing liquidity. People hate paying fees a lot more than they like getting subsidies. They won’t cancel out.

With that out of the way, what are your options for the main subsidy?

Option 1: Be a Market Maker and Provide Liquidity Directly

As the subsidizer of the market, commit to being the market maker with well-defined rules.

The standard principle is, let everyone know that there will always be $X of liquidity available on both sides, and at a fixed cost of Y% price difference between your bid and your offer. So for example, you might agree to offer$1,000 on each side with a difference of 5% at all times, starting with a 48% bid and a 53% offer. You’d then adjust as you did trades.

A simple rule to protect yourself from unlimited downside is if you do a trade for some percent of your liquidity, you adjust your price that percentage of its width. So in this example, if someone took 40% of your offer, you’d adjust by 40% of 5%, which is 2%, and now have a 50% bid and a 55% offer. If you follow such a rule, your maximum loss is what it takes to move the odds to 0% or 100% (and if you let people keep trading until the event is done, you will take that loss). People trading against you in opposite directions can make you money, but can’t cost you money.

For convenience, you can post additional bids and offers so that if someone wants to move the odds a lot, they can see what liquidity they would get from you, and have the option to take it all at once. You’ll lose money every time the fair probability changes, but that’s why they call it a subsidy, an this encourages people to show their information quickly and efficiently.

There are ways to make that smarter, so you can lose less (or make more!) money while offering better liquidity, which will be left as an exercise to the reader. Generally they sacrifice simplicity and transparency in order to make the subsidy ‘more efficient.’ The danger is that if the subsidy is attempting to ensure a sucker is at the table, it does not do that if it stops being the sucker, or it becomes too hard to tell if it is one or not.

Then again, the dream is to offer a subsidy that doesn’t cost you anything, or even makes you money! Market making can be highly profitable when done skillfully, while also building up a marketplace.

Option 2: Take Liquidity

If you provide liquidity, others will take advantage, but in some ways you make it harder to provide liquidity. If you take liquidity, you make it more profitable to provide it, at the risk of making the market look less liquid.

It also loses money. The more clear you are about what you are up to, the better.

There are a few fun variants of this, if you’re all right with the expense.

One strategy is to take periodically liquidity in both directions. At either fixed or random intervals, examine the order books in the market. If they meet required conditions (e.g. there is at least $X on the bid and offer within Y% of each other) then you hit the bid and lift the offer for$Z.

This costs you money, since your trades net out at a loss. If someone else was both the best bid and best offer, they made money.

That’s the idea. You’re directly subsidizing people to aggressively provide liquidity.

Traders compete to be on the bid and offer to trade with you, the virtual customer, which in turn gives those with an opinion a liquid market to trade against. Sometimes people get far too aggressive providing in such situations, and those trying to capture the subsidy end up losing money because they make bad trades against others, especially if they don’t then hedge.

You can also do this in a more random or unbalanced fashion. If you flip a coin each day and decide whether to be a buyer or a seller, that will cause the price to temporarily become ‘unfair’ to satisfy your demand – you’ll get a bad price. But that creates a trading opportunity for others. It can also make the results hard to interpret, which is a risk.

Option 3: Subsidize Trading / Give Free Money

Often you’ll see crypto exchanges do this as a promotion, offering a prize to whoever trades the most of some coin. By paying for trades, you’re encouraging exactly what you want.

Except that you’re probably not doing that. Remember Goodhart’s Law.

The problem is ‘wash’ trading, where people trade with each other or themselves without taking on positions. This is bad on every level. It misleads everyone about the volume and price, and doesn’t help at all with finding out the answer to the question the market is trying to answer. The last thing you want to do is encourage it!

For that reason, subsidizing trading itself is a dangerous game. But it can be done, if you’re careful with the design.

Many online sites have tried this in the form of the classic ‘deposit bonus’ or even the free play. Anyone can sign up and get Free Money in exchange for engaging in a minimum amount of trading activity. And of course, most of the time, a deposit to match, if the offer is more than a small ‘free play.’ In for-profit markets the goal is to have the required activity make up for the subsidy, then hopefully hook the customer to keep them trading. There are always those looking to game these offerings if you leave them vulnerable.

That can work for you. Getting those same people, who are often quite creative and clever, thinking about how to come out ahead in your system can be a big win if your end goal isn’t profit! So long as you make it sufficiently difficult to do wash trading or sign up for tons of copies of the bonuses, you can give them a puzzle worth maximizing (from their perspective) and effectively rent their labor to see what they think of the situation.

Option 4: Subsidize Market Making

You can also subsidize market making activity, as an alternative to doing the job yourself and butchering it. That’s activity you can’t fake, provided you set the rules carefully. Paying people who provide rather than take liquidity is good, and often paying for real two-sided market making activity is better. As always, make sure you’re not vulnerable to wash trading or other forms of collusion.

Putting It All Together

Which of these strategies is most efficient and what circumstances change that answer?

It’s expensive to change or clarify your rules and conditions once trading has begun, so invest in doing that first. Other quality of life improvements are great, but take a back seat to establishing good liquidity.

I list Option 0 first because it’s things you definitely should do if you’re taking the operation seriously, but that doesn’t mean you always do all of them first before the direct subsidy. It’s great if you can, but often you need to establish liquidity first.

If ‘no liquidity’ is the pain point and bad experience, there isn’t much that will overcome that. There’s no market. So if you don’t have liquidity yet, providing at least a reasonable amount, or paying someone else to do it, is the best thing you can do. Just throw something out there and see what happens. This makes intuitive sense all around – as an easy intuition pump, if you want to know if something is more likely than not, offering someone a 50/50 bet on it is a great way to get their real opinion.

Once liquidity isn’t a full deal breaker, it’s time to go with Option 0, then return to increasing the subsidy and spreading the word.

What form should the direct subsidy take?

I’d advise to continue to take away bad experiences and barriers first.

The best subsidy is paying to produce reliable, safe and easy to use software, getting ironclad rules in place, being ready to handle deposits, withdraws, evaluation of results and other hassles. Make sure people can find your markets and set up the markets people want to find.

Next best is to avoid fees. People hate fees more than they love subsidies. Yes, you can trick people with deposit bonuses and then charge them a lot on their trades, but the best way to get away with that is bake the fees into the trade prices, so it doesn’t look like a fee.

At a minimum, you shouldn’t be charging fees for deposits or withdraws, or for providing liquidity in the market.

Next up, make trades cost net zero fees. Either charge nothing to provide or to take liquidity, or charge a fee to take liquidity but pay it to those who provide.

After that, my opinions are less confident, but here’s my best guess.

If that’s still not good enough, provide liquidity. Either pay someone else to be a market maker, or provide the service yourself. I like the idea of a ‘dumb’ market maker everyone knows is dumb, and that operates with known rules that hamstring it. If you’re looking to provide a subsidy, this is a great way to do that. A smarter market maker is cheaper, and can provide better liquidity, but is less obviously a target. As the market matures, you’ll want to transition to something smarter. Thin markets want obviously dumb providers.

Once you’ve done a healthy amount of that, then you’ll want to give away Free Money. Give people some cash in exchange for participating in the market at all, or trading a minimum amount. Or give people bonuses on deposited funds so long as they use them to trade, or similar.

You have to watch for abuse. If you can respond to abuse by changing the system, it’s fine to be vulnerable to abuse in theory, and even allow small amounts of it. If you’re going to release a cryptographic protocol you can’t alter, you’ll need to be game theoretically robust, so this won’t be an option, and you’ll have to retreat to taking liquidity.

Taking liquidity seems less likely to motivate the average potential participant, and costs you weirdness points, but does provide a strong incentive for the right type of trader. The best reason I can think of to use such a strategy is that it is robust to abuse. That’s a big game if you can’t respond dynamically to unfriendly players.

At the end of the day, your biggest barriers are that people’s attention is limited, complexity is bad, opportunity cost is high and people don’t do things. I keep meaning to get around to bothering with HyperMind and/or PredictIt, and keep not doing it, and I’m guessing I am far from alone in that. Subsidy can get people excited and make markets work that wouldn’t otherwise get off the ground. What I think they can’t do at reasonable cost is fix fundamental problems. If you don’t have a great product behind the subsidy, it’s going to be orders of magnitude more expensive to motivate participation.

Discuss

### Thibodaux Gynecology & Obstetrics

9 часов 34 минуты назад

Thibodaux Gyn/Ob provides Gynecological, Obstetrical and Tubal Ligation Reversal in Thibodaux, Louisiana, USA.

http://www.thibodauxgynob.com

Discuss

### How Roman Emperors Handled The Succession Problem

16 августа, 2018 - 22:30
A relief from Trajan’s Column

Institutions built by one generation of founders must be successfully handed off to the next to keep them functional. In the absence of such succession, organizational sclerosis or constant internal conflict sets in.

The succession problem has two components: skill and power succession. In public discourse and political thought we have tried to solve either power succession or skill succession under different names, often not even perceiving when all of seamlessly switch between two fragmented states of mind depending on the problem we face.

Our culture is pervaded with an ideology of proving worth through struggle. This almost Darwinian view is strongly present in our economic, political, and even academic values. We define merit by equating it with success in competition, not even realizing this was merely one of many possible choices.

These values then underwrite various legal and social barriers and obstacles imposed on power succession, that are widely believed to solve skill succession.

When it comes to our private decisions, we are of a more nepotistic mindset. We can in it more cleanly think about power succession. In this mode we usually fudge skill evaluation, however.

What is missing from Western understanding is that power succession and skills succession are not actually at odds with each other, as our meritocratic ideology would have you believe, but are actually two mutually necessary halves. If your goal is to keep institutions functions, solutions that solve one but not the other are not solutions at all.

The example of Roman adoption is used to explore and illustrate this. The Roman Empire is one of the greatest civilizations of all time. In this piece, we examine its surprising solution to the succession problem, the practice of adult adoption, for insight into what kind of social norms and institutional features would be necessary in a modern solution.

When in Rome…

Roman society is correctly noted for its production of highly skilled individuals. The society had no problem with skill succession — ambitious and greatly talented individuals abounded. They found a challenge in power succession, however. The early Republic was anomalous and impressive in how cooperative the elites were (among themselves at least).

Cincinnatus could be called upon by the senate to be dictator in an emergency, then earn the admiration of his peers by choosing to seamlessly retire back to his farm after the crisis passed, without fear of reprisals from former political rivals. They trusted that his retirement was genuine and that he would no longer be a towering figure in politics.

Contrast this with modern Libya. Muammar Gaddafi’s gruesome death at the hands of the National Transition Council militia is infamous. Even absent American and French interventions that toppled him, a peaceful retirement, should he have handed power over to his political opposition, seems unlikely at best.

The Roman republican system eventually met its limits as it grew from managing a provincial Rome and its client states on the Italian Peninsula to managing a more complex urban economy and the political life of the entire Western Mediterranean.

Problems that could previously be solved by aligned political fundamentals or the social fabric of the patrician class mounted. They fell more and more on the religious and legal structures of the republic.

These structures once the last recourse could not bear the burden of regular use. What were once dire contingencies resorted to once the failure of coordination became apparent, came to be seen as normal political moves. Roman economic, military and political elites grew steadily less cooperative.

By the late republic, talented people still arose but were forced to fight bloody civil wars to resolve disputes. The career of Sulla for example is littered with political opponents defeated not just on the senate floor but on the field of battle.

Long after these civil wars changed the Roman state beyond recognition, Roman Emperors found an inventive solution for the newly apparent problem of power succession. In subsequent periods of stability, such as during the Nerva-Antonine dynasty, this was achieved with the use of adoption.

In Roman society adoption wasn’t solely a means to help orphaned or abandoned children, but a social and legal mechanism with which you could make an adult your son and heir, allowing him to inherit your position. In other words, your dynasty didn’t need to end with your bloodline.

This solution has many interesting features, the most notable of which is that the emperor can work out an agreement with a rising younger rival. Adoption legibly positions them as the natural successor.

Since the practice of adult adoption was well understood and respected throughout Roman society, it amounted to a credible guarantee. Credible guarantees changed incentives notably.

The adopted son, who might have previously been tempted to undermine the emperor, should now be in favor of expanding a power base that will one day be his. The current and future rulers, then, have reason to work together even before the transfer of power is affected. The result is not only a peaceful transfer of power, but a potent political alchemy that transmutes your most dangerous rival into your most potent ally.

A well-respected law backed by legal practice is what ensures that wealth and other legal rights are properly transferred. Importantly the legitimacy of the social practice of adoption together with the expectation of future power meant that intangible social connections, so vital to securing power, are transferred as well.

Even if the chosen successor and head of state are not in the closest political allegiance due to other factors, this adoption mechanism can still be used to formalize the capacity to carry out a coup to put that person in charge, or at least in the waiting line for formal governance, without a civil war. This solves one of the greatest difficulties with negotiated surrenders and peace negotiations in general, that of credible commitment.

The mechanism has benefits in terms of skill succession as well, since it allows a skilled pilot, in this case a skilled ruler, to recognize and pick another with comparable skill. This turned out to be the case, as the era of the Nerva-Antonine dynasty saw relatively peaceful and competent governance.

The term “Five Good Emperors” has been used to refer to a chain of five good rulers from the Nerva-Antonine dynasty (Nerva, Trajan, Hadrian, Antoninus Pius and Marcus Aurelius). The famous British historian Edward Gibbon went so far in his praise as to say mankind never had as happy a condition before or after as under their rule.

The relative harmony of this period provides an important contrast with the civil wars seen earlier in the late republic and later in imperial history. Adoption proved a viable method of solving power succession.

The more complex the solution, the more fragile it is

The most developed version of the system of adoptive succession was implemented centuries later under Diocletian. The practice of adoption was less prominent in Roman society by that point, so the stability of the guarantee was more questionable, since it was no longer a celebrated cultural practice. Diocletian revived it for use as a legal succession mechanism and developed it further by implementing a system of seniority and apprenticeship. The appointed successor was granted the title of Caesar (junior Emperor) and would be allowed to manage their own lands, under nominal supervision of the Augustus (senior Emperor).

This sweetens the deal; not only will I name you my son and by culture and law make you inheritor, I will grant or acknowledge your right to manage territories right now.

An advantage of this approach is that the senior position is directly analogous in terms of skills and responsibilities required to the junior one. The job of head of state is usually sufficiently unique that preparation, training, and directly relevant experience are infamously hard to come by. A disadvantage is that it favors the junior party, perhaps to the point of making premature conflict a viable route to power.

The Roman Empire was experiencing great difficulties in this era, having become sclerotic and bureaucratized. Military and administrative demands made the division of the Empire into a Western and Eastern half politically advantageous. In the landscape of power, high was then composed of a four-way alliance, a tetrarchy of the Eastern and Western senior emperors, and their junior successors.

This more complicated arrangement proved more unstable than the Nerva-Antonine dynasty. The balance of power between four skilled individuals is a hard thing to maintain. Every now and then there do arise cooperative strategists that can make such a balance of power work, but the skill requirement for the job rises. The tetrarchy was stable only under Diocletian himself. He managed the feat of safely retiring. Unfortunately in his old age he also lived to see the system fall.

More complicated systems of succession and coordination are generally fragile, since in those cases successful power succession relies on successful skill succession. The more robust approach is to aim for skill succession, but enable power succession in its absence or partial success.

There are examples of seemingly very complex systems of succession that have endured for centuries. An example of complicated constitutional arrangements was the Republic of Venice, the longest lived republic. Such arrangements are however best thought of as very complicated legal machinery that validate and render legal any decision arrived through some other means (the selection of the Doge of Venice was likely accomplished by direct negotiation between the patrician families of Venice, not through the nominal selection procedure).

Lessons

Successfully transferring not only the formal, but also informal, position that allows an individual to shape an organization is necessary for keeping an institution functional. On the scale of societies, employing solutions that prevent destructive conflicts between elites is vital.

Adoption of adults was a viable solution in the Roman Empire for as long as the social fabric underwriting it was there. As the underlying social norms changed, the legal norms that made it possible required backing by more complicated mechanisms and workarounds. This architecture proved less successful, in part because its complexity made it more difficult to maintain.

We cannot simply copy the Roman solution because our own social and legal norms are different. While adult adoption is legal in many Western countries, the Roman social practice would be considered an exploit, and would leave the companies and organizations that used it open to attack. The challenge, then, is finding a solution that would work as well and is as simple as possible.

It is important to note that in modern Japan, a technologically developed industrial economy, we actually observe a similar practice. It is termed Mukoyōshi, where a son-in-law is chosen primarily for the ability to run the family business. They marry into the family and take on the family name. The practice can be found in the history of companies such as Suzuki, Kikkoman, and Toyota.

It might be tempting to try to imitate Mukoyōshi in the Western context. The legal vehicle of marriage certainly seems more appropriate for the task than our adoption laws. The crucial problem, however, lies in how we choose marriage partners in the West. Our choice of spouse is a personal and romantic, rather than a business and family matter. This means that while we could use marriage for power succession, its appropriateness for solving the problem of skill succession is dubious.

Despite the obstacles to its direct application, the Roman solution displays features we can and should emulate in our own institutional thinking. When pursuing reform, setting cultural expectations, or building new organizations with the intent to solve the succession problem, we should aim for simplicity of mechanism (robustness), have the mechanism carry over informal as well as formal resources, and ensure that the incentives of the successor and the current pilot are as aligned as possible.

Discuss

### The "semiosis reply" to the Chinese Room Argument

16 августа, 2018 - 22:23

Nobody proposed so far the following solution to the Chinese Room Argument against the claim that a program can be constitutive of understanding (a human, non-Chinese-speaker, cannot understand Chinese just having run a given program, even if this program enables the human to have input/output interactions in Chinese).

My reply goes as follows: a program, to be run by a human, non-Chinese-speaker, may indeed teach the human Chinese. Humans learn Chinese all the time; yet it is uncommon having them learning Chinese by running a program. Even if we are not aware of such a program (no existing program satisfies said requirement), we cannot a priori exclude its existence.

Before enunciating my reply, let me first steelman the Chinese Room Argument. If the human in the mental experiment of the Chinese Room is Searle, he may not know Chinese, but he may now a lot of things about Chinese: that it has ideograms and punctuation, which he may recognize; that it is a human language, which has a grammar; that it has the same expressive power of a language he knows, e.g. English; that it is very likely to have a symbol for “man” and a symbol for “earth”, and so on. Searle, unlike a computer processor, holds a lot of a priori knowledge about Chinese. He may be able to understand a lot of Chinese just because of this a priori knowledge.

Let us require the human in the Chinese Room to be a primitive, e.g. an Aboriginal, with absolutely no experience of written languages. Let us suppose that Chinese appears so remote to the Aboriginal, that she would never link it to humans (to the way she communicates) and always regard it as something alien. She would never use knowledge of her world, even if somebody tells her to run a given program to manipulate Chinese symbols. In this respect, she would be exactly like the computer processor and have no prior linguistic knowledge. The Chinese Room Argument is then reformulated: can a program to be run by the Aboriginal teach her Chinese (or, as a matter of fact, any other language)?

I am going to reply that yes, a program to be run by the Aboriginal can teach her a language. I am going to call this reply the “semiosis reply”.

Semiosis is the performance element involving signs. A sign, during semiosis, get interpreted and related to an object. Signs can be symbols of Chinese text or of English text, that a human may recognize. An object is any thing available in the environment, which may be related to a sign. It has been suggested that artificial systems can also perform (simulated) semiosis [Gomes et al, 2003]. Moreover, it has been suggested that objects can become available not only from sensory-motion experience, but also from symbolic experience of an artificial system [Wang, 2005]. A sign as recognizable by a machine can be related to a position in an input stream as perceived by a machine. For example, the symbol "z" stands for something that is much less frequent in English text than the interpretant which stands to the symbol "e". Semiosis is an iterative process in which the interpretant can become a sign to be interpreted (for example, the symbol "a" can get interpreted as a letter, as a vowel, as a word, as an article, etc). At any given time the machine may select as potential signs any thing available to it, including previous interpretants such as paradigms and any representation it created. I suggest that the machine should also interpret its internal functions and structures through semiosis. This comprises "computation primitives", including conditional, application, continuation and sequence formation, but also high-level functions such as read/write. The meaning that the machine can give to the symbols it experiences as input becomes then increasingly complex. Such a meaning is not given by a human interpreter (parasitic meaning), but it is rather intrinsic to the machine. When a human executes the program on behalf of the machine, it arrives at the same understanding, at the same meaning, i.e. simulating semiosis ultimately amounts to performing semiosis and the Aboriginal can actually learn from the program. (note how existing artificial neural networks, including deep learning for natural language processing, are ungrounded and devoid of meaning. A human, even executing the training phase of the artificial neural network, cannot arrive at any understanding. This is because the artificial neural network, despite its evocative name, at no level simulates a human neural network. On the contrary, semiotic artificial intelligence, despite having no representation of neurons, simulates semiosis occurring in human brains)

Let me tell you how an Aboriginal, called SCA, could learn English just by running a program. Let us suppose that SCA is given as an input the following text in English, the book "The adventures of Pinocchio" (represented as a sequence of characters with spaces and new lines replaced by special characters):

THE ADVENTURES OF PINOCCHIO§CHAPTER 1§How it happened that Mastro Cherry, carpenter, found a piece of wood that wept and laughed like a child.§Centuries ago there lived--§"A king!" my little readers will say immediately.§No, children, you are mistaken. Once upon a time there was a piece of wood...

This input contains 211,627 characters, which are all incomprehensible symbols for SCA. (This input represents a very small corpus compared to those used to train artificial neural networks)

Let me tell you how SCA learns, through only seven reflection actions and via only three iterations of a semiotic algorithm, to output something very similar to the following: "I write" (it is suggested that the first thing SCA could output is “I said”, while more processing would be needed to actually have her output “I write”)

SCA runs the following reflection actions (some of them at a level of meta-language) and iterations of the semiotic algorithm (characterized by a syntagmatic algorithm and by a paradigmatic algorithm).

Reflection action 1

SCA does not know anything about the text she reads, i.e. observes, but she knows that she can reproduce any of its symbols by writing with her stick on the sand. The actions of reading and writing can be put together such that the result of one action is input to the other action. Writing what is read is copying. Moreover, it is logical that when something is read, somebody wrote it.

Reflection action 2

SCA may of course know already about herself. However, let us make this explicit when she observes that an action of reading not corresponding to a previous action of writing indicates that there must be an agent in the world to which an “I” is opposed.

Iteration 1 semiotic algorithm

SCA discovers the paradigm of uppercase and lowercase letters, i.e. symbols which can be in a sequence such that when they follow certain other symbols the first one of them is capitalized and when they follow certain other symbols they are normally not capitalized (let us consider “.§Poor”, “.§”Poor”, “, poor” and “ poor”). Proper nouns, e.g. “Pinocchio”, are an exception as they obey their capitalization rules. This does not hinder SCA from learning that “P” and “p” belong to the same paradigm.

Reflection action 3

SCA looks for a way of applying the action of writing to itself. This amounts to a situation of “reported writing” when someone writes someone (else) wrote something. It identifies possible candidates for the content of this action in words which stand out from other words due to capitalization. One candidate could be capitalized words, i.e. the words that begin each new sentences. Another candidate could be all the words in direct speech (such as "A king!" in the excerpt above). This reflection action takes advantage from the fact that “The adventures of Pinocchio” contains direct discourse. It would be more complicated for SCA to identify reported speech in a text not using direct discourse, i.e. using only indirect discourse.

Iteration 2 semiotic algorithm

SCA considers any sequence of two words and discovers several paradigms of words (words are considered in both their uppercase and lowercase versions). A paradigm comprise “said”, “cried” and “asked”. Another one “Pinocchio” and “Geppetto”. A third one “boy”, “man”, “voice”, “Marionette” and “Fairy”. These paradigms are discovered only because of side effects existing in the text, in particular of word adjacency side effects.

Iteration 3 semiotic algorithm

SCA considers syntagms made up of words and comprising paradigms of the previous iteration. A more refined paradigm is created to contain words which are in a similar relationship to quotations and direct discourse as “said”, “cried” and “asked”. Let us refer to this paradigm as p_saying_verbs.

Reflection action 4

SCA compares occurrences of words in p_saying_verbs with occurrences of other words. She makes the hypothesis that words in paradigm p_saying_verbs correspond to the action of writing.

Reflection action 5

SCA makes the hypothesis that, when in combination with words corresponding to the action of writing, other words such as “Pinocchio” and “man” correspond to the agent of writing.

Reflection action 6

SCA considers the fact that there is a proper noun which is appearing almost only in direct discourse, but does not seem to be writing-capable: the first person pronoun “I”, which occurs more than 500 times in direct discourse. SCA makes the hypothesis that, when she reads “I” in direct discourse, someone is self-referring.

Reflection action 7

SCA makes the hypothesis that when she performs an action of writing, she may refer to this action using a word in the paradigm p_saying_verbs and the word “I”. Therefore, she may write: “I said”.

From “I said” to “I write”

SCA does not know about verb tenses. She can run however another iteration of the semiotic algorithm to find out the morphemes of verb conjugation “-s”, “-ed” and “-ing”. She then expands paradigm p_saying_verbs to include also “say” and “saying”. She uses Occam’s razor to select the simplest hypothesis for referring to an action of writing.

She considers:

• the 6 occurrences in direct discourse of the sequence "I said" (always in the presence of quotations inside quotations);
• the 6 occurrences in direct discourse of the sequence "I say" (mostly in the presence of exclamation marks);
• the 2 occurrences in direct discourse of the sequence "I ask" (in the presence of question marks).

She finds that quotations inside quotations are more special than exclamations marks, which occur more than 600 times. Therefore, “I say” is the simplest explanation for referring to her action of writing. SCA can output this sequence.

Finally, interacting with SCA could make her learn to use the words “I write” instead. Let us suppose that we can send a message to SCA of the form "No, you write". Based on this message, SCA searches the corpus again and makes the hypothesis that "write" and "written" belong to the same paradigm ("written" is the irregular past participle of "write"). SCA retrieves the following 4 passages about written signs, which all contain the word "written":

"Oh, really? Then I'll read it to you. Know, then, that written in  letters of fire I see the words: GREAT MARIONETTE THEATER.

on the walls of the  houses, written with charcoal, were words like these: HURRAH FOR THE  LAND OF TOYS! DOWN WITH ARITHMETIC! NO MORE SCHOOL!

The announcements, posted all around the  town, and written in large letters, read thus:§GREAT SPECTACLE TONIGHT

As soon as he was dressed, he put his hands in his pockets and pulled  out a little leather purse on which were written the following words:§The Fairy with Azure Hair returns§fifty pennies to her dear Pinocchio§with many thanks for his kind heart.

SCA then makes the hypothesis that the paradigm of "write" and the paradigm of "say" belong to the same paradigm, and that when reporting about its own displaying the former should be used. Finally, SCA outputs the sequence "I write", nowhere to find in the corpus.

We can give SCA instructions for each of the reflection actions and each of the semiotic algorithm iterations. We can write these instructions in a way they can be executed by a computer processor, i.e. in a way they are a computer program. I have suggested that the program should use compositable high-level functions only (operations in Peano arithmetics instead of calls to a black-box arithmetic logic unit, see [Targon, 2016]) so that it operates only with cognitively grounded semiotic symbols, it can automate reflective programming (see [Targon, 2018]) and it can simulate semiosis. It follows that when the program is executed by a human, the human achieves semiosis ("semiosis reply" to the Chinese Room Argument).

Bibliography

Gomes, A., Gudwin, R., Queiroz, J. (2003), "On a computational model of Peircean semiosis", In: Hexmoor, H. (ed.) KIMAS 2003: 703-708

Wang, P. (2005), "Experience-grounded semantics: a theory for intelligent systems", Cognitive Systems Research, 6: 282-302

Targon, V. (2016), "Learning the Semantics of Notational Systems with a Semiotic Cognitive Automaton", Cognitive Computation 8(4): 555-576.

Targon, V. (2018), "Towards Semiotic Artificial Intelligence", BICA 2018.

Discuss

### One night, without sleep

16 августа, 2018 - 20:50

In memory of all those who fell and were not saved

(I normally have little regard for trigger warnings,

but on this occasion, imagine that my words are prefaced with every trigger warning ever)

A body, on a battlefield, maimed and bleeding;

terrible pain, staring at the sky,

left to die.

An experience that must have happened a million times in human history,

an experience that must be happening somewhere right now.

My situation is better - perhaps. I lie, not on a battlefield, but in a sickbed.

I just have some kind of flu;

and an eye haunted by a migraine that comes, and goes, and threatens to come again;

and I am not sleeping.

Also, I can write.

The smartphone battery was almost dead, I thought I was resigned to simply lying awake for long hours, without the catharsis of expression;

but enough time passed that I stirred, reached for the laptop, hauled it onto the bed, plugged in the phone.

.

The monotony of describing mundane acts

has removed me from the experiences that impelled me to write.

Those experiences said: no one should have to endure anything like this;

life should not be created.

But it was not just sensation that tortured me.

It was the defeat of my will, not just now, but many times.

It was all the lost opportunities to create in my own life,

the pointless obstruction, and being left to rust,

that denied everyone the benefits of what I might have made,

that negated my rare attempt to actually fix, and transform this malign existence.

.

The sun is still far from rising, but my mind has stirred to something like wakefulness.

Possible words now queue for attention and selection,

the hubbub of daylight communication,

rather than crystallizing in the dark,

a single phrase that repeats and repeats and repeats that it not be forgotten.

And I have remembered another thought:

that I am so tired of this. Of having to endure pain, whole days lost to waiting for pain to fade,

in order to keep carting my burdens uphill, alone.

Once, I was optimistic,

fruit of a happy childhood perhaps,

and I think I still carry the error implanted,

the expectation that everything works out for the best,

that in the end one will be noticed and saved.

There was a time, a long time, when I thought I would do the saving;

I thought I knew perspectives that would cheer everyone up,

and recipes that would materially change the world for the better.

this world as it is, needs transformation;

there is no groundswell to even try to make it better,

instead I find myself on a solitary march years in duration;

therefore, no one should create life.

That was 1996.

.

But both my defiant pursuit of a way to redeem existence,

and my sad insight that it is not now an existence in which life should be created,

remained hidden, buried. My life, and the world, twisted and turned, and I never produced a great work;

just fragments, actions, statements that echoed briefly in a handful of other lives.

.

Now I lose the momentum of this testimony too,

which began with a memoriam for all those lost who never return.

The sun will come, the illness subside, the migraine fade (though it returns weekly);

and during the long night I fashioned new tactics for escaping the circumstances that weigh on me,

that interfere with my imperatives, this time so much so that the effort to preserve my mind

left my body weak and sick with misery.

My experience tells me, a hundred times over:

shout for help, say you and all your possibilities are going to waste,

and no one will come for you.

So one is left to survive, on the charity of family and the cunning to make a little money,

left to attempt to do as much as possible on one's own,

and then, occasionally, make another big appeal for help.

But time grows short; the machines have advanced;

it could mean hope more concrete than ever;

but I want to do more than just cultivate private hope,

I want to attempt with all my strength, to do the specific things that I see to be done.

And this is why the illness of bodily defeat is so bitter;

one's struggle, conducted alone, but in the hope of one day dragging treasures into daylight,

is felled by the weakness of one's own physical vehicle.

And now grief sears through me, though it seems that I will live to fight another day;

and a voiceless cry forces itself out (but I keep it silent, for the others who try to sleep, wrapped in their own lives, in rooms nearby),

and my cheeks are wet with tears.

Whether it is the thought of all those who were not saved,

or just the thought of myself and those others I have known who, though they lived, were not saved;

and there was a flash of fierceness too, some ancient killer instinct,

sensing opportunity to finally vanquish a foe.

.

There is only anticlimax left. Words, that I will not further shape despite their flaws,

a message from one depth and one overcoming,

out of the myriads that have been endured.

Discuss

### Entropic Regret I: Deterministic MDPs

16 августа, 2018 - 20:34

.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-ex-box-test {position: absolute; overflow: hidden; width: 1px; height: 60ex} .mjx-line-box-test {display: table!important} .mjx-line-box-test span {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} This is the first in a series of essays that aim to derive a regret bound for DRL that depends on finer attributes of the prior than just the number of hypotheses. Specifically, we consider the entropy of the prior and a certain learning-theoretic dimension parameter. As a "by product", we derive a new regret bound for ordinary RL without resets and without traps. In this chapter, we open the series by demonstrating the latter, under the significant simplifying assumption that the MDPs are deterministic.

Background

The regret bound we previously derived for DRL grows as a power law with the number of hypotheses. In contrast, the RL literature is usually concerned with considering all transition kernels on a fixed state space satisfying some simple assumption (e.g. a bound on the diameter or bias span). In particular, the number of hypotheses is uncountable. While the former seems too restrictive (although it's relatively simple to generalize it to countable hypothesis classes), the latter seems too coarse. Indeed, we expect a universally intelligent agent to detect patterns in the data, i.e. follow some kind of simplicity prior rather than a uniform distribution over transition kernels.

The underlying technique of our proof was lower bounding the information gain in a single episode of posterior sampling by the expected regret incurred during this episode. Although we have not previously stated it in this form, the resulting regret bound depends on the entropy of the prior (we considered a uniform prior instead). This idea (unbeknownst to us) appeared earlier in Russo and Van Roy. Moreover, later Russo and Van Roy used to it formulate a generalization of posterior sampling they call "information-directed sampling" the can produce far better regret bounds in certain scenarios. However, to the best of our knowledge, this technique was not previously used to analyze reinforcement learning (as opposed to bandits). Therefore, it seems natural to derive such an "entropic" regret bound for ordinary RL, before extending it to DRL.

Now, Osband and Van Roy derived a regret bound for priors supported on some space of transition kernels as a function of its Kolmogorov dimension and "eluder" dimension (the latter introduced previously by Russo and Van Roy). They also consider a continuous state space. This is a finer approach than considering nearly arbitrary transition kernels on a fixed state space, but it still doesn't distinguish between different priors with the same support. Our new results involve a parameter similar to eluder dimension, but instead of Kolmogorov dimension we use entropy (in the following chapters we will see that Kolmogorov dimension is, in some sense, an upper bound on entropy). As opposed to Osband and Van Roy, we currently limit ourselves to finite state spaces, but on the other hand we consider no resets (at the price of a "no traps" assumption).

In this chapter we derive the entropic regret bound for deterministic MDPs. (In the following we will call them deterministic decision processes (DDPs) since they have little to do with Markov.) This latter restriction significantly simplifies the analysis. In the following chapters, we will extend it to stochastic MDPs, however the resulting regret bound will be somewhat weaker.

Results

We start by introducing a new learning-theoretic concept of dimension. It is similar to eluder dimension, but is adapted to the discrete deterministic setting and also somewhat stronger (i.e. smaller: more environment classes are low dimensional w.r.t. this concept than w.r.t. eluder dimension).

Definition 1

Consider sets A, B and F⊆{A→B} non-empty. Given C⊆A×B and a∗∈A, we say that a∗ is F-dependent of C when for any f,g:A→B s.t. for any (a,b)∈C it holds f(a)=g(a)=b, we have f(a∗)=g(a∗). Otherwise, we say that a∗ is F-independent of C.

The prediction dimension of F (denoted dimpF) is the supremum of the set of n∈N for which there is a sequence {(ak∈A,bk∈B)}k∈[n] s.t. for every k∈[n], ak is F-independent of {(aj,bj)}j∈[k].

Fix non-empty finite sets S (the set of states) and A (the set of actions). Denote X:=S×[0,1], where the second factor is regarded as the space of reward values. Given e:S×A→X, Te:S×A→S and Re:S×A→[0,1] s.t. e(s,a)=(Te(s,a),Re(s,a)), we regard Te as the (deterministic) transition kernel and Re as the reward function associated with "DDP hypothesis" e. This allows us to speak of the dimension of a DDP hypothesis class (i.e. some H⊆{S×A→X}). We now give some examples for dimensions of particular function/hypothesis classes.

Proposition 1

Given any A, B and F⊆{A→B}, we have dimpF<|F|.

Proposition 2

Given any A, B and F⊆{A→B}, we have dimpF≤|A|. In particular, given H as above, dimH≤|S|⋅|A|.

Proposition 3

We now consider deterministic Markov decision processes that are cellular automata. Consider finite sets X (the set of cells), M (the set of neighbor types) and a mapping ν:X×M→X (which cell is the neighbor of which cell). For example, M might be a subset of a group acting on X. More specifically, X can be (Z/nZ)d acting on itself, corresponding to a d-dimensional toroidal cellular automaton of size n.

Consider another set C (the set of cell states) and suppose that S={X→C}. Given any s∈S, and x∈X, define sx:X→C by sx(m):=s(ν(x,m)). Given any T:{M→C}×A→C, define Tglob:S×A→S by Tglob(s,a)(x):=T(sx,a). Given any R:{M→C}→[0,1], define Rglob:S→[0,1] by Rglob(s):=1|X|∑x∈XR(sx). Define H by

H:={(Tglob,Rglob)∣∣T:{M→C}×A→C, R:{M→C}→[0,1]}

That is, H is the set of transition kernels and reward functions that are local in the sense defined by ν. Then, dimpH≤|C||M|(|A|+1).

In Proposition 3, it might seem like, although the rules of the automaton are local, the influence of the agent is necessarily global, because the dependence on the action appears in all cells. However, this is not really a restriction: the state of the cells can encode a particular location for the agent and the rules might be such that the agent's influence is local around this location. More unrealistic is the full observability. Dealing with partially observable cellular automata is outside the scope of this essay.

The central lemma in the proof of the regret bound for RL is a regret bound in its own right, in the setting of (deterministic) contextual bandits. Since this lemma might be of independent interest, we state it already here.

Let S (contexts), A (arms) and O (outcomes) be non-empty sets. Fix a function R:O→[0,1] (the reward function). For any c∈Sω (a fixed sequence of contexts), e:S×A→O (outcome rule) and π:(S×O)∗×Sk→A (policy), we define ecπ∈ΔOω to be the resulting distribution over outcome histories. Given γ∈[0,1), we define Uγ:Oω→[0,1] (the utility function) by

Uγ(o):=(1−γ)∞∑n=0γnR(on)

Lemma 1

Consider a countable non-empty set of hypotheses H⊆{S×A→O} and some ζ∈ΔH (the prior). For each s∈S, define Hs⊆{A→O} by

Hs:={e:A→O∣∣e(a)=e′(s,a), e′∈H}

Let D:=maxs∈SdimpHs and suppose that A is countable [this assumption is to simplify the proof and is not really necessary]. Then, there exists π†:(S×O)∗×Sk→A s.t. for any c∈Sω and γ∈(0,1)

Ee∼ζ[(1−γ)∞∑n=0γnmaxa∈AR(e(cn,a))−Eecπ†[Uγ]]≤√16DH(ζ)ln2⋅(1−γ)

Note that the expression on the left hand side is the Bayesian regret. On the right hand side, H(ζ) stands for the Shannon entropy of ζ. In particular we have

H(ζ)≤ln|H|≤|S|⋅|A|ln|O|

Also, it's not hard to see that |H|≤|O|dimpH≤|O|D|S| and therefore

DH(ζ)≤(dimpH)2ln|O|DH(ζ)≤D2|S|ln|O|

Finally, the policy π† we actually consider in the proof is Thompson sampling.

Now we proceed to studying reinforcement learning. First, we state a regret bound for RL with resets. Now S stands for the set of states and A for the set of actions. We fix a sequence of initial states c∈Sω. For any e:S×A→X (environment), π:X∗×Xk→A (policy), and T∈N+ we define eπ∈ΔXω to be the resulting distribution over histories, assuming that the state is reset to cn and the reward to 0 every time period of length T+1. In particular, we have

Prx∼ec[T]π[∀n∈N:xn(T+1)=(cn,0)]=1

Given γ∈[0,1) we define Uγ:Xω→[0,1]

Uγ(sr):=(1−γ)∞∑n=0γnrn

Theorem 1

Consider a countable non-empty set of hypotheses H⊆{S×A→X} and some ζ∈ΔH. Let D:=dimpH. Then, for any T∈N+ and γ∈[0,1), there exists π†γ:X∗×Xk→A s.t. for any c∈Sω

Ee∼ζ⎡⎣maxπ:X∗×X→AEec[T]π[Uγ]−Eec[T]π†T,γ[Uγ]⎤⎦≤√16DH(ζ)ln2⋅(T+1)(1−γ)

Note that ec[T]π is actually a probability measure concentrated on a single history, sincee π is deterministic: we didn't make it explicit only to avoid introducing new notation.

Finally, we give the regret bound without resets. For any e:S×A→X and π:X∗×Xk→A, we define eπ∈ΔXω to be the resulting distribution over histories, given initial state s0 and no resets.

Theorem 2

Consider a countable non-empty set of hypotheses H⊆{S×A→X} and some ζ∈ΔH. Let D:=dimpH. Assume that for any e∈H and s∈S, A0e(s)=A [A0 was defined here in "Definition 1"; so was the value function V(s,x) used below] (i.e. there are no traps). For any γ∈[0,1) we define τ(γ) by

τ(γ):=Ee∼ζ[maxs∈Ssupx∈(γ,1)∣∣∣dVe(s,x)dx∣∣∣]

Then, for any γ∈[0,1) s.t. 1−γ≪1 there exists π†γ:X∗×Xk→A s.t.

Ee∼ζ⎡⎣maxπ:X∗×X→AEeπ[Uγ]−Eeπ†γ[Uγ]⎤⎦=O(3√DH(ζ)⋅(τ(γ)+1)(1−γ))

Note that τ(γ) decreases with γ so this factor doesn't make the qualitative dependence on γ any worse.

Both Theorem 1 and Theorem 2 have anytime variants in which the policy doesn't depend on γ at the price of a slightly (within a constant factor) worse regret bound, but for the sake of brevity we don't state them (our ultimate aim is DRL which is not anytime anyway). In Theorem 2 we didn't specify the constant, so it is actually true verbatim without the γ dependence in π†γ (but we still leave the dependence to simplify the proof a little). It is also possible to spell out the assumption on γ in Theorem 2.

Proofs

Definition A.1

Consider sets A, B and F⊆{A→B} non-empty. A sequence {(ak∈A,bk∈B)}k∈[n] is said to be F-independent, when for every k∈[n], ak is F-independent of {(aj,bj)}j∈[k].

Definition A.2

Consider sets A, B. Given C⊆A×B and a∗∈A, suppose f,g:A→B are s.t.:

• For any (a,b)∈C, f(a)=g(a)=b
• f(a∗)=g(a∗)

Then, f,g are said to shatter (C,a∗).

Given sequences {(ak∈A,bk∈B)}k∈[n] and {(fk,gk:A→B)}k∈[n], (f,g) is said to shatter (a,b) when for any k∈[n], (fk,gk) shatters ({(aj∈A,bj∈B)}j∈[k],ak).

Proof of Proposition 1

Consider {(sk∈S,ak∈A,tk∈S,rk∈[0,1])}k∈[n] an H-independent sequence. We will now construct G⊆F s.t. |G|=n+1, which is sufficient.

By the definition of F-independence, there is a sequence {(fk,gk∈F)}k∈[n] that shatters (a,b). We have fk(ak)≠gk(ak) and therefore either fk(ak)≠bk or gk(ak)≠bk. Without loss of generality, assume that for each k∈[n], fk(ak)≠bk. It follows that for any k∈[n] and j∈[k], fj(aj)≠bj=fk(aj) and therefore fj≠fk.

If n=0 then there is nothing to prove since F is non-empty, hence we can assume 0">n>0. For any k∈[n−1], fk(ak)≠bk=gn−1(ak) and therefore fk≠gn−1. Also, fn−1(an−1)≠gn−1(an−1) and therefore fn−1≠gn−1. We now take G={f0,f1…fn−1,gn−1}, completing the proof. ■

Proof of Proposition 2

Consider {(sk∈S,ak∈A,tk∈S,rk∈[0,1])}k∈[n] an H-independent sequence and {(Tglobk,Rglobk,~Tglobk,~Rglobk)}k∈[n] that shatters it. For any k∈[n] and j∈[k], we have fk(aj)=gk(aj) but fk(ak)≠gk(ak), implying that aj≠ak. It follows that |A|≥n. ■

Proof of Proposition 3

Consider {(sk∈S,ak∈A,tk∈S,rk∈[0,1])}k∈[n] an H-independent sequence and {(Tglobk,Rglobk,~Tglobk,~Rglobk)}k∈[n] that shatters it. Define A,B⊆[n] by

A:={k∈[n]∣∣Tglobk(sk,ak)≠~Tglobk(sk,ak)}B:={k∈[n]∣∣Rglobk(sk)≠~Rglobk(sk)}

Since (Tglob,Rglob,~Tglob,~Rglob) shatters (s,a,t,r), we have A∪B=[n].

Consider any k∈A. Obviously, there is xk∈X s.t.

Tglobk(sk,ak)(xk)≠~Tglobk(sk,ak)(xk)

Denote σk:=(sk)xk∈CM. We have Tk(σk,ak)≠~Tk(σk,ak). On the other hand, for any j∈A∩[k], the shattering implies Tglobk(sj,aj)=~Tglobk(sj,aj) and in particular Tk(σj,aj)=~Tk(σj,aj). Therefore, (σk,ak)≠(σj,aj). We conclude that |A|≤|C||M|⋅|A|.

Now consider any k∈B. Define fk∈R{M→C} by

(fk)σ:=∣∣{x∈X∣∣(sk)x=σ}∣∣|X|

We have

fk⋅Rk=Rglobk(sk)≠~Rglobk(sk)=fk⋅~Rkfk⋅(Rk−~Rk)≠0

(These are dot products in R{M→C}.) On the other hand, for any j∈B∩[k], the shattering implies fj⋅Rk=fj⋅~Rk and therefore fj⋅(Rk−~Rk)=0. Therefore, fk is not in the linear span of {fj}j∈B∩[k] and hence the set {fk}k∈B is linearly independent. We conclude that |B|≤dimR{M→C}=|C||M|.

Putting everything together, we get n≤|A|+|B|≤|C||M|(|A|+1). ■

Proposition A.1

Consider a countable set A, a set O, a non-empty countable set H⊆{A→O}, some f∗:A→B, some ζ∈ΔH and some ξ∈ΔA. Denote

p:=Pr(a,f)∼ξ×ζ[f(a)≠f∗(a)]

Then, for any q∈(0,1), we can choose A∘⊆A and F∘⊆F s.t. ξ(A∘)≥1−q, ζ(F∘)≥1−pqdimpF and for any f,g∈F∘ and a∈A∘, f(a)=g(a).

Proof of Proposition A.1

We define A∘ by

A∘:={a∈A∣∣∣Prf∼ζ[f(a)≠f∗(a)]≤pq}

The fact that ξ(A∘)≥1−q follows from the definition of p.

Denote n:=|A∘|∈N∪{ω} and enumerate A∘ as A∘={ak}k∈[n]. For each k∈[n], define Fk⊆F recursively by

F0:=FFk+1:={Fk if ∀f,g∈Fk:f(ak)=g(ak){f∈Fk|f(ak)=f∗(ak)} otherwise

Set F∘:=⋂k∈[n]Fk. Define I⊆[n] by

I:={k∈[n]|Fk+1≠Fk}

Denote m:=|I|. For any i∈[m], we denote by ki the i-th number in I in increasing order. Denote a∗i:=aki and b∗i:=f∗(a∗i). By the definition of Fk, for each i∈[m] we can choose fi,gi∈Fki s.t. fi(a∗i)≠gi(a∗i). Moreover, it also follows from the definition that for every i∈[m] and j∈[i], fi(a∗j)=gi(a∗j)=b∗j (using only the fact that fi,gi∈Fki and k_j">ki>kj). Therefore, (f,g) shatters (a∗,b∗) and hence m≤dimpF.

By the definition of A∘, for any k∈I, ζ(Fk+1)≥ζ(Fk)−pq. On the other hand, for any k∈[n]∖I, Fk+1=Fk and in particular ζ(Fk+1)=ζ(Fk). We conclude that

ζ(F∘)≥ζ(F0)−pqm≥1−pqdimpF■

Proposition A.2

Consider a countable set A, a set O, a non-empty countable set H⊆{A→O}, some ζ∈ΔH and some ξ∈ΔA. Consider f∗:A→B s.t.

f∗(a)∈argmaxb∈BPrf∼ζ[f(a)=b]

Then,

Pr(a,f)∼ξ×ζ[f(a)≠f∗(a)]≤1ln2I(a,f)∼ξ×ζ[f;a,f(a)]

[I[f;a,f(a)] stands for the mutual information between f and the joint random variables a,f(a).]

Proof of Proposition A.2

By the chain rule for mutual information

I(a,f)∼ξ×ζ[f;a,f(a)]=I(a,f)∼ξ×ζ[f;a]+I(a,f)∼ξ×ζ[f;f(a)|a]

The first term on the right hand side obviously vanishes, and the second term can be expressed in terms of KL-divergence. For any a∈A, Define eva:F→B by eva(f):=f(a). We get

I(a,f)∼ξ×ζ[f;a,f(a)]=E(a,f)∼ξ×ζ[DKL(δf(a)∣∣∣∣eva∗ζ)]=E(a,f)∼ξ×ζ[ln1ζ(ev−1a(f(a)))]I(a,f)∼ξ×ζ[f;a,f(a)]≥E(a,f)∼ξ×ζ[ln1ζ(ev−1a(f(a)));f(a)≠f∗(a)]

For any a∈A and f∈F, ζ(ev−1a(f(a)))≤ζ(ev−1a(f∗(a))) by definition of f∗. If f(a)≠f∗(a), then

ζ(ev−1a(f(a)))+ζ(ev−1a(f∗(a)))=ζ(ev−1a({f(a),f∗(a)}))≤1

It follows that, in this case, ζ(ev−1a(f(a)))≤12. We conclude

I(a,f)∼ξ×ζ[f;a,f(a)]≥(ln2)Pr(a,f)∼ξ×ζ[f(a)≠f∗(a)]■

Proposition A.3

Consider a countable set A, a set O, a non-empty countable set H⊆{A→O}, some ζ∈ΔH and some R:O→[0,1]. Let Π:H→A be s.t.

Π(e)∈argmaxa∈AR(e(a))

Then,

Ee∼ζ[maxa∈AR(e(a))−Ea∼Π∗ζ[R(e(a))]]≤4√dimpHln2⋅I(a,e)∼Π∗ζ×ζ[e;a,e(a)]

Proof of Proposition A.3

Denote ξ:=Π∗ζ, D:=dimpH and Γ:=I(a,e)∼ξ×ζ[e;a,e(a)]. By Proposition A.2, there is e∗:A→O s.t.

Pr(a,e)∼ξ×ζ[e(a)≠e∗(a)]≤Γln2

By Proposition A.1 (setting q:=√ΓDln2), there are A∘⊆A and H∘⊆H s.t. ξ(A∘)≥1−√ΓDln2, ζ(H∘)≥1−√ΓDln2 and for any e1,e2∈H∘ and a∈A, e1(a)=e2(a). Define H∗:=H∘∩Π−1(A∘). We have ζ(H∗)≥1−2√ΓDln2.

Denote L:=Ee∼ζ[maxa∈AR(e(a))−Ea∼ξ[R(e(a))]]. For brevity, we will also use the notation Ee∗[X]:=Ee∼ζ[X;e∈H∗]. We get, using the bound on ζ(H∗)

L≤Ee∗[maxa∈AR(e(a))−Ea∼ξ[R(e(a))]]+2√ΓDln2L≤Ee∗[Ee′∗[R(e(Π(e)))−R(e(Π(e′)))]]+4√ΓDln2

Using the properties of H∘ and A∘ we get

L≤Ee∗[Ee′∗[R(e′(Π(e)))−R(e′(Π(e′)))]]+4√ΓDln2

The expression inside the expected values in the first term on the right hand side is negative, by the property of Π. We conclude

L≤4√ΓDln2■

Proof of Lemma 1

For each s∈S we choose some Πs:H→A s.t.

Πs(e)∈argmaxa∈AR(e(s,a))

We now take π† to be Thompson sampling. More formally, we construct a probability space (Ω,P) with random variables K:Ω→H (the true environment) and {Jn:Ω→{Sn→H}}n∈N (the hypothesis sampled on round n). We define {An:Ω→{Sn+1→A}}n∈N (the action taken on round n) and {Θn:Ω→{Sn+1→O}}n∈N (the observation made on round n) by

An(s):=Πsn(Jn(s:n))Θn(s):=K(sn,An(s))

We also define {Hn:Ω→{Sn→H}}n∈N by

Hn(s):={e∈H|∀m∈[n]:e(sm,Am(s:m))=Θn(s:m)}

Hn is thus the set of hypotheses consistent with the previous outcomes K(sm,Πsm(Jm)). The random variables K,J have to satisfy K∗P=ζ (the distribution of the true environment is the prior) and

Pr[Jn(s)=e∣∣K,{Jm(s:m)}m∈[n]]=Prζ[e|Hn(s)]

That is, the distribution of Jn conditional on K and Jm for m∈[n] is given by the prior ζ updated by the previous outcomes. π† is then defined s.t. for any so∈(S×O)n, t∈S and a∈A:

π†(a|so,t):=Pr[An(st)=a|∀m∈[n]:om=Θm(s:m)]

Now we need to prove the regret bound. We denote the Bayesian regret by R:

R(c,γ):=Ee∼ζ[(1−γ)∞∑n=0γnmaxa∈AR(e(cn,a))−Eecπ†[Uγ]]

The construction of π† implies that

R(c,γ)=(1−γ)∞∑n=0γnE[maxa∈AR(K(cn,a))−R(K(cn,An(c:n+1)))]

Define {Zn:Ω→{Sn→ΔH}}n∈N (the belief state of the agent on round n) and {Ξn:Ω→{Sn+1→ΔA}}n∈N (the distribution over actions on round n) by

Zn(s):=ζ∣Hn(s)Ξn(s):=Πsn∗Zn(s)

Using expectation conditional on Zn we can rewrite the equation for R(γ) as

R(c,γ)=(1−γ)∞∑n=0γnE[Ee∼Zn(c:n)[maxa∈AR(e(cn,a))−Ea∼Ξn(c:n+1)[R(e(cn,a))]]]

It follows that

R(c,γ)≤  ⎷(1−γ)∞∑n=0γnE⎡⎣Ee∼Zn(c:n)[maxa∈AR(e(cn,a))−Ea∼Ξn(c:n+1)[R(e(cn,a))]]2⎤⎦

We now apply Proposition A.3 and get

R(c,γ)≤ ⎷16Dln2⋅(1−γ)∞∑n=0γnE[I(a,e)∼Ξn(c:n+1)×Zn(c:n)[e;a,e(a)]]R(c,γ)≤ ⎷16Dln2⋅(1−γ)∞∑n=0γnE[H(Zn(c:n))−H(Zn+1(c:n+1))]R(c,γ)≤√16DH(ζ)ln2⋅(1−γ)■

Given any x∈X, we will use the notation x=(stx∈S,rwx∈[0,1]).

Proposition A.4

Fix T∈N+ and consider a non-empty set of DDP hypotheses H⊆{S×A→X}. For each e:S×A→X we define e♯:S×AT→XT recursively by

e♯(s,π)0:=e(s,π0)e♯(s,π)n+1=e(ste♯(s,π)n,πn+1)

(That is, e♯(s,π) is the history resulting from DDP e interacting with action sequence π starting from initial state s.) Define

H♯:={e♯∣∣e∈H}

Then, for any s∈S, dimpH♯s≤dimpH.

Here, the subscript s has the same meaning as in Lemma 1.

Proof of Proposition A.4

Fix s∈S. Let {(πk∈A♯,hk∈XT)}k∈[n] be an H♯s-independent sequence. We will now construct a sequence

{(s∗k∈S,a∗k∈A,x∗k∈X,ek∈H,~ek∈H)}k∈[n]

s.t (e,~e) shatters (s∗,a∗,x∗). The latter implies that (s∗,a∗,x∗) is an H-independent sequence, establishing the claim.

For brevity, we will use the notation fπ:=f♯(s,π) (here, f:S×A→X and π:S×N→A). For each k∈[n] define mk∈[T] by

mk:=min{l∣∣∃f,~f∈H:(fπk):l+1≠(~fπk):l+1∧∀j∈[k]:fπj=~fπj=hj}

Note that the set above is indeed non-empty because (π,h) is H♯s-independent.

Given g∈XT, we will use the notational convention stg−1:=s.

Choose ek,~ek s.t. (ekπk):mk+1≠(~ekπk):mk+1 and ∀j∈[k]:ekπj=~ekπj=hj. Obviously (ekπk):mk=(~ekπk):mk. We set s∗k:=st(ekπk)mk−1 and a∗k:=(πk)mk. Clearly ek(s∗k,a∗k)≠~ek(s∗k,a∗k).

If k<n−1 then there is f∈H s.t. ∀j∈[k+1]:fπj=hj (by independence). Therefore, st(hk)mk−1=s∗k: otherwise st(fπk)mk−1≠st(ekπk)mk−1 and we get a contradiction with the minimality of mk. We now set x∗k:=(hk)mk. If k=n, we choose x∗k∈X arbitrarily. For any k∈[n] and j∈[k], ekπj=~ekπj=hj and therefore ek(s∗k,a∗k)=~ek(s∗k,a∗k)=x∗k.

Proof of Theorem 1

Let A♯:=AT, O♯:=XT and H♯ be defined as in Proposition A.4. By Proposition A.4, we have maxs∈SdimpH♯s≤D. Define ζ♯∈ΔH♯ as the pushforward of ζ by the ♯ operator. Define R♯γ:O♯→[0,1] by

R♯γ(h):=γ(1−γ)1−γT+1T−1∑n=0γnrwhn

Denote also γ♯:=γT+1. It is easy to see that given any η∈O♯ω

Uγ(∞∏n=0(cn,0)ηn)=(1−γ♯)∞∑n=0γ♯nR♯γ(ηn)

The product is in the string concatenation sense.

Applying Lemma 1 to all the "♯" objects (with S and c unaffected), we get π†T,γ s.t.

Ee∼ζ⎡⎣maxπ:X∗×X→AEec[T]π[Uγ]−Eec[T]π†T,γ[Uγ]⎤⎦≤√16DH(ζ♯)ln2⋅(1−γ♯)

Here, we used the fact that the optimal policy for a DDP with fixed initial state (and any time discount function) can be taken to be a fixed sequence of actions.

Observing that H(ζ♯)≤H(ζ) and 1−γT+1≤(T+1)(1−γ), we get the desired result. ■

The following is a minor variant of what was called "Proposition B.2" here, and we state it here without proof (since the proof is essentially the same).

Proposition A.5

Consider a DDP e:S×A→X, policies π0,π∗:X∗×Xk→A and T∈N+. Suppose that π∗ is optimal for e with time discount γ and any initial state. Denote

τe(γ):=maxs∈Ssupx∈(γ,1)∣∣∣dVe(s,x)dx∣∣∣

Given any s∈S and policy π, denote esπ∈ΔXω the distribution over histories of resulting from π interacting with e starting from initial state s. Finally, define UT,γ:Xω→[0,1] by

UT,γ(h):=1−γ1−γTT−1∑n=0γnrwhn

Then

Eeπ∗[Uγ]≤(1−γT)∞∑n=0γnTEh∼eπ0[EesthnTπ∗[UT,γ]]+2τe(γ)γT(1−γ)1−γT

Proof of Theorem 2

It is not hard to see that Theorem 1 can be extended to the setting where the initial state sequence c∈Sω is chosen in a way that depends on the history. This is because if c is chosen adversarially (i.e. in order to maximize regret), the history doesn't matter (in other words, we get a repeated zero-sum game in which, on each round, the initial state is chosen by the adversary and the policy is chosen by the agent after seeing the initial state; this game clearly has a pure Nash equilibrium). In particular, we can let c be just the states naturally resulting from the interaction of the DDP with the policy.

Let π∗eγ be the optimal policy for DDP e and time discount γ. We get

Ee∼ζ⎡⎣(1−γT)∞∑n=0γnTEh∼eπ†T,γ[EesthnTπ∗eγ[UT,γ]]−Eeπ†T−1,γ[Uγ]⎤⎦≤√16DH(ζ)ln2⋅T(1−γ)+1−γ1−γT

Here, the second term on the right hand side comes from the rewards at time moments divisible by T, which were set to 0 in Theorem 1. Denote

R(T,γ):=Ee∼ζ⎡⎣maxπ:X∗×X→AEeπ[Uγ]−Eeπ†T−1,γ[Uγ]⎤⎦

Applying Proposition A.5, we get

R(T,γ)≤√16DH(ζ)ln2⋅T(1−γ)+1−γ1−γT+2τ(γ)γT(1−γ)1−γT

Assume that γT−1≥12. Then

1−γ1−γT=1∑T−1n=0γn≤2TR(T,γ)≤√16DH(ζ)ln2⋅T(1−γ)+4τ(γ)+2T

Now set

T:=⎡⎢ ⎢ ⎢⎢3 ⎷(τ(γ)+1)2DH(ζ)⋅(1−γ)⎤⎥ ⎥ ⎥⎥

It is easy to see that the assumption γT−1≥12 is now justified for 1−γ≪1. We get

R(T,γ)=O(3√DH(ζ)⋅(τ(γ)+1)(1−γ))

Discuss

### Tidying One’s Room

16 августа, 2018 - 16:50

Previously (Compass Rose): Culture, Interpretive Labor, and Tidying One’s Room

Epistemic Status: A bit messy

She’s tidied up and I can’t find anything! All my tubes and wires, careful notes!” – Thomas Dolby, She Blinded Me With Science

From Compass Rose:

Why would tidying my room involve interpretive labor?

It turns out, every item in my room is a sort of crystallized intention, generally past-me. (We’ve all heard the stories of researchers with messy rooms who somehow knew where everything was, and lost track of everything when someone else committed the violent act of reorganizing the room, thus deindexing it from its owner’s mind.) As I decided what to do with an item, I wanted to make sure I didn’t lose that information. So, I tried to Aumann with my past self – the true way, the way that filters back into deep models, so that I could pass my past self’s ideological turing test. And that’s cognitively expensive.

It’s generally too aggressive to tidy someone’s room without their permission, unless they’re in physical danger because of it. But to be unwilling to tidy my own room without getting very clear explicit permission from my past self for every action – or at least checking in – is pathologically nonaggressive.

From my wife, upon seeing the draft up to this point:

You know, in the time it takes you to write this, you could actually tidy your room.

Proof that the subject of cleaning, and cleaning that which does not belong to you, can escalate quickly in aggressiveness!

There are a few dynamics I’d like to talk about here. I won’t (today) be relating them back to Ben’s larger questions of how generally to deal with the intentions of the environment, instead choosing a more narrow scope.

I

Intention

Your past self left you an ideological Turing test, of a sort, by leaving items in seemingly random locations.

Good news! I have the cheat sheet.

Close, but wrong: “I’ll remember that it’s there and I’m too lazy to optimize its location further.”

Usually correct: “I’m done with this, I should put it somewhere. This is somewhere.”

Don’t give your past self too much credit.

Most things are (hopefully) where they are because you put them there on purpose. That’s where they ‘live’. If they’re not in a permanent location, they’re probably in an arbitrary location.

One should think about intention behind the current location of a thing if and only if the location was clearly chosen intentionally.

If the location doesn’t seem random, this is probably why: “I predicted I’d look here for this item in the future. This is where I seemed to have indexed it.”

Ben worried he needed to pass an ITT against his past self before he could alter his past self’s wishes.

I think that’s backwards. Past you’s work is done. The key ITT is against your future self!

II

Indexing

Whether or not a location was chosen carefully, it has the great advantage of memory. If you put something somewhere, there’s a good chance that’s where you will look for it. If you put it there regularly, that chance is better still.

This is why ‘tidying’ someone’s room for them is an act of aggression.

If I’m the one who put a thing somewhere, I could figure out where it is by remembering where I put it, or asking where I had it last (which my family called ‘The Papa Josh method’ as if it wasn’t universal, but specific names are still useful, and Papa Josh was apparently kind of an annoying jerk about it). I could also pass the ideological Turing test of my past self and figure out where I would have chosen to put it.  Since, philosophical objections aside, I am me, my chances are often very good.

If I have a strong indexing of an item to a location, I’ll instinctively put it back in the same location, confident I can find it in the future. My ability to automatically look in the right place, and find it now, is good evidence of that. If it was hard to find, I should probably move it. Over time, this improves indexing.

If someone else puts the object somewhere, I now have to figure out where someone else would place the object. Over time, if they keep doing this, I’ll figure out where they put it, but when a new person starts cleaning a location, chaos reigns. What is logical to them is not what is logical to you.

An especially nasty trap is when you’re not sure if you know where an object is, so you check, it is where you look for it, then put it back in a different location. Oh good, you think, I have it, I’ll now put it over here. Classic mistake. If an object is in the first place you look, and you need to find it soon, put it back exactly where you found it! If an object isn’t in the first place you look, put it in the first place you looked! You’ll look there again.

Otherwise, what you are doing is systematically taking things you can find, and moving them to locations where you might not find them. Whereas if you fail to find them, you won’t move them, and they’ll stay not found. This is why you can’t find the remote – it keeps moving randomly until it finds a place where you can’t find it, then stays there until you figure that one out. Repeat.

It took way too many times when the only thing I needed was reliably in the wrong pocket for me to figure out how this works.

III

Illusion

As a child in the days before the internet, I would keep stacks of sports and gaming magazines in my room. In order to quickly locate the one I wanted, I’d spread them out so part of the cover was visible on each copy, allowing a quick visual scan.

Then someone would, against my will, come in and ‘clean’ the room, stacking them all into one pile with no way to tell which one was which.

So the moment I came back, I’d undo the pile and spread them back out again, since the pile was almost optimizing for lack of legibility.

I’d complain about this all the time, and make my wishes clear, and the stack would reassemble twice a week anyway.

Space, especially visual space, is a resource. Using it draws things to your attention. That’s good if you want to find them! It also threatens to distract. It gives the appearance of clutter, and threatens to clutter the mind.

IV

Indebted

It is tempting to ‘tidy’ one’s room, to give appearance of tidiness, or to clear necessary space, by accumulating debt. You shove things aside or into closets, rather than putting them in a place that is helpful. Even sorting things into seemingly organized piles is still debt, if you don’t know the indexing and won’t be able to find them. At some point you’ll be paying search costs.

If you are not careful, this debt will accumulate, and interest on it will add up. It is hard to get motivated to pay down such debts, even when returns are good.

It is also tempting to ‘tidy’ that which does not need to be fixed, or to let this task distract you as a way to procrastinate other tasks.

My solution is simple. Any time you look for something, you give yourself a reasonable amount of time to find it. If after that time you cannot find it, but you are confident it is there to be found, you stop looking for the item and instead clean the room (at least) until you find the item. This inevitably finds the item and creates equilibrium – the more you need to clean, the more likely you are to do so. If you can always find everything, then everything is fine.

Discuss

### Request for input on multiverse-wide superrationality (MSR)

14 августа, 2018 - 20:29

I am currently working on a research project as part of CEA’s summer research fellowship. I am building a simple model of so-called “multiverse-wide cooperation via superrationality” (MSR). The model should incorporate the most relevant uncertainties for determining possible gains from trade. To be able to make this model maximally useful, I would like to ask others for their opinions on the idea of MSR. For instance, what are the main reasons you think MSR might be irrelevant or might not work as it is supposed to work? Which questions are unanswered and need to be addressed before being able to assess the merit of the idea? I would be happy about any input in the comments to this post or via mail to johannes@foundational-research.org.

Discuss

### Building Safer AGI by introducing Artificial Stupidity

14 августа, 2018 - 18:54

Authors: Michaël Trazzi, Roman V. Yampolskiy

Abstract: Artificial Intelligence (AI) achieved super-human performance in a broad variety of domains. We say that an AI is made Artificially Stupid on a task when some limitations are deliberately introduced to match a human's ability to do the task. An Artificial General Intelligence (AGI) can be made safer by limiting its computing power and memory, or by introducing Artificial Stupidity on certain tasks. We survey human intellectual limits and give recommendations for which limits to implement in order to build a safe AGI.

Discuss

### Logical Counterfactuals & the Co-operation Game

14 августа, 2018 - 17:00

Logical counterfactuals (as in Functional Decision Theory) are more about your state of knowledge than the actual physical state of the universe. I will illustrate this with a relatively simple example.

Suppose there are two players in a game where each can choose A or B with the payoffs as follows for the given combinations:

AA: 10, 10

BB: 0, 0

AB or BA: -10, -10

Situation 1:

Suppose you are told that you will make the same decision as the other player. You can quickly conclude that A provides the highest utility.

Situation 2:

Suppose you are told that the other player chooses A. You then reason that A provides the highest utility

Generalised Situation: Player 1 is an agent that will choose A, although this is not known by Player 2 unless option b) in the next sentence is true. Player 2 is either told a) that they will inevitably make the same decision as Player 1 or b) that Player 1 definitely will choose A. If Player 2 is a rational timeless agent, then both agents will inevitably both choose A.

Analysis:

Consider the Generalised Situation, where you are Player 2. Then both a) and b) are true statements. In both cases, the physical situation was identical, apart from the information you (Player 2) were told. Even the information Player 1 is told is identical. But in one situation the Player 1's decision should counterfactually vary with yours, while in the other situation, Player 1 is treated as a fixed element of the universe.

On the other hand, if you were told that the other player would choose A and that they would make the same choice as you, then the only choice compatible with that would be to choose A. We could easily end up in all kinds of tangles trying to figure out the logical counterfactuals for this situation. However, the decision problem is really just trivial in this case and the only (non-strict) counterfactual is what actually happened. There is simply no need to attempt to figure out logical counterfactuals given perfect knowledge of a situation.

It is a mistake to focus too much on the world itself as given precisely what happened all (strict) counterfactuals are impossible. The only thing that is possible is what actually happened. This is why we need to focus on your state of knowledge instead.

(I won't be surprised if these ideas already well-known. If they are, please link me to a relevant article or post)

Discuss

14 августа, 2018 - 05:10

Highlights

OpenAI Five Benchmark: Results (OpenAI's Dota Team): The OpenAI Five benchmark happened last Sunday, where OpenAI Five won two matches against the human team, and lost the last one when their draft was adversarially selected. They are now planning to play at The International in a couple of weeks (dates to be finalized). That will be a harder challenge, since they will be playing against teams that play and train professionally, and so will be better at communication and coordination than the human team here.

Blitz (one of the human players) said: "The only noticeable difference in the mechanical skill aspect was the hex from the Lion, but even that was sorta irrelevant to the overall game flow. Got outdrafted and outmaneuvered pretty heavily, and from a strategy perspective it was just better then us. Even with the limitations in place it still 'felt' like a dota game, against a very good team. It made all the right plays I'd expect most top tier teams to make."

On the technical side, OpenAI implemented a brute-force draft system. With a pool of 18 heroes, you get some combinatorial explosion, but there are still only ~11 million possible matchups. You can then do a simple tree search over which hero to draft, where at the leaves (when you have a full draft) you choose which leaf you want based on the win probability (which OpenAI Five already outputs). Seeing this in action, it seems to me like it's a vanilla minimax algorithm, probably with alpha-beta pruning so that they don't have to evaluate all ~159 billion nodes in the tree. (Or they could have done the full search once, hardcoded the action it comes up with for the first decision, and run the full search for every subsequent action, which would have under 10 billion nodes in the tree.)

Besides the win probabilities, there are other ways to get insight into what the model is "thinking" -- for example, by asking the model to predict where the hero will be in 6 seconds, or by predicting how many last hits / denies / kills / deaths it will have.

The model that played the benchmark has been training since June 9th. Of course, in that time they've changed many things about the system (if for no other reason than to remove many of the restrictions in the original post). This is not a thing that you can easily do -- typically you would change your model architecture, which means your old parameters don't map over to the new architecture. I've been pretty curious about how they handle this, but unfortunately the blog post doesn't go into much detail, beyond saying that they can in fact handle these kinds of "surgery" issues.

They estimate that this particular model has used 190 petaflop/s-days of compute, putting it just below AlphaZero.

My opinion: I think this finally fell within my expectations, after two instances where I underestimated OpenAI Five. I expected that they would let the human team choose heroes in some limited way (~80%), that OpenAI Five would not be able to draft using just gradients via PPO (~60%), and (after having seen the first two games) that the human team would win after an adversarial draft (~70%). Of course, a draft did happen, but it was done by a tree search algorithm, not an algorithm learned using PPO.

The games themselves were pretty interesting (though I have not played Dota so take this with a grain of salt). It seemed to me like OpenAI Five had learned a particularly good strategy that plays to the advantages of computers, but hadn't learned some of the strategies and ideas that human players use to think about Dota. Since it uses the same amount of computation for each decision, it makes good decisions on all timescales, including ones where something surprising has occurred where humans would need some time to react, and also to coordinate. For example, as soon as a human hero entered within range of the bots (just to look and retreat), all of the bots would immediately unleash a barrage of attacks, killing the hero -- a move that humans could not execute, because of slower reaction times and worse communication and teamwork. Similarly, one common tactic in human gameplay is to teleport into a group of heroes and unleash an area-of-effect ability, but when they tried this against OpenAI Five, one of the bots hexed the hero as soon as he teleported in, rendering him unable to cast the spell. (That felt like the decisive moment in the first game.) On the other hand, there were some clear issues with the bots. At one point, two OpenAI bots were chasing Blitz, and Blitz used an ability that made him invisible while standing still. Any human player would have spammed area attacks, but the bots simply became confused and eventually left. Similarly, I believe (if I understood the commentary correctly) that a bot once used an ability multiple times, wasting mana, even though all uses after the first had no additional effect.

Other articles would have you believe that the games weren't even close, and if you look at the kill counts, that would seem accurate. I don't think that's actually right -- from what I understand, kills aren't as important as experience and gold, and you could see this in the human gameplay. OpenAI Five would often group most of its heroes together to push forward, which means they get less experience and gold. The human team continued to keep their heroes spread out over the map to collect resources -- and even though OpenAI Five got way more kills, the overall net worth of the two teams' heroes remained about equal for most of the early game. The big difference seemed to be that when the inevitable big confrontation between the two teams happened, OpenAI Five always came out on top. I'm not sure how, my Dota knowledge isn't good enough for that. Based on Blitz's comment, my guess is that OpenAI Five is particularly good at fights between heroes, and the draft reflects that. But I'd still guess that if you had pro human players who ceded control to OpenAI Five whenever a fight was about to happen, they would beat OpenAI Five (~70%). I used to put 80% on that prediction, but Blitz's comment updated me away from that.

One interesting thing was that the win probability seemed to be very strongly influenced by the draft, which in hindsight seems obvious. Dota is a really complicated game that is constantly tweaked to keep it balanced for humans, and even then the draft is very important. When you now introduce a new player (OpenAI Five) with very different capabilities (such as very good decision making under time pressure) and change the game conditions (such as a different pool of heroes), you should expect the game to become very imbalanced, with some teams far outshining others. And in fact we did see that Lion (the hero with the hexing ability) was remarkably useful (against humans, at least).

Certified Defenses against Adversarial Examples (Aditi Raghunathan et al) and A Dual Approach to Scalable Verification of Deep Networks (Krishnamurthy (Dj) Dvijotham et al): Even when defenses are developed to make neural nets robust against adversarial examples, they are usually broken soon after by stronger attacks. Perhaps we could prove once and for all that the neural net is robust to adversarial examples?

The abstract from the Raghunathan paper summarizes their approach well: "[W]e study this problem for neural networks with one hidden layer. We first propose a method based on a semidefinite relaxation that outputs a certificate that for a given network and test input, no attack can force the error to exceed a certain value. Second, as this certificate is differentiable, we jointly optimize it with the network parameters, providing an adaptive regularizer that encourages robustness against all attacks. On MNIST, our approach produces a network and a certificate that no attack that perturbs each pixel by at most \epsilon = 0.1 can cause more than 35% test error."

To compute the certificate, they consider the optimal attack A. Given a particular input x, the optimal attack A is the one that changes f(A(x)) to a different class, where f is the ML model, and A(x) is restricted to not change x too much. They leverage the structure of f (linear models and neural nets with one hidden layer) and the restrictions on A to compute a bound on f(A(x)) in terms of x. So, for each data point in the training set, the bound either says “guaranteed that it can’t be adversarially attacked” or “might be possible to adversarially attack it”. Averaging this over the training set or test set gives you an estimate of an upper bound on the optimal adversarial attack success rate.

The Dvijotham paper can work on general feedforward and recurrent neural nets, though they show the math specifically for nets with layers with componentwise activations. They start by defining an optimization problem, where the property to be verified is encoded as the optimization objective, and the mechanics of the neural net are encoded as equality constraints. If the optimal value is negative, then the property has been verified. The key idea to solving this problem is to break down the hard problem of understanding a sequence of linear layers followed by nonlinearities into multiple independent problems each involving a single layer and a nonlinearity. They do this by computing bounds on the values coming out of each layer (both before and after activations), and allowing the constraints to be satisfied with some slack, with the slack variables going into the objective with Lagrange multipliers. This dual problem satisfies weak duality -- the solution to the dual problem for any setting of the Lagrange multipliers constitutes an upper bound on the solution to the original problem. If that upper bound is negative, then we have verified the property. They show how to solve the dual problem -- this is easy now that the slack variables allow us to decouple the layers from each other. They can then compute a tighter upper bound by optimizing over the Lagrange multipliers (which is a convex optimization problem, and can be done using standard techniques). In experiments, they show that the computed bounds on MNIST are reasonably good for very small perturbations, even on networks with 2-3 layers.

My opinion: Lots of AI alignment researchers talk about provable guarantees from our AI system, that are quite broad and comprehensive, even if not a proof of "the AI is aligned and will not cause catastrophe". Both of these papers seem like an advance in our ability to prove things about neural nets, and so could help with that goal. My probably-controversial opinion is that in the long term the harder problem is actually figuring out what you want to prove, and writing down a formal specification of it in a form that is amenable to formal verification that will generalize to the real world, if you want to go down that path. To be clear, I'm excited about this research, both because it can be used both to solve problems that affect current AI systems (eg. to verify that a neural net on a plane will never crash under a mostly-realistic model of the world) and because it can be used as a tool for developing very capable, safer AI systems in the future -- I just don't expect it to be the main ingredient that gives us confidence that our AI systems are aligned with us.

On the methods themselves, it looks like the Raghunathan paper can achieve much tighter bounds if you use their training procedure, which can optimize the neural net weights in tandem with the certificate of robustness -- they compute a bound of 35% on MNIST with perturbations of up to size 26 (where the maximum is 256). However, there are many restrictions on the applicability of the method. The Dvijotham paper lifts many of these restrictions (multilayer neural nets instead of just one hidden layer, any training procedure allowed) but gets much looser bounds as a result -- the bounds are quite tight at perturbations of size 1 or 2, but by perturbations of size 10 the bounds are trivial (i.e. a bound of 100%). The training procedure that Raghunathan et al use is crucial -- without it, their algorithm finds non-trivial bounds on only a single small neural net, for perturbations of size at most 1.

Technical AI alignmentProblems

When Bots Teach Themselves How To Cheat (Tom Simonite): A media article about specification gaming in AI that I actually just agree with, and it doesn't even have a Terminator picture!

Agent foundations

Probabilistic Tiling (Preliminary Attempt) (Diffractor)

Handling groups of agents

Learning to Share and Hide Intentions using Information Regularization (DJ Strouse et al)

Interpretability

Techniques for Interpretable Machine Learning (Mengnan Du et al): This paper summarizes work on interpretability, providing a classification of different ways of achieving interpretability. There are two main axes -- first, whether you are trying to gain insight into the entire model, or its classification of a particular example; and second, whether you try to create a new model that is inherently interpretable, or whether you are post-hoc explaining the decision made by an uninterpretable model. The whole paper is a summary of techniques, so I'm not going to summarize it even further.

My opinion: This seems like a useful taxonomy that hits the kinds of interpretability research I know about, though the citation list is relatively low for a summary paper, and there are a few papers I expected to see that weren't present. On the other hand, I'm not actively a part of this field, so take it with a grain of salt.

Verification

Certified Defenses against Adversarial Examples (Aditi Raghunathan et al): Summarized in the highlights!

A Dual Approach to Scalable Verification of Deep Networks (Krishnamurthy (Dj) Dvijotham et al): Summarized in the highlights!

Adversarial Vision Challenge (Wieland Brendel et al): There will be a competition on adversarial examples for vision at NIPS 2018.

Motivating the Rules of the Game for Adversarial Example Research (Justin Gilmer, George E. Dahl et al) (H/T Daniel Filan)

Privacy and security

Security and Privacy Issues in Deep Learning (Ho Bae, Jaehee Jang et al)

AI capabilitiesReinforcement learning

OpenAI Five Benchmark: Results (OpenAI's Dota Team): Summarized in the highlights!

Learning Actionable Representations from Visual Observations (Debidatta Dwibedi et al): Prior work on Time Contrastive Networks (TCN)s showed that you can use time as an unsupervised learning signal, in order to learn good embeddings of states that you can then use in other tasks. This paper extends TCNs to work with multiple frames, so that it can understand motion as well. Consider any two short videos of a task demonstration. If they were taken at different times, then they should be mapped to different embedding vectors (since they correspond to different "parts" of the task). On the other hand, if they were taken at the same time (even if from different viewpoints), they should be mapped to the same embedding vector. The loss function based on this encourages the network to learn an embedding for these short videos that is invariant to changes in perspective (which are very large changes in pixel-space), but is different for changes in time (which may be very small changes in pixel-space). They evaluate with a bunch of different experiments.

My opinion: Unsupervised learning seems like the way forward to learn rich models of the world, because of the sheer volume of data that you can use.

ICML 2018 Notes (David Abel)

Deep learning

When Recurrent Models Don't Need to be Recurrent (John Miller): Recurrent neural networks (RNNs) are able to use and update a hidden state over an entire sequence, which means that in theory it is possible for them to learn very long term dependencies in a sequence, that a feedforward model would not be able to do. For example, it would be easy to assign weights to an RNN so that on input x_n it outputs n (the length of the sequence so far), whereas a feedforward model could not learn this function. Despite this, in practice feedforward methods match and exceed the performance of RNNs on sequence modeling tasks. This post argues that this is because of gradient descent -- any stable gradient descent on RNNs can be well approximated by gradient descent on a feedforward model (both at training and inference time).

My opinion: The post doesn't really explain why this is the case, instead referencing the theory in their paper (which I haven't read). It does sound like a cool result explaining a phenomenon that I do find confusing, since RNNs should be more expressive than feedforward models. It does suggest that gradient descent is not actually good at finding the optimum of a function, if that optimum involves lots of long-term dependencies.

Objects that Sound (Relja Arandjelović, Andrew Zisserman et al): The key idea behind this blog post is that there is a rich source of information in videos -- the alignment between the video frames and audio frames. We can leverage this by creating a proxy task that will force the neural net to learn good representations of the video, which we can then use for other tasks. In particular, we can consider the proxy task of deciding whether a short (~1 second) video clip and audio clip are aligned or not. We don't care about this particular task, but by designing our neural net in the right way, we can ensure that the net will learn good representations of video and audio. We pass the video clip through a convolutional net, the audio clip through another convolutional net, and take the resulting vectors and use the distance between them as a measure of how dissimilar they are. There is no way for video to affect the audio or vice versa before the distance -- so the net is forced to learn to map each of them to a shared space where the distance is meaningful. Intuitively, we would expect that this shared space would have to encode the cause of both the audio and video. Once we have these embeddings (and the neural nets that generate them), we can use them for other purposes. For example, their audio encoder sets the new state-of-the-art on two audio classification benchmarks. In addition, by modifying the video encoder to output embeddings for different regions in the image, we can compute the distance between the audio embedding and the video embedding at each region, and the regions where this is highest correspond to the object that is making the sound.

My opinion: Another great example of using unsupervised learning to learn good embeddings. Also, a note -- you might wonder why I'm calling this unsupervised learning even though there's a task, with a yes/no answer, a loss function, and an iid dataset, which are hallmarks of supervised learning. The difference is that the labels for the data did not require any human annotation, and we don't care about the actual task that we're learning -- we're after the underlying embeddings that it uses to solve the task. In the previous paper on learning actionable representations, time was used to define an unsupervised learning signal in a similar way.

MnasNet: Towards Automating the Design of Mobile Machine Learning Models (Mingxing Tan): Mobile phones have strong resource constraints (memory, power usage, available compute), which makes it hard to put neural nets on them. Previously, for image classification, researchers hand designed MobileNetV2 to be fast while still achieving good accuracy. Now, using neural architecture search, researchers have found a new architecture, MnasNet, which is 1.5x faster with the same accuracy. Using the squeeze-and-excitation optimization improves it even further.

My opinion: Neural architecture search is diversifying, focusing on computation time in addition to accuracy now. It seems possible that we'll run into the same problems with architecture search soon, where the reward functions are complex enough that we don't get them right on the first try. What would it look like to learn from human preferences here? Perhaps we could present two models from the search to humans, along with statistics about each, and see which ones the researchers prefer? Perhaps we could run tests on the model, and then have humans provide feedback on the result? Maybe we could use feature visualization to provide feedback on whether the network is learning the "right" concepts?

Neural Arithmetic Logic Units (Andrew Trask et al)

Generalization Error in Deep Learning (Daniel Jakubovitz et al)

Applications

The Machine Learning Behind Android Smart Linkify (Lukas Zilka): Android now has Smart Linkify technology, which allows it to automatically find pieces of text that should link to another app (for example, addresses should link to Maps, dates and times to Calendar, etc). There are a lot of interesting details on what had to be done to get this to actually work in the real world. The system has two separate nets -- one which generates candidate entities, and another which says what kind of entity each one is. In between these two nets, we have a regular program that takes the set of proposed entities, and prunes it so that no two entities overlap, and then sends it off to the entity classification net. There are a few tricks to get the memory requirements down, and many dataset augmentation tricks to get the nets to learn particular rules that it would not otherwise have learned.

My opinion: I take this as an example of what advanced AI systems will look like -- a system of different modules, each with its own job, passing around information appropriately in order to perform some broad task. Some of the modules could be neural nets (which can learn hard-to-program functions), while others could be classic programs (which generalize much better and are more efficient). OpenAI Five also has elements of this -- the drafting system is a classic program operating on the win probabilities from the neural net. It's also interesting how many tricks are required to get Smart Linkify to work -- I don't know whether to think that this means generally intelligent AI is further away, or that the generally intelligent AI that we build will rely on these sorts of tricks.

News

Human-Aligned AI Summer School: A Summary (Michaël Trazzi): A summary of the talks at the summer school that just happened, from one of the attendees, that covers value learning, agent foundations, bounded rationality, and side effects. Most of the cited papers have been covered in this newsletter, with the notable exceptions of Bayesian IRL and information-theoretic bounded rationality.

Discuss

### Preliminary thoughts on moral weight

14 августа, 2018 - 02:45

This post adapts some internal notes I wrote for the Open Philanthropy Project, but they are merely at a "brainstorming" stage, and do not express my "endorsed" views nor the views of the Open Philanthropy Project. This post is also written quickly and not polished or well-explained.

My 2017 Report on Consciousness and Moral Patienthood tried to address the question of "Which creatures are moral patients?" but it did little to address the question of "moral weight," i.e. how to weigh the interests of different kinds of moral patients against each other:

For example: suppose we conclude that fishes, pigs, and humans are all moral patients, and we estimate that, for a fixed amount of money, we can (in expectation) dramatically improve the welfare of (a) 10,000 rainbow trout, (b) 1,000 pigs, or (c) 100 adult humans. In that situation, how should we compare the different options? This depends (among other things) on how much “moral weight” we give to the well-being of different kinds of moral patients.

Thus far, philosophers have said very little about moral weight (see below). In this post I lay out one approach to thinking about the question, in the hope that others might build on it or show it to be misguided.

Proposed setup

For the simplicity of a first-pass analysis of moral weight, let's assume a variation on classical utilitarianism according to which the only thing that morally matters is the moment-by-moment character of a being's conscious experience. So e.g. it doesn't matter whether a being's rights are respected/violated or its preferences are realized/thwarted, except insofar as those factors affect the moment-by-moment character of the being's conscious experience, by causing pain/pleasure, happiness/sadness, etc.

Next, and again for simplicity's sake, let's talk only about the "typical" conscious experience of "typical" members of different species when undergoing various "canonical" positive and negative experiences, e.g. consuming species-appropriate food or having a nociceptor-dense section of skin damaged.

Given those assumptions, when we talk about the relative "moral weight" of different species, we mean to ask something like "How morally important is 10 seconds of a typical human's experience of [some injury], compared to 10 seconds of a typical rainbow trout's experience of [that same injury]?

For this exercise, I'll separate "moral weight" from "probability of moral patienthood." Naively, you could then multiply your best estimate of a species' moral weight (using humans as the baseline of 1) by P(moral patienthood) to get the species' "expected moral weight" (or whatever you want to call it). Then, to estimate an intervention's potential benefit for a given species, you could multiply [expected moral weight of species] × [individuals of species affected] × [average # of minutes of conscious experience affected across those individuals] × [average magnitude of positive impact on those minutes of conscious experience].

However, I say "naively" because this doesn't actually work, due to two-envelope effects.

Potential dimensions of moral weight

What features of a creature's conscious experience might be relevant to the moral weight of its experiences? Below, I describe some possibilities that I previously mentioned in Appendix Z7 of my moral patienthood report.

Note that any of the features below could be (and in some cases, very likely are) hugely multidimensional. For simplicity, I'm going to assume a unidimensional characterization of them, e.g. what we'd get if we looked only at the principal component in a principal component analysis of a hugely multidimensional phenomenon.

Clock speed of consciousness

Perhaps animals vary in their "clock speed." E.g. a hummingbird reacts to some things much faster than I ever could. If any of that is under conscious control, its "clock speed" of conscious experience seems like it should be faster than mine, meaning that, intuitively, it should have a greater number of subjective "moments of consciousness" per objective minute than I do.

In general, smaller animals probably have faster clock speeds than larger ones, for mechanical reasons:

The natural oscillation periods of most consciously controllable human body parts are greater than a tenth of a second. Because of this, the human brain has been designed with a matching reaction time of roughly a tenth of a second. As it costs more to have faster reaction times, there is little point in paying to react much faster than body parts can change position.…the first resonant period of a bending cantilever, that is, a stick fixed at one end, is proportional to its length, at least if the stick’s thickness scales with its length. For example, sticks twice as long take twice as much time to complete each oscillation. Body size and reaction time are predictably related for animals today… (Hanson 2016, ch. 6)

My impression is that it's a common intuition to value experience by its "subjective" duration rather than its "objective" duration, with no discount. So if a hummingbird's clock speed is 3x as fast as mine, then all else equal, an objective minute of its conscious pleasure would be worth 3x an objective minute of my conscious pleasure.

Unities of consciousness

Philosophers and cognitive scientists debate how "unified" consciousness is, in various ways. Our normal conscious experience seems to many people to be pretty "unified" in various ways, though sometimes it feels less unified, for example when one goes "in and out of consciousness" during a restless night's sleep, or when one engages in certain kinds of meditative practices.

Daniel Dennett suggests that animal conscious experience is radically less unified than human consciousness is, and cites this as a major reason he doesn't give most animals much moral weight.

For convenience, I'll use Bayne (2010)'s taxonomy of types of unity. He talks about subject unity, representational unity, and phenomenal unity — each of which has a "synchronic" (momentary) and "diachronic" (across time) aspect of unity.

Subject unity

Bayne explains:

My conscious states possess a certain kind of unity insofar as they are all mine; likewise, your conscious states possess that same kind of unity insofar as they are all yours. We can describe conscious states that are had by or belong to the same subject of experience as subject unified. Within subject unity we need to distinguish the unity provided by the subject of experience across time (diachronic unity) from that provided by the subject at a time (synchronic unity).Representational unity

Bayne explains:

Let us say that conscious states are representationally unified to the degree that their contents are integrated with each other. Representational unity comes in a variety of forms. A particularly important form of representational unity concerns the integration of the contents of consciousness around perceptual objects—what we might call ‘object unity’. Perceptual features are not normally represented by isolated states of consciousness but are bound together in the form of integrated perceptual objects. This process is known as feature-binding. Feature-binding occurs not only within modalities but also between them, for we enjoy multimodal representations of perceptual objects.

I suspect many people wouldn't treat representational unity as all that relevant to moral weight. E.g. there are humans with low representational unity of a sort (e.g. visual agnosics); are their sensory experiences less morally relevant as a result?

Phenomenal unity

Bayne explains:

Subject unity and representational unity capture important aspects of the unity of consciousness, but they don’t get to the heart of the matter. Consider again what it’s like to hear a rumba playing on the stereo whilst seeing a bartender mix a mojito. These two experiences might be subject unified insofar as they are both yours. They might also be representationally unified, for one might hear the rumba as coming from behind the bartender. But over and above these unities is a deeper and more primitive unity: the fact that these two experiences possess a conjoint experiential character. There is something it is like to hear the rumba, there is something it is like to see the bartender work, and there is something it is like to hear the rumba while seeing the bartender work. Any description of one’s overall state of consciousness that omitted the fact that these experiences are had together as components, parts, or elements of a single conscious state would be incomplete. Let us call this kind of unity — sometimes dubbed ‘co-consciousness’ — phenomenal unity.Phenomenal unity is often in the background in discussions of the ‘stream’ or ‘field’ of consciousness. The stream metaphor is perhaps most naturally associated with the flow of consciousness — its unity through time — whereas the field metaphor more accurately captures the structure of consciousness at a time. We can say that what it is for a pair of experiences to occur within a single phenomenal field just is for them to enjoy a conjoint phenomenality — for there to be something it is like for the subject in question not only to have both experiences but to have them together. By contrast, simultaneous experiences that occur within distinct phenomenal fields do not share a conjoint phenomenal character.Unity-independent intensity of valenced aspects of consciousness

A common report of those who take psychedelics is that, while "tripping," their conscious experiences are "more intense" than they normally are. Similarly, different pains feel similar but have different intensities, e.g. when my stomach is upset, the intensity of my stomach pain waxes and wanes a fair bit, until it gradually fades to not being noticeable anymore. Same goes for conscious pleasures.

It's possible such variations in intensity are entirely accounted for by their degrees of different kinds of unity, or by some other plausible feature(s) of moral weight, but maybe not. If there is some additional "intensity" variable for valenced aspects of conscious experience, it would seem a good candidate for affecting moral weight.

From my own experience, my guess is that I would endure ~10 seconds of the most intense pain I've ever experienced to avoid experiencing ~2 months of the lowest level of discomfort that I'd bother to call "discomfort." That very low level of discomfort might suggest a lower bound on "intensity of valenced aspects of experience" that I intuitively morally care about, but "the most intense pain I've ever experienced" probably is not the highest intensity of valenced aspects of experience it is possible to experience — probably not even close. You could consider similar trades to get a sense for how much you intuitively value "intensity of experience," at least in your own case.

Moral weights of various species

If we thought about all this more carefully and collected as much relevant empirical data as possible, what moral weights might we assign to different species?

Whereas my probabilities of moral patienthood for any animal as complex as a crab only range from 0.2 - 1, the plausible ranges of moral weight seem like they could be much larger. I don't feel like I'd be surprised if an omniscient being told me that my extrapolated values would assign pigs more moral weight than humans, and I don't feel like I'd be surprised if an omniscient being told me my extrapolated values would assign pigs .0001 moral weight (assuming they were moral patients at all).

To illustrate how this might work, below are some guesses at some "plausible ranges of moral weight" for a variety of species that someone might come to, if they had intuitions like those explained below.

• Humans: 1 (baseline)
• Chimpanzees: 0.001 - 3
• Pigs: 0.0005 - 3.5
• Cows: 0.0001 - 5
• Chickens: 0.0005 - 7
• Rainbow trout: 0.00005 - 10
• Fruit fly: 0.000001 - 20

(But whenever you're tempted to multiply such numbers by something, remember two-envelope effects!)

What intuitions might lead to something like these ranges?

• An intuition to not place much value on "complex/higher-order" dimensions of moral weight — such as "fullness of self-awareness" or "capacity for reflecting on one's holistic life satisfaction" — above and beyond the subjective duration and "intensity" of relatively "brute" pleasure/pain/happiness/sadness that (in humans) tends to accompany reflection, self-awareness, etc.
• An intuition to care more about subject unity and phenomenal unity than about such higher-order dimensions of moral weight.
• An intuition to care most of all about clock speed and experience intensity (if intensity is distinct from unity).
• Intuitions that if the animal species listed above are conscious, they:
• have very little of the higher-order dimensions of conscious experience,
• have faster clock speeds than humans (the smaller the faster),
• probably have lower "intensity" of experience, but might actually have somewhat greater intensity of experience (e.g. because they aren't distracted by linguistic thought),
• have moderately less subject unity and phenomenal unity, especially of the diachronic sort.

Under these intuitions, the low end of the ranges above could be explained by the possibility that intensity of conscious experience diminishes dramatically with brain complexity and flexibility, while the high end of the ranges above could be explained by the possibility concerning faster clock speeds for smaller animals, the possibility of lesser unity in non-human animals (which one might value at >1x for the same reason one might value a dually-conscious split-brain patient at ~2x), and the possibility for greater intensity of experience in simpler animals.

Other writings on moral weight

Discuss

### Rationality Retreat in Europe: Gauging Interest

13 августа, 2018 - 23:33

If you're interested in attending a rationality retreat in Europe I would appreciate if you participate in my survey (Link). Just answering the first three questions would already be helpful.

I would like to organize a smallish (up to 20 people) rationality retreat in Europe. With the survey I want to estimate how much interest there is, and what participants would like such an event to look like. Feel free to share the survey with anyone who might be interested.

Discuss

### Cause Awareness as a Factor against Cause Neutrality

13 августа, 2018 - 23:00

[Epistemic status: Sudden insight + some reflection]

[Novelty status: I Googled some stuff on cause neutrality, and didn’t see this category of concerns mentioned]

Trying to live an optimized, impactful life can be quite a burden. There is no time for fun when there is world-saving to do. You can't learn math unless it helps your career. And you must sacrifice the charitable causes you most care about for the ones that have the most impact.

What a relief when, last week, I realized how it may be possible to support personal pet causes while living an optimized life.

Cause-neutrality is one of the cornerstones of EA thought. It goes like this: Bob took a trip to a village in Indonesia and was really struck by the poverty there. He wants to dedicate his life to helping people in this village because it can do a lot of good. Boo Bob!

For you see, this is just one of many impoverished villages around the world, and probably neither the one suffering most nor the one where he can have the biggest impact cheaply. He should instead go to the List of Suffering Villages™ and pick the one where he can do the most good. And that’s assuming it’s not better for him to just become an investment banker and donate all his money.

But to make this decision, first the village must be on the list.

We live in a world where every village has a Wikipedia entry. But in other areas, there are a large number of local problems where the resources needed to raise awareness about the problem are a substantial fraction of the resources needed to solve it. In these cases, working on it immediately can be more effective than trying to get it properly prioritized and allocated to the most efficient person.

For comparison, consider personal productivity. The philosophy of Getting Things Done is all about moving tasks into a centralized system so you can do them at an optimal time. But, there’s an exception: if something takes less than two minutes, you should do it now.

In altruism, this looks like: If you see someone bleeding on a deserted street, you call 911 instead of weighing its impact it against all the phone calls to Congress you should be making. But it also applies in cases where the cost/benefit ratio is less extreme.

This is most evident in obscure political causes. Suppose you’re a skilled professional whose time is worth $100/hour. One day, you realize that your local town council can produce 10 million utilons by adopting policy X. You think you can convince them to adopt this policy by spending 500 hours running a campaign. But, an experienced campaign manager whose time is worth$80/hour could do it in 250 hours. If you do it: $50,000 for 10 million utilons. If he does it:$20,000 for 10 million utilons. Profit!

But what if you need to spend 50 hours finding and interviewing candidates for the campaign manager, and another 100 hours explaining the problem to him and introducing him to important people in your town? Now you’re up to $35,000 for the campaign manager. And what if there’s a 100% chance of success if you do it yourself, versus only a 70% chance if you try to hire someone else (who, of course, may not be as good as his resume claims)? Now you’re looking at 200 utilons/$ in expectation whether you do it yourself or hire someone. If you’re at all risk-averse, then…..this is how it can make sense for a professional engineer to wind up running a political campaign.

So, the problems of cause-neutrality are the same as the problems of delegating any other task. Risk, transaction costs, and management can eat up all the efficiency gains.

This idea means you should weaken the recommendations of cause-neutrality to: spend resources on your pet causes in an amount inversely proportional to how well-known the cause is. If you have a burning hatred of cancer because it killed your parents, you should probably still give your money to malaria organizations rather than cancer researchers. But if there’s an epidemic of salmonella in a small town you’re visiting and you happen to be a famous doctor, it can still be effective to spend time helping salmonella patients rather than trying to bring in nurses so you can go do famous-doctor-y things.

Discuss

### The ever expanding moral circle

13 августа, 2018 - 21:41

(DISCLAIMER: not a new concept, and there are probably better descriptions out there. But I wanted to try explaining this as an exercise)

I propose to you an exercise: picture in your mind all beings you care about. After reading this prompt, you are probably conjuring images of your family, your friends, and the other wonderful beings that inhabit this planet and make life worthwhile.

The moral circle is the term we use to refer to what you care about and fight for. But the key insight to be had is that is not fixed, but can be expanded.

Many of us have a deep appreciation for those around us, and can empathize with the suffering we see: the homeless people next to our homes, the sick we tend and the crying children in the park.

But upon introspection, most come to the realization that there are many others we do not interact with, but suffer nonetheless, and are people worth fighting for, the same as those in our immediate surroundings.

This may start as a inclusion in our circle of all the people in our country, with whom you share so much, even if they have different beliefs or preferences.

Thankfully, we live in a world where democracies and first world social norms are pushing for the acceptance in our circles of concern of social groups that once not so long ago were regarded as inferiors or outcasts, from women to homosexuals.

But the expansion does not have to end there. We may start including people from other countries and very different backgrounds. We may start caring about the immense suffering of those who were not lucky enough to be born in a first world country, and whose human rights are under threat.

We do not have to limit ourselves to humans either. Many of us include in our circle of concern non-human animals, who are bred and slaughtered by the billions per year.

And the moral circle is not restricted to the present: we can imagine caring and fighting for the interests of those which are not yet born, but may be.

Thanks to our unstoppable imagination, we can even include in our circle of concerns beings that are not yet part of our little world, but may come in the future: artificial intelligence, alien lifeforms, self-modified humans, to name some possibilities.

Coming back to the beginning, I propose an exercise: picture in your mind all beings you care about. Think about those you left outside, and reexamine the frontiers of your moral circle. Do you notice its expansion?

Discuss

### Pirate c: Vacuum Energy and the Joules Hertz Constant

13 августа, 2018 - 12:00

Just a Theory, Pirate c: (3110mhz)^3/[e/(m/v)]=cpu, s^3/ED=p, or

The processing limit to a single core circuit processor (dictated by electromigration and the speed of light) cubed, divided by the available energy density is equal to processing capacity in mhz (cpu speed in mega hertz).

(3110mhz^3/3100mhz)/c^2=1.07616626e-10

Assuming density relative to mhz:

Correction*

The processing limit to a single core circuit processor (dictated by electromigration and the speed of light) cubed, divided by the available energy density is equal to processing capacity in mhz (cpu speed in giga hertz).

1.07616626e-10/amu=6.48082551e+16amu

Into Unified atomic mass units (including electron masses):

(6.48082551e+16amu)+(3.555250784e+13amu) which is plus electrons.

6.48438076e+16/c^2,

What I was trying to show is that around a certain energy density, a simulated reality bubble can emerge, where the processing capacity of an environment can in fact simulate a give perspective of reality to be physically real: producing simulated events, which result in the un-illusion of the Mandela Effect, where probable events become physical reality locally, but not universally.

Random nonsense, or stuff no one is looking into..?

3110^3/c^2/2242.58697665=1.49241796e-10

1.49241796e-10Kg*c^2=13413183.7J

s^3/13413183.7/ED=2242.58697665mhz, ed=13413183.7/(1/1),

Properties of gold, DUH! Energy density, m/v not=1/1

3110^3/[(13413183.7/(d*1.076756663056355872e-10)]=amu/mhz

The Joules Hertz Constant: 3110^3/(13413183.7/1.076756663056355872e-10)=2.41472046e-7

1.076756663056355872e-10*c^2=9677406.27

3110^3/[(9677406.27/(d*1.49241796e-10)]

The Joules Hertz Constant 2.0: 3110^3/[(9677406.27/(1*1.49241796e-10)]=4.63887489e-7

Vacuum Energy: 3110^3/[(9677406.27/[(10e9/c^2)*1.49241796e-10]]=5.1614444e-32

3110^3/[(9677406.27/(1.11265006e-25*1.49241796e-10)]=5.1614444e-32

The Joules Hertz Constant divided by Vacuum energy hertz over c^3:

(4.63887489e-7/5.1614444e-32)/c^3=0.333564096

3110^3/(13413183.7/0.0000208029)=46.65231249khz for gold

3110^3/(13413183.7/8.47407494e-7)=1.900385khz for iron

The Earth: 13.3179077mhz

The Sun: 3110^3/(13413183.7/19.8591607459)=44535.895136mhz

Venus: 3110^3/[(13413183.7/(31725133.0181*1.076756663056355872e-10)]=7.66073278929mhz/amu

Assume: 3.11ghz=c^2

3110^3/[(13413183.7/(d*1.076756663056355872e-10)]=amu/mhz,

3110^3/3110/c^2/amu=0.721089009*c^2

[3110*(3110^3/3110/c^2/amu/c^2)]]=2242.58697665

[3110^3/c^2/[3110*(3110^3/3110/c^2/amu/c^2)]]*c^2=13413183.7J,

3110^3/[[3110^3/c^2/[3110*(3110^3/3110/c^2/amu/c^2)]]*c^2]/[d*(3110^3/3110)/c^2]=mhz/amu

s^3/[[s^3/c^2/[s*(s^3/s/c^2/amu/c^2)]]*c^2]/(d*s^3/s)/c^2=mhz/amu

3110^3/c^2/[3110*(3110^3/3110/c^2/amu/c^2)]]=c^2/amu

3110^3/3110)/c^2=1.076756663056355872e-10

Discuss

### Is there a practitioner's guide for rationality?

13 августа, 2018 - 08:39

Hi everyone, I'm new to the community, and am currently working my way through the sequences — yes, all of them.

In the introduction to the first book of Rationality A-Z, Eliezer says:

I didn’t realize that the big problem in learning this valuable way of thinking was figuring out how to practice it, not knowing the theory. I didn’t realize that part was the priority; and regarding this I can only say “Oops” and “Duh.” Yes, sometimes those big issues really are big and really are important; but that doesn’t change the basic truth that to master skills you need to practice them and it’s harder to practice on things that are further away. (Today the ￼Center for Applied Rationality is working on repairing this huge mistake of mine in a more systematic fashion.)

Just wanted to ask if CFAR has got any of those reorganised materials up, and if they're linked to from anywhere on this site? Any links to other rationality-as-practice blog posts or books or sequences would also be incredibly appreciated!

Discuss

### Trust Me I'm Lying: A Summary and Review

13 августа, 2018 - 05:55

Cross-posted from my blog

Trust Me I'm Lying, by Ryan Holiday, is probably the most influential book I've read in the past two years. This book is a guidebook to the twenty-first century media ecosystem, and shows how the structures and incentives of online media serve to polarize us by stoking fear and anger. First published in 2012, the book is eerily prophetic in many places, as it talks about the toxic influence of online publishing on politics and society.

Ryan Holiday is a marketer and publicist who specializes in manipulating blogs in service of his clients. So what is a blog? A blog is any online publishing platform which derives its revenue from advertising. Blogs range in size from small, fly-by-night local publications all the way up to multi million dollar properties like Gawker and the Huffington Post. Blogs may be independent, such as Huffington Post or Politico or they may be associated with an existing media franchise. For example the Monkey Cage politics blog is hosted by the Washington Post, and MoneyWatch is a financial blog hosted by CBS.

The key observation that Ryan makes is that all blogs, no matter how large or small they are, no matter if they're associated with an existing media franchise or not, are driven by the same economic incentives. The revenue a blog makes can be expressed as (price per page-view) × (number of page-views). Since blogs have little control over how much they are paid per reader, blogs universally tend to work to maximize the number of readers they get. Moreover the sheer number of blogs, and the extremely low barrier to entry creates a brutally Darwinian marketplace where any content that doesn't maximize page-views is rapidly superseded by content that does.

In order to maximize the number of people clicking on their stories (and thus viewing the ads on those stories), blogs exploit every flaw in human psychology they can find. Chief among these is provocation. Ryan cites a study by Berger and Milkman from 2012 which shows that content with high emotional valence spreads much faster than content which is emotionally neutral. The study compared stories on the New York Times website, and found that articles which induced anger were 34% more likely than the median article to make the top-10 most e-mailed list. Articles that induced awe also did well, being 30% more likely than the median article to make the most e-mailed list. Both anger and awe are high-arousal emotions (in a negative and positive direction, respectively). On the flip side, articles that induce low-arousal emotions, like sadness, suffer a penalty. Sad articles were 16% less likely than the median article to end up on the most e-mailed list. These facts about human psychology act as constraints on the kinds of stories blogs will write. Every story has to make people feel a "high-energy" emotion, like anger, or awe. Stories that are thoughtful, practical, useful or beautiful but melancholy fall by the wayside.

Another constraint on blogs is their very structure. Marshall McLuhan's adage, "the medium is the message" applies just as much to blogs as it does to television. For blogs, the medium is the stack. Most blogs are arranged in a reverse-chronological list, with new stories coming in at the top of the page, and percolating down towards the bottom as further stories come in. A blog which can always produce something fresh at the top of the stack can draw in more readers by having more novel content for readers to click on and share. Ryan points out that, unlike newspapers, who have a finite number of column-inches per day and unlike cable news, which has a finite number of hours per day, a blog's appetite for content is functionally infinite. Blogs whose writers produce the fastest win, regardless of the quality of their writing.

These three constraints (virality, structure, and disaggregation) serve to create a set of entry points that allows a media manipulator such as Ryan Holiday to influence the content and framing of stories that blogs cover. Ryan noticed, contrary to prevailing wisdom, that most original reporting in online media was done by smaller blogs, whose stories were picked up and summarized by larger publications, until they reached mainstream media outlets and entered the "national conversation". Therefore, by influencing small blogs today, one could alter what was in the Washington Post tomorrow.

Ryan created a process, which he termed "trading up the chain" to do just that. First he would observe which large blogs the national media outlets he was targeting drew from. Then he would observe which small blogs those large blogs pulled stories out of. Finally, he would craft a media campaign targeting those smaller blogs, seeding the same provocative story in enough places to ensure that it would get picked up and passed up the chain until it received national coverage.

A concrete example of this is work he did for the movie I Hope They Serve Beer In Hell, starring Tucker Max. He wished to get coverage for the movie in the Washington Post. By observing the sources of the Post's stories carefully, he found out that much of the Post's media coverage originated from stories on Gawker. Going one level further he noticed that Gawker, in turn, pulled its stories from smaller city-focused blogs like MediaBistro and Curbed LA. He then targeted those blogs with a series of provocative actions, such as buying billboards promoting the movie and then vandalizing them himself, calling feminist groups to protest showings of the movie, arranging for provocative advertisements on buses, and other acts designed to go viral on social media. When those blogs inevitably began covering I Hope They Serve Beer In Hell, he ensured that there was enough "chatter" to quickly drive the story up the chain to the Washington Post.

The end result of these tactics was that an otherwise no-name D-List "internet celebrity" like Tucker Max was being interviewed by The Washington Post and late-night TV hosts like Conan O'Brien. Ryan makes the point that much of what we consider to be "organic" content, spreading by "word of mouth" is actually carefully engineered and seeded by professionals like himself to generate coverage for particular celebrities, products, and events.

The important thing to note is that all of these manipulations make the stories that we read less accurate. Moreover, the publishers of these stories bear no cost for misleading us and wasting our time and emotional energy. The toll that the exact on each and every one of us is a pure externality, just as much as smokestack fumes or toxic chemicals going into the water supply.

If the effects of this media manipulation were merely to drive customers to products they wouldn't otherwise buy, Ryan would still probably be out there plying his trade. What caused him to reconsider his profession (and write this book) was the increasing use of these manipulation techniques to spread political ideas, and, in the process, hurt individuals. In the second half of the book, he talks about how sites like Jezebel and Breitbart News use the techniques he pioneered to push product for American Apparel to maximize their own page-views by stoking outrage both among their supporters and their opponents. In his view, much of responsibility for the coarsening and polarization of politics and culture can be laid at the feet of professional manipulators like himself.

Ryan's hope is that by writing this book and exposing the actual techniques that manipulators use, we can inoculate ourselves and make ourselves less susceptible to the sort of media manipulation that he used to carry out. Though things look bleak at the moment, Ryan looks to history to show that our online media ecosystem today is very similar to the "yellow-press" era at the turn of the 20th century. Given that, he has hope that the current state of affairs is not sustainable, and we will eventually craft a stronger, more trustworthy online media, just as the provocative tabloids like the New York Herald and The World eventually gave way to more trustworthy publications like The New York Times and the Wall Street Journal.

Personally, I found this book important because it explained and crystallized many of the troubling trends I had personally observed in online media, and put them in a framework that allowed me to see clearly how they worked and how they were manipulating me. Though I knew that online media was growing ever more provocative and ever less accurate, my prior was that it was the result of blind evolutionary forces, as described in Scott Alexander's post on the same topic, The Toxoplasma of Rage.

While Scott's piece is important and insightful, it's still written from an outsider's perspective. It frames the ever escalating spiral of provocation as the result of groups competing to stoke the most outrage among their own members. Trust Me I'm Lying, by contrast, says that the result of deliberate manipulation by people who deliberately set groups against one another in order to bring attention to the issues that they want attention brought to. When Ryan vandalized his own billboards and organized protests by feminist groups against I Hope They Serve Beer In Hell, he wasn't attempting to send a message about the content of the film. He was merely operating under the adage, "Any publicity is good publicity." Similarly, Ryan's claim is that much of the outrage we find in today's politics and culture is the result of deliberate manipulation by publications who want to drive traffic to their own sites, without necessarily caring about which side "wins".

Where the book is weakest is in its advice about where to go from here. The book ends with a note that people's time and attention are limited, and eventually people will catch on to the fact that they're being manipulated, and will start to demand higher quality reporting, rather than merely quantity. In support of this contention, he cites the evolution of print journalism, which evolved from "yellow press" tabloids to newspapers that are widely considered to be accurate and relatively unbiased. However, he gives few predictions about how this will come to pass, other than noting the current media ecosystem is unsustainable and that unsustainable things cannot be sustained over the long term.

Nevertheless, I found the book insightful, entertaining, and more than slightly horrifying. As a result of the book, I can look at stories like this and better look past the manipulative elements to see how little substance there actually is to the article. As a result, I find that I'm more efficient at extracting the factual content from news articles, and better at identifying and avoiding so-called "fake news". For this reason, I consider Trust Me I'm Lying to be a strong recommendation.

If you're interested in a more detailed outline of the book, I have one on my wiki.

Discuss

### Tactical vs. Strategic Cooperation

12 августа, 2018 - 19:41

As I've matured, one of the (101-level?) social skills I've come to appreciate is asking directly for the narrow, specific thing you want, instead of debating around it.

What do I mean by "debating around" an issue?

Things like:

"If we don't do what I want, horrible things A, B, and C will happen!"

(This tends to degenerate into a miserable argument over how likely A, B, and C are, or a referendum on how neurotic or pessimistic I am.)

"You're such an awful person for not having done [thing I want]!"

(This tends to degenerate into a miserable argument about each other's general worth.)

"Authority Figure Bob will disapprove if we don't do [thing I want]!"

(This tends to degenerate into a miserable argument about whether we should respect Bob's authority.)

It's been astonishing to me how much better people respond if instead I just say, "I really want to do [thing I want.] Can we do that?"

No, it doesn't guarantee that you'll get your way, but it makes it a whole lot more likely. More than that, it means that when you do get into negotiation or debate, that debate stays targeted to the actual decision you're disagreeing about, instead of a global fight about anything and everything, and thus is more likely to be resolved.

Real-life example:

Back at MetaMed, I had a coworker who believed in alternative medicine. I didn't. This caused a lot of spoken and unspoken conflict. There were global values issues at play: reason vs. emotion, logic vs. social charisma, whether her perspective on life was good or bad. I'm embarrassed to say I was rude and inappropriate. But it was coming from a well-meaning place; I didn't want any harm to come to patients from misinformation, and I was very frustrated, because I didn't see how I could prevent that outcome.

Finally, at my wit's end, I blurted out what I wanted: I wanted to have veto power over any information we sent to patients, to make sure it didn't contain any factual inaccuracies.

Guess what? She agreed instantly.

This probably should have been obvious (and I'm sure it was obvious to her.) My job was producing the research reports, while her jobs included marketing and operations. The whole point of division of labor is that we can each stick to our own tasks and not have to critique each other's entire philosophy of life, since it's not relevant to getting the company's work done as well as possible. But I was extremely inexperienced at working with people at that time.

It's not fair to your coworkers to try to alter their private beliefs. (Would you try to change their religion?) A company is an association of people who cooperate on a local task. They don't have to see eye-to-eye about everything in the world, so long as they can work out their disagreements about the task at hand.

This is a skill that "practical" people have, and "idealistic" and "theoretical" people are often weak at -- the ability to declare some issues off topic. We're trying to decide what to do in the here and now; we don't always have to turn things into a debate about underlying ethical or epistemological principles. It's not that principles don't exist (though some self-identified "pragmatic" or "practical" people are against principles per se, I don't agree with them.) It's that it can be unproductive to get into debates about general principles, when it takes up too much time and generates too much ill will, and when it isn't necessary to come to agreement about the tactical plan of what to do next.

Well, what about longer-term, more intimate partnerships? Maybe in a strictly professional relationship you can avoid talking about politics and religion altogether, but in a closer relationship, like a marriage, you actually want to get alignment on underlying values, worldviews, and principles. My husband and I spend a ton of time talking about the diffs between our opinions, and reconciling them, until we do basically have the same worldview, seen through the lens of two different temperaments. Isn't that a counterexample to this "just debate the practical issue at hand" thing? Isn't intellectual discussion really valuable to intellectually intimate people?

Well, it's complicated. Because I've found the same trick of narrowing the scope of the argument and just asking for what I want resolves debates with my husband too.

When I find myself "debating around" a request, it's often debating in bad faith. I'm not actually trying to find out what the risks of [not what I want] are in real life, I'm trying to use talking about danger as a way to scare him into doing [what I want]. If I'm quoting an expert nutritionist to argue that we should have home-cooked family dinners, my motivation is not actually curiosity about the long-term health dangers of not eating as a family, but simply that I want family dinners and I'm throwing spaghetti at a wall hoping some pro-dinner argument will work on him. The "empirical" or "intellectual" debate is just so much rhetorical window dressing for an underlying request. And when that's going on, it's better to notice and redirect to the actual underlying desire.

Then you can get to the actual negotiation, like: what makes family dinners undesirable to you? How could we mitigate those harms? What alternatives would work for both of us?

Debating a far-mode abstraction (like "how do home eating habits affect children's long-term health?") is often an inefficient way of debating what's really a near-mode practical issue only weakly related to the abstraction (like "what kind of schedule should our household have around food?") The far-mode abstract question still exists and might be worth getting into as well, but it also may recede dramatically in importance once you've resolved the practical issue.

One of my long-running (and interesting and mutually respectful) disagreements with my friend Michael Vassar is about the importance of local/tactical vs. global/strategic cooperation. Compared to me, he's much more likely to value getting to alignment with people on fundamental values, epistemology, and world-models. He would rather cooperate with people who share his principles but have opposite positions on object-level, near-term decisions, than people who oppose his principles but are willing to cooperate tactically with him on one-off decisions.

The reasoning for this, he told me, is simply that the long-term is long, and the short-term is short. There's a lot more value to be gained from someone who keeps actively pursuing goals aligned with yours, even when they're far away and you haven't spoken in a long time, than from someone you can persuade or incentivize to do a specific thing you want right now, but who won't be any help in the long run (or might actually oppose your long-run aims.)

This seems like fine reasoning to me, as far as it goes. I think my point of departure is that I estimate different numbers for probabilities and expected values than him. I expect to get a lot of mileage out of relatively transactional or local cooperation (e.g. donors to my organization who don't buy into all of my ideals, synagogue members who aren't intellectually rigorous but are good people to cooperate with on charity, mutual aid, or childcare). I expect getting to alignment on principles to be really hard, expensive, and unlikely to work, most of the time, for me.

Now, I think compared to most people in the world, we're both pretty far on the "long-term cooperation" side of the spectrum.

It's pretty standard advice in business books about company culture, for instance, to note that the most successful teams are more likely to have shared idealistic visions and to get along with each other as friends outside of work. Purely transactional, working-for-a-paycheck, arrangements don't really inspire excellence. You can trust strangers in competitive market systems that effectively penalize fraud, but large areas of life aren't like that, and you actually have to have pretty broad value-alignment with people to get any benefit from cooperating.

I think we'd both agree that it's unwise (and immoral, which is kind of the same thing) to try to benefit in the short term from allying with terrible people. The question is, who counts as terrible? What sorts of lapses in rigorous thinking are just normal human fallibility and which make a person seriously untrustworthy?

I'd be interested to read some discussion about when and how much it makes sense to prioritize strategic vs. tactical alliance.

Discuss