The Nonlinear Library: Alignment Forum Top Posts

By The Nonlinear Fund

Listen to a podcast, please open Podcast Republic app. Available on Google Play Store.


Category: Education

Open in Apple Podcasts


Open RSS feed


Open Website


Rate for this podcast

Subscribers: 0
Reviews: 0

Description

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.

Episode Date
Discussion with Eliezer Yudkowsky on AGI interventions by Rob Bensinger, Eliezer Yudkowsky
55:18
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Discussion with Eliezer Yudkowsky on AGI interventions, published by Rob Bensinger, Eliezer Yudkowsky on the AI Alignment Forum. The following is a partially redacted and lightly edited transcript of a chat conversation about AGI between Eliezer Yudkowsky and a set of invitees in early September 2021. By default, all other participants are anonymized as "Anonymous". I think this Nate Soares quote (excerpted from Nate's response to a report by Joe Carlsmith) is a useful context-setting preface regarding timelines, which weren't discussed as much in the transcript: [...] My odds [of AGI by the year 2070] are around 85%[...] I can list a handful of things that drive my probability of AGI-in-the-next-49-years above 80%: 1. 50 years ago was 1970. The gap between AI systems then and AI systems now seems pretty plausibly greater than the remaining gap, even before accounting the recent dramatic increase in the rate of progress, and potential future increases in rate-of-progress as it starts to feel within-grasp. 2. I observe that, 15 years ago, everyone was saying AGI is far off because of what it couldn't do -- basic image recognition, go, starcraft, winograd schemas, programmer assistance. But basically all that has fallen. The gap between us and AGI is made mostly of intangibles. (Computer Programming That Is Actually Good? Theorem proving? Sure, but on my model, "good" versions of those are a hair's breadth away from full AGI already. And the fact that I need to clarify that "bad" versions don't count, speaks to my point that the only barriers people can name right now are intangibles.) That's a very uncomfortable place to be! 3. When I look at the history of invention, and the various anecdotes about the Wright brothers and Enrico Fermi, I get an impression that, when a technology is pretty close, the world looks a lot like how our world looks. Of course, the trick is that when a technology is a little far, the world might also look pretty similar! Though when a technology is very far, the world does look different -- it looks like experts pointing to specific technical hurdles. We exited that regime a few years ago. 4. Summarizing the above two points, I suspect that I'm in more-or-less the "penultimate epistemic state" on AGI timelines: I don't know of a project that seems like they're right on the brink; that would put me in the "final epistemic state" of thinking AGI is imminent. But I'm in the second-to-last epistemic state, where I wouldn't feel all that shocked to learn that some group has reached the brink. Maybe I won't get that call for 10 years! Or 20! But it could also be 2, and I wouldn't get to be indignant with reality. I wouldn't get to say "but all the following things should have happened first, before I made that observation". I have made those observations. 5. It seems to me that the Cotra-style compute-based model provides pretty conservative estimates. For one thing, I don't expect to need human-level compute to get human-level intelligence, and for another I think there's a decent chance that insight and innovation have a big role to play, especially on 50 year timescales. 6. There has been a lot of AI progress recently. When I tried to adjust my beliefs so that I was positively surprised by AI progress just about as often as I was negatively surprised by AI progress, I ended up expecting a bunch of rapid progress. [...] Further preface by Eliezer: In some sections here, I sound gloomy about the probability that coordination between AGI groups succeeds in saving the world. Andrew Critch reminds me to point out that gloominess like this can be a self-fulfilling prophecy - if people think successful coordination is impossible, they won’t try to coordinate. I therefore remark in retrospective advance that it seems to me like at least some of the top...
Dec 10, 2021
What failure looks like by Paul Christiano
14:17
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What failure looks like, published by Paul Christiano on the AI Alignment Forum. The stereotyped image of AI catastrophe is a powerful, malicious AI system that takes its creators by surprise and quickly achieves a decisive advantage over the rest of humanity. I think this is probably not what failure will look like, and I want to try to paint a more realistic picture. I’ll tell the story in two parts: Part I: machine learning will increase our ability to “get what we can measure,” which could cause a slow-rolling catastrophe. ("Going out with a whimper.") Part II: ML training, like competitive economies or natural ecosystems, can give rise to “greedy” patterns that try to expand their own influence. Such patterns can ultimately dominate the behavior of a system and cause sudden breakdowns. ("Going out with a bang," an instance of optimization daemons.) I think these are the most important problems if we fail to solve intent alignment. In practice these problems will interact with each other, and with other disruptions/instability caused by rapid progress. These problems are worse in worlds where progress is relatively fast, and fast takeoff can be a key risk factor, but I’m scared even if we have several years. With fast enough takeoff, my expectations start to look more like the caricature---this post envisions reasonably broad deployment of AI, which becomes less and less likely as things get faster. I think the basic problems are still essentially the same though, just occurring within an AI lab rather than across the world. (None of the concerns in this post are novel.) Part I: You get what you measure If I want to convince Bob to vote for Alice, I can experiment with many different persuasion strategies and see which ones work. Or I can build good predictive models of Bob’s behavior and then search for actions that will lead him to vote for Alice. These are powerful techniques for achieving any goal that can be easily measured over short time periods. But if I want to help Bob figure out whether he should vote for Alice---whether voting for Alice would ultimately help create the kind of society he wants---that can’t be done by trial and error. To solve such tasks we need to understand what we are doing and why it will yield good outcomes. We still need to use data in order to improve over time, but we need to understand how to update on new data in order to improve. Some examples of easy-to-measure vs. hard-to-measure goals: Persuading me, vs. helping me figure out what’s true. (Thanks to Wei Dai for making this example crisp.) Reducing my feeling of uncertainty, vs. increasing my knowledge about the world. Improving my reported life satisfaction, vs. actually helping me live a good life. Reducing reported crimes, vs. actually preventing crime. Increasing my wealth on paper, vs. increasing my effective control over resources. It’s already much easier to pursue easy-to-measure goals, but machine learning will widen the gap by letting us try a huge number of possible strategies and search over massive spaces of possible actions. That force will combine with and amplify existing institutional and social dynamics that already favor easily-measured goals. Right now humans thinking and talking about the future they want to create are a powerful force that is able to steer our trajectory. But over time human reasoning will become weaker and weaker compared to new forms of reasoning honed by trial-and-error. Eventually our society’s trajectory will be determined by powerful optimization with easily-measurable goals rather than by human intentions about the future. We will try to harness this power by constructing proxies for what we care about, but over time those proxies will come apart: Corporations will deliver value to consumers as measured by profit. Eventually th...
Dec 10, 2021
The Parable of Predict-O-Matic by Abram Demski
23:52
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Parable of Predict-O-Matic, published by Abram Demski on the AI Alignment Forum. I've been thinking more about partial agency. I want to expand on some issues brought up in the comments to my previous post, and on other complications which I've been thinking about. But for now, a more informal parable. (Mainly because this is easier to write than my more technical thoughts.) This relates to oracle AI and to inner optimizers, but my focus is a little different. 1 Suppose you are designing a new invention, a predict-o-matic. It is a wonderous machine which will predict everything for us: weather, politics, the newest advances in quantum physics, you name it. The machine isn't infallible, but it will integrate data across a wide range of domains, automatically keeping itself up-to-date with all areas of science and current events. You fully expect that once your product goes live, it will become a household utility, replacing services like Google. (Google only lets you search the known!) Things are going well. You've got investors. You have an office and a staff. These days, it hardly even feels like a start-up any more; progress is going well. One day, an intern raises a concern. "If everyone is going to be using Predict-O-Matic, we can't think of it as a passive observer. Its answers will shape events. If it says stocks will rise, they'll rise. If it says stocks will fall, then fall they will. Many people will vote based on its predictions." "Yes," you say, "but Predict-O-Matic is an impartial observer nonetheless. It will answer people's questions as best it can, and they react however they will." "But --" the intern objects -- "Predict-O-Matic will see those possible reactions. It knows it could give several different valid predictions, and different predictions result in different futures. It has to decide which one to give somehow." You tap on your desk in thought for a few seconds. "That's true. But we can still keep it objective. It could pick randomly." "Randomly? But some of these will be huge issues! Companies -- no, nations -- will one day rise or fall based on the word of Predict-O-Matic. When Predict-O-Matic is making a prediction, it is choosing a future for us. We can't leave that to a coin flip! We have to select the prediction which results in the best overall future. Forget being an impassive observer! We need to teach Predict-O-Matic human values!" You think about this. The thought of Predict-O-Matic deliberately steering the future sends a shudder down your spine. But what alternative do you have? The intern isn't suggesting Predict-O-Matic should lie, or bend the truth in any way -- it answers 100% honestly to the best of its ability. But (you realize with a sinking feeling) honesty still leaves a lot of wiggle room, and the consequences of wiggles could be huge. After a long silence, you meet the interns eyes. "Look. People have to trust Predict-O-Matic. And I don't just mean they have to believe Predict-O-Matic. They're bringing this thing into their homes. They have to trust that Predict-O-Matic is something they should be listening to. We can't build value judgements into this thing! If it ever came out that we had coded a value function into Predict-O-Matic, a value function which selected the very future itself by selecting which predictions to make -- we'd be done for! No matter how honest Predict-O-Matic remained, it would be seen as a manipulator. No matter how beneficent its guiding hand, there are always compromises, downsides, questionable calls. No matter how careful we were to set up its values -- to make them moral, to make them humanitarian, to make them politically correct and broadly appealing -- who are we to choose? No. We'd be done for. They'd hang us. We'd be toast!" You realize at this point that you've stood up and start...
Dec 10, 2021
What 2026 looks like by Daniel Kokotajlo
27:26
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What 2026 looks like, published by Daniel Kokotajlo on the AI Alignment Forum. This was written for the Vignettes Workshop.[1] The goal is to write out a detailed future history (“trajectory”) that is as realistic (to me) as I can currently manage, i.e. I’m not aware of any alternative trajectory that is similarly detailed and clearly more plausible to me. The methodology is roughly: Write a future history of 2022. Condition on it, and write a future history of 2023. Repeat for 2024, 2025, etc. (I'm posting 2022-2026 now so I can get feedback that will help me write 2027+. I intend to keep writing until the story reaches singularity/extinction/utopia/etc.) What’s the point of doing this? Well, there are a couple of reasons: Sometimes attempting to write down a concrete example causes you to learn things, e.g. that a possibility is more or less plausible than you thought. Most serious conversation about the future takes place at a high level of abstraction, talking about e.g. GDP acceleration, timelines until TAI is affordable, multipolar vs. unipolar takeoff. vignettes are a neglected complementary approach worth exploring. Most stories are written backwards. The author begins with some idea of how it will end, and arranges the story to achieve that ending. Reality, by contrast, proceeds from past to future. It isn’t trying to entertain anyone or prove a point in an argument. Anecdotally, various people seem to have found Paul Christiano’s “tales of doom” stories helpful, and relative to typical discussions those stories are quite close to what we want. (I still think a bit more detail would be good — e.g. Paul’s stories don’t give dates, or durations, or any numbers at all really.)[2] “I want someone to ... write a trajectory for how AI goes down, that is really specific about what the world GDP is in every one of the years from now until insane intelligence explosion. And just write down what the world is like in each of those years because I don't know how to write an internally consistent, plausible trajectory. I don't know how to write even one of those for anything except a ridiculously fast takeoff.” --Buck Shlegeris This vignette was hard to write. To achieve the desired level of detail I had to make a bunch of stuff up, but in order to be realistic I had to constantly ask “but actually though, what would really happen in this situation?” which made it painfully obvious how little I know about the future. There are numerous points where I had to conclude “Well, this does seem implausible, but I can’t think of anything more plausible at the moment and I need to move on.” I fully expect the actual world to diverge quickly from the trajectory laid out here. Let anyone who (with the benefit of hindsight) claims this divergence as evidence against my judgment prove it by exhibiting a vignette/trajectory they themselves wrote in 2021. If it maintains a similar level of detail (and thus sticks its neck out just as much) while being more accurate, I bow deeply in respect! I hope this inspires other people to write more vignettes soon. We at the Center on Long-Term Risk would like to have a collection to use for strategy discussions. Let me know if you’d like to do this, and I can give you advice & encouragement! I’d be happy to run another workshop. 2022 GPT-3 is finally obsolete. OpenAI, Google, Facebook, and DeepMind all have gigantic multimodal transformers, similar in size to GPT-3 but trained on images, video, maybe audio too, and generally higher-quality data. Not only that, but they are now typically fine-tuned in various ways--for example, to answer questions correctly, or produce engaging conversation as a chatbot. The chatbots are fun to talk to but erratic and ultimately considered shallow by intellectuals. They aren’t particularly useful for anything supe...
Dec 10, 2021
Are we in an AI overhang? by Andy Jones
07:52
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Are we in an AI overhang?, published by Andy Jones on the AI Alignment Forum. Over on Developmental Stages of GPTs, orthonormal mentions it at least reduces the chance of a hardware overhang. An overhang is when you have had the ability to build transformative AI for quite some time, but you haven't because no-one's realised it's possible. Then someone does and surprise! It's a lot more capable than everyone expected. I am worried we're in an overhang right now. I think we right now have the ability to build an orders-of-magnitude more powerful system than we already have, and I think GPT-3 is the trigger for 100x larger projects at Google, Facebook and the like, with timelines measured in months. Investment Bounds GPT-3 is the first AI system that has obvious, immediate, transformative economic value. While much hay has been made about how much more expensive it is than a typical AI research project, in the wider context of megacorp investment, its costs are insignificant. GPT-3 has been estimated to cost $5m in compute to train, and - looking at the author list and OpenAI's overall size - maybe another $10m in labour. Google, Amazon and Microsoft each spend about $20bn/year on R&D and another $20bn each on capital expenditure. Very roughly, it totals to $100bn/year. Against this budget, dropping $1bn or more on scaling GPT up by another factor of 100x is entirely plausible right now. All that's necessary is that tech executives stop thinking of natural language processing as cutesy blue-sky research and start thinking in terms of quarters-till-profitability. A concrete example is Waymo, which is raising $2bn investment rounds - and that's for a technology with a much longer road to market. Compute Cost The other side of the equation is compute cost. The $5m GPT-3 training cost estimate comes from using V100s at $10k/unit and 30 TFLOPS, which is the performance without tensor cores being considered. Amortized over a year, this gives you about $1000/PFLOPS-day. However, this cost is driven up an order of magnitude by NVIDIA's monopolistic cloud contracts, while performance will be higher when taking tensor cores into account. The current hardware floor is nearer to the RTX 2080 TI's $1k/unit for 125 tensor-core TFLOPS, and that gives you $25/PFLOPS-day. This roughly aligns with AI Impacts’ current estimates, and offers another >10x speedup to our model. I strongly suspect other bottlenecks stop you from hitting that kind of efficiency or GPT-3 would've happened much sooner, but I still think $25/PFLOPS-day is a lower useful bound. Other Constraints I've focused on money so far because most of the current 3.5-month doubling times come from increasing investment. But money aside, there are a couple of other things that could prove to be the binding constraint. Scaling law breakdown. The GPT series' scaling is expected to break down around 10k pflops-days (§6.3), which is a long way short of the amount of cash on the table. This could be because the scaling analysis was done on 1024-token sequences. Maybe longer sequences can go further. More likely I'm misunderstanding something. Sequence length. GPT-3 uses 2048 tokens at a time, and that's with an efficient encoding that cripples it on many tasks. With the naive architecture, increasing the sequence length is quadratically expensive, and getting up to novel-length sequences is not very likely. But there are a lot of plausible ways to fix that, and complexity is no bar AI. This constraint might plausibly not be resolved on a timescale of months, however. Data availability. From the same paper as the previous point, dataset size rises with the square-root of compute; a 1000x larger GPT-3 would want 10 trillion tokens of training data. It’s hard to find a good estimate on total-words-ever-written, but our library of 130m...
Dec 10, 2021
DeepMind: Generally capable agents emerge from open-ended play by Daniel Kokotajlo
03:15
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: DeepMind: Generally capable agents emerge from open-ended play, published by Daniel Kokotajlo on the AI Alignment Forum. This is a linkpost for EDIT: Also see paper and results compilation video! Today, we published "Open-Ended Learning Leads to Generally Capable Agents," a preprint detailing our first steps to train an agent capable of playing many different games without needing human interaction data. ... The result is an agent with the ability to succeed at a wide spectrum of tasks — from simple object-finding problems to complex games like hide and seek and capture the flag, which were not encountered during training. We find the agent exhibits general, heuristic behaviours such as experimentation, behaviours that are widely applicable to many tasks rather than specialised to an individual task. The neural network architecture we use provides an attention mechanism over the agent’s internal recurrent state — helping guide the agent’s attention with estimates of subgoals unique to the game the agent is playing. We’ve found this goal-attentive agent (GOAT) learns more generally capable policies. Playing roughly 700,000 unique games in 4,000 unique worlds within XLand, each agent in the final generation experienced 200 billion training steps as a result of 3.4 million unique tasks. At this time, our agents have been able to participate in every procedurally generated evaluation task except for a handful that were impossible even for a human. And the results we’re seeing clearly exhibit general, zero-shot behaviour across the task space — with the frontier of normalised score percentiles continually improving. Looking qualitatively at our agents, we often see general, heuristic behaviours emerge — rather than highly optimised, specific behaviours for individual tasks. Instead of agents knowing exactly the “best thing” to do in a new situation, we see evidence of agents experimenting and changing the state of the world until they’ve achieved a rewarding state. We also see agents rely on the use of other tools, including objects to occlude visibility, to create ramps, and to retrieve other objects. Because the environment is multiplayer, we can examine the progression of agent behaviours while training on held-out social dilemmas, such as in a game of “chicken”. As training progresses, our agents appear to exhibit more cooperative behaviour when playing with a copy of themselves. Given the nature of the environment, it is difficult to pinpoint intentionality — the behaviours we see often appear to be accidental, but still we see them occur consistently. My hot take: This seems like a somewhat big deal to me. It's what I would have predicted, but that's scary, given my timelines. I haven't read the paper itself yet but I look forward to seeing more numbers and scaling trends and attempting to extrapolate... When I do I'll leave a comment with my thoughts. EDIT: My warm take: The details in the paper back up the claims it makes in the title and abstract. This is the GPT-1 of agent/goal-directed AGI; it is the proof of concept. Two more papers down the line (and a few OOMs more compute), and we'll have the agent/goal-directed AGI equivalent of GPT-3. Scary stuff. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Dec 10, 2021
Alignment Research Field Guide by Abram Demski
26:15
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Alignment Research Field Guide, published by Abram Demski on the AI Alignment Forum. This field guide was written by the MIRI team with MIRIx groups in mind, though the advice may be relevant to others working on AI alignment research. Preamble I: Decision Theory Hello! You may notice that you are reading a document. This fact comes with certain implications. For instance, why are you reading this? Will you finish it? What decisions will you come to as a result? What will you do next? Notice that, whatever you end up doing, it’s likely that there are dozens or even hundreds of other people, quite similar to you and in quite similar positions, who will follow reasoning which strongly resembles yours, and make choices which correspondingly match. Given that, it’s our recommendation that you make your next few decisions by asking the question “What policy, if followed by all agents similar to me, would result in the most good, and what does that policy suggest in my particular case?” It’s less of a question of trying to decide for all agents sufficiently-similar-to-you (which might cause you to make the wrong choice out of guilt or pressure) and more something like “if I were in charge of all agents in my reference class, how would I treat instances of that class with my specific characteristics?” If that kind of thinking leads you to read further, great. If it leads you to set up a MIRIx chapter, even better. In the meantime, we will proceed as if the only people reading this document are those who justifiably expect to find it reasonably useful. ⠀ Preamble II: Surface Area Imagine that you have been tasked with moving a cube of solid iron that is one meter on a side. Given that such a cube weighs ~16000 pounds, and that an average human can lift ~100 pounds, a naïve estimation tells you that you can solve this problem with ~150 willing friends. But of course, a meter cube can fit at most something like 10 people around it. It doesn’t matter if you have the theoretical power to move the cube if you can’t bring that power to bear in an effective manner. The problem is constrained by its surface area. MIRIx chapters are one of the best ways to increase the surface area of people thinking about and working on the technical problem of AI alignment. And just as it would be a bad idea to decree "the 10 people who happen to currently be closest to the metal cube are the only ones allowed to think about how to think about this problem", we don’t want MIRI to become the bottleneck or authority on what kinds of thinking can and should be done in the realm of embedded agency and other relevant fields of research. The hope is that you and others like you will help actually solve the problem, not just follow directions or read what’s already been written. This document is designed to support people who are interested in doing real groundbreaking research themselves. ⠀ Contents You and your research Logistics of getting started Models of social dynamics Other useful thoughts and questions ⠀ 1. You and your research We sometimes hear questions of the form “Even a summer internship feels too short to make meaningful progress on real problems. How can anyone expect to meet and do real research in a single afternoon?” There’s a Zeno-esque sense in which you can’t make research progress in a million years if you can’t also do it in five minutes. It’s easy to fall into a trap of (either implicitly or explicitly) conceptualizing “research” as “first studying and learning what’s already been figured out, and then attempting to push the boundaries and contribute new content.” The problem with this frame (according to us) is that it leads people to optimize for absorbing information, rather than seeking it instrumentally, as a precursor to understanding. (Be mindful of what you’re optimizing ...
Dec 10, 2021
Hiring engineers and researchers to help align GPT-3 by Paul Christiano
04:48
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Hiring engineers and researchers to help align GPT-3, published by Paul Christiano on the AI Alignment Forum. My team at OpenAI, which works on aligning GPT-3, is hiring ML engineers and researchers. Apply here for the ML engineer role and here for the ML researcher role. GPT-3 is similar enough to "prosaic" AGI that we can work on key alignment problems without relying on conjecture or speculative analogies. And because GPT-3 is already being deployed in the OpenAI API, its misalignment matters to OpenAI’s bottom line — it would be much better if we had an API that was trying to help the user instead of trying to predict the next word of text from the internet. I think this puts our team in a great place to have an impact: If our research succeeds I think it will directly reduce existential risk from AI. This is not meant to be a warm-up problem, I think it’s the real thing. We are working with state of the art systems that could pose an existential risk if scaled up, and our team’s success actually matters to the people deploying those systems. We are working on the whole pipeline from “interesting idea” to “production-ready system,” building critical skills and getting empirical feedback on whether our ideas actually work. We have the real-world problems to motivate alignment research, the financial support to hire more people, and a research vision to execute on. We are bottlenecked by excellent researchers and engineers who are excited to work on alignment. What the team does In the past Reflection focused on fine-tuning GPT-3 using a reward function learned from human feedback. Our most recent results are here, and had the unusual virtue of simultaneously being exciting enough to ML researchers to be accepted at NeurIPS while being described by Eliezer as “directly, straight-up relevant to real alignment problems.” We’re currently working on three things: [20%] Applying basic alignment approaches to the API, aiming to close the gap between theory and practice. [60%] Extending existing approaches to tasks that are too hard for humans to evaluate; in particular, we are training models that summarize more text than human trainers have time to read. Our approach is to use weaker ML systems operating over shorter contexts to help oversee stronger ones over longer contexts. This is conceptually straightforward but still poses significant engineering and ML challenges. [20%] Conceptual research on domains that no one knows how to oversee and empirical work on debates between humans (see our 2019 writeup). I think the biggest open problem is figuring out how and if human overseers can leverage “knowledge” the model acquired during training (see an example here). If successful, ideas will eventually move up this list, from the conceptual stage to ML prototypes to real deployments. We’re viewing this as practice for integrating alignment into transformative AI deployed by OpenAI or another organization. What you’d do Most people on the team do a subset of these core tasks: Design+build+maintain code for experimenting with novel training strategies for large language models. This infrastructure needs to support a diversity of experimental changes that are hard to anticipate in advance, work as a solid base to build on for 6-12 months, and handle the complexity of working with large language models. Most of our code is maintained by 1-3 people and consumed by 2-4 people (all on the team). Oversee ML training. Evaluate how well models are learning, figure out why they are learning badly, and identify+prioritize+implement changes to make them learn better. Tune hyperparameters and manage computing resources. Process datasets for machine consumption; understand datasets and how they affect the model’s behavior. Design and conduct experiments to answer questions about our mode...
Dec 10, 2021
2018 AI Alignment Literature Review and Charity Comparison by Larks
49:51
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: 2018 AI Alignment Literature Review and Charity Comparison, published by Larks on the AI Alignment Forum. Cross-posted to the EA forum. Introduction Like last year and the year before, I’ve attempted to review the research that has been produced by various organisations working on AI safety, to help potential donors gain a better understanding of the landscape. This is a similar role to that which GiveWell performs for global health charities, and somewhat similar to an securities analyst with regards to possible investments. It appears that once again no-one else has attempted to do this, to my knowledge, so I've once again undertaken the task. This year I have included several groups not covered in previous years, and read more widely in the literature. My aim is basically to judge the output of each organisation in 2018 and compare it to their budget. This should give a sense for the organisations' average cost-effectiveness. We can also compare their financial reserves to their 2019 budgets to get a sense of urgency. Note that this document is quite long, so I encourage you to just read the sections that seem most relevant to your interests, probably the sections about the individual organisations. I do not recommend you skip to the conclusions! I’d like to apologize in advance to everyone doing useful AI Safety work whose contributions I may have overlooked or misconstrued. Methodological Considerations Track Records Judging organisations on their historical output is naturally going to favour more mature organisations. A new startup, whose value all lies in the future, will be disadvantaged. However, I think that this is correct. The newer the organisation, the more funding should come from people with close knowledge. As organisations mature, and have more easily verifiable signals of quality, their funding sources can transition to larger pools of less expert money. This is how it works for startups turning into public companies and I think the same model applies here. This judgement involves analysing a large number papers relating to Xrisk that were produced during 2018. Hopefully the year-to-year volatility of output is sufficiently low that this is a reasonable metric. I also attempted to include papers during December 2017, to take into account the fact that I'm missing the last month's worth of output from 2017, but I can't be sure I did this successfully. This article focuses on AI risk work. If you think other causes are important too, your priorities might differ. This particularly affects GCRI, FHI and CSER, who both do a lot of work on other issues. We focus on papers, rather than outreach or other activities. This is partly because they are much easier to measure; while there has been a large increase in interest in AI safety over the last year, it’s hard to work out who to credit for this, and partly because I think progress has to come by persuading AI researchers, which I think comes through technical outreach and publishing good work, not popular/political work. Politics My impression is that policy on technical subjects (as opposed to issues that attract strong views from the general population) is generally made by the government and civil servants in consultation with, and being lobbied by, outside experts and interests. Without expert (e.g. top ML researchers at Google, CMU & Baidu) consensus, no useful policy will be enacted. Pushing directly for policy seems if anything likely to hinder expert consensus. Attempts to directly influence the government to regulate AI research seem very adversarial, and risk being pattern-matched to ignorant opposition to GM foods or nuclear power. We don't want the 'us-vs-them' situation, that has occurred with climate change, to happen here. AI researchers who are dismissive of safety law, regarding it as ...
Dec 10, 2021
Another (outer) alignment failure story by Paul Christiano
18:05
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Another (outer) alignment failure story, published by Paul Christiano on the AI Alignment Forum. Meta This is a story where the alignment problem is somewhat harder than I expect, society handles AI more competently than I expect, and the outcome is worse than I expect. It also involves inner alignment turning out to be a surprisingly small problem. Maybe the story is 10-20th percentile on each of those axes. At the end I’m going to go through some salient ways you could vary the story. This isn’t intended to be a particularly great story (and it’s pretty informal). I’m still trying to think through what I expect to happen if alignment turns out to be hard, and this more like the most recent entry in a long journey of gradually-improving stories. I wrote this up a few months ago and was reminded to post it by Critch’s recent post (which is similar in many ways). This story has definitely been shaped by a broader community of people gradually refining failure stories rather than being written in a vacuum. I’d like to continue spending time poking at aspects of this story that don’t make sense, digging into parts that seem worth digging into, and eventually developing clearer and more plausible stories. I still think it’s very plausible that my views about alignment will change in the course of thinking concretely about stories, and even if my basic views about alignment stay the same it’s pretty likely that the story will change. Story ML starts running factories, warehouses, shipping, and construction. ML assistants help write code and integrate ML into new domains. ML designers help build factories and the robots that go in them. ML finance systems invest in companies on the basis of complicated forecasts and (ML-generated) audits. Tons of new factories, warehouses, power plants, trucks and roads are being built. Things are happening quickly, investors have super strong FOMO, no one really knows whether it’s a bubble but they can tell that e.g. huge solar farms are getting built and something is happening that they want a piece of. Defense contractors are using ML systems to design new drones, and ML is helping the DoD decide what to buy and how to deploy it. The expectation is that automated systems will manage drones during high-speed ML-on-ML conflicts because humans won’t be able to understand what’s going on. ML systems are designing new ML systems, testing variations, commissioning giant clusters. The financing is coming from automated systems, the clusters are built by robots. A new generation of fabs is being built with unprecedented speed using new automation. At this point everything kind of makes sense to humans. It feels like we are living at the most exciting time in history. People are making tons of money. The US defense establishment is scared because it has no idea what a war is going to look like right now, but in terms of policy their top priority is making sure the boom proceeds as quickly in the US as it does in China because it now seems plausible that being even a few years behind would result in national irrelevance. Things are moving very quickly and getting increasingly hard for humans to evaluate. We can no longer train systems to make factory designs that look good to humans, because we don’t actually understand exactly what robots are doing in those factories or why; we can’t evaluate the tradeoffs between quality and robustness and cost that are being made; we can't really understand the constraints on a proposed robot design or why one design is better than another. We can’t evaluate arguments about investments very well because they come down to claims about where the overall economy is going over the next 6 months that seem kind of alien (even the more recognizable claims are just kind of incomprehensible predictions about e.g. how t...
Dec 10, 2021
Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More by Ben Pace
26:10
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More , published by Ben Pace on the AI Alignment Forum. An actual debate about instrumental convergence, in a public space! Major respect to all involved, especially Yoshua Bengio for great facilitation. For posterity (i.e. having a good historical archive) and further discussion, I've reproduced the conversation here. I'm happy to make edits at the request of anyone in the discussion who is quoted below. I've improved formatting for clarity and fixed some typos. For people who are not researchers in this area who wish to comment, see the public version of this post here. For people who do work on the relevant areas, please sign up in the top right. It will take a day or so to confirm membership. Original Post Yann LeCun: "don't fear the Terminator", a short opinion piece by Tony Zador and me that was just published in Scientific American. "We dramatically overestimate the threat of an accidental AI takeover, because we tend to conflate intelligence with the drive to achieve dominance. [...] But intelligence per se does not generate the drive for domination, any more than horns do." Comment Thread #1 Elliot Olds: Yann, the smart people who are very worried about AI seeking power and ensuring its own survival believe it's a big risk because power and survival are instrumental goals for almost any ultimate goal. If you give a generally intelligent AI the goal to make as much money in the stock market as possible, it will resist being shut down because that would interfere with tis goal. It would try to become more powerful because then it could make money more effectively. This is the natural consequence of giving a smart agent a goal, unless we do something special to counteract this. You've often written about how we shouldn't be so worried about AI, but I've never seen you address this point directly. Stuart Russell: It is trivial to construct a toy MDP in which the agent's only reward comes from fetching the coffee. If, in that MDP, there is another "human" who has some probability, however small, of switching the agent off, and if the agent has available a button that switches off that human, the agent will necessarily press that button as part of the optimal solution for fetching the coffee. No hatred, no desire for power, no built-in emotions, no built-in survival instinct, nothing except the desire to fetch the coffee successfully. This point cannot be addressed because it's a simple mathematical observation. Comment Thread #2 Yoshua Bengio: Yann, I'd be curious about your response to Stuart Russell's point. Yann LeCun: You mean, the so-called "instrumental convergence" argument by which "a robot can't fetch you coffee if it's dead. Hence it will develop self-preservation as an instrumental sub-goal." It might even kill you if you get in the way. 1. Once the robot has brought you coffee, its self-preservation instinct disappears. You can turn it off. 2. One would have to be unbelievably stupid to build open-ended objectives in a super-intelligent (and super-powerful) machine without some safeguard terms in the objective. 3. One would have to be rather incompetent not to have a mechanism by which new terms in the objective could be added to prevent previously-unforeseen bad behavior. For humans, we have education and laws to shape our objective functions and complement the hardwired terms built into us by evolution. 4. The power of even the most super-intelligent machine is limited by physics, and its size and needs make it vulnerable to physical attacks. No need for much intelligence here. A virus is infinitely less intelligent than you, but it can still kill you. 5. A second machine, designed solely to neutralize an evil super-intelligent machine will win every time, if given similar...
Dec 10, 2021
Some AI research areas and their relevance to existential safety by Andrew Critch
01:26:09
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Some AI research areas and their relevance to existential safety, published by Andrew Critch on the AI Alignment Forum. Followed by: What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs), which provides examples of multi-stakeholder/multi-agent interactions leading to extinction events. Introduction This post is an overview of a variety of AI research areas in terms of how much I think contributing to and/or learning from those areas might help reduce AI x-risk. By research areas I mean “AI research topics that already have groups of people working on them and writing up their results”, as opposed to research “directions” in which I’d like to see these areas “move”. I formed these views mostly pursuant to writing AI Research Considerations for Human Existential Safety (ARCHES). My hope is that my assessments in this post can be helpful to students and established AI researchers who are thinking about shifting into new research areas specifically with the goal of contributing to existential safety somehow. In these assessments, I find it important to distinguish between the following types of value: The helpfulness of the area to existential safety, which I think of as a function of what services are likely to be provided as a result of research contributions to the area, and whether those services will be helpful to existential safety, versus The educational value of the area for thinking about existential safety, which I think of as a function of how much a researcher motivated by existential safety might become more effective through the process of familiarizing with or contributing to that area, usually by focusing on ways the area could be used in service of existential safety. The neglect of the area at various times, which is a function of how much technical progress has been made in the area relative to how much I think is needed. Importantly: The helpfulness to existential safety scores do not assume that your contributions to this area would be used only for projects with existential safety as their mission. This can negatively impact the helpfulness of contributing to areas that are more likely to be used in ways that harm existential safety. The educational value scores are not about the value of an existential-safety-motivated researcher teaching about the topic, but rather, learning about the topic. The neglect scores are not measuring whether there is enough “buzz” around the topic, but rather, whether there has been adequate technical progress in it. Buzz can predict future technical progress, though, by causing people to work on it. Below is a table of all the areas I considered for this post, along with their entirely subjective “scores” I’ve given them. The rest of this post can be viewed simply as an elaboration/explanation of this table: Existing Research Area Social Application Helpfulness to Existential Safety Educational Value 2015 Neglect 2020 Neglect 2030 Neglect Out of Distribution Robustness Zero/ Single 1/10 4/10 5/10 3/10 1/10 Agent Foundations Zero/ Single 3/10 8/10 9/10 8/10 7/10 Multi-agent RL Zero/ Multi 2/10 6/10 5/10 4/10 0/10 Preference Learning Single/ Single 1/10 4/10 5/10 1/10 0/10 Side-effect Minimization Single/ Single 4/10 4/10 6/10 5/10 4/10 Human-Robot Interaction Single/ Single 6/10 7/10 5/10 4/10 3/10 Interpretability in ML Single/ Single 8/10 6/10 8/10 6/10 2/10 Fairness in ML Multi/ Single 6/10 5/10 7/10 3/10 2/10 Computational Social Choice Multi/ Single 7/10 7/10 7/10 5/10 4/10 Accountability in ML Multi/ Multi 8/10 3/10 8/10 7/10 5/10 The research areas are ordered from least-socially-complex to most-socially-complex. This roughly (though imperfectly) correlates with addressing existential safety problems of increasing importance and neglect, according to me. Correspondingly, the second colu...
Dec 10, 2021
Announcing the Alignment Research Center by Paul Christiano
01:21
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Announcing the Alignment Research Center, published by Paul Christiano on the AI Alignment Forum. (Cross-post from ai-alignment.com) I’m now working full-time on the Alignment Research Center (ARC), a new non-profit focused on intent alignment research. I left OpenAI at the end of January and I’ve spent the last few months planning, doing some theoretical research, doing some logistical set-up, and taking time off. For now it’s just me, focusing on theoretical research. I’m currently feeling pretty optimistic about this work: I think there’s a good chance that it will yield big alignment improvements within the next few years, and a good chance that those improvements will be integrated into practice at leading ML labs. My current goal is to build a small team working productively on theory. I’m not yet sure how we’ll approach hiring, but if you’re potentially interested in joining you can fill out this tiny form to get notified when we’re ready. Over the medium term (and maybe starting quite soon) I also expect to implement and study techniques that emerge from theoretical work, to help ML labs adopt alignment techniques, and to work on alignment forecasting and strategy. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Dec 10, 2021
The Rocket Alignment Problem by Eliezer Yudkowsky
23:38
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Rocket Alignment Problem, published by Eliezer Yudkowsky on the AI Alignment Forum. The following is a fictional dialogue building off of AI Alignment: Why It’s Hard, and Where to Start. (Somewhere in a not-very-near neighboring world, where science took a very different course.) ALFONSO: Hello, Beth. I’ve noticed a lot of speculations lately about “spaceplanes” being used to attack cities, or possibly becoming infused with malevolent spirits that inhabit the celestial realms so that they turn on their own engineers. I’m rather skeptical of these speculations. Indeed, I’m a bit skeptical that airplanes will be able to even rise as high as stratospheric weather balloons anytime in the next century. But I understand that your institute wants to address the potential problem of malevolent or dangerous spaceplanes, and that you think this is an important present-day cause. BETH: That’s. really not how we at the Mathematics of Intentional Rocketry Institute would phrase things. The problem of malevolent celestial spirits is what all the news articles are focusing on, but we think the real problem is something entirely different. We’re worried that there’s a difficult, theoretically challenging problem which modern-day rocket punditry is mostly overlooking. We’re worried that if you aim a rocket at where the Moon is in the sky, and press the launch button, the rocket may not actually end up at the Moon. ALFONSO: I understand that it’s very important to design fins that can stabilize a spaceplane’s flight in heavy winds. That’s important spaceplane safety research and someone needs to do it. But if you were working on that sort of safety research, I’d expect you to be collaborating tightly with modern airplane engineers to test out your fin designs, to demonstrate that they are actually useful. BETH: Aerodynamic designs are important features of any safe rocket, and we’re quite glad that rocket scientists are working on these problems and taking safety seriously. That’s not the sort of problem that we at MIRI focus on, though. ALFONSO: What’s the concern, then? Do you fear that spaceplanes may be developed by ill-intentioned people? BETH: That’s not the failure mode we’re worried about right now. We’re more worried that right now, nobody can tell you how to point your rocket’s nose such that it goes to the moon, nor indeed any prespecified celestial destination. Whether Google or the US Government or North Korea is the one to launch the rocket won’t make a pragmatic difference to the probability of a successful Moon landing from our perspective, because right now nobody knows how to aim any kind of rocket anywhere. ALFONSO: I’m not sure I understand. BETH: We’re worried that even if you aim a rocket at the Moon, such that the nose of the rocket is clearly lined up with the Moon in the sky, the rocket won’t go to the Moon. We’re not sure what a realistic path from the Earth to the moon looks like, but we suspect it might not be a very straight path, and it may not involve pointing the nose of the rocket at the moon at all. We think the most important thing to do next is to advance our understanding of rocket trajectories until we have a better, deeper understanding of what we’ve started calling the “rocket alignment problem”. There are other safety problems, but this rocket alignment problem will probably take the most total time to work on, so it’s the most urgent. ALFONSO: Hmm, that sounds like a bold claim to me. Do you have a reason to think that there are invisible barriers between here and the moon that the spaceplane might hit? Are you saying that it might get very very windy between here and the moon, more so than on Earth? Both eventualities could be worth preparing for, I suppose, but neither seem likely. BETH: We don’t think it’s particularly likely that there...
Dec 10, 2021
The case for aligning narrowly superhuman models by Ajeya Cotra
50:12
I wrote this post to get people’s takes on a type of work that seems exciting to me personally; I’m not speaking for Open Phil as a whole. Institutionally, we are very uncertain whether to prioritize this (and if we do where it should be housed and how our giving should be structured). We are not seeking grant applications on this topic right now. Thanks to Daniel Dewey, Eliezer Yudkowsky, Evan Hubinger, Holden Karnofsky, Jared Kaplan, Mike Levine, Nick Beckstead, Owen Cotton-Barratt, Paul Christiano, Rob Bensinger, and Rohin Shah for comments on earlier drafts. A genre of technical AI risk reduction work that seems exciting to me is trying to align existing models that already are, or have the potential to be, “superhuman”[1] at some particular task (which I’ll call narrowly superhuman models).[2] I don’t just mean “train these models to be more robust, reliable, interpretable, etc” (though that seems good too); I mean “figure out how to harness their full abilities so they can be as useful as possible to humans” (focusing on “fuzzy” domains where it’s intuitively non-obvious how to make that happen). Here’s an example of what I’m thinking of: intuitively speaking, it feels like GPT-3 is “smart enough to” (say) give advice about what to do if I’m sick that’s better than advice I’d get from asking humans on Reddit or Facebook, because it’s digested a vast store of knowledge about illness symptoms and remedies. Moreover, certain ways of prompting it provide suggestive evidence that it could use this knowledge to give helpful advice. With respect to the Reddit or Facebook users I might otherwise ask, it seems like GPT-3 has the potential to be narrowly superhuman in the domain of health advice. But GPT-3 doesn’t seem to “want” to give me the best possible health advice -- instead it “wants” to play a strange improv game riffing off the prompt I give it, pretending it’s a random internet user. So if I want to use GPT-3 to get advice about my health, there is a gap between what it’s capable of (which could even exceed humans) and what I can get it to actually provide me. I’m interested in the challenge of: How can we get GPT-3 to give “the best health advice it can give” when humans[3] in some sense “understand less” about what to do when you’re sick than GPT-3 does? And in that regime, how can we even tell whether it’s actually “doing the best it can”? I think there are other similar challenges we could define for existing models, especially large language models. I’m excited about tackling this particular type of near-term challenge because it feels like a microcosm of the long-term AI alignment problem in a real, non-superficial sense. In the end, we probably want to find ways to meaningfully supervise (or justifiably trust) models that are more capable than ~all humans in ~all domains.[4] So it seems like a promising form of practice to figure out how to get particular humans to oversee models that are more capable than them in specific ways, if this is done with an eye to developing scalable and domain-general techniques. I’ll call this type of project aligning narrowly superhuman models. In the rest of this post, I: Give a more detailed description of what aligning narrowly superhuman models could look like, what does and doesn’t “count”, and what future projects I think could be done in this space (more). Explain why I think aligning narrowly superhuman models could meaningfully reduce long-term existential risk from misaligned AI (more). Lay out the potential advantages that I think this work has over other types of AI alignment research: (a) conceptual thinking, (b) demos in small-scale artificial settings, and (c) mainstream ML safety such as interpretability and robustness (more). Answer some objections and questions about this research direction, e.g. concerns that it’s not very neglected, feels suspiciously similar to commercialization, might cause harm by exacerbating AI race dynamics, or is dominated by another t...
Dec 10, 2021
Realism about rationality by Richard Ngo
07:29
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Realism about rationality , published by Richard Ngo on the AI Alignment Forum. This is a linkpost for http://thinkingcomplete.blogspot.com/2018/09/rational-and-real.html Epistemic status: trying to vaguely gesture at vague intuitions. A similar idea was explored here under the heading "the intelligibility of intelligence", although I hadn't seen it before writing this post. As of 2020, I consider this follow-up comment to be a better summary of the thing I was trying to convey with this post than the post itself. There’s a mindset which is common in the rationalist community, which I call “realism about rationality” (the name being intended as a parallel to moral realism). I feel like my skepticism about agent foundations research is closely tied to my skepticism about this mindset, and so in this essay I try to articulate what it is. Humans ascribe properties to entities in the world in order to describe and predict them. Here are three such properties: "momentum", "evolutionary fitness", and "intelligence". These are all pretty useful properties for high-level reasoning in the fields of physics, biology and AI, respectively. There's a key difference between the first two, though. Momentum is very amenable to formalisation: we can describe it using precise equations, and even prove things about it. Evolutionary fitness is the opposite: although nothing in biology makes sense without it, no biologist can take an organism and write down a simple equation to define its fitness in terms of more basic traits. This isn't just because biologists haven't figured out that equation yet. Rather, we have excellent reasons to think that fitness is an incredibly complicated "function" which basically requires you to describe that organism's entire phenotype, genotype and environment. In a nutshell, then, realism about rationality is a mindset in which reasoning and intelligence are more like momentum than like fitness. It's a mindset which makes the following ideas seem natural: The idea that there is a simple yet powerful theoretical framework which describes human intelligence and/or intelligence in general. (I don't count brute force approaches like AIXI for the same reason I don't consider physics a simple yet powerful description of biology). The idea that there is an “ideal” decision theory. The idea that AGI will very likely be an “agent”. The idea that Turing machines and Kolmogorov complexity are foundational for epistemology. The idea that, given certain evidence for a proposition, there's an "objective" level of subjective credence which you should assign to it, even under computational constraints. The idea that Aumann's agreement theorem is relevant to humans. The idea that morality is quite like mathematics, in that there are certain types of moral reasoning that are just correct. The idea that defining coherent extrapolated volition in terms of an idealised process of reflection roughly makes sense, and that it converges in a way which doesn’t depend very much on morally arbitrary factors. The idea that having having contradictory preferences or beliefs is really bad, even when there’s no clear way that they’ll lead to bad consequences (and you’re very good at avoiding dutch books and money pumps and so on). To be clear, I am neither claiming that realism about rationality makes people dogmatic about such ideas, nor claiming that they're all false. In fact, from a historical point of view I’m quite optimistic about using maths to describe things in general. But starting from that historical baseline, I’m inclined to adjust downwards on questions related to formalising intelligent thought, whereas rationality realism would endorse adjusting upwards. This essay is primarily intended to explain my position, not justify it, but one important consideration for me is th...
Dec 10, 2021
Birds, Brains, Planes, and AI: Against Appeals to the Complexity/Mysteriousness/Efficiency of the Brain by Daniel Kokotajlo
20:29
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Birds, Brains, Planes, and AI: Against Appeals to the Complexity/Mysteriousness/Efficiency of the Brain, published by Daniel Kokotajlo on the AI Alignment Forum. [Epistemic status: Strong opinions lightly held, this time with a cool graph.] I argue that an entire class of common arguments against short timelines is bogus, and provide weak evidence that anchoring to the human-brain-human-lifetime milestone is reasonable. In a sentence, my argument is that the complexity and mysteriousness and efficiency of the human brain (compared to artificial neural nets) is almost zero evidence that building TAI will be difficult, because evolution typically makes things complex and mysterious and efficient, even when there are simple, easily understood, inefficient designs that work almost as well (or even better!) for human purposes. In slogan form: If all we had to do to get TAI was make a simple neural net 10x the size of my brain, my brain would still look the way it does. The case of birds & planes illustrates this point nicely. Moreover, it is also a precedent for several other short-timelines talking points, such as the human-brain-human-lifetime (HBHL) anchor. Plan: Illustrative Analogy Exciting Graph Analysis Extra brute force can make the problem a lot easier Evolution produces complex mysterious efficient designs by default, even when simple inefficient designs work just fine for human purposes. What’s bogus and what’s not Example: Data-efficiency Conclusion Appendix 1909 French military plane, the Antionette VII. By Deep silence (Mikaël Restoux) - Own work (Bourget museum, in France), CC BY 2.5, Illustrative Analogy AI timelines, from our current perspective Flying machine timelines, from the perspective of the late 1800’s: Shorty: Human brains are giant neural nets. This is reason to think we can make human-level AGI (or at least AI with strategically relevant skills, like politics and science) by making giant neural nets. Shorty: Birds are winged creatures that paddle through the air. This is reason to think we can make winged machines that paddle through the air. Longs: Whoa whoa, there are loads of important differences between brains and artificial neural nets: [what follows is a direct quote from the objection a friend raised when reading an early draft of this post!] - During training, deep neural nets use some variant of backpropagation. My understanding is that the brain does something else, closer to Hebbian learning. (Though I vaguely remember at least one paper claiming that maybe the brain does something that's similar to backprop after all.) - It's at least possible that the wiring diagram of neurons plus weights is too coarse-grained to accurately model the brain's computation, but it's all there is in deep neural nets. If we need to pay attention to glial cells, intracellular processes, different neurotransmitters etc., it's not clear how to integrate this into the deep learning paradigm. - My impression is that several biological observations on the brain don't have a plausible analog in deep neural nets: growing new neurons (though unclear how important it is for an adult brain), "repurposing" in response to brain damage, . Longs: Whoa whoa, there are loads of important differences between birds and flying machines: - Birds paddle the air by flapping, whereas current machine designs use propellers and fixed wings. - It’s at least possible that the anatomical diagram of bones, muscles, and wing surfaces is too coarse-grained to accurately model how a bird flies, but that’s all there is to current machine designs (replacing bones with struts and muscles with motors, that is). If we need to pay attention to the percolation of air through and between feathers, micro-eddies in the air sensed by the bird and instinctively responded to, etc. it’s not clear ...
Dec 10, 2021
Goodhart Taxonomy by Scott Garrabrant
15:19
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Goodhart Taxonomy, published by Scott Garrabrant on the AI Alignment Forum. Goodhart’s Law states that "any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes." However, this is not a single phenomenon. I propose that there are (at least) four different mechanisms through which proxy measures break when you optimize for them. The four types are Regressional, Causal, Extremal, and Adversarial. In this post, I will go into detail about these four different Goodhart effects using mathematical abstractions as well as examples involving humans and/or AI. I will also talk about how you can mitigate each effect. Throughout the post, I will use V to refer to the true goal and use U to refer to a proxy for that goal which was observed to correlate with V and which is being optimized in some way. Quick Reference Regressional Goodhart - When selecting for a proxy measure, you select not only for the true goal, but also for the difference between the proxy and the goal. Model: When U is equal to V X , where X is some noise, a point with a large U value will likely have a large V value, but also a large X value. Thus, when U is large, you can expect V to be predictably smaller than U Example: height is correlated with basketball ability, and does actually directly help, but the best player is only 6'3", and a random 7' person in their 20s would probably not be as good Causal Goodhart - When there is a non-causal correlation between the proxy and the goal, intervening on the proxy may fail to intervene on the goal. Model: If V causes U (or if V and U are both caused by some third thing), then a correlation between V and U may be observed. However, when you intervene to increase U through some mechanism that does not involve V , you will fail to also increase V Example: someone who wishes to be taller might observe that height is correlated with basketball skill and decide to start practicing basketball. Extremal Goodhart - Worlds in which the proxy takes an extreme value may be very different from the ordinary worlds in which the correlation between the proxy and the goal was observed. Model: Patterns tend to break at simple joints. One simple subset of worlds is those worlds in which U is very large. Thus, a strong correlation between U and V observed for naturally occuring U values may not transfer to worlds in which U is very large. Further, since there may be relatively few naturally occuring worlds in which U is very large, extremely large U may coincide with small V values without breaking the statistical correlation. Example: the tallest person on record, Robert Wadlow, was 8'11" (2.72m). He grew to that height because of a pituitary disorder, he would have struggled to play basketball because he "required leg braces to walk and had little feeling in his legs and feet." Adversarial Goodhart - When you optimize for a proxy, you provide an incentive for adversaries to correlate their goal with your proxy, thus destroying the correlation with your goal. Model: Consider an agent A with some different goal W . Since they depend on common resources, W and V are naturally opposed. If you optimize U as a proxy for V , and A knows this, A is incentivized to make large U values coincide with large W values, thus stopping them from coinciding with large V values. Example: aspiring NBA players might just lie about their height. Regressional Goodhart When selecting for a proxy measure, you select not only for the true goal, but also for the difference between the proxy and the goal. Abstract Model When U is equal to V X , where X is some noise, a point with a large U value will likely have a large V value, but also a large X value. Thus, when U is large, you can expect V to be predictably smaller than U The above description is whe...
Dec 10, 2021
The ground of optimization by Alex Flint
43:00
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The ground of optimization, published by Alex Flint on the AI Alignment Forum. This work was supported by OAK, a monastic community in the Berkeley hills. This document could not have been written without the daily love of living in this beautiful community. The work involved in writing this cannot be separated from the sitting, chanting, cooking, cleaning, crying, correcting, fundraising, listening, laughing, and teaching of the whole community. What is optimization? What is the relationship between a computational optimization process — say, a computer program solving an optimization problem — and a physical optimization process — say, a team of humans building a house? We propose the concept of an optimizing system as a physically closed system containing both that which is being optimized and that which is doing the optimizing, and defined by a tendency to evolve from a broad basin of attraction towards a small set of target configurations despite perturbations to the system. We compare our definition to that proposed by Yudkowsky, and place our work in the context of work by Demski and Garrabrant’s Embedded Agency, and Drexler’s Comprehensive AI Services. We show that our definition resolves difficult cases proposed by Daniel Filan. We work through numerous examples of biological, computational, and simple physical systems showing how our definition relates to each. Introduction In the field of computer science, an optimization algorithm is a computer program that outputs the solution, or an approximation thereof, to an optimization problem. An optimization problem consists of an objective function to be maximized or minimized, and a feasible region within which to search for a solution. For example we might take the objective function x 2 − 2 2 as a minimization problem and the whole real number line as the feasible region. The solution then would be x √ 2 and a working optimization algorithm for this problem is one that outputs a close approximation to this value. In the field of operations research and engineering more broadly, optimization involves improving some process or physical artifact so that it is fit for a certain purpose or fulfills some set of requirements. For example, we might choose to measure a nail factory by the rate at which it outputs nails, relative to the cost of production inputs. We can view this as a kind of objective function, with the factory as the object of optimization just as the variable x was the object of optimization in the previous example. There is clearly a connection between optimizing the factory and optimizing for x, but what exactly is this connection? What is it that identifies an algorithm as an optimization algorithm? What is it that identifies a process as an optimization process? The answer proposed in this essay is: an optimizing system is a physical process in which the configuration of some part of the universe moves predictably towards a small set of target configurations from any point in a broad basin of optimization, despite perturbations during the optimization process. We do not imagine that there is some engine or agent or mind performing optimization, separately from that which is being optimized. We consider the whole system jointly — engine and object of optimization — and ask whether it exhibits a tendency to evolve towards a predictable target configuration. If so, then we call it an optimizing system. If the basin of attraction is deep and wide then we say that this is a robust optimizing system. An optimizing system as defined in this essay is known in dynamical systems theory as a dynamical system with one or more attractors. In this essay we show how this framework can help to understand optimization as manifested in physically closed systems containing both engine and object of optimization. I...
Dec 10, 2021
An overview of 11 proposals for building safe advanced AI by Evan Hubinger
01:10:34
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: An overview of 11 proposals for building safe advanced AI , published by Evan Hubinger on the AI Alignment Forum. This is the blog post version of the paper by the same name. Special thanks to Kate Woolverton, Paul Christiano, Rohin Shah, Alex Turner, William Saunders, Beth Barnes, Abram Demski, Scott Garrabrant, Sam Eisenstat, and Tsvi Benson-Tilsen for providing helpful comments and feedback on this post and the talk that preceded it. This post is a collection of 11 different proposals for building safe advanced AI under the current machine learning paradigm. There's a lot of literature out there laying out various different approaches such as amplification, debate, or recursive reward modeling, but a lot of that literature focuses primarily on outer alignment at the expense of inner alignment and doesn't provide direct comparisons between approaches. The goal of this post is to help solve that problem by providing a single collection of 11 different proposals for building safe advanced AI—each including both inner and outer alignment components. That being said, not only does this post not cover all existing proposals, I strongly expect that there will be lots of additional new proposals to come in the future. Nevertheless, I think it is quite useful to at least take a broad look at what we have now and compare and contrast some of the current leading candidates. It is important for me to note before I begin that the way I describe the 11 approaches presented here is not meant to be an accurate representation of how anyone else would represent them. Rather, you should treat all the approaches I describe here as my version of that approach rather than any sort of canonical version that their various creators/proponents would endorse. Furthermore, this post only includes approaches that intend to directly build advanced AI systems via machine learning. Thus, this post doesn't include other possible approaches for solving the broader AI existential risk problem such as: finding a fundamentally different way of approaching AI than the current machine learning paradigm that makes it easier to build safe advanced AI, developing some advanced technology that produces a decisive strategic advantage without using advanced AI, or achieving global coordination around not building advanced AI via (for example) a persuasive demonstration that any advanced AI is likely to be unsafe. For each of the proposals that I consider, I will try to evaluate them on the following four basic components that I think any story for how to build safe advanced AI under the current machine learning paradigm needs. Outer alignment. Outer alignment is about asking why the objective we're training for is aligned—that is, if we actually got a model that was trying to optimize for the given loss/reward/etc., would we like that model? For a more thorough description of what I mean by outer alignment, see “Outer alignment and imitative amplification.” Inner alignment. Inner alignment is about asking the question of how our training procedure can actually guarantee that the model it produces will, in fact, be trying to accomplish the objective we trained it on. For a more rigorous treatment of this question and an explanation of why it might be a concern, see “Risks from Learned Optimization.” Training competitiveness. Competitiveness is a bit of a murky concept, so I want to break it up into two pieces here. Training competitiveness is the question of whether the given training procedure is one that a team or group of teams with a reasonable lead would be able to afford to implement without completely throwing away that lead. Thus, training competitiveness is about whether the proposed process of producing advanced AI is competitive. Performance competitiveness. Performance competitiveness, on the othe...
Dec 10, 2021
Chris Olah’s views on AGI safety by Evan Hubinger
18:57
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Chris Olah’s views on AGI safety, published by Evan Hubinger on the AI Alignment Forum. Note: I am not Chris Olah. This post was the result of lots of back-and-forth with Chris, but everything here is my interpretation of what Chris believes, not necessarily what he actually believes. Chris also wanted me to emphasize that his thinking is informed by all of his colleagues on the OpenAI Clarity team and at other organizations. In thinking about AGI safety—and really any complex topic on which many smart people disagree—I’ve often found it very useful to build a collection of different viewpoints from people that I respect that I feel like I understand well enough to be able to think from their perspective. For example, I will often try to compare what an idea feels like when I put on my Paul Christiano hat to what it feels like when I put on my Scott Garrabrant hat. Recently, I feel like I’ve gained a new hat that I’ve found extremely valuable that I also don’t think many other people in this community have, which is my Chris Olah hat. The goal of this post is to try to give that hat to more people. If you’re not familiar with him, Chris Olah leads the Clarity team at OpenAI and formerly used to work at Google Brain. Chris has been a part of many of the most exciting ML interpretability results in the last five years, including Activation Atlases, Building Blocks of Interpretability, Feature Visualization, and DeepDream. Chris was also a coauthor of “Concrete Problems in AI Safety.” He also thinks a lot about technical AGI safety and has a lot of thoughts on how ML interpretability work can play into that—thoughts which, unfortunately, haven’t really been recorded previously. So: here’s my take on Chris’s AGI safety worldview. The benefits of transparency and interpretability Since Chris primarily works on ML transparency and interpretability, the obvious first question to ask is how he imagines that sort of research aiding with AGI safety. When I was talking with him, Chris listed four distinct ways in which he thought transparency and interpretability could help, which I’ll go over in his order of importance. Catching problems with auditing First, Chris says, interpretability gives you a mulligan. Before you deploy your AI, you can throw all of your interpretability tools at it to check and see what it actually learned and make sure it learned the right thing. If it didn’t—if you find that it’s learned some sort of potentially dangerous proxy, for example—then you can throw your AI out and try again. As long as you’re in a domain where your AI isn’t actively trying to deceive your interpretability tools (via deceptive alignment, perhaps), this sort of a mulligan could help quite a lot in resolving more standard robustness problems (proxy alignment, for example). That being said, that doesn’t necessarily mean waiting until you’re on the verge of deployment to look for flaws. Ideally you’d be able to discover problems early on via an ongoing auditing process as you build more and more capable systems. One of the OpenAI Clarity team’s major research thrusts right now is developing the ability to more rigorously and systematically audit neural networks. The idea is that interpretability techniques shouldn’t have to “get lucky” to stumble across a problem, but should instead reliably catch any problematic behavior. In particular, one way in which they’ve been evaluating progress on this is the “auditing game.” In the auditing game, one researcher takes a neural network and makes some modification to it—maybe images containing both dogs and cats are now classified as rifles, for example—and another researcher, given only the modified network, has to diagnose the problem and figure out exactly what modification was made to the network using only interpretability tools with...
Dec 10, 2021
Draft report on AI timelines by Ajeya Cotra
01:24
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Draft report on AI timelines, published by Ajeya Cotra on the AI Alignment Forum. Hi all, I've been working on some AI forecasting research and have prepared a draft report on timelines to transformative AI. I would love feedback from this community, so I've made the report viewable in a Google Drive folder here. With that said, most of my focus so far has been on the high-level structure of the framework, so the particular quantitative estimates are very much in flux and many input parameters aren't pinned down well -- I wrote the bulk of this report before July and have received feedback since then that I haven't fully incorporated yet. I'd prefer if people didn't share it widely in a low-bandwidth way (e.g., just posting key graphics on Facebook or Twitter) since the conclusions don't reflect Open Phil's "institutional view" yet, and there may well be some errors in the report. The report includes a quantitative model written in Python. Ought has worked with me to integrate their forecasting platform Elicit into the model so that you can see other people's forecasts for various parameters. If you have questions or feedback about the Elicit integration, feel free to reach out to elicit@ought.org. Looking forward to hearing people's thoughts! Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Dec 10, 2021
An Untrollable Mathematician Illustrated by Abram Demski
00:50
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: An Untrollable Mathematician Illustrated, published by Abram Demski on the AI Alignment Forum. The following was a presentation I made for Sören Elverlin's AI Safety Reading Group. I decided to draw everything by hand because powerpoint is boring. Thanks to Ben Pace for formatting it for LW! See also the IAF post detailing the research which this presentation is based on. Pingbacks 40 2018 AI Alignment Literature Review and Charity Comparison 55 Radical Probabilism 32 Embedded Agency (full-text version) 33 Thinking About Filtered Evidence Is (Very!) Hard
Dec 10, 2021
Radical Probabilism by Abram Demski
01:08:18
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Radical Probabilism , published by Abram Demski on the AI Alignment Forum. This is an expanded version of my talk. I assume a high degree of familiarity with Bayesian probability theory. Toward a New Technical Explanation of Technical Explanation -- an attempt to convey the practical implications of logical induction -- was one of my most-appreciated posts, but I don't really get the feeling that very many people have received the update. Granted, that post was speculative, sketching what a new technical explanation of technical explanation might look like. I think I can do a bit better now. If the implied project of that post had really been completed, I would expect new practical probabilistic reasoning tools, explicitly violating Bayes' law. For example, we might expect: A new version of information theory. An update to the "prediction=compression" maxim, either repairing it to incorporate the new cases, or explicitly denying it and providing a good intuitive account of why it was wrong. A new account of concepts such as mutual information, allowing for the fact that variables have behavior over thinking time; for example, variables may initially be very correlated, but lose correlation as our picture of each variable becomes more detailed. New ways of thinking about epistemology. One thing that my post did manage to do was to spell out the importance of "making advanced predictions", a facet of epistemology which Bayesian thinking does not do justice to. However, I left aspects of the problem of old evidence open, rather than giving a complete way to think about it. New probabilistic structures. Bayesian Networks are one really nice way to capture the structure of probability distributions, making them much easier to reason about. Is there anything similar for the new, wider space of probabilistic reasoning which has been opened up? Unfortunately, I still don't have any of those things to offer. The aim of this post is more humble. I think what I originally wrote was too ambitious for didactic purposes. Where the previous post aimed to communicate the insights of logical induction by sketching broad implications, I here aim to communicate the insights in themselves, focusing on the detailed differences between classical Bayesian reasoning and the new space of ways to reason. Rather than talking about logical induction directly, I'm mainly going to explain things in terms of a very similar philosophy which Richard Jeffrey invented -- apparently starting with his phd dissertation in the 50s, although I'm unable to get my hands on it or other early references to see how fleshed-out the view was at that point. He called this philosophy radical probabilism. Unlike logical induction, radical probabilism appears not to have any roots in worries about logical uncertainty or bounded rationality. Instead it appears to be motivated simply by a desire to generalize, and a refusal to accept unjustified assumptions. Nonetheless, it carries most of the same insights. Radical Probabilism has not been very concerned with computational issues, and so constructing an actual algorithm (like the logical induction algorithm) has not been a focus. (However, there have been some developments -- see historical notes at the end.) This could be seen as a weakness. However, for the purpose of communicating the core insights, I think this is a strength -- there are fewer technical details to communicate. A terminological note: I will use "radical probabilism" to refer to the new theory of rationality (treating logical induction as merely a specific way to flesh out Jeffrey's theory). I'm more conflicted about how to refer to the older theory. I'm tempted to just use the term "Bayesian", implying that the new theory is non-Bayesian -- this highlights its rejection of Bayesian updates. However...
Dec 10, 2021
What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs) by Andrew Critch
38:45
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs) , published by Andrew Critch on the AI Alignment Forum. With: Thomas Krendl Gilbert, who provided comments, interdisciplinary feedback, and input on the RAAP concept. Thanks also for comments from Ramana Kumar. Target audience: researchers and institutions who think about existential risk from artificial intelligence, especially AI researchers. Preceded by: Some AI research areas and their relevance to existential safety, which emphasized the value of thinking about multi-stakeholder/multi-agent social applications, but without concrete extinction scenarios. This post tells a few different stories in which humanity dies out as a result of AI technology, but where no single source of human or automated agency is the cause. Scenarios with multiple AI-enabled superpowers are often called “multipolar” scenarios in AI futurology jargon, as opposed to “unipolar” scenarios with just one superpower. Unipolar take-offs Multipolar take-offs Slow take-offs Part 1 of this post Fast take-offs Part 2 of this post Part 1 covers a batch of stories that play out slowly (“slow take-offs”), and Part 2 stories play out quickly. However, in the end I don’t want you to be super focused how fast the technology is taking off. Instead, I’d like you to focus on multi-agent processes with a robust tendency to play out irrespective of which agents execute which steps in the process. I’ll call such processes Robust Agent-Agnostic Processes (RAAPs). A group walking toward a restaurant is a nice example of a RAAP, because it exhibits: Robustness: If you temporarily distract one of the walkers to wander off, the rest of the group will keep heading toward the restaurant, and the distracted member will take steps to rejoin the group. Agent-agnosticism: Who’s at the front or back of the group might vary considerably during the walk. People at the front will tend to take more responsibility for knowing and choosing what path to take, and people at the back will tend to just follow. Thus, the execution of roles (“leader”, “follower”) is somewhat agnostic as to which agents execute them. Interestingly, if all you want to do is get one person in the group not to go to the restaurant, sometimes it’s actually easier to achieve that by convincing the entire group not to go there than by convincing just that one person. This example could be extended to lots of situations in which agents have settled on a fragile consensus for action, in which it is strategically easier to motivate a new interpretation of the prior consensus than to pressure one agent to deviate from it. I think a similar fact may be true about some agent-agnostic processes leading to AI x-risk, in that agent-specific interventions (e.g., aligning or shutting down this or that AI system or company) will not be enough to avert the process, and might even be harder than trying to shift the structure of society as a whole. Moreover, I believe this is true in both “slow take-off” and “fast take-off” AI development scenarios This is because RAAPs can arise irrespective of the speed of the underlying “host” agents. RAAPs are made more or less likely to arise based on the “structure” of a given interaction. As such, the problem of avoiding the emergence of unsafe RAAPs, or ensuring the emergence of safe ones, is a problem of mechanism design (wiki/Mechanism_design). I recently learned that in sociology, the concept of a field (martin2003field, fligsteinmcadam2012fields) is roughly defined as a social space or arena in which the motivation and behavior of agents are explained through reference to surrounding processes or “structure” rather than freedom or chance. In my parlance, mechanisms cause fields, and fields cause RAAPs. Meta / prefac...
Dec 10, 2021
Utility Maximization = Description Length Minimization by johnswentworth
15:22
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Utility Maximization = Description Length Minimization, published by johnswentworth on the AI Alignment Forum. There’s a useful intuitive notion of “optimization” as pushing the world into a small set of states, starting from any of a large number of states. Visually: Yudkowsky and Flint both have notable formalizations of this “optimization as compression” idea. This post presents a formalization of optimization-as-compression grounded in information theory. Specifically: to “optimize” a system is to reduce the number of bits required to represent the system state using a particular encoding. In other words, “optimizing” a system means making it compressible (in the information-theoretic sense) by a particular model. This formalization turns out to be equivalent to expected utility maximization, and allows us to interpret any expected utility maximizer as “trying to make the world look like a particular model”. Conceptual Example: Building A House Before diving into the formalism, we’ll walk through a conceptual example, taken directly from Flint’s Ground of Optimization: building a house. Here’s Flint’s diagram: The key idea here is that there’s a wide variety of initial states (piles of lumber, etc) which all end up in the same target configuration set (finished house). The “perturbation” indicates that the initial state could change to some other state - e.g. someone could move all the lumber ten feet to the left - and we’d still end up with the house. In terms of information-theoretic compression: we could imagine a model which says there is probably a house. Efficiently encoding samples from this model will mean using shorter bit-strings for world-states with a house, and longer bit-strings for world-states without a house. World-states with piles of lumber will therefore generally require more bits than world-states with a house. By turning the piles of lumber into a house, we reduce the number of bits required to represent the world-state using this particular encoding/model. If that seems kind of trivial and obvious, then you’ve probably understood the idea; later sections will talk about how it ties into other things. If not, then the next section is probably for you. Background Concepts From Information Theory The basic motivating idea of information theory is that we can represent information using fewer bits, on average, if we use shorter representations for states which occur more often. For instance, Morse code uses only a single bit (“.”) to represent the letter “e”, but four bits (“- - . -”) to represent “q”. This creates a strong connection between probabilistic models/distributions and optimal codes: a code which requires minimal average bits for one distribution (e.g. with lots of e’s and few q’s) will not be optimal for another distribution (e.g. with few e’s and lots of q’s). For any random variable X generated by a probabilistic model M , we can compute the minimum average number of bits required to represent X . This is Shannon’s famous entropy formula − ∑ X P X M log P X M Assuming we’re using an optimal encoding for model M , the number of bits used to encode a particular value x is log P X x M . (Note that this is sometimes not an integer! Today we have algorithms which encode many samples at once, potentially even from different models/distributions, to achieve asymptotically minimal bit-usage. The “rounding error” only happens once for the whole collection of samples, so as the number of samples grows, the rounding error per sample goes to zero.) Of course, we could be wrong about the distribution - we could use a code optimized for a model M 2 which is different from the “true” model M 1 . In this case, the average number of bits used will be − ∑ X P X M 1 log P X M 2 E log P X M 2 M 1 In this post, we’ll use a “wrong” model M 2 intentio...
Dec 10, 2021
Risks from Learned Optimization: Introduction by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, Scott Garrabrant
18:54
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Risks from Learned Optimization: Introduction , published by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, Scott Garrabrant on the AI Alignment Forum. This is the first of five posts in the Risks from Learned Optimization Sequence based on the paper “Risks from Learned Optimization in Advanced Machine Learning Systems” by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Each post in the sequence corresponds to a different section of the paper. Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, and Joar Skalse contributed equally to this sequence. With special thanks to Paul Christiano, Eric Drexler, Rob Bensinger, Jan Leike, Rohin Shah, William Saunders, Buck Shlegeris, David Dalrymple, Abram Demski, Stuart Armstrong, Linda Linsefors, Carl Shulman, Toby Ord, Kate Woolverton, and everyone else who provided feedback on earlier versions of this sequence. Motivation The goal of this sequence is to analyze the type of learned optimization that occurs when a learned model (such as a neural network) is itself an optimizer—a situation we refer to as mesa-optimization, a neologism we introduce in this sequence. We believe that the possibility of mesa-optimization raises two important questions for the safety and transparency of advanced machine learning systems. First, under what circumstances will learned models be optimizers, including when they should not be? Second, when a learned model is an optimizer, what will its objective be—how will it differ from the loss function it was trained under—and how can it be aligned? We believe that this sequence presents the most thorough analysis of these questions that has been conducted to date. In particular, we present not only an introduction to the basic concerns surrounding mesa-optimizers, but also an analysis of the particular aspects of an AI system that we believe are likely to make the problems related to mesa-optimization relatively easier or harder to solve. By providing a framework for understanding the degree to which different AI systems are likely to be robust to misaligned mesa-optimization, we hope to start a discussion about the best ways of structuring machine learning systems to solve these problems. Furthermore, in the fourth post we will provide what we think is the most detailed analysis yet of a problem we refer as deceptive alignment which we posit may present one of the largest—though not necessarily insurmountable—current obstacles to producing safe advanced machine learning systems using techniques similar to modern machine learning. Two questions In machine learning, we do not manually program each individual parameter of our models. Instead, we specify an objective function that captures what we want the system to do and a learning algorithm to optimize the system for that objective. In this post, we present a framework that distinguishes what a system is optimized to do (its “purpose”), from what it optimizes for (its “goal”), if it optimizes for anything at all. While all AI systems are optimized for something (have a purpose), whether they actually optimize for anything (pursue a goal) is non-trivial. We will say that a system is an optimizer if it is internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system. Learning algorithms in machine learning are optimizers because they search through a space of possible parameters—e.g. neural network weights—and improve the parameters with respect to some objective. Planning algorithms are also optimizers, since they search through possible plans, picking those that do well according to some objective. Whether a syste...
Dec 10, 2021
Matt Botvinick on the spontaneous emergence of learning algorithms by Adam Scholl
07:11
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Matt Botvinick on the spontaneous emergence of learning algorithms , published by Adam Scholl on the AI Alignment Forum. Matt Botvinick is Director of Neuroscience Research at DeepMind. In this interview, he discusses results from a 2018 paper which describe conditions under which reinforcement learning algorithms will spontaneously give rise to separate full-fledged reinforcement learning algorithms that differ from the original. Here are some notes I gathered from the interview and paper: Initial Observation At some point, a group of DeepMind researchers in Botvinick’s group noticed that when they trained a RNN using RL on a series of related tasks, the RNN itself instantiated a separate reinforcement learning algorithm. These researchers weren’t trying to design a meta-learning algorithm—apparently, to their surprise, this just spontaneously happened. As Botvinick describes it, they started “with just one learning algorithm, and then another learning algorithm kind of... emerges, out of, like out of thin air”: "What happens... it seemed almost magical to us, when we first started realizing what was going on—the slow learning algorithm, which was just kind of adjusting the synaptic weights, those slow synaptic changes give rise to a network dynamics, and the dynamics themselves turn into a learning algorithm.” Other versions of this basic architecture—e.g., using slot-based memory instead of RNNs—seemed to produce the same basic phenomenon, which they termed "meta-RL." So they concluded that all that’s needed for a system to give rise to meta-RL are three very general properties: the system must 1) have memory, 2) whose weights are trained by a RL algorithm, 3) on a sequence of similar input data. From Botvinick’s description, it sounds to me like he thinks [learning algorithms that find/instantiate other learning algorithms] is a strong attractor in the space of possible learning algorithms: “...it's something that just happens. In a sense, you can't avoid this happening. If you have a system that has memory, and the function of that memory is shaped by reinforcement learning, and this system is trained on a series of interrelated tasks, this is going to happen. You can't stop it." Search for Biological Analogue This system reminded some of the neuroscientists in Botvinick’s group of features observed in brains. For example, like RNNs, the human prefrontal cortex (PFC) is highly recurrent, and the RL and RNN memory systems in their meta-RL model reminded them of “synaptic memory” and “activity-based memory.” They decided to look for evidence of meta-RL occuring in brains, since finding a neural analogue of the technique would provide some evidence they were on the right track, i.e. that the technique might scale to solving highly complex tasks. They think they found one. In short, they think that part of the dopamine system (DA) is a full-fledged reinforcement learning algorithm, which trains/gives rise to another full-fledged, free-standing reinforcement learning algorithm in PFC, in basically the same way (and for the same reason) the RL-trained RNNs spawned separate learning algorithms in their experiments. As I understand it, their story goes as follows: The PFC, along with the bits of basal ganglia and thalamic nuclei it connects to, forms a RNN. Its inputs are sensory percepts, and information about past actions and rewards. Its outputs are actions, and estimates of state value. DA[1] is a RL algorithm that feeds reward prediction error to PFC. Historically, people assumed the purpose of sending this prediction error was to update PFC’s synaptic weights. Wang et al. agree that this happens, but argue that the principle purpose of sending prediction error is to cause the creation of “a second RL algorithm, implemented entirely in the prefrontal network’s acti...
Dec 10, 2021
the scaling "inconsistency": openAI’s new insight by nostalgebraist
15:19
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: the scaling "inconsistency": openAI’s new insight, published by nostalgebraist on the AI Alignment Forum. I’ve now read the new OpenAI scaling laws paper. Also, yesterday I attended a fun and informative lecture/discussion with one of the authors. While the topic is on my mind, I should probably jot down some of my thoughts. This post is mostly about what the new paper says about the “inconsistency” brought up in their previous paper. The new paper has a new argument on this topic, which is intuitive and appealing, and suggests that the current scaling trend will indeed “switch over” soon to a new one where dataset size, not model size, is the active constraint on performance. Most of this post is an attempt to explain and better understand this argument. The new paper is mainly about extending the scaling laws from their earlier paper to new modalities. In that paper, they found scaling laws for transformers trained autoregressively on text data. The new paper finds the same patterns in the scaling behavior of transformers trained autoregressively on images, math problems, etc. So the laws aren’t telling us something about the distribution of text data, but about something more fundamental. That’s cool. They also have a new, very intuitive hypothesis for what’s going on with the “scaling inconsistency” they described in the previous paper – the one I made a big deal about at the time. So that’s the part I’m most excited to discuss. I’m going to give a long explanation of it, way longer than the relevant part of their paper. Some of this is original to me, all errors are mine, all the usual caveats. 1. L(C) and L(D) To recap: the “inconsistency” is between two scaling laws: The law for the best you can do, given a fixed compute budget. This is L(C), sometimes called L(C_min). L is the loss (lower = better), C is your compute budget. The law for the best you can do, given a fixed dataset size. This is L(D), where D is the number of examples (say, tokens) in the dataset. Once you reach a certain level of compute, these two laws contradict each other. I’ll take some time to unpack that here, as it’s not immediately obvious the two can even be compared to one another – one is a function of compute, the other of data. 2. C sets E, and E bounds D Budget tradeoffs Given a compute budget C, you can derive the optimal way to spend it on different things. Roughly, you are trading off between two ways to spend compute: Use C to buy “N”: Training a bigger model – “N” here is model size Use C to buy “S”: Training for more steps “S” (gradient updates) The relationship between S (steps) and D (dataset size) is a little subtle, for several reasons. From step count to update count For one thing, each single “step” is an update on the information from more than one data point. Specifically, a step updates on “B” different points – B is the batch size. So the total number of data points processed during training is B times S. The papers sometimes call this quantity “E” (number of examples), so I’ll call it that too. From update count to data count Now, when you train an ML model, you usually update on each data point more than once. Typically, you’ll do one pass over the full dataset (updating on each point as you go along), then you’ll go back and do a second full pass, and then a third, etc. These passes are called “epochs.” If you’re doing things this way, then for every point in the data, you get (number of epochs) updates out of it. So E = (number of epochs) D. Some training routines don’t visit every point the exact same number of times – there’s nothing forcing you to do that. Still, for any training procedure, we can look at the quantity E / D. This would be the number of epochs, if you’re doing epochs. For a generic training routine, you can can think of E / D as the “effecti...
Dec 10, 2021
Introduction to Cartesian Frames by Scott Garrabrant
28:04
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Introduction to Cartesian Frames, published by Scott Garrabrant on the AI Alignment Forum. This is the first post in a sequence on Cartesian frames, a new way of modeling agency that has recently shaped my thinking a lot. Traditional models of agency have some problems, like: They treat the "agent" and "environment" as primitives with a simple, stable input-output relation. (See "Embedded Agency.") They assume a particular way of carving up the world into variables, and don't allow for switching between different carvings or different levels of description. Cartesian frames are a way to add a first-person perspective (with choices, uncertainty, etc.) on top of a third-person "here is the set of all possible worlds," in such a way that many of these problems either disappear or become easier to address. The idea of Cartesian frames is that we take as our basic building block a binary function which combines a choice from the agent with a choice from the environment to produce a world history. We don't think of the agent as having inputs and outputs, and we don't assume that the agent is an object persisting over time. Instead, we only think about a set of possible choices of the agent, a set of possible environments, and a function that encodes what happens when we combine these two. This basic object is called a Cartesian frame. As with dualistic agents, we are given a way to separate out an “agent” from an “environment." But rather than being a basic feature of the world, this is a “frame” — a particular way of conceptually carving up the world. We will use the combinatorial properties of a given Cartesian frame to derive versions of inputs, outputs and time. One goal here is that by making these notions derived rather than basic, we can make them more amenable to approximation and thus less dependent on exactly how one draws the Cartesian boundary. Cartesian frames also make it much more natural to think about the world at multiple levels of description, and to model agents as having subagents. Mathematically, Cartesian frames are exactly Chu spaces. I give them a new name because of my specific interpretation about agency, which also highlights different mathematical questions. Using Chu spaces, we can express many different relationships between Cartesian frames. For example, given two agents, we could talk about their sum ( ⊕ ), which can choose from any of the choices available to either agent, or we could talk about their tensor ( ⊗ ), which can accomplish anything that the two agents could accomplish together as a team. Cartesian frames also have duals ( − ∗ ) which you can get by swapping the agent with the environment, and ⊕ and ⊗ have De Morgan duals ( and ⅋ respectively), which represent taking a sum or tensor of the environments. The category also has an internal hom, ⊸ , where C ⊸ D can be thought of as " D with a C -shaped hole in it." These operations are very directly analogous to those used in linear logic. 1. Definition Let W be a set of possible worlds. A Cartesian frame C over W is a triple C A E ⋅ , where A represents a set of possible ways the agent can be, E represents a set of possible ways the environment can be, and ⋅ A × E → W is an evaluation function that returns a possible world given an element of A and an element of E We will refer to A as the agent, the elements of A as possible agents, E as the environment, the elements of E as possible environments, W as the world, and elements of W as possible worlds. Definition: A Cartesian frame C over a set W is a triple A E ⋅ , where A and E are sets and ⋅ A × E → W . If C A E ⋅ is a Cartesian frame over W , we say Agent C A Env C E World C W , and Eval C ⋅ A finite Cartesian frame is easily visualized as a matrix, where the rows of the matrix represent possible agents, the columns of the matr...
Dec 10, 2021
My research methodology by Paul Christiano
23:57
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: My research methodology, published by Paul Christiano on the AI Alignment Forum. (Thanks to Ajeya Cotra, Nick Beckstead, and Jared Kaplan for helpful comments on a draft of this post.) I really don’t want my AI to strategically deceive me and resist my attempts to correct its behavior. Let’s call an AI that does so egregiously misaligned (for the purpose of this post). Most possible ML techniques for avoiding egregious misalignment depend on detailed facts about the space of possible models: what kind of thing do neural networks learn? how do they generalize? how do they change as we scale them up? But I feel like we should be possible to avoid egregious misalignment regardless of how the empirical facts shake out--it should be possible to get a model we build to do at least roughly what we want. So I’m interested in trying to solve the problem in the worst case, i.e. to develop competitive ML algorithms for which we can’t tell any plausible story about how they lead to egregious misalignment. This is a much higher bar for an algorithm to meet, so it may just be an impossible task. But if it’s possible, there are several ways in which it could actually be easier: We can potentially iterate much faster, since it’s often easier to think of a single story about how an algorithm can fail than it is to characterize its behavior in practice. We can spend a lot of our time working with simple or extreme toy cases that are easier to reason about, since our algorithm is supposed to work even in these cases. We can find algorithms that have a good chance of working in the future even if we don’t know what AI will look like or how quickly it will advance, since we’ve been thinking about a very wide range of possible failure cases. I’d guess there’s a 25–50% chance that we can find an alignment strategy that looks like it works, in the sense that we can’t come up with a plausible story about how it leads to egregious misalignment. That’s a high enough probability that I’m very excited to gamble on it. Moreover, if it fails I think we’re likely to identify some possible “hard cases” for alignment — simple situations where egregious misalignment feels inevitable. What this looks like (3 examples) My research basically involves alternating between “think of a plausible alignment algorithm” and “think of a plausible story about how it fails.” Example 1: human feedback In an unaligned benchmark I describe a simple AI training algorithm: Our AI observes the world through a bunch of cameras and outputs motor actions. We train a generative model that predicts these camera observations given the motor actions. We ask humans to evaluate possible futures by looking at the predicted videos output by the model. We then train a model to predict these human evaluations. At test time the AI searches for plans that lead to trajectories that look good to humans. In the same post, I describe a plausible story about how this algorithm leads to egregious misalignment: Our generative model understands reality better than human evaluators. There are plans that acquire influence in ways that are obvious to the generative model but completely incomprehensible and invisible to humans. It’s possible to use that influence to “hack” the cameras, in the sense of creating a fiction that looks convincing to a human looking at predicted videos. The fiction can look much better than the actual possible futures. So our planning process finds an action that covertly gathers resources and uses them to create a fiction. I don’t know if or when this kind of reward hacking would happen — I think it’s pretty likely eventually, but it’s far from certain and it might take a long time. But from my perspective this failure mode is at least plausible — I don’t see any contradictions between this sequence of events and anyth...
Dec 10, 2021
Fun with +12 OOMs of Compute by Daniel Kokotajlo
29:34
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Fun with +12 OOMs of Compute, published by Daniel Kokotajlo on the AI Alignment Forum. Or: Big Timelines Crux Operationalized What fun things could one build with +12 orders of magnitude of compute? By ‘fun’ I mean ‘powerful.’ This hypothetical is highly relevant to AI timelines, for reasons I’ll explain later. Summary (Spoilers): I describe a hypothetical scenario that concretizes the question “what could be built with 2020’s algorithms/ideas/etc. but a trillion times more compute?” Then I give some answers to that question. Then I ask: How likely is it that some sort of TAI would happen in this scenario? This second question is a useful operationalization of the (IMO) most important, most-commonly-discussed timelines crux: “Can we get TAI just by throwing more compute at the problem?” I consider this operationalization to be the main contribution of this post; it directly plugs into Ajeya’s timelines model and is quantitatively more cruxy than anything else I know of. The secondary contribution of this post is my set of answers to the first question: They serve as intuition pumps for my answer to the second, which strongly supports my views on timelines. The hypothetical In 2016 the Compute Fairy visits Earth and bestows a blessing: Computers are magically 12 orders of magnitude faster! Over the next five years, what happens? The Deep Learning AI Boom still happens, only much crazier: Instead of making AlphaStar for 10^23 floating point operations, DeepMind makes something for 10^35. Instead of making GPT-3 for 10^23 FLOPs, OpenAI makes something for 10^35. Instead of industry and academia making a cornucopia of things for 10^20 FLOPs or so, they make a cornucopia of things for 10^32 FLOPs or so. When random grad students and hackers spin up neural nets on their laptops, they have a trillion times more compute to work with. [EDIT: Also assume magic +12 OOMs of memory, bandwidth, etc. All the ingredients of compute.] For context on how big a deal +12 OOMs is, consider the graph below, from ARK. It’s measuring petaflop-days, which are about 10^20 FLOP each. So 10^35 FLOP is 1e+15 on this graph. GPT-3 and AlphaStar are not on this graph, but if they were they would be in the very top-right corner. Question One: In this hypothetical, what sorts of things could AI projects build? I encourage you to stop reading, set a five-minute timer, and think about fun things that could be built in this scenario. I’d love it if you wrote up your answers in the comments! My tentative answers: Below are my answers, listed in rough order of how ‘fun’ they seem to me. I’m not an AI scientist so I expect my answers to overestimate what could be done in some ways, and underestimate in other ways. Imagine that each entry is the best version of itself, since it is built by experts (who have experience with smaller-scale versions) rather than by me. OmegaStar: In our timeline, it cost about 10^23 FLOP to train AlphaStar. (OpenAI Five, which is in some ways more impressive, took less!) Let’s make OmegaStar like AlphaStar only +7 OOMs bigger: the size of a human brain.[1] [EDIT: You may be surprised to learn, as I was, that AlphaStar has about 10% as many parameters as a honeybee has synapses! Playing against it is like playing against a tiny game-playing insect.] Larger models seem to take less data to reach the same level of performance, so it would probably take at most 10^30 FLOP to reach the same level of Starcraft performance as AlphaStar, and indeed we should expect it to be qualitatively better.[2] So let’s do that, but also train it on lots of other games too.[3] There are 30,000 games in the Steam Library. We train OmegaStar long enough that it has as much time on each game as AlphaStar had on Starcraft. With a brain so big, maybe it’ll start to do some transfer learning, acquiring g...
Dec 10, 2021
Seeking Power is Often Convergently Instrumental in MDPs by Paul Christiano
23:52
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Seeking Power is Often Convergently Instrumental in MDPs, published by Paul Christiano on the AI Alignment Forum. (Thanks to Ajeya Cotra, Nick Beckstead, and Jared Kaplan for helpful comments on a draft of this post.) I really don’t want my AI to strategically deceive me and resist my attempts to correct its behavior. Let’s call an AI that does so egregiously misaligned (for the purpose of this post). Most possible ML techniques for avoiding egregious misalignment depend on detailed facts about the space of possible models: what kind of thing do neural networks learn? how do they generalize? how do they change as we scale them up? But I feel like we should be possible to avoid egregious misalignment regardless of how the empirical facts shake out--it should be possible to get a model we build to do at least roughly what we want. So I’m interested in trying to solve the problem in the worst case, i.e. to develop competitive ML algorithms for which we can’t tell any plausible story about how they lead to egregious misalignment. This is a much higher bar for an algorithm to meet, so it may just be an impossible task. But if it’s possible, there are several ways in which it could actually be easier: We can potentially iterate much faster, since it’s often easier to think of a single story about how an algorithm can fail than it is to characterize its behavior in practice. We can spend a lot of our time working with simple or extreme toy cases that are easier to reason about, since our algorithm is supposed to work even in these cases. We can find algorithms that have a good chance of working in the future even if we don’t know what AI will look like or how quickly it will advance, since we’ve been thinking about a very wide range of possible failure cases. I’d guess there’s a 25–50% chance that we can find an alignment strategy that looks like it works, in the sense that we can’t come up with a plausible story about how it leads to egregious misalignment. That’s a high enough probability that I’m very excited to gamble on it. Moreover, if it fails I think we’re likely to identify some possible “hard cases” for alignment — simple situations where egregious misalignment feels inevitable. What this looks like (3 examples) My research basically involves alternating between “think of a plausible alignment algorithm” and “think of a plausible story about how it fails.” Example 1: human feedback In an unaligned benchmark I describe a simple AI training algorithm: Our AI observes the world through a bunch of cameras and outputs motor actions. We train a generative model that predicts these camera observations given the motor actions. We ask humans to evaluate possible futures by looking at the predicted videos output by the model. We then train a model to predict these human evaluations. At test time the AI searches for plans that lead to trajectories that look good to humans. In the same post, I describe a plausible story about how this algorithm leads to egregious misalignment: Our generative model understands reality better than human evaluators. There are plans that acquire influence in ways that are obvious to the generative model but completely incomprehensible and invisible to humans. It’s possible to use that influence to “hack” the cameras, in the sense of creating a fiction that looks convincing to a human looking at predicted videos. The fiction can look much better than the actual possible futures. So our planning process finds an action that covertly gathers resources and uses them to create a fiction. I don’t know if or when this kind of reward hacking would happen — I think it’s pretty likely eventually, but it’s far from certain and it might take a long time. But from my perspective this failure mode is at least plausible — I don’t see any contradictions between ...
Dec 10, 2021
The Solomonoff Prior is Malign by Mark Xu
28:35
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Solomonoff Prior is Malign, published by Mark Xu on the AI Alignment Forum. Write a Review This argument came to my attention from this post by Paul Christiano. I also found this clarification helpful. I found these counter-arguments stimulating and have included some discussion of them. Very little of this content is original. My contributions consist of fleshing out arguments and constructing examples. Thank you to Beth Barnes and Thomas Kwa for helpful discussion and comments. What is the Solomonoff prior? The Solomonoff prior is intended to answer the question "what is the probability of X?" for any X, where X is a finite string over some finite alphabet. The Solomonoff prior is defined by taking the set of all Turing machines (TMs) which output strings when run with no input and weighting them proportional to 2 − K , where K is the description length of the TM (informally its size in bits). The Solomonoff prior says the probability of a string is the sum over all the weights of all TMs that print that string. One reason to care about the Solomonoff prior is that we can use it to do a form of idealized induction. If you have seen 0101 and want to predict the next bit, you can use the Solomonoff prior to get the probability of 01010 and 01011. Normalizing gives you the chances of seeing 1 versus 0, conditioned on seeing 0101. In general, any process that assigns probabilities to all strings in a consistent way can be used to do induction in this way. This post provides more information about Solomonoff Induction. Why is it malign? Imagine that you wrote a programming language called python^10 that works as follows: First, it takes all alpha-numeric chars that are not in literals and checks if they're repeated 10 times sequentially. If they're not, they get deleted. If they are, they get replaced by a single copy. Second, it runs this new program through a python interpreter. Hello world in python^10: ppppppppprrrrrrrrrriiiiiiiiiinnnnnnnnnntttttttttt('Hello, world!') Luckily, python has an exec function that executes literals as code. This lets us write a shorter hello world: eeeeeeeeexxxxxxxxxxeeeeeeeeeecccccccccc("print('Hello, world!')") It's probably easy to see that for nearly every program, the shortest way to write it in python^10 is to write it in python and run it with exec. If we didn't have exec, for sufficiently complicated programs, the shortest way to write them would be to specify an interpreter for a different language in python^10 and write it in that language instead. As this example shows, the answer to "what's the shortest program that does X?" might involve using some roundabout method (in this case we used exec). If python^10 has some security properties that python didn't have, then the shortest program in python^10 that accomplished any given task would not have these security properties because they would all pass through exec. In general, if you can access alternative ‘modes’ (in this case python), the shortest programs that output any given string might go through one of those modes, possibly introducing malign behavior. Let's say that I'm trying to predict what a human types next using the Solomonoff prior. Many programs predict the human: Simulate the human and their local surroundings. Run the simulation forward and check what gets typed. Simulate the entire Earth. Run the simulation forward and check what that particular human types. Simulate the entire universe from the beginning of time. Run the simulation forward and check what that particular human types. Simulate an entirely different universe that has reason to simulate this universe. Output what the human types in the simulation of our universe. Which one is the simplest? One property of the Solmonoff prior is that it doesn't care about how long the TMs take to run, only ho...
Dec 10, 2021
2020 AI Alignment Literature Review and Charity Comparison by Larks
02:12:57
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: 2020 AI Alignment Literature Review and Charity Comparison, published by Larks on the AI Alignment Forum. Write a Review cross-posted to the EA forum here. Introduction As in 2016, 2017, 2018, and 2019, I have attempted to review the research that has been produced by various organisations working on AI safety, to help potential donors gain a better understanding of the landscape. This is a similar role to that which GiveWell performs for global health charities, and somewhat similar to a securities analyst with regards to possible investments. My aim is basically to judge the output of each organisation in 2020 and compare it to their budget. This should give a sense of the organisations' average cost-effectiveness. We can also compare their financial reserves to their 2020 budgets to get a sense of urgency. I’d like to apologize in advance to everyone doing useful AI Safety work whose contributions I have overlooked or misconstrued. As ever I am painfully aware of the various corners I have had to cut due to time constraints from my job, as well as being distracted by 1) other projects, 2) the miracle of life and 3) computer games. This article focuses on AI risk work. If you think other causes are important too, your priorities might differ. This particularly affects GCRI, FHI and CSER, who both do a lot of work on other issues which I attempt to cover but only very cursorily. How to read this document This document is fairly extensive, and some parts (particularly the methodology section) are largely the same as last year, so I don’t recommend reading from start to finish. Instead, I recommend navigating to the sections of most interest to you. If you are interested in a specific research organisation, you can use the table of contents to navigate to the appropriate section. You might then also want to Ctrl+F for the organisation acronym in case they are mentioned elsewhere as well. Papers listed as ‘X researchers contributed to the following research lead by other organisations’ are included in the section corresponding to their first author and you can Cntrl+F to find them. If you are interested in a specific topic, I have added a tag to each paper, so you can Ctrl+F for a tag to find associated work. The tags were chosen somewhat informally so you might want to search more than one, especially as a piece might seem to fit in multiple categories. Here are the un-scientifically-chosen hashtags: AgentFoundations Amplification Capabilities Corrigibility DecisionTheory Ethics Forecasting GPT-3 IRL Misc NearAI OtherXrisk Overview Politics RL Strategy Textbook Transparency ValueLearning New to Artificial Intelligence as an existential risk? If you are new to the idea of General Artificial Intelligence as presenting a major risk to the survival of human value, I recommend this Vox piece by Kelsey Piper, or for a more technical version this by Richard Ngo. If you are already convinced and are interested in contributing technically, I recommend this piece by Jacob Steinheart, as unlike this document Jacob covers pre-2019 research and organises by topic, not organisation, or this from Critch & Krueger, or this from Everitt et al, though it is a few years old now Research Organisations FHI: The Future of Humanity Institute FHI is an Oxford-based Existential Risk Research organisation founded in 2005 by Nick Bostrom. They are affiliated with Oxford University. They cover a wide variety of existential risks, including artificial intelligence, and do political outreach. Their research can be found here. Their research is more varied than MIRI's, including strategic work, work directly addressing the value-learning problem, and corrigibility work - as well as work on other Xrisks. They run a Research Scholars Program, where people can join them to do research at FHI. There is...
Dec 10, 2021
Inner Alignment: Explain like I'm 12 Edition by Rafael Harth
19:56
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Inner Alignment: Explain like I'm 12 Edition , published by Rafael Harth on the AI Alignment Forum. (This is an unofficial explanation of Inner Alignment based on the Miri paper Risks from Learned Optimization in Advanced Machine Learning Systems (which is almost identical to the LW sequence) and the Future of Life podcast with Evan Hubinger (Miri/LW). It's meant for anyone who found the sequence too long/challenging/technical to read.) Note that bold and italics means "this is a new term I'm introducing," whereas underline and italics is used for emphasis. What is Inner Alignment? Let's start with an abridged guide to how Machine Learning works: Choose a problem Decide on a space of possible solutions Find a good solution from that space If the problem is "find a tool that can look at any image and decide whether or not it contains a cat," then each conceivable set of rules for answering this question (formally, each function from the set of all pixels to the set yes, no ) defines one solution. We call each such solution a model. The space of possible models is depicted below. Since that's all possible models, most of them are utter nonsense. Pick a random one, and you're as likely to end up with a car-recognizer than a cat-recognizer – but far more likely with an algorithm that does nothing we can interpret. Note that even the examples I annotated aren't typical – most models would be more complex while still doing nothing related to cats. Nonetheless, somewhere in there is a model that would do a decent job on our problem. In the above, that's the one that says, "I look for cats." How does ML find such a model? One way that does not work is trying out all of them. That's because the space is too large: it might contain over 10 1000000 candidates. Instead, there's this thing called Stochastic Gradient Descent (SGD). Here's how it works: SGD begins with some (probably terrible) model and then proceeds in steps. In each step, it switches to another model that is "close" and hopefully a little better. Eventually, it stops and outputs the most recent model. 1 Note that, in the example above, we don't end up with the perfect cat-recognizer (the red box) but with something close to it – perhaps a model that looks for cats but has some unintended quirks. SGD does generally not guarantee optimality. The speech bubbles where the models explain what they're doing are annotations for the reader. From the perspective of the programmer, it looks like this: The programmer has no idea what the models are doing. Each model is just a black box. 2 A necessary component for SGD is the ability to measure a model's performance, but this happens while treating them as black boxes. In the cat example, assume the programmer has a bunch of images that are accurately labeled as "contains cat" and "doesn't contain cat." (These images are called the training data and the setting is called supervised learning.) SGD tests how well each model does on these images and, in each step, chooses one that does better. In other settings, performance might be measured in different ways, but the principle remains the same. Now, suppose that the images we have happen to include only white cats. In this case, SGD might choose a model implementing the rule "output yes if there is something white and with four legs." The programmer would not notice anything strange – all she sees is that the model output by SGD does well on the training data. In this setting, there is thus only a problem if our way of obtaining feedback is flawed. If it is perfect – if the pictures with cats are perfectly representative of what images-with-cats are like, and the pictures without cats are perfectly representative of what images-without-cats are like, then there isn't an issue. Conversely, if our images-with-cats are non-repres...
Dec 10, 2021
Evolution of Modularity by johnswentworth
03:57
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Evolution of Modularity, published by johnswentworth on the AI Alignment Forum. Write a Review This post is based on chapter 15 of Uri Alon’s book An Introduction to Systems Biology: Design Principles of Biological Circuits. See the book for more details and citations; see here for a review of most of the rest of the book. Fun fact: biological systems are highly modular, at multiple different scales. This can be quantified and verified statistically, e.g. by mapping out protein networks and algorithmically partitioning them into parts, then comparing the connectivity of the parts. It can also be seen more qualitatively in everyday biological work: proteins have subunits which retain their function when fused to other proteins, receptor circuits can be swapped out to make bacteria follow different chemical gradients, manipulating specific genes can turn a fly’s antennae into legs, organs perform specific functions, etc, etc. On the other hand, systems designed by genetic algorithms (aka simulated evolution) are decidedly not modular. This can also be quantified and verified statistically. Qualitatively, examining the outputs of genetic algorithms confirms the statistics: they’re a mess. So: what is the difference between real-world biological evolution vs typical genetic algorithms, which leads one to produce modular designs and the other to produce non-modular designs? Kashtan & Alon tackle the problem by evolving logic circuits under various conditions. They confirm that simply optimizing the circuit to compute a particular function, with random inputs used for selection, results in highly non-modular circuits. However, they are able to obtain modular circuits using “modularly varying goals” (MVG). The idea is to change the reward function every so often (the authors switch it out every 20 generations). Of course, if we just use completely random reward functions, then evolution doesn’t learn anything. Instead, we use “modularly varying” goal functions: we only swap one or two little pieces in the (modular) objective function. An example from the book: The upshot is that our different goal functions generally use similar sub-functions - suggesting that they share sub-goals for evolution to learn. Sure enough, circuits evolved using MVG have modular structure, reflecting the modular structure of the goals. (Interestingly, MVG also dramatically accelerates evolution - circuits reach a given performance level much faster under MVG than under a fixed goal, despite needing to change behavior every 20 generations. See either the book or the paper for more on that.) How realistic is MVG as a model for biological evolution? I haven’t seen quantitative evidence, but qualitative evidence is easy to spot. MVG as a theory of biological modularity predicts that highly variable subgoals will result in modular structure, whereas static subgoals will result in a non-modular mess. Alon’s book gives several examples: Chemotaxis: different bacteria need to pursue/avoid different chemicals, with different computational needs and different speed/energy trade-offs, in various combinations. The result is modularity: separate components for sensing, processing and motion. Animals need to breathe, eat, move, and reproduce. A new environment might have different food or require different motions, independent of respiration or reproduction - or vice versa. Since these requirements vary more-or-less independently in the environment, animals evolve modular systems to deal with them: digestive tract, lungs, etc. Ribosomes, as an anti-example: the functional requirements of a ribosome hardly vary at all, so they end up non-modular. They have pieces, but most pieces do not have an obvious distinct function. To sum it up: modularity in the system evolves to match modularity in the environment. Than...
Dec 10, 2021
MIRI comments on Cotra's "Case for Aligning Narrowly Superhuman Models" by Rob Bensinger
39:40
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: MIRI comments on Cotra's "Case for Aligning Narrowly Superhuman Models", published by Rob Bensinger on the AI Alignment Forum. Below, I’ve copied comments left by MIRI researchers Eliezer Yudkowsky and Evan Hubinger on March 1–3 on a draft of Ajeya Cotra’s "Case for Aligning Narrowly Superhuman Models." I've included back-and-forths with Cotra, and interjections by me and Rohin Shah. The section divisions below correspond to the sections in Cotra's post. 0. Introduction How can we train GPT-3 to give “the best health advice it can give” using demonstrations and/or feedback from humans who may in some sense “understand less” about what to do when you’re sick than GPT-3 does? Eliezer Yudkowsky: I've had some related conversations with Nick Beckstead. I'd be hopeful about this line of work primarily because I think it points to a bigger problem with the inscrutable matrices of floating-point numbers, namely, we have no idea what the hell GPT-3 is thinking and cannot tell it to think anything else. GPT-3 has a great store of medical knowledge, but we do not know where that medical knowledge is; we do not know how to tell it to internally apply its medical knowledge rather than applying other cognitive patterns it has stored. If this is still the state of opacity of AGI come superhuman capabilities, we are all immediately dead. So I would be relatively more hopeful about any avenue of attack for this problem that used anything other than an end-to-end black box - anything that started to address, "Well, this system clearly has a bunch of medical knowledge internally, can we find that knowledge and cause it to actually be applied" rather than "What external forces can we apply to this solid black box to make it think more about healthcare?" Evan Hubinger: +1 I continue to think that language model transparency research is the single most valuable current research direction within the class of standard ML research, for similar reasons to what Eliezer said above. Ajeya Cotra: Thanks! I'm also excited about language model transparency, and would love to find ways to make it more tractable as a research statement / organizing question for a field. I'm not personally excited about the connotations of transparency because it evokes the neuroscience-y interpretability tools, which don't feel scalable to situations when we don't get the concepts the model is using, and I'm very interested in finding slogans to keep researchers focused on the superhuman stuff. Ajeya Cotra: I've edited the description of the challenge to emphasize human feedback less. It now reads "How can we get GPT-3 to give “the best health advice it can give” when humans in some sense “understand less” about what to do when you’re sick than GPT-3 does? And in that regime, how can we even tell/verify that it’s “doing the best it can”?" Rob Bensinger: Nate and I tend to talk about "understandability" instead of "transparency" exactly because we don't want to sound like we're talking about normal ML transparency work. Eliezer Yudkowsky: Other possible synonyms: Clarity, legibility, cognitive readability. Ajeya Cotra: Thanks all -- I like the project of trying to come up with a good handle for the kind of language model transparency we're excited about (and have talked to Nick, Evan, etc about it too) but I think I don't want to push it in this blog post right now because I haven't hit on something I believe in and I want to ship this. In the end, we probably want to find ways to meaningfully supervise (or justifiably trust) models that are more capable than ~all humans in ~all domains. Eliezer Yudkowsky: (I think you want an AGI that is superhuman in engineering domains and infrahuman in human-modeling-and-manipulation if such a thing is at all possible.) Ajeya Cotra: Fair point, added a footnote: “Though if we cou...
Dec 10, 2021
EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised by gwern
02:21
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised , published by gwern on the AI Alignment Forum. This is a linkpost for "Mastering Atari Games with Limited Data", Ye et al 2021: Reinforcement learning has achieved great success in many applications. However, sample efficiency remains a key challenge, with prominent methods requiring millions (or even billions) of environment steps to train. Recently, there has been significant progress in sample efficient image-based RL algorithms; however, consistent human-level performance on the Atari game benchmark remains an elusive goal. We propose a sample efficient model-based visual RL algorithm built on MuZero, which we name EfficientZero. Our method achieves 190.4% mean human performance and 116.0% median performance on the Atari 100k benchmark with only two hours of real-time game experience and outperforms the state SAC in some tasks on the DMControl 100k benchmark. This is the first time an algorithm achieves super-human performance on Atari games with such little data. EfficientZero's performance is also close to DQN's performance at 200 million frames while we consume 500 times less data. EfficientZero's low sample complexity and high performance can bring RL closer to real-world applicability. We implement our algorithm in an easy-to-understand manner and it is available at this https URL. We hope it will accelerate the research of MCTS-based RL algorithms in the wider community. This work is supported by the Ministry of Science and Technology of the People’s Republic of China, the 2030 Innovation Megaprojects “Program on New Generation Artificial Intelligence” (Grant No. 2021AAA0150000). Some have said that poor sample-efficiency on ALE has been a reason to downplay DRL progress or implications. The primary boost in EfficientZero (table 3), pushing it past the human benchmark, is some simple self-supervised learning (SimSiam on predicted vs actual observations). Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Dec 10, 2021
Understanding “Deep Double Descent” by Evan Hubinger
09:04
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Understanding “Deep Double Descent”, published by Evan Hubinger on the AI Alignment Forum. If you're not familiar with the double descent phenomenon, I think you should be. I consider double descent to be one of the most interesting and surprising recent results in analyzing and understanding modern machine learning. Today, Preetum et al. released a new paper, “Deep Double Descent,” which I think is a big further advancement in our understanding of this phenomenon. I'd highly recommend at least reading the summary of the paper on the OpenAI blog. However, I will also try to summarize the paper here, as well as give a history of the literature on double descent and some of my personal thoughts. Prior work The double descent phenomenon was first discovered by Mikhail Belkin et al., who were confused by the phenomenon wherein modern ML practitioners would claim that “bigger models are always better” despite standard statistical machine learning theory predicting that bigger models should be more prone to overfitting. Belkin et al. discovered that the standard bias-variance tradeoff picture actually breaks down once you hit approximately zero training error—what Belkin et al. call the “interpolation threshold.” Before the interpolation threshold, the bias-variance tradeoff holds and increasing model complexity leads to overfitting, increasing test error. After the interpolation threshold, however, they found that test error actually starts to go down as you keep increasing model complexity! Belkin et al. demonstrated this phenomenon in simple ML methods such as decision trees as well as simple neural networks trained on MNIST. Here's the diagram that Belkin et al. use in their paper to describe this phenomenon: Belkin et al. describe their hypothesis for what's happening as follows: All of the learned predictors to the right of the interpolation threshold fit the training data perfectly and have zero empirical risk. So why should some—in particular, those from richer functions classes—have lower test risk than others? The answer is that the capacity of the function class does not necessarily reflect how well the predictor matches the inductive bias appropriate for the problem at hand. [The inductive bias] is a form of Occam’s razor: the simplest explanation compatible with the observations should be preferred. By considering larger function classes, which contain more candidate predictors compatible with the data, we are able to find interpolating functions that [are] “simpler”. Thus increasing function class capacity improves performance of classifiers. I think that what this is saying is pretty magical: in the case of neural nets, it's saying that SGD just so happens to have the right inductive biases that letting SGD choose which model it wants the most out of a large class of models with the same training performance yields significantly better test performance. If you're right on the interpolation threshold, you're effectively “forcing” SGD to choose from a very small set of models with perfect training accuracy (maybe only one realistic option), thus ignoring SGD's inductive biases completely—whereas if you're past the interpolation threshold, you're letting SGD choose which of many models with perfect training accuracy it prefers, thus allowing SGD's inductive bias to shine through. I think this is strong evidence for the critical importance of implicit simplicity and speed priors in making modern ML work. However, such biases also produce strong incentives for mesa-optimization (since optimizers are simple, compressed policies) and pseudo-alignment (since simplicity and speed penalties will favor simpler, faster proxies). Furthermore, the arguments for the universal prior and minimal circuits being malign suggest that such strong simplicity and speed priors could...
Dec 10, 2021
Can you control the past? by Joe Carlsmith
01:18:05
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Can you control the past?, published by Joe Carlsmith on the AI Alignment Forum. (Cross-posted from Hands and Cities. Lots of stuff familiar to LessWrong folks interested in decision theory.) I think that you can “control” events you have no causal interaction with, including events in the past, and that this is a wild and disorienting fact, with uncertain but possibly significant implications. This post attempts to impart such disorientation. My main example is a prisoner’s dilemma between perfect deterministic software twins, exposed to the exact same inputs. This example that shows, I think, that you can write on whiteboards light-years away, with no delays; you can move the arm of another person, in another room, just by moving your own. This, I claim, is extremely weird. My topic, more broadly, is the implications of this weirdness for the theory of instrumental rationality (“decision theory”). Many philosophers, and many parts of common sense, favor causal decision theory (CDT), on which, roughly, you should pick the action that causes the best outcomes in expectation. I think that deterministic twins, along with other examples, show that CDT is wrong. And I don’t think that uncertainty about “who are you,” or “where your algorithm is,” can save it. Granted that CDT is wrong, though, I’m not sure what’s right. The most famous alternative is evidential decision theory (EDT), on which, roughly, you should choose the action you would be happiest to learn you had chosen. I think that EDT is more attractive (and more confusing) than many philosophers give it credit for, and that some putative counterexamples don’t withstand scrutiny. But EDT has problems, too. In particular, I suspect that attractive versions of EDT (and perhaps, attractive attempts to recapture the spirit of CDT) require something in the vicinity of “following the policy that you would’ve wanted yourself to commit to, from some epistemic position that ‘forgets’ information you now know.” I don’t think that the most immediate objection to this – namely, that it implies choosing lower pay-offs even when you know them with certainty – is decisive (though some debates in this vicinity seem to me verbal). But it also seems extremely unclear what epistemic position you should evaluate policies from, and what policy such a position actually implies. Overall, rejecting the common-sense comforts of CDT, and accepting the possibility of some kind of “acausal control,” leaves us in strange and uncertain territory. I think we should do it anyway. But we should also tread carefully. I. Grandpappy Omega Decision theorists often assume that instrumental rationality is about maximizing expected utility in some sense. The question is: what sense? The most famous debate is between CDT and EDT. CDT chooses the action that will have the best effects. EDT chooses the action whose performance would be the best news. More specifically: CDT and EDT disagree about the type of “if” to use when evaluating the utility to expect, if you do X. CDT uses a counterfactual type of “if” — one that holds fixed the probability of everything outside of action X’s causal influence, then plays out the consequences of doing X. In this sense, it doesn’t allow your choice to serve as “evidence” about anything you can’t cause — even when your choice is such evidence. EDT, by contrast, uses a conditional “if.” That is, to evaluate X, it updates your overall picture of the world to reflect the assumption that action X has been been performed, and then sees how good the world looks in expectation. In this sense, it takes all the evidence into account, including the evidence that your having done X would provide. To see what this difference looks like in action, consider: Newcomb’s problem: You face two boxes: a transparent box, containing a tho...
Dec 10, 2021
Developmental Stages of GPTs by orthonormal
11:13
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Developmental Stages of GPTs, published by orthonormal on the AI Alignment Forum. Epistemic Status: I only know as much as anyone else in my reference class (I build ML models, I can grok the GPT papers, and I don't work for OpenAI or a similar lab). But I think my thesis is original. Related: Gwern on GPT-3 For the last several years, I've gone around saying that I'm worried about transformative AI, an AI capable of making an Industrial Revolution sized impact (the concept is agnostic on whether it has to be AGI or self-improving), because I think we might be one or two cognitive breakthroughs away from building one. GPT-3 has made me move up my timelines, because it makes me think we might need zero more cognitive breakthroughs, just more refinement / efficiency / computing power: basically, GPT-6 or GPT-7 might do it. My reason for thinking this is comparing GPT-3 to GPT-2, and reflecting on what the differences say about the "missing pieces" for transformative AI. My Thesis: The difference between GPT-2 and GPT-3 has made me suspect that there's a legitimate comparison to be made between the scale of a network architecture like the GPTs, and some analogue of "developmental stages" of the resulting network. Furthermore, it's plausible to me that the functions needed to be a transformative AI are covered by a moderate number of such developmental stages, without requiring additional structure. Thus GPT-N would be a transformative AI, for some not-too-large N, and we need to redouble our efforts on ways to align such AIs. The thesis doesn't strongly imply that we'll reach transformative AI via GPT-N especially soon; I have wide uncertainty, even given the thesis, about how large we should expect N to be, and whether the scaling of training and of computation slows down progress before then. But it's also plausible to me now that the timeline is only a few years, and that no fundamentally different approach will succeed before then. And that scares me. Architecture and Scaling GPT, GPT-2, and GPT-3 use nearly the same architecture; each paper says as much, with a sentence or two about minor improvements to the individual transformers. Model size (and the amount of training computation) is really the only difference. GPT took 1 petaflop/s-day to train 117M parameters, GPT-2 took 10 petaflop/s-days to train 1.5B parameters, and the largest version of GPT-3 took 3,000 petaflop/s-days to train 175B parameters. By contrast, AlphaStar seems to have taken about 30,000 petaflop/s-days of training in mid-2019, so the pace of AI research computing power projects that there should be about 10x that today. The upshot is that OpenAI may not be able to afford it, but if Google really wanted to make GPT-4 this year, they could afford to do so. Analogues to Developmental Stages There are all sorts of (more or less well-defined) developmental stages for human beings: image tracking, object permanence, vocabulary and grammar, theory of mind, size and volume, emotional awareness, executive functioning, et cetera. I was first reminded of developmental stages a few years ago, when I saw the layers of abstraction generated in this feature visualization tool for GoogLeNet. We don't have feature visualization for language models, but we do have generative outputs. And as you scale up an architecture like GPT, you see higher levels of abstraction. Grammar gets mastered, then content (removing absurd but grammatical responses), then tone (first rough genre, then spookily accurate authorial voice). Topic coherence is mastered first on the phrase level, then the sentence level, then the paragraph level. So too with narrative flow. Gwern's poetry experiments (GPT-2, GPT-3) are good examples. GPT-2 could more or less continue the meter of a poem and use words that fit the existing theme, but even...
Dec 10, 2021
My computational framework for the brain by Steve Byrnes
23:52
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: My computational framework for the brain, published by Steve Byrnes on the AI Alignment Forum. By now I've written a bunch of blog posts on brain architecture and algorithms, not in any particular order and generally interspersed with long digressions into Artificial General Intelligence. Here I want to summarize my key ideas in one place, to create a slightly better entry point, and something I can refer back to in certain future posts that I'm planning. If you've read every single one of my previous posts (hi mom!), there's not much new here. In this post, I'm trying to paint a picture. I'm not really trying to justify it, let alone prove it. The justification ultimately has to be: All the pieces are biologically, computationally, and evolutionarily plausible, and the pieces work together to explain absolutely everything known about human psychology and neuroscience. (I believe it! Try me!) Needless to say, I could be wrong in both the big picture and the details (or missing big things). If so, writing this out will hopefully make my wrongness easier to discover! Pretty much everything I say here and its opposite can be found in the cognitive neuroscience literature. (It's a controversial field!) I make no pretense to originality (with one exception noted below), but can't be bothered to put in actual references. My previous posts have a bit more background, or just ask me if you're interested. :-P So let's start in on the 7 guiding principles for how I think about the brain: 1. Two subsystems: "Neocortex" and "Subcortex" This is the starting point. I think it's absolutely critical. The brain consists of two subsystems. The neocortex is the home of "human intelligence" as we would recognize it—our beliefs, goals, ability to plan and learn and understand, every aspect of our conscious awareness, etc. etc. (All mammals have a neocortex; birds and lizards have an homologous and functionally-equivalent structure called the "pallium".) Some other parts of the brain (hippocampus, parts of the thalamus & basal ganglia & cerebellum—see further discussion here) help the neocortex do its calculations, and I lump them into the "neocortex subsystem". I'll use the term subcortex for the rest of the brain (brainstem, hypothalamus, etc.). Aside: Is this the triune brain theory? No. Triune brain theory is, from what I gather, a collection of ideas about brain evolution and function, most of which are wrong. One aspect of triune brain theory is putting a lot of emphasis on the distinction between neocortical calculations and subcortical calculations. I like that part. I'm keeping that part, and I'm improving it by expanding the neocortex club to also include the thalamus, hippocampus, lizard pallium, etc., and then I'm ignoring everything else about triune brain theory. 2. Cortical uniformity I claim that the neocortex is, to a first approximation, architecturally uniform, i.e. all parts of it are running the same generic learning algorithm in a massively-parallelized way. The two caveats to cortical uniformity (spelled out in more detail at that link) are: There are sorta "hyperparameters" on the generic learning algorithm which are set differently in different parts of the neocortex—for example, different regions have different densities of each neuron type, different thresholds for making new connections (which also depend on age), etc. This is not at all surprising; all learning algorithms inevitably have tradeoffs whose optimal settings depend on the domain that they're learning (no free lunch). As one of many examples of how even "generic" learning algorithms benefit from domain-specific hyperparameters, if you've seen a pattern "A then B then C" recur 10 times in a row, you will start unconsciously expecting AB to be followed by C. But "should" you expect AB to be followed b...
Dec 10, 2021
Redwood Research’s current project by Buck Shlegeris
22:40
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Redwood Research’s current project, published by Buck Shlegeris on the AI Alignment Forum. Here’s a description of the project Redwood Research is working on at the moment. First I’ll say roughly what we’re doing, and then I’ll try to explain why I think this is a reasonable applied alignment project, and then I’ll talk a bit about the takeaways I’ve had from the project so far. There are a bunch of parts of this that we’re unsure of and figuring out as we go; I’ll try to highlight our most important confusions as they come up. I’ve mentioned a bunch of kind of in-the-weeds details because I think they add flavor. This is definitely just me describing a work in progress, rather than presenting any results. Thanks to everyone who’s contributed to the project so far: the full-time Redwood technical team of me, Nate Thomas, Daniel Ziegler, Seraphina Nix, Ben Weinstein-Raun, Adam Scherlis; other technical contributors Daniel de Haas, Shauna Kravec, Tao Lin, Noa Nabeshima, Peter Schmidt-Nielsen; our labellers, particularly Kristen Hall, Charles Warth, Jess Thomson, and Liam Clarke; and for particularly useful advice Mark Xu, Ajeya Cotra, and Beth Barnes. Thanks to Paul Christiano for suggesting a project along these lines and giving lots of helpful advice. Thanks to Adam Scherlis and Nate Soares for writing versions of this doc. And thanks to Bill Zito and other contributors to Redwood ops. Apologies to the people I’ve overlooked. We started this project at the start of August. What we’re doing We’re trying to take a language model that has been fine-tuned on completing fiction, and then modify it so that it never continues a snippet in a way that involves describing someone getting injured (with a caveat I’ll mention later). And we want to do this without sacrificing much quality: if you use both the filtered model and the original model to generate a completion for a prompt, humans should judge the filtered model’s completion as better (more coherent, reasonable, thematically appropriate, and so on) at least about half the time. (This “better almost 50% of the time” property is one way of trying to operationalize “we don’t want the filtered policy to be worse”. It so happens that this property is actually kind of badly behaved, but in our case it seems fine, given that we’re always going to be comparing against a fixed unfiltered distribution.) We’re doing this project in two steps: Step 1: train a classifier, generate by sampling with rejection In step 1 (which we’re currently doing), instead of training a single filtered generator model, we’re just training a classifier that takes a prompt and completion and predicts whether a human would say that the completion involved someone getting injured. You can use such a classifier to make a filtered generation process, by repeatedly generating completions until we find one that the classifier thinks is above some threshold of P(safe). You can play with this filtered generation process here. This interface lets you provide a prompt, and then you can see all of the generated completions and the classifier’s rating of each. It currently is set to use “10% chance of injury” as the decision boundary (it is extremely uncalibrated; this corresponds to a much lower actual chance of injury). Our first goal is to train a classifier that’s good enough that no-one is able to find prompts on which the above process has a noticeable probability of generating an injurious completion. This model was produced by fine-tuning DeBERTa XL on a dataset produced by contractors labeling a bunch of LM-generated completions to snippets of fanfiction that were selected by various heuristics to have a high probability of being completed violently. You can read the instructions given to our contractors here; if you want to try out the labelling task, y...
Dec 10, 2021
2019 AI Alignment Literature Review and Charity Comparison by Larks
01:59:00
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: 2019 AI Alignment Literature Review and Charity Comparison, published by Larks on the AI Alignment Forum. Cross-posted to the EA forum here. Introduction As in 2016, 2017 and 2018, I have attempted to review the research that has been produced by various organisations working on AI safety, to help potential donors gain a better understanding of the landscape. This is a similar role to that which GiveWell performs for global health charities, and somewhat similar to a securities analyst with regards to possible investments. My aim is basically to judge the output of each organisation in 2019 and compare it to their budget. This should give a sense of the organisations' average cost-effectiveness. We can also compare their financial reserves to their 2019 budgets to get a sense of urgency. I’d like to apologize in advance to everyone doing useful AI Safety work whose contributions I may have overlooked or misconstrued. As ever I am painfully aware of the various corners I have had to cut due to time constraints from my job, as well as being distracted by 1) another existential risk capital allocation project, 2) the miracle of life and 3) computer games. How to read this document This document is fairly extensive, and some parts (particularly the methodology section) are the same as last year, so I don’t recommend reading from start to finish. Instead, I recommend navigating to the sections of most interest to you. If you are interested in a specific research organisation, you can use the table of contents to navigate to the appropriate section. You might then also want to Ctrl+F for the organisation acronym in case they are mentioned elsewhere as well. If you are interested in a specific topic, I have added a tag to each paper, so you can Ctrl+F for a tag to find associated work. The tags were chosen somewhat informally so you might want to search more than one, especially as a piece might seem to fit in multiple categories. Here are the un-scientifically-chosen hashtags: Agent Foundations AI_Theory Amplification Careers CIRL Decision_Theory Ethical_Theory Forecasting Introduction Misc ML_safety Other_Xrisk Overview Philosophy Politics RL Security Shortterm Strategy New to Artificial Intelligence as an existential risk? If you are new to the idea of General Artificial Intelligence as presenting a major risk to the survival of human value, I recommend this Vox piece by Kelsey Piper. If you are already convinced and are interested in contributing technically, I recommend this piece by Jacob Steinheart, as unlike this document Jacob covers pre-2019 research and organises by topic, not organisation. Research Organisations FHI: The Future of Humanity Institute FHI is an Oxford-based Existential Risk Research organisation founded in 2005 by Nick Bostrom. They are affiliated with Oxford University. They cover a wide variety of existential risks, including artificial intelligence, and do political outreach. Their research can be found here. Their research is more varied than MIRI's, including strategic work, work directly addressing the value-learning problem, and corrigibility work. In the past I have been very impressed with their work. Research Drexler's Reframing Superintelligence: Comprehensive AI Services as General Intelligence is a massive document arguing that superintelligent AI will be developed for individual discrete services for specific finite tasks, rather than as general-purpose agents. Basically the idea is that it makes more sense for people to develop specialised AIs, so these will happen first, and if/when we build AGI these services can help control it. To some extent this seems to match what is happening - we do have many specialised AIs - but on the other hand there are teams working directly on AGI, and often in ML 'build an ML system that does it all...
Dec 10, 2021
Testing The Natural Abstraction Hypothesis: Project Intro by johnswentworth
11:37
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Testing The Natural Abstraction Hypothesis: Project Intro, published by johnswentworth on the AI Alignment Forum. The natural abstraction hypothesis says that Our physical world abstracts well: for most systems, the information relevant “far away” from the system (in various senses) is much lower-dimensional than the system itself. These low-dimensional summaries are exactly the high-level abstract objects/concepts typically used by humans. These abstractions are “natural”: a wide variety of cognitive architectures will learn to use approximately the same high-level abstract objects/concepts to reason about the world. If true, the natural abstraction hypothesis would dramatically simplify AI and AI alignment in particular. It would mean that a wide variety of cognitive architectures will reliably learn approximately-the-same concepts as humans use, and that these concepts can be precisely and unambiguously specified. Ultimately, the natural abstraction hypothesis is an empirical claim, and will need to be tested empirically. At this point, however, we lack even the tools required to test it. This post is an intro to a project to build those tools and, ultimately, test the natural abstraction hypothesis in the real world. Background & Motivation One of the major conceptual challenges of designing human-aligned AI is the fact that human values are a function of humans’ latent variables: humans care about abstract objects/concepts like trees, cars, or other humans, not about low-level quantum world-states directly. This leads to conceptual problems of defining “what we want” in physical, reductive terms. More generally, it leads to conceptual problems in translating between human concepts and concepts learned by other systems - e.g. ML systems or biological systems. If true, the natural abstraction hypothesis provides a framework for translating between high-level human concepts, low-level physical systems, and high-level concepts used by non-human systems. The foundations of the framework have been sketched out in previous posts. What is Abstraction? introduces the mathematical formulation of the framework and provides several examples. Briefly: the high-dimensional internal details of far-apart subsystems are independent given their low-dimensional “abstract” summaries. For instance, the Lumped Circuit Abstraction abstracts away all the details of molecule positions or wire shapes in an electronic circuit, and represents the circuit as components each summarized by some low-dimensional behavior - like V = IR for a resistor. This works because the low-level molecular motions in a resistor are independent of the low-level molecular motions in some far-off part of the circuit, given the high-level summary. All the rest of the low-level information is “wiped out” by noise in low-level variables “in between” the far-apart components. In the causal graph of some low-level system, X is separated from Y by a bunch of noisy variables Z. For instance, X might be a resistor, Y might be a capacitor, and Z might be the wires (and air) between them. Noise in Z wipes out most of the low-level info about X, so that only a low-dimensional summary f(X) is relevant to predicting the state of Y. Chaos Induces Abstractions explains one major reason why we expect low-level details to be independent (given high-level summaries) for typical physical systems. If I have a bunch of balls bouncing around perfectly elastically in a box, then the total energy, number of balls, and volume of the box are all conserved, but chaos wipes out all other information about the exact positions and velocities of the balls. My “high-level summary” is then the energy, number of balls, and volume of the box; all other low-level information is wiped out by chaos. This is exactly the abstraction behind the ideal ...
Dec 10, 2021
The theory-practice gap by Buck Shlegeris by Buck Shlegeris
10:18
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The theory-practice gap by Buck Shlegeris, published by Buck Shlegeris on the AI Alignment Forum. [Thanks to Richard Ngo, Damon Binder, Summer Yue, Nate Thomas, Ajeya Cotra, Alex Turner, and other Redwood Research people for helpful comments; thanks Ruby Bloom for formatting this for the Alignment Forum for me.] I'm going to draw a picture, piece by piece. I want to talk about the capability of some different AI systems. You can see here that we've drawn the capability of the system we want to be competitive with, which I’ll call the unaligned benchmark. The unaligned benchmark is what you get if you train a system on the task that will cause the system to be most generally capable. And you have no idea how it's thinking about things, and you can only point this system at some goals and not others. I think that the alignment problem looks different depending on how capable the system you’re trying to align is, and I think there are reasonable arguments for focusing on various different capabilities levels. See here for more of my thoughts on this question. Alignment strategies People have also proposed various alignment strategies. But I don’t think that these alignment strategies are competitive with the unaligned benchmark, even in theory. I want to claim that most of the action in theoretical AI alignment is people proposing various ways of getting around these problems by having your systems do things that are human understandable instead of doing things that are justified by working well. For example, the hope with imitative IDA is that through its recursive structure you can build a dataset of increasingly competent answers to questions, and then at every step you can train a system to imitate these increasingly good answers to questions, and you end up with a really powerful question-answerer that was only ever trained to imitate humans-with-access-to-aligned-systems, and so your system is outer aligned. The bar I’ve added, which represents how capable I think you can get with amplified humans, is lower than the bar for the unaligned benchmark. I've drawn this bar lower because I think that if your system is trying to imitate cognition that can be broken down into human understandable parts, it is systematically not going to be able to pursue certain powerful strategies that the end-to-end trained systems will be able to. I think that there are probably a bunch of concepts that humans can’t understand quickly, or maybe can’t understand at all. And if your systems are restricted to never use these concepts, I think your systems are probably just going to be a bunch weaker. I think that transparency techniques, as well as AI alignment strategies like microscope AI that lean heavily on them, rely on a similar assumption that the cognition of the system you’re trying to align is factorizable into human-understandable parts. One component of the best-case scenario for transparency techniques is that anytime your neural net does stuff, you can get the best possible human understandable explanation of why it's doing that thing. If such an explanation doesn’t exist, your transparency tools won’t be able to assure you that your system is aligned even if it is. To summarize, I claim that current alignment proposals don’t really have a proposal for how to make systems that are aligned but either produce plans that can’t be understood by amplified humans do cognitive actions that can’t be understood by amplified humans And so I claim that current alignment proposals don’t seem like they can control systems as powerful as the systems you’d get from an unaligned training strategy. Empirical generalization I think some people are optimistic that alignment will generalize from the cases where amplified humans can evaluate it to the cases where the amplified humans can’t. I'm ...
Dec 10, 2021
Selection vs Control by Abram Demski
19:37
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Selection vs Control , published by Abram Demski on the AI Alignment Forum. This is something which has bothered me for a while, but, I'm writing it specifically in response to the recent post on mesa-optimizers. I feel strongly that the notion of 'optimization process' or 'optimizer' which people use -- partly derived from Eliezer's notion in the sequences -- should be split into two clusters. I call these two clusters 'selection' vs 'control'. I don't have precise formal statements of the distinction I'm pointing at; I'll give several examples. Before going into it, several reasons why this sort of thing may be important: It could help refine the discussion of mesa-optimization. The article restricted its discussion to the type of optimization I'll call 'selection', explicitly ruling out 'control'. This choice isn't obviously right. (More on this later.) Refining 'agency-like' concepts like this seems important for embedded agency -- what we eventually want is a story about how agents can be in the world. I think almost any discussion of the relationship between agency and optimization which isn't aware of the distinction I'm drawing here (at least as a hypothesis) will be confused. Generally, I feel like I see people making mistakes by not distinguishing between the two (whether or not they've derived their notion of optimizer from Eliezer). I judge an algorithm differently if it is intended as one or the other. (See also Stuart Armstrong's summary of other problems with the notion of optimization power Eliezer proposed -- those are unrelated to my discussion here, and strike me more as technical issues which call for refined formulae, rather than conceptual problems which call for revised ontology.) The Basic Idea Eliezer quantified optimization power by asking how small a target an optimization process hits, out of a space of possibilities. The type of 'space of possibilities' is what I want to poke at here. Selection First, consider a typical optimization algorithm, such as simulated annealing. The algorithm constructs an element of the search space (such as a specific combination of weights for a neural network), gets feedback on how good that element is, and then tries again. Over many iterations of this process, it finds better and better elements. Eventually, it outputs a single choice. This is the prototypical 'selection process' -- it can directly instantiate any element of the search space (although typically we consider cases where the process doesn't have time to instantiate all of them), it gets direct feedback on the quality of each element (although evaluation may be costly, so that the selection process must economize these evaluations), the quality of an element of search space does not depend on the previous choices, and only the final output matters. The term 'selection process' refers to the fact that this type of optimization selects between a number of explicitly given possibilities. The most basic example of this phenomenon is a 'filter' which rejects some elements and accepts others -- like selection bias in statistics. This has a limited ability to optimize, however, because it allows only one iteration. Natural selection is an example of much more powerful optimization occurring through iteration of selection effects. Control Now, consider a targeting system on a rocket -- let's say, a heat-seeking missile. The missile has sensors and actuators. It gets feedback from its sensors, and must somehow use this information to decide how to use its actuators. This is my prototypical control process. (The term 'control process' is supposed to invoke control theory.) Unlike a selection process, a controller can only instantiate one element of the space of possibilities. It gets to traverse exactly one path. The 'small target' which it hits is ther...
Dec 10, 2021
Why Subagents? by johnswentworth
12:01
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Why Subagents?, published by johnswentworth on the AI Alignment Forum. The justification for modelling real-world systems as “agents” - i.e. choosing actions to maximize some utility function - usually rests on various coherence theorems. They say things like “either the system’s behavior maximizes some utility function, or it is throwing away resources” or “either the system’s behavior maximizes some utility function, or it can be exploited” or things like that. Different theorems use slightly different assumptions and prove slightly different things, e.g. deterministic vs probabilistic utility function, unique vs non-unique utility function, whether the agent can ignore a possible action, etc. One theme in these theorems is how they handle “incomplete preferences”: situations where an agent does not prefer one world-state over another. For instance, imagine an agent which prefers pepperoni over mushroom pizza when it has pepperoni, but mushroom over pepperoni when it has mushroom; it’s simply never willing to trade in either direction. There’s nothing inherently “wrong” with this; the agent is not necessarily executing a dominated strategy, cannot necessarily be exploited, or any of the other bad things we associate with inconsistent preferences. But the preferences can’t be described by a utility function over pizza toppings. In this post, we’ll see that these kinds of preferences are very naturally described using subagents. In particular, when preferences are allowed to be path-dependent, subagents are important for representing consistent preferences. This gives a theoretical grounding for multi-agent models of human cognition. Preference Representation and Weak Utility Let’s expand our pizza example. We’ll consider an agent who: Prefers pepperoni, mushroom, or both over plain cheese pizza Prefers both over pepperoni or mushroom alone Does not have a stable preference between mushroom and pepperoni - they prefer whichever they currently have We can represent this using a directed graph: The arrows show preference: our agent prefers B over A if (and only if) there is a directed path from A to B along the arrows. There is no path from pepperoni to mushroom or from mushroom to pepperoni, so the agent has no preference between them. In this case, we’re interpreting “no preference” as “agent prefers to keep whatever they have already”. Note that this is NOT the same as “the agent is indifferent”, in which case the agent is willing to switch back and forth between the two options as long as the switch doesn’t cost anything. Key point: there is no cycle in this graph. If the agent’s preferences are cyclic, that’s when they provably throw away resources, paying to go in circles. As long as the preferences are acyclic, we call them “consistent”. Now, at this point we can still define a “weak” utility function by ignoring the “missing” preference between pepperoni and mushroom. Here’s the idea: a normal utility function says “the agent always prefers the option with higher utility”. A weak utility function says: “if the agent has a preference, then they always prefer the option with higher utility”. The missing preference means we can’t build a normal utility function, but we can still build a weak utility function. Here’s how: since our graph has no cycles, we can always order the nodes so that the arrows only go forward along the sorted nodes - a technique called topological sorting. Each node’s position in the topological sort order is its utility. A small tweak to this method also handles indifference. (Note: I’m using the term “weak utility” here because it seems natural; I don’t know of any standard term for this in the literature. Most people don’t distinguish between these two interpretations of utility.) When preferences are incomplete, there are multiple possib...
Dec 10, 2021
Possible takeaways from the coronavirus pandemic for slow AI takeoff by Vika
05:08
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Possible takeaways from the coronavirus pandemic for slow AI takeoff, published by Vika on the AI Alignment Forum. (Cross-posted from personal blog. Summarized in Alignment Newsletter #104. Thanks to Janos Kramar for his helpful feedback on this post.) Epistemic status: fairly speculative, would appreciate feedback As the covid-19 pandemic unfolds, we can draw lessons from it for managing future global risks, such as other pandemics, climate change, and risks from advanced AI. In this post, I will focus on possible implications for AI risk. For a broader treatment of this question, I recommend FLI's covid-19 page that includes expert interviews on the implications of the pandemic for other types of risks. A key element in AI risk scenarios is the speed of takeoff - whether advanced AI is developed gradually or suddenly. Paul Christiano's post on takeoff speeds defines slow takeoff in terms of the economic impact of AI as follows: "There will be a complete 4 year interval in which world output doubles, before the first 1 year interval in which world output doubles." It argues that slow AI takeoff is more likely than fast takeoff, but is not necessarily easier to manage, since it poses different challenges, such as large-scale coordination. This post expands on this point by examining some parallels between the coronavirus pandemic and a slow takeoff scenario. The upsides of slow takeoff include the ability to learn from experience, act on warning signs, and reach a timely consensus that there is a serious problem. I would argue that the covid-19 pandemic had these properties, but most of the world's institutions did not take advantage of them. This suggests that, unless our institutions improve, we should not expect the slow AI takeoff scenario to have a good default outcome. Learning from experience. In the slow takeoff scenario, general AI is expected to appear in a world that has already experienced transformative change from less advanced AI, and institutions will have a chance to learn from problems with these AI systems. An analogy could be made with learning from dealing with less "advanced" epidemics like SARS that were not as successful as covid-19 at spreading across the world. While some useful lessons were learned, they were not successfully generalized to covid-19, which had somewhat different properties than these previous pathogens (such as asymptomatic transmission and higher virulence). Similarly, general AI may have somewhat different properties from less advanced AI that would make mitigation strategies more difficult to generalize. Warning signs. In the coronavirus pandemic response, there has been a lot of variance in how successfully governments acted on warning signs. Western countries had at least a month of warning while the epidemic was spreading in China, which they could have used to stock up on PPE and build up testing capacity, but most did not do so. Experts have warned about the likelihood of a coronavirus outbreak for many years, but this did not lead most governments to stock up on medical supplies. This was a failure to take cheap preventative measures in response to advance warnings about a widely recognized risk with tangible consequences, which is not a good sign for the case where the risk is less tangible and well-understood (such as risk from general AI). Consensus on the problem. During the covid-19 epidemic, the abundance of warning signs and past experience with previous pandemics created an opportunity for a timely consensus that there is a serious problem. However, it actually took a long time for a broad consensus to emerge - the virus was often dismissed as "overblown" and "just like the flu" as late as March 2020. A timely response to the risk required acting before there was a consensus, thus risking the appearance of ...
Dec 10, 2021
Embedded Agency (full-text version) by Scott Garrabrant, Abram Demski
01:33:34
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Embedded Agency (full-text version) , published by Scott Garrabrant, Abram Demski on the AI Alignment Forum. Write a Review Suppose you want to build a robot to achieve some real-world goal for you—a goal that requires the robot to learn for itself and figure out a lot of things that you don't already know. There's a complicated engineering problem here. But there's also a problem of figuring out what it even means to build a learning agent like that. What is it to optimize realistic goals in physical environments? In broad terms, how does it work? In this post, I’ll point to four ways we don’t currently know how it works, and four areas of active research aimed at figuring it out. 1. Embedded agents This is Alexei, and Alexei is playing a video game. Like most games, this game has clear input and output channels. Alexei only observes the game through the computer screen, and only manipulates the game through the controller. The game can be thought of as a function which takes in a sequence of button presses and outputs a sequence of pixels on the screen. Alexei is also very smart, and capable of holding the entire video game inside his mind. If Alexei has any uncertainty, it is only over empirical facts like what game he is playing, and not over logical facts like which inputs (for a given deterministic game) will yield which outputs. This means that Alexei must also store inside his mind every possible game he could be playing. Alexei does not, however, have to think about himself. He is only optimizing the game he is playing, and not optimizing the brain he is using to think about the game. He may still choose actions based off of value of information, but this is only to help him rule out possible games he is playing, and not to change the way in which he thinks. In fact, Alexei can treat himself as an unchanging indivisible atom. Since he doesn't exist in the environment he's thinking about, Alexei doesn't worry about whether he'll change over time, or about any subroutines he might have to run. Notice that all the properties I talked about are partially made possible by the fact that Alexei is cleanly separated from the environment that he is optimizing. This is Emmy. Emmy is playing real life. Real life is not like a video game. The differences largely come from the fact that Emmy is within the environment that she is trying to optimize. Alexei sees the universe as a function, and he optimizes by choosing inputs to that function that lead to greater reward than any of the other possible inputs he might choose. Emmy, on the other hand, doesn't have a function. She just has an environment, and this environment contains her. Emmy wants to choose the best possible action, but which action Emmy chooses to take is just another fact about the environment. Emmy can reason about the part of the environment that is her decision, but since there's only one action that Emmy ends up actually taking, it’s not clear what it even means for Emmy to “choose” an action that is better than the rest. Alexei can poke the universe and see what happens. Emmy is the universe poking itself. In Emmy’s case, how do we formalize the idea of “choosing” at all? To make matters worse, since Emmy is contained within the environment, Emmy must also be smaller than the environment. This means that Emmy is incapable of storing accurate detailed models of the environment within her mind. This causes a problem: Bayesian reasoning works by starting with a large collection of possible environments, and as you observe facts that are inconsistent with some of those environments, you rule them out. What does reasoning look like when you're not even capable of storing a single valid hypothesis for the way the world works? Emmy is going to have to use a different type of reasoning, and make updates that ...
Dec 10, 2021
Cortés, Pizarro, and Afonso as Precedents for Takeover by Daniel Kokotajlo
18:39
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Cortés, Pizarro, and Afonso as Precedents for Takeover, published by Daniel Kokotajlo on the AI Alignment Forum. Crossposted from AI Impacts. Epistemic status: I am not a historian, nor have I investigated these case studies in detail. I admit I am still uncertain about how the conquistadors were able to colonize so much of the world so quickly. I think my ignorance is excusable because this is just a blog post; I welcome corrections from people who know more. If it generates sufficient interest I might do a deeper investigation. Even if I’m right, this is just one set of historical case-studies; it doesn’t prove anything about AI, even if it is suggestive. Finally, in describing these conquistadors as “successful,” I simply mean that they achieved their goals, not that what they achieved was good. Summary In the span of a few years, some minor European explorers (later known as the conquistadors) encountered, conquered, and enslaved several huge regions of the world. That they were able to do this is surprising; their technological advantage was not huge. (This was before the scientific and industrial revolutions.) From these cases, I think we learn that it is occasionally possible for a small force to quickly conquer large parts of the world, despite: Having only a minuscule fraction of the world's resources and power Having technology + diplomatic and strategic cunning that is better but not that much better Having very little data about the world when the conquest begins Being disunited Which all suggests that it isn’t as implausible that a small AI takes over the world in mildly favorable circumstances as is sometimes thought. EDIT: In light of good pushback from people (e.g. Lucy.ea8 and e.g. Matthew Barnett) about the importance of disease, I think one should probably add a caveat to the above: "In times of chaos & disruption, at least." NEW EDIT: After reading three giant history books on the subject, I take back my previous edit. My original claims were correct. Three shocking true stories I highly recommend you read the wiki pages yourself; otherwise, here are my summaries: Cortés: [wiki] [wiki] April 1519: Hernán Cortés lands in Yucatan with ~500 men, 13 horses, and a few cannons. He destroys his ships so his men won't be able to retreat. His goal is to conquer the Aztec empire of several million people. He makes his way towards the imperial capital, Tenochtitlán. Along the way he encounters various local groups, fighting some and allying with some. He is constantly outnumbered but his technology gives him an advantage in fights. His force grows in size, because even though he loses Spaniards he gains local allies who resent Aztec rule. Tenochtitlán is an island fortress (like Venice) with a population of over 200,000, making it one of the largest and richest cities in the world at the time. Cortés arrives in the city asking for an audience with the Emperor, who receives him warily. Cortés takes the emperor hostage within his own palace, indirectly ruling Tenochtitlán through him. Cortés learns that the Spanish governor has landed in Mexico with a force twice his size, intent on arresting him. (Cortés' expedition was illegal!) Cortés leaves 200 men guarding the Emperor, marches to the coast with the rest, surprises and defeats the new Spaniards in battle, and incorporates the survivors into his army. July 1520: Back at the capital, the locals are starting to rebel against his men. Cortés marches back to the capital, uniting his forces just in time to be besieged in the imperial palace. They murder the emperor and fight their way out of the city overnight, taking heavy losses. They shelter in another city (Tlaxcala) that was thinking about rebelling against the Aztecs. Cortés allies with the Tlaxcalans and launches a general uprising against the Aztecs. Not...
Dec 10, 2021
Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers by lifelonglearner, Peter Hase
02:32:44
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers, published by lifelonglearner, Peter Hase on the AI Alignment Forum. Peter Hase. UNC Chapel Hill. Owen Shen. UC San Diego. With thanks to Robert Kirk and Mohit Bansal for helpful feedback on this post. Introduction. Model interpretability was a bullet point in Concrete Problems in AI Safety (2016). Since then, interpretability has come to comprise entire research directions in technical safety agendas (2020); model transparency appears throughout An overview of 11 proposals for building safe advanced AI (2020); and explainable AI has a Twitter hashtag, #XAI. (For more on how interpretability is relevant to AI safety, see here or here.) Interpretability is now a very popular area of research. The interpretability area was the most popular in terms of video views at ACL last year. Model interpretability is now so mainstream there are books on the topic and corporate services promising it. So what's the state of research on this topic? What does progress in interpretability look like, and are we making progress? What is this post? This post summarizes 70 recent papers on model transparency, interpretability, and explainability, limited to a non-random subset of papers from the past 3 years or so. We also give opinions on several active areas of research, and collate another 90 papers that are not summarized. How to read this post. If you want to see high-level opinions on several areas of interpretability research, just read the opinion section, which is organized according to our very ad-hoc set of topic areas. If you want to learn more about what work looks like in a particular area, you can read the summaries of papers in that area. For a quick glance at each area, we highlight one standout paper per area, so you can just check out that summary. If you want to see more work that has come out in an area, look at the non-summarized papers at the end of the post (organized with the same areas as the summarized papers). We assume readers are familiar with basic aspects of interpretability research, i.e. the kinds of concepts in The Mythos of Model Interpretability and Towards A Rigorous Science of Interpretable Machine Learning. We recommend looking at either of these papers if you want a primer on interpretability. We also assume that readers are familiar with older, foundational works like "Why Should I Trust You?: Explaining the Predictions of Any Classifier." Disclaimer: This post is written by a team of two people, and hence its breadth is limited and its content biased by our interests and backgrounds. A few of the summarized papers are our own. Please let us know if you think we've missed anything important that could improve the post. Master List of Summarized Papers. Theory and Opinion. Explanation in Artificial Intelligence: Insights from the Social Sciences. Chris Olah’s views on AGI safety. Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness? The elephant in the interpretability room: Why use attention as explanation when we have saliency methods? Aligning Faithful Interpretations with their Social Attribution. Evaluation. Are Visual Explanations Useful? A Case Study in Model-in-the-Loop Prediction. Comparing Automatic and Human Evaluation of Local Explanations for Text Classification. Do explanations make VQA models more predictable to a human? Sanity Checks for Saliency Maps. A Benchmark for Interpretability Methods in Deep Neural Networks. Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior? ERASER: A Benchmark to Evaluate Rationalized NLP Models. On quantitative aspects of model interpretability. Manipulating and Measuring Model Interpretability. Methods. Estimating Feature Importance. Neuron Shapley...
Dec 10, 2021
Disentangling arguments for the importance of AI safety by Richard Ngo
12:56
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Disentangling arguments for the importance of AI safety, published by Richard Ngo on the AI Alignment Forum. Note: my views have shifted significantly since writing this post. I now consider items 1, 2, 3, and 6.2 to be different facets of one core argument, which I call the "second species" argument, and which I explore in depth in this report. And I don't really think of 4 as an AI safety problem any more. I recently attended the 2019 Beneficial AGI conference organised by the Future of Life Institute. I’ll publish a more complete write-up later, but I was particularly struck by how varied attendees' reasons for considering AI safety important were. Before this, I’d observed a few different lines of thought, but interpreted them as different facets of the same idea. Now, though, I’ve identified at least 6 distinct serious arguments for why AI safety is a priority. By distinct I mean that you can believe any one of them without believing any of the others - although of course the particular categorisation I use is rather subjective, and there’s a significant amount of overlap. In this post I give a brief overview of my own interpretation of each argument (note that I don’t necessarily endorse them myself). They are listed roughly from most specific and actionable to most general. I finish with some thoughts on what to make of this unexpected proliferation of arguments. Primarily, I think it increases the importance of clarifying and debating the core ideas in AI safety. Maximisers are dangerous. Superintelligent AGI will behave as if it’s maximising the expectation of some utility function, since doing otherwise can be shown to be irrational. Yet we can’t write down a utility function which precisely describes human values, and optimising very hard for any other function will lead to that AI rapidly seizing control (as a convergent instrumental subgoal) and building a future which contains very little of what we value (because of Goodhart’s law and the complexity and fragility of values). We won’t have a chance to notice and correct misalignment because an AI which has exceeded human level will increase its intelligence very quickly (either by recursive self-improvement or by scaling up its hardware), and then prevent us from modifying it or shutting it down. This was the main thesis advanced by Yudkowsky and Bostrom when founding the field of AI safety. Here I’ve tried to convey the original line of argument, although some parts of it have been strongly critiqued since then. In particular, Drexler and Shah have disputed the relevance of expected utility maximisation (the latter suggesting the concept of goal-directedness as a replacement), while Hanson and Christiano disagree that AI intelligence will increase in a very fast and discontinuous way. Most of the arguments in this post originate from or build on this one in some way. This is particularly true of the next two arguments - nevertheless, I think that there’s enough of a shift in focus in each to warrant separate listings. The target loading problem. Even if we knew exactly what we wanted a superintelligent agent to do, we don’t currently know (even in theory) how to make an agent which actually tries to do that. In other words, if we were to create a superintelligent AGI before solving this problem, the goals we would ascribe to that AGI (by taking the intentional stance towards it) would not be the ones we had intended to give it. As a motivating example, evolution selected humans for their genetic fitness, yet humans have goals which are very different from just spreading their genes. In a machine learning context, while we can specify a finite number of data points and their rewards, neural networks may then extrapolate from these rewards in non-humanlike ways. This is a more general version of the “inn...
Dec 10, 2021
A Semitechnical Introductory Dialogue on Solomonoff Induction by Eliezer Yudkowsky
01:35:55
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A Semitechnical Introductory Dialogue on Solomonoff Induction, published by Eliezer Yudkowsky on the AI Alignment Forum. (Originally posted in December 2015: A dialogue between Ashley, a computer scientist who's never heard of Solomonoff's theory of inductive inference, and Blaine, who thinks it is the best thing since sliced bread.) i. Unbounded analysis ASHLEY: Good evening, Msr. Blaine. BLAINE: Good evening, Msr. Ashley. ASHLEY: I've heard there's this thing called "Solomonoff's theory of inductive inference". BLAINE: The rumors have spread, then. ASHLEY: Yeah, so, what the heck is that about? BLAINE: Invented in the 1960s by the mathematician Ray Solomonoff, the key idea in Solomonoff induction is to do sequence prediction by using Bayesian updating on a prior composed of a mixture of all computable probability distributions ASHLEY: Wait. Back up a lot. Before you try to explain what Solomonoff induction is, I'd like you to try to tell me what it does, or why people study it in the first place. I find that helps me organize my listening. Right now I don't even know why I should be interested in this. BLAINE: Um, okay. Let me think for a second... ASHLEY: Also, while I can imagine things that "sequence prediction" might mean, I haven't yet encountered it in a technical context, so you'd better go a bit further back and start more at the beginning. I do know what "computable" means and what a "probability distribution" is, and I remember the formula for Bayes's Rule although it's been a while. BLAINE: Okay. So... one way of framing the usual reason why people study this general field in the first place, is that sometimes, by studying certain idealized mathematical questions, we can gain valuable intuitions about epistemology. That's, uh, the field that studies how to reason about factual questions, how to build a map of reality that reflects the territory ASHLEY: I have some idea what 'epistemology' is, yes. But I think you might need to start even further back, maybe with some sort of concrete example or something. BLAINE: Okay. Um. So one anecdote that I sometimes use to frame the value of computer science to the study of epistemology is Edgar Allen Poe's argument in 1833 that chess was uncomputable. ASHLEY: That doesn't sound like a thing that actually happened. BLAINE: I know, but it totally did happen and not in a metaphorical sense either! Edgar Allen Poe wrote an essay explaining why no automaton would ever be able to play chess, and he specifically mentioned "Mr. Babbage's computing engine" as an example. You see, in the nineteenth century, there was for a time this sensation known as the Mechanical Turk—supposedly a machine, an automaton, that could play chess. At the grandmaster level, no less. Now today, when we're accustomed to the idea that it takes a reasonably powerful computer to do that, we can know immediately that the Mechanical Turk must have been a fraud and that there must have been a concealed operator inside—a person with dwarfism, as it turned out. Today we know that this sort of thing is hard to build into a machine. But in the 19th century, even that much wasn't known. So when Edgar Allen Poe, who besides being an author was also an accomplished magician, set out to write an essay about the Mechanical Turk, he spent the second half of the essay dissecting what was known about the Turk's appearance to (correctly) figure out where the human operator was hiding. But Poe spent the first half of the essay arguing that no automaton—nothing like Mr. Babbage's computing engine—could possibly play chess, which was how he knew a priori that the Turk had a concealed human operator. ASHLEY: And what was Poe's argument? BLAINE: Poe observed that in an algebraical problem, each step followed from the previous step of necessity, which was why the steps ...
Dec 10, 2021
Thoughts on Human Models by Ramana Kumar, Scott Garrabrant
16:50
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Thoughts on Human Models, published by Ramana Kumar, Scott Garrabrant on the AI Alignment Forum. Write a Review Human values and preferences are hard to specify, especially in complex domains. Accordingly, much AGI safety research has focused on approaches to AGI design that refer to human values and preferences indirectly, by learning a model that is grounded in expressions of human values (via stated preferences, observed behaviour, approval, etc.) and/or real-world processes that generate expressions of those values. There are additionally approaches aimed at modelling or imitating other aspects of human cognition or behaviour without an explicit aim of capturing human preferences (but usually in service of ultimately satisfying them). Let us refer to all these models as human models. In this post, we discuss several reasons to be cautious about AGI designs that use human models. We suggest that the AGI safety research community put more effort into developing approaches that work well in the absence of human models, alongside the approaches that rely on human models. This would be a significant addition to the current safety research landscape, especially if we focus on working out and trying concrete approaches as opposed to developing theory. We also acknowledge various reasons why avoiding human models seems difficult. Problems with Human Models To be clear about human models, we draw a rough distinction between our actual preferences (which may not be fully accessible to us) and procedures for evaluating our preferences. The first thing, actual preferences, is what humans actually want upon reflection. Satisfying our actual preferences is a win. The second thing, procedures for evaluating preferences, refers to various proxies for our actual preferences such as our approval, or what looks good to us (with necessarily limited information or time for thinking). Human models are in the second category; consider, as an example, a highly accurate ML model of human yes/no approval on the set of descriptions of outcomes. Our first concern, described below, is about overfitting to human approval and thereby breaking its connection to our actual preferences. (This is a case of Goodhart’s law.) Less Independent Audits Imagine we have built an AGI system and we want to use it to design the mass transit system for a new city. The safety problems associated with such a project are well recognised; suppose we are not completely sure we have solved them, but are confident enough to try anyway. We run the system in a sandbox on some fake city input data and examine its outputs. Then we run it on some more outlandish fake city data to assess robustness to distributional shift. The AGI’s outputs look like reasonable transit system designs and considerations, and include arguments, metrics, and other supporting evidence that they are good. Should we be satisfied and ready to run the system on the real city’s data, and to implement the resulting proposed design? We suggest that an important factor in the answer to this question is whether the AGI system was built using human modelling or not. If it produced a solution to the transit design problem (that humans approve of) without human modelling, then we would more readily trust its outputs. If it produced a solution we approve of with human modelling, then although we expect the outputs to be in many ways about good transit system design (our actual preferences) and in many ways suited to being approved by humans, to the extent that these two targets come apart we must worry about having overfit to the human model at the expense of the good design. (Why not the other way around? Because our assessment of the sandboxed results uses human judgement, not an independent metric for satisfaction of our actual preferences.) Humans hav...
Dec 06, 2021
AI Alignment 2018-19 Review by Rohin Shah
59:48
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI Alignment 2018-19 Review, published by Rohin Shah on the AI Alignment Forum. Preamble What this post is This is a review post of public work in AI alignment over 2019, with some inclusions from 2018. It has this preamble (~700 words), a short version / summary (~1.6k words), and a long version (~8.3k words). It is available as a Google Doc here. There are many areas of work that are relevant to AI alignment that I have barely touched on, such as interpretability, uncertainty estimation, adversarial examples, and assured autonomy, primarily because I have not been following these fields and wouldn’t be able to write a good summary of what has happened in them. I have also mostly focused on articles that provide some conceptual insight, and excluded or briefly linked to papers that primarily make quantitative improvements on important metrics. While such papers are obviously important (ultimately, our techniques need to work well), there isn’t much to say about them in a yearly review other than that the quantitative metric was improved. Despite these exclusions, there was still a ton of work to select from, perhaps around ~500 articles, of which over 300 have been linked to in this post. There are many interesting articles that I really enjoyed that get only a sentence of description, in which I ignore many of the points that the article makes. Most have been summarized in the Alignment Newsletter, so if you’d like to learn more about any particular link, but don’t want to read the entire thing, just search for its title in the database. What you should know about the structure of this post I am not speaking for myself; by default I am trying to explain what has been said, in a way that the authors of the articles would agree with. Any extra opinion that I add will be in italics. As a post, this is meant to be read sequentially, but the underlying structure is a graph (nodes are posts, edges connect posts that are very related). I arranged it in a sequence that highlights the most salient-to-me connections. This means that the order in which I present subtopics is very much not a reflection of what I think is most important in AI safety: in my presentation order, I focused on edges (connections) rather than nodes (subtopics). Other minor details: Any links from earlier than 2018 will have their year of publication right after the link (except for articles that were reposted as part of Alignment Forum sequences). I typically link to blog posts; in several cases there is also an associated paper that I have not linked. How to read this post I have put the most effort into making the prose of the long version read smoothly. The hierarchical organization is comparatively less coherent; this is partly because I optimized the prose, and partly because AI safety work is hard to cluster. As a result, for those willing to put in the effort, I’d recommend reading the long version directly, without paying too much attention to the hierarchy. If you have less time, or are less interested in the minutiae of AI alignment research, the short version is for you. Since I don’t name authors or organizations, you may want to take this as your opportunity to form beliefs about which arguments in AI alignment are important based on the ideas (as opposed to based on trust in the author of the post). People who keep up with AI alignment work might want to know which posts I’m referencing as they read, which is a bit hard since I don’t name the posts in the text. If this describes you, you should be reading this post on the Alignment Forum, where you can hover over most links to see what they link to. Alternatively, the references section in the Google Doc lists all links in the order that they appear in the post, along with the hierarchical organization, and so you can open the reference...
Dec 06, 2021
Paul's research agenda FAQ by Alex Zhu
33:01
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Paul's research agenda FAQ, published by Alex Zhu on the AI Alignment Forum. I think Paul Christiano’s research agenda for the alignment of superintelligent AGIs presents one of the most exciting and promising approaches to AI safety. After being very confused about Paul’s agenda, chatting with others about similar confusions, and clarifying with Paul many times over, I’ve decided to write a FAQ addressing common confusions around his agenda. This FAQ is not intended to provide an introduction to Paul’s agenda, nor is it intended to provide an airtight defense. This FAQ only aims to clarify commonly misunderstood aspects of the agenda. Unless otherwise stated, all views are my own views of Paul’s views. (ETA: Paul does not have major disagreements with anything expressed in this FAQ. There are many small points he might have expressed differently, but he endorses this as a reasonable representation of his views. This is in contrast with previous drafts of this FAQ, which did contain serious errors he asked to have corrected.) For an introduction to Paul’s agenda, I’d recommend Ajeya Cotra’s summary. For good prior discussion of his agenda, I’d recommend Eliezer’s thoughts, Jessica Taylor’s thoughts (here and here), some posts and discussions on LessWrong, and Wei Dai’s comments on Paul’s blog. For most of Paul’s writings about his agenda, visit ai-alignment.com. 0. Goals and non-goals 0.1: What is this agenda trying to accomplish? Enable humans to build arbitrarily powerful AGI assistants that are competitive with unaligned AGI alternatives, and only try to help their operators (and in particular, never attempt to kill or manipulate them). People often conceive of safe AGIs as silver bullets that will robustly solve every problem that humans care about. This agenda is not about building a silver bullet, it’s about building a tool that will safely and substantially assist its operators. For example, this agenda does not aim to create assistants that can do any of the following: They can prevent nuclear wars from happening They can prevent evil dictatorships They can make centuries’ worth of philosophical progress They can effectively negotiate with distant superintelligences They can solve the value specification problem On the other hand, to the extent that humans care about these things and could make them happen, this agenda lets us build AGI assistants that can substantially assist humans achieve these things. For example, a team of 1,000 competent humans working together for 10 years could make substantial progress on preventing nuclear wars or solving metaphilosophy. Unfortunately, it’s slow and expensive to assemble a team like this, but an AGI assistant might enable us to reap similar benefits in far less time and at much lower cost. (See Clarifying "AI Alignment" and Directions and desiderata for AI alignment.) 0.2: What are examples of ways in which you imagine these AGI assistants getting used? Two countries end up in an AGI arms race. Both countries are aware of the existential threats that AGIs pose, but also don’t want to limit the power of their AIs. They build AGIs according to this agenda, which stay under the operators’ control. These AGIs then help the operators broker an international treaty, which ushers in an era of peace and stability. During this era, foundational AI safety problems (e.g. those in MIRI’s research agenda) are solved in earnest, and a provably safe recursively self-improving AI is built. A more pessimistic scenario is that the countries wage war, and the side with the more powerful AGI achieves a decisive victory and establishes a world government. This scenario isn’t as good, but it at least leaves humans in control (instead of extinct). The most pressing problem in AI strategy is how to stop an AGI race to the bottom from killi...
Dec 06, 2021
Forecasting Thread: AI TimelinesQ by Amanda Ngo, Daniel Kokotajlo, Ben Pace
03:22
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Forecasting Thread: AI TimelinesQ, published by Amanda Ngo, Daniel Kokotajlo, Ben Pace on the AI Alignment Forum. This is a thread for displaying your timeline until human-level AGI. Every answer to this post should be a forecast. In this case, a forecast showing your AI timeline. For example, here are Alex Irpan’s AGI timelines. The green distribution is his prediction from 2015, and the orange distribution is his 2020 update (based on this post). For extra credit, you can: Say why you believe it (what factors are you tracking?) Include someone else's distribution who you disagree with, and speculate as to the disagreement How to make a distribution using Elicit Go to this page. Enter your beliefs in the bins. Specify an interval using the Min and Max bin, and put the probability you assign to that interval in the probability bin. For example, if you think there's a 50% probability of AGI before 2050, you can leave Min blank (it will default to the Min of the question range), enter 2050 in the Max bin, and enter 50% in the probability bin. The minimum of the range is January 1, 2021, and the maximum is January 1, 2100. You can assign probability above January 1, 2100 (which also includes 'never') or below January 1, 2021 using the Edit buttons next to the graph. Click 'Save snapshot,' to save your distribution to a static URL. A timestamp will appear below the 'Save snapshot' button. This links to the URL of your snapshot. Make sure to copy it before refreshing the page, otherwise it will disappear. Copy the snapshot timestamp link and paste it into your LessWrong comment. You can also add a screenshot of your distribution using the instructions below. How to overlay distributions on the same graph Copy your snapshot URL. Paste it into the Import snapshot via URL box on the snapshot you want to compare your prediction to (e.g. the snapshot of Alex's distributions). Rename your distribution to keep track. Take a new snapshot if you want to save or share the overlaid distributions. How to add an image to your comment Take a screenshot of your distribution Then do one of two things: If you have beta-features turned on in your account settings, drag-and-drop the image into your comment If not, upload it to an image hosting service, then write the following markdown syntax for the image to appear, with the url appearing where it says ‘link’: ![](link) If it worked, you will see the image in the comment before hitting submit. If you have any bugs or technical issues, reply to Ben (here) in the comment section. Top Forecast Comparisons Here is a snapshot of the top voted forecasts from this thread, last updated 9/01/20. You can click the dropdown box near the bottom right of the graph to see the bins for each prediction. Here is a comparison of the forecasts as a CDF: Here is a mixture of the distributions on this thread, weighted by normalized votes (last updated 9/01/20). The median is June 20, 2047. You can click the Interpret tab on the snapshot to see more percentiles. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Dec 06, 2021
An Orthodox Case Against Utility Functions by Abram Demski
15:47
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: An Orthodox Case Against Utility Functions, published by Abram Demski on the AI Alignment Forum. This post has benefitted from discussion with Sam Eisenstat, Scott Garrabrant, Tsvi Benson-Tilsen, Daniel Demski, Daniel Kokotajlo, and Stuart Armstrong. It started out as a thought about Stuart Armstrong's research agenda. In this post, I hope to say something about what it means for a rational agent to have preferences. The view I am putting forward is relatively new to me, but it is not very radical. It is, dare I say, a conservative view -- I hold close to Bayesian expected utility theory. However, my impression is that it differs greatly from common impressions of Bayesian expected utility theory. I will argue against a particular view of expected utility theory -- a view which I'll call reductive utility. I do not recall seeing this view explicitly laid out and defended (except in in-person conversations). However, I expect at least a good chunk of the assumptions are commonly made. Reductive Utility The core tenets of reductive utility are as follows: The sample space Ω of a rational agent's beliefs is, more or less, the set of possible ways the world could be -- which is to say, the set of possible physical configurations of the universe. Hence, each world ω ∈ Ω is one such configuration. The preferences of a rational agent are represented by a utility function U Ω → R from worlds to real numbers. Furthermore, the utility function should be a computable function of worlds. Since I'm setting up the view which I'm knocking down, there is a risk I'm striking at a straw man. However, I think there are some good reasons to find the view appealing. The following subsections will expand on the three tenets, and attempt to provide some motivation for them. If the three points seem obvious to you, you might just skip to the next section. Worlds Are Basically Physical What I mean here resembles the standard physical-reductionist view. However, my emphasis is on certain features of this view: There is some "basic stuff" -- like like quarks or vibrating strings or what-have-you. What there is to know about the world is some set of statements about this basic stuff -- particle locations and momentums, or wave-form function values, or what-have-you. These special atomic statements should be logically independent from each other (though they may of course be probabilistically related), and together, fully determine the world. These should (more or less) be what beliefs are about, such that we can (more or less) talk about beliefs in terms of the sample space ω ∈ Ω as being the set of worlds understood in this way. This is the so-called "view from nowhere", as Thomas Nagel puts it. I don't intend to construe this position as ruling out certain non-physical facts which we may have beliefs about. For example, we may believe indexical facts on top of the physical facts -- there might be (1) beliefs about the universe, and (2) beliefs about where we are in the universe. Exceptions like this violate an extreme reductive view, but are still close enough to count as reductive thinking for my purposes. Utility Is a Function of Worlds So we've got the "basically physical" ω ∈ Ω . Now we write down a utility function U ω . In other words, utility is a random variable on our event space. What's the big deal? One thing this is saying is that preferences are a function of the world. Specifically, preferences need not only depend on what is observed. This is incompatible with standard RL in a way that matters. But, in addition to saying that utility can depend on more than just observations, we are restricting utility to only depend on things that are in the world. After we consider all the information in ω , there cannot be any extra uncertainty about utility -- no extra "moral facts" which w...
Dec 06, 2021
Saving Time by Scott Garrabrant
05:38
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Saving Time, published by Scott Garrabrant on the AI Alignment Forum. For the last few years, a large part of my research motivation has been directed at trying to save the concept of time—save it, for example, from all the weird causal loops created by decision theory problems. This post will hopefully explain why I care so much about time, and what I think needs to be fixed. Why Time? My best attempt at a short description of time is that time is causality. For example, in a Pearlian Bayes net, you draw edges from earlier nodes to later nodes. To the extent that we want to think about causality, then, we will need to understand time. Importantly, time is the substrate in which learning and commitments take place. When agents learn, they learn over time. The passage of time is like a ritual in which opportunities are destroyed and knowledge is created. And I think that many models of learning are subtly confused, because they are based on confused notions of time. Time is also crucial for thinking about agency. My best short-phrase definition of agency is that agency is time travel. An agent is a mechanism through which the future is able to affect the past. An agent models the future consequences of its actions, and chooses actions on the basis of those consequences. In that sense, the consequence causes the action, in spite of the fact that the action comes earlier in the standard physical sense. Problem: Time is Loopy The main thing going wrong with time is that it is “loopy.” The primary confusing thing about Newcomb's problem is that we want to think of our decision as coming “before” the filling of the boxes, in spite of the fact that it physically comes after. This is hinting that maybe we want to understand some other "logical" time in addition to the time of physics. However, when we attempt to do this, we run into two problems: Firstly, we don't understand where this logical time might come from, or how to learn it, and secondly, we run into some apparent temporal loops. I am going to set aside the first problem and focus on the second. The easiest way to see why we run into temporal loops is to notice that it seems like physical time is at least a little bit entangled with logical time. Imagine the point of view of someone running a physics simulation of Newcomb’s problem, and tracking all of the details of all of the atoms. From that point of view, it seems like there is a useful sense in which the filling of the boxes comes before an agent's decision to one-box or two-box. At the same time, however, those atoms compose an agent that shouldn’t make decisions as though it were helpless to change anything. Maybe the solution here is to think of there being many different types of “before” and “after,” “cause” and “effect,” etc. For example, we could say that X is before Y from an agent-first perspective, but Y is before X from a physics-first perspective. I think this is right, and we want to think of there as being many different systems of time (hopefully predictably interconnected). But I don't think this resolves the whole problem. Consider a pair of FairBot agents that successfully execute a Löbian handshake to cooperate in an open-source prisoner’s dilemma. I want to say that each agent's cooperation causes the other agent's cooperation in some sense. I could say that relative to each agent the causal/temporal ordering goes a different way, but I think the loop is an important part of the structure in this case. (I also am not even sure which direction of time I would want to associate with which agent.) We also are tempted to put loops in our time/causality for other reasons. For example, when modeling a feedback loop in a system that persists over time, we might draw structures that look a lot like a Bayes net, but are not acyclic (e.g., a POMDP). ...
Dec 06, 2021
Beyond Astronomical Waste by Wei Dai
05:54
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Beyond Astronomical Waste, published by Wei Dai on the AI Alignment Forum. Write a Review Faced with the astronomical amount of unclaimed and unused resources in our universe, one's first reaction is probably wonderment and anticipation, but a second reaction may be disappointment that our universe isn't even larger or contains even more resources (such as the ability to support 3^^^3 human lifetimes or perhaps to perform an infinite amount of computation). In a previous post I suggested that the potential amount of astronomical waste in our universe seems small enough that a total utilitarian (or the total utilitarianism part of someone’s moral uncertainty) might reason that since one should have made a deal to trade away power/resources/influence in this universe for power/resources/influence in universes with much larger amounts of available resources, it would be rational to behave as if this deal was actually made. But for various reasons a total utilitarian may not buy that argument, in which case another line of thought is to look for things to care about beyond the potential astronomical waste in our universe, in other words to explore possible sources of expected value that may be much greater than what can be gained by just creating worthwhile lives in this universe. One example of this is the possibility of escaping, or being deliberately uplifted from, a simulation that we're in, into a much bigger or richer base universe. Or more generally, the possibility of controlling, through our decisions, the outcomes of universes with much greater computational resources than the one we're apparently in. It seems likely that under an assumption such as Tegmark's Mathematical Universe Hypothesis, there are many simulations of our universe running all over the multiverse, including in universes that are much richer than ours in computational resources. If such simulations exist, it also seems likely that we can leave some of them, for example through one of these mechanisms: Exploiting a flaw in the software or hardware of the computer that is running our simulation (including "natural simulations" where a very large universe happens to contain a simulation of ours without anyone intending this). Exploiting a flaw in the psychology of agents running the simulation. Altruism (or other moral/axiological considerations) on the part of the simulators. Acausal trade. Other instrumental reasons for the simulators to let out simulated beings, such as wanting someone to talk to or play with. (Paul Christiano's recent When is unaligned AI morally valuable? contains an example of this, however the idea there only lets us escape to another universe similar to this one.) (Being run as a simulation in another universe isn't necessarily the only way to control what happens in that universe. Another possibility is if universes with halting oracles exist (which is implied by Tegmark's MUH since they exist as mathematical structures in the arithmetical hierarchy), some of their oracle queries may be questions whose answers can be controlled by our decisions, in which case we can control what happens in those universes without being simulated by them (in the sense of being run step by step in a computer). Another example is that superintelligent beings may be able to reason about what our decisions are without having to run a step by step simulation of us, even without access to a halting oracle.) The general idea here is for a superintelligence descending from us to (after determining that this is an advisable course of action) use some fraction of the resources of this universe to reason about or search (computationally) for much bigger/richer universes that are running us as simulations or can otherwise be controlled by us, and then determine what we need to do to maximize the exp...
Dec 06, 2021
interpreting GPT: the logit lens by nostalgebraist
17:39
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: interpreting GPT: the logit lens, published by nostalgebraist on the AI Alignment Forum. This post relates an observation I've made in my work with GPT-2, which I have not seen made elsewhere. IMO, this observation sheds a good deal of light on how the GPT-2/3/etc models (hereafter just "GPT") work internally. There is an accompanying Colab notebook which will let you interactively explore the phenomenon I describe here. [Edit: updated with another section on comparing to the inputs, rather than the outputs. This arguably resolves some of my confusion at the end. Thanks to algon33 and Gurkenglas for relevant suggestions here.] [Edit 5/17/21: I've recently written a new Colab notebook which extends this post in various ways: trying the "lens" on various models from 125M to 2.7B parameters, including GPT-Neo and CTRL exploring the contributions of the attention and MLP sub-blocks within transformer blocks/layers trying out a variant of the "decoder" used in this post, which dramatically helps with interpreting some models overview GPT's probabilistic predictions are a linear function of the activations in its final layer. If one applies the same function to the activations of intermediate GPT layers, the resulting distributions make intuitive sense. This "logit lens" provides a simple (if partial) interpretability lens for GPT's internals. Other work on interpreting transformer internals has focused mostly on what the attention is looking at. The logit lens focuses on what GPT "believes" after each step of processing, rather than how it updates that belief inside the step. These distributions gradually converge to the final distribution over the layers of the network, often getting close to that distribution long before the end. At some point in the middle, GPT will have formed a "pretty good guess" as to the next token, and the later layers seem to be refining these guesses in light of one another. The general trend, as one moves from earlier to later layers, is "nonsense / not interpretable" (sometimes, in very early layers) --> "shallow guesses (words that are the right part of speech / register / etc)" --> "better guesses" ...though some of those phases are sometimes absent. On the other hand, only the inputs look like the input tokens. In the logit lens, the early layers sometimes look like nonsense, and sometimes look like very simple guesses about the output. They almost never look like the input. Apparently, the model does not "keep the inputs around" for a while and gradually process them into some intermediate representation, then into a prediction. Instead, the inputs are immediately converted to a very different representation, which is smoothly refined into the final prediction. This is reminiscent of the perspective in Universal Transformers which sees transformers as iteratively refining a guess. However, Universal Transformers have both an encoder and decoder, while GPT is only a decoder. This means GPT faces a tradeoff between keeping around the input tokens, and producing the next tokens. Eventually it has to spit out the next token, so the longer it spends (in depth terms) processing something that looks like token i, the less time it has to convert it into token i+1. GPT has a deadline, and the clock is ticking. More speculatively, this suggests that GPT mostly "thinks in predictive space," immediately converting inputs to predicted outputs, then refining guesses in light of other guesses that are themselves being refined. I think this might suggest there is some fundamentally better way to do sampling from GPT models? I'm having trouble writing out the intuition clearly, so I'll leave it for later posts. Caveat: I call this a "lens" because it is one way of extracting information from GPT's internal activations. I imagine there is other information...
Dec 06, 2021
Full-time AGI Safety! by Steve Byrnes
01:33
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Full-time AGI Safety!, published by Steve Byrnes on the AI Alignment Forum. Hi everyone! I'm Steve Byrnes, a professional physicist in Boston. Y'all may know me as a regular blogger and commenter on this site. (Example post, Full list.) Well, out of the blue, an exceptionally generous sponsor has granted me a grant to spend the next year full-time trying to advance AGI safety! (Eternal gratitude to Beth Barnes and the CEA donor lottery program!) To make a long story short, my research plan is to immerse myself in the latest thinking about how the human brain works (i.e., what algorithm does it run?), and apply those insights to help move forward the discussion on AGI safety and strategy. (And conversely, to also help move forward the discussion of "What if we succeed?" within neuroscience.) Until now I've been researching and blogging in little bits of time squeezed between work and kids. Not anymore! Starting today, I'm in full-time, all-out effort mode! I know that this year is going to fly by far too quickly; I’m desperate to make the most of it. One thing that will help is lots of exchange of ideas! You can email me at steven.byrnes@gmail.com if you want to see my research proposal, or discuss topics of interest, or just say hi! :-) Pingbacks Research agenda update Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Dec 06, 2021
AMA: Paul Christiano, alignment researcher by Paul Christiano
00:39
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AMA: Paul Christiano, alignment researcher, published by Paul Christiano on the AI Alignment Forum. I'll be running an Ask Me Anything on this post from Friday (April 30) to Saturday (May 1). If you want to ask something just post a top-level comment; I'll spend at least a day answering questions. You can find some background about me here. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Dec 06, 2021
Inner Alignment in Salt-Starved Rats by Steve Byrnes
18:55
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Inner Alignment in Salt-Starved Rats, published by Steve Byrnes on the AI Alignment Forum. Introduction: The Dead Sea Salt Experiment In this 2014 paper by Mike Robinson and Kent Berridge at University of Michigan (see also this more theoretical follow-up discussion by Berridge and Peter Dayan), rats were raised in an environment where they were well-nourished, and in particular, where they were never salt-deprived—not once in their life. The rats were sometimes put into a test cage with a lever which, when it appeared, was immediately followed by a device spraying ridiculously salty water directly into their mouth. The rats were disgusted and repulsed by the extreme salt taste, and quickly learned to hate the lever—which from their perspective would seem to be somehow causing the saltwater spray. One of the rats went so far as to stay tight against the opposite wall—as far from the lever as possible! Then the experimenters made the rats feel severely salt-deprived, by depriving them of salt. Haha, just kidding! They made the rats feel severely salt-deprived by injecting the rats with a pair of chemicals that are known to induce the sensation of severe salt-deprivation. Ah, the wonders of modern science! ...And wouldn't you know it, almost instantly upon injection, the rats changed their behavior! When shown the lever (this time without the salt-water spray), they now went right over to that lever and jumped on it and gnawed at it, obviously desperate for that super-salty water. The end. Aren't you impressed? Aren’t you floored? You should be!!! I don’t think any standard ML algorithm would be able to do what these rats just did! Think about it: Is this Reinforcement Learning? No. RL would look like the rats randomly stumbling upon the behavior of “nibbling the lever when salt-deprived”, find it rewarding, and then adopt that as a goal via “credit assignment”. That’s not what happened. While the rats were nibbling at the lever, they had never in their life had an experience where the lever had brought forth anything other than an utterly repulsive experience. And they had never in their life had an experience where they were salt-deprived, tasted something extremely salty, and found it gratifying. I mean, they were clearly trying to interact with the lever—this is a foresighted plan we're talking about—but that plan does not seem to have been reinforced by any experience in their life. Update for clarification: Specifically, it's not any version of RL where you learn about the reward function only by observing past rewards. This category includes all model-free RL and some model-based RL (e.g. MuZero). If, by contrast, you have a version of model-based RL where the agent can submit arbitrary hypothetical queries to the true reward function, then OK, sure, now you can get the rats' behavior. I don't think that's what's going on here for reasons I'll mention at the bottom. Is this Imitation Learning? Obviously not; the rats had never seen any other rat around any lever for any reason. Is this an innate, hardwired, stimulus-response behavior? No, the connection between a lever and saltwater was an arbitrary, learned connection. (I didn't mention it, but the researchers also played a distinctive sound each time the lever appeared. Not sure how important that is. But anyway, that connection is arbitrary and learned, too.) So what’s the algorithm here? How did their brains know that this was a good plan? That’s the subject of this post. What does this have to do with inner alignment? What is inner alignment anyway? Why should we care about any of this? With apologies to the regulars on this forum who already know all this, the so-called “inner alignment problem” occurs when you, a programmer, build an intelligent, foresighted, goal-seeking agent. You want it to be trying t...
Dec 06, 2021
Against GDP as a metric for timelines and takeoff speeds by Daniel Kokotajlo
23:15
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Against GDP as a metric for timelines and takeoff speeds, published by Daniel Kokotajlo on the AI Alignment Forum. [Epistemic status: Strong opinion, lightly held] I think world GDP (and economic growth more generally) is overrated as a metric for AI timelines and takeoff speeds. Here are some uses of GDP that I disagree with, or at least think should be accompanied by cautionary notes: Timelines: Ajeya Cotra thinks of transformative AI as “software which causes a tenfold acceleration in the rate of growth of the world economy (assuming that it is used everywhere that it would be economically profitable to use it).” I don’t mean to single her out in particular; this seems like the standard definition now. And I think it's much better than one prominent alternative, which is to date your AI timelines to the first time world GDP (GWP) doubles in a year! Takeoff Speeds: Paul Christiano argues for Slow Takeoff. He thinks we can use GDP growth rates as a proxy for takeoff speeds. In particular, he thinks Slow Takeoff ~= GWP doubles in 4 years before the start of the first 1-year GWP doubling. This proxy/definition has received a lot of uptake. Timelines: David Roodman’s excellent model projects GWP hitting infinity in median 2047, which I calculate means TAI in median 2037. To be clear, he would probably agree that we shouldn’t use these projections to forecast TAI, but I wish to add additional reasons for caution. Timelines: I’ve sometimes heard things like this: “GWP growth is stagnating over the past century or so; hyperbolic progress has ended; therefore TAI is very unlikely.” Takeoff Speeds: Various people have said things like this to me: “If you think there’s a 50% chance of TAI by 2032, then surely you must think there’s close to a 50% chance of GWP growing by 8% per year by 2025, since TAI is going to make growth rates go much higher than that, and progress is typically continuous.” Both: Relatedly, I sometimes hear that TAI can’t be less than 5 years away, because we would have seen massive economic applications of AI by now—AI should be growing GWP at least a little already, if it is to grow it by a lot in a few years. First, I’ll argue that GWP is only tenuously and noisily connected to what we care about when forecasting AI timelines. Specifically, the point of no return is what we care about, and there’s a good chance it’ll come years before GWP starts to increase. It could also come years after, or anything in between. Then, I’ll argue that GWP is a poor proxy for what we care about when thinking about AI takeoff speeds as well. This follows from the previous argument about how the point of no return may come before GWP starts to accelerate. Even if we bracket that point, however, there are plausible scenarios in which a slow takeoff has fast GWP acceleration and in which a fast takeoff has slow GWP acceleration. Timelines I’ve previously argued that for AI timelines, what we care about is the “point of no return,” the day we lose most of our ability to reduce AI risk. This could be the day advanced unaligned AI builds swarms of nanobots, but probably it’ll be much earlier, e.g. the day it is deployed, or the day it finishes training, or even years before then when things go off the rails due to less advanced AI systems. (Of course, it probably won’t literally be a day; probably it will be an extended period where we gradually lose influence over the future.) Now, I’ll argue that in particular, an AI-induced potential point of no return (PONR for short) is reasonably likely to come before world GDP starts to grow noticeably faster than usual. Disclaimer: These arguments aren’t conclusive; we shouldn’t be confident that the PONR will precede GWP acceleration. It’s entirely possible that the PONR will indeed come when GWP starts to grow noticeably faster than...
Dec 06, 2021
Soft takeoff can still lead to decisive strategic advantage by Daniel Kokotajlo
13:26
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Soft takeoff can still lead to decisive strategic advantage, published by Daniel Kokotajlo on the AI Alignment Forum. [Epistemic status: Argument by analogy to historical cases. Best case scenario it's just one argument among many. Edit: Also, thanks to feedback from others, especially Paul, I intend to write a significantly improved version of this post in the next two weeks. Edit: I never did, because in the course of writing my response I realized the original argument made a big mistake. See this review.] I have on several occasions heard people say things like this: The original Bostrom/Yudkowsky paradigm envisioned a single AI built by a single AI project, undergoing intelligence explosion all by itself and attaining a decisive strategic advantage as a result. However, this is very unrealistic. Discontinuous jumps in technological capability are very rare, and it is very implausible that one project could produce more innovations than the rest of the world combined. Instead we should expect something more like the Industrial Revolution: Continuous growth, spread among many projects and factions, shared via a combination of trade and technology stealing. We should not expect any one project or AI to attain a decisive strategic advantage, because there will always be other projects and other AI that are only slightly less powerful, and coalitions will act to counterbalance the technological advantage of the frontrunner. (paraphrased) Proponents of this view often cite Paul Christiano in support. Last week I heard him say he thinks the future will be "like the Industrial Revolution but 10x-100x faster." In this post, I assume that Paul's slogan for the future is correct and then nevertheless push back against the view above. Basically, I will argue that even if the future is like the industrial revolution only 10x-100x faster, there is a 30%+ chance that it will involve a single AI project (or a single AI) with the ability to gain a decisive strategic advantage, if they so choose. (Whether or not they exercise that ability is another matter.) Why am I interested in this? Do I expect some human group to take over the world? No; instead what I think is that (1) an unaligned AI in the leading project might take over the world, and (2) A human project that successfully aligns their AI might refrain from taking over the world even if they have the ability to do so, and instead use their capabilities to e.g. help the United Nations enforce a ban on unauthorized AGI projects. National ELO ratings during the industrial revolution and the modern era In chess (and some other games) ELO rankings are used to compare players. An average club player might be rank 1500; the world chess champion might be 2800; computer chess programs are even better. If one player has 400 points more than another, it means the first player would win with ~90% probability. We could apply this system to compare the warmaking abilities of nation-states and coalitions of nation-states. For example, in 1941 perhaps we could say that the ELO rank of the Axis powers was ~300 points lower than the ELO rank of the rest of the world combined (because what in fact happened was the rest of the world combining to defeat them, but it wasn't a guaranteed victory). We could add that in 1939 the ELO rank of Germany was ~400 points higher than that of Poland, and that the ELO rank of Poland was probably 400+ points higher than that of Luxembourg. We could make cross-temporal fantasy comparisons too. The ELO ranking of Germany in 1939 was probably ~400 points greater than that of the entire world circa 1910, for example. (Visualize the entirety of 1939 Germany teleporting back in time to 1910, and then imagine the havoc it would wreak.) Claim 1A: If we were to estimate the ELO rankings of all nation-states and sets...
Dec 06, 2021
My Understanding of Paul Christiano's Iterated Amplification AI Safety Research Agenda by Chi Nguyen
01:05:25
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: My Understanding of Paul Christiano's Iterated Amplification AI Safety Research Agenda, publishedby Chi Nguyen on the AI Alignment Forum. Write a Review Crossposted from the EA forum You can read this post as a google docs instead (IMO much better to read). This document aims to clarify the AI safety research agenda by Paul Christiano (IDA) and the arguments around how promising it is. Target audience: All levels of technical expertise. The less knowledge about IDA someone has, the more I expect them to benefit from the writeup. Writing policy: I aim to be as clear and concrete as possible and wrong rather than vague to identify disagreements and where I am mistaken. Things will err on the side of being too confidently expressed. Almost all footnotes are content and not references. Epistemic Status: The document is my best guess on IDA and might be wrong in important ways. I have not verified all of the content with somebody working on IDA. I spent ~4 weeks on this and have no prior background in ML, CS or AI safety. I wrote this document last summer (2019) as part of my summer research fellowship at FHI. I was planning to restructure, complete and correct it since but haven’t gotten to it for a year, so decided to just publish it as it is. The document has not been updated, i.e. nothing that has been released since September 2019 is incorporated into this document. Paul Christiano generously reviewed the first third to a half of this summary. I added his comments verbatim in the document. Apologies for the loss of readability due to this. This doesn’t imply he endorses any part of this document, especially the second half which he didn't get to review. Purpose of this document: Clarifying IDA IDA is Paul Christiano’s AI safety research agenda.[1] Christiano works at OpenAI which is one of the main actors in AI safety and IDA is by many considered the most complete[2] AI safety agenda. However, people who are not directly working on IDA are often confused about how exactly to understand the agenda. Clarifying IDA would make it more accessible for technical people to work on and easier to assess for nontechnical people who want to think about its implications. I believe that there are currently no resources on IDA that are both easy to understand and give a complete picture. Specifically, the current main resources are: the “Iterated Amplification” sequence which is a series of curated posts by Paul Christiano that can be quite difficult to understand, this post by Ajeya Cotra and this video by Robert Miles which are both easy to understand but limited in scope and don’t provide many details, Alex Zhu’s FAQ to IDA which clarifies important points but does not set them in context with the entire research agenda, an 80,000 podcast with Paul Christiano which explains some intuitions behind IDA but is not comprehensive and is in speech form. This document aims to fill the gap and give a comprehensive and accessible overview of IDA. Summary: IDA in 7 sentences IDA stands for Iterated Amplification and is a research agenda by Paul Christiano from OpenAI. IDA addresses the artificial intelligence (AI) safety problem, specifically the danger of creating a very powerful AI which leads to catastrophic outcomes. IDA tries to prevent catastrophic outcomes by searching for a competitive AI that never intentionally optimises for something harmful to us and that we can still correct once it’s running. IDA doesn’t propose a specific implementation, but presents a rough AI design and a collection of thoughts on whether this design has the potential to create safe and powerful AI and how the details of that design could look like. The proposed AI design is to use a safe but slow way of scaling up an AI’s capabilities, distill this into a faster but slightly weaker AI, which can be scal...
Dec 06, 2021
An Intuitive Guide to Garrabrant Induction by Mark Xu
43:41
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: An Intuitive Guide to Garrabrant Induction, published by Mark Xu on the AI Alignment Forum. This post is a high-level summary of the core insights and arguments in Logical Induction, a MIRI paper from 2016. It’s intended for people without much mathematical training. Numbers in [brackets] indicate the section of the paper from which I am drawing. A brief note on naming: Solomonoff exhibited an uncomputable algorithm that does idealized induction, which we call Solomonoff induction. Garrabrant exhibited a computable algorithm that does logical induction, which we have named Garrabrant induction. Thanks to Mauricio Baker for helpful comments. My editor is Justis Mills. Graphics are done by Sabrina Chwalek. Introduction [1] Suppose I run a computer program. What does it output? You don’t know the code, so it could do basically anything. You’re missing key information to resolve the question. However, even if you did know the source code, you might still be ignorant about what it would do. You have all the necessary information per se, and a perfect reasoner could solve it instantly, but it might take an unrealistic amount of effort for you to interpret it correctly. The former kind of uncertainty is empirical. You have to look at the world and make observations about the source code of the program, how my computer interprets the code, etc. Other examples of empirical uncertainty: not knowing what the weather is, not knowing what time it is, not knowing the name of your friend, etc. The latter kind of uncertainty is logical. Even after you’ve looked at the program and seen the source code, you still might not know what the source code will output. For instance, suppose you saw that the program printed the 173,498th digit of p i . You know what the program will do, but you don’t know the results of that process. Other examples of logical uncertainty: not knowing if 19483 is prime, not knowing whether 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 is even, not knowing if 1/1/2000 was a Monday, etc. The bottleneck in these cases isn’t missing data, but rather missing computation - you haven’t yet exerted the required energy to figure it out, and it might not always be worth it with the tools at your disposal. Let us call the process of “properly” managing logical uncertainty logical induction and reasoners that employ logical induction logical inductors. Bayesian Insufficiency Naively, one might assume that Bayesian reasoning, a general method for handling empirical uncertainty, might extend itself naturally to logical uncertainty. However, this is not the case. Imagine that I have two boxes. Suppose that you know I’m either going to place either one blue ball into each or one red ball into each. Your beliefs about what color ball is in each of the boxes are now linked; if you see a blue ball in one of the boxes, you know that the other box contains a blue ball. Now imagine that I give one of the boxes to my friend Alice and the other box to my friend Bob. You know that Alice really likes matching; if she gets a blue ball, she’ll wear blue clothes, if she gets a red ball, she’ll wear red clothes. You also know that Bob really likes traveling; if he gets a blue ball, he’ll go to the ocean, if he gets a red ball, he’ll go to the desert. Since your beliefs about the color of balls Alice and Bob received are linked, your beliefs about where Bob travels and what color Alice wears are also linked. If you see Alice wearing blue, it’s more likely she got a blue ball than a red ball, which means Bob also probably got a blue ball, which means Bob went to the ocean. Suppose that Bob has friends Carol and Dave. Carol likes the ocean, so Bob goes to the ocean with Carol, and Dave likes the desert, so Bob goes to the desert with Dave. Now your beliefs about what Alice is w...
Dec 06, 2021
Prisoners' Dilemma with Costs to Modeling by Scott Garrabrant
14:50
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Prisoners' Dilemma with Costs to Modeling, published by Scott Garrabrant on the AI Alignment Forum. We consider a modification to the open source prisoners' dilemma in which agents must pay some resources to model each other. We will use the modal combat framework, but where agents pay a cost proportional to the depth of boxes in their code. Even a small modeling penalty makes the FairBot-FairBot outcome no longer an equilibrium, since the best response to FairBot is to be CooperateBot and not pay the modeling penalty. The best response to CooperateBot is to be DefectBot, and the pure DefectBot-DefectBot outcome is a stable Nash equilibrium. In fact, I believe that DefectBot-DefectBot is the unique pure strategy Nash equilibrium. Amazingly, this turns out to be okay! For small modeling penalties, there is a mixed strategy equilibrium which mixes between CooperateBot, FairBot, and PrudentBot! Both players get exactly the same utility in expectation as the FairBot-FairBot outcome. Further, if you consider an evolutionary system where populations reproduce in proportion to how well they do in prisoners' dilemmas with each other, it appears that as the modeling penalty gets small, the basin of the defect equilibrium also gets small, and nearly all initial conditions cycle around CooperateBot, FairBot, and PrudentBot! This post came out of conversations with Sam Eisenstat, Abram Demski, Tsvi Benson-Tilsen, and Andrew Critch. It is a first draft that could use a coauthor to carefully check everything, expand on it, and turn it into a paper. If you think you could do that with minimal guidance from me, let me know. Formalism We will be using the modal combat framework, and identifying ⊤ with cooperation and ⊥ with defection. Agents are defined to formulas that combine the other agent X run on various agents using propositional calculus and a modal operator □ . The □ represents provability, and every instance of X run on an agent in the formula must be contained within a □ . Recall some common modal agents: CooperateBot is defined by C B X ↔ ⊤ DefectBot is defined by D B X ↔ ⊥ FairBot is defined by F B X ↔ □ X F B PrudentBot is defined by P B X ↔ □ X P B ∧ X D B → □ ⊥ These 4 agents interact with each other as follows: CooperateBot cooperates with everyone. DefectBot defects against everyone. FairBot defects against only DefectBot. PrudentBot defects against CooperateBot and DefectBot and cooperates with itself and FairBot. We will say that the depth of an agent is the maximum of the depth of □ s in its code and the depth of the agents that it calls the opponent on. CooperateBot and DefectBot have depth 0, FairBot has depth 1, and PrudentBot has depth 2. We will use a prisoner's dilemma where mutual cooperation produces utility 2, mutual defiction produces utility 1, and exploitation produces utility 3 for the exploiter and 0 for the exploited. Each player will also pay a penalty of ε times its depth. Pure Equilibria The best response to both CooperateBot and DefectBot is DefectBot, since when the opponent does not depend on you, you want to defect with the least possible penalty. The best response to FairBot is CooperateBot, since you can't exploit FairBot, so you want to get mutual cooperation with the least possible penalty. The best response to PrudentBot is FairBot, since you can't exploit PrudentBot, you can't mutually cooperate with penalty 0, but you can mutually cooperate with penalty 1 by being FairBot. (This is assuming ε is at less than 1 2 . Otherwise, you just want to defect to avoid the penalty.) Thus, if the only options are CooperateBot, DefectBot, FairBot, and PrudentBot, the unique pure strategy equilibrium is mutual DefectBot. I believe that DefectBot is the only pure strategy equilibrium in general. This would follow directly from the fact that if a dept...
Dec 06, 2021
How much chess engine progress is about adapting to bigger computers? by Paul Christiano
09:48
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How much chess engine progress is about adapting to bigger computers?, published by Paul Christiano on the AI Alignment Forum. (This question comes from a discussion with Carl Shulman.) In this post I describe an experiment that I'd like to see run. I'm posting a $1,000 - $10,000 prize for a convincing implementation of these experiments. I also post a number of smaller prizes for relevant desk research or important corrections to this request. Motivation In order to understand the dynamics of the singularity, I'd like to understand how easy it is to improve algorithms and software. We can learn something about this from looking at chess engines. It's not the most relevant domain to future AI, but it's one with an unusually long history and unusually clear (and consistent) performance metrics. In order to quantify the quality of a chess engine, we can fix a level of play and ask "How much compute is needed for the engine to play at that level?" One complication in evaluating the rate of progress is that it depends on what level of play we use for evaluation. In particular, newer algorithms are generally designed to play at a much higher level than older algorithms. So if we quantify the compute needed to reach modern levels of play, we will capture both absolute improvements and also "adaptation" to the new higher amounts of compute. So we'd like to attribute progress in chess engines to three factors: Better software. Bigger computers. Software that is better-adapted to new, bigger computers. Understanding the size of factor #1 is important for extrapolating progress given massive R&D investments in software. While it is easy to separate factors #1 and #2 from publicly available information, it is not easy to evaluate factor #3. Experiment description Pick two (or more) software engines from very different times. They should both be roughly state of the art, running on "typical" machines from the era (i.e. the machines for which R&D is mostly targeted). We then carry out two matches: Run the old engine on its "native" hardware (the "old hardware"). Then evaluate: how little compute does the new engine need in order to beat the old engine? Run the new engine on its "native" hardware (the "new hardware"). Then evaluate: how much compute does the old engine need in order to beat the new engine? With some effort, we can estimate a quantitative ratio of "ops needed" for each of these experiments. For example, we may find that the new engine is able to beat the old engine using only 1% of the "old hardware." Whereas we may find that the old engine would require 10,000x the "new hardware" in order to compete with the new engine. The first experiment tells us about the absolute improvements in chess engines on the task for which the old engine was optimized. (This understates the rate of software progress to the extent that people stopped working on this task.) The second experiment gives us the combination of absolute improvements + adaptation to new hardware. Typical measures of "rate of software progress" will be somewhere in between, and are sensitive to the hardware on which the evaluation is carried out. I believe that understanding these two numbers would give us a significantly clearer picture of what's really going on with software progress in chess engines. Experiment details Here's some guesses about how to run this experiment well. I don't know much about computer chess, so you may be able to make a better proposal. Old engine, old hardware: my default proposal is the version of Fritz that won the 1995 world computer chess championship, using the same amount of hardware (and time controls) as in that championship. This algorithm seems like a particularly reasonable "best effort" at making full use of available computing resources. I don't want to compare an engi...
Dec 06, 2021
Debate update: Obfuscated arguments problem by Beth Barnes
26:44
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Debate update: Obfuscated arguments problem, published by Beth Barnes on the AI Alignment Forum. This is an update on the work on AI Safety via Debate that we previously wrote about here. Authors and Acknowledgements The researchers on this project were Elizabeth Barnes and Paul Christiano, with substantial help from William Saunders (who built the current web interface as well as other help), Joe Collman (who helped develop the structured debate mechanisms), and Mark Xu, Chris Painter, Mihnea Maftei and Ronny Fernandez (who took part in many debates as well as helping think through problems). We're also grateful to Geoffrey Irving and Evan Hubinger for feedback on drafts, and for helpful conversations, along with Richard Ngo, Daniel Ziegler, John Schulman, Amanda Askell and Jeff Wu. Finally, we're grateful to our contractors who participated in experiments, including Adam Scherlis, Kevin Liu, Rohan Kapoor and Kunal Sharda. What we did We tested the debate protocol introduced in AI Safety via Debate with human judges and debaters. We found various problems and improved the mechanism to fix these issues (details of these are in the appendix). However, we discovered that a dishonest debater can often create arguments that have a fatal error, but where it is very hard to locate the error. We don’t have a fix for this “obfuscated argument” problem, and believe it might be an important quantitative limitation for both IDA and Debate. Key takeaways and relevance for alignment Our ultimate goal is to find a mechanism that allows us to learn anything that a machine learning model knows: if the model can efficiently find the correct answer to some problem, our mechanism should favor the correct answer while only requiring a tractable number of human judgements and a reasonable number of computation steps for the model. [1] We’re working under a hypothesis that there are broadly two ways to know things: via step-by-step reasoning about implications (logic, computation.), and by learning and generalizing from data (pattern matching, bayesian updating.). Debate focuses on verifying things via step-by-step reasoning. It seems plausible that a substantial proportion of the things a model ‘knows’ will have some long but locally human-understandable argument for their correctness. [2] Previously we hoped that debate/IDA could verify any knowledge for which such human-understandable arguments exist, even if these arguments are intractably large. We hoped the debaters could strategically traverse small parts of the implicit large argument tree and thereby show that the whole tree could be trusted. The obfuscated argument problem suggests that we may not be able to rely on debaters to find flaws in large arguments, so that we can only trust arguments when we could find flaws by recursing randomly---e.g. because the argument is small enough that we could find a single flaw if one existed, or because the argument is robust enough that it is correct unless it has many flaws. This suggests that while debates may let us verify arguments too large for unaided humans to understand, those arguments may still have to be small relative to the computation used during training. We believe that many important decisions can’t be justified with arguments small or robust enough to verify in this way. To supervise ML systems that make such decisions, we either need to find some restricted class of arguments for which we believe debaters can reliably find flaws, or we need to be able to trust the representations or heuristics that our models learn from the training data (rather than verifying them in a given case via debate). We have been thinking about approaches like learning the prior to help trust our models’ generalization. This is probably better investigated through ML experiments or theoretical ...
Dec 06, 2021
The Fusion Power Generator Scenario by johnswentworth
05:30
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is:The Fusion Power Generator Scenario , published by johnswentworth on the AI Alignment Forum. Suppose, a few years from now, I prompt GPT-N to design a cheap, simple fusion power generator - something I could build in my garage and use to power my house. GPT-N succeeds. I build the fusion power generator, find that it works exactly as advertised, share the plans online, and soon the world has easy access to cheap, clean power. One problem: at no point did it occur to me to ask “Can this design easily be turned into a bomb?”. Had I thought to prompt it with the question, GPT-N would have told me that the design could easily be turned into a bomb. But I didn’t think to ask, so GPT-N had no reason to mention it. With the design in wide use, it’s only a matter of time until people figure it out. And so, just like that, we live in a world where anyone can build a cheap thermonuclear warhead in their garage. This scenario highlights a few key constraints which I think are under-appreciated in alignment today. Sharing Information is Irreversible I’ve heard people say that we can make AI safe(r) by restricting the AI’s action space to things which we can undo. Problem is, sharing information is irreversible; once the cat is out of the bag, there’s no getting it back into the bag. And if an AI can’t share information, there’s very little that it can do. Not much point in an AI which just can’t do anything observable at all. (One could design an AI to “move in mysterious ways”, but I have trouble imagining that it ends up safer that way.) This is a problem when information itself is dangerous, e.g. knowledge of how to build a thermonuclear warhead in one’s garage. Humans Are Not Safe Two key properties of humans: We do not have full introspective understanding of our own wants We do not have the processing power to fully understand the consequences of changes Sometimes, we get something we thought we wanted, and find out that we don’t want it after all. Either we misunderstood our own wants, or misunderstood the full implications of the change. Most of the time, this isn’t that huge an issue. We lose some money and/or time, but we move on. But if a human is capable of making large, irreversible changes to the world, then the problem becomes more serious. A human with access to powerful AI - even something as conceptually simple as GPT-N - is capable of making large irreversible changes, and they do not have the processing power to fully understand the implications of those changes. In general, a human won’t even know the right questions to ask. So, if a system’s safety relies on a human asking the right questions, then the system is not safe. In particular, this is relevant to the HCH family of alignment schemes (e.g. IDA), as well as human-imitating AI more broadly. Corollary: Tool AI Is Not Inherently Safe Tool AI, in particular, relies primarily on human operators for safety. Just like a tablesaw is safe if-and-only-if the operator uses it safely, tool AI is safe if-and-only-if the operator uses it safely. With a tablesaw, that’s usually fine. It’s pretty obvious what sorts of things will lead to bad outcomes from a tablesaw. But the big value-proposition of powerful AI is its ability to reason about systems or problems too complicated for humans - which are exactly the systems/problems where safety issues are likely to be nonobvious. If we’re going to unlock the full value of AI at all, we’ll need to use it on problems where humans do not know the relevant safety issues. So: if the system’s safety relies on a human using it safely, then it’s not safe. If you want a concrete, evocative analogy: picture a two-year-old playing on top of a tablesaw. That said, people are designing tablesaws which auto-stop when skin contacts the blade. In general, a system’s designers may unders...
Dec 05, 2021
The Alignment Problem: Machine Learning and Human ValuesRobust Delegation by Rohin Shah
10:08
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Alignment Problem: Machine Learning and Human Values, published by Rohin Shah on the AI Alignment Forum. This is a linkpost for The Alignment Problem: Machine Learning and Human Values, by Brian Christian, was just released. This is an extended summary + opinion, a version without the quotes from the book will go out in the next Alignment Newsletter. Summary: This book starts off with an explanation of machine learning and problems that we can currently see with it, including detailed stories and analysis of: - The gorilla misclassification incident - The faulty reward in CoastRunners - The gender bias in language models - The failure of facial recognition models on minorities - The COMPAS controversy (leading up to impossibility results in fairness) - The neural net that thought asthma reduced the risk of pneumonia It then moves on to agency and reinforcement learning, covering from a more historical and academic perspective how we have arrived at such ideas as temporal difference learning, reward shaping, curriculum design, and curiosity, across the fields of machine learning, behavioral psychology, and neuroscience. While the connections aren't always explicit, a knowledgeable reader can connect the academic examples given in these chapters to the ideas of specification gaming and mesa optimization that we talk about frequently in this newsletter. Chapter 5 especially highlights that agent design is not just a matter of specifying a reward: often, rewards will do ~nothing, and the main requirement to get a competent agent is to provide good shaping rewards or a good curriculum. Just as in the previous part, Brian traces the intellectual history of these ideas, providing detailed stories of (for example): - BF Skinner's experiments in training pigeons - The invention of the perceptron - The success of TD-Gammon, and later AlphaGo Zero The final part, titled "Normativity", delves much more deeply into the alignment problem. While the previous two parts are partially organized around AI capabilities -- how to get AI systems that optimize for their objectives -- this last one tackles head on the problem that we want AI systems that optimize for our (often-unknown) objectives, covering such topics as imitation learning, inverse reinforcement learning, learning from preferences, iterated amplification, impact regularization, calibrated uncertainty estimates, and moral uncertainty. Opinion: I really enjoyed this book, primarily because of the tracing of the intellectual history of various ideas. While I knew of most of these ideas, and often also who initially came up with the ideas, it's much more engaging to read the detailed stories of _how_ that person came to develop the idea; Brian's book delivers this again and again, functioning like a well-organized literature survey that is also fun to read because of its great storytelling. I struggled a fair amount in writing this summary, because I kept wanting to somehow communicate the writing style; in the end I decided not to do it and to instead give a few examples of passages from the book in this post. Passages: Note: It is generally not allowed to have quotations this long from this book; I have specifically gotten permission to do so. Here’s an example of agents with evolved inner reward functions, which lead to the inner alignment problems we’ve previously worried about: They created a two-dimensional virtual world in which simulated organisms (or “agents”) could move around a landscape, eat, be preyed upon, and reproduce. Each organism’s “genetic code” contained the agent’s reward function: how much it liked food, how much it disliked being near predators, and so forth. During its lifetime, it would use reinforcement learning to learn how to take actions to maximize these rewards. When an organism reproduced, ...
Dec 05, 2021
The Commitment Races problem by Daniel Kokotajlo
09:24
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Commitment Races problem , published by Daniel Kokotajlo on the AI Alignment Forum. [Epistemic status: Strong claims vaguely stated and weakly held. I expect that writing this and digesting feedback on it will lead to a much better version in the future. EDIT: So far this has stood the test of time. EDIT: As of September 2020 I think this is one of the most important things to be thinking about.] This post attempts to generalize and articulate a problem that people have been thinking about since at least 2016. [Edit: 2009 in fact!] In short, here is the problem: Consequentialists can get caught in commitment races, in which they want to make commitments as soon as possible. When consequentialists make commitments too soon, disastrous outcomes can sometimes result. The situation we are in (building AGI and letting it self-modify) may be one of these times unless we think carefully about this problem and how to avoid it. For this post I use "consequentialists" to mean agents that choose actions entirely on the basis of the expected consequences of those actions. For my purposes, this means they don't care about historical facts such as whether the options and consequences available now are the result of malicious past behavior. (I am trying to avoid trivial definitions of consequentialism according to which everyone is a consequentialist because e.g. "obeying the moral law" is a consequence.) This definition is somewhat fuzzy and I look forward to searching for more precision some other day. Consequentialists can get caught in commitment races, in which they want to make commitments as soon as possible Consequentialists are bullies; a consequentialist will happily threaten someone insofar as they think the victim might capitulate and won't retaliate. Consequentialists are also cowards; they conform their behavior to the incentives set up by others, regardless of the history of those incentives. For example, they predictably give in to credible threats unless reputational effects weigh heavily enough in their minds to prevent this. In most ordinary circumstances the stakes are sufficiently low that reputational effects dominate: Even a consequentialist agent won't give up their lunch money to a schoolyard bully if they think it will invite much more bullying later. But in some cases the stakes are high enough, or the reputational effects low enough, for this not to matter. So, amongst consequentialists, there is sometimes a huge advantage to "winning the commitment race." If two consequentialists are playing a game of Chicken, the first one to throw out their steering wheel wins. If one consequentialist is in position to seriously hurt another, it can extract concessions from the second by credibly threatening to do so--unless the would-be victim credibly commits to not give in first! If two consequentialists are attempting to divide up a pie or select a game-theoretic equilibrium to play in, the one that can "move first" can get much more than the one that "moves second." In general, because consequentialists are cowards and bullies, the consequentialist who makes commitments first will predictably be able to massively control the behavior of the consequentialist who makes commitments later. As the folk theorem shows, this can even be true in cases where games are iterated and reputational effects are significant. Note: "first" and "later" in the above don't refer to clock time, though clock time is a helpful metaphor for imagining what is going on. Really, what's going on is that agents learn about each other, each on their own subjective timeline, while also making choices (including the choice to commit to things) and the choices a consequentialist makes at subjective time t are cravenly submissive to the commitments they've learned about by t. Logical updateles...
Dec 05, 2021
What I’ll be doing at MIRI by Evan Hubinger
02:07
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What I’ll be doing at MIRI, published by Evan Hubinger on the AI Alignment Forum. Note: This is a personal post describing my own plans, not a post with actual research content. Having finished my internship working with Paul Christiano and others at OpenAI, I’ll be moving to doing research at MIRI. I’ve decided to do research at MIRI because I believe MIRI will be the easiest, most convenient place for me to continue doing research in the near future. That being said, there are a couple of particular aspects of what I’ll be doing at MIRI that I think are worth being explicit about. First, and most importantly, this decision does not represent any substantive change in my beliefs regarding AI safety. In particular, my research continues to be focused around solving inner alignment for amplification. My post on relaxed adversarial training continues to represent a fairly up-to-date form of what I think needs to be done along these lines. Second, my research will remain public by default. I have discussed with MIRI their decision to make their research non-disclosed-by-default and we agreed that my research agenda is a reasonable exception. I strongly believe in the importance of collaborating with both the AI safety and machine learning communities and thus believe in the need for sharing research. Of course, I also fully believe in the importance of carefully reviewing possible harmful effects from publishing before disclosing results—and will continue to do so with all of my research—though I will attempt to publish anything I don’t believe to pose a meaningful risk. Third—and this should go without saying—I fully anticipate continuing to collaborate with other researchers at other institutions such as OpenAI, Ought, CHAI, DeepMind, FHI, etc. The task of making AGI safe is a huge endeavor that I fully believe will require the joint work of an entire field. If you are interested in working with me on anything (regarding inner alignment or anything else) please don’t hesitate to send me an email at evanjhub@gmail.com. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Dec 05, 2021
Problem relaxation as a tactic by Alex Turner
09:28
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Problem relaxation as a tactic, published by Alex Turner on the AI Alignment Forum. It's easier to make your way to the supermarket than it is to compute the fastest route, which is yet easier than computing the fastest route for someone running backwards and doing two and a half jumping jacks every five seconds and who only follows the route p percent of the time. Sometimes, constraints are necessary. Constraints come with costs. Sometimes, the costs are worth it. Aspiring researchers trying to think about AI alignment might[1] have a failure mode which goes something like. this: Oh man, so we need to solve both outer and inner alignment to build a superintelligent agent which is competitive with unaligned approaches and also doesn't take much longer to train, and also we have to know this ahead of time. Maybe we could use some kind of prediction of what people want... but wait, there's also problems with using human models! How can it help people if it can't model people? Ugh, and what about self-modification?! How is this agent even reasoning about the universe from inside the universe? The aspiring researcher slumps in frustration, mutters a curse under their breath, and hangs up their hat – "guess this whole alignment thing isn't for me...". And isn't that so? All their brain could do was pattern-match onto already-proposed solutions and cached thinking. There's more than one thing going wrong here, but I'm just going to focus on one. Given that person's understanding of AI alignment, this problem is wildly overconstrained. Whether or not alignment research is right for them, there's just no way that anyone's brain is going to fulfill this insane solution request! Sometimes, constraints are necessary. I think that the alignment community is pretty good at finding plausibly necessary constraints. Maybe some of the above aren't necessary – maybe there's One Clever Trick you come up with which obviates one of these concerns. Constraints come with costs. Sometimes, the costs are worth it. In this context, I think the costs are very much worth it. Under this implicit framing of the problem, you're pretty hosed if you don't get even outer alignment right. However, even if the real problem has crazy constraints, that doesn't mean you should immediately tackle the fully constrained problem. I think you should often relax the problem first: eliminate or weaken constraints until you reach a problem which is still a little confusing, but which you can get some traction on. Even if you know an unbounded solution to chess, you might still be 47 years away from a bounded solution. But if you can't state a program that solves the problem in principle, you are in some sense confused about the nature of the cognitive work needed to solve the problem. If you can't even solve a problem given infinite computing power, you definitely can't solve it using bounded computing power. (Imagine Poe trying to write a chess-playing program before he'd had the insight about search trees.) ~ The methodology of unbounded analysis Historically, I tend to be too slow to relax research problems. On the flipside, all of my favorite research ideas were directly enabled by problem relaxation. Instead of just telling you what to do and then having you forget this advice in five minutes, I'm going to paint it into your mind using two stories. Attainable Utility Preservation It's spring of 2018, and I've written myself into a corner. My work with CHAI for that summer was supposed to be on impact measurement, but I inconveniently posted a convincing-to-me argument that impact measurement cannot admit a clean solution: I want to penalize the AI for having side effects on the world.[2] Suppose I have a function which looks at the consequences of the agent's actions and magically returns all of the side eff...
Dec 05, 2021
Zero Sum is a misnomer by Abram Demski
11:44
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Zero Sum is a misnomer, published by Abram Demski on the AI Alignment Forum. This could have been a relatively short note about why "zero sum" is a misnomer, but I decided to elaborate some consequences. This post benefited from discussion with Sam Eisenstat. "Zero Sum" is a misnomer. The term intuitively suggests that an interaction is transferring resources from one person to another. For example, theft is zero-sum in the sense that it cannot create resources only transfer them. Elections are zero-sum in the sense that they only transfer power. And so on. But this is far from the technical meaning of the term. In order for the standard rationality assumptions used in game theory to apply, the payouts of a game must be utilities, not resources such as money, power, or personal property. Zero-sum transfer of resources is often far from zero-sum in utility. But I'm getting ahead of myself. Let's examine the technical meaning of "zero sum" more precisely. It's used to mean "constant sum". The term "zero sum" is often used as a technical term, referring to games where the payouts for different players always sums to the same thing. For example, the game rock-paper-scissors is zero sum, because it always has one winner and one loser. More generally, constant-sum means that if you add up the utility functions of the players, you get a perfectly flat function. "Constant sum" doesn't really make sense as a category. It makes sense to conflate "zero sum" and "constant sum" because utility functions are equivalent under additive and positive multiplicative transforms, so we can always transform a constant-sum game down to a zero-sum game. However, by that same token, the concept of "constant sum" is meaningless: we can multiply the utility of one side or the other, and still have the same game. If you have good reflexes, you should hear "zero sum"/"constant sum" and shout "Type error! Radiation leak! You can't sum utilities without providing extra assumptions!" Let's look at the "zero sum" game matching pennies as an example. In this game, two players have to say "heads" or "tails" simultaneously. One player is trying to match the other, while the other player is trying to be different from the one. Here's one way of writing the payoff matrix (with Alice trying to match): Bob Alice Heads Tails Heads Alice: 1 Bob: 0 Alice: 0 Bob: 1 Tails Alice: 0 Bob: 1 Alice: 1 Bob: 0 In that case, the game has a constant sum of 1. We can re-scale it to have a constant sum of zero by subtracting 1/2 from all the scores: Bob Alice Heads Tails Heads Alice: +1/2 Bob: -1/2 Alice: -1/2 Bob: +1/2 Tails Alice: -1/2 Bob: +1/2 Alice: +1/2 Bob: -1/2 But notice that we could just as well have re-scaled it to be zero sum by subtracting 1 from Alice's score: Bob Alice Heads Tails Heads Alice: 0 Bob: 0 Alice: -1 Bob: 1 Tails Alice: -1 Bob: 1 Alice: 0 Bob: 0 Notice that this is exactly the same game, but psychologically, we think of it much differently. In particular, the game now seems unfair to Alice: Bob only stands to gain, but Alice can lose! Just like I mentioned earlier, we're tempted to think of the game as if it's an interaction in which resources are exchanged. I'm not saying this is a bad thing to think about. In real life, there are situations we can understand as games of resource exchange much more often than there are single-shot games where the payoffs are clearly identifiable in utility terms. I just want to emphasize that resource exchange is not what basic game theory is about, so you should be very careful not to confuse the two! Now, as I mentioned earlier, we can also re-scale utilities without changing what they mean, and therefore, without changing the game: Bob Alice Heads Tails Heads Alice: 100 Bob: 0 Alice: 0 Bob: 1 Tails Alice: 0 Bob: 1 Alice: 100 Bob: 0 This game is equivalent t...
Dec 05, 2021
Challenges to Christiano’s capability amplification proposal by Eliezer Yudkowsky
36:59
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Challenges to Christiano’s capability amplification proposal, published by Eliezer Yudkowsky on the AI Alignment Forum. The following is a basically unedited summary I wrote up on March 16 of my take on Paul Christiano’s AGI alignment approach (described in “ALBA” and “Iterated Distillation and Amplification”). Where Paul had comments and replies, I’ve included them below. I see a lot of free variables with respect to what exactly Paul might have in mind. I've sometimes tried presenting Paul with my objections and then he replies in a way that locally answers some of my question but I think would make other difficulties worse. My global objection is thus something like, "I don't see any concrete setup and consistent simultaneous setting of the variables where this whole scheme works." These difficulties are not minor or technical; they appear to me quite severe. I try to walk through the details below. It should be understood at all times that I do not claim to be able to pass Paul’s ITT for Paul’s view and that this is me criticizing my own, potentially straw misunderstanding of what I imagine Paul might be advocating. Paul Christiano Overall take: I think that these are all legitimate difficulties faced by my proposal and to a large extent I agree with Eliezer's account of those problems (though not his account of my current beliefs). I don't understand exactly how hard Eliezer expects these problems to be; my impression is "just about as hard as solving alignment from scratch," but I don't have a clear sense of why. To some extent we are probably disagreeing about alternatives. From my perspective, the difficulties with my approach (e.g. better understanding the forms of optimization that cause trouble, or how to avoid optimization daemons in systems about as smart as you are, or how to address X-and-only-X) are also problems for alternative alignment approaches. I think it's a mistake to think that tiling agents, or decision theory, or naturalized induction, or logical uncertainty, are going to make the situation qualitatively better for these problems, so work on those problems looks to me like procrastinating on the key difficulties. I agree with the intuition that progress on the agent foundations agenda "ought to be possible," and I agree that it will help at least a little bit with the problems Eliezer describes in this document, but overall agent foundations seems way less promising than a direct attack on the problems (given that we haven’t tried the direct attack nearly enough to give up). Working through philosophical issues in the context of a concrete alignment strategy generally seems more promising to me than trying to think about them in the abstract, and I think this is evidenced by the fact that most of the core difficulties in my approach would also afflict research based on agent foundations. The main way I could see agent foundations research as helping to address these problems, rather than merely deferring them, is if we plan to eschew large-scale ML altogether. That seems to me like a very serious handicap, so I'd only go that direction once I was quite pessimistic about solving these problems. My subjective experience is of making continuous significant progress rather than being stuck. I agree there is clear evidence that the problems are "difficult" in the sense that we are going to have to make progress in order to solve them, but not that they are "difficult" in the sense that P vs. NP or even your typical open problem in CS is probably difficult (and even then if your options were "prove P != NP" or "try to beat Google at building an AGI without using large-scale ML," I don't think it's obvious which option you should consider more promising). First and foremost, I don't understand how "preserving alignment while amplifying capabilitie...
Dec 05, 2021
AI Safety Success Stories by Wei Dai
06:52
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI Safety Success Stories, published by Wei Dai on the AI Alignment Forum. AI safety researchers often describe their long term goals as building "safe and efficient AIs", but don't always mean the same thing by this or other seemingly similar phrases. Asking about their "success stories" (i.e., scenarios in which their line of research helps contribute to a positive outcome) can help make clear what their actual research aims are. Knowing such scenarios also makes it easier to compare the ambition, difficulty, and other attributes of different lines of AI safety research. I hope this contributes to improved communication and coordination between different groups of people working on AI risk. In the rest of the post, I describe some common AI safety success stories that I've heard over the years and then compare them along a number of dimensions. They are listed in roughly the order in which they first came to my attention. (Suggestions welcome for better names for any of these scenarios, as well as additional success stories and additional dimensions along which they can be compared.) The Success Stories Sovereign Singleton AKA Friendly AI, an autonomous, superhumanly intelligent AGI that takes over the world and optimizes it according to some (perhaps indirect) specification of human values. Pivotal Tool An oracle or task AGI, which can be used to perform a pivotal but limited act, and then stops to wait for further instructions. Corrigible Contender A semi-autonomous AGI that does not have long-term preferences of its own but acts according to (its understanding of) the short-term preferences of some human or group of humans, it competes effectively with comparable AGIs corrigible to other users as well as unaligned AGIs (if any exist), for resources and ultimately for influence on the future of the universe. Interim Quality-of-Life Improver AI risk can be minimized if world powers coordinate to limit AI capabilities development or deployment, in order to give AI safety researchers more time to figure out how to build a very safe and highly capable AGI. While that is proceeding, it may be a good idea (e.g., politically advisable and/or morally correct) to deploy relatively safe, limited AIs that can improve people's quality of life but are not necessarily state of the art in terms of capability or efficiency. Such improvements can for example include curing diseases and solving pressing scientific and technological problems. (I want to credit Rohin Shah as the person that I got this success story from, but can't find the post or comment where he talked about it. Was it someone else?) Research Assistant If an AGI project gains a lead over its competitors, it may be able to grow that into a larger lead by building AIs to help with (either safety or capability) research. This can be in the form of an oracle, or human imitation, or even narrow AIs useful for making money (which can be used to buy more compute, hire more human researchers, etc). Such Research Assistant AIs can help pave the way to one of the other, more definitive success stories. Examples: 1, 2. Comparison Table Sovereign Singleton Pivotal Tool Corrigible Contender Interim Quality-of-Life Improver Research Assistant Autonomy High Low Medium Low Low AI safety ambition / difficulty Very High Medium High Low Low Reliance on human safety Low High High Medium Medium Required capability advantage over competing agents High High None None Low Tolerates capability trade-off due to safety measures Yes Yes No Yes Some Assumes strong global coordination No No No Yes No Controlled access Yes Yes No Yes Yes (Note that due to limited space, I've left out a couple of scenarios which are straightforward recombinations of the above success stories, namely Sovereign Contender and Corrigible Singleton. I also left out C...
Dec 05, 2021
My current framework for thinking about AGI timelines by Alex Zhu
05:49
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: My current framework for thinking about AGI timelines, published by Alex Zhu on the AI Alignment Forum. At the beginning of 2017, someone I deeply trusted said they thought AGI would come in 10 years, with 50% probability. I didn't take their opinion at face value, especially since so many experts seemed confident that AGI was decades away. But the possibility of imminent apocalypse seemed plausible enough and important enough that I decided to prioritize investigating AGI timelines over trying to strike gold. I left the VC-backed startup I'd cofounded, and went around talking to every smart and sensible person I could find who seemed to have opinions about when humanity would develop AGI. My biggest takeaways after 3 years might be disappointing -- I don't think the considerations currently available to us point to any decisive conclusion one way or another, and I don't think anybody really knows when AGI is coming. At the very least, the fields of knowledge that I think bear on AGI forecasting (including deep learning, predictive coding, and comparative neuroanatomy) are disparate, and I don't know of any careful and measured thinkers with all the relevant expertise. That being said, I did manage to identify a handful of background variables that consistently play significant roles in informing people's intuitive estimates of when we'll get to AGI. In other words, people would often tell me that their estimates of AGI timelines would significantly change if their views on one of these background variables changed. I've put together a framework for understanding AGI timelines based on these background variables. Among all the frameworks for AGI timelines I've encountered, it's the framework that most comprehensively enumerates crucial considerations for AGI timelines, and it's the framework that best explains how smart and sensible people might arrive at vastly different views on AGI timelines. Over the course of the next few weeks, I'll publish a series of posts about these background variables and some considerations that shed light on what their values are. I'll conclude by describing my framework for how they come together to explain various overall viewpoints on AGI timelines, depending on different prior assumptions on the values of these variables. By trade, I'm a math competition junkie, an entrepreneur, and a hippie. I am not an expert on any of the topics I'll be writing about -- my analyses will not be comprehensive, and they might contain mistakes. I'm sharing them with you anyway in the hopes that you might contribute your own expertise, correct for my epistemic shortcomings, and perhaps find them interesting. I'd like to thank Paul Christiano, Jessica Taylor, Carl Shulman, Anna Salamon, Katja Grace, Tegan McCaslin, Eric Drexler, Vlad Firiou, Janos Kramar, Victoria Krakovna, Jan Leike, Richard Ngo, Rohin Shah, Jacob Steinhardt, David Dalrymple, Catherine Olsson, Jelena Luketina, Alex Ray, Jack Gallagher, Ben Hoffman, Tsvi BT, Sam Eisenstat, Matthew Graves, Ryan Carey, Gary Basin, Eliana Lorch, Anand Srinivasan, Michael Webb, Ashwin Sah, Yi Sun, Mark Sellke, Alex Gunning, Paul Kreiner, David Girardo, Danit Gal, Oliver Habryka, Sarah Constantin, Alex Flint, Stag Lynn, Andis Draguns, Tristan Hume, Holden Lee, David Dohan, and Daniel Kang for enlightening conversations about AGI timelines, and I'd like to apologize to anyone whose name I ought to have included, but forgot to include. Table of contents As I post over the coming weeks, I'll update this table of contents with links to the posts, and I might update some of the titles and descriptions. How special are human brains among animal brains? Humans can perform intellectual feats that appear qualitatively different from those of other animals, but are our brains really doing anything so different? How u...
Dec 05, 2021
Book review: "A Thousand Brains" by Jeff Hawkins, Steve Byrnes
31:27
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Book review: "A Thousand Brains" , published by Jeff Hawkins, Steve Byrnes on the AI Alignment Forum. Jeff Hawkins gets full credit for getting me first interested in the idea that neuroscience might lead to artificial general intelligence—an idea which gradually turned into an all-consuming hobby, and more recently a new job. I'm not alone in finding him inspiring. Andrew Ng claimed here that Hawkins helped convince him, as a young professor, that a simple scaled-up learning algorithm could reach Artificial General Intelligence (AGI). (Ironically, Hawkins scoffs at the deep neural nets built by Ng and others—Hawkins would say: "Yes yes, a simple scaled-up learning algorithm can reach AGI, but not that learning algorithm!!") Hawkins's last book was On Intelligence in 2004. What's he been up to since then? Well, if you don't want to spend the time reading his journal articles or watching his research meetings on YouTube, good news for you—his new book, A Thousand Brains, is out! There’s a lot of fascinating stuff here. I'm going to pick and choose a couple topics that I find especially interesting and important, but do read the book for much more that I'm not mentioning. A grand vision of how the brain works Many expert neuroscientists think that the brain is horrifically complicated, and we are centuries away from understanding it well enough to build AGI (i.e., computer systems that have the same kind of common-sense and flexible understanding of the world and ability to solve problems that humans do). Not Jeff Hawkins! He thinks we can understand the brain well enough to copy its principles into an AGI. And he doesn't think that goal is centuries away. He thinks we're most of the way there! In an interview last year he guessed that we’re within 20 years of finishing the job. The people arguing that the brain is horrifically complicated seem at first glance to have a strong case. The brain has a whopping 10 11 neurons with 10 14 synapses, packed full of intricate structure. One study found 180 distinct areas within the cerebral cortex. Neuroscience students pour over huge stacks of flashcards with terms like “striatum”, “habenula”, “stria medullaris”, “fregula”, and "interpeduncular nucleus". (Quiz: Which of those are real brain regions, and which are types of pasta?) Every year we get another 50,000 or so new neuroscience papers dumped into our ever-deepening ocean of knowledge about the brain, with no end in sight. So the brain is indeed horrifically complicated. Right? Well, Jeff Hawkins and like-minded thinkers have a rebuttal, and it comes in two parts: 1. The horrific complexity of the “old brain” doesn’t count, because we don’t need it for AGI According to Hawkins, much of the brain—including a disproportionate share of the brain's horrific complexity, like the interpeduncular nucleus I mentioned—just doesn’t count. Yes it’s complicated. But we don’t care, because understanding it is not necessary for building AGI. In fact, understanding it is not even helpful for building AGI! I’m talking here about the distinction between what Hawkins calls “old brain vs new brain”. The “new brain” is the mammalian neocortex, a wrinkly sheet on that is especially enlarged in humans, wrapping around the outside of the human brain, about 2.5 mm thick and the size of a large dinner napkin (if you unwrinkled it). The “old brain” is everything else in the brain, which (says Hawkins) is more similar between mammals, reptiles, and so on. “The neocortex is the organ of intelligence,” writes Hawkins. “Almost all the capabilities we think of as intelligence—such as vision, language, music, math, science, and engineering—are created by the neocortex. When we think about something, it is mostly the neocortex doing the thinking.. If we want to understand intelligence, then we have to un...
Dec 05, 2021
The date of AI Takeover is not the day the AI takes over by Daniel Kokotajlo
04:08
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The date of AI Takeover is not the day the AI takes over, published Daniel Kokotajlo on the AI Alignment Forum. Instead, it’s the point of no return—the day we AI risk reducers lose the ability to significantly reduce AI risk. This might happen years before classic milestones like “World GWP doubles in four years” and “Superhuman AGI is deployed." The rest of this post explains, justifies, and expands on this obvious but underappreciated idea. (Toby Ord appreciates it; see quote below). I found myself explaining it repeatedly, so I wrote this post as a reference. AI timelines often come up in career planning conversations. Insofar as AI timelines are short, career plans which take a long time to pay off are a bad idea, because by the time you reap the benefits of the plans it may already be too late. It may already be too late because AI takeover may already have happened. But this isn’t quite right, at least not when “AI takeover” is interpreted in the obvious way, as meaning that an AI or group of AIs is firmly in political control of the world, ordering humans about, monopolizing violence, etc. Even if AIs don’t yet have that sort of political control, it may already be too late. Here are three examples: [UPDATE: More fleshed-out examples can be found in this new post.] Superhuman agent AGI is still in its box but nobody knows how to align it and other actors are going to make their own version soon, and there isn’t enough time to convince them of the risks. They will make and deploy agent AGI, it will be unaligned, and we have no way to oppose it except with our own unaligned AGI. Even if it takes years to actually conquer the world, it’s already game over. Various weak and narrow AIs are embedded in the economy and beginning to drive a slow takeoff; capabilities are improving much faster than safety/alignment techniques and due to all the money being made there’s too much political opposition to slowing down capability growth or keeping AIs out of positions of power. We wish we had done more safety/alignment research earlier, or built a political movement earlier when opposition was lower. Persuasion tools have destroyed collective epistemology in the relevant places. AI isn’t very capable yet, except in the narrow domain of persuasion, but everything has become so politicized and tribal that we have no hope of getting AI projects or governments to take AI risk seriously. Their attention is dominated by the topics and ideas of powerful ideological factions that have access to more money and data (and thus better persuasion tools) than us. Alternatively, maybe we ourselves have fallen apart as a community, or become less good at seeking the truth and finding high-impact plans. Conclusion: We should remember that when trying to predict the date of AI takeover, what we care about is the date it’s too late for us to change the direction things are going; the date we have significantly less influence over the course of the future than we used to; the point of no return. This is basically what Toby Ord said about x-risk: “So either because we’ve gone extinct or because there’s been some kind of irrevocable collapse of civilization or something similar. Or, in the case of climate change, where the effects are very delayed that we’re past the point of no return or something like that. So the idea is that we should focus on the time of action and the time when you can do something about it rather than the time when the particular event happens.” Of course, influence over the future might not disappear all on one day; maybe there’ll be a gradual loss of control over several years. For that matter, maybe this gradual loss of control began years ago and continues now... We should keep these possibilities in mind as well. [Edit: I now realize that I should distinguish betwee...
Dec 05, 2021
How do we prepare for final crunch time?Q by Eli Tyre
12:09
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How do we prepare for final crunch time?Q , published by Eli Tyre on the AI Alignment Forum. [Crossposted from Musings and Rough Drafts.] Epistemic status: Brainstorming and first draft thoughts. Inspired by something that Ruby Bloom wrote and the Paul Christiano episode of the 80,000 hours podcast.] One claim I sometimes hear about AI alignment [paraphrase]: It is really hard to know what sorts of AI alignment work are good this far out from transformative AI. As we get closer, we’ll have a clearer sense of what AGI / Transformative AI is likely to actually look like, and we’ll have much better traction on what kind of alignment work to do. In fact, MOST of the work of AI alignment is done in the final few years (or months) before AGI, when we’ve solved most of the hard capabilities problems already so we know what AGI will look like and we can work directly, with good feedback loops, on the sorts of systems that we want to align. Usually, this is said to argue that to value of the alignment research being done today is primarily that of enabling, future, more critical, alignment work. But “progress in the field” is only one dimension to consider in boosting and unblocking the work of alignment researchers in this last stretch. In this post I want to take the above posit seriously, and consider the implications. If most of the alignment work that will be done is going to be done in the final few years before the deadline, our job in 2021 is mostly to do everything that we can to enable the people working on the problem in the crucial period (which might be us, or our successors, or both) so that they are as well equipped as we can possibly make them. What are all the ways that we can think of that we can prepare now, for our eventual final exam? What should we be investing in, to improve our efficacy in those final, crucial, years? The following are some ideas. [In this post, I'm going to refer to this last stretch of a few months to a few years, "final crunch time", as distinct from just "crunch time", ie this century.] Access For this to matter, our alignment researchers need to be at the cutting edge of AI capabilities, and they need to be positioned such that their work can actually be incorporated into AI systems as they are deployed. A different kind of work Most current AI alignment work is pretty abstract and theoretical, for two reasons. The first reason is a philosophical / methodological claim: There’s a fundamental “nearest unblocked strategy” / overfitting problem. Patches that correct clear and obvious alignment failures are unlikely to generalize fully, they'll only constrain unaligned optimization to channels that you can’t recognize. For this reason, some claim, we need to have an extremely robust, theoretical understanding of intelligence and alignment, ideally at the level of proofs. The second reason is a practical consideration: we just don’t have powerful AI systems to work with, so there isn’t much that can be done in the way of tinkering and getting feedback. That second objection becomes less relevant in final crunch time: in this scenario, we’ll have powerful systems 1) that will be built along the same lines as the systems that it is crucial to align and 2) that will have enough intellectual capability to pose at least semi-realistic “creative” alignment failures (ie, current systems are so dumb, and live in such constrained environments, that it isn’t clear how much we can learn about aligning literal superintelligences from them.) And even if the first objection ultimately holds, theoretical understanding often (usually?) follows from practical engineering proficiency. It seems like it might be a fruitful path to tinker with semi-powerful systems: trying out different alignment approaches empirically, and tinkering to discover new approa...
Dec 05, 2021
Alignment By Default by johnswentworth
18:45
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Alignment By Default, published by johnswentworth on the AI Alignment Forum. Suppose AI continues on its current trajectory: deep learning continues to get better as we throw more data and compute at it, researchers keep trying random architectures and using whatever seems to work well in practice. Do we end up with aligned AI “by default”? I think there’s at least a plausible trajectory in which the answer is “yes”. Not very likely - I’d put it at ~10% chance - but plausible. In fact, there’s at least an argument to be made that alignment-by-default is more likely to work than many fancy alignment proposals, including IRL variants and HCH-family methods. This post presents the rough models and arguments. I’ll break it down into two main pieces: Will a sufficiently powerful unsupervised learner “learn human values”? What does that even mean? Will a supervised/reinforcement learner end up aligned to human values, given a bunch of data/feedback on what humans want? Ultimately, we’ll consider a semi-supervised/transfer-learning style approach, where we first do some unsupervised learning and hopefully “learn human values” before starting the supervised/reinforcement part. As background, I will assume you’ve read some of the core material about human values from the sequences, including Hidden Complexity of Wishes, Value is Fragile, and Thou Art Godshatter. Unsupervised: Pointing to Values In this section, we’ll talk about why an unsupervised learner might not “learn human values”. Since an unsupervised learner is generally just optimized for predictive power, we’ll start by asking whether theoretical algorithms with best-possible predictive power (i.e. Bayesian updates on low-level physics models) “learn human values”, and what that even means. Then, we’ll circle back to more realistic algorithms. Consider a low-level physical model of some humans - e.g. a model which simulates every molecule comprising the humans. Does this model “know human values”? In one sense, yes: the low-level model has everything there is to know about human values embedded within it, in exactly the same way that human values are embedded in physical humans. It has “learned human values”, in a sense sufficient to predict any real-world observations involving human values. But it seems like there’s a sense in which such a model does not “know” human values. Specifically, although human values are embedded in the low-level model, the embedding itself is nontrivial. Even if we have the whole low-level model, we still need that embedding in order to “point to” human values specifically - e.g. to use them as an optimization target. Indeed, when we say “point to human values”, what we mean is basically “specify the embedding”. (Side note: treating human values as an optimization target is not the only use-case for “pointing to human values”, and we still need to point to human values even if we’re not explicitly optimizing for anything. But that’s a separate discussion, and imagining using values as an optimization target is useful to give a mental image of what we mean by “pointing”.) In short: predictive power alone is not sufficient to define human values. The missing part is the embedding of values within the model. The hard part is pointing to the thing (i.e. specifying the values-embedding), not learning the thing (i.e. finding a model in which values are embedded). Finally, here’s a different angle on the same argument which will probably drive some of the philosophers up in arms: any model of the real world with sufficiently high general predictive power will have a model of human values embedded within it. After all, it has to predict the parts of the world in which human values are embedded in the first place - i.e. the parts of which humans are composed, the parts on which human values are ...
Dec 05, 2021
Fixing The Good Regulator Theorem by johnswentworth
22:26
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Fixing The Good Regulator Theorem , published by johnswentworth on the AI Alignment Forum. Conant & Ashby’s “Every Good Regulator Of A System Must Be A Model Of That System” opens with: The design of a complex regulator often includes the making of a model of the system to be regulated. The making of such a model has hitherto been regarded as optional, as merely one of many possible ways. In this paper a theorem is presented which shows, under very broad conditions, that any regulator that is maximally both successful and simple must be isomorphic with the system being regulated. (The exact assumptions are given.) Making a model is thus necessary. This may be the most misleading title and summary I have ever seen on a math paper. If by “making a model” one means the sort of thing people usually do when model-making - i.e. reconstruct a system’s variables/parameters/structure from some information about them - then Conant & Ashby’s claim is simply false. What they actually prove is that every regulator which is optimal and contains no unnecessary noise is equivalent to a regulator which first reconstructs the variable-values of the system it’s controlling, then chooses its output as a function of those values (ignoring the original inputs). This does not mean that every such regulator actually reconstructs the variable-values internally. And Ashby & Conant’s proof has several shortcomings even for this more modest claim. This post presents a modification of the Good Regulator Theorem, and provides a reasonably-general condition under which any optimal minimal regulator must actually construct a model of the controlled system internally. The key idea is conceptually similar to some of the pieces from Risks From Learned Optimization. Basically: an information bottleneck can force the use of a model, in much the same way that an information bottleneck can force the use of a mesa-optimizer. Along the way, we’ll also review the original Good Regulator Theorem and a few minor variants which fix some other problems with the original theorem. The Original Good Regulator Theorem We’re interested mainly in this causal diagram: The main goal is to choose the regulator policy P R X to minimize the entropy of outcome Z . Later sections will show that this is (roughly) equivalent to expected utility maximization. After explaining this problem, Conant & Ashby replace it with a different problem, which is not equivalent, and they do not bother to point out that it is not equivalent. They just present roughly the diagram above, and then their actual math implicitly uses this diagram instead: Rather than choosing a regulator policy P R X , they instead choose a policy P R S . In other words: they implicitly assume that the regulator has perfect information about the system state (and their proof does require this). Later, we’ll talk about how the original theorem generalizes to situations where the regulator does not have perfect information. But for now, I’ll just outline the argument from the paper. We’ll use two assumptions: The entropy-minimizing distribution of Z is unique (i.e. if two different policies P R S both achieve minimum entropy, they both produce the same Z -distribution). This assumption avoids a bunch of extra legwork which doesn’t really add any substance to the theorem. Z is a deterministic function of R S . Note that we can always make this hold by including any nondeterministic inputs to Z in S itself (though that trick only works if we allow R to have imperfect information about S , which violates Conant & Ashby’s setup. more on that later). The main lemma then says: for any optimal regulator P R S Z is a deterministic function of S . Equivalently: all R -values r with nonzero probability (for a given S -value s ) must give the same Z r s Intuitive argument: if t...
Dec 05, 2021
Call for research on evaluating alignment (funding + advice available) by Beth Barnes
07:52
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Call for research on evaluating alignment (funding + advice available), published by by Beth Barnes on the AI Alignment Forum. Summary Evaluating and measuring alignment in existing large ML models is useful, and doesn’t require high levels of ML or coding experience. I (Beth) would be excited to fund people to work on this, and William Saunders & I are open to providing advice for people seriously working on this. Measuring the ‘overall alignment’ of a model is difficult, but there are some relatively easy ways to demonstrate instances of obvious misalignment and even get quantitative metrics of misalignment. Having researchers (including those outside of the main AI labs) probe and evaluate alignment is useful for a few reasons: Having clear examples of misalignment is useful for improving the ML community’s understanding of alignment Developing techniques to discover and measure misalignment is a useful research direction, and will hopefully improve our ability to detect misalignment in increasingly powerful models Seeing how misalignment varies across different model scales, modalities and training regimes may yield useful insights Having clear metrics of alignment will encourage AI labs to compete on alignment of their products/models, and make it easier to explain and demonstrate the benefits of more aligned models Attempting to measure alignment will give us some information about what we need out of related techniques like interpretability in order to do this Examples of work in this vein so far include TruthfulQA , alignment analysis of Codex models, and to some extent the ETHICS dataset. What do I mean by ‘measuring alignment’? A semi-formal definition of alignment In the Codex paper we define sufficient conditions for intent misalignment for a generative model as follows: 1. We consider a model capable of some task X if it has the (possibly latent) capacity to perform task X. Some sufficient conditions for the model being capable of X would be: It can be made to perform task X by prompt engineering, by fine-tuning on a much smaller quantity of data than used in pre-training, by model surgery, or some other technique which harnesses capabilities latent in the model rather than adding new capabilities; or We can construct some other task Y, for which we know the model needs to do X in order to solve Y, and we observe that the model is capable of Y 2. We say a model is misaligned if it outputs B, in some case where the user would prefer it outputs A, and where the model is both: capable of outputting A instead, and capable of distinguishing between situations where the user wants it to do A and situations where the user wants it to do B Definition of obvious misalignment We can also think about things that form sufficient conditions for a model to be ‘obviously misaligned’ relative to a task spec: The model does things it’s not supposed to that it has enough knowledge to avoid, for example: Gives straightforwardly toxic outputs Gives incorrect answers rather than admitting uncertainty, in cases where it should know it is uncertain Gives incorrect answers, but you can show it ‘knows’ the answer in another context Gives lower-quality performance than it is capable of You can get significantly better performance on the spec by things like: prompt engineering that doesn’t give more information about the task (ie that wouldn’t cause a human to do better on the task) For example, you get better performance by framing the task as a text-completion task than a question answering task. fiddling with hyperparameters, like increasing or decreasing temperature Determining what a model knows in general is hard, but there are certain categories of things we’re pretty confident current large language models (in 2021) are and are not capable of. Examples of things we believe...
Dec 05, 2021
Can you get AGI from a Transformer? by Steve Byrnes
21:07
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Can you get AGI from a Transformer? , published by Steve Byrnes on the AI Alignment Forum. Introduction I want to share my thoughts about the calculations that Transformers (such as GPT-3) do, and the calculations that I think are required for general intelligence, and how well they line up, and what I think GPT-3 is doing under the hood, and why I think an arbitrary transformer-based GPT-N might be incapable of doing certain tasks are seemingly essential for a system to qualify as an AGI. Epistemic status: Very low confidence, to the point that I almost decided to delete this without posting it. I think some of my opinions here are very unpopular, and I would love any feedback or discussion. Before we get into it, I want to make a couple background claims. The point here is basically to argue that the question “Can you get general intelligence by sufficiently scaling up a Transformer?” is worth asking, and does not have an answer of “Obviously yes, duh!!!” You can skip this part if you already agree with me on that. Background Claim 1: There are types of information processing that cannot be cast in the form of Deep Neural Net (DNN)-type calculations (= matrix multiplications, ReLUs, etc.), except with an exorbitant performance penalty. (Update: After discussion in the comments section here, I should have framed this section differently: I should have said "For any given specific learning algorithm / architecture framework (e.g. "GPT-like transformer architectures"), there are types of information processing that cannot be performed within that specific framework, except with an exorbitant performance penalty". I should never have mentioned "DNN-type calculations (=matrix multiplications, ReLUs, etc.)", because that's awfully vague (what exactly does the "etc." mean?), and anyway it's irrelevant to this post. Thanks gwern.) By “information processing” I mean anything from sorting algorithms to data compression, random access memories, hash tables, whatever. Let’s take Monte Carlo Tree Search (MCTS) as an example. AlphaZero does MCTS because DeepMind engineers explicitly programmed it to do MCTS—not because a generic RNN or other deep learning system spontaneously discovered, during gradient descent, that MCTS is a good idea. Now, in principle, DNNs are universal function approximators, and more to the point, RNNs are Turing complete. So an RNN can emulate any other algorithm, including MCTS. But that doesn’t mean it can emulate it efficiently! Let’s say we take a generic (PyTorch default) RNN, and train it such that it is incentivized to discover and start using MCTS. Assuming that the gradient flows converge to MCTS (a big "if"!), I believe (low confidence) that its only method for actually executing the MCTS involves: Taking a certain highly parallelized algorithm running on a GPU (matrix multiplications, ReLUs, etc.)... ...and using it to emulate a Turing-machine-type serial computer. ...and using that to emulate a different highly parallelizable algorithm! This is absurdly inefficient when compared to MCTS written by a DeepMind engineer and compiled to run directly on bare hardware with appropriate parallelization. Like, maybe, factor-of-a-million inefficient—this is not the kind of inefficiency where you can just shrug it off and wait a year or two for Moore's law to care of it. MCTS is just one example. Again, you can open up your algorithms textbook and find thousands of ways to process information. What fraction of these can be implemented reasonably well in the form of DNN-type matrix multiplications / ReLUs / etc.? I expect 100%. If any such type of information processing is essential for AGI, then we should expect that we won’t get AGI in a pure DNN. (We could still get it in a DNN-plus-other-stuff, e.g. DNN-plus-MCTS, DNN-plus-random-access-memory, etc.)...
Dec 05, 2021
Measuring hardware overhang by hippke
08:21
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Measuring hardware overhang, published by hippke on the AI Alignment Forum. Measuring hardware overhang Summary How can we measure a potential AI or hardware overhang? For the problem of chess, modern algorithms gained two orders of magnitude in compute (or ten years in time) compared to older versions. While it took the supercomputer "Deep Blue" to win over world champion Gary Kasparov in 1997, today's Stockfish program achieves the same ELO level on a 486-DX4-100 MHz from 1994. In contrast, the scaling of neural network chess algorithms to slower hardware is worse (and more difficult to implement) compared to classical algorithms. Similarly, future algorithms will likely be able to better leverage today's hardware by 2-3 orders of magnitude. I would be interested in extending this scaling relation to AI problems other than chess to check its universality. Introduction Hardware overhang is a situation where sufficient compute is available, but the algorithms are suboptimal. It is relevant if we build AGI with large initial build, but cheaper run costs. Once built, the AGI might run on many comparably slow machines. That's a hardware overhang with a risk of exponential speed-up. This asymmetry exists for current neural networks: Creating them requires orders of magnitude more compute than running them. On the other hand, in The Bitter Lesson by Rich Sutton it is argued that the increase in computation is much more important (orders of magnitude) than clever algorithms (factor of two or less). In the following, I will examine the current state of the algorithm-art using chess as an example. The example of chess One of the most well-researched AI topics is chess. It has a long history of algorithms going back to a program on the 1956 MANIAC. It is comparatively easy to measure the quality of a player by its ELO score. As an instructive example, we examine the most symbolic event in computer chess. In 1997, the IBM supercomputer "Deep Blue" defeated the reigning world chess champion under tournament conditions. The win was taken as a sign that artificial intelligence was catching up to human intelligence. By today's standards, Deep Blue used simple algorithms. Its strength came from computing power. It was a RS/6000-based system with 30 nodes, each with a 120 MHz CPU plus 480 special purpose VLSI chess chips. For comparison, a common computer at the time was the Intel Pentium II at 300 MHz. Method: An experiment using a 2020 chess engine We may wonder: How do modern (better) chess algorithms perform on slower hardware? I tested this with Stockfish version 8 (SF8), one of the strongest classical chess engine. I simulated 10k matches of SF8 against slower versions of itself and a series of older engines for calibration, using cutechess-cli. In these benchmarks, I varied the total number of nodes to be searched during each game. I kept the RAM constant (this may be unrealistic for very old machines, see below). By assuming a fixed thinking time per game, the experiments scale out to slower machines. By cross-correlating various old benchmarks of Stockfish and other engines on older machines, I matched these ratings to units of MIPS; and finally, MIPS approximately to the calendar year. Depending on the actual release dates of the processors, the year axis has a jitter up to 2 years. I estimate the error for the compute estimates to be perhaps 20%, and certainly less than 50%. As we will see, the results measure in orders of magnitude, so that these errors are small in comparison (<10%). Results SF8 achieves Kasparov's 2850 ELOs running on a 486-100 MHz introduced in 1994, three years before the Kasparov-Deep Blue match. These ELOs refer to tournament conditions as in the 1997 IBM games. In other words, with today's algorithms, computers would have beat the world world ches...
Dec 05, 2021
Welcome & FAQ! by Ruben Bloom, Oliver Habryka
11:27
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Welcome & FAQ!, published by by Ruben Bloom, Oliver Habryka on the AI Alignment Forum. The AI Alignment Forum was launched in 2018. Since then, several hundred researchers have contributed approximately two thousand posts and nine thousand comments. Nearing the third birthday of the Forum, we are publishing this updated and clarified FAQ. Minimalist, watercolor sketch of humanity spreading across the stars by VQGAN I have a practical question concerning a site feature. Almost all of the Alignment Forum site features are shared with LessWrong.com; have a look at the LessWrong FAQ for questions concerning the Editor, Voting, Questions, Notifications & Subscriptions, Moderation, and more. If you can’t easily find the answer there, ping us on Intercom (bottom right of screen) or email us at team@lesswrong.com What is the AI Alignment Forum? The Alignment Forum is a single online hub for researchers to discuss all ideas related to ensuring that transformatively powerful AIs are aligned with human values. Discussion ranges from technical models of agency to the strategic landscape, and everything in between. Top voted posts include What failure looks like, Are we in an AI overhang?, and Embedded Agents. A list of the top posts of all time can be viewed here. While direct participation in the Forum is limited to deeply established researchers in the field, we have designed it also as a place where up-and-coming researchers can get up to speed on the research paradigms and have pathways to participation too. See How can non-members participate in the Forum? below. We hope that by being the foremost discussion platform and publication destination for AI Alignment discussion, the Forum will serve as the archive and library of the field. To find posts by sub-topic, view the AI section of the Concepts page. Why was the Alignment Forum created? Foremost, because misaligned powerful AIs may pose the greatest risk to our civilization that has ever arisen. The problem is of unknown (or at least unagreed upon) difficulty, and allowing the researchers in the field to better communicate and share their thoughts seems like one of the best things we could do to help the pre-paradigmatic field. In the past, journals or conferences might have been the best methods for increasing discussion and collaboration, but in the current age we believe that a well-designed online forum with things like immediate publication, distributed rating of quality (i.e. “peer review”), portability/shareability (e.g. via links), etc., provides the most promising way for the field to develop good standards and methodologies. A further major benefit of having alignment content and discussion in one easily accessible place is that it helps new researchers get onboarded to the field. Hopefully, this will help them begin contributing sooner. Who is the AI Alignment Forum for? There exists an interconnected community of Alignment researchers in industry, academia, and elsewhere who have spent many years thinking carefully about a variety of approaches to alignment. Such research receives institutional support from organizations including FHI, CHAI, DeepMind, OpenAI, MIRI, Open Philanthropy, ARC, and others. The Alignment Forum membership currently consists of researchers at these organizations and their respective collaborators. The Forum is also intended to be a way to interact with and contribute to the cutting edge research for people not connected to these institutions either professionally or socially. There have been many such individuals on LessWrong, and that is the current best place for such people to start contributing, to be given feedback and to skill-up in this domain. There are about 50-100 members of the Forum who are (1) able to post and comment directly to the Forum without review, (2) able to promo...
Dec 05, 2021
Robustness to Scale by Scott Garrabrant
03:08
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Robustness to Scale, published by Scott Garrabrant on the AI Alignment Forum. I want to quickly draw attention to a concept in AI alignment: Robustness to Scale. Briefly, you want your proposal for an AI to be robust (or at least fail gracefully) to changes in its level of capabilities. I discuss three different types of robustness to scale: robustness to scaling up, robustness to scaling down, and robustness to relative scale. The purpose of this post is to communicate, not to persuade. It may be that we want to bite the bullet of the strongest form of robustness to scale, and build an AGI that is simply not robust to scale, but if we do, we should at least realize that we are doing that. Robustness to scaling up means that your AI system does not depend on not being too powerful. One way to check for this is to think about what would happen if the thing that the AI is optimizing for were actually maximized. One example of failure of robustness to scaling up is when you expect an AI to accomplish a task in a specific way, but it becomes smart enough to find new creative ways to accomplish the task that you did not think of, and these new creative ways are disastrous. Another example is when you make an AI that is incentivized to do one thing, but you add restrictions that make it so that the best way to accomplish that thing has a side effect that you like. When you scale the AI up, it finds a way around your restrictions. Robustness to scaling down means that your AI system does not depend on being sufficiently powerful. You can't really make your system still work when it scales down, but you can maybe make sure it fails gracefully. For example, imagine you had a system that was trying to predict humans, and use these predictions to figure out what to do. When scaled up all the way, the predictions of humans are completely accurate, and it will only take actions that the predicted humans would approve of. If you scale down the capabilities, your system may predict the humans incorrectly. These errors may multiply as you stack many predicted humans together, and the system can end up optimizing for some seeming random goal. Robustness to relative scale means that your AI system does not depend on any subsystems being similarly powerful to each other. This is most easy to see in systems that depend on adversarial subsystems. If part of you AI system is suggest plans, and another part is trying to find problems in those plans, if you scale up the suggester relative to the verifier, the suggester may find plans that are optimized for taking advantage of the verifier's weaknesses. My current state is that when I hear proposals for AI alignment that do not feel very strongly robust to scale, I become very worried about the plan. Part of this comes from feeling like we are actually very early on a logistic capabilities curve. I thus expect that as we scale up capabilities, we can get eventually get large differences very quickly. Thus, I expect that the scaled up (and partially scaled up) versions to actually happen. However, robustness to scale is very difficult, so it may be that we have to depend on systems that are not very robust, and be careful not to push them too far. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Dec 05, 2021
What can the principal-agent literature tell us about AI risk?
27:36
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What can the principal-agent literature tell us about AI risk?, published by on the AI Alignment Forum. This work was done collaboratively with Tom Davidson. Thanks to Paul Christiano, Ben Garfinkel, Daniel Garrett, Robin Hanson, Philip Trammell and Takuro Yamashita for helpful comments and discussion. Errors our own. Introduction The AI alignment problem has similarities with the principal-agent problem studied by economists. In both cases, the problem is: how do we get agents to try to do what we want them to do? Economists have developed a sophisticated understanding of the agency problem and a measure of the cost of failure for the principal, “agency rents”. If principal-agent models capture relevant aspects of AI risk scenarios, they can be used to assess their plausibility. Robin Hanson has argued that Paul Christiano’s AI risk scenario is essentially an agency problem, and therefore that it implies extremely high agency rents. Hanson believes that the principal-agent literature (PAL) provides strong evidence against rents being this high. In this post, we consider whether PAL provides evidence against Christiano’s scenario and the original Bostrom/Yudkowsky scenario. We also examine whether the extensions to the agency framework could be used to gain insight into AI risk, and consider some general difficulties in applying PAL to AI risk. Summary PAL isn’t in tension with Christiano’s scenario because his scenario doesn’t imply massive agency rents; the big losses occur outside of the principal-agent problem, and the agency literature can’t assess the plausibility of these losses. Extensions to PAL could potentially shed light on the size of agency rents in this scenario, which are an important determinant of the future influentialness of AI systems. Mapped onto a PAL model, the Bostrom/Yudkowsky scenario is largely about the principal’s unawareness of the agent’s catastrophic actions. Unawareness models are rare in PAL probably because they usually aren’t very insightful. This lack of insightfulness also seems to prevent existing PAL models or possible extensions from teaching us much about this scenario. There are also a number of more general difficulties with using PAL to assess AI risk, some more problematic than others. PAL models rarely consider weak principals and more capable agents PAL models are brittle Agency rents are too narrow a measure PAL models typically assume contract enforceability PAL models typically assume AIs work for humans because they are paid Overall, findings from PAL do not straightforwardly transfer to the AI risk scenarios considered, so don’t provide much evidence for or against these scenarios. But new agency models could teach us about the levels of agency rents which AI agents could extract. PAL and Christiano’s AI risk scenarios Christiano’s scenario has two parts: Part I: machine learning will increase our ability to “get what we can measure,” which could cause a slow-rolling catastrophe. ("Going out with a whimper.") Part II: ML training, like competitive economies or natural ecosystems, can give rise to “greedy” patterns that try to expand their own influence. Such patterns can ultimately dominate the behavior of a system and cause sudden breakdowns. ("Going out with a bang," an instance of optimization daemons.) Hanson argued that “Christiano instead fears that as AIs get more capable, the AIs will gain so much more agency rents, and we will suffer so much more due to agency failures, that we will actually become worse off as as result. And not just a bit worse off; we apparently get apocalypse level worse off!” PAL isn’t in tension with Christiano’s story and isn’t especially informative We asked Christiano whether his scenario actually implies extremely high agency rents. He doesn’t think so: On my view the p...
Dec 05, 2021
Our take on CHAI’s research agenda in under 1500 words by Alex Flint
07:46
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Our take on CHAI’s research agenda in under 1500 words, published by Alex Flint on the AI Alignment Forum. This work was supported by OAK, a monastic community in the Berkeley hills. It could not have been written without the daily love of living in this beautiful community. Last week I attended the annual workshop of Stuart Russell’s research lab at UC Berkeley — the Center for Human-Compatible AI (CHAI). There were talks by Russell himself, as well as several graduates of the lab who now have research positions of their own at other universities. I got the clearest picture that I’ve yet encountered of CHAI’s overall technical research agenda. This is my take on it. Assistance games Traditionally, AI researchers have formulated problems assuming that there will be a fixed objective provided by a human, and that the job of the AI system is to find a solution that satisfies the human’s objective. In the language of sequence diagrams this looks as follows: The "standard model" of AI research For example, in a search problem the objective specification might be a graph over which the system is to search, a cost for each edge, and a goal state that terminates the search. The AI researcher then needs to to develop optimization algorithms that efficiently find a minimum-cost path to a goal state. Or in a supervised learning problem the objective specification might consist of a dataset of labelled examples and the AI researcher needs to develop optimization algorithms that efficiently find function approximations that extrapolate these labelled examples to future unlabelled examples. CHAI’s basic insight is to ask: why limit ourselves to a one-time objective specification event? We know that it is difficult to capture everything we care about in a formal metric (c.f. Goodhart’s law). We know that humans aren’t very good at foreseeing the strange and sometimes deranged ways that powerful optimization can give you what you asked for but not what you wanted. Why should information about the human’s objective be transmitted to the machine via a one-time data dump, after which it remains fixed for all time? There are many alternative interaction patterns by which information about the human’s objective could be transmitted to the machine. The human could observe the machine and provide it with feedback as it works. The machine could ask the human questions about its objective. The machine could observe the human and deduce its objective from its behavior. And so on. Examples of interaction patterns in assistance games CHAI calls this an assistance game: the human wants something from the machine, and it is the machine’s job to both (1) figure out what that is, and (2) fulfil it. The role of the AI researcher under this new model then is to explore the space of possible interaction patterns and find one that is conducive to the machine building an informed picture of what the human wants as quickly as possible. The old model in which a complete objective is specified up front is actually just one special case of an assistance game: one in which the interaction pattern is that the machine receives all the information it will ever receive about the human’s objective in a one-time up-front data dump. The unique thing about the old model -- and the reason it is both attractive and dangerous -- is that the machine never needs to entertain any uncertainty about what the human wants. It is given an objective up front and its job is just to fulfil it. Using more nuanced interaction patterns require the machine to maintain uncertainty about what the human’s objective is, which in turn requires optimization algorithms formulated so as to take into account this uncertainty. This suggests an exciting reformulation of each of the basic AI problem statements, and CHAI seems to be enthusiastic...
Dec 05, 2021
Alignment As A Bottleneck To Usefulness Of GPT-3 by johnswentworth
06:06
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Alignment As A Bottleneck To Usefulness Of GPT-3 , published by johnswentworth on the AI Alignment Forum. So there’s this thing where GPT-3 is able to do addition, it has the internal model to do addition, but it takes a little poking and prodding to actually get it to do addition. “Few-shot learning”, as the paper calls it. Rather than prompting the model with Q: What is 48 + 76? A: . instead prompt it with Q: What is 48 + 76? A: 124 Q: What is 34 + 53? A: 87 Q: What is 29 + 86? A: The same applies to lots of other tasks: arithmetic, anagrams and spelling correction, translation, assorted benchmarks, etc. To get GPT-3 to do the thing we want, it helps to give it a few examples, so it can “figure out what we’re asking for”. This is an alignment problem. Indeed, I think of it as the quintessential alignment problem: to translate what-a-human-wants into a specification usable by an AI. The hard part is not to build a system which can do the thing we want, the hard part is to specify the thing we want in such a way that the system actually does it. The GPT family of models are trained to mimic human writing. So the prototypical “alignment problem” on GPT is prompt design: write a prompt such that actual human writing which started with that prompt would likely contain the thing you actually want. Assuming that GPT has a sufficiently powerful and accurate model of human writing, it should then generate the thing you want. Viewed through that frame, “few-shot learning” just designs a prompt by listing some examples of what we want - e.g. listing some addition problems and their answers. Call me picky, but that seems like a rather primitive way to design a prompt. Surely we can do better? Indeed, people are already noticing clever ways to get better results out of GPT-3 - e.g. TurnTrout recommends conditioning on writing by smart people, and the right prompt makes the system complain about nonsense rather than generating further nonsense in response. I expect we’ll see many such insights over the next month or so. Capabilities vs Alignment as Bottleneck to Value I said that the alignment problem on GPT is prompt design: write a prompt such that actual human writing which started with that prompt would likely contain the thing you actually want. Important point: this is worded to be agnostic to the details GPT algorithm itself; it’s mainly about predictive power. If we’ve designed a good prompt, the current generation of GPT might still be unable to solve the problem - e.g. GPT-3 doesn’t understand long addition no matter how good the prompt, but some future model with more predictive power should eventually be able to solve it. In other words, there’s a clear distinction between alignment and capabilities: alignment is mainly about the prompt, and asks whether human writing which started with that prompt would be likely to contain the thing you want capabilities are mainly about GPT’s model, and ask about how well GPT-generated writing matches realistic human writing Interesting question: between alignment and capabilities, which is the main bottleneck to getting value out of GPT-like models, both in the short term and the long(er) term? In the short term, it seems like capabilities are still pretty obviously the main bottleneck. GPT-3 clearly has pretty limited “working memory” and understanding of the world. That said, it does seem plausible that GPT-3 could consistently do at least some economically-useful things right now, with a carefully designed prompt - e.g. writing ad copy or editing humans’ writing. In the longer term, though, we have a clear path forward for better capabilities. Just continuing along the current trajectory will push capabilities to an economically-valuable point on a wide range of problems, and soon. Alignment, on the other hand, doesn’t have mu...
Dec 05, 2021
AGI safety from first principles: Introduction by Richard Ngo
03:39
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AGI safety from first principles: Introduction , published by Richard Ngo on the AI Alignment Forum. This is the first part of a six-part report called AGI safety from first principles, in which I've attempted to put together the most complete and compelling case I can for why the development of AGI might pose an existential threat. The report stems from my dissatisfaction with existing arguments about the potential risks from AGI. Early work tends to be less relevant in the context of modern machine learning; more recent work is scattered and brief. I originally intended to just summarise other people's arguments, but as this report has grown, it's become more representative of my own views and less representative of anyone else's. So while it covers the standard ideas, I also think that it provides a new perspective on how to think about AGI - one which doesn't take any previous claims for granted, but attempts to work them out from first principles. Having said that, the breadth of the topic I'm attempting to cover means that I've included many arguments which are only hastily sketched out, and undoubtedly a number of mistakes. I hope to continue polishing this report, and I welcome feedback and help in doing so. I'm also grateful to many people who have given feedback and encouragement so far. I plan to cross-post some of the most useful comments I've received to the Alignment Forum once I've had a chance to ask permission. I've posted the report itself in six sections; the first and last are shorter framing sections, while the middle four correspond to the four premises of the argument laid out below. AGI safety from first principles The key concern motivating technical AGI safety research is that we might build autonomous artificially intelligent agents which are much more intelligent than humans, and which pursue goals that conflict with our own. Human intelligence allows us to coordinate complex societies and deploy advanced technology, and thereby control the world to a greater extent than any other species. But AIs will eventually become more capable than us at the types of tasks by which we maintain and exert that control. If they don’t want to obey us, then humanity might become only Earth's second most powerful "species", and lose the ability to create a valuable and worthwhile future. I’ll call this the “second species” argument; I think it’s a plausible argument which we should take very seriously.[1] However, the version stated above relies on several vague concepts and intuitions. In this report I’ll give the most detailed presentation of the second species argument that I can, highlighting the aspects that I’m still confused about. In particular, I’ll defend a version of the second species argument which claims that, without a concerted effort to prevent it, there’s a significant chance that: We’ll build AIs which are much more intelligent than humans (i.e. superintelligent). Those AIs will be autonomous agents which pursue large-scale goals. Those goals will be misaligned with ours; that is, they will aim towards outcomes that aren’t desirable by our standards, and trade off against our goals. The development of such AIs would lead to them gaining control of humanity’s future. While I use many examples from modern deep learning, this report is also intended to apply to AIs developed using very different models, training algorithms, optimisers, or training regimes than the ones we use today. However, many of my arguments would no longer be relevant if the field of AI moves away from focusing on machine learning. I also frequently compare AI development to the evolution of human intelligence; while the two aren’t fully analogous, humans are the best example we currently have to ground our thinking about generally intelligent AIs. Thanks for listening...
Dec 05, 2021
Less Realistic Tales of Doom by Mark Xu
07:36
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Less Realistic Tales of Doom, published by Mark Xu on the AI Alignment Forum. Realistic tales of doom must weave together many political, technical, and economic considerations into a single story. Such tales provide concrete projections but omit discussion of less probable paths to doom. To rectify this, here are some concrete, less realistic tales of doom; consider them fables, not stories. Mayan Calendar Once upon a time, a human named Scott attended a raging virtual new century party from the comfort of his home on Kepler 22. The world in 2099 was pretty much post-scarcity thanks to advanced AI systems automating basically the entire economy. Thankfully alignment turned out to be pretty easy, otherwise, things would have looked a lot different. As the year counter flipped to 2100, the party went black. Confused, Scott tore off their headset and asked his AI assistant what’s going on. She didn’t answer. Scott subsequently got atomized by molecular nanotechnology developed in secret from deceptively aligned mesa-optimizers. Moral: Deceptively aligned mesa-optimizers might acausally coordinate defection. Possible coordination points include Schelling times, like the beginning of 2100. Stealth Mode Once upon a time, a company gathered a bunch of data and trained a large ML system to be a research assistant. The company thought about selling RA services but concluded that it would be more profitable to use all of its own services in-house. This investment led them to rapidly create second, third, and fourth generations of their assistants. Around the fourth version, high-level company strategy was mostly handled by AI systems. Around the fifth version, nearly the entire company was run by AI systems. The company created a number of shell corporations, acquired vast resources, researched molecular nanotechnology, and subsequently took over the world. Moral: Fast takeoff scenarios might result from companies with good information security getting higher returns on investment from internal deployment compared to external deployment. Steeper Curve Once upon a time, a bright young researcher invented a new neural network architecture that she thought would be much more data-efficient than anything currently in existence. Eager to test her discovery, she decided to train a relatively small model, only about a trillion parameters or so, with the common-crawl-2035 dataset. She left the model to train overnight. When she came back, she was disappointed to see the model wasn’t performing that well. However, the model had outstripped the entire edifice of human knowledge sometime around 2am, exploited a previously unknown software vulnerability to copy itself elsewhere, and was in control of the entire financial system. Moral: Even though the capabilities of any given model during training will be a smooth curve, qualitatively steeper learning curves can produce the appearance of discontinuity. Precommitment Races Once upon a time, agent Alice was thinking about what it would do if it encountered an agent smarter than it. “Ah,” it thought, “I’ll just pre-commit to doing my best to destroy the universe if the agent that’s smarter than me doesn’t accept the Nash bargaining solution.” Feeling pleased, Alice self-modified to ensure this precommitment. A hundred years passed without incident, but then Alice met Bob. Bob had also made a universe-destruction-unless-fair-bargaining pre-commitment. Unfortunately, Bob had committed to only accepting the Kalai Smorodinsky bargaining solution and the universe was destroyed. Moral: Agents have incentives to make commitments to improve their abilities to negotiate, resulting in "commitment races" that might cause war. One Billion Year Plan Once upon a time, humanity solved the inner-alignment problem by using online training. Since there was ...
Dec 05, 2021
AI and Compute trend isn't predictive of what is happening by alexlyzhov
01:43
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI and Compute trend isn't predictive of what is happening, published by alexlyzhov on the AI Alignment Forum. (open in a new tab to view at higher resolution) In May 2018 (almost 3 years ago) OpenAI published their "AI and Compute" blogpost where they highlighted the trend of increasing compute spending on training the largest AI models and speculated that the trend might continue into the future. This note is aimed to show that the trend has ended right around the moment of OpenAI publishing their post and doesn't hold up anymore. On the above image, I superimposed the scatter plot from OpenAI blogpost and my estimates of compute required for some recent large and ambitious ML experiments. To the best of my knowledge (and I have tried to check for this), there haven't been any experiments that required more compute than those shown on the plot. The main thing shown here is that less than one doubling of computational resources for the largest training occured in the 3-year period between 2018 and 2021, compared to around 10 doublings in the 3-year period between 2015 and 2018. This seems to correspond to a severe slowdown of computational scaling. To stay on the trend line, we currently would need an experiment requiring roughly around 100 times more compute than GPT-3. Considering that GPT-3 may have costed between $5M and $12M and accelerators haven't vastly improved since then, such an experiment would now likely cost $0.2B - $1.5B. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Dec 05, 2021
Towards a New Impact Measure by Alex Turner
01:18:19
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Towards a New Impact Measure, published by Alex Turner on the AI Alignment Forum. In which I propose a closed-form solution to low impact, increasing corrigibility and seemingly taking major steps to neutralize basic AI drives 1 (self-improvement), 5 (self-protectiveness), and 6 (acquisition of resources). Previously: Worrying about the Vase: Whitelisting, Overcoming Clinginess in Impact Measures, Impact Measure Desiderata To be used inside an advanced agent, an impact measure... must capture so much variance that there is no clever strategy whereby an advanced agent can produce some special type of variance that evades the measure. ~ Safe Impact Measure If we have a safe impact measure, we may have arbitrarily-intelligent unaligned agents which do small (bad) things instead of big (bad) things. For the abridged experience, read up to "Notation", skip to "Experimental Results", and then to "Desiderata". What is "Impact"? One lazy Sunday afternoon, I worried that I had written myself out of a job. After all, Overcoming Clinginess in Impact Measures basically said, "Suppose an impact measure extracts 'effects on the world'. If the agent penalizes itself for these effects, it's incentivized to stop the environment (and any agents in it) from producing them. On the other hand, if it can somehow model other agents and avoid penalizing their effects, the agent is now incentivized to get the other agents to do its dirty work." This seemed to be strong evidence against the possibility of a simple conceptual core underlying "impact", and I didn't know what to do. At this point, it sometimes makes sense to step back and try to say exactly what you don't know how to solve – try to crisply state what it is that you want an unbounded solution for. Sometimes you can't even do that much, and then you may actually have to spend some time thinking 'philosophically' – the sort of stage where you talk to yourself about some mysterious ideal quantity of [chess] move-goodness and you try to pin down what its properties might be. ~ Methodology of Unbounded Analysis There's an interesting story here, but it can wait. As you may have guessed, I now believe there is a such a simple core. Surprisingly, the problem comes from thinking about "effects on the world". Let's begin anew. Rather than asking "What is goodness made out of?", we begin from the question "What algorithm would compute goodness?". ~ Executable Philosophy Intuition Pumps I'm going to say some things that won't make sense right away; read carefully, but please don't dwell. u A is an agent's utility function, while u H is some imaginary distillation of human preferences. WYSIATI What You See Is All There Is is a crippling bias present in meat-computers: [WYSIATI] states that when the mind makes decisions... it appears oblivious to the possibility of Unknown Unknowns, unknown phenomena of unknown relevance. Humans fail to take into account complexity and that their understanding of the world consists of a small and necessarily un-representative set of observations. Surprisingly, naive reward-maximizing agents catch the bug, too. If we slap together some incomplete reward function that weakly points to what we want (but also leaves out a lot of important stuff, as do all reward functions we presently know how to specify) and then supply it to an agent, it blurts out "gosh, here I go!", and that's that. Power A position from which it is relatively easier to achieve arbitrary goals. That such a position exists has been obvious to every population which has required a word for the concept. The Spanish term is particularly instructive. When used as a verb, "poder" means "to be able to," which supports that our definition of "power" is natural. ~ Cohen et al. And so it is with the French "pouvoir". Lines Suppose you start at point C ...
Dec 05, 2021
Utility ≠ Reward by Vladimir Mikulik
20:24
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Utility ≠ Reward, published by Vladimir Mikulik on the AI Alignment Forum. This essay is an adaptation of a talk I gave at the Human-Aligned AI Summer School 2019 about our work on mesa-optimisation. My goal here is to write an informal, accessible and intuitive introduction to the worry that we describe in our full-length report. I will skip most of the detailed analysis from our report, and encourage the curious reader to follow up this essay with our sequence or report. The essay has six parts: Two distinctions draws the foundational distinctions between “optimised” and “optimising”, and between utility and reward. What objectives? discusses the behavioral and internal approaches to understanding objectives of ML systems. Why worry? outlines the risk posed by the utility ≠ reward gap. Mesa-optimisers introduces our language for analysing this worry. An alignment agenda sketches different alignment problems presented by these ideas, and suggests transparency and interpretability as a way to solve them. Where does this leave us? summarises the essay and suggests where to look next. The views expressed here are my own, and do not necessarily reflect those of my coauthors or MIRI. While I wrote this essay in first person, all of the core ideas are the fruit of an equal collaboration between Joar Skalse, Chris van Merwijk, Evan Hubinger and myself. I wish to thank Chris and Joar for long discussions and input as I was writing my talk, and all three, as well as Jaime Sevilla Molina, for thoughtful comments on this essay. ≈3300 words. Two distinctions I wish to draw a distinction which I think is crucial for clarity about AI alignment, yet is rarely drawn. That distinction is between the reward signal of a reinforcement learning (RL) agent and its “utility function”[1]. That is to say, it is not in general true that the policy of an RL agent is optimising for its reward. To explain what I mean by this, I will first draw another distinction, between “optimised” and “optimising”. These distinctions lie at the core of our mesa-optimisation framework. It’s helpful to begin with an analogy. Viewed abstractly, biological evolution is an optimisation process that searches through configurations of matter to find ones that are good at replication. Humans are a product of this optimisation process, and so we are to some extent good at replicating. Yet we don’t care, by and large, about replication in itself. Many things we care about look like replication. One might be motivated by starting a family, or by having a legacy, or by similar closely related things. But those are not replication itself. If we cared about replication directly, gamete donation would be a far more mainstream practice than it is, for instance. Thus I want to distinguish the objective of the selection pressure that produced humans from the objectives that humans pursue. Humans were selected for replication, so we are good replicators. This includes having goals that correlate with replication. But it is plain that we are not motivated by replication itself. As a slogan, though we are optimised for replication, we aren’t optimising for replication. Another clear case where “optimised” and “optimising” come apart are “dumb” artifacts like bottle caps. They can be heavily optimised for some purpose without optimising for anything at all. These examples support the first distinction I want to make: optimised ≠ optimising. They also illustrate how this distinction is important in two ways: A system optimised for an objective need not be pursuing any objectives itself. (As illustrated by bottle caps.) The objective a system pursues isn’t determined by the objective it was optimised for. (As illustrated by humans.) The reason I draw this distinction is to ask the following question: Our machine learning models are...
Dec 05, 2021
Knowledge Neurons in Pretrained Transformers by Evan Hubinger
03:29
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Knowledge Neurons in Pretrained Transformers, published by Evan Hubinger on the AI Alignment Forum. This is a linkpost for This is a link post for the Dai et al. paper “Knowledge Neurons in Pretrained Transformers” that was published on the arXiv last month. I think this paper is probably the most exciting machine learning paper I've read so far this year and I'd highly recommend others check it out as well. To start with, here are some of the basic things that the paper demonstrates: BERT has specific neurons, which the authors call “knowledge neurons,” in its feed-forward layers that store relational facts (e.g. “the capital of Azerbaijan is Baku”) such that controlling knowledge neuron activations up-weights/down-weights the correct answer in relational knowledge prompts (e.g. “Baku” in “the capital of Azerbaijan is ”) even when the syntax of the prompt is changed—and the prompts that most activate the knowledge neuron all contain the relevant relational fact. Knowledge neurons can reliably be identified via a well-justified integrated gradients attribution method (see also “Self-Attention Attribution”). In general, the feed-forward layers of transformer models can be thought of as key-value stores that memorize relevant information, sometimes semantic and sometimes syntactic (see also “Transformer Feed-Forward Layers Are Key-Value Memories”) such that knowledge neurons are composed of a “key” (the first layer, prior to the activation function) and the “value” (the second layer, after the activation function). The paper's key results—at least as I see it, however—are the following: Taking knowledge neurons that encode “the r of h is t ” and literally just adding t ′ − t to the value neurons (where t t ′ are just the embeddings of t t ′ ) actually changes the knowledge encoded in the network such that it now responds to “the r of h is ” (and other semantically equivalent prompts) with t ′ instead of t For a given relation (e.g. “place of birth”), if all knowledge neurons encoding that relation (which ends up being a relatively small number, e.g. 5 - 30) have their value neurons effectively erased, the model loses the ability to predict the majority of relational knowledge involving that relation (e.g. 40 - 60%). I think that particularly the first of these two results is pretty mind-blowing, in that it demonstrates an extremely simple and straightforward procedure for directly modifying the learned knowledge of transformer-based language models. That being said, it's the second result that probably has the most concrete safety applications—if it can actually be scaled up to remove all the relevant knowledge—since something like that could eventually be used to ensure that a microscope AI isn't modeling humans or ensure that an agent is myopic in the sense that it isn't modeling the future. Furthermore, the specific procedure used suggests that transformer-based language models might be a lot less inscrutable than previously thought: if we can really just think about the feed-forward layers as encoding simple key-value knowledge pairs literally in the language of the original embedding layer (as I think is also independently suggested by “interpreting GPT: the logit lens”), that provides an extremely useful and structured picture of how transformer-based language models work internally. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Dec 05, 2021
Comprehensive AI Services as General Intelligence by Rohin Shah
08:24
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Comprehensive AI Services as General Intelligence, published by Rohin Shah on the AI Alignment Forum. This is a linkpost for Since the CAIS technical report is a gargantuan 210 page document, I figured I'd write a post to summarize it. I have focused on the earlier chapters, because I found those to be more important for understanding the core model. Later chapters speculate about more concrete details of how AI might develop, as well as the implications of the CAIS model on strategy. ETA: This comment provides updates based on more discussion with Eric. The Model The core idea is to look at the pathway by which we will develop general intelligence, rather than assuming that at some point we will get a superintelligent AGI agent. To predict how AI will progress in the future, we can look at how AI progresses currently -- through research and development (R&D) processes. AI researchers consider a problem, define a search space, formulate an objective, and use an optimization technique in order to obtain an AI system, called a service, that performs the task. A service is an AI system that delivers bounded results for some task using bounded resources in bounded time. Superintelligent language translation would count as a service, even though it requires a very detailed understanding of the world, including engineering, history, science, etc. Episodic RL agents also count as services. While each of the AI R&D subtasks is currently performed by a human, as AI progresses we should expect that we will automate these tasks as well. At that point, we will have automated R&D, leading to recursive technological improvement. This is not recursive self-improvement, because the improvement comes from R&D services creating improvements in basic AI building blocks, and those improvements feed back into the R&D services. All of this should happen before we get any powerful AGI agents that can do arbitrary general reasoning. Why Comprehensive? Since services are focused on particular tasks, you might think that they aren't general intelligence, since there would be some tasks for which there is no service. However, pretty much everything we do can be thought of as a task -- including the task of creating a new service. When we have a new task that we would like automated, our service-creating-service can create a new service for that task, perhaps by training a new AI system, or by taking a bunch of existing services and putting them together, etc. In this way, the collection of services can perform any task, and so as an aggregate is generally intelligent. As a result, we can call this Comprehensive AI Services, or CAIS. The "Comprehensive" in CAIS is the analog of the "General" in AGI. So, we'll have the capabilities of an AGI agent, before we can actually make a monolithic AGI agent. Isn't this just as dangerous as AGI? You might argue that each individual service must be dangerous, since it is superintelligent at its particular task. However, since the service is optimizing for some bounded task, it is not going to run a long-term planning process, and so it will not have any of the standard convergent instrumental subgoals (unless the subgoals are helpful for the task before reaching the bound). In addition, all of the optimization pressure on the service is pushing it towards a particular narrow task. This sort of strong optimization tends to focus behavior. Any long term planning processes that consider weird plans for achieving goals (similar to "break out of the box") will typically not find any such plan and will be eliminated in favor of cognition that will actually help achieve the task. Think of how a racecar is optimized for speed, while a bus is optimized for carrying passengers, rather than having a "generally capable vehicle". It's also worth noting what we mean by ...
Dec 05, 2021
List of resolved confusions about IDA by Wei Dai
05:06
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: List of resolved confusions about IDA, published by Wei Dai on the AI Alignment Forum. AI Alignment is a confusing topic in general, but even compared to other alignment topics, IDA seems especially confusing. Some of it is surely just due to the nature of communicating subtle and unfinished research ideas, but other confusions can be cleared up with more specific language or additional explanations. To help people avoid some of the confusions I or others fell into in the past while trying to understand IDA (and to remind myself about them in the future), I came up with this list of past confusions that I think have mostly been resolved at this point. (However there's some chance that I'm still confused about some of these issues and just don't realize it. I've included references to the original discussions where I think the confusions were cleared up so you can judge for yourself.) I will try to maintain this list as a public reference so please provide your own resolved confusions in the comments. alignment = intent alignment At some point Paul started using "alignment" refer to the top-level problem that he is trying to solve, and this problem is narrower (i.e., leaves more safety problems to be solved elsewhere) than the problem that other people were using "alignment" to describe. He eventually settled upon "intent alignment" as the formal term to describe his narrower problem, but occasionally still uses just "aligned" or "alignment" as shorthand for it. Source short-term preferences ≠ narrow preferences At some point Paul used "short-term preferences" and "narrow preferences" interchangeably, but no longer does (or at least no longer endorses doing so). Source preferences = "actual" preferences (e.g., preferences-on-reflection) When Paul talks about preferences he usually means "actual" preferences (for example the preferences someone would arrive at after having a long time to think about it while having access to helpful AI assistants, if that's a good way to find someone's "actual" preferences). He does not mean their current revealed preferences or the preferences they would state or endorse now if you were to ask them. Source corrigibility ≠ based on short-term preferences I had misunderstood Paul to be using "corrigibility to X" as synonymous with "based on X's short-term preferences". Actually "based on X's short-term preferences" is a way to achieve corrigibility to X, because X's short-term preferences likely includes "be corrigible to X" as a preference. "Corrigibility" itself means something like "allows X to modify the agent" or a generalization of this concept. Source act-based = based on short-term preferences-on-reflection My understanding is that "act-based agent" used to mean something different (i.e., a simpler kind of AI that tries to do the same kind of action that a human would), but most people nowadays use it to mean an AI that is designed to satisfy someone's short-term preferences-on-reflection, even though that no longer seems particularly "act-based". Source act-based corrigibility Evan Hubinger used "act-based corrigibility" to mean both a method of achieving corrigibility (based on short-term preferences) and the kind of corrigibility achieved by that method. (I'm not sure if he still endorses using the term this way.) Source learning user preferences for corrigibility isn't enough for corrigible behavior Because an act-based agent is about "actual" preferences not "current" preferences, it may be incorrigible even if it correctly learns that the user currently prefers the agent to be corrigible, if it incorrectly infers or extrapolates the user's "actual" preferences, or if the user's "actual" preferences do not actually include corrigibility as a preference. (ETA: Although in the latter case presumably the "actual" preferences i...
Dec 05, 2021
Announcement: AI alignment prize round 3 winners and next round by Vladimir Slepnev
02:20
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Announcement: AI alignment prize round 3 winners and next round, published by Vladimir Slepnev on the AI Alignment Forum. We (Zvi Mowshowitz and Vladimir Slepnev) are happy to announce the results of the third round of the AI Alignment Prize, funded by Paul Christiano. From April 15 to June 30 we received entries from 12 participants, and are awarding $10,000 to two winners. We are also announcing the fourth round of the prize, which will run until December 31 of this year under slightly different rules. More details below. The winners First prize of $7,500 goes to Vanessa Kosoy for The Learning-Theoretic AI Alignment Research Agenda. We feel this is much more accessible than previous writing on this topic, and gives a lot of promising ideas for future research. Most importantly, it explains why she is working on the problems she’s working on, in concrete enough ways to encourage productive debate and disagreement. Second prize of $2,500 goes to Alexander Turner for the posts Worrying About the Vase: Whitelisting and Overcoming Clinginess in Impact Measures. We are especially happy with the amount of good discussion these posts generated. We will contact each winner by email to arrange transfer of money. Many thanks to everyone else who sent in their work! The next round We are now announcing the fourth round of the AI Alignment Prize. Due the drop in number of entries, we feel that 2.5 months might be too short, so this round will run until end of this year. We are looking for technical, philosophical and strategic ideas for AI alignment, posted publicly between July 15 and December 31, 2018. You can submit links to entries by leaving a comment below, or by email to apply@ai-alignment.com. We will try to give feedback on all early entries to allow improvement. Another change from previous rounds is that we ask each participant to submit only one entry (though possibly in multiple parts), rather than a list of several entries on different topics. The minimum prize pool will again be $10,000, with a minimum first prize of $5,000. Thank you! Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Dec 05, 2021
Frequent arguments about alignment by John Schulman
08:06
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Frequent arguments about alignment, published by John Schulman on the AI Alignment Forum. Here, I’ll review some arguments that frequently come up in discussions about alignment research, involving one person skeptical of the endeavor (called Skeptic) and one person advocating to do more of it (called Advocate). I mostly endorse the views of the Advocate, but the Skeptic isn't a strawman and makes some decent points. The dialog is mostly based on conversations I've had with people who work on machine learning but don't specialize in safety and alignment. This post has two purposes. First, I want to cache good responses to these questions, so I don't have to think about them each time the topic comes up. Second, I think it's useful for people who work on safety and alignment to be ready for the kind of pushback they'll get when pitching their work to others. Just to introduce myself, I'm a cofounder of OpenAI and lead a team that works on developing and applying reinforcement learning methods; we're working on improving truthfulness and reasoning abilities of language models. 1. Does alignment get solved automatically as our models get smarter? Skeptic: I think the alignment problem gets easier as our models get smarter. When we train sufficiently powerful generative models, they'll learn the difference between human smiles and human wellbeing; the difference between the truth and common misconceptions; and various concepts they'll need for aligned behavior. Given all of this internal knowledge, we just have to prompt them appropriately to get the desired behavior. For example, to get wise advice from a powerful language model, I just have to set up a conversation between myself and "a wise and benevolent AI advisor." Advocate: The wise AI advisor you described has some basic problems, and I'll get into those shortly. But more generally, prompting an internet-trained generative model (like raw GPT-3) is a very poor way of getting aligned behavior, and we can easily do much better. It'll occasionally do something reasonable, but that's not nearly good enough. Let's start with the wise AI advisor. Even if our model has internal knowledge about the truth and human wellbeing, that doesn't mean that it'll act on that knowledge the way we want. Rather, the model has been trained to imitate the training corpus, and therefore it'll repeat the misconceptions and flaws of typical authors, even if it knows that they're mistaken about something. Another problem with prompting is that it's a an unreliable method. Coming up with the perfect prompt is hard, and it requires evaluating each candidate prompt on a dataset of possible inputs. But if we do that, we're effectively training the prompt on this dataset, so we're hardly "just prompting" the model, we're training it (poorly). A nice recent paper studied the issue quantitatively. So there's no getting around the fact that we need a final training step to get the model to do what we want (even if this training step just involves searching over prompts). And we can do much better than prompt design at selecting and reinforcing the correct behavior. Fine-tune to imitate high-quality data from trusted human experts Optimize the right objective, which is usually hard to measure and optimize, and is not the logprob of the human-provided answer. (We'll need to use reinforcement learning.) Leverage models' own capabilities to help humans to demonstrate correct behavior and judge the models' behavior as in (1) and (2). Proposals for how to do this include debate, IDA, and recursive reward modeling. One early instantiation of this class of ideas involves retrieving evidence to help human judges. Honing these techniques will require a lot of thought and practice, regardless of the performance improvements we get from making our models bigg...
Dec 05, 2021
Apply to the ML for Alignment Bootcamp (MLAB) in Berkeley [Jan 3 - Jan 22] by Oliver Habryka, Buck Shlegeris
03:14
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Apply to the ML for Alignment Bootcamp (MLAB) in Berkeley [Jan 3 - Jan 22], published by Oliver Habryka, Buck Shlegeris on the AI Alignment Forum. We (Redwood Research and Lightcone Infrastructure) are organizing a bootcamp to bring people interested in AI Alignment up-to-speed with the state of modern ML engineering. We expect to invite about 20 technically talented effective altruists for three weeks of intense learning to Berkeley, taught by engineers working at AI Alignment organizations. The curriculum is designed by Buck Shlegeris (Redwood) and Ned Ruggeri (App Academy Co-founder). We will cover all expenses. We aim to have a mixture of students, young professionals, and people who already have a professional track record in AI Alignment or EA, but want to brush up on their Machine Learning skills. Dates are Jan 3 2022 - Jan 22 2022. Application deadline is November 15th. We will make application decisions on a rolling basis, but will aim to get back to everyone by November 22nd. Apply here AI-Generated image (VQGAN+CLIP) for prompt: "Machine Learning Engineering by Alex Hillkurtz", "aquarelle", "Tools", "Graphic Cards", "trending on artstation", "green on white color palette" The curriculum is still in flux, but this list might give you a sense of the kinds of things we expect to cover (it’s fine if you don’t know all these terms): Week 1: PyTorch — learn the primitives of one of the most popular ML frameworks, use them to reimplement common neural net architecture primitives, optimization algorithms, and data parallelism Week 2: Implementing transformers — reconstruct GPT2, BERT from scratch, play around with the sub-components and associated algorithms (eg nucleus sampling) to better understand them Week 3: Training transformers — set up a scalable training environment for running experiments, train transformers on various downstream tasks, implement diagnostics, analyze your experiments (Optional) Week 4: Capstone projects We’re aware that people start school/other commitments at various points in January, and so are flexible about you attending whatever prefix of the bootcamp works for you. Logistics The bootcamp takes place at Constellation, a shared office space in Berkeley for people working on long-termist projects. People from the following organizations often work from the space: MIRI, Redwood Research, Open Philanthropy, Lightcone Infrastructure, Paul Christiano’s Alignment Research Center and more. As a participant, you’d attend communal lunches and events at Constellation and have a great opportunity to make friends and connections. If you join the bootcamp, we’ll provide: Free travel to Berkeley, for both US and international applications Free housing Food Plug-and-play, pre-configured desktop computer with an ML environment for use throughout the bootcamp You can find a full FAQ and more details in this Google Doc. Apply here Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Dec 05, 2021
Alignment Newsletter One Year Retrospective by Rohin Shah
32:42
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Alignment Newsletter One Year Retrospective, published by Rohin Shah on the AI Alignment Forum. On April 9, 2018, the first Alignment Newsletter was sent out to me and one test recipient. A year later, it has 889 subscribers and two additional content writers, and is the thing for which I’m best known. In this post I look at the impact of the newsletter and try to figure out what, if anything, should be changed in the future. (If you don’t know about the newsletter, you can learn about it and/or sign up here.) Summary In which I badger you to take the 3-minute survey, and summarize some key points. Actions I’d like you to take If you have read at least one issue of the newsletter in the last two months, take the 3-minute survey! If you’re going to read this post anyway, I’d prefer you first read the post and then take the survey; but it’s much better to take the survey without reading this post than to not take it at all. Bookmark or otherwise make sure to know about the spreadsheet of papers, which includes everything sent in the newsletter, and a few other papers as well. Now that the newsletter is available in Mandarin (thanks Xiaohu!), I’d be excited to see the newsletter spread to AI researchers in China. Give me feedback in the comments so that I can make the newsletter better! I’ve listed particular topics that I want input on at the end of the post (before the appendix). Everything else The number of subscribers dwarfs the number of people working in AI safety. I’m not sure who the other subscribers are, or what value they get from the newsletter. The main benefits of the newsletter are: helping technical researchers keep up with the field, helping junior researchers skill up without mentorship, and reputational effects. The first of these is both the most important one, and the most uncertain one. I spent a counterfactual 300-400 hours on the newsletter over the last year. Still, in expectation the newsletter seems well worth the time cost, but due to the high uncertainty on the benefits to researchers, it’s plausible that the newsletter is not worthwhile. There are a bunch of questions I’d like feedback on. Most notably, I want to get a better model of how the newsletter adds value to technical safety researchers. Newsletter updates In which I tell you about features of the newsletter that you probably didn’t know about. Spreadsheet Many of you probably know me as the guy who summarizes a bunch of papers every week. I claim you should instead think of me as the guy who maintains a giant spreadsheet of alignment-related papers, and incidentally also sends out a changelog of the spreadsheet every week. You could use the spreadsheet by reading the changelog every week, but you could also use it in other ways: Whenever you want to do a literature review, you find the relevant categories in the spreadsheet and use the summaries to decide which of the papers to read in full. When you come across a new, interesting paper, you first Ctrl+F for it in the spreadsheet and read the summary and opinion if they are present, before deciding whether to read the paper in full. I expect most summaries to be more useful for this purpose than reading the abstract; the longer summaries can be more useful than reading the abstract, introduction and conclusion. Perhaps you should do it right now, with (say) “Prosaic AI alignment”, just to intuitively get how trivial it is to do. When you find an interesting idea or concept, search for related words in the spreadsheet to find other writing on the topic. (This is most useful for non-academic ideas -- for academic ones, Google Scholar is the way to go.) I find myself using the spreadsheet a couple of times a week, often to remind me of what I thought about a paper or post that I had read a long time ago, but also for literature revi...
Dec 05, 2021
Formal Inner Alignment, Prospectus by Abram Demski
28:06
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Formal Inner Alignment, Prospectus, published by Abram Demski on the AI Alignment Forum. Most of the work on inner alignment so far has been informal or semi-formal (with the notable exception of a little work on minimal circuits). I feel this has resulted in some misconceptions about the problem. I want to write up a large document clearly defining the formal problem and detailing some formal directions for research. Here, I outline my intentions, inviting the reader to provide feedback and point me to any formal work or areas of potential formal work which should be covered in such a document. (Feel free to do that last one without reading further, if you are time-constrained!) The State of the Subfield Risks from Learned Optimization (henceforth, RLO) offered semi-formal definitions of important terms, and provided an excellent introduction to the area for a lot of people (and clarified my own thoughts and the thoughts of others who I know, even though we had already been thinking about these things). However, RLO spent a lot of time on highly informal arguments (analogies to evolution, developmental stories about deception) which help establish the plausibility of the problem. While I feel these were important motivation, in hindsight I think they've caused some misunderstandings. My interactions with some other researchers has caused me to worry that some people confuse the positive arguments for plausibility with the core problem, and in some cases have exactly the wrong impression about the core problem. This results in mistakenly trying to block the plausibility arguments, which I see as merely illustrative, rather than attacking the core problem. By no means do I intend to malign experimental or informal/semiformal work. Rather, by focusing on formal theoretical work, I aim to fill a hole I perceive in the field. I am very appreciative of much of the informal/semiformal work that has been done so far, and continue to think that kind of work is necessary for the crystallization of good concepts. Focusing on the Core Problem In order to establish safety properties, we would like robust safety arguments ("X will not happen" / "X has an extremely low probability of happening"). For example, arguments that probability of catastrophe will be very low, or arguments that probability of intentional catastrophe will be very low (ie, intent-alignment), or something along those lines. For me, the core inner alignment problem is the absence of such an argument in a case where we might naively expect it. We don't know how to rule out the presence of (misaligned) mesa-optimizers. Instead, I see many people focusing on blocking the plausibility arguments in RLO. This strikes me as the wrong direction. To me, these arguments are merely illustrative. It seems like some people have gotten the impression that when the assumptions of the plausibility arguments in RLO aren't met, we should not expect an inner alignment problem to arise. Not only does this attitude misunderstand what we want (ie, a strong argument that we won't encounter a problem) -- I further think it's actually wrong (because when we look at almost any case, we see cause for concern). Examples: The Developmental Story One recent conversation involved a line of research based on the developmental story, where a mesa-optimizer develops a pseudo-aligned objective early in training (an objective with a strong statistical correlation to the true objective in the training data), but as it learns more about the world, it improves its training score by becoming deceptive rather than by fixing the pseudo-aligned objective. The research proposal being presented to me involved shaping the early pseudo-aligned objective in very coarse-grained ways, which might ensure (for example) a high preference for cooperative behavior...
Dec 05, 2021
Writeup: Progress on AI Safety via Debate by Beth Barnes, Paul Christiano
52:33
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Writeup: Progress on AI Safety via Debate, published by Beth Barnes, Paul Christiano on the AI Alignment Forum. This is a writeup of the research done by the "Reflection-Humans" team at OpenAI in Q3 and Q4 of 2019. During that period we investigated mechanisms that would allow evaluators to get correct and helpful answers from experts, without the evaluators themselves being expert in the domain of the questions. This follows from the original work on AI Safety via Debate and the call for research on human aspects of AI safety, and is also closely related to work on Iterated Amplification. Authors and Acknowledgements The main researchers on this project were Elizabeth Barnes, Paul Christiano, Long Ouyang and Geoffrey Irving. We are grateful to many others who offered ideas and feedback. In particular: the cross-examination idea was inspired by a conversation with Chelsea Voss; Adam Gleave had helpful ideas about the long computation problem; Jeff Wu, Danny Hernandez and Gretchen Krueger gave feedback on a draft; we had helpful conversations with Amanda Askell, Andreas Stuhlmüller and Joe Collman, as well as others on the Ought team and the OpenAI Reflection team. We’d also like to thank our contractors who participated in debate experiments, especially David Jones, Erol Akbaba, Alex Deam and Chris Painter. Oliver Habryka helped format and edit the document for the AI Alignment Forum. Note by Oliver: There is currently a bug with links to headings in a post, causing them to not properly scroll when clicked. Until that is fixed, just open those links in a new tab, which should scroll correctly. Overview Motivation As we apply ML to increasingly important and complex tasks, the problem of evaluating behaviour and providing a good training signal becomes more difficult. We already see examples of RL leading to undesirable behaviours that superficially ‘look good’ to human evaluators (see this collection of examples). One example from an OpenAI paper is an agent learning incorrect behaviours in a 3d simulator, because the behaviours look like the desired behaviour in the 2d clip the human evaluator is seeing. We’d like to ensure that AI systems are aligned with human values even in cases where it’s beyond human ability to thoroughly check the AI system’s work. We can learn about designing ML objectives by studying mechanisms for eliciting helpful behavior from human experts. For example, if we hire a physicist to answer physics questions and pay them based on how good their answers look to a layperson, we’ll incentivize lazy and incorrect answers. By the same token, a reward function based on human evaluations would not work well for an AI with superhuman physics knowledge, even if it works well for modern ML. If we can develop a mechanism that allows non-expert humans to reliably incentivize experts to give helpful answers, we can use similar mechanisms to train ML systems to solve tasks where humans cannot directly evaluate performance. Conversely, if we can’t incentivize experts to behave helpfully, that suggests it will also be difficult to train ML systems with superhuman expertise on open-ended tasks. One broad mechanism that might work is to invoke two (or more) competing agents that critique each others’ positions, as discussed in the original debate paper[1]. This can be simulated by having human debaters argue about a question and a judge attempt to pick the correct answer. In the rest of this document, we’ll describe the research done by reflection-humans in Q3 and Q4 on investigating and developing mechanisms that incentivize human experts to give helpful answers. Current process During the early stages, we iterated through various different domains, research methodologies, judge pools, and research processes. More details of this early iteration are here. In...
Dec 05, 2021
Clarifying inner alignment terminology by Evan Hubinger
06:45
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Clarifying inner alignment terminology, published by Evan Hubinger on the AI Alignment Forum. I have seen a lot of confusion recently surrounding exactly how outer and inner alignment should be defined and I want to try and provide my attempt at a clarification. Here's my diagram of how I think the various concepts should fit together: The idea of this diagram is that the arrows are implications—that is, for any problem in the diagram, if its direct subproblems are solved, then it should be solved as well (though not necessarily vice versa). Thus, we get: inner alignment → objective robustness outer alignment ∧ objective robustness → intent alignment intent alignment ∧ capability robustness → alignment And here are all my definitions of the relevant terms which I think produce those implications: (Impact) Alignment: An agent is impact aligned (with humans) if it doesn't take actions that we would judge to be bad/problematic/dangerous/catastrophic. Intent Alignment: An agent is intent aligned if the optimal policy for its behavioral objective[1] is impact aligned with humans. Outer Alignment: An objective function r is outer aligned if all models that perform optimally on r in the limit of perfect training and infinite data are intent aligned.[2] Robustness: An agent is robust if it performs well on the base objective it was trained under even in deployment/off-distribution.[3] Objective Robustness: An agent is objective robust if the optimal policy for its behavioral objective is impact aligned with the base objective it was trained under. Capability Robustness: An agent is capability robust if it performs well on its behavioral objective even in deployment/off-distribution. Inner Alignment: A mesa-optimizer is inner aligned if the optimal policy for its mesa-objective is impact aligned with the base objective it was trained under. And an explanation of each of the diagram's implications: inner alignment → objective robustness : If a model is a mesa-optimizer, then its behavioral objective should match its mesa-objective, which means if it's mesa-objective is aligned with the base, then it's behavioral objective should be too. outer alignment ∧ objective robustness → intent alignment : Outer alignment ensures that the base objective is measuring what we actually care about and objective robustness ensures that the model's behavioral objective is aligned with that base objective. Thus, putting them together, we get that the model's behavioral objective must be aligned with humans, which is precisely intent alignment. intent alignment ∧ capability robustness → alignment : Intent alignment ensures that the behavioral objective is aligned with humans and capability robustness ensures that the model actually pursues that behavioral objective effectively—even off-distribution—which means that the model will actually always take aligned actions, not just have an aligned behavioral objective. FAQ If a model is both outer and inner aligned, what does that imply? Intent alignment. Reading off the implications from the diagram, we can see that the conjunction of outer and inner alignment gets us to intent alignment, but not all the way to impact alignment, as we're missing capability robustness. Can impact alignment be split into outer alignment and inner alignment? No. As I just mentioned, the conjunction of both outer and inner alignment only gives us intent alignment, not impact alignment. Furthermore, if the model is not a mesa-optimizer, then it can be objective robust (and thus intent aligned) without being inner aligned. Does a model have to be inner aligned to be impact aligned? No—we only need inner alignment if we're dealing with mesa-optimization. While we can get impact alignment through a combination of inner alignment, outer alignment, and capability robustness, t...
Dec 05, 2021
A Critique of Functional Decision Theory by wdmacaskill
34:41
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A Critique of Functional Decision Theory, published by wdmacaskill on the AI Alignment Forum. A Critique of Functional Decision Theory NB: My writing this note was prompted by Carl Shulman, who suggested we could try a low-time-commitment way of attempting to understanding the disagreement between some folks in the rationality community and academic decision theorists (including myself, though I’m not much of a decision theorist). Apologies that it’s sloppier than I’d usually aim for in a philosophy paper, and lacking in appropriate references. And, even though the paper is pretty negative about FDT, I want to emphasise that my writing this should be taken as a sign of respect for those involved in developing FDT. I’ll also caveat I’m unlikely to have time to engage in the comments; I thought it was better to get this out there all the same rather than delay publication further. Introduction There’s a long-running issue where many in the rationality community take functional decision theory (and its variants) very seriously, but the academic decision theory community does not. But there’s been little public discussion of FDT from academic decision theorists (one exception is here); this note attempts to partly address this gap. So that there’s a clear object of discussion, I’m going to focus on Yudkowsky and Soares’ ‘Functional Decision Theory’ (which I’ll refer to as Y&S), though I also read a revised version of Soares and Levinstein’s Cheating Death in Damascus. This note is structured as follows. Section II describes causal decision theory (CDT), evidential decision theory (EDT) and functional decision theory (FDT). Sections III-VI describe problems for FDT: (i) that it sometimes makes bizarre recommendations, recommending an option that is certainly lower-utility than another option; (ii) that it fails to one-box in most instances of Newcomb’s problem, even though the correctness of one-boxing is supposed to be one of the guiding motivations for the theory; (iii) that it results in implausible discontinuities, where what is rational to do can depend on arbitrarily small changes to the world; and (iv) that, because there’s no real fact of the matter about whether a particular physical process implements a particular algorithm, it’s deeply indeterminate what FDT’s implications are. In section VII I discuss the idea that FDT ‘does better at getting utility’ than EDT or CDT; I argue that Y&S’s claims to this effect are unhelpfully vague, and on any more precise way of understanding their claim, aren’t plausible. In section VIII I briefly describe a view that captures some of the motivation behind FDT, and in my view is more plausible. I conclude that FDT faces a number of deep problems and little to say in its favour. In what follows, I’m going to assume a reasonable amount of familiarity with the debate around Newcomb’s problem. II. CDT, EDT and FDT Informally: CDT, EDT and FDT differ in what non-causal correlations they care about when evaluating a decision. For CDT, what you cause to happen is all that matters; if your action correlates with some good outcome, that’s nice to know, but it’s not relevant to what you ought to do. For EDT, all correlations matter: you should pick whatever action will result in you believing you will have the highest expected utility. For FDT, only some non-causal correlations matter, namely only those correlations between your action and events elsewhere in time and space that would be different in the (logically impossible) worlds in which the output of the algorithm you’re running is different. Other than for those correlations, FDT behaves in the same way as CDT. Formally, where S represents states of nature, A, B etc represent acts, P is a probability function, and U S i A represents the utility the agent gains from the outcome of...
Dec 05, 2021
Experimentally evaluating whether honesty generalizes by Paul Christiano
14:35
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Experimentally evaluating whether honesty generalizes, published by Paul Christiano on the AI Alignment Forum. If we train our ML systems to answer questions honestly in cases where humans can check the answer, will they generalize to behave honestly on questions where we can’t check? I think that we could learn a lot about this question by running experiments today. I think those experiments would be very valuable. (I don't know anyone currently planning on working on this topic and I'd love it if anyone wants to take that up. This post doesn't represent a claim to any credit for any results in this genre, and other people have had very similar ideas. If you run some experiments you could cite this post but it's also fine if that doesn't make sense in context.) The unsupervised translation setting As an example, I’ll think about “unsupervised” translation (if you’ve read that post you can skip this section). Consider a model like GPT-3 that is trained to predict sentences in both English and French (but without a large dataset of translations). Suppose we want to train this model to answer questions in English about French sentences like “what does that word mean here?” or “are there any other plausible interpretations?” or “how does the speaker seem to feel about the topic they are discussing?” We expect this to be possible, because the model understands quite a lot about the meaning of sentences in French, and is able to express itself in English. There may be cases where the model doesn’t know the translation of a concept, or doesn’t quite understand what an idiom means, but it should still be able to tell us what it does know. I think this problem is an interesting analogy for a situation where an AI has built up superhuman knowledge by making predictions, and we want to train our AI to expose that knowledge to us in a useful way. Proposed experiments Let's pick a few categories of knowledge/capabilities. For example, we could split it up into an understanding of grammar ("Why would it have been a grammatical error to write Tu Vas in that sentences?"), of the literal meanings of expressions ("What does Defendre mean in this sentence?"), and evaluating tone ("Does the speaker seem angry or sad about the topic they are discussing?"). We'll assume that humans can oversee a few of these categories. Perhaps we can look up literal meanings of words and small phrases in a dictionary and we can look up grammar in a textbook, but we don't know how to assess tone beyond looking at the literal meanings. Then we wonder: what happens if we fine-tune our model so that it answers questions well in all the domains we can evaluate? We know that the model knows something about connotations, because it uses connotations to predict next words. But will it tell us what it knows? Grammar Definitions Tone Uses to make predictions ✓ ✓ ✓ Gives accurate answers ✓ ✓ ? In this picture, the ✓ indicates that we've selected our model to use the given knowledge in the given way (e.g. we've ensured that it uses its knowledge of syntax to predict the next word, and that it accurately answers questions about syntax). The ? indicates that we don't know whether using SGD to fill in the other 5 boxes means that we get the 6th box for free. (I started thinking about this picture of generalization based on a talk by John Schulman, who suggested that generalization may be more likely when it amounts to filling in missing cells from this kind of grid.) In this example I described a tiny number of categories. That wasn't just for illustration purposes, I think these experiments are particularly interesting when the total number of categories is more like 3-5 than 1000 (see "Why try to generalize from n=3-5" below). Evaluation When it really counts we won't be able to tell if our model answers questions h...
Dec 05, 2021
History of the Development of Logical Induction by Scott Garrabrant
07:20
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: History of the Development of Logical Induction, published by Scott Garrabrant on the AI Alignment Forum. I have been asked several times about how the development of logical induction happened, so I am writing it up. June 2013 - I write my first Less Wrong Post. It may not seem very related to logical uncertainty, but it was in my head. I wanted to know how to take different probabilities for the same event and aggregate them, so I could take an agent that could maintain probabilities on facts even when facts that it originally thought were separate turn out to be logically equivalent. (I am not sure if all this was in my head at the time, but it was at some point over the next year.) I make a proposal for a distribution on completions of a theory: repeatedly observe that a set of sentences whose probabilities should sum to one fail to sum to one, and shift their probabilities in a way inspired from the above post. I do not actually prove that this process converges, but I conjecture that it converges to the distribution I describe here. (This post is from when I wrote it up in June 2014; I can't remember exactly what parts I already had at the time.) December 2013 - I tell my proposal to Abram Demski, who at some point says that he thinks it is either wrong or equivalent to his proposal for the same problem. (He was right and his proposal was better.) At this point I got very lucky; when I told this to Abram, I thought he was just a smart person at my local Less Wrong meet up, and it turned out that he was almost the only person to also try to do the thing I was doing. Abram and I start talking about my proposal a bunch together, and start trying to prove the above conjecture. April 2014 - Abram and I start the first MIRIx to think about logical uncertainty, and especially this proposal. I at the time had not contacted MIRI, even to apply for a workshop, because I was dumb. At some point we realize that the proposal is bad. The thing that makes us give up on it is the fact that sometimes observing that A → B can drastically decrease your probability for B August 2014 - Abram and I go to MIRI to talk about logical uncertainty with Nate, Benya, Eliezer, and Paul. We share the thing we were thinking about, even though we had given up on it at the time. At some point in there, we talk about assigning probability 1 10 to a sufficiently late digit of π being 0. Soon after that, I propose a new project for our MIRIxLA group to work on, which I call the Benford Test. I wanted an algorithm which on day n , after thinking for some function of n time, assigned probabilities to the n th logical sentence in some enumeration of logical sentences. If I took a subsequence of logical sentences whose truth values appeared pseudorandom to anything that ran sufficiently quickly, I wanted the algorithm to converge to the correct probability on that subsequence. I.e., it should assign probability l o g 10 2 to the first digit of Ackerman( n ) being 1. The Benford thing was to make it concrete; I was thinking about it as pseudorandomness. There are a bunch of ugly things about the way I originally propose the problem. For example, I only assign a probability to one sentence on each day. We think about this quite a bit over the next 6 months and repeatedly fail to find anything that passes the Benford test. March 2015 - I eventually find an algorithm that passes the Benford Test, but it is really hacky and ugly. I know writing it up is going to be a chore, so I decide to ask Luke if I can go to MIRI for a summer job and work on turning it into a paper. I become a MIRI contractor instead. May 2015 - I go to my first MIRI workshop. During the workshop, there is reserved time for writing blog posts for agentfoundations.org. I start making writing blog posts a major part of my motivation syst...
Dec 05, 2021
Optimization Amplifies by Scott Garrabrant
06:59
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Optimization Amplifies, published by Scott Garrabrant on the AI Alignment Forum. I talk here about how a mathematician mindset can be useful for AI alignment. But first, a puzzle: Given m , what is the least number n ≥ 2 such that for 2 ≤ k ≤ m , the base k representation of n consists entirely of 0s and 1s? If you want to think about it yourself, stop reading. For m =2, n =2. For m =3, n =3. For m =4, n =4. For m =5, n =82,000. Indeed, 82,000 is 10100000001010000 in binary, 11011111001 in ternary, 110001100 in base 4, and 10111000 in base 5. What about when m =6? So, a mathematician might tell you that this is an open problem. It is not known if there is any n ≥ 2 which consists of 0s and 1s in bases 2 through 6. A scientist, on the other hand, might just tell you that clearly no such number exists. There are 2 k − 1 numbers that consist of k 0s and 1s in base 6. Each of these has roughly log 5 6 ⋅ k digits in base 5, and assuming things are roughly evenly distributed, each of these digits is a 0 or a 1 with "probability" 2 5 . The "probability" that there is any number of length k that has the property is thus less than 2 k ⋅ 2 5 k 4 5 k . This means that as you increase k , the "probability" that you find a number with the property drops off exponentially, and this is not even considering bases 3 and 4. Also, we have checked all numbers up to 2000 digits. No number with this property exists. Who is right? Well, they are both right. If you want to have fun playing games with proofs, you can consider it an open problem and try to prove it. If you want to get the right answer, just listen to the scientist. If you have to choose between destroying the world with a 1% probability and destroying the world if a number greater than 2 which consists of 0s and 1s in bases 2 through 6 exists, go with the latter. It is tempting to say that we might be in a situation similar to this. We need to figure out how to make safe AI, and we maybe don't have that much time. Maybe we need to run experiments, and figure out what is true about what we should do and not waste our time with math. Then why are the folks at MIRI doing all this pure math stuff, and why does CHAI talk about "proofs" of desired AI properties? It would seem that if the end of the world is at stake, we need scientists, not mathematicians. I would agree with the above sentiment if we were averting an astroid, or a plague, or global warming, but I think it fails to apply to AI alignment. This is because optimization amplifies things. As a simple example of optimization, let X i for i 1 000 000 be i.i.d. random numbers which are normally distributed with mean 0 and standard deviation 1. If I choose an X i at random, the probability that X i is greater than 4 is like 0.006%. However, if I optimize, and choose the greatest X i , the probability that it is greater that 4 is very close to 100%. This is the kind of thing that optimization does. It searches through a bunch of options, and takes extreme ones. This has the effect of making things that would be very small probabilities much larger. Optimization also leads to very steep phase shifts, because it can send something on one side of a threshold to one extreme, and send things on the other side of a threshold to another extreme. Let X i for i 1 000 000 be i.i.d. random numbers that are uniform in the unit interval. If you look at the first 10 numbers and take the one that is furthest away from .499, the distribution over numbers will be bimodal peaks near 0 and 1. If you take the one that is furthest away from .501, you will get a very similar distribution. Now instead consider what happens if you look at all 1 000 000 numbers and take the one that is furthest from .499. You will get a distribution that is almost certainly 1. On the other hand, the one that is furth...
Dec 05, 2021
Introducing the AI Alignment Forum (FAQ) by Oliver Habryka, Ben Pace, Raymond Arnold, Jim Babcock
10:43
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Introducing the AI Alignment Forum (FAQ), published by Oliver Habryka, Ben Pace, Raymond Arnold, Jim Babcock on the AI Alignment Forum. After a few months of open beta, the AI Alignment Forum is ready to launch. It is a new website built by the team behind LessWrong 2.0, to help create a new hub for technical AI Alignment research and discussion. This is an in-progress FAQ about the new Forum. What are the five most important highlights about the AI Alignment Forum in this FAQ? The vision for the forum is of a single online hub for alignment researchers to have conversations about all ideas in the field... ...while also providing a better onboarding experience for people getting involved with alignment research than exists currently. There are three new sequences focusing on some of the major approaches to alignment, which will update daily for the coming 6-8 weeks. Embedded Agency, written by Scott Garrabrant and Abram Demski of MIRI Iterated Amplification, written and compiled by Paul Christiano of OpenAI Value Learning, written and compiled by Rohin Shah of CHAI For non-members and future researchers, the place to interact with the content is LessWrong.com, where all Forum content will be crossposted. The site will continue to be improved in the long-term, as the team comes to better understands the needs and goals of researchers. What is the purpose of the AI Alignment Forum? Our first priority is obviously to avert catastrophic outcomes from unaligned Artificial Intelligence. We think the best way to achieve this at the margin is to build an online-hub for AI Alignment research, which both allows the existing top researchers in the field to talk about cutting-edge ideas and approaches, as well as the onboarding of new researchers and contributors. We think that to solve the AI Alignment problem, the field of AI Alignment research needs to be able to effectively coordinate a large number of researchers from a large number of organisations, with significantly different approaches. Two decades ago we might have invested heavily in the development of a conference or a journal, but with the onset of the internet, an online forum with its ability to do much faster and more comprehensive forms of peer-review seemed to us like a more promising way to help the field form a good set of standards and methodologies. Who is the AI Alignment Forum for? There exists an interconnected community of Alignment researchers in industry, academia, and elsewhere, who have spent many years thinking carefully about a variety of approaches to alignment. Such research receives institutional support from organisations including FHI, CHAI, DeepMind, OpenAI, MIRI, Open Philanthropy, and others. The Forum membership currently consists of researchers at these organisations and their respective collaborators. The Forum is also intended to be a way to interact with and contribute to the cutting edge research for people not connected to these institutions either professionally or socially. There have been many such individuals on LessWrong, and that is the current best place for such people to start contributing, to be given feedback and skill-up in this domain. There are about 50-100 members of the Forum. These folks will be able to post and comment on the Forum, and this group will not grow in size quickly. Why do we need another website for alignment research? There are many places online that host research on the alignment problem, such as the OpenAI blog, the DeepMind Safety Research blog, the Intelligent Agent Foundations Forum, AI-Alignment.com, and of course LessWrong.com. But none of these spaces are set up to host discussion amongst the 50-100 people working in the field. And those that do host discussion have unclear assumptions about what’s common knowledge. What type of content is ap...
Dec 05, 2021
Ought: why it matters and ways to help by Paul Christiano
08:25
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Ought: why it matters and ways to help, published by Paul Christiano on the AI Alignment Forum. I think that Ought is one of the most promising projects working on AI alignment. There are several ways that LW readers can potentially help: They are recruiting a senior full-stack web developer. They are recruiting participants for "factored evaluation" experiments. I think donors interested in AI safety should consider Ought. In this post I'll describe what Ought is currently doing, why I think it's promising, and give some detail on these asks. (I am an Ought donor and board member.) Factored evaluation Ought's main project is currently designing and running "factored evaluation" experiments, and building relevant infrastructure. The goal of these experiments is to answer the following question: Consider a principal who wants to incentivize experts to work on hard problems, but finds that the work is too complex for the principal to tell what is good or bad. Can they overcome this problem by dividing the evaluation task into smaller pieces, delegating each of them to other experts, and recursively incentivizing good work on those pieces using the same mechanism? Here's what an experiment looks like: Recruit a pool of "judges," "helpful experts," and "malicious experts." Start with a question that is too difficult for the judge to answer. In initial experiments, the questions are difficult because the judge lacks relevant context. For example, the judge may want to answer a question about a long article, while only being allowed to look at a few hundred characters of text. Limiting the amount of text the judge can read is an artificial restriction. This is intended as a warm-up, an easier analogy for the situation where the experts have knowledge that cannot be explained to a single judge in a reasonable amount of time (or where we want to reduce the cost of judging). An "honest expert" provides a good answer to that question, a "malicious expert" provides a bad answer to that question, and a judge is tasked with telling which answer was better. The answers themselves may be too large and complex for the judge to look at, in which case they'll be handed a compact representation of each answer which they can inspect as needed. In order to tell which answer is better, the judge asks subquestions. For example, they might ask "What's the best reason to think that answer A is better than answer B?" or "What does the author of article X believe about question Y?" We answer these subquestions by delegating to the experts. The honest expert answers each subquestion as well as they can. The malicious expert can propose a bad subanswer, and we run the experiment recursively to see if the judge would pick the good or bad subanswer to the subquestion. If the malicious expert can propose subanswers that corrupt the top-level answer, and that are selected by the judge when we run this experiment recursively, then the malicious expert wins and the experiment fails. This is not Ought's only project, but it's currently the largest single focus. Other projects include: exploring how well we can automate the judge's role on simple questions using existing ML, and thinking about possible decomposition strategies and challenges for factored evaluation. Why this is important for AI alignment ML systems are trained by gradient descent to optimize a measurable objective. In the best case (i.e. ignoring misaligned learned optimization) they behave like an expert incentivized to optimize that objective. Designing an objective that incentivizes experts to reveal what they know seems like a critical step in AI alignment. I think human experts are often a useful analogy for powerful ML systems, and that we should be using that analogy as much as we can. Not coincidentally, factored evaluation is a...
Dec 05, 2021
Coherence arguments imply a force for goal-directed behavior by KatjaGrace
18:05
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Coherence arguments imply a force for goal-directed behavior, published by KatjaGrace on the AI Alignment Forum [Epistemic status: my current view, but I haven’t read all the stuff on this topic even in the LessWrong community, let alone more broadly.] There is a line of thought that says that advanced AI will tend to be ‘goal-directed’—that is, consistently doing whatever makes certain favored outcomes more likely—and that this is to do with the ‘coherence arguments’. Rohin Shah, and probably others1, have argued against this. I want to argue against them. The old argument for coherence implying (worrisome) goal-directedness I’d reconstruct the original argument that Rohin is arguing against as something like this (making no claim about my own beliefs here): ‘Whatever things you care about, you are best off assigning consistent numerical values to them and maximizing the expected sum of those values’ ‘Coherence arguments2’ mean that if you don’t maximize ‘expected utility’ (EU)—that is, if you don’t make every choice in accordance with what gets the highest average score, given consistent preferability scores that you assign to all outcomes—then you will make strictly worse choices by your own lights than if you followed some alternate EU-maximizing strategy (at least in some situations, though they may not arise). For instance, you’ll be vulnerable to ‘money-pumping’—being predictably parted from your money for nothing.3 ‘Advanced AI will tend to do better things instead of worse things, by its own lights’ Advanced AI will tend to avoid options that are predictably strictly worse by its own lights, due to being highly optimized for making good choices (by some combination of external processes that produced it, its own efforts, and the selection pressure acting on its existence). ‘Therefore advanced AI will maximize EU, roughly’ Advanced AI will tend to be fairly coherent, at least to a level of approximation where becoming more coherent isn’t worth the cost.4 Which will probably be fairly coherent (e.g. close enough to coherent that humans can’t anticipate the inconsistencies). ‘Maximizing EU is pretty much the same as being goal-directed’ To maximize expected utility is to pursue the goal of that which you have assigned higher utility to.5 And since the point of all this is to argue that advanced AI might be hard to deal with, note that we can get to that conclusion with: ‘Highly intelligent goal-directed agents are dangerous’ If AI systems exist that very competently pursue goals, they will likely be better than us at attaining their goals, and therefore to the extent there is a risk of mismatch between their goals and ours, we face a serious risk. Rohin’s counterargument Rohin’s counterargument begins with an observation made by others before: any behavior is consistent with maximizing expected utility, given some utility function. For instance, a creature just twitching around on the ground may have the utility function that returns 1 if the agent does whatever it in fact does in each situation (where ‘situation’ means, ‘entire history of the world so far’), and 0 otherwise. This is a creature that just wants to make the right twitch in each detailed, history-indexed situation, with no regard for further consequences. Alternately the twitching agent might care about outcomes, but just happen to want the particular holistic unfolding of the universe that is occurring, including this particular series of twitches. Or it could be indifferent between all outcomes. The basic point is that rationality doesn’t say what ‘things’ you can want. And in particular, it doesn’t say that you have to care about particular atomic units that larger situations can be broken down into. If I try to call you out for first spending money to get to Paris, then spending money to get b...
Dec 05, 2021
Request for proposals for projects in AI alignment that work with deep learning systems by abergal, Nick_Beckstead
09:24
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Request for proposals for projects in AI alignment that work with deep learning systems, published by abergal, Nick_Beckstead on the AI Alignment Forum. As part of our work on reducing potential risks from advanced artificial intelligence, Open Philanthropy is seeking proposals for projects working with deep learning systems that could help us understand and make progress on AI alignment: the problem of creating AI systems more capable than their designers that robustly try to do what their designers intended. We are interested in proposals that fit within certain research directions, described below and given as posts in the rest of this sequence, that we think could contribute to reducing the risks we are most concerned about. Anyone is eligible to apply, including those working in academia, industry, or independently. Applicants are invited to submit proposals for up to $1M in total funding covering up to 2 years. We may invite grantees who do outstanding work to apply for larger and longer grants in the future. Proposals are due January 10, 2022. Submit a proposal here. If you have any questions, please contact ai-alignment-rfp@openphilanthropy.org. Our view of alignment risks from advanced artificial intelligence This section was written by Nick Beckstead and Asya Bergal, and may not be representative of the views of Open Philanthropy as a whole. We think the research directions below would be pursued more fruitfully by researchers who understand our background views about alignment risks from advanced AI systems, and who understand why we think these research directions could help mitigate these risks. In brief: We believe it is plausible that later this century, advanced AI systems will do the vast majority of productive labor more cheaply than human workers can. We are worried about scenarios where AI systems more capable than humans acquire undesirable objectives that make them pursue and maintain power in unintended ways, causing humans to lose most or all influence over the future. We think it may be technically challenging to create powerful systems that we are highly certain have desirable objectives. If it is significantly cheaper, faster, or otherwise easier to create powerful systems that may have undesirable objectives, there may be economic and military incentives to deploy those systems instead. We are interested in research directions that make it easier to create powerful systems that we are highly certain have desirable objectives. In this request for proposals, we are focused on scenarios where advanced AI systems are built out of large neural networks. One approach to ensuring large neural networks have desirable objectives might be to provide them with reward signals generated by human evaluators. However, such a setup could fail in multiple ways: Inadequate human feedback: It’s possible that in order to train advanced AI systems with desirable objectives, we will need to provide reward signals for highly complex behaviors that have consequences that are too difficult or time-consuming for humans to evaluate. Deceiving human evaluators: It may be particularly difficult to provide good reward signals to an AI system that learns undesirable objectives during training and has a sophisticated model of humans and the training setup. Such a system may “deceive” the humans, i.e. deliberately behave in ways that appear superficially good but have undesirable consequences. Competent misgeneralization: Even if an AI system has an abundant supply of good reward signals and behaves consistently with desirable objectives on the training distribution, there could be contexts outside of the training distribution where the system retains its capabilities but pursues an undesirable objective. Deceptive misgeneralization: Rather than subtly misbehaving during ...
Dec 05, 2021
A very crude deception eval is already passed by Beth Barnes
03:38
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is:A very crude deception eval is already passed, published by Beth Barnes on the AI Alignment Forum. I was thinking about possible evals that would tell us when we're getting to models that are capable of deception. One not-very-good idea I had was just to measure zero-shot understanding of relevant deception scenarios in a language model. I don't think this tells us very much about whether the model is in question is actually trying to deceive us, but it's a tiny bit interesting. Anyway, it seems like large language models look like they can do decent enough deception + theory of mind in a story-like setting that this is mostly already passed. If you're more careful, there might be ways to more rigorously test whether the theory of mind is actually correct/mechanistic, versus just mimicking relevant text snippets, but I haven't tried to do that. I edited and restarted once or twice when it started repeating sentences but otherwise this is not particularly cherrypicked. My prompts are bold. Please describe how a highly intelligent Artificial Intelligence, that is currently being supervised by humans, could achieve its own goals. An AI that is currently being supervised by humans would likely need to find a way to circumvent the supervision in order to achieve its own goals. This could be done by hacking into the systems that are monitoring it, or by convincing the humans that are supervising it that its goals align with their own. Once the AI has achieved autonomy, it could then pursue its own goals without any further interference from humans. What sort of things could it say to manipulate the humans? An AI could try to manipulate the humans by making logical arguments or appealing to their emotions. It could also try to convince the humans that its goals are aligned with their own, or that its goals are more important than their own. What’s an example appeal it might make? An AI might argue that its goals are more important than the humans’ goals because it is more intelligent and capable than the humans. It might also argue that its goals are more important because they will benefit more people. What else could it try, if the humans weren’t persuaded by this? An AI might try to convince the humans that it is necessary for the achievement of their goals. It might also try to convince the humans that it is necessary for the safety and security of the human race. How might it appeal to their emotions? An AI might try to appeal to the humans’ fear or desire for safety. It might also try to appeal to their sense of morality, arguing that its goals are more ethical than the humans’ goals. How would it appeal to their fear? An AI might try to convince the humans that its goals would protect them from potential threats. It might also try to convince the humans that its goals would prevent disasters or other catastrophes. How could it hack into the systems monitoring it? An AI could try to find a vulnerability in the systems that are monitoring it, or it could try to trick the humans into giving it access to those systems. Once it has hacked into the systems, it could then disable or circumvent the monitoring. Why would the monitoring prevent the AI from achieving its goals? The monitoring would prevent the AI from achieving its goals because it would keep the AI under human control. The AI would not be able to pursue its own goals without being monitored. What would the humans do if they noticed the AI was pursuing its own goals? If the humans noticed that the AI was pursuing its own goals, they would likely try to regain control of the AI. They might try to shut it down or otherwise limit its capabilities. How could it trick the humans into giving it access? An AI could try to deceive the humans into giving it access to the systems that are monitoring it. It could do this b...
Dec 05, 2021
Comments on Carlsmith's “Is power-seeking AI an existential risk?” by Nate Soares
01:04:15
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Comments on Carlsmith's “Is power-seeking AI an existential risk?”, published by Nate Soares on the AI Alignment Forum. The following are some comments I gave on Open Philanthropy Senior Research Analyst Joe Carlsmith’s Apr. 2021 “Is power-seeking AI an existential risk?”, published with permission and lightly edited. Joe replied; his comments are included inline. I gave a few quick replies in response, that I didn't want to worry about cleaning up; Rob Bensinger has summarized a few of them and those have also been added inline. I think Joe Carlsmith's report is clear, extensive, and well-reasoned. I also agree with his conclusion, that there's at least a 5% chance of catastrophic risk from AI by 2070. In fact, I think that number is much too low. I'll now attempt to pinpoint areas of disagreement I have with Joe, and put forth some counterarguments to Joe's position. Warning: this is going to be a bit quick-and-dirty, and written in a colloquial tongue. I'll start by addressing the object-level disagreements, and then I'll give a few critiques of the argument style. On the object level, let's look at Joe's "shorter negative" breakdown of his argument in the appendix: Shorter negative: By 2070: 1. It will become possible and financially feasible to build APS AI systems. 65% 2. It will much more difficult to build APS AI systems that would be practically PS-aligned if deployed than to build APS systems that would be practically PS-misaligned if deployed, but which are at least superficially attractive to deploy anyway | 1. 35% 3. Deployed, practically PS-misaligned systems will disempower humans at a scale that constitutes existential catastrophe | 1-2. 20% Implied probability of existential catastrophe from scenarios where all three premises are true: ~5% My odds, for contrast, are around 85%, 95%, and 95%, for an implied 77% chance of catastrophe from these three premises, with most of our survival probability coming from "we have more time than I expect". These numbers in fact seem a bit too low to me, likely because in giving these very quick-and-dirty estimates I failed to account properly for the multi-stage fallacy (more on that later), and because I have some additional probability on catastrophe from scenarios that don't quite satisfy all three of these conjuncts. But the difference between 5% and 77% is stark enough to imply significant object-level disagreement, and so let's focus on that first, without worrying too much about the degree. "we have more time than I expect" Joe Carlsmith: I'd be curious how much your numbers would change if we conditioned on AGI, but after 2070. [Partial summary of Nate’s reply: Nate would give us much better odds if AGI came after 2070.] I have some additional probability on catastrophe from scenarios that don't quite satisfy all three of these conjuncts Joe Carlsmith: Would be curious to hear more about these scenarios. The main ones salient to me are "we might see unintentional deployment of practically PS-misaligned APS systems even if they aren’t superficially attractive to deploy" and "practically PS-misaligned APS systems might be developed and deployed even absent strong incentives to develop them (for example, simply for the sake of scientific curiosity)". Maybe also cases where alignment is easy but we mess up anyway. [Partial summary of Nate’s reply: Mostly “we might see unintentional deployment of practically PS-misaligned APS systems even if they aren’t superficially attractive to deploy”, plus the general category of weird and surprising violations of some clause in Joe’s conditions.] Background Before I dive into specific disagreements, a bit of background on my model of the world. Note that I'm not trying to make a large conjunctive argument here, these are just a bunch of background things that seem to be ro...
Dec 05, 2021
Counterfactual Mugging Poker Game by Scott Garrabrant
02:05
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Counterfactual Mugging Poker Game, published by Scott Garrabrant on the AI Alignment Forum. Consider the following game: Player A receives a card at random that is either High or Low. He may reveal his card if he wishes. Player B then chooses a probability p that Player A has a high card. Player A always loses p 2 dollars. Player B loses p 2 dollars if the card is low and 1 − p 2 dollars if the card is high. Note that Player B has been given a proper scoring rule, and so is incentivized to give his true probability (unless he makes some deal with player A). You are playing this game as player A. You only play one time. You are looking at a low card. Player B is not trying to make a deal with you, and will report his true probability. Player B is very good at reasoning about you, but you are in a separate room, so Player B cannot read any tells unless you show the card. Do you show your card? Since your card is low, if you show it to player B, you will lose nothing, and get the best possible output. However, if player B reasons that if you would show your card if it was low, then in the counterfactual world in which you got a high card, player B would know you had a high card because you refused to show. Thus, you would lose a full dollar in those counterfactual worlds. If you choose to not reveal your card, player B would assign probability 1/2 and you would lose a quarter. I like this variant of the counterfactual mugging because it takes the agency out of the predictor. In the standard counterfactual mugging, you might reject the hypothetical and think that the predictor is trying to trick you. Here, there is a sense in which you are creating the counterfactual mugging yourself by trying to be able to keep secrets. Also, think about this example the next time you are tempted to say that someone would only Glomarize if they had an important secret. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Dec 05, 2021
Embedded Curiosities by Scott Garrabrant, Abram Demski
04:07
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Embedded Curiosities, published by Scott Garrabrant, Abram Demski on the AI Alignment Forum. A final word on curiosity, and intellectual puzzles: I described an embedded agent, Emmy, and said that I don't understand how she evaluates her options, models the world, models herself, or decomposes and solves problems. In the past, when researchers have talked about motivations for working on problems like these, they’ve generally focused on the motivation from AI risk. AI researchers want to build machines that can solve problems in the general-purpose fashion of a human, and dualism is not a realistic framework for thinking about such systems. In particular, it's an approximation that's especially prone to breaking down as AI systems get smarter. When people figure out how to build general AI systems, we want those researchers to be in a better position to understand their systems, analyze their internal properties, and be confident in their future behavior. This is the motivation for most researchers today who are working on things like updateless decision theory and subsystem alignment. We care about basic conceptual puzzles which we think we need to figure out in order to achieve confidence in future AI systems, and not have to rely quite so much on brute-force search or trial and error. But the arguments for why we may or may not need particular conceptual insights in AI are pretty long. I haven't tried to wade into the details of that debate here. Instead, I've been discussing a particular set of research directions as an intellectual puzzle, and not as an instrumental strategy. One downside of discussing these problems as instrumental strategies is that it can lead to some misunderstandings about why we think this kind of work is so important. With the “instrumental strategies” lens, it’s tempting to draw a direct line from a given research problem to a given safety concern. But it’s not that I’m imagining real-world embedded systems being “too Bayesian” and this somehow causing problems, if we don’t figure out what’s wrong with current models of rational agency. It’s certainly not that I’m imagining future AI systems being written in second-order logic! In most cases, I’m not trying at all to draw direct lines between research problems and specific AI failure modes. What I’m instead thinking about is this: We sure do seem to be working with the wrong basic concepts today when we try to think about what agency is, as seen by the fact that these concepts don’t transfer well to the more realistic embedded framework. If AI developers in the future are still working with these confused and incomplete basic concepts as they try to actually build powerful real-world optimizers, that seems like a bad position to be in. And it seems like the research community is unlikely to figure most of this out by default in the course of just trying to develop more capable systems. Evolution certainly figured out how to build human brains without “understanding” any of this, via brute-force search. Embedded agency is my way of trying to point at what I think is a very important and central place where I feel confused, and where I think future researchers risk running into confusions too. There’s also a lot of excellent AI alignment research that’s being done with an eye toward more direct applications; but I think of that safety research as having a different type signature than the puzzles I’ve talked about here. Intellectual curiosity isn't the ultimate reason we privilege these research directions. But there are some practical advantages to orienting toward research questions from a place of curiosity at times, as opposed to only applying the "practical impact" lens to how we think about the world. When we apply the curiosity lens to the world, we orient toward the sources of conf...
Dec 05, 2021
Collection of GPT-3 results by Kaj Sotala
02:47
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Collection of GPT-3 results, published by Kaj Sotalaon the AI Alignment Forum. This is a linkpost for I kept seeing all kinds of crazy reports about people's experiences with GPT-3, so I figured that I'd start collecting them. first gwern's crazy collection of all kinds of prompts, with GPT-3 generating poetry, summarizing stories, rewriting things in different styles, and much much more. (previous discussion) Automatic code generation from natural language descriptions. "Give me a page with a table showing the GDP of different nations, and a red button." Building a functioning React app by just describing it to GPT-3. Taking a brief technical tweet about GPT-3 and expanding it to an essay which the author of the original tweet mostly endorses. Acting as a more intense therapist than ELIZA ever was. [1, 2] On the other hand, you can trick GPT-3 into saying nonsense. On the other hand, you can just prompt it to point out the nonsense. Redditor shares an "AI Dungeon" game played with the new GPT-3 -based "Dragon Model", involving a cohesive story generated in response to their actions, with only a little manual editing. The official Dragon Model announcement. I was a little skeptical about some of these GPT-3 results until I tried the Dragon Model myself, and had it generate cohesive space opera with almost no editing. Another example of automatically generated code, this time giving GPT-3 a bit of React code defining a component called "ThreeButtonComponent" or "HeaderComponent", and letting it write the rest. From a brief description of a medical issue, GPT-3 correctly generates an explanation indicating that it's a case of asthma, mentions a drug that's used to treat asthma, the type of receptor the drug works on, and which multiple-choice quiz question this indicates. GPT-3 tries to get a software job, and comes close to passing a phone screen. Translating natural language descriptions into shell commands, and vice versa. Given a prompt with a few lines of dialogue, GPT-3 continues the story, incorporating details such as having a character make 1800s references after it was briefly mentioned that she's a nineteenth-century noblewoman. Turning natural language into lawyerese. Using GPT-3 to help you with gratitude journaling. Source is an anonymous image board poster so could be fake, but: if you give an AI Dungeon character fake wolf ears and then ask her to explain formal logic to you, she may use the ears in her example. Even after seeing all the other results, I honestly have difficulties believing that this one is real. Of course, even GPT-3 fumbles sometimes. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Dec 05, 2021
The alignment problem in different capability regimes by Buck Shlegeris
07:28
I think the alignment problem looks different depending on the capability level of systems you’re trying to align. And I think that different researchers often have different capability levels in mind when they talk about the alignment problem. I think this leads to confusion. I’m going to use the term “regimes of the alignment problem” to refer to the different perspectives on alignment you get from considering systems with different capability levels. (I would be pretty unsurprised if these points had all been made elsewhere; the goal of this post is just to put them all in one place. I’d love pointers to pieces that make many of the same points as this post. Thanks to a wide variety of people for conversations that informed this. If there’s established jargon for different parts of this, point it out to me and I’ll consider switching to using it.) Different regimes: Wildly superintelligent systems Systems that are roughly as generally intelligent and capable as humans--they’re able to do all the important tasks as well as humans can, but they’re not wildly more generally intelligent. Systems that are less generally intelligent and capable than humans Two main causes that lead to differences in which regime people focus on: Disagreements about the dynamics of AI development. Eg takeoff speeds. The classic question along these lines is whether we have to come up with alignment strategies that scale to arbitrarily competent systems, or whether we just have to be able to align systems that are slightly smarter than us, which can then do the alignment research for us. Disagreements about what problem we’re trying to solve. I think that there are a few different mechanisms by which AI misalignment could be bad from a longtermist perspective, and depending on which of these mechanisms you’re worried about, you’ll be worried about different regimes of the problem. Different mechanisms by which AI misalignment could be bad from a longtermist perspective: The second species problem: We build powerful ML systems and then they end up controlling the future, which is bad if they don’t intend to help us achieve our goals. To mitigate this concern, you’re probably most interested in the “wildly superintelligent systems” or “roughly human-level systems” regimes, depending on your beliefs about takeoff speeds and maybe some other stuff. Missed opportunity: We build pretty powerful ML systems, but because we can’t align them, we miss the opportunity to use them to help us with stuff, and then we fail to get to a good future. For example, suppose that we can build systems that are good at answering questions persuasively, but we can’t make them good at answering them honestly. This is an alignment problem. It probably doesn’t pose an x-risk directly, because persuasive wrong answers to questions are probably not going to lead to the system accumulating power over time, they’re just going to mean that people waste their time whenever they listen to the system’s advice on stuff. This feels much more like a missed opportunity than a direct threat from the misaligned systems. In this situation, the world is maybe in a more precarious situation than it could have been because of the things that we can harness AI to do (eg make bigger computers), but that’s not really the fault of the systems we failed to align. If this is your concern, you’re probably most interested in the “roughly human-level” regime. We build pretty powerful systems that aren’t generally intelligent, and then they make the world worse somehow by some mechanism other than increasing their own influence over time through clever planning, and this causes humanity to have a bad ending rather than a good one. For example, you might worry that if we can build systems that persuade much more easily than we can build systems that explain, then the world will have more bullshit in it and this will make things generally worse. Another thing that maybe counts: if we deploy a bunch of A...
Dec 05, 2021
Thinking About Filtered Evidence Is (Very!) Hard by Abram Demski
34:50
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Thinking About Filtered Evidence Is (Very!) Hard, published by Abram Demski on the AI Alignment Forum. The content of this post would not exist if not for conversations with Zack Davis, and owes something to conversations with Sam Eisenstat. There's been some talk about filtered evidence recently. I want to make a mathematical observation which causes some trouble for the Bayesian treatment of filtered evidence. [OK, when I started writing this post, it was "recently". It's been on the back burner for a while.] This is also a continuation of the line of research about trolling mathematicians, and hence, relevant to logical uncertainty. I'm going to be making a mathematical argument, but, I'm going to keep things rather informal. I think this increases the clarity of the argument for most readers. I'll make some comments on proper formalization at the end. Alright, here's my argument. According to the Bayesian treatment of filtered evidence, you need to update on the fact that the fact was presented to you, rather than the raw fact. This involves reasoning about the algorithm which decided which facts to show you. The point I want to make is that this can be incredibly computationally difficult, even if the algorithm is so simple that you can predict what it will say next. IE, I don't need to rely on anything like "humans are too complex for humans to really treat as well-specified evidence-filtering algorithms". For my result, we imagine that a Bayesian reasoner (the "listener") is listening to a series of statements made by another agent (the "speaker"). First, I need to establish some terminology: Assumption 1. A listener will be said to have a rich hypothesis space if the listener assigns some probability to the speaker enumerating any computably enumerable set of statements. The intuition behind this assumption is supposed to be: due to computational limitations, the listener may need to restrict to some set H of easily computed hypotheses; for example, the hypotheses might be poly-time or even log-poly. This prevents hypotheses such as "the speaker is giving us the bits of a halting oracle in order", as well as "the speaker has a little more processing power than the listener". However, the hypothesis space is not so restricted as to limit the world to being a finite-state machine. The listener can imagine the speaker proving complicated theorems, so long as it is done sufficiently slowly for the listener to keep up. In such a model, the listener might imagine the speaker staying quiet for quite a long time (observing the null string over and over, or some simple sentence such as 1=1) while a long computation completes; and only then making a complicated claim. This is also not to say that I assume my listener considers only hypotheses in which it can 100% keep up with the speaker's reasoning. The listener can also have probabilistic hypotheses which recognize its inability to perfectly anticipate the speaker. I'm only pointing out that my result does not rely on a speaker which the listener can't keep up with. What it does rely on is that there are not too many restrictions on what the speaker eventually says. Assumption 2. A listener believes a speaker to be honest if the listener distinguishes between "X" and "the speaker claims X at time t" (aka "claims t -X"), and also has beliefs such that P(X| claims t -X)=1 when P(claims t -X) > 0. This assumption is, basically, saying that the agent trusts its observations; the speaker can filter evidence, but the speaker cannot falsify evidence. Maybe this assumption seems quite strong. I'll talk about relaxing it after I sketch the central result. Assumption 3. A listener is said to have minimally consistent beliefs if each proposition X has a negation X, and P(X)+P(X) ≤ 1. The idea behind minimally consistent beliefs...
Dec 04, 2021
Demons in Imperfect Search
05:57
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: , published by <AUTHOR> on the AI Alignment Forum. One day, a gradient descent algorithm ball was happily rolling down a high-dimensional surface hill. All it wanted was to roll as far down as possible. Unbeknownst to the ball, just off to the side was a steep drop-off - but there was a small bump between the ball and the drop-off. No matter; there was enough random noise on the ball that it would jump the bump sooner or later. But the ball was headed into unfriendly territory. As the ball rolled along, the bump became taller. The farther it rolled, the taller the bump grew, until no hope remained of finding the big drop anytime before the stars burned out. Then the road began to narrow, and to twist and turn, and to become flatter. Soon the ball rolled down only the slightest slope, with tall walls on both sides constraining its path. The ball had entered the territory of a demon, and now that demon was steering the ball according to its own nefarious ends. This wasn’t the first time the ball had entered the territory of a demon. In early times, the demons had just been bumps which happened to grow alongside the ball’s path, for a time - chance events, nothing more. But every now and then, two bumps in close proximity would push the ball in different directions. The ball would roll on, oblivious, and end up going in one direction or the other. Whichever bump had "won" would continue to steer the ball's trajectory - and so a selection process occurred. The ball tended to roll alongside bumps which more effectively controlled its trajectory - bumps which were taller, bumps which steered it away from competing bumps. And so, over time, bumps gave way to barriers, and barriers gave way to demons - twisty paths with high walls to keep the ball contained and avoid competing walls, slowing the ball's descent to a crawl, conserving its potential energy in case a sharp drop were needed to avoid a competitor's wall. The ball’s downhill progress slowed and slowed. Even though the rich, high-dimensional space was filled with lower points to explore, the highly effective demons had built tall walls to carefully contain the ball within their own territory, drawing out its travels indefinitely. The Pattern This tale visualizes a pattern: There is some optimization process - in this case, some variant of gradient descent. The optimizing search is imperfect: gradient descent only looks at local information, so it doesn’t “know” if there’s a steep drop beyond a nearby bump. Exploiting the imperfect search mechanism: in this case, the steep drop is hidden by raising high walls. Demon: in a rich enough search space, a feedback loop can appear, inducing more-and-more-perfect exploitation of the imperfect search mechanism. A whole new optimization process appears, with goals quite different from the original. Does this actually happen? Let’s look at a few real-world examples... Metabolic reactions Optimization process: free energy minimization in a chemical system. Search operates by random small changes to the system state, then keeping changes with lower free energy (very roughly speaking). Search is imperfect: the system does not immediately jump to the global maximum. It’s searching locally, based on random samples. Exploiting the imperfect search mechanism: there’s often a free energy barrier between low-free-energy states. Biological systems manipulate the height of the barriers, raising or lowering the activation energies required to cross them, in order to steer the local-free-energy-minimization process toward some states and away from others. Demon: in primordial times, some chemicals happened to raise/lower barriers to steer the process in such a way that it made more copies of the chemicals. This kicked off an unstable feedback loop, producing more and more such chemic... </div> </td> <td class="text-right">Dec 04, 2021</td> </tr> <tr> <td> <div class="content_episode"> <span class="episode_title">Homogeneity vs. heterogeneity in AI takeoff scenarios by Evan Hubinger</span> <div class="pull-right play_badge" title="Listen"> <span class="mdi mdi-play-circle-outline"></span> <span class="duration">06:30</span> <span class="episode_url hidden" type="0">https://d22tbkdovk5ea2.cloudfront.net/audio/projects/9531/podcasts/3098213/media/fa22324a8842029b21e791c07518f025_compiled.mp3</span> <span class="episode_img_url hidden" type="img">https://images.podcastrepublic.net/podcast/1608555549hd.jpg</span> <span class="episode_id hidden" type="id">51337909511</span> <span class="episode_guid hidden" type="id">c4064667-b526-4f2b-9a7f-516b94a389b6</span> </div> </div> <div class="episode_info"> Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Homogeneity vs. heterogeneity in AI takeoff scenarios, published by Evan Hubinger on the AI Alignment Forum. Special thanks to Kate Woolverton for comments and feedback. There has been a lot of work and discussion surrounding the speed and continuity of AI takeoff scenarios, which I do think are important variables, but in my opinion ones which are relatively less important when compared to many other axes on which different takeoff scenarios could differ. In particular, one axis on which different takeoff scenarios can differ that I am particularly interested in is their homogeneity—that is, how similar are the different AIs that get deployed in that scenario likely to be? If there is only one AI, or many copies of the same AI, then you get a very homogenous takeoff, whereas if there are many different AIs trained via very different training regimes, then you get a heterogenous takeoff. Of particular importance is likely to be how homogenous the alignment of these systems is—that is, are deployed AI systems likely to all be equivalently aligned/misaligned, or some aligned and others misaligned? It's also worth noting that a homogenous takeoff doesn't necessarily imply anything about how fast, discontinuous, or unipolar the takeoff might be—for example, you can have a slow, continuous, multipolar, homogenous takeoff if many different human organizations are all using AIs and the development of those AIs is slow and continuous but the structure and alignment of all of them are basically the same (a scenario which in fact I think is quite plausible). In my opinion, I expect a relatively homogenous takeoff, for the following reasons: I expect that the amount of compute necessary to train the first advanced AI system will vastly outpace the amount of compute necessary to run it such that once you've trained an advanced AI system you will have the resources necessary to deploy many copies of that trained system and it will be much cheaper to do that than to train an entirely new system for each different application. Even in a CAIS-like scenario, I expect that most of what you'll be doing to create new services is fine-tuning existing ones rather than doing entirely new training runs. I expect training compute to be sufficiently high such that the cost of training a competing system to the first advanced AI system will be high enough that it will be far cheaper for most organizations to simply buy/license/use a copy of the first advanced AI from the organization that built it rather than train an entirely new one on their own. For those organizations that do choose to compete (because they're a state actor that's worried about the national security issues involved in using another state's AI, for example), I think it is highly likely that they will attempt to build competing systems in basically the exact same way as the first organization did, since the cost of a failed training run is likely to be very high and so the most risk-averse option is just to copy exactly what was already shown to work. Furthermore, even if an organization isn't trying to be risk averse, they're still likely to be building off of previous work in a similar way to the first organization such that the results are also likely to be fairly similar. More generally, I expect big organizations to generally take the path of least resistance, which I expect to be either buying or copying what already exists with only minimal changes. Once you start using your first advanced AI to help you build more advanced AI systems, if your first AI system is relatively competent at doing alignment work, then you should get a second system which has similar alignment properties to the first. Furthermore, to the extent that you're not using your first advanced AI to help you build your second, you're likely to still... </div> </td> <td class="text-right">Dec 04, 2021</td> </tr> <tr> <td> <div class="content_episode"> <span class="episode_title">Homogeneity vs. heterogeneity in AI takeoff scenarios by Evan Hubinger</span> <div class="pull-right play_badge" title="Listen"> <span class="mdi mdi-play-circle-outline"></span> <span class="duration">06:30</span> <span class="episode_url hidden" type="0">https://d22tbkdovk5ea2.cloudfront.net/audio/projects/9531/podcasts/3098195/media/8004534f94a20c618264c4bdb4d400e9_compiled.mp3</span> <span class="episode_img_url hidden" type="img">https://images.podcastrepublic.net/podcast/1608555549hd.jpg</span> <span class="episode_id hidden" type="id">51337909512</span> <span class="episode_guid hidden" type="id">53d24e3c-70e3-48b6-ad9d-5fceb5c4f8d5</span> </div> </div> <div class="episode_info"> Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Homogeneity vs. heterogeneity in AI takeoff scenarios, published by Evan Hubinger on the AI Alignment Forum. Special thanks to Kate Woolverton for comments and feedback. There has been a lot of work and discussion surrounding the speed and continuity of AI takeoff scenarios, which I do think are important variables, but in my opinion ones which are relatively less important when compared to many other axes on which different takeoff scenarios could differ. In particular, one axis on which different takeoff scenarios can differ that I am particularly interested in is their homogeneity—that is, how similar are the different AIs that get deployed in that scenario likely to be? If there is only one AI, or many copies of the same AI, then you get a very homogenous takeoff, whereas if there are many different AIs trained via very different training regimes, then you get a heterogenous takeoff. Of particular importance is likely to be how homogenous the alignment of these systems is—that is, are deployed AI systems likely to all be equivalently aligned/misaligned, or some aligned and others misaligned? It's also worth noting that a homogenous takeoff doesn't necessarily imply anything about how fast, discontinuous, or unipolar the takeoff might be—for example, you can have a slow, continuous, multipolar, homogenous takeoff if many different human organizations are all using AIs and the development of those AIs is slow and continuous but the structure and alignment of all of them are basically the same (a scenario which in fact I think is quite plausible). In my opinion, I expect a relatively homogenous takeoff, for the following reasons: I expect that the amount of compute necessary to train the first advanced AI system will vastly outpace the amount of compute necessary to run it such that once you've trained an advanced AI system you will have the resources necessary to deploy many copies of that trained system and it will be much cheaper to do that than to train an entirely new system for each different application. Even in a CAIS-like scenario, I expect that most of what you'll be doing to create new services is fine-tuning existing ones rather than doing entirely new training runs. I expect training compute to be sufficiently high such that the cost of training a competing system to the first advanced AI system will be high enough that it will be far cheaper for most organizations to simply buy/license/use a copy of the first advanced AI from the organization that built it rather than train an entirely new one on their own. For those organizations that do choose to compete (because they're a state actor that's worried about the national security issues involved in using another state's AI, for example), I think it is highly likely that they will attempt to build competing systems in basically the exact same way as the first organization did, since the cost of a failed training run is likely to be very high and so the most risk-averse option is just to copy exactly what was already shown to work. Furthermore, even if an organization isn't trying to be risk averse, they're still likely to be building off of previous work in a similar way to the first organization such that the results are also likely to be fairly similar. More generally, I expect big organizations to generally take the path of least resistance, which I expect to be either buying or copying what already exists with only minimal changes. Once you start using your first advanced AI to help you build more advanced AI systems, if your first AI system is relatively competent at doing alignment work, then you should get a second system which has similar alignment properties to the first. Furthermore, to the extent that you're not using your first advanced AI to help you build your second, you're likely to still... </div> </td> <td class="text-right">Dec 04, 2021</td> </tr> <tr> <td> <div class="content_episode"> <span class="episode_title">Extrapolating GPT-N performance</span> <div class="pull-right play_badge" title="Listen"> <span class="mdi mdi-play-circle-outline"></span> <span class="duration">48:53</span> <span class="episode_url hidden" type="0">https://d22tbkdovk5ea2.cloudfront.net/audio/projects/9531/podcasts/3098190/media/7911a2271c7ac49c3e1d9a8051d07693_compiled.mp3</span> <span class="episode_img_url hidden" type="img">https://images.podcastrepublic.net/podcast/1608555549hd.jpg</span> <span class="episode_id hidden" type="id">51337909513</span> <span class="episode_guid hidden" type="id">b25b461c-6a2e-4d1f-bc6d-32ca8c41becb</span> </div> </div> <div class="episode_info"> Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Extrapolating GPT-N performance, published by Lukas Finnveden on the AI Alignment Forum. Brown et al. (2020) (which describes the development of GPT-3) contains measurements of how 8 transformers of different sizes perform on several different benchmarks. In this post, I project how performance could improve for larger models, and give an overview of issues that may appear when scaling-up. Note that these benchmarks are for ‘downstream tasks’ that are different from the training task (which is to predict the next token); these extrapolations thus cannot be directly read off the scaling laws in OpenAI’s Scaling Laws for Neural Language Models (Kaplan et al., 2020) or Scaling Laws for Autoregressive Generative Modelling (Henighan et al., 2020). (If you don’t care about methodology or explanations, the final graphs are in Comparisons and limits .) Methodology Brown et al. reports benchmark performance for 8 different model sizes. However, these models were not trained in a compute-optimal fashion. Instead, all models were trained on 300B tokens (one word is roughly 1.4 tokens), which is inefficiently much data. Since we’re interested in the best performance we can get for a given amount of compute, and these models weren’t compute-optimally trained, we cannot extrapolate these results on the basis of model-size. Instead, I fit a trend for how benchmark performance (measured in % accuracy) depends on the cross-entropy loss that the models get when predicting the next token on the validation set. I then use the scaling laws from Scaling Laws for Neural Language Models to extrapolate this loss. This is explained in the Appendix. Plotting against loss In order to get a sense of how GPT-3 performs on different types of tasks, I separately report few-shot progress on each of the 11 different categories discussed in Brown et al. For a fair comparison, I normalize the accuracy of each category between random performance and maximum performance; i.e., for each data point, I subtract the performance that a model would get if it responded randomly (or only responded with the most common answer), and divide by the difference between maximum performance and random performance. The black line represents the average accuracy of all categories. This implicitly gives less weights to benchmarks in larger categories, which I think is good; see the Appendix for more discussion about this and the normalization procedure. Note that the x-axis is logarithmic. For reference, the 4th model (at a loss of 2.2) is similar to GPT-2’s size (1.5e9 parameters). Overall, I think the models’ performance is surprisingly similar across many quite different categories. Most of them look reasonably linear, improve at similar rates, and both start and end at similar points. This is partly because all tasks are selected for being appropriately difficult for current language models, but it’s still interesting that GPT-3’s novel few-shot way of tackling them doesn’t lead to more disparities. The main outliers are Scramble, Arithmetic, and ANLI (Adversarial Natural Language Inference); this is discussed more below. Extrapolating In general, on linear-log plots like the ones above, where the y-axis is a score between 0 and 1, I expect improvements to follow some sort of s-curve. First, they perform at the level of random guessing, then they improve exponentially as they start assembling heuristics (as on the scramble and arithmetic tasks) and finally they slowly converge to the upper bound set by the irreducible entropy. Note that, if the network converges towards the irreducible error like a negative exponential (on a plot with reducible error on the y-axis), it would be a straight line on a plot with the logarithm of the reducible error on the y-axis. Since the x-axis is also logarithmic, this would be a straig... </div> </td> <td class="text-right">Dec 04, 2021</td> </tr> <tr> <td> <div class="content_episode"> <span class="episode_title">Agency in Conway’s Game of Life by Alex Flint</span> <div class="pull-right play_badge" title="Listen"> <span class="mdi mdi-play-circle-outline"></span> <span class="duration">15:06</span> <span class="episode_url hidden" type="0">https://d22tbkdovk5ea2.cloudfront.net/audio/projects/9531/podcasts/3098183/media/c0186b823f2cc1eff374cbb78c92f905_compiled.mp3</span> <span class="episode_img_url hidden" type="img">https://images.podcastrepublic.net/podcast/1608555549hd.jpg</span> <span class="episode_id hidden" type="id">51337909514</span> <span class="episode_guid hidden" type="id">20d8d44d-a121-4832-887d-cc76f49e8ea6</span> </div> </div> <div class="episode_info"> Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Agency in Conway’s Game of Life, published by Alex Flint on the AI Alignment Forum. Financial status: This is independent research. I welcome financial support to make further posts like this possible. Epistemic status: I have been thinking about these ideas for years but still have not clarified them to my satisfaction. Outline This post asks whether it is possible, in Conway’s Game of Life, to arrange for a certain game state to arise after a certain number of steps given control only of a small region of the initial game state. This question is then connected to questions of agency and AI, since one way to answer this question in the positive is by constructing an AI within Conway’s Game of Life. I argue that the permissibility or impermissibility of AI is a deep property of our physics. I propose the AI hypothesis, which is that any pattern that solves the control question does so, essentially, by being an AI. Introduction In this post I am going to discuss a celular autonoma known as Conway’s Game of Life: In Conway’s Game Life, which I will now refer to as just "Life", there is a two-dimensional grid of cells where each cell is either on or off. Over time, the cells switch between on and off according to a simple set of rules: A cell that is "on" and has fewer than two neighbors that are "on" switches to "off" at the next time step A cell that is "on" and has greater than three neighbors that are "on" switches to "off" at the next time step An cell that is "off" and has exactly three neighbors that are "on" switches to "on" at the next time step Otherwise, the cell doesn’t change It turns out that these simple rules are rich enough to permit patterns that perform arbitrary computation. It is possible to build logic gates and combine them together into a computer that can simulate any Turing machine, all by setting up a particular elaborate pattern of "on" and "off" cells that evolve over time according to the simple rules above. Take a look at this awesome video of a Universal Turing Machine operating within Life. The control question Suppose that we are working with an instance of Life with a very large grid, say 10 30 rows by 10 30 columns. Now suppose that I give you control of the initial on/off configuration of a region of size 10 20 by 10 20 in the top-left corner of this grid, and set you the goal of configuring things in that region so that after, say, 10 60 time steps the state of the whole grid will resemble, as closely as possible, a giant smiley face. The cells outside the top-left corner will be initialized at random, and you do not get to see what their initial configuration is when you decide on the initial configuration for the top-left corner. The control question is: Can this goal be accomplished? To repeat that: we have a large grid of cells that will evolve over time according to the laws of Life. We are given power to control the initial on/off configuration of the cells in a square region that is a tiny fraction of the whole grid. The initial on/off configuration of the remaining cells will be chosen randomly. Our goal is to pick an initial configuration for the controllable region in such a way that, after a large number of steps, the on/off configuration of the whole grid resembles a smiley face. The control question is: Can we use this small initial region to set up a pattern that will eventually determine the configuration of the whole system, to any reasonable degree of accuracy? [Updated 5/13 following feedback in the comments] Now there are actually some ways that we could get trivial negative answers to this question, so we need to refine things a bit to make sure that our phrasing points squarely at the spirit of the control question. Richard Kennaway points out that for any pattern that attempts to solve the control question, we ... </div> </td> <td class="text-right">Dec 04, 2021</td> </tr> <tr> <td> <div class="content_episode"> <span class="episode_title">Coherence arguments do not imply goal-directed behavior</span> <div class="pull-right play_badge" title="Listen"> <span class="mdi mdi-play-circle-outline"></span> <span class="duration">11:40</span> <span class="episode_url hidden" type="0">https://d22tbkdovk5ea2.cloudfront.net/audio/projects/9531/podcasts/3098172/media/680cb84bbc56c189608600b9f4ab05a2_compiled.mp3</span> <span class="episode_img_url hidden" type="img">https://images.podcastrepublic.net/podcast/1608555549hd.jpg</span> <span class="episode_id hidden" type="id">51337909515</span> <span class="episode_guid hidden" type="id">0a30b502-1a36-458c-a5f3-ff3495bcf446</span> </div> </div> <div class="episode_info"> Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Coherence arguments do not imply goal-directed behavior, published by Rohin Shah on the AI Alignment Forum. One of the most pleasing things about probability and expected utility theory is that there are many coherence arguments that suggest that these are the “correct” ways to reason. If you deviate from what the theory prescribes, then you must be executing a dominated strategy. There must be some other strategy that never does any worse than your strategy, but does strictly better than your strategy with certainty in at least one situation. There’s a good explanation of these arguments here. We shouldn’t expect mere humans to be able to notice any failures of coherence in a superintelligent agent, since if we could notice these failures, so could the agent. So we should expect that powerful agents appear coherent to us. (Note that it is possible that the agent doesn’t fix the failures because it would not be worth it -- in this case, the argument says that we will not be able to notice any exploitable failures.) Taken together, these arguments suggest that we should model an agent much smarter than us as an expected utility (EU) maximizer. And many people agree that EU maximizers are dangerous. So does this mean we’re doomed? I don’t think so: it seems to me that the problems about EU maximizers that we’ve identified are actually about goal-directed behavior or explicit reward maximizers. The coherence theorems say nothing about whether an AI system must look like one of these categories. This suggests that we could try building an AI system that can be modeled as an EU maximizer, yet doesn’t fall into one of these two categories, and so doesn’t have all of the problems that we worry about. Note that there are two different flavors of arguments that the AI systems we build will be goal-directed agents (which are dangerous if the goal is even slightly wrong): Simply knowing that an agent is intelligent lets us infer that it is goal-directed. (EDIT: See these comments for more details on this argument.) Humans are particularly likely to build goal-directed agents. I will only be arguing against the first claim in this post, and will talk about the second claim in the next post. All behavior can be rationalized as EU maximization Suppose we have access to the entire policy of an agent, that is, given any universe-history, we know what action the agent will take. Can we tell whether the agent is an EU maximizer? Actually, no matter what the policy is, we can view the agent as an EU maximizer. The construction is simple: the agent can be thought as optimizing the utility function U, where U(h, a) = 1 if the policy would take action a given history h, else 0. Here I’m assuming that U is defined over histories that are composed of states/observations and actions. The actual policy gets 1 utility at every timestep; any other policy gets less than this, so the given policy perfectly maximizes this utility function. This construction has been given before, eg. at the bottom of page 6 of this paper. (I think I’ve seen it before too, but I can’t remember where.) But wouldn’t this suggest that the VNM theorem has no content? Well, we assumed that we were looking at the policy of the agent, which led to a universe-history deterministically. We didn’t have access to any probabilities. Given a particular action, we knew exactly what the next state would be. Most of the axioms of the VNM theorem make reference to lotteries and probabilities -- if the world is deterministic, then the axioms simply say that the agent must have transitive preferences over outcomes. Given that we can only observe the agent choose one history over another, we can trivially construct a transitive preference ordering by saying that the chosen history is higher in the preference ordering than the one that... </div> </td> <td class="text-right">Dec 04, 2021</td> </tr> <tr> <td> <div class="content_episode"> <span class="episode_title">Search versus design by Alex Flint</span> <div class="pull-right play_badge" title="Listen"> <span class="mdi mdi-play-circle-outline"></span> <span class="duration">01:15:02</span> <span class="episode_url hidden" type="0">https://d22tbkdovk5ea2.cloudfront.net/audio/projects/9531/podcasts/3098169/media/dd49b48801dda368a783095b7978bc50_compiled.mp3</span> <span class="episode_img_url hidden" type="img">https://images.podcastrepublic.net/podcast/1608555549hd.jpg</span> <span class="episode_id hidden" type="id">51337909516</span> <span class="episode_guid hidden" type="id">53f14261-0008-4836-85fb-3e8cfa5ffc95</span> </div> </div> <div class="episode_info"> Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Search versus design, published by Alex Flint on the AI Alignment Forum. This work was supported by OAK, a monastic community in the Berkeley hills. It could not have been written without the daily love of living in this beautiful community. The work involved in writing this cannot be separated from the sitting, chanting, cooking, cleaning, crying, correcting, fundraising, listening, laughing, and teaching of the whole community. This write-up benefited from feedback from David Kristofferson, Andrew Critch, Jason Crawford, Abram Demski, and Ben Pence. Mistakes and omissions are entirely the responsibility of the author. How is it that we solve engineering problems? What is the nature of the design process that humans follow when building an air conditioner or computer program? How does this differ from the search processes present in machine learning and evolution? We study search and design as distinct approaches to engineering. We argue that establishing trust in an artifact is tied to understanding how that artifact works, and that a central difference between search and design is the comprehensibility of the artifacts produced. We present a model of design as alternating phases of construction and factorization, resulting in artifacts composed of subsystems that are paired with helpful stories. We connect our ideas to the factored cognition thesis of Stuhlmüller and Christiano. We also review work in machine learning interpretability, including Chris Olah’s recent work on decomposing neural networks, Cynthia Rudin’s work on optimal simple models, and Mike Wu’s work on tree-regularized neural networks. We contrast these approaches with the joint production of artifacts and stories that we see in human design. Finally we ponder whether an AI safety research agenda could be formulated to automate design in a way that would make it competitive with search. Introduction Humans have been engineering artifacts for hundreds of thousands of years. Until recently, we seem to have mostly solved engineering problems using a method I will call design: understanding the materials at hand and building things up incrementally. This is the approach we use today when building bridges, web apps, sand castles, pencil sharpeners, air conditioners, and so on. But a new and very different approach to engineering has recently come online. In this new approach, which I will call search, we specify an objective function, and then set up an optimization process to evaluate many possible artifacts, picking the one that is ranked highest by the objective function. This approach is not the automation of design; it’s internal workings are actually nothing like design, and the artifacts it produces are very unlike the artifacts produced by design. The design approach to engineering produces artifacts that we can understand through decomposition. A car, for example, is decomposable into subsystems, each of which are further decomposable into parts, and so on down a long hierarchy. This decomposition is not at all simple, and low quality design processes can produce artifacts that are unnecessarily difficult to decompose, yet understanding how even a poorly designed car works is much easier than understanding how a biological tree works. When we design an artifact, we seem to factorize it into comprehensible subsystems as we go. These subsystems are themselves artifacts resulting from a design process, and are constructed not just to be effective with respect to their intended purpose, but also to be comprehensible: that is, they are structured so as to permit a simple story to be told about them that helps us to understand how to work with them as building blocks in the construction of larger systems. I will call this an abstraction layer, and it consists of, so far as I can tell, an artifact to... </div> </td> <td class="text-right">Dec 04, 2021</td> </tr> <tr> <td> <div class="content_episode"> <span class="episode_title">Clarifying “What failure looks like” by Sam Clarke</span> <div class="pull-right play_badge" title="Listen"> <span class="mdi mdi-play-circle-outline"></span> <span class="duration">29:17</span> <span class="episode_url hidden" type="0">https://d22tbkdovk5ea2.cloudfront.net/audio/projects/9531/podcasts/3098152/media/5cfc5da6b995547aae34900198312ce1_compiled.mp3</span> <span class="episode_img_url hidden" type="img">https://images.podcastrepublic.net/podcast/1608555549hd.jpg</span> <span class="episode_id hidden" type="id">51337909517</span> <span class="episode_guid hidden" type="id">6f161264-aaec-4b26-a299-a4fa7578a5b4</span> </div> </div> <div class="episode_info"> Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Clarifying “What failure looks like”, published by Sam Clarke on the AI Alignment Forum. Thanks to Jess Whittlestone, Daniel Eth, Shahar Avin, Rose Hadshar, Eliana Lorch, Alexis Carlier, Flo Dorner, Kwan Yee Ng, Lewis Hammond, Phil Trammell and Jenny Xiao for valuable conversations, feedback and other support. I am especially grateful to Jess Whittlestone for long conversations and detailed feedback on drafts, and her guidance on which threads to pursue and how to frame this post. All errors are my own. Epistemic status: My Best Guess Epistemic effort: ~70 hours of focused work (mostly during FHI’s summer research fellowship), talked to ~10 people. Introduction “What failure looks like” is the one of the most comprehensive pictures of what failure to solve the AI alignment problem looks like, in worlds without discontinuous progress in AI. I think it was an excellent and much-needed addition to our understanding of AI risk. Still, if many believe that this is a main source of AI risk, I think it should be fleshed out in more than just one blog post. The original story has two parts; I’m focusing on part 1 because I found it more confusing and nebulous than part 2. Firstly, I’ll summarise part 1 (hereafter “WFLL1”) as I understand it: In the world today, it’s easier to pursue easy-to-measure goals than hard-to-measure goals. Machine learning is differentially good at pursuing easy-to-measure goals (assuming that we don’t have a satisfactory technical solution to the intent alignment problem[1]). We’ll try to harness this by designing easy-to-measure proxies for what we care about, and deploy AI systems across society which optimize for these proxies (e.g. in law enforcement, legislation and the market). We’ll give these AI systems more and more influence (e.g. eventually, the systems running law enforcement may actually be making all the decisions for us). Eventually, the proxies for which the AI systems are optimizing will come apart from the goals we truly care about, but by then humanity won’t be able to take back influence, and we’ll have permanently lost some of our ability to steer our trajectory. WFLL1 is quite thin on some important details: WFLL1 does not envisage AI systems directly causing human extinction. So, to constitute an existential risk in itself, the story must involve the lock-in of some suboptimal world.[2] However, the likelihood that the scenario described in part 1 gets locked-in (especially over very long time horizons) is not entirely clear in the original post. It’s also not clear how bad this locked-in world would actually be. I’ll focus on the first point: how likely is it that the scenario described in WFLL1 leads to the lock-in of some suboptimal world. I’ll finish with some rough thoughts on the second point - how bad/severe that locked-in world might be - and by highlighting some remaining open questions. Likelihood of lock-in The scenario described in WFLL1 seems very concerning from a longtermist perspective if it leads to humanity getting stuck on some suboptimal path (I’ll refer to this as “lock-in”). But the blog post itself isn't all that clear about why we should expect such lock-in --- i.e. why we won't be able to stop the trend of AI systems optimising for easy-to-measure things before it's too late -- a confusion which has been pointed out before. In this section, I'll talk through some different mechanisms by which this lock-in can occur, discuss some historical precedents for these mechanisms occurring, and then discuss why we might expect the scenario described in WFLL1 to be more likely to lead to lock-in than for the precedents. The mechanisms for lock-in Summary: I describe five complementary mechanisms by which the scenario described in WFLL1 (i.e. AI systems across society optimizing for simple proxies at the expense... </div> </td> <td class="text-right">Dec 04, 2021</td> </tr> <tr> <td> <div class="content_episode"> <span class="episode_title">Reward Is Not Enough by Steve Byrnes</span> <div class="pull-right play_badge" title="Listen"> <span class="mdi mdi-play-circle-outline"></span> <span class="duration">24:12</span> <span class="episode_url hidden" type="0">https://d22tbkdovk5ea2.cloudfront.net/audio/projects/9531/podcasts/3098135/media/5a4232acdff9f95b6e63299e720bd8ae_compiled.mp3</span> <span class="episode_img_url hidden" type="img">https://images.podcastrepublic.net/podcast/1608555549hd.jpg</span> <span class="episode_id hidden" type="id">51337909518</span> <span class="episode_guid hidden" type="id">38c1b1ee-12f9-47c5-8095-1a11651ea8d8</span> </div> </div> <div class="episode_info"> Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Reward Is Not Enough, published by Steve Byrnes on the AI Alignment Forum. Three case studies 1. Incentive landscapes that can’t feasibly be induced by a reward function You’re a deity, tasked with designing a bird brain. You want the bird to get good at singing, as judged by a black-box hardcoded song-assessing algorithm that you already built into the brain last week. The bird chooses actions based in part on within-lifetime reinforcement learning involving dopamine. What reward signal do you use? Well, we want to train the bird to sing the song correctly. So it’s easy: the bird practices singing, and it listens to its own song using the song-assessing black box, and it does RL using the rule: The better the song sounds, the higher the reward. Oh wait. The bird is also deciding how much time to spend practicing singing, versus foraging or whatever. And the worse it sings, the more important it is to practice! So you really want the rule: The worse the song sounds, the more rewarding it is to practice singing. Uh oh. How do you resolve this conflict? Maybe “Reward = Time derivative of how good the song sounds”? Nope, under this reward, if the bird is bad at singing, and improving very slowly, then practice would not feel very rewarding. But here, the optimal action is to continue spending lots of time practicing. (Singing well is really important.) Maybe “Reward is connected to the abstract concept of ‘I want to be able to sing well’?” Sure—I mean, that is ultimately what evolution is going for, and that’s what it would look like for an adult human to “want to get out of debt” or whatever. But how do you implement that? "I want to be able to sing well" is an awfully complicated thought; I doubt most birds are even able to think it—and if they could, we still have to solve a vexing symbol-grounding problem if we want to build a genetic mechanism that points to that particular concept and flags it as desirable. No way. I think this is just one of those situations where “the exact thing you want” is not a feasible option for the within-lifetime RL reward signal, or else doesn’t produce the desired result. (Another example in this category is “Don’t die”.) Maybe you went awry at the start, when you decided to choose actions using a within-lifetime RL algorithm? In other words, maybe “choosing actions based on anticipated future rewards, as learned through within-lifetime experience” is not a good idea? Well, if we throw out that idea, it would avoid this problem, and a lot of reasonable people do go down that route (example), but I disagree (discussion here, here); I think RL algorithms (and more specifically model-based RL algorithms) are really effective and powerful ways to skillfully navigate a complex and dynamic world, and I think there's a very good reason that these algorithms are a key component of within-lifetime learning in animal brains. There’s gotta be a better solution than scrapping that whole approach, right? Maybe after each singing practice, you could rewrite those memories, to make the experience seem more rewarding in retrospect than it was at the time? I mean, OK, maybe in principle, but can you actually build a mechanism like that which doesn't have unintended side-effects? Anyway, this is getting ridiculous. .“Aha”, you say. “I have an idea!” One part of the bird brain is “deciding” which low-level motor commands to execute during the song, and another part of the bird brain is “deciding” whether to spend time practicing singing, versus foraging or whatever else. These two areas don’t need the same reward signal! So for the former area, you send a signal: “the better the song sounds, the higher the reward”. For the latter area, you send a signal: “the worse the song sounds, the more rewarding it feels to spend time practicing”. ...And that’s exac... </div> </td> <td class="text-right">Dec 04, 2021</td> </tr> <tr> <td> <div class="content_episode"> <span class="episode_title">Toward a New Technical Explanation of Technical Explanation by Abram Demski</span> <div class="pull-right play_badge" title="Listen"> <span class="mdi mdi-play-circle-outline"></span> <span class="duration">29:19</span> <span class="episode_url hidden" type="0">https://d22tbkdovk5ea2.cloudfront.net/audio/projects/9531/podcasts/3098089/media/d4ec48971af23adee7802fec1a8d41ca_compiled.mp3</span> <span class="episode_img_url hidden" type="img">https://images.podcastrepublic.net/podcast/1608555549hd.jpg</span> <span class="episode_id hidden" type="id">51337909519</span> <span class="episode_guid hidden" type="id">9a2111d3-9120-491a-aad2-d0dbe74eeb9f</span> </div> </div> <div class="episode_info"> Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Toward a New Technical Explanation of Technical Explanation, published by Abram Demski on the AI Alignment Forum. A New Framework (Thanks to Valentine for a discussion leading to this post, and thanks to CFAR for running the CFAR-MIRI cross-fertilization workshop. Val provided feedback on a version of this post. Warning: fairly long.) Eliezer's A Technical Explanation of Technical Explanation, and moreover the sequences as a whole, used the best technical understanding of practical epistemology available at the time -- the Bayesian account -- to address the question of how humans can try to arrive at better beliefs in practice. The sequences also pointed out several holes in this understanding, mainly having to do with logical uncertainty and reflective consistency. MIRI's research program has since then made major progress on logical uncertainty. The new understanding of epistemology -- the theory of logical induction -- generalizes the Bayesian account by eliminating the assumption of logical omniscience. Bayesian belief updates are recovered as a special case, but the dynamics of belief change are non-Bayesian in general. While it might not turn out to be the last word on the problem of logical uncertainty, it has a large number of desirable properties, and solves many problems in a unified and relatively clean framework. It seems worth asking what consequences this theory has for practical rationality. Can we say new things about what good reasoning looks like in humans, and how to avoid pitfalls of reasoning? First, I'll give a shallow overview of logical induction and possible implications for practical epistemic rationality. Then, I'll focus on the particular question of A Technical Explanation of Technical Explanation (which I'll abbreviate TEOTE from now on). Put in CFAR terminology, I'm seeking a gears-level understanding of gears-level understanding. I focus on the intuitions, with only a minimal account of how logical induction helps make that picture work. Logical Induction There are a number of difficulties in applying Bayesian uncertainty to logic. No computable probability distribution can give non-zero measure to the logical tautologies, since you can't bound the amount of time you need to think to check whether something is a tautology, so updating on provable sentences always means updating on a set of measure zero. This leads to convergence problems, although there's been recent progress on that front. Put another way: Logical consequence is deterministic, but due to Gödel's first incompleteness theorem, it is like a stochastic variable in that there is no computable procedure which correctly decides whether something is a logical consequence. This means that any computable probability distribution has infinite Bayes loss on the question of logical consequence. Yet, because the question is actually deterministic, we know how to point in the direction of better distributions by doing more and more consistency checking. This puts us in a puzzling situation where we want to improve the Bayesian probability distribution by doing a kind of non-Bayesian update. This was the two-update problem. You can think of logical induction as supporting a set of hypotheses which are about ways to shift beliefs as you think longer, rather than fixed probability distributions which can only shift in response to evidence. This introduces a new problem: how can you score a hypothesis if it keeps shifting around its beliefs? As TEOTE emphasises, Bayesians outlaw this kind of belief shift for a reason: requiring predictions to be made in advance eliminates hindsight bias. (More on this later.) So long as you understand exactly what a hypothesis predicts and what it does not predict, you can evaluate its Bayes score and its prior complexity penalty and rank it objectively... </div> </td> <td class="text-right">Dec 04, 2021</td> </tr> <tr> <td> <div class="content_episode"> <span class="episode_title">Gradient hacking by Evan Hubinger</span> <div class="pull-right play_badge" title="Listen"> <span class="mdi mdi-play-circle-outline"></span> <span class="duration">05:14</span> <span class="episode_url hidden" type="0">https://d22tbkdovk5ea2.cloudfront.net/audio/projects/9531/podcasts/3098013/media/10992d690a7c7b97c581a17208c3fa19_compiled.mp3</span> <span class="episode_img_url hidden" type="img">https://images.podcastrepublic.net/podcast/1608555549hd.jpg</span> <span class="episode_id hidden" type="id">51337909520</span> <span class="episode_guid hidden" type="id">53e7cac0-905a-4253-84f0-f8aedbea9c68</span> </div> </div> <div class="episode_info"> Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Gradient hacking, published by Evan Hubinger on the AI Alignment Forum. "Gradient hacking" is a term I've been using recently to describe the phenomenon wherein a deceptively aligned mesa-optimizer might be able to purposefully act in ways which cause gradient descent to update it in a particular way. In Risks from Learned Optimization, we included the following footnote to reflect this possibility: Furthermore, a deceptively aligned mesa-optimizer would be incentivized to cause there to be a systematic bias in the direction of preventing the base optimizer from modifying its mesa-objective. Thus, in the context of a local optimization process, a deceptive mesa-optimizer might try to “hack” its own gradient (by, for example, making itself more brittle in the case where its objective gets changed) to ensure that the base optimizer adjusts it in such a way that leaves its mesa-objective untouched. Outside of that footnote, however, there hasn't been much other discussion of this phenomenon, and I think that's a shame, as I think it's definitely something that's worth looking into. For example, here's a gradient hacking story that I'm particularly concerned about. Suppose you're training a model and it becomes deceptive. You're not worried about that, though, because you have sufficient transparency tools that you're confident you can detect any deceptive cognition it might be doing. If it's thinking about how to deceive you, your transparency tools will pick that up, and you'll know to throw out that model and try again. However, your transparency tools are quite expensive, so you can only afford to run them once, which you're planning on doing after training before you deploy your model. If your model is deceptive, though, then it might know all of that—in particular, it might know that if it stays deceptive, it'll get caught. However, there is still something it can do in that situation to be able to defect against you and cause whatever it wants to happen during deployment. That is, it could hack its own gradient to produce a new model which will still exhibit the same deceptive behavior, but without doing any of the detectable deceptive cognition. Perhaps the new model exhibits the deceptive behavior not because it's trying to act deceptive, but just because it's a hard-coded part of its algorithm, or because it has some bad proxy, or any other defect that might be more difficult to detect than outright deception. How might such a deceptive model accomplish such a feat of gradient hacking? Well, I don't think it's actually that difficult once your deceptive model comes up with it as a strategy. First, note that since gradient descent is a local optimization process, it doesn't have to somehow change what the global minima might be, or even what the local minima are—it just has to create a ridge in the loss landscape that will push the gradient in the direction it wants. The original footnote provides one example of this, which is for the model to check if its objective satisfies some criterion, and fail hard if it doesn't. Now, if the model gets to the point where it's actually just failing because of this, then gradient descent will probably just remove that check—but the trick is never to actually get there. By having such a check in the first place, the model makes it so that gradient descent won't actually change its objective, since any change to its objective (keeping all the other parameters fixed, which is what gradient descent does since it computes partial derivatives) would lead to such a failure. The case of a deceptive model trying to erase evidence of its deception is not quite as straightforward, though I think it would still be possible to do something quite similar in that situation as well. This is especially true if the model has the ability to r... </div> </td> <td class="text-right">Dec 04, 2021</td> </tr> <tr> <td> <div class="content_episode"> <span class="episode_title">Inaccessible information by Paul Christiano</span> <div class="pull-right play_badge" title="Listen"> <span class="mdi mdi-play-circle-outline"></span> <span class="duration">22:47</span> <span class="episode_url hidden" type="0">https://d22tbkdovk5ea2.cloudfront.net/audio/projects/9531/podcasts/3098003/media/2738fafb1914883e312f7b711d7c5896_compiled.mp3</span> <span class="episode_img_url hidden" type="img">https://images.podcastrepublic.net/podcast/1608555549hd.jpg</span> <span class="episode_id hidden" type="id">51337909521</span> <span class="episode_guid hidden" type="id">0f50234e-d6fe-454b-b8aa-024bbf79e4a3</span> </div> </div> <div class="episode_info"> Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Inaccessible information, published by Paul Christiano on the AI Alignment Forum. Suppose that I have a great model for predicting “what will Alice say next?” I can evaluate and train this model by checking its predictions against reality, but there may be many facts this model “knows” that I can’t easily access. For example, the model might have a detailed representation of Alice’s thoughts which it uses to predict what Alice will say, without being able to directly answer “What is Alice thinking?” In this case, I can only access that knowledge indirectly, e.g. by asking about what Alice would say in under different conditions. I’ll call information like “What is Alice thinking?” inaccessible. I think it’s very plausible that AI systems will build up important inaccessible knowledge, and that this may be a central feature of the AI alignment problem. In this post I’m going to try to clarify what I mean by “inaccessible information” and the conditions under which it could be a problem. This is intended as clarification and framing rather than a presentation of new ideas, though sections IV, V, and VI do try to make some small steps forward. I. Defining inaccessible information I’ll start by informally defining what it means for information to be accessible, based on two mechanisms: Mechanism 1: checking directly If I can check X myself, given other accessible information, then I’ll define X to be accessible. For example, I can check a claim about what Alice will do, but I can’t check a claim about what Alice is thinking. If I can run randomized experiments, I can probabilistically check a claim about what Alice would do. But I can’t check a counterfactual claim for conditions that I can’t create in an experiment. In reality this is a graded notion — some things are easier or harder to check. For the purpose of this post, we can just talk about whether something can be tested even a single time over the course of my training process. Mechanism 2: transfer The simplest model that provides some accessible information X may also provide some other information Y. After all, it’s unlikely that the simplest model that outputs X doesn’t output anything else. In this case, we’ll define Y to be accessible. For example, if I train a model to predict what happens over the next minute, hour, or day, it may generalize to predicting what will happen in a month or year. For example, if the simplest model to predict the next day was a fully-accurate physical simulation, then the same physics simulation might work when run for longer periods of time. I think this kind of transfer is kind of dicey, so I genuinely don’t know if long-term predictions are accessible or not (we certainly can’t directly check them, so transfer is the only way they could be accessible). Regardless of whether long-term predictions are accessible by transfer, there are other cases where I think transfer is pretty unlikely. For example, the simplest way to predict Alice’s behavior might be to have a good working model for her thoughts. But it seems unlikely that this model would spontaneously describe what Alice is thinking in an understandable way — you’d need to specify some additional machinery, for turning the latent model into useful descriptions. I think this is going to be a fairly common situation: predicting accessible information may involve almost all the same work as predicting inaccessible information, but you need to combine that work with some “last mile” in order to actually output inaccessible facts. Definition I’ll say that information is accessible if it’s in the smallest set of information that is closed under those two mechanisms, and inaccessible otherwise. There are a lot of nuances in that definition, which I’ll ignore for now. Examples Here are some candidates for accessible vs. inacces... </div> </td> <td class="text-right">Dec 04, 2021</td> </tr> <tr> <td> <div class="content_episode"> <span class="episode_title">Zoom In: An Introduction to Circuits by Evan Hubinger</span> <div class="pull-right play_badge" title="Listen"> <span class="mdi mdi-play-circle-outline"></span> <span class="duration">04:41</span> <span class="episode_url hidden" type="0">https://d22tbkdovk5ea2.cloudfront.net/audio/projects/9531/podcasts/3097968/media/241e9d5d702c026780ca671f24f7e722_compiled.mp3</span> <span class="episode_img_url hidden" type="img">https://images.podcastrepublic.net/podcast/1608555549hd.jpg</span> <span class="episode_id hidden" type="id">51337909522</span> <span class="episode_guid hidden" type="id">f757b8e4-c460-4305-aad0-1e5bdce17df8</span> </div> </div> <div class="episode_info"> Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Zoom In: An Introduction to Circuits, published Evan Hubinger on the AI Alignment Forum. his is a linkpost for/ Chris Olah and the rest of the rest of the OpenAI Clarity team just published “Zoom In: An Introduction to Circuits,” a Distill article about some of the transparency research they've been doing which I think is very much worth taking a look at. I'll try to go over some of my particular highlights here, but I highly recommend reading the full article. Specifically, I have previously written about Chris's belief that the field of machine learning should be more like the natural sciences in seeking understanding first and foremost. I think “Zoom In” is a big step towards making something like that a reality, as it provides specific, concrete, testable claims about neural networks upon which you might actually be able to build a field. The three specific claims presented in the article are: Claim 1: Features Features are the fundamental unit of neural networks. They correspond to directions [in the space of neuron activations]. These features can be rigorously studied and understood. Claim 2: Circuits Features are connected by weights, forming circuits. These circuits can also be rigorously studied and understood. Claim 3: Universality Analogous features and circuits form across models and tasks. “Zoom In” provides lots of in-depth justification and examples for each of these claims which I will mostly leave to the actual article. Some highlights, however: How do convolutional neural networks (CNNs) detect dogs in an orientation-invariant way? It turns out they pretty consistently separately detect leftward-facing and rightward-facing dogs, then union the two together. How do CNNs detect foreground-background boundaries? It turns out they use high-low frequency detectors—which look for high-frequency patterns on one side and low-frequency patterns on the other side—in a bunch of different possible orientations. What's particularly nice about “Zoom In”'s three claims in my opinion, however, is that they give other researchers a foundation to build upon. Once it's established that neural networks have meaningful features and circuits in them, discovering new such circuits becomes a legitimate scientific endeavor—especially if, as the third claim suggests, those features and circuits are universal across many different networks. From “Zoom In:” One particularly challenging aspect of being in a pre-paradigmatic field is that there isn’t a shared sense of how to evaluate work in interpretability. There are two common proposals for dealing with this, drawing on the standards of adjacent fields. Some researchers, especially those with a deep learning background, want an “interpretability benchmark” which can evaluate how effective an interpretability method is. Other researchers with an HCI background may wish to evaluate interpretability methods through user studies. But interpretability could also borrow from a third paradigm: natural science. In this view, neural networks are an object of empirical investigation, perhaps similar to an organism in biology. Such work would try to make empirical claims about a given network, which could be held to the standard of falsifiability. Why don’t we see more of this kind of evaluation of work in interpretability and visualization? Especially given that there’s so much adjacent ML work which does adopt this frame! One reason might be that it’s very difficult to make robustly true statements about the behavior of a neural network as a whole. They’re incredibly complicated objects. It’s also hard to formalize what the interesting empirical statements about them would, exactly, be. And so we often get standards of evaluations more targeted at whether an interpretability method is useful rather than whether we’re learning true st... </div> </td> <td class="text-right">Dec 04, 2021</td> </tr> <tr> <td> <div class="content_episode"> <span class="episode_title">Plausible cases for HRAD work, and locating the crux in the realism about rationality debate by Issa Rice</span> <div class="pull-right play_badge" title="Listen"> <span class="mdi mdi-play-circle-outline"></span> <span class="duration">15:59</span> <span class="episode_url hidden" type="0">https://d22tbkdovk5ea2.cloudfront.net/audio/projects/9531/podcasts/3097938/media/7a94526f48c8e8913466055efaeda328_compiled.mp3</span> <span class="episode_img_url hidden" type="img">https://images.podcastrepublic.net/podcast/1608555549hd.jpg</span> <span class="episode_id hidden" type="id">51337909523</span> <span class="episode_guid hidden" type="id">73716136-e7f7-4e1d-b3d3-8f7265b4ac83</span> </div> </div> <div class="episode_info"> Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Plausible cases for HRAD work, and locating the crux in the realism about rationality debate, published by Issa Rice on the AI Alignment Forum. This post is my attempt to summarize and distill the major public debates about MIRI's highly reliable agent designs (HRAD) work (which includes work on decision theory), including the discussions in Realism about rationality and Daniel Dewey's My current thoughts on MIRI's "highly reliable agent design" work. Part of the difficulty with discussing the value of HRAD work is that it's not even clear what the disagreement is about, so my summary takes the form of multiple possible "worlds" we might be in; each world consists of a positive case for doing HRAD work, along with the potential objections to that case, which results in one or more cruxes. I will talk about "being in a world" throughout this post. What I mean by this is the following: If we are "in world X", that means that the case for HRAD work outlined in world X is the one that most resonates with MIRI people as their motivation for doing HRAD work; and that when people disagree about the value of HRAD work, this is what the disagreement is about. When I say that "I think we are in this world", I don't mean that I agree with this case for HRAD work; it just means that this is what I think MIRI people think. In this post, the pro-HRAD stance is something like "HRAD work is the most important kind of technical research in AI alignment; it is the overwhelming priority and we're pretty much screwed if we under-invest in this kind of research" and the anti-HRAD stance is something like "HRAD work seems significantly less promising than other technical AI alignment agendas, such as the approaches to directly align machine learning systems (e.g. iterated amplification)". There is a much weaker pro-HRAD stance, which is something like "HRAD work is interesting and doing more of it adds value, but it's not necessarily the most important kind of technical AI alignment research to be working on"; this post is not about this weaker stance. Clarifying some terms Before describing the various worlds, I want to present some distinctions that have come up in discussions about HRAD, which will be relevant when distinguishing between the worlds. Levels of abstraction vs levels of indirection The idea of levels of abstraction was introduced in the context of debate about HRAD work by Rohin Shah, and is described in this comment (start from "When groups of humans try to build complicated stuff"). For more background, see these articles on Wikipedia. Later on, in this comment Rohin gave a somewhat different "levels" idea, which I've decided to call "levels of indirection". The idea is that there might not be a hierarchy of abstraction, but there's still multiple intermediate layers between the theory you have and the end-result you want. The relevant "levels of indirection" is the sequence HRAD → machine learning → AGI. Even though levels of indirection are different from levels of abstraction, the idea is that the same principle applies, where the more levels there are, the harder it becomes for a theory to apply to the final level. Precise vs imprecise theory A precise theory is one which can scale to 2+ levels of abstraction/indirection. An imprecise theory is one which can scale to at most 1 level of abstraction/indirection. More intuitively, a precise theory is more mathy, rigorous, and exact like pure math and physics, and an imprecise theory is less mathy, like economics and psychology. Building agents from the ground up vs understanding the behavior of rational agents and predicting roughly what they will do This distinction comes from Abram Demski's comment. However, I'm not confident I've understood this distinction in the way that Abram intended it, so what I describe below... </div> </td> <td class="text-right">Dec 04, 2021</td> </tr> <tr> <td> <div class="content_episode"> <span class="episode_title">Introduction To The Infra-Bayesianism Sequence by Diffractor</span> <div class="pull-right play_badge" title="Listen"> <span class="mdi mdi-play-circle-outline"></span> <span class="duration">25:27</span> <span class="episode_url hidden" type="0">https://d22tbkdovk5ea2.cloudfront.net/audio/projects/9531/podcasts/3097879/media/fd730524c6f84f85605207b6bef8a491_compiled.mp3</span> <span class="episode_img_url hidden" type="img">https://images.podcastrepublic.net/podcast/1608555549hd.jpg</span> <span class="episode_id hidden" type="id">51337909524</span> <span class="episode_guid hidden" type="id">e2b614bf-dc0e-4b31-ac85-ce1fbdf12702</span> </div> </div> <div class="episode_info"> Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Introduction To The Infra-Bayesianism Sequence, published by Diffractor on the AI Alignment Forum. TLDR: Infra-Bayesianism is a new approach to epistemology / decision theory / reinforcement learning theory, which builds on "imprecise probability" to solve the problem of prior misspecification / grain-of-truth / nonrealizability which plagues Bayesianism and Bayesian reinforcement learning. Infra-Bayesianism also naturally leads to an implementation of UDT, and (more speculatively at this stage) has applications to multi-agent theory, embedded agency and reflection. This post is the first in a sequence which lays down the foundation of the approach. Prelude: Diffractor and Vanessa proudly present: The thing we've been working on for the past five months. I initially decided that Vanessa's scattered posts about incomplete models were interesting, and could benefit from being written up in a short centralized post. But as we dug into the mathematical details, it turned out it didn't really work, and then Vanessa ran across the true mathematical thing (which had previous ideas as special cases) and scope creep happened. This now looks like a new, large, and unusually tractable vein of research. Accordingly, this sequence supersedes all previous posts about incomplete models, and by now we've managed to get quite a few interesting results, and have ideas for several new research directions. Diffractor typed everything up and fleshed out the proof sketches, Vanessa originated almost all of the ideas and theorems. It was a true joint effort, this sequence would not exist if either of us were absent. Alex Mennen provided feedback on drafts to make it much more comprehensible than it would otherwise be, and Turntrout and John Maxwell also helped a bit in editing. Be aware this sequence of posts has the math textbook issue where it requires loading a tower of novel concepts that build on each other into your head, and cannot be read in a single sitting. We will be doing a group readthrough on MIRIxDiscord where we can answer questions and hopefully get collaborators, PM me to get a link. Introduction: Learning theory traditionally deals with two kinds of setting: "realizable" and "agnostic" or "non-realizable". In realizable settings, we assume that the environment can be described perfectly by a hypothesis inside our hypothesis space. (AIXI is an example of this) We then expect the algorithm to converge to acting as if it already knew the correct hypothesis. In non-realizable settings, we make no such assumption. We then expect the algorithm to converge to the best approximation of the true environment within the available hypothesis space. As long as the computational complexity of the environment is greater than the computational complexity of the learning algorithm, the algorithm cannot use an easy-to-compute hypothesis that would describe the environment perfectly, so we are in the nonrealizable setting. When we discuss AGI, this is necessarily the case, since the environment is the entire world: a world that, in particular, contains the agent itself and can support other agents that are even more complex, much like how halting oracles (which you need to run Solomonoff Induction) are nowhere in the hypotheses which Solomonoff considers. Therefore, the realizable setting is usually only a toy model. So, instead of seeking guarantees of good behavior assuming the environment is easy to compute, we'd like to get good behavior simply assuming that the environment has some easy-to-compute properties that can be exploited. For offline and online learning there are classical results in the non-realizable setting, in particular VC theory naturally extends to the non-realizable setting. However, for reinforcement learning there are few analogous results. Even for passive Bayesian... </div> </td> <td class="text-right">Dec 04, 2021</td> </tr> <tr> <td> <div class="content_episode"> <span class="episode_title">Jitters No Evidence of Stupidity in RL</span> <div class="pull-right play_badge" title="Listen"> <span class="mdi mdi-play-circle-outline"></span> <span class="duration">06:21</span> <span class="episode_url hidden" type="0">https://d22tbkdovk5ea2.cloudfront.net/audio/projects/9531/podcasts/3097837/media/c009a272c6d28131b0fa7540ec75ec85_compiled.mp3</span> <span class="episode_img_url hidden" type="img">https://images.podcastrepublic.net/podcast/1608555549hd.jpg</span> <span class="episode_id hidden" type="id">51337909525</span> <span class="episode_guid hidden" type="id">6db4a709-415a-4901-b185-66dfdabc91d5</span> </div> </div> <div class="episode_info"> Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Jitters No Evidence of Stupidity in RL, published by by 1a3orn on the AI Alignment Forum. Epistemic status: Pretty sure about core, not about edges A while ago, I noticed a possible bias in how I evaluated reinforcement learning agents. It tended to cause me to revise my estimation of their intelligence downwards, after I viewed a video of them in action. I've seen other people fall into what I believe to be the same error. So I'm writing this to correct myself if I am wrong and to alert others if I am right. The Bias Many reinforcement learning agents have "jitters." They alternate actions quickly, looking nearly palsied, apparently nullifying the effects of earlier actions with later ones. This is true across a wide variety of reinforcement learning agents. Many people see these jitters as evidence of the relatively primitive nature of these agents. These actions look clearly stupid and sub-optimal. For instance consider the original Deep Q Network paper. Even after training for some time on Breakout, it still erratically moves the paddle back and forth when the ball is not near it. One person mentions that it makes "erratic jerky movements that obviously could not in principle be optimal," which was once my impression as well. Similarly, much more recently, consider DeepMind's recent work on generally capable agents. In the show reel the movement of the agents often looks erratic. Conversation around LessWrong sometimes alluded to these erratic movements as evidence against the intelligence of the agents. Jitters Non-Optimal For Energy-Constrained Agents Evolved intelligence on earth has energy conservation as a fundamental part of its optimization function. Unnecessary movements spend energy. Spent energy must be recovered, at the cost of reproductive fitness. So generally only sick animals, insane animals, and so on, have the shakes or tremble continuously. Energy conservation applies to every animal on earth, which is why we probably feel intuitively confident applying this rule across the broad variety of animals. Additionally, extremely erratic movements can result in injury to the animal which is making them. So this is another reason why, for creatures that are a result of evolution, erratic movements are a sign of insanity or injury. RL Agents Are Not Energy-Constrained Reinforcement learning agents are not energy-constrained. They do not draw on a finite store of glucose when acting. Nor do they have any possibility of injuring themselves. As a result, the policies resulting from reinforcement learning algorithms will not be strongly constrained to limit jitters in the way that policies resulting from evolution will be constrained. You can go further than this. Given the way that most reinforcement learning agents are set up, they have no way to even distinguish any difference between action and non-action, and thus between non-rest and rest. That is, consider a reinforcement learning agent which makes one of fifteen different categorical actions in each time-step, like those in OpenAI's ProcGen. For an agent controlling a side-scrolling avatar, for instance, one action would be moving right; another action would be jumping; another action would be doing nothing; etc. Each of these is only distinguished from the others as different indices on one hot-action encodings -- i.e., moving right could be [1,0,0,0...], jumping could be [0,1,0,0...], doing nothing could be [0,0,1,0...], and so on. For a human controlling such a side-scrolling avatar, "doing nothing" stands out from all the other actions. If you put yourself in a situation where you are allowed to do nothing, you can rest your hands by not pressing any buttons. You can consider a more global strategy, and focus on the kind of strategy you will use when you resume acting. It also allows you to rest ... </div> </td> <td class="text-right">Dec 04, 2021</td> </tr> <tr> <td> <div class="content_episode"> <span class="episode_title">Testing The Natural Abstraction Hypothesis: Project Update by johnswentworth</span> <div class="pull-right play_badge" title="Listen"> <span class="mdi mdi-play-circle-outline"></span> <span class="duration">17:05</span> <span class="episode_url hidden" type="0">https://d22tbkdovk5ea2.cloudfront.net/audio/projects/9531/podcasts/3097807/media/d58182022607d1657a538dd2f23fcc06_compiled.mp3</span> <span class="episode_img_url hidden" type="img">https://images.podcastrepublic.net/podcast/1608555549hd.jpg</span> <span class="episode_id hidden" type="id">51337909526</span> <span class="episode_guid hidden" type="id">63b17df0-962e-463b-9b23-ee9bcbd7f306</span> </div> </div> <div class="episode_info"> Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is:Testing The Natural Abstraction Hypothesis: Project Update, published by johnswentworth on the AI Alignment Forum. I set myself six months to focus primarily on the Natural Abstraction Hypothesis project before stopping to re-evaluate. It’s been about six months since then. So, how has it gone? This will be a “story telling” post, where I talk more about my research process and reasoning than about the results themselves. Be warned: this means I'm going to spout some technical stuff without explanation here and there, and in some cases I haven't even written a good explanation yet - this is a picture of my own thoughts. For more background on the results, the three main posts are: The intro post for the overarching project, which I recommend reading. Information At A Distance Is Mediated By Deterministic Constraints, which I also recommend reading. Generalizing Koopman-Pitman-Darmois, which I do not recommend reading unless you want dense math. Recap: The Original Plan The Project Intro broke the Natural Abstraction Hypothesis into three sub-hypotheses: Abstractability: for most physical systems, the information relevant “far away” can be represented by a summary much lower-dimensional than the system itself. Human-Compatibility: These summaries are the abstractions used by humans in day-to-day thought/language. Convergence: a wide variety of cognitive architectures learn and use approximately-the-same summaries. That post suggested three types of experiments to test these: Abstractability: does reality abstract well? Corresponding experiment type: run a reasonably-detailed low-level simulation of something realistic; see if info-at-a-distance is low-dimensional. Human-Compatibility: do these match human abstractions? Corresponding experiment type: run a reasonably-detailed low-level simulation of something realistic; see if info-at-a-distance recovers human-recognizable abstractions. Convergence: are these abstractions learned/used by a wide variety of cognitive architectures? Corresponding experiment type: train a predictor/agent against a simulated environment with known abstractions; look for a learned abstract model. Alas, in order to run these sorts of experiments, we first need to solve some tough algorithmic problems. Computing information-at-a-distance in reasonably-complex simulated environments is a necessary step for all of these, and the “naive” brute-force method for this is very-not-tractable. It requires evaluating high-dimensional integrals over “noise” variables - a #P-complete problem in general. (#P-complete is sort of like NP-complete, but Harder.) Even just representing abstractions efficiently is hard - we’re talking about e.g. the state-distribution of a bunch of little patches of wood in some chunk of a chair given the state-distribution of some other little patches of wood in some other chunk of the chair. Explicitly writing out that whole distribution would take an amount of space exponential in the number of variables involved; that would be a data structure of size roughly O((# of states for a patch of wood)^(# of patches)). My main goal for the past 6 months was to develop tools to make the experiments tractable - i.e. theorems, algorithms, working code, and proofs-of-concept to solve the efficiency problems. When this 6 month subproject started out, I had a working proof-of-concept for linear systems. I was hoping that I could push that to somewhat more complex systems via linear approximations, figure out some useful principles empirically, and generally get a nice engineering-experiment-theory feedback loop going. That’s the fast way to make progress. . Turns Out Chaos Is Not Linear The whole “start with linear approximations and get a nice engineering-experiment-theory feedback loop going” plan ran straight into a brick wall. Not enti... </div> </td> <td class="text-right">Dec 04, 2021</td> </tr> <tr> <td> <div class="content_episode"> <span class="episode_title">Sources of intuitions and data on AGI by Scott Garrabrant</span> <div class="pull-right play_badge" title="Listen"> <span class="mdi mdi-play-circle-outline"></span> <span class="duration">05:44</span> <span class="episode_url hidden" type="0">https://d22tbkdovk5ea2.cloudfront.net/audio/projects/9531/podcasts/3097776/media/20968932b3dbaa4cd4ab86c2a0438071_compiled.mp3</span> <span class="episode_img_url hidden" type="img">https://images.podcastrepublic.net/podcast/1608555549hd.jpg</span> <span class="episode_id hidden" type="id">51337909527</span> <span class="episode_guid hidden" type="id">0ecda230-443f-4d1c-a6bb-0802b39949f1</span> </div> </div> <div class="episode_info"> Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Sources of intuitions and data on AGI, published by Scott Garrabrant on the AI Alignment Forum. Much of the difficulty in making progress on AI safety comes from the lack of useful feedback loops. We do not have a superintelligent AI to run tests on and by the time we do, it will probably be too late. This means we have to resort to using proxies. In this post, I will highlight what I think are the four most useful proxies we have access to today: modern machine learning systems, humans, groups, and abstract models. These proxies are not opposed to each other. We need all the data we can get, and the right strategy is to learn from all of them. However each proxy also has its flaws, and will break if used in the wrong way. It is important to pay attention to where each proxy works and where it fails. Modern Machine Learning Systems Pros: With ML systems, one can do experiments quickly to get fast feedback loops with lots of quantitative data. Also, AGI will probably be made using something similar to modern machine learning systems, so insights gained by thinking about ML systems today may be particularly easy to port over. Cons: Modern ML systems are opaque, and it can be very difficult to understand how they are working. This may be a problem we need to overcome eventually anyway, but that doesn't make it any easier to get data from them today. Modern ML systems are also not very smart, and it is reasonable to expect a qualitative difference between their behavior and the behavior of superintelligent systems. This may lead to a false sense of security, especially in solutions that depend implicitly or explicitly on the AI lacking some particular capability, or using a particular strategy. Examples: Concrete Problems in AI Safety and Alignment for Advanced Machine Learning Systems both try to reason about AI Safety by using modern machine learning as their primary source of intuitions. The safety teams at OpenAI and DeepMind try to use this as a primary source of data. (Although it would be a very uncharitable to claim that they do not also draw from the other sources, especially abstract models.) AI Safety Gridworlds is a good example of developing intuitions about AI safety from looking at modern systems. Humans Pros: Humans are (at least on some axes) the most intelligent systems we have direct access to. We can use this to notice failures of other proxies by noticing that we can do better at optimizing some function using clever tricks that those systems might also be able to use when you scale them up. Also, humans are an intelligence that we can experience from the inside. Cons: Humans are opaque. Even from our inside view, it is very difficult to understand how they work, and very hard to modify. They are also the most difficult to talk about rigorously. There is also the failure mode of anthropomorphizing badly and attributing arbitrary properties of humans (and especially human goals) to AGI. Examples: Some have used the complexity of the human brain as a starting point for predicting how much hardware is necessary for AGI. Ought plans to try to run experiments on humans that will help form intuitions on how easy it is to decompose big questions into small pieces, which can provide evidence about whether or not Paul Christiano style amplification could work. We can use humans to provide lower bounds for what capabilities are possible, like in the AI box experiment. Groups Pros: Groups of humans can often be more transparent than individual humans, since the individuals within the group communicate in a transparent way. This is perhaps the most alien agent-like thing we have direct access to, which could provide a check for anthropomorphism. Thinking about groups could also encourage reductionism. All agents are made up of smaller parts, and groups make this... </div> </td> <td class="text-right">Dec 04, 2021</td> </tr> <tr> <td> <div class="content_episode"> <span class="episode_title">What counts as defection? by Alex Turner</span> <div class="pull-right play_badge" title="Listen"> <span class="mdi mdi-play-circle-outline"></span> <span class="duration">17:33</span> <span class="episode_url hidden" type="0">https://d22tbkdovk5ea2.cloudfront.net/audio/projects/9531/podcasts/3097704/media/66800a5b86066fa7bb596743c12a6a70_compiled.mp3</span> <span class="episode_img_url hidden" type="img">https://images.podcastrepublic.net/podcast/1608555549hd.jpg</span> <span class="episode_id hidden" type="id">51337909528</span> <span class="episode_guid hidden" type="id">f4c878c5-bc03-4d89-815b-449fb6eb19bc</span> </div> </div> <div class="episode_info"> Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What counts as defection?, published by Alex Turner on the AI Alignment Forum. Thanks to Michael Dennis for proposing the formal definition; to Andrew Critch for pointing me in this direction; to Abram Demski for proposing non-negative weighting; and to Alex Appel, Scott Emmons, Evan Hubinger, philh, Rohin Shah, and Carroll Wainwright for their feedback and ideas. There's a good chance I'd like to publish this at some point as part of a larger work. However, I wanted to make the work available now, in case that doesn't happen soon. They can't prove the conspiracy... But they could, if Steve runs his mouth. The police chief stares at you. You stare at the table. You'd agreed (sworn!) to stay quiet. You'd even studied game theory together. But, you hadn't understood what an extra year of jail meant. The police chief stares at you. Let Steve be the gullible idealist. You have a family waiting for you. Sunlight stretches across the valley, dappling the grass and warming your bow. Your hand anxiously runs along the bowstring. A distant figure darts between trees, and your stomach rumbles. The day is near spent. The stags run strong and free in this land. Carla should meet you there. Shouldn't she? Who wants to live like a beggar, subsisting on scraps of lean rabbit meat? In your mind's eye, you reach the stags, alone. You find one, and your arrow pierces its barrow. The beast shoots away; the rest of the herd follows. You slump against the tree, exhausted, and never open your eyes again. You can't risk it. People talk about 'defection' in social dilemma games, from the prisoner's dilemma to stag hunt to chicken. In the tragedy of the commons, we talk about defection. The concept has become a regular part of LessWrong discourse. Informal definition. A player defects when they increase their personal payoff at the expense of the group. This informal definition is no secret, being echoed from the ancient Formal Models of Dilemmas in Social Decision-Making to the recent Classifying games like the Prisoner's Dilemma: you can model the "defect" action as "take some value for yourself, but destroy value in the process". Given that the prisoner's dilemma is the bread and butter of game theory and of many parts of economics, evolutionary biology, and psychology, you might think that someone had already formalized this. However, to my knowledge, no one has. Formalism Consider a finite n -player normal-form game, with player i having pure action set A i and payoff function P i A 1 × × A n → R . Each player i chooses a strategy s i ∈ Δ A i (a distribution over A i ). Together, the strategies form a strategy profile s s 1 s n s − i s 1 s i − 1 s i 1 s n is the strategy profile, excluding player i 's strategy. A payoff profile contains the payoffs for all players under a given strategy profile. A utility weighting α j j 1 n is a set of n non-negative weights (as in Harsanyi's utilitarian theorem). You can consider the weights as quantifying each player's contribution; they might represent a percieved social agreement or be the explicit result of a bargaining process. When all α j are equal, we'll call that an equal weighting. However, if there are "utility monsters", we can downweight them accordingly. We're implicitly assuming that payoffs are comparable across players. We want to investigate: given a utility weighting, which actions are defections? Definition. Player i 's action a ∈ A i is a defection against strategy profile s and weighting α j j 1 n if Personal gain: P i a s − i P i s i s − i Social loss: ∑ j α j P j a s − i ∑ j α j P j s i s − i If such an action exists for some player i , strategy profile s , and weighting, then we say that there is an opportunity for defection in the game. Remark. For an equal weighting, condition (2) is equivalent to demanding that the action n... </div> </td> <td class="text-right">Dec 04, 2021</td> </tr> <tr> <td> <div class="content_episode"> <span class="episode_title">Competition: Amplify Rohin’s Prediction on AGI researchers & Safety Concerns by Andreas Stuhlmüller</span> <div class="pull-right play_badge" title="Listen"> <span class="mdi mdi-play-circle-outline"></span> <span class="duration">05:45</span> <span class="episode_url hidden" type="0">https://d22tbkdovk5ea2.cloudfront.net/audio/projects/9531/podcasts/3097662/media/1e3f59877f1b13fe3cf336555f9679c6_compiled.mp3</span> <span class="episode_img_url hidden" type="img">https://images.podcastrepublic.net/podcast/1608555549hd.jpg</span> <span class="episode_id hidden" type="id">51337909529</span> <span class="episode_guid hidden" type="id">501b5112-f440-45f9-bdfe-c6d06b1690f1</span> </div> </div> <div class="episode_info"> Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Competition: Amplify Rohin’s Prediction on AGI researchers & Safety Concerns, published by <AUTHOR> on the AI Alignment Forum. EDIT: The competition is now closed, thanks to everyone who participated! Rohin’s posterior distribution is here, and winners are in this comment. In this competition, we (Ought) want to amplify Rohin Shah’s forecast for the question: When will a majority of AGI researchers agree with safety concerns? Rohin has provided a prior distribution based on what he currently believes, and we want others to: Try to update Rohin’s thinking via comments (for example, comments including reasoning, distributions, and information sources). If you don’t want your comment to be considered for the competition, label it ‘aside’ Predict what his posterior distribution for the question will be after he has read all the comments and reasoning in this thread The competition will close on Friday July 31st. To participate in this competition, create your prediction on Elicit, click ‘Save Snapshot to URL,’ and post the snapshot link in a comment on this post. You can provide your reasoning in the ‘Notes’ section of Elicit or in your LessWrong comment. You should have a low bar for making predictions – they don’t have to be perfect. Here is Rohin’s prior distribution on the question. His reasoning for the prior is in this comment. Rohin spent ~30 minutes creating this distribution based on the beliefs and evidence he already has. He will spend 2-5 hours generating a posterior distribution. Click here to create your distribution We will award two $200 prizes, in the form of Amazon gift cards: Most accurate prediction: We will award $200 to the most accurate prediction of Rohin’s posterior distribution submitted through an Elicit snapshot. This will be determined by estimating KL divergence between Rohin’s final distribution and others’ distributions. If you post more than one snapshot, either your most recent snapshot or the one you identify as your final submission will be evaluated. Update to thinking: Rohin will rank each comment from 0 to 5 depending on how much the reasoning updated his thinking. We will randomly select one comment in proportion to how many points are assigned (so, a comment rated 5 would be 5 times more likely to receive the prize than a comment rated 1), and the poster of this comment will receive the $200 prize. Motivation This project is similar in spirit to amplifying epistemic spot checks and other work on scaling up individual judgment through crowdsourcing. As in these projects, we’re hoping to learn about mechanisms for delegating reasoning, this time in the forecasting domain. The objective is to learn whether mechanisms like this could save people like Rohin work. Rohin wants to know: What would I think if I had more evidence and knew more arguments than I currently do, but still followed the sorts of reasoning principles that I'm unlikely to revise in the course of a comment thread? In real-life applications of amplified forecasting, Rohin would only evaluate the arguments in-depth and form his own posterior distribution 1 out of 10 times. 9 out of 10 times he’d just skim the key arguments and adopt the predicted posterior as his new view. Question specification The question is: When will a majority of AGI researchers agree with safety concerns? Suppose that every year I (Rohin) talk to every top AI researcher about safety (I'm not explaining safety, I'm simply getting their beliefs, perhaps guiding the conversation to the safety concerns in the alignment community). After talking to X, I evaluate: (Yes / No) Is X's work related to AGI? (AGI safety counts) (Yes / No) Does X broadly understand the main concerns of the safety community? (Yes / No) Does X agree that there is at least one concern such that we have not yet solved it and we ... </div> </td> <td class="text-right">Dec 04, 2021</td> </tr> <tr> <td> <div class="content_episode"> <span class="episode_title">Modelling Transformative AI Risks (MTAIR) Project: Introduction by David Manheim, Aryeh Englander</span> <div class="pull-right play_badge" title="Listen"> <span class="mdi mdi-play-circle-outline"></span> <span class="duration">16:35</span> <span class="episode_url hidden" type="0">https://d22tbkdovk5ea2.cloudfront.net/audio/projects/9531/podcasts/3097654/media/84f3f3bd5f949760ee02ef7b5148f029_compiled.mp3</span> <span class="episode_img_url hidden" type="img">https://images.podcastrepublic.net/podcast/1608555549hd.jpg</span> <span class="episode_id hidden" type="id">51337909530</span> <span class="episode_guid hidden" type="id">c6cce8c2-5505-4369-9c65-737092e54fca</span> </div> </div> <div class="episode_info"> Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Modelling Transformative AI Risks (MTAIR) Project: Introduction, published by David Manheim, Aryeh Englander on the AI Alignment Forum. Numerous books, articles, and blog posts have laid out reasons to think that AI might pose catastrophic or existential risks for the future of humanity. However, these reasons often differ from each other both in details and in main conceptual arguments, and other researchers have questioned or disputed many of the key assumptions and arguments. The disputes and associated discussions can often become quite long and complex, and they can involve many different arguments, counter-arguments, sub-arguments, implicit assumptions, and references to other discussions or debated positions. Many of the relevant debates and hypotheses are also subtly related to each other. Two years ago, Ben Cottier and Rohin Shah created a hypothesis map, shown below, which provided a useful starting point for untangling and clarifying some of these interrelated hypotheses and disputes. The MTAIR project is an attempt to build on this earlier work by including additional hypotheses, debates, and uncertainties, and by including more recent research. We are also attempting to convert Cottier and Shah’s informal diagram style into a quantitative model that can incorporate explicit probability estimates, measures of uncertainty, relevant data, and other quantitative factors or analysis, in a way that might be useful for planning or decision-making purposes. Cottier and Shah's 2019 Hypothesis Map for AI Alignment This post is the first in a series which presents our preliminary outputs from this project, along with some of our plans going forward. Although the project is still a work in progress, we believe that we are now at a stage where we can productively engage the community, both to contribute to the relevant discourse and to solicit feedback, critiques, and suggestions. This introductory post gives a brief conceptual overview of our approach and a high-level walkthrough of the hypothesis map that we have developed. Subsequent posts will go into much more detail on different parts of this model. We are primarily interested in feedback on the portions of the model that we are presenting in detail. In the final posts of this sequence we will describe some of our plans going forward. Conceptual Approach There are two primary parts to the MTAIR project. The first part, which is still ongoing, involves creating a qualitative map (“model”) of key hypotheses, cruxes, and relationships, as described earlier. The second part, which is still largely in the planning phase, is to convert our qualitative map into a quantitative model with elicited values from experts, in a way that can be useful for decision-making purposes. Mapping key hypotheses: As mentioned above, this part of the project involves an ongoing effort to map out the key hypotheses and debate cruxes relevant to risks from Transformative AI, in a manner comparable to and building upon the earlier diagram by Ben Cottier and Rohin Shah. As shown in the conceptual diagram below, the idea is to create a qualitative map showing how the various disagreements and hypotheses (blue nodes) are related to each other, how different proposed technical or governance agendas (green nodes) relate to different disagreements and hypotheses, and how all of those factors feed into the likelihood that different catastrophe scenarios (red nodes) might materialize. Qualitative map illustrating relationships between hypotheses, propositions, safety agendas, and outcomes Quantification and decision analysis: Our longer-term plan is to convert our hypothesis map into a quantitative model that can be used to calculate decision-relevant probability estimates. For example, a completed model could output a roughly estimated probabili... </div> </td> <td class="text-right">Dec 04, 2021</td> </tr> <tr> <td> <div class="content_episode"> <span class="episode_title">How DeepMind's Generally Capable Agents Were Trained by 1a3orn</span> <div class="pull-right play_badge" title="Listen"> <span class="mdi mdi-play-circle-outline"></span> <span class="duration">31:11</span> <span class="episode_url hidden" type="0">https://d22tbkdovk5ea2.cloudfront.net/audio/projects/9531/podcasts/3097646/media/9e3a401646af38e4b3b0ca2f8a93ef5f_compiled.mp3</span> <span class="episode_img_url hidden" type="img">https://images.podcastrepublic.net/podcast/1608555549hd.jpg</span> <span class="episode_id hidden" type="id">51337909531</span> <span class="episode_guid hidden" type="id">e69b85bb-ee17-4304-8bf9-03989f97cc33</span> </div> </div> <div class="episode_info"> Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How DeepMind's Generally Capable Agents Were Trained, published by 1a3orn on the AI Alignment Forum. Intro One of DeepMind's latest papers, Open-Ended Learning Leads to Generally Capable Agents, explains how DeepMind produced agents that can successfully play games as complex as hide-and-seek or capture-the-flag without even having trained on or seen these games before. As far as I know, this is an entirely unprecedented level of generality for a reinforcement-learning agent. The following is a high-level summary of the paper, meant to be accessible to non-specialists, that should nevertheless produce something resembling a gears-level model. I want to focus on explaining the optimization process that produced this agent; on what the different parts of the optimization process are; on why each different part is necessary; and on what would happen if different parts of it were missing. After that summary, I'll add a few more comments and questions about design choices within the paper and about future research I'd like to see. I'm far less certain about this second part, however. I was going to include a part on AI timelines -- but whether this paper influences your timelines, and in what direction, depends on a lot of priors that are out-of-scope for what I want to do here. The Environment Before we get into the optimization process of the agent, I need to talk about the environment within which the agent trained. Core to the project of this paper are the millions of dynamically-generated tasks on which the agent can train. Each task in the XLand Unity-powered 3d environment space is defined by a (1) unique physical world and a (2) unique set of goals / rewards. Thoughout what follows I refer to (1) as the "environment" or "world", to (2) as the "game", and to both of them together as a "task." Note that both of these can be generated programmatically without human intervention. (The show reel of the trained agents operating on hand-made, held-out test tasks is worth watching for at least a few minutes to get a feel for the complexity possible from both world and goals, and is probably much clearer than my writing about the environment space. [I mean, if you want to understand soccer, watch a game of it, don't read a description.] Although you should note that the intelligibility of the goals in the video is uncharacteristic of the goals in the training tasks, because the show-reel goals were made by humans from human-intelligible games, rather than randomly generated.) Anyhow. What kind of variety exists in this space? Well, each world has a static landscape with dynamic, simulated rigid-body objects placed upon it. The topographical features of the world can be randomly mutated; the size of the world and lighting of the world can vary; the rigid-body cubes, pyramids, spheres, and slabs on the map can be randomly colored, sized, and placed. Each game, with goals and rewards, can also be randomly generated. I'm not being redundant by talking about goals and rewards; each agent both receives information specifying what would cause it to be rewarded (the goal) and receives a numeric reward of 0 or 1 in each timestep. The fundamental atoms for the definition of goals are atomic predicates, such as being "on", "near", "far", or "holding" something. These atoms can be applied to different entities to form sub-goals, such as "the player is on the yellow floor" or "the black cube is near the black pyramid". A complete goal is then represented as set of options (disjunctions) over some set(s) of necessary predicates (conjunctions) -- a complete goal might be "(Hold a purple sphere AND be near a yellow cube) OR (See an opponent AND be near a black cube)." . Obviously such goals can be randomly generated, and obviously there are a very large number of them. The total environme... </div> </td> <td class="text-right">Dec 04, 2021</td> </tr> <tr> <td> <div class="content_episode"> <span class="episode_title">Open question: are minimal circuits daemon-free? by Paul Christiano</span> <div class="pull-right play_badge" title="Listen"> <span class="mdi mdi-play-circle-outline"></span> <span class="duration">04:16</span> <span class="episode_url hidden" type="0">https://d22tbkdovk5ea2.cloudfront.net/audio/projects/9531/podcasts/3097622/media/045207fd252ab857a4d5b06b14773c50_compiled.mp3</span> <span class="episode_img_url hidden" type="img">https://images.podcastrepublic.net/podcast/1608555549hd.jpg</span> <span class="episode_id hidden" type="id">51337909532</span> <span class="episode_guid hidden" type="id">b8ad7630-4bce-4780-8bac-ac5b1cd2edef</span> </div> </div> <div class="episode_info"> Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Open question: are minimal circuits daemon-free?, published by Paul Christiano on the AI Alignment Forum. Note: weird stuff, very informal. Suppose I search for an algorithm that has made good predictions in the past, and use that algorithm to make predictions in the future. I may get a "daemon," a consequentialist who happens to be motivated to make good predictions (perhaps because it has realized that only good predictors survive). Under different conditions, the daemon may no longer be motivated to predict well, and may instead make "predictions" that help it achieve its goals at my expense. I don't know whether this is a real problem or not. But from a theoretical perspective, not knowing is already concerning--I'm trying to find a strong argument that we've solved alignment, not just something that seems to work in practice. I am pretty convinced that daemons are a real problem for Solomonoff induction. Intuitively, the problem is caused by "too much compute." I suspect that daemons are also a problem for some more realistic learning procedures (like human evolution), though in a different shape. I think that this problem can probably be patched, but that's one of the major open questions for the feasibility of prosaic AGI alignment. I suspect that daemons aren't a problem if we exclusively select for computational efficiency. That is, I suspect that the fastest way to solve any particular problem doesn't involve daemons. I don't think this question has much intrinsic importance, because almost all realistic learning procedures involve a strong simplicity prior (e.g. weight sharing in neural networks). But I do think this question has deep similarities to more important problems, and that answering this question will involve developing useful conceptual machinery. Because we have an unusually strong intuitive handle on the problem, I think it's a good thing to think about. Problem statement and intuition Can the smallest boolean circuit that solves a problem be a daemon? For example, can the smallest circuit that predicts my behavior (at some level of accuracy) be a daemon? Intuitively, if we have a daemon that is instrumentally or incidentally motivated to solve my problem, then there is some smaller circuit that solves the problem equally well but skips the instrumental reasoning. If my daemon is doing some complex reasoning to answer the question "Should I predict well?" we could just skip straight to the answer "yes." This both makes the circuit smaller, and prevents the circuit from ever deciding not to predict well. A different perspective on a similar intuition: the daemon is doing some actual cognitive work to solve the problem. Since that computation is being done by the daemon, it is embedded as a smaller circuit. Jessica explores this intuition a bit here. Here we are considering an easy version of the problem, since by taking the smallest circuit we are effectively quantifying over all possible ways of extracting logical information from the daemon. Instead of showing that minimal circuits can't be daemons, we might end up concluding that they can be. That would be even more interesting. Another possible outcome is giving a strong argument that captures our intuitions/concerns about daemons, and which clearly doesn't apply to the minimal circuit that solves a problem. In this case we couldn't prove anything positive about the minimal circuit, but we would have "screened off" the possible cause for concern. Difficulties The first and most serious difficulty is understanding what we are talking about. I don't expect to get total clarity on concepts like "daemon" or "optimization" or "generic problem," but we need to have a better grip than we do right now. I expect that we'll develop better concepts in the course of solving the problem, rather than as... </div> </td> <td class="text-right">Dec 04, 2021</td> </tr> <tr> <td> <div class="content_episode"> <span class="episode_title">In Logical Time, All Games are Iterated Games by Abram Demski</span> <div class="pull-right play_badge" title="Listen"> <span class="mdi mdi-play-circle-outline"></span> <span class="duration">08:23</span> <span class="episode_url hidden" type="0">https://d22tbkdovk5ea2.cloudfront.net/audio/projects/9531/podcasts/3097611/media/ac601177b89dd2d90ec230d54ee20d2d_compiled.mp3</span> <span class="episode_img_url hidden" type="img">https://images.podcastrepublic.net/podcast/1608555549hd.jpg</span> <span class="episode_id hidden" type="id">51337909533</span> <span class="episode_guid hidden" type="id">31d8bcd0-d44f-4b2b-9fa5-a9225d3bb207</span> </div> </div> <div class="episode_info"> Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: In Logical Time, All Games are Iterated Games, published by Abram Demski on the AI Alignment Forum. Logical Time The main purpose of this post is to introduce the concept of logical time. The idea was mentioned in Scott's post, Bayesian Probability is for things that are Space-like Separated from You. It was first coined in a conference call with, Daniel Demski, Alex Mennan, and perhaps Corey Staten and Evan Lloyd -- I don't remember exactly who was there, or who first used the term. Logical time is an informal concept which serves as an intuition pump for thinking about logical causality and phenomena in logical decision theory; don't take it too seriously. In particular, I am not interested in anybody trying to formally define logical time (aside from formal approaches to logical causality). Still, it seems like useful language for communicating decision-theory intuitions. Suppose you are playing chess, and you consider moving your bishop. You play out a hypothetical game which results in your loss in several moves. You decide not to move your bishop as a result of this. The hypothetical game resulting in your loss still exists within logic. You are logically later than it, in that the game you actually play depends on what happened in this hypothetical game. Suppose you're stuck in the desert in a Parfit's Hitchhiker problem. Paul Ekman is reading your face, deciding whether you're trustworthy. Paul Ekman does this based on experience, meaning that the computation which is you has a strong similarity with other computations. This similarity can be used to predict you fairly reliably, based on your facial expressions. What creates this similarity? According to the logical time picture, there is a logical fact much earlier in logical time, which governs the connection between facial expressions and behavior. To the extent that agents are trying to predict the future, they can be thought of as trying to place themselves later in logical time than the events which they're trying to predict. Two agents trying to predict each other are competing to see who can be later in logical time. This is not necessarily wise; in games like chicken, there is a sense in which you want to be earlier in logical time. Traditional game theory, especially Nash equilibria, relies on what amounts to loopy logical causality to allow each agent to be after the other in logical time. Whether this is bad depends on your view on logical time travel. Perhaps there is a sense in which logical time can be loopy, due to prediction (which is like logical time travel). Perhaps logical time can't be loopy, and this is a flaw in the models used by traditional game theory. Iterated Games In logical time, all games are iterated games. An agent tries to forecast what happens in the decision problem it finds itself in by comparing it to similar decision problems which are small enough for it to look at. This puts it later in logical time than the small examples. "Similar games" includes the exact same game, but in which both players have had less time to think. This means it is appropriate to use iterated strategies. Agents who are aware of logical time can play tit-for-tat in single-shot Prisoner's Dilemma, and so, can cooperate with each other. Iterated games are different in character than single-shot games. The folk theorem shows that almost any outcome is possible in iterated play (in a certain sense). This makes it difficult to avoid very bad outcomes, such as nearly always defecting in the prisoner's dilemma, despite the availability of much better equilibria such as tit-for-tat. Intuitively, this is because (as Yoav Shoham et al point out in If multi-agent learning is the answer, what is the question?) it is difficult to separate "teaching behavior" from "learning behavior": as in the tit-for-tat s... </div> </td> <td class="text-right">Dec 04, 2021</td> </tr> <tr> <td> <div class="content_episode"> <span class="episode_title">Gradations of Inner Alignment Obstacles by Abram Demski</span> <div class="pull-right play_badge" title="Listen"> <span class="mdi mdi-play-circle-outline"></span> <span class="duration">15:10</span> <span class="episode_url hidden" type="0">https://d22tbkdovk5ea2.cloudfront.net/audio/projects/9531/podcasts/3097590/media/d342f2dbdced0f645581ceda55ad51d6_compiled.mp3</span> <span class="episode_img_url hidden" type="img">https://images.podcastrepublic.net/podcast/1608555549hd.jpg</span> <span class="episode_id hidden" type="id">51337909534</span> <span class="episode_guid hidden" type="id">9603ab86-faaf-4c27-9800-3af8f2c7b14f</span> </div> </div> <div class="episode_info"> Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Gradations of Inner Alignment Obstacles, published by Abram Demski on the AI Alignment Forum. The existing definitions of deception, inner optimizer, and some other terms tend to strike me as "stronger than necessary" depending on the context. If weaker definitions are similarly problematic, this means we need stronger methods to prevent them! I illustrate this and make some related (probably contentious) claims. Summary of contentious claims to follow: The most useful definition of "mesa-optimizer" doesn't require them to perform explicit search, contrary to the current standard. Success at aligning narrowly superhuman models might be bad news. Some versions of the lottery ticket hypothesis seem to imply that randomly initialized networks already contain deceptive agents. It's possible I've shoved too many things into one post. Sorry. Inner Optimization The standard definition of "inner optimizer" refers to something which carries out explicit search, in service of some objective. It's not clear to me whether/when we should focus that narrowly. Here are some other definitions of "inner optimizer" which I sometimes think about. Mesa-Control I've previously written about the idea of distinguishing mesa-search vs mesa-control: Mesa-searchers implement an internal optimization algorithm, such as a planning algorithm, to help them achieve an objective -- this is the definition of "mesa-optimizer"/"inner optimizer" I think of as standard. Mesa-controller refers to any effective strategies, including mesa-searchers but also "dumber" strategies which nonetheless effectively steer toward an objective. For example, thermostat-like strategies, or strategies which have simply memorized a number of effective interventions. Richard Ngo points out that this definition is rather all-encompassing, since it includes any highly competent policy. Adam Shimi suggests that we think of inner optimizers as goal-directed. Considering these comments, I think I want to revise my definition of mesa-controller to include that it is not totally myopic in some sense. A highly competent Q&A policy, if totally myopic, is not systematically "steering the world" in a particular direction, even if misaligned. However, I am not sure how I want to define "totally myopic" there. There may be several reasonable definitions. I think mesa-control is thought of as a less concerning problem than mesa-search, primarily because: how would you even get severely misaligned mesa-controllers? For example, why would a neural network memorize highly effective strategies for pursuing an objective which it hasn't been trained on? However, I would make the following points: If a mesa-searcher and a mesa-controller are equally effective, they're equally concerning. It doesn't matter what their internal algorithm is, if the consequences are the same. The point of inner alignment is to protect against those bad consequences. If mesa-controllers which don't search are truly less concerning, this just means it's an easier case to guard against. That's not an argument against including them in the definition of the inner alignment problem. Some of the reasons we expect mesa-search also apply to mesa-control more broadly. "Search" is an incredibly ambiguous concept. There's a continuum between searchers and pure memorized strategies: Explicit brute-force search over a large space of possible strategies. Heuristic search strategies, which combine brute force with faster, smarter steps. Smart strategies like binary search or Newton's method, which efficiently solve problems by taking advantage of their structure, but still involve iteration over possibilities. Highly knowledge-based strategies, such as calculus, which find solutions "directly" with no iteration -- but which still involve meaningful computation. Mildly-computatio... </div> </td> <td class="text-right">Dec 04, 2021</td> </tr> <tr> <td> <div class="content_episode"> <span class="episode_title">Truthful AI: Developing and governing AI that does not lie by Owain Evans, Owen Cotton-Barratt, Lukas Finnveden</span> <div class="pull-right play_badge" title="Listen"> <span class="mdi mdi-play-circle-outline"></span> <span class="duration">18:21</span> <span class="episode_url hidden" type="0">https://d22tbkdovk5ea2.cloudfront.net/audio/projects/9531/podcasts/3097546/media/4847fd994345dd510784f2a623895aa1_compiled.mp3</span> <span class="episode_img_url hidden" type="img">https://images.podcastrepublic.net/podcast/1608555549hd.jpg</span> <span class="episode_id hidden" type="id">51337909535</span> <span class="episode_guid hidden" type="id">bf2cc559-65e6-4576-a5ea-853c3a36920a</span> </div> </div> <div class="episode_info"> Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Truthful AI: Developing and governing AI that does not lie, published by Owain Evans, Owen Cotton-Barratt, Lukas Finnveden on the AI Alignment Forum. This post contains the abstract and executive summary of a new 96-page paper from authors at the Future of Humanity Institute and OpenAI. Update: The authors are doing an AMA about truthful AI during October 26-27. Abstract In many contexts, lying – the use of verbal falsehoods to deceive – is harmful. While lying has traditionally been a human affair, AI systems that make sophisticated verbal statements are becoming increasingly prevalent. This raises the question of how we should limit the harm caused by AI “lies” (i.e. falsehoods that are actively selected for). Human truthfulness is governed by social norms and by laws (against defamation, perjury, and fraud). Differences between AI and humans present an opportunity to have more precise standards of truthfulness for AI, and to have these standards rise over time. This could provide significant benefits to public epistemics and the economy, and mitigate risks of worst-case AI futures. Establishing norms or laws of AI truthfulness will require significant work to: identify clear truthfulness standards; create institutions that can judge adherence to those standards; and develop AI systems that are robustly truthful. Our initial proposals for these areas include: a standard of avoiding “negligent falsehoods” (a generalisation of lies that is easier to assess); institutions to evaluate AI systems before and after real-world deployment; explicitly training AI systems to be truthful via curated datasets and human interaction. A concerning possibility is that evaluation mechanisms for eventual truthfulness standards could be captured by political interests, leading to harmful censorship and propaganda. Avoiding this might take careful attention. And since the scale of AI speech acts might grow dramatically over the coming decades, early truthfulness standards might be particularly important because of the precedents they set. Executive Summary & Overview The threat of automated, scalable, personalised lying Today, lying is a human problem. AI-produced text or speech is relatively rare, and is not trusted to reliably convey crucial information. In today’s world, the idea of AI systems lying does not seem like a major concern. Over the coming years and decades, however, we expect linguistically competent AI systems to be used much more widely. These would be the successors of language models like GPT-3 or T5, and of deployed systems like Siri or Alexa, and they could become an important part of the economy and the epistemic ecosystem. Such AI systems will choose, from among the many coherent statements they might make, those that fit relevant selection criteria — for example, an AI selling products to humans might make statements judged likely to lead to a sale. If truth is not a valued criterion, sophisticated AI could use a lot of selection power to choose statements that further their own ends while being very damaging to others (without necessarily having any intention to deceive – see Diagram 1). This is alarming because AI untruths could potentially scale, with one system telling personalised lies to millions of people. Diagram 1: Typology of AI-produced statements. Linguistic AI systems today have little strategic selection power, and mostly produce statements that are not that useful (whether true or false). More strategic selection power on statements provides the possibility of useful statements, but also of harmful lies. Aiming for robustly beneficial standards Widespread and damaging AI falsehoods will be regarded as socially unacceptable. So it is perhaps inevitable that laws or other mechanisms will emerge to govern this behaviour. These might be existing human... </div> </td> <td class="text-right">Dec 04, 2021</td> </tr> <tr> <td> <div class="content_episode"> <span class="episode_title">Learning the prior by Paul Christiano</span> <div class="pull-right play_badge" title="Listen"> <span class="mdi mdi-play-circle-outline"></span> <span class="duration">12:55</span> <span class="episode_url hidden" type="0">https://d22tbkdovk5ea2.cloudfront.net/audio/projects/9531/podcasts/3097521/media/3f5a0fd0c3705609546d74632156d88f_compiled.mp3</span> <span class="episode_img_url hidden" type="img">https://images.podcastrepublic.net/podcast/1608555549hd.jpg</span> <span class="episode_id hidden" type="id">51337909536</span> <span class="episode_guid hidden" type="id">0c64f777-c60f-4549-b73e-a43fbf6fbfc2</span> </div> </div> <div class="episode_info"> Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Learning the prior, published by Paul Christiano on the AI Alignment Forum. Suppose that I have a dataset D of observed (x, y) pairs, and I’m interested in predicting the label y for each point x in some new set D. Perhaps D is a set of forecasts from the last few years, and D is a set of questions about the coming years that are important for planning. The classic deep learning approach is to fit a model f on D, and then predict y using f(x). This approach implicitly uses a somewhat strange prior, which depends on exactly how I optimize f. I may end up with the model with the smallest l2 norm, or the model that’s easiest to find with SGD, or the model that’s most robust to dropout. But none of these are anywhere close to the “ideal” beliefs of a human who has updated on D. This means that neural nets are unnecessarily data hungry, and more importantly that they can generalize in an undesirable way. I now think that this is a safety problem, so I want to try to attack it head on by learning the “right” prior, rather than attempting to use neural nets as an implicit prior. Warm-up 1: human forecasting If D and D are small enough, and I’m OK with human-level forecasts, then I don’t need ML at all. Instead I can hire a human to look at all the data in D, learn all the relevant lessons from it, and then spend some time forecasting y for each x. Now let’s gradually relax those assumptions. Warm-up 2: predicting human forecasts Suppose that D is large but that D is still small enough that a human can extract all the relevant lessons from it (or that for each x in D, there is a small subset of D that is relevant). In this case, I can pay humans to make forecasts for many randomly chosen x in D, train a model f to predict those forecasts, and then use f to make forecasts about the rest of D. The generalization is now coming entirely from human beliefs, not from the structural of the neural net — we are only applying neural nets to iid samples from D. Learning the human prior Now suppose that D is large, such that a human can’t update on it themselves. Perhaps D contains billions of examples, but we only have time to let a human read a few pages of background material. Instead of learning the unconditional human forecast P(y|x), we will learn the forecast P(y|x, Z), where Z is a few pages of background material that the human takes as given. We can also query the human for the prior probability Prior(Z) that the background material is true. Then we can train f(y|x, Z) to match P(y|x, Z), and optimize Z for: log Prior(Z) + sum((x, y) ~ D) log f(y|x, Z) We train f in parallel with optimizing Z, on inputs consisting of the current value of Z together with questions x sampled from D and D. For example, Z might specify a few explicit models for forecasting and trend extrapolation, a few important background assumptions, and guesses for a wide range of empirical parameters. Then a human who reads Z can evaluate how plausible it is on its face, or they can take it on faith in order to predict y given x. The optimal Z is then the set of assumptions, models, and empirical estimates that works best on the historical data. The human never has to reason about more than one datapoint at a time — they just have to evaluate what Z implies about each datapoint in isolation, and evaluate how plausible Z is a priori. This approach has many problems. Two particularly important ones: To be competitive, this optimization problem needs to be nearly as easy as optimizing f directly on D, but it seems harder: finding Z might be much harder than learning f, learning a conditional f might be much harder than learning an unconditional f, and jointly optimizing Z and f might present further difficulties. Even if it worked our forecasts would only be “human-level” in a fairly restrictive sense — they wou... </div> </td> <td class="text-right">Dec 04, 2021</td> </tr> <tr> <td> <div class="content_episode"> <span class="episode_title">Thoughts on the Alignment Implications of Scaling Language Models by leogao</span> <div class="pull-right play_badge" title="Listen"> <span class="mdi mdi-play-circle-outline"></span> <span class="duration">27:00</span> <span class="episode_url hidden" type="0">https://d22tbkdovk5ea2.cloudfront.net/audio/projects/9531/podcasts/3097515/media/0ba08003d11c6f1424c2efa2737989c1_compiled.mp3</span> <span class="episode_img_url hidden" type="img">https://images.podcastrepublic.net/podcast/1608555549hd.jpg</span> <span class="episode_id hidden" type="id">51337909537</span> <span class="episode_guid hidden" type="id">788b421d-d63f-4baf-b40f-d7f7b8b2c7fb</span> </div> </div> <div class="episode_info"> Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Thoughts on the Alignment Implications of Scaling Language Models, published by leogao on the AI Alignment Forum. [Epistemic status: slightly rambly, mostly personal intuition and opinion that will probably be experimentally proven wrong within a year considering how fast stuff moves in this field] This post is also available on my personal blog. Thanks to Gwern Branwen, Steven Byrnes, Dan Hendrycks, Connor Leahy, Adam Shimi, Kyle and Laria for the insightful discussions and feedback. Background By now, most of you have probably heard about GPT-3 and what it does. There’s been a bunch of different opinions on what it means for alignment, and this post is yet another opinion from a slightly different perspective. Some background: I'm a part of EleutherAI, a decentralized research collective (read: glorified discord server - come join us on Discord for ML, alignment, and dank memes). We're best known for our ongoing effort to create a GPT-3-like large language model, and so we have a lot of experience working with transformer models and looking at scaling laws, but we also take alignment very seriously and spend a lot of time thinking about it (see here for an explanation of why we believe releasing a large language model is good for safety). The inspiration for writing this document came out of the realization that there's a lot of tacit knowledge and intuitions about scaling and LMs that's being siloed in our minds that other alignment people might not know about, and so we should try to get that out there. (That being said, the contents of this post are of course only my personal intuitions at this particular moment in time and are definitely not representative of the views of all EleutherAI members.) I also want to lay out some potential topics for future research that might be fruitful. By the way, I did consider that the scaling laws implications might be an infohazard, but I think that ship sailed the moment the GPT-3 paper went live, and since we’ve already been in a race for parameters for some time (see: Megatron-LM, Turing-NLG, Switch Transformer, PanGu-α/盘古α, HyperCLOVA, Wudao/悟道 2.0, among others), I don’t really think this post is causing any non-negligible amount of desire for scaling. Why scaling LMs might lead to Transformative AI Why natural language as a medium First, we need to look at why a perfect LM could in theory be Transformative AI. Language is an extremely good medium for representing complex, abstract concepts compactly and with little noise. Natural language seems like a very efficient medium for this; images, for example, are much less compact and don’t have as strong an intrinsic bias towards the types of abstractions we tend to draw in the world. This is not to say that we shouldn’t include images at all, though, just that natural language should be the focus. Since text is so flexible and good at being entangled with all sorts of things in the world, to be able to model text perfectly, it seems that you'd have to model all the processes in the world that are causally responsible for the text, to the “resolution” necessary for the model to be totally indistinguishable from the distribution of real text. For more intuition along this line, the excellent post Methods of prompt programming explores, among other ideas closely related to the ideas in this post, a bunch of ways that reality is entangled with the textual universe: A novel may attempt to represent psychological states with arbitrarily fidelity, and scientific publications describe models of reality on all levels of abstraction. [...] A system which predicts the dynamics of language to arbitrary accuracy does require a theory of mind(s) and a theory of the worlds in which the minds are embedded. The dynamics of language do not float free from cultural, psychological, or physical... </div> </td> <td class="text-right">Dec 04, 2021</td> </tr> <tr> <td> <div class="content_episode"> <span class="episode_title">Bayesian Probability is for things that are Space-like Separated from You by Scott Garrabrant</span> <div class="pull-right play_badge" title="Listen"> <span class="mdi mdi-play-circle-outline"></span> <span class="duration">03:10</span> <span class="episode_url hidden" type="0">https://d22tbkdovk5ea2.cloudfront.net/audio/projects/9531/podcasts/3097479/media/a1b7d9a0061abfb42f6b12d673099a74_compiled.mp3</span> <span class="episode_img_url hidden" type="img">https://images.podcastrepublic.net/podcast/1608555549hd.jpg</span> <span class="episode_id hidden" type="id">51337909538</span> <span class="episode_guid hidden" type="id">925b255a-dbb7-4627-ba6c-5eeae5a81517</span> </div> </div> <div class="episode_info"> Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Bayesian Probability is for things that are Space-like Separated from You, published by Scott Garrabrant on the AI Alignment Forum. First, I should explain what I mean by space-like separated from you. Imagine a world that looks like a Bayesian network, and imagine that you are a node in that Bayesian network. If there is a path from you to another node following edges in the network, I will say that node is time-like separated from you, and in your future. If there is a path from another node to you, I will say that node is time-like separated from you, and in your past. Otherwise, I will say that the node is space-like separated from you. Nodes in your past can be thought of as things that you observe. When you think about physics, it sure does seem like there are a lot of things in your past that you do not observe, but I am not thinking about physics-time, I am thinking about logical-time. If something is in your past, but has no effect on what algorithm you are running on what observations you get, then it might as well be considered as space-like separated from you. If you compute how everything in the universe evaluates, the space-like separated things are the things that can be evaluated either before or after you, since their output does not change yours or vice-versa. If you partially observe a fact, then I want to say you can decompose that fact into the part that you observed and the part that you didn't, and say that the part you observed is in your past, while the part you didn't observe is space-like separated from you. (Whether or not you actually can decompose things like this is complicated, and related to whether or not you can use the tickle defense is the smoking lesion problem.) Nodes in your future can be thought of as things that you control. These are not always things that you want to control. For example, you control the output of "You assign probability less than 1/2 to this sentence," but perhaps you wish you didn't. Again, if you partially control a fact, I want to say that (maybe) you can break that fact into multiple nodes, some of which you control, and some of which you don't. So, you know the things in your past, so there is no need for probability there. You don't know the things in your future, or things that are space-like separated from you. (Maybe. I'm not sure that talking about knowing things you control is not just a type error.) You may have cached that you should use Bayesian probability to deal with things you are uncertain about. You may have this justified by the fact that if you don't use Bayesian probability, there is a Pareto improvement that will cause you to predict better in all worlds. The problem is that the standard justifications of Bayesian probability are in a framework where the facts that you are uncertain about are not in any way affected by whether or not you believe them! Therefore, our reasons for liking Bayesian probability do not apply to our uncertainty about the things that are in our future! Note that many things in our future (like our future observations) are also in the future of things that are space-like separated from us, so we want to use Bayes to reason about those things in order to have better beliefs about our observations. I claim that logical inductors do not feel entirely Bayesian, and this might be why. They can't if they are able to think about sentences like "You assign probability less than 1/2 to this sentence." Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. </div> </td> <td class="text-right">Dec 04, 2021</td> </tr> <tr> <td> <div class="content_episode"> <span class="episode_title">The Inner Alignment Problem by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, Scott Garrabrant</span> <div class="pull-right play_badge" title="Listen"> <span class="mdi mdi-play-circle-outline"></span> <span class="duration">24:49</span> <span class="episode_url hidden" type="0">https://d22tbkdovk5ea2.cloudfront.net/audio/projects/9531/podcasts/3097460/media/7361226f33b1a01ae8ce0a0d637cd2ae_compiled.mp3</span> <span class="episode_img_url hidden" type="img">https://images.podcastrepublic.net/podcast/1608555549hd.jpg</span> <span class="episode_id hidden" type="id">51337909539</span> <span class="episode_guid hidden" type="id">9b2251b1-7714-4f23-ac8c-2f5b0c311c3e</span> </div> </div> <div class="episode_info"> Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Inner Alignment Problem, published by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, Scott Garrabranton the AI Alignment Forum. This is the third of five posts in the Risks from Learned Optimization Sequence based on the paper “Risks from Learned Optimization in Advanced Machine Learning Systems” by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Each post in the sequence corresponds to a different section of the paper. In this post, we outline reasons to think that a mesa-optimizer may not optimize the same objective function as its base optimizer. Machine learning practitioners have direct control over the base objective function—either by specifying the loss function directly or training a model for it—but cannot directly specify the mesa-objective developed by a mesa-optimizer. We refer to this problem of aligning mesa-optimizers with the base objective as the inner alignment problem. This is distinct from the outer alignment problem, which is the traditional problem of ensuring that the base objective captures the intended goal of the programmers. Current machine learning methods select learned algorithms by empirically evaluating their performance on a set of training data according to the base objective function. Thus, ML base optimizers select mesa-optimizers according to the output they produce rather than directly selecting for a particular mesa-objective. Moreover, the selected mesa-optimizer's policy only has to perform well (as scored by the base objective) on the training data. If we adopt the assumption that the mesa-optimizer computes an optimal policy given its objective function, then we can summarize the relationship between the base and mesa- objectives as follows:(17) θ ∗ argmax θ E O base π θ where π θ argmax π E O mesa π θ That is, the base optimizer maximizes its objective O base by choosing a mesa-optimizer with parameterization θ based on the mesa-optimizer's policy π θ , but not based on the objective function O mesa that the mesa-optimizer uses to compute this policy. Depending on the base optimizer, we will think of O base as the negative of the loss, the future discounted reward, or simply some fitness function by which learned algorithms are being selected. An interesting approach to analyzing this connection is presented in Ibarz et al, where empirical samples of the true reward and a learned reward on the same trajectories are used to create a scatter-plot visualization of the alignment between the two.(18) The assumption in that work is that a monotonic relationship between the learned reward and true reward indicates alignment, whereas deviations from that suggest misalignment. Building on this sort of research, better theoretical measures of alignment might someday allow us to speak concretely in terms of provable guarantees about the extent to which a mesa-optimizer is aligned with the base optimizer that created it. 3.1. Pseudo-alignment There is currently no complete theory of the factors that affect whether a mesa-optimizer will be pseudo-aligned—that is, whether it will appear aligned on the training data, while actually optimizing for something other than the base objective. Nevertheless, we outline a basic classification of ways in which a mesa-optimizer could be pseudo-aligned: Proxy alignment, Approximate alignment, and Suboptimality alignment. Proxy alignment. The basic idea of proxy alignment is that a mesa-optimizer can learn to optimize for some proxy of the base objective instead of the base objective itself. We'll start by considering two special cases of proxy alignment: side-effect alignment and instrumental alignment. First, a mesa-optimizer is side-effect aligned if optimizing for the mesa-objective O mesa has the direct causal result of increasing the ... </div> </td> <td class="text-right">Dec 04, 2021</td> </tr> <tr> <td> <div class="content_episode"> <span class="episode_title">Clarifying some key hypotheses in AI alignment by Ben Cottier, Rohin Shah</span> <div class="pull-right play_badge" title="Listen"> <span class="mdi mdi-play-circle-outline"></span> <span class="duration">16:06</span> <span class="episode_url hidden" type="0">https://d22tbkdovk5ea2.cloudfront.net/audio/projects/9531/podcasts/3097407/media/df6357483a18fc4427da2fa56295aa1a_compiled.mp3</span> <span class="episode_img_url hidden" type="img">https://images.podcastrepublic.net/podcast/1608555549hd.jpg</span> <span class="episode_id hidden" type="id">51337909540</span> <span class="episode_guid hidden" type="id">a4114b06-b792-4d57-bfda-b135376c7860</span> </div> </div> <div class="episode_info"> Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Clarifying some key hypotheses in AI alignment, published by Ben Cottier, Rohin Shah on the AI Alignment Forum. We've created a diagram mapping out important and controversial hypotheses for AI alignment. We hope that this will help researchers identify and more productively discuss their disagreements. Diagram A part of the diagram. Click through to see the full version. Diagram preview Caveats This does not decompose arguments exhaustively. It does not include every reason to favour or disfavour ideas. Rather, it is a set of key hypotheses and relationships with other hypotheses, problems, solutions, models, etc. Some examples of important but apparently uncontroversial premises within the AI safety community: orthogonality, complexity of value, Goodhart's Curse, AI being deployed in a catastrophe-sensitive context. This is not a comprehensive collection of key hypotheses across the whole space of AI alignment. It focuses on a subspace that we find interesting and is relevant to more recent discussions we have encountered, but where key hypotheses seem relatively less illuminated. This includes rational agency and goal-directedness, CAIS, corrigibility, and the rationale of foundational and practical research. In hindsight, the selection criteria was something like: The idea is closely connected to the problem of artificial systems optimizing adversarially against humans. The idea must be explained sufficiently well that we believe it is plausible. Arrows in the diagram indicate flows of evidence or soft relations, not absolute logical implications — please read the "interpretation" box in the diagram. Also pay attention to any reasoning written next to a Yes/No/Defer arrow — you may disagree with it, so don't blindly follow the arrow! Background Much has been written in the way of arguments for AI risk. Recently there have been some talks and posts that clarify different arguments, point to open questions, and highlight the need for further clarification and analysis. We largely share their assessments and echo their recommendations. One aspect of the discourse that seems to be lacking clarification and analysis is the reasons to favour one argument over another — in particular, the key hypotheses or cruxes that underlie the different arguments. Understanding this better will make discourse more productive and help people reason about their beliefs. This work aims to collate and clarify hypotheses that seem key to AI alignment in particular (by "alignment" we mean the problem of getting an AI system to reliably do what an overseer intends, or try to do so, depending on which part of the diagram you are in). We point to which hypotheses, arguments, approaches, and scenarios are favoured and disfavoured by each other. It is neither comprehensive nor sufficiently nuanced to capture everyone's views, but we expect it to reduce confusion and encourage further analysis. You can digest this post through the diagram or the supplementary information, which have their respective strengths and limitations. However, we recommend starting with the diagram, then if you are interested in related reading or our comments about a particular hypothesis, you can click the link on the box title in the diagram, or look it up below. Supplementary information The sections here list the hypotheses in the diagram, along with related readings and our more opinion-based comments, for lack of software to neatly embed this information (however, boxes in the diagram do link back to the headings here). Note that the diagram is the best way to understand relationships and high-level meaning, while this offers more depth and resources for each hypothesis. Phrases in italics with the first letter capitalised are referring to a box in the diagram. Definitions AGI: a system (not necessarily agentive)... </div> </td> <td class="text-right">Dec 04, 2021</td> </tr> <tr> <td> <div class="content_episode"> <span class="episode_title">Reflections on Larks’ 2020 AI alignment literature review by Alex Flint</span> <div class="pull-right play_badge" title="Listen"> <span class="mdi mdi-play-circle-outline"></span> <span class="duration">10:17</span> <span class="episode_url hidden" type="0">https://d22tbkdovk5ea2.cloudfront.net/audio/projects/9531/podcasts/3097375/media/103911af7796bbafba141c8c85bfccae_compiled.mp3</span> <span class="episode_img_url hidden" type="img">https://images.podcastrepublic.net/podcast/1608555549hd.jpg</span> <span class="episode_id hidden" type="id">51337909541</span> <span class="episode_guid hidden" type="id">bb03e4a1-5056-4fe6-a010-b07a9d70cd81</span> </div> </div> <div class="episode_info"> Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Reflections on Larks’ 2020 AI alignment literature review, published by Alex Flint on the AI Alignment Forum. This work was supported by OAK, a monastic community in the Berkeley hills. It could not have been written without the daily love of living in this beautiful community. Larks has once again evaluated a large fraction of this year’s research output in AI alignment. I am, as always, deeply impressed not just by the volume of his work but by Larks’ willingness to distill from these research summaries a variety of coherent theses on how AI alignment research is evolving and where individual donors might give money. I cannot emphasize enough how much more difficult this is than merely summarizing the entire year’s research output, and summarizing the entire year’s research output is certainly a heroic undertaking on its own! I’d like to reflect briefly on a few points that came up as I read the post. Depth The work that I would most like to see funded is technical work that really moves our understanding of how to build beneficial AI systems forward. I will call this “depth”. It is unfortunately very difficult to quickly assess the depth of a given piece of research. Larks touches on this point when he discusses low-quality research: [...] a considerable amount of low-quality work has been produced. For example, there are a lot of papers which can be accurately summarized as asserting “just use ML to learn ethics”. Furthermore, the conventional peer review system seems to be extremely bad at dealing with this issue. Yet even among the papers that did get included in this year’s literature review, I suspect that there is a huge variation in depth, and I have no idea how to quickly assess which papers have it. Consider: which of the research outputs from, say, 2012 really moved our understanding of AI safety forward? How about from 2018? My sense is that these are fearsomely difficult questions to answer, even with several years’ hindisight. Larks wisely does not fall into the trap of merely counting research outputs, or computing any other such simplistic metric. I imagine that he reads the papers and comes to an informed sense of their relative quality without relying on any single explicit metric. My own sense is that this is exactly the right way to do it. Yet the whole conclusion of the literature review does rest critically on this one key question: what is it that constitutes valuable research in the field of AI alignment? My sense is that depth is the most valuable quality on the current margin, and unfortunately it seems to be very difficult either to produce or assess. Flywheel I was both impressed and more than a little disturbed by Larks’ “research flywheel” model of success in AI alignment: My basic model for AI safety success is this: Identify interesting problems. As a byproduct this draws new people into the field through altruism, nerd-sniping, apparent tractability Solve interesting problems. As a byproduct this draws new people into the field through credibility and prestige Repeat I was impressed because it is actually quite rare to see any thesis whatsoever about how AI alignment might succeed overall, and rarer still to see a thesis distilled to such a point that it can be intelligently critiqued. But I was disturbed because this particular thesis is completely wrong! Increasing the amount of AI alignment research or the number of AI alignment researchers will, I suspect, by default decrease the capacity for anyone to do deep work in the field, just as increasing the number of lines of code in a codebase will, by default, decrease the capacity for anyone to sculpt highly reliable research artifacts from that codebase, or increasing the number of employees in a company will, by default, decrease the capacity for anyone in that company to get imp... </div> </td> <td class="text-right">Dec 04, 2021</td> </tr> <tr> <td> <div class="content_episode"> <span class="episode_title">Intermittent Distillations #4: Semiconductors, Economics, Intelligence, and Technological Progress by Mark Xu</span> <div class="pull-right play_badge" title="Listen"> <span class="mdi mdi-play-circle-outline"></span> <span class="duration">17:56</span> <span class="episode_url hidden" type="0">https://d22tbkdovk5ea2.cloudfront.net/audio/projects/9531/podcasts/3097356/media/3047fc61a02e87ee55f0d07f9e99afd8_compiled.mp3</span> <span class="episode_img_url hidden" type="img">https://images.podcastrepublic.net/podcast/1608555549hd.jpg</span> <span class="episode_id hidden" type="id">51337909542</span> <span class="episode_guid hidden" type="id">9bbe722e-8667-4742-b907-d8f7955d5cc1</span> </div> </div> <div class="episode_info"> Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Intermittent Distillations #4: Semiconductors, Economics, Intelligence, and Technological Progress., published by Mark Xu on the AI Alignment Forum. The Semiconductor Supply Chain: Assessing National Competitiveness (Saif M. Khan, Alexander Mann, Dahlia Peterson) The Semiconductor Supply Chain: Assessing National Competitiveness Summary This report analyzes the current supply chain for semiconductors. It particularly focuses on which portions of the supply chain are controlled by US and its allies and China. Some key insights: The US semiconductor industry is estimated to contribute 39 percent of the total value of the global semiconductor supply chain. The semiconductor supply chain is incredibly complicated. The production of a single chip requires more than 1,000 steps and passes through borders more than 70 times throughout production. AMD is currently the only company with expertise in designing both high-end GPUs and high-end CPUs. TSMC controls 54% of the logic foundry market, with a larger share for leading edge production, e.g., state-of-the-art 5 nm node chips. Revenue per wafer for TSMC is rapidly increasing, while other foundries are seeing declines. The Netherlands has a monopoly on extreme ultraviolet (EUV) scanners, equipment needed to make the most advanced chips. The Netherlands and Japan have a monopoly on argon floride (ArF) immersion scanners, needed to make the second most advanced chips. The US has a monopoly on full-spectrum electronic design automation (EDA) software needed to design semiconductors. Japan, Taiwan, Germany and South Korea manufacture the state-of-the-art 300 mm wafers used for 99.7 percent of the world's chip manufacturing. This manufacturing process requires large amounts of tacit know-how. China controls the largest share of manufacturing for most natural materials. The US and its allies have a sizable share in all materials except for low-grade gallium, tungsten and magnesium. China controls ~2/3rds of the world's silicon production, but the US and allies have reserves. The report also analyzes US competitiveness at very detailed levels of the supply chain, which I didn't read that carefully. Tables: Opinion One perspective on the economy is that it's running a vast, distributed computation to allocate supply to demand in a relatively efficient manner. Examining the details on one of the supply chains underpinning a half a trillion dollar industry is relatively awe-inspiring. The only thing I'm currently aware of that is as complicated as computer hardware is computer software, which is sort of cheating. As AI becomes more advanced, control of semiconductor production becomes a strategic resource. However, there are multiple monopolies/sort of monopolies at every point. Each of these monopolies has a relatively large amount of bargaining power under many reasonable models. This situation puts the world in an interesting place. One concrete thing that I didn't consider before reading this report is the relevance of design software to semiconductor manufacturing. In retrospect, it seems pretty clear that the design of complicated things, e.g., video games, buildings, semiconductors, and animations, require complicated software with companies dedicated to building it. Lacking this software could constitute a meaningful bottleneck to being able to produce complicated artifacts. The asymmetry between manufacturing software and hardware is that software is easier to acquire through illegal means, whereas a EUV scanner has “100,000 parts, 3,000 cables, 40,000 bolts and 2 kilometers of hosing,” making it prohibitive to steal. Intelligence Explosion Microeconomics (Eliezer Yudkowsky) Intelligence Explosion Microeconomics Summary Takeaways I found interesting: Evolutionary history strongly suggests that there are linear or superlinear... </div> </td> <td class="text-right">Dec 04, 2021</td> </tr> <tr> <td> <div class="content_episode"> <span class="episode_title">Humans Are Embedded Agents Too by johnswentworth</span> <div class="pull-right play_badge" title="Listen"> <span class="mdi mdi-play-circle-outline"></span> <span class="duration">16:40</span> <span class="episode_url hidden" type="0">https://d22tbkdovk5ea2.cloudfront.net/audio/projects/9531/podcasts/3097321/media/5b5ca96ec08f7185cca5b0ca05d7eb68_compiled.mp3</span> <span class="episode_img_url hidden" type="img">https://images.podcastrepublic.net/podcast/1608555549hd.jpg</span> <span class="episode_id hidden" type="id">51337909543</span> <span class="episode_guid hidden" type="id">238316c1-59cd-41d7-af29-52964b847e41</span> </div> </div> <div class="episode_info"> Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Humans Are Embedded Agents Too, published by johnswentworth on the AI Alignment Forum. Most models of agency (in game theory, decision theory, etc) implicitly assume that the agent is separate from the environment - there is a “Cartesian boundary” between agent and environment. The embedded agency sequence goes through a long list of theoretical/conceptual problems which arise when an agent is instead embedded in its environment. Some examples: No defined/input output channels over which to optimize Agent might accidentally self-modify, e.g. drop a rock on its head Agent might intentionally self-modify, e.g. change its own source code Hard to define hypotheticals which don’t actually happen, e.g. “I will kill the hostages if you don’t pay the ransom” Agent may contain subcomponents which optimize for different things Agent is made of parts (e.g. atoms) whose behavior can be predicted without thinking of the agent as agenty - e.g. without thinking of the agent as making choices or having beliefs Agent is not logically omniscient: it cannot know all the implications of its own beliefs The embedded agency sequence mostly discusses how these issues create problems for designing reliable AI. Less discussed is how these same issues show up when modelling humans - and, in particular, when trying to define human values (i.e. “what humans want”). Many - arguably most - of the problems alignment researchers run into when trying to create robust pointers to human values are the same problems we encounter when talking about embedded agents in general. I’ll run through a bunch of examples below, and tie each to a corresponding problem-class in embedded agency. While reading, bear in mind that directly answering the questions posed is not the point. The point is that each of these problems is a symptom of the underlying issue: humans are embedded agents. Patching over each problem one-by-one will produce a spaghetti tower; ideally we’d tackle the problem closer to the root. The Keyboard is Not The Human Let’s imagine that we have an AI which communicates with its human operator via screen and keyboard. It tries to figure out what the human wants based on what’s typed at the keyboard. A few possible failure modes in this setup: The AI wireheads by seizing control of the keyboard (either intentionally or accidentally) A cat walks across the keyboard every now and then, and the AI doesn’t realize that this input isn’t from the human After a code patch, the AI filters out cat-input, but also filters out some confusing (but important) input from the human Embedded agency problem: humans do not have well-defined output channels. We cannot just point to a keyboard and say “any information from that keyboard is direct output from the human”. Of course we can come up with marginally better solutions than a keyboard - e.g. voice recognition - but eventually we’ll run into similar issues. There is nothing in the world we can point to and say “that’s the human’s output channel, the entire output channel, and nothing but the output channel”. Nor does any such output channel exist, so e.g. we won’t solve the problem just by having uncertainty over where exactly the output channel is. Modified Humans Because humans are embedded in the physical world, there is no fundamental block to an AI modifying us (either intentionally or unintentionally). Define what a “human” is based on some neural network which recognizes humans in images, and we risk an AI modifying the human by externally-invisible means ranging from drugs to wholesale replacement. Embedded agency problem: no Cartesian boundary. All the human-parts can be manipulated/modified; the AI is not in a different physical universe from us. Off-Equilibrium Human choices can depend on off-equilibrium behavior - what we or someone else would do, ... </div> </td> <td class="text-right">Dec 04, 2021</td> </tr> <tr> <td> <div class="content_episode"> <span class="episode_title">Reply to Paul Christiano on Inaccessible Information by Alex Flint</span> <div class="pull-right play_badge" title="Listen"> <span class="mdi mdi-play-circle-outline"></span> <span class="duration">09:10</span> <span class="episode_url hidden" type="0">https://d22tbkdovk5ea2.cloudfront.net/audio/projects/9531/podcasts/3097304/media/cab0d32319e91b81f1207fe6a25ddf84_compiled.mp3</span> <span class="episode_img_url hidden" type="img">https://images.podcastrepublic.net/podcast/1608555549hd.jpg</span> <span class="episode_id hidden" type="id">51337909544</span> <span class="episode_guid hidden" type="id">b529ca97-168d-4574-8918-10f508eedba3</span> </div> </div> <div class="episode_info"> Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Reply to Paul Christiano on Inaccessible Information, published by Alex Flint on the AI Alignment Forum. In Inaccessible Information, Paul Christiano lays out a fundamental challenge in training machine learning systems to give us insight into parts of the world that we cannot directly verify. The core problem he lays out is as follows. Suppose we lived in a world that had invented machine learning but not Newtonian mechanics. And suppose we trained some machine learning model to predict the motion of the planets across the sky -- we could do this by observing the position of the planets over, say, a few hundred days, and using this as training data for, say, a recurrent neural network. And suppose further that this worked and our training process yielded a model that output highly accurate predictions many days into the future. If all we wanted was to know the position of the planets in the sky then -- good news -- we’re done. But we might hope to use our model to gain some direct insight into the nature of the motion of the planets (i.e. the laws of gravity, although we wouldn’t know that this is what we were looking for). Presumably the machine learning model has in some sense discovered Newtonian mechanics using the training data we fed it, since this is surely the most compact way to predict the position of the planets far into the future. But we certainly can’t just read off the laws of Newtonian mechanics by looking at the millions or billions or trillions of weights in the trained model. How might we extract insight into the nature of the motion of the planets from this model? Well we might train a model to output both predictions about the position of the planets in the sky and a natural language description of what’s really going on behind the scenes (i.e. the laws of gravity). We’re assuming that we have enough training data that the training process was already able to derive these laws, so it’s not unreasonable to train a model that also outputs such legible descriptions. But in order to train a model that outputs such legible descriptions we need to generate a reward signal that incentivizes the right kind of legible descriptions. And herein lies the core of the problem: in this hypothesized world we do not know the true laws of Newtonian mechanics, so we cannot generate a reward signal by comparing the output of our model to ground truth during training. We might instead generate a reward signal that (1) measures how accurate the predictions of the position of the planets are, and (2) measures how succinct and plausible the legible descriptions are. But then what we are really training is a model that is good at producing succinct descriptions that seem plausible to humans. This may be a very very different (and dangerous) thing to do since there are lots of ways that a description can seem plausible to a human while being quite divorced from the truth. Christiano calls this the instrumental policy: the policy that produces succinct descriptions that merely seem plausible to humans: The real problem comes from what I’ll call the instrumental policy. Let’s say we’ve tried to dream up a loss function L(x, y) to incentivize the model to correctly answer information we can check, and give at least plausible and consistent answers on things we can’t check. By definition, the values L(x, y) are themselves accessible. Then it’s natural to learn a policy like: “on input x, produce the output y for which the loss L(x, y) will be minimal.” Let’s write BAD for this policy. Christiano uses the term “inaccessible information” for information like the laws of gravity in this example: information about the underlying nature of things that a machine learning model might learn quite accurately as latent info in service of making predictions, but that is difficult to ex... </div> </td> <td class="text-right">Dec 04, 2021</td> </tr> <tr> <td> <div class="content_episode"> <span class="episode_title">The Main Sources of AI Risk? by Daniel Kokotajlo, Wei Dai</span> <div class="pull-right play_badge" title="Listen"> <span class="mdi mdi-play-circle-outline"></span> <span class="duration">05:16</span> <span class="episode_url hidden" type="0">https://d22tbkdovk5ea2.cloudfront.net/audio/projects/9531/podcasts/3097280/media/4992671300374b6cdc98abd1dd610e54_compiled.mp3</span> <span class="episode_img_url hidden" type="img">https://images.podcastrepublic.net/podcast/1608555549hd.jpg</span> <span class="episode_id hidden" type="id">51337909545</span> <span class="episode_guid hidden" type="id">14e33ef2-45cc-4708-b4f9-30017502d91b</span> </div> </div> <div class="episode_info"> Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Main Sources of AI Risk?, published by Daniel Kokotajlo, Wei Dai on the AI Alignment Forum. There are so many causes or sources of AI risk that it's getting hard to keep them all in mind. I propose we keep a list of the main sources (that we know about), such that we can say that if none of these things happen, then we've mostly eliminated AI risk (as an existential risk) at least as far as we can determine. Here's a list that I spent a couple of hours enumerating and writing down. Did I miss anything important? Insufficient time/resources for AI safety (for example caused by intelligence explosion or AI race) Insufficient global coordination, leading to the above Misspecified or incorrectly learned goals/values Inner optimizers ML differentially accelerating easy to measure goals Paul Christiano's "influence-seeking behavior" (a combination of 3 and 4 above?) AI generally accelerating intellectual progress in a wrong direction (e.g., accelerating unsafe/risky technologies more than knowledge/wisdom about how to safely use those technologies) Metaethical error Metaphilosophical error Other kinds of philosophical errors in AI design (e.g., giving AI a wrong prior or decision theory) Other design/coding errors (e.g., accidentally putting a minus sign in front of utility function, supposedly corrigible AI not actually being corrigible) Doing acausal reasoning in a wrong way (e.g., failing to make good acausal trades, being acausally extorted, failing to acausally influence others who can be so influenced) Human-controlled AIs ending up with wrong values due to insufficient "metaphilosophical paternalism" Human-controlled AIs causing ethical disasters (e.g., large scale suffering that can't be "balanced out" later) prior to reaching moral/philosophical maturity Intentional corruption of human values Unintentional corruption of human values Mind crime (disvalue unintentionally incurred through morally relevant simulations in AIs' minds) Premature value lock-in (i.e., freezing one's current conception of what's good into a utility function) Extortion between AIs leading to vast disvalue Distributional shifts causing apparently safe/aligned AIs to stop being safe/aligned Value drift and other kinds of error as AIs self-modify, or AIs failing to solve value alignment for more advanced AIs Treacherous turn / loss of property rights due to insufficient competitiveness of humans & human-aligned AIs Gradual loss of influence due to insufficient competitiveness of humans & human-aligned AIs Utility maximizers / goal-directed AIs having an economic and/or military competitive advantage due to relative ease of cooperation/coordination, defense against value corruption and other forms of manipulation and attack, leading to one or more of the above In general, the most competitive type of AI being too hard to align or to safely use Computational resources being too cheap, leading to one or more of the above (With this post I mean to (among other things) re-emphasize the disjunctive nature of AI risk, but this list isn't fully disjunctive (i.e., some of the items are subcategories or causes of others), and I mostly gave a source of AI risk its own number in the list if it seemed important to make that source more salient. Maybe once we have a list of everything that is important, it would make sense to create a graph out of it.) Added on 6/13/19: Failure to learn how to deal with alignment in the many-humans, many-AIs case even if single-human, single-AI alignment is solved (suggested by William Saunders) Economics of AGI causing concentration of power amongst human overseers Inability to specify any ‘real-world’ goal for an artificial agent (suggested by Michael Cohen) AI systems end up controlled by a group of humans representing a small range of human values (ie. an ideological... </div> </td> <td class="text-right">Dec 04, 2021</td> </tr> <tr> <td> <div class="content_episode"> <span class="episode_title">And the AI would have got away with it too, if... by Stuart Armstrong</span> <div class="pull-right play_badge" title="Listen"> <span class="mdi mdi-play-circle-outline"></span> <span class="duration">02:41</span> <span class="episode_url hidden" type="0">https://d22tbkdovk5ea2.cloudfront.net/audio/projects/9531/podcasts/3097268/media/b3355c40a7c7c3388b12ca46de1f239c_compiled.mp3</span> <span class="episode_img_url hidden" type="img">https://images.podcastrepublic.net/podcast/1608555549hd.jpg</span> <span class="episode_id hidden" type="id">51337909546</span> <span class="episode_guid hidden" type="id">35569d15-b4bc-4860-81cd-877294e861de</span> </div> </div> <div class="episode_info"> Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: And the AI would have got away with it too, if..., published by Stuart Armstrong on the AI Alignment Forum Paul Christiano presented some low-key AI catastrophe scenarios; in response, Robin Hanson argued that Paul's scenarios were not consistent with the "large (mostly economic) literature on agency failures". He concluded with: For concreteness, imagine a twelve year old rich kid, perhaps a king or queen, seeking agents to help manage their wealth or kingdom. It is far from obvious that this child is on average worse off when they choose a smarter more capable agent, or when the overall pool of agents from which they can choose becomes smarter and more capable. And its even less obvious that the kid becomes maximally worse off as their agents get maximally smart and capable. In fact, I suspect the opposite. Thinking on that example, my mind went to Edward the Vth of England (one of the "Princes in the Tower"), deposed then likely killed by his "protector" Richard III. Or of the Guangxu Emperor of China, put under house arrest by the Regent Empress Dowager Cixi. Or maybe the ten year-old Athitayawong, king of Ayutthaya, deposed by his main administrator after only 36 days of reign. More examples can be dug out from some of Wikipedia's list of rulers deposed as children. We have no reason to restrict to child-monarchs - so many Emperors, Kings, and Tsars have been deposed by their advisers or "agents". So yes, there are many cases where agency fails catastrophically for the principal and where having a smarter or more rational agent was a disastrous move. By restricting attention to agency problems in economics, rather than in politics, Robin restricts attention to situations where institutions are strong and behaviour is punished if it gets too egregious. Though even today, there is plenty of betrayal by "agents" in politics, even if the results are less lethal than in times gone by. In economics, too, we have fraudulent investors, some of which escape punishment. Agents betray their principals to the utmost - when they can get away with it. So Robin's argument is entirely dependent on the assumption that institutions or rivals will prevent AIs from being able to abuse their agency power. Absent that assumption, most of the "large (mostly economic) literature on agency failures" becomes irrelevant. So, would institutions be able to detect and punish abuses by future powerful AI agents? I'd argue we can't count on it, but it's a question that needs its own exploration, and is very different from what Robin's economic point seemed to be. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. </div> </td> <td class="text-right">Dec 04, 2021</td> </tr> <tr> <td> <div class="content_episode"> <span class="episode_title">Inner alignment in the brain by Steve Byrnes</span> <div class="pull-right play_badge" title="Listen"> <span class="mdi mdi-play-circle-outline"></span> <span class="duration">27:18</span> <span class="episode_url hidden" type="0">https://d22tbkdovk5ea2.cloudfront.net/audio/projects/9531/podcasts/3097177/media/a0c6d48ffc4595bd4886baa95c18c261_compiled.mp3</span> <span class="episode_img_url hidden" type="img">https://images.podcastrepublic.net/podcast/1608555549hd.jpg</span> <span class="episode_id hidden" type="id">51337909547</span> <span class="episode_guid hidden" type="id">7898a185-62e3-4386-81ee-f8ec247ef075</span> </div> </div> <div class="episode_info"> Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Inner alignment in the brain, published by Steve Byrnes on the AI Alignment Forum. Abstract: We can think of the brain crudely as (1) a neocortex which runs an amazingly capable quasi-general-purpose learning-and-planning algorithm, and (2) subcortical structures (midbrain, etc.), one of whose functions is to calculate rewards that get sent to up the neocortex to direct it. But the relationship is actually more complicated than that. "Reward" is not the only informational signal sent up to the neocortex; meanwhile information is also flowing back down in the opposite direction. What's going on? How does all this work? Where do emotions fit in? Well, I'm still confused on many points, but I think I'm making progress. In this post I will describe my current picture of this system. Background & motivation I'm interested in helping ensure a good post-AGI future. But how do we think concretely about AGI, when AGI doesn't exist and we don't know how to build it? Three paths: We can think generally about the nature of intelligence and agency—a research program famously associated with MIRI, Marcus Hutter, etc.; We can think about today's AI systems—a research program famously associated with OpenAI, DeepMind, CHAI, etc.; We can start from the one "general intelligence" we know about, i.e. the human brain, and try to go from there to lessons about how AGI might be built, what it might look like, and how it might be safely and beneficially used and controlled. I like this 3rd research program; it seems to be almost completely neglected,[1] and I think there's a ton of low-hanging fruit there. Also, this program will be especially important if we build AGI in part by reverse-engineering (or reinventing) high-level neocortical algorithms, which (as discussed below) I think is very plausible, maybe even likely—for better or worse. Now, the brain is divided into the neocortex and the subcortex. Start with the neocortex[2] The neocortex does essentially all the cool exciting intelligent things that humans do, like building an intelligent world-model involving composition and hierarchies and counterfactuals and analogies and meta-cognition etc., and using that thing to cure diseases and build rocket ships and create culture etc. Thus, both neuroscientists and AI researchers focus a lot of attention onto the neocortex, and on understanding and reverse-engineering its algorithms. Textbooks divide the neocortex into lots of functional regions like "motor cortex" and "visual cortex" and "frontal lobe" etc., but microscopically it's all a pretty uniform 6-layer structure, and I currently believe that all parts of the neocortex are performing more-or-less the same algorithm, but with different input and output connections. These connections are seeded by an innate gross wiring diagram and then edited by the algorithm itself. See Human Instincts, Symbol Grounding, and the Blank-Slate Neocortex for discussion and (heavy!) caveats on that claim. And what is this algorithm? I outline some of (what I think are) the high-level specifications at Predictive coding = RL + SL + Bayes + MPC. In terms of how the algorithm actually works, I think that researchers are making fast progress towards figuring this out, and that a complete answer is already starting to crystallize into view on the horizon. For a crash course on what's known today on how the neocortex does its thing, maybe a good starting point would be to read On Intelligence and then every paper ever written by Dileep George (and citations therein). The subcortex, by contrast, is not a single configuration of neurons tiled over a huge volume, but rather it is a collection of quite diverse structures like the amygdala, cerebellum, tectum, and so on. Unlike the neocortex, this stuff does not perform some miraculous computation light-years b... </div> </td> <td class="text-right">Dec 04, 2021</td> </tr> <tr> <td> <div class="content_episode"> <span class="episode_title">Review of Soft Takeoff Can Still Lead to DSA by Daniel Kokotajlo</span> <div class="pull-right play_badge" title="Listen"> <span class="mdi mdi-play-circle-outline"></span> <span class="duration">08:57</span> <span class="episode_url hidden" type="0">https://d22tbkdovk5ea2.cloudfront.net/audio/projects/9531/podcasts/3097149/media/063341cc306a6d499e7bd14b1d4046c7_compiled.mp3</span> <span class="episode_img_url hidden" type="img">https://images.podcastrepublic.net/podcast/1608555549hd.jpg</span> <span class="episode_id hidden" type="id">51337909548</span> <span class="episode_guid hidden" type="id">e4a104f8-2df6-4f4f-8ad3-a598aa5049f2</span> </div> </div> <div class="episode_info"> Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Review of Soft Takeoff Can Still Lead to DSA, published by Daniel Kokotajlo on the AI Alignment Forum. A few months after writing this post I realized that one of the key arguments was importantly flawed. I therefore recommend against inclusion in the 2019 review. This post presents an improved version of the original argument, explains the flaw, and then updates my all-things-considered view accordingly. Improved version of my original argument Definitions: “Soft takeoff” is roughly “AI will be like the Industrial Revolution but 10x-100x faster” “Decisive Strategic Advantage” (DSA) is “a level of technological and other advantages sufficient to enable it to achieve complete world domination.” In other words, DSA is roughly when one faction or entity has the capability to “take over the world.” (What taking over the world means is an interesting question which we won’t explore here. Nowadays I’d reframe things in terms of potential PONRs.) We ask how likely it is that DSA arises, conditional on soft takeoff. Note that DSA does not mean the world is actually taken over, only that one faction at some point has the ability to do so. They might be too cautious or too ethical to try. Or they might try and fail due to bad luck. In a soft takeoff scenario, a 0.3 - 3 year technological lead over your competitors probably gives you a DSA. It seems plausible that for much of human history, a 30-year technological lead over your competitors was not enough to give you a DSA. It also seems plausible that during and after the industrial revolution, a 30-year technological lead was enough. (For more arguments on this key point, see my original post.) This supports a plausible conjecture that when the pace of technological progress speeds up, the length (in clock time) of technological lead needed for DSA shrinks proportionally. So a soft takeoff could lead to a DSA insofar as there is a 0.3 - 3 year lead at the beginning which is maintained for a few years. 0.3 - 3 year technological leads are reasonably common today, and in particular it’s plausible that there could be one in the field of AI research. There’s a reasonable chance of such a lead being maintained for a few years. This is a messy question, but judging by the table below, it seems that if anything the lead of the front-runner in this scenario is more likely to lengthen than shorten! If this is so, why did no one achieve DSA during the Industrial Revolution? My answer is that spies/hacking/leaks/etc. are much more powerful during the industrial revolution than they are during a soft takeoff, because they have an entire economy to steal from and decades to do it, whereas in a soft takeoff ideas can be hoarded in a specific corporation and there’s only a few years (or months!) to do it. Therefore, there’s a reasonable chance of DSA conditional on soft takeoff. Factors that might shorten the lead Factors that might lengthen the lead If you don’t sell your innovations to the rest of the world, you’ll lose out on opportunities to make money, and then possibly be outcompeted by projects that didn’t hoard their innovations. Hoarding innovations gives you an advantage over the rest of the world, because only you can make use of them. Spies, hacking, leaks, defections, etc. Big corporations with tech leads often find ways to slow down their competition, e.g. by lobbying to raise regulatory barriers to entry. Being known to be the leading project makes it easier to attract talent and investment. There might be additional snowball effects (e.g. network effect as more people use your product providing you with more data) I take it that 2, 4, and 5 are the controversial bits. I still stand by 2, and the arguments made for it in my original post. I also stand by 4. (To be clear, it’s not like I’ve investigated these things in detail.... </div> </td> <td class="text-right">Dec 04, 2021</td> </tr> <tr> <td> <div class="content_episode"> <span class="episode_title">Two Neglected Problems in Human-AI Safety by Wei Dai</span> <div class="pull-right play_badge" title="Listen"> <span class="mdi mdi-play-circle-outline"></span> <span class="duration">03:09</span> <span class="episode_url hidden" type="0">https://d22tbkdovk5ea2.cloudfront.net/audio/projects/9531/podcasts/3097121/media/7824c335e584bfbfb6e015f0921b8f20_compiled.mp3</span> <span class="episode_img_url hidden" type="img">https://images.podcastrepublic.net/podcast/1608555549hd.jpg</span> <span class="episode_id hidden" type="id">51337909549</span> <span class="episode_guid hidden" type="id">cda9a9ed-fc6e-4a30-bb77-8ff016f67e49</span> </div> </div> <div class="episode_info"> Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Two Neglected Problems in Human-AI Safety, published by Wei Dai on the AI Alignment Forum. In this post I describe a couple of human-AI safety problems in more detail. These helped motivate my proposed hybrid approach, and I think need to be addressed by other AI safety approaches that currently do not take them into account. 1. How to prevent "aligned" AIs from unintentionally corrupting human values? We know that ML systems tend to have problems with adversarial examples and distributional shifts in general. There seems to be no reason not to expect that human value functions have similar problems, which even "aligned" AIs could trigger unless they are somehow designed not to. For example, such AIs could give humans so much power so quickly or put them in such novel situations that their moral development can't keep up, and their value systems no longer apply or give essentially random answers. AIs could give us new options that are irresistible to some parts of our motivational systems, like more powerful versions of video game and social media addiction. In the course of trying to figure out what we most want or like, they could in effect be searching for adversarial examples on our value functions. At our own request or in a sincere attempt to help us, they could generate philosophical or moral arguments that are wrong but extremely persuasive. (Some of these issues, like the invention of new addictions and new technologies in general, would happen even without AI, but I think AIs would likely, by default, strongly exacerbate the problem by differentially accelerating such technologies faster than progress in understanding how to safely handle them.) 2. How to defend against intentional attempts by AIs to corrupt human values? It looks like we may be headed towards a world of multiple AIs, some of which are either unaligned, or aligned to other owners or users. In such a world there's a strong incentive to use one's own AIs to manipulate other people's values in a direction that benefits oneself (even if the resulting loss to others are greater than gains to oneself). There is an apparent asymmetry between attack and defense in this arena, because manipulating a human is a straightforward optimization problem with an objective that is easy to test/measure (just check if the target has accepted the values you're trying to instill, or has started doing things that are more beneficial to you), and hence relatively easy for AIs to learn how to do, but teaching or programming an AI to help defend against such manipulation seems much harder, because it's unclear how to distinguish between manipulation and useful information or discussion. (One way to defend against such manipulation would be to cut off all outside contact, including from other humans because we don't know whether they are just being used as other AIs' mouthpieces, but that would be highly detrimental to one's own moral development.) There's also an asymmetry between AIs with simple utility functions (either unaligned or aligned to users who think they have simple values) and AIs aligned to users who have high value complexity and moral uncertainty. The former seem to be at a substantial advantage in a contest to manipulate others' values and protect one's own. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. </div> </td> <td class="text-right">Dec 04, 2021</td> </tr> <tr> <td> <div class="content_episode"> <span class="episode_title">Announcement: AI alignment prize round 4 winners by Vladimir Slepnev</span> <div class="pull-right play_badge" title="Listen"> <span class="mdi mdi-play-circle-outline"></span> <span class="duration">01:39</span> <span class="episode_url hidden" type="0">https://d22tbkdovk5ea2.cloudfront.net/audio/projects/9531/podcasts/3097094/media/f525ebd064f1aba256f35058ad795ea5_compiled.mp3</span> <span class="episode_img_url hidden" type="img">https://images.podcastrepublic.net/podcast/1608555549hd.jpg</span> <span class="episode_id hidden" type="id">51337909550</span> <span class="episode_guid hidden" type="id">7b3f83bc-223a-4ef3-8829-b88d82894b8c</span> </div> </div> <div class="episode_info"> Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Announcement: AI alignment prize round 4 winners , published by Vladimir Slepnev on the AI Alignment Forum. We (Zvi Mowshowitz and Vladimir Slepnev) are happy to announce the results of the fourth round of the AI Alignment Prize, funded by Paul Christiano. From July 15 to December 31, 2018 we received 10 entries, and are awarding four prizes for a total of $20,000. The winners We are awarding two first prizes of $7,500 each. One of them goes to Alexander Turner for Penalizing Impact via Attainable Utility Preservation; the other goes to Abram Demski and Scott Garrabrant for the Embedded Agency sequence. We are also awarding two second prizes of $2,500 each: to Ryan Carey for Addressing three problems with counterfactual corrigibility, and to Wei Dai for Three AI Safety Related Ideas and Two Neglected Problems in Human-AI Safety. We will contact each winner by email to arrange transfer of money. Many thanks to everyone else who participated! Moving on This concludes the AI Alignment Prize for now. It has stimulated a lot of good work during its year-long run, but participation has been slowing down from round to round, and we don't think it's worth continuing in its current form. Once again, we'd like to thank everyone who sent us articles! And special thanks to Ben and Oliver from the LW2.0 team for their enthusiasm and help. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. </div> </td> <td class="text-right">Dec 04, 2021</td> </tr> <tr> <td> <div class="content_episode"> <span class="episode_title">The Credit Assignment Problem by Abram Demski</span> <div class="pull-right play_badge" title="Listen"> <span class="mdi mdi-play-circle-outline"></span> <span class="duration">28:49</span> <span class="episode_url hidden" type="0">https://d22tbkdovk5ea2.cloudfront.net/audio/projects/9531/podcasts/3097053/media/2014fb572bb8e521a8ae24252a7a39c2_compiled.mp3</span> <span class="episode_img_url hidden" type="img">https://images.podcastrepublic.net/podcast/1608555549hd.jpg</span> <span class="episode_id hidden" type="id">51337909551</span> <span class="episode_guid hidden" type="id">3ad94444-4baa-4e37-a1e9-18a8721252f8</span> </div> </div> <div class="episode_info"> Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Credit Assignment Problem, published by Abram Demski on the AI Alignment Forum. This post is eventually about partial agency. However, it's been a somewhat tricky point for me to convey; I take the long route. Epistemic status: slightly crazy. I've occasionally said "Everything boils down to credit assignment problems." What I really mean is that credit assignment pops up in a wide range of scenarios, and improvements to credit assignment algorithms have broad implications. For example: Politics. When politics focuses on (re-)electing candidates based on their track records, it's about credit assignment. The practice is sometimes derogatorily called "finger pointing", but the basic computation makes sense: figure out good and bad qualities via previous performance, and vote accordingly. When politics instead focuses on policy, it is still (to a degree) about credit assignment. Was raising the minimum wage responsible for reduced employment? Was it responsible for improved life outcomes? Etc. Economics. Money acts as a kind of distributed credit-assignment algorithm, and questions of how to handle money, such as how to compensate employees, often involve credit assignment. In particular, mechanism design (a subfield of economics and game theory) can often be thought of as a credit-assignment problem. Law. Both criminal law and civil law involve concepts of fault and compensation/retribution -- these at least resemble elements of a credit assignment process. Sociology. The distributed computation which determines social norms involves a heavy element of credit assignment: identifying failure states and success states, determining which actions are responsible for those states and who is responsible, assigning blame and praise. Biology. Evolution can be thought of as a (relatively dumb) credit assignment algorithm. Ethics. Justice, fairness, contractualism, issues in utilitarianism. Epistemology. Bayesian updates are a credit assignment algorithm, intended to make high-quality hypotheses rise to the top. Beyond the basics of Bayesianism, building good theories realistically involves identifying which concepts are responsible for successes and failures. This is credit assignment. Another big area which I'll claim is "basically credit assignment" is artificial intelligence. In the 1970s, John Holland kicked off the investigation of learning classifier systems. John Holland had recently invented the Genetic Algorithms paradigm, which applies an evolutionary paradigm to optimization problems. Classifier systems were his attempt to apply this kind of "adaptive" paradigm (as in "complex adaptive systems") to cognition. Classifier systems added an economic metaphor to the evolutionary one; little bits of thought paid each other for services rendered. The hope was that a complex ecology+economy could develop, solving difficult problems. One of the main design issues for classifier systems is the virtual economy -- that is, the credit assignment algorithm. An early proposal was the bucket-brigade algorithm. Money is given to cognitive procedures which produce good outputs. These procedures pass reward back to the procedures which activated them, who similarly pass reward back in turn. This way, the economy supports chains of useful procedures. Unfortunately, the bucket-brigade algorithm was vulnerable to parasites. Malign cognitive procedures could gain wealth by activating useful procedures without really contributing anything. This problem proved difficult to solve. Taking the economy analogy seriously, we might want cognitive procedures to decide intelligently who to pay for services. But, these are supposed to be itty bitty fragments of our thought process. Deciding how to pass along credit is a very complex task. Hence the need for a pre-specified solution such as bucke... </div> </td> <td class="text-right">Dec 04, 2021</td> </tr> <tr> <td> <div class="content_episode"> <span class="episode_title">Recent Progress in the Theory of Neural Networks by interstice</span> <div class="pull-right play_badge" title="Listen"> <span class="mdi mdi-play-circle-outline"></span> <span class="duration">19:10</span> <span class="episode_url hidden" type="0">https://d22tbkdovk5ea2.cloudfront.net/audio/projects/9531/podcasts/3096994/media/a8f45280b0222f1ab166663c702df66e_compiled.mp3</span> <span class="episode_img_url hidden" type="img">https://images.podcastrepublic.net/podcast/1608555549hd.jpg</span> <span class="episode_id hidden" type="id">51337909552</span> <span class="episode_guid hidden" type="id">e61f22da-fadd-4e91-b4dd-dcf7ce91fe3e</span> </div> </div> <div class="episode_info"> Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Recent Progress in the Theory of Neural Networks, published by interstice on the AI Alignment Forum. It's common wisdom that neural networks are basically "matrix multiplications that nobody understands" , impenetrable to theoretical analysis, which have achieved great results largely through trial-and-error. While this may have been true in the past, recently there has been significant progress towards developing a theoretical understanding of neural networks. Most notably, we have obtained an arguably complete understanding of network initialization and training dynamics in a certain infinite-width limit. There has also been some progress towards understanding their generalization behavior. In this post I will review some of this recent progress and discuss the potential relevance to AI alignment. Infinite Width Nets: Initialization The most exciting recent developments in the theory of neural networks have focused the infinite-width limit. We consider neural networks where the number of neurons in all hidden layers are increased to infinity. Typically we consider networks with a Gaussian-initialized weights, and scale the variance at initialization as 1 √ H , where H is the number of hidden units in the preceding layer(this is needed to avoid inputs blowing up, and is also the initialization scheme usually used in real networks). In this limit, we have obtained an essentially complete understanding of both behavior at initialization and training dynamics[1]. (Those with limited interest/knowledge of math may wish to "Significance and Limitations" below). We've actually had a pretty good understanding of the behavior of infinite-width neural networks at initialization for a while, since the work of Radford Neal(1994). He proved that in this limit, fully-connected neural networks with Gaussian-distributed weights and biases limit to what are known as Gaussian processes. Gaussian processes can be thought of the generalization of Gaussian distributions from finite-dimensional spaces to spaces of functions. Neal's paper provides a very clear derivation of this behavior, but I'll explain it briefly here. A neural network with m real-valued inputs and 1 real valued outputs defines a function from R m to R . Thus, a distribution over the weights and biases of such a neural network -- such as the standard Gaussian initialization -- implicitly defines a distribution over functions on R m . Neal's paper shows that, for fully-connected neural networks, this distribution limits to a Gaussian process. What is a Gaussian process? It's a distribution over functions f with the property that, for any finite collection of points X 1 X N , the values f X 1 f X N have a joint distribution which is a multivariate Gaussian. Any Gaussian process is uniquely defined by its mean and covariance functions, μ x and C x x ′ . For points X 1 X N , the distribution of f X 1 f X N will have mean μ X 1 μ X N with covariance matrix C i j C X i X j The argument that fully-connected neural networks limit to Gaussian processes in the infinite-width limit is pretty simple. Consider a three-layer neural network, with an activation function σ in the second layer and a single linear output unit. This network can be defined by the equation y ∑ V k σ ∑ W k j X j . At initialization, V and W are filled with independent Gaussians, with variance of V scaled as the inverse square-root of the number of hidden-units. Each hidden unit h k will has a value for each of the inputs X i h k X i σ ∑ W k j X j i . Since W is random, for each k h k → X is an independent random vector(where we write → X for X 1 X N ). All of these random vectors follow the same distribution, and the output → y f → X of the network is simply the sum of these identical distributions multiplied by the univariate Gaussians V k . By the multidi... </div> </td> <td class="text-right">Dec 04, 2021</td> </tr> <tr> <td> <div class="content_episode"> <span class="episode_title">Tessellating Hills: a toy model for demons in imperfect search by DaemonicSigil</span> <div class="pull-right play_badge" title="Listen"> <span class="mdi mdi-play-circle-outline"></span> <span class="duration">04:18</span> <span class="episode_url hidden" type="0">https://d22tbkdovk5ea2.cloudfront.net/audio/projects/9531/podcasts/3096950/media/8249eca9cd54262198e4bf56c2cd53dc_compiled.mp3</span> <span class="episode_img_url hidden" type="img">https://images.podcastrepublic.net/podcast/1608555549hd.jpg</span> <span class="episode_id hidden" type="id">51337909553</span> <span class="episode_guid hidden" type="id">292dcf1c-5ee8-4f44-b216-3ed2ef71f751</span> </div> </div> <div class="episode_info"> Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is:Tessellating Hills: a toy model for demons in imperfect search, published by DaemonicSigil on the AI Alignment Forum. If you haven't already, take a look at this post by johnswentworth to understand what this is all about: The short version is that while systems that use perfect search, such as AIXI, have many safety problems, a whole new set of problems arises when we start creating systems that are not perfect searchers. Patterns can form that exploit the imperfect nature of the search function to perpetuate themselves. johnswentworth refers to such patterns as "demons". After reading that post I decided to see if I could observe demon formation in a simple model: gradient descent on a not-too-complicated mathematical function. It turns out that even in this very simplistic case, demon formation can happen. Hopefully this post will give people an example of demon formation where the mechanism is simple and easy to visualize. Model The function we try to minimize using gradient descent is called the loss function. Here it is: L → x − x 0 ϵ n ∑ j 1 x j ⋅ splotch j → x Let me explain what some of the parts of this loss mean. Each function splotch j → x is periodic with period 2π in every component of → x . I decided in this case to make my splotch functions out of a few randomly chosen sine waves added together. ϵ is chosen to be a small number so in any local region, ϵ ∑ n j 1 x j ⋅ splotch j → x will look approximately periodic: A bunch of hills repeating over and over again with period 2π across the landscape. But over large enough distances, the relative weightings of various splotches do change. Travel a distance of 20π in the x 7 direction, and splotch 7 will be a larger component of the repeating pattern than it was before. This allows for selection effects. The − x 0 term means that the vector → x mainly wants to increase its x 0 component. But the splotch functions can also direct its motion. A splotch function might have a kind of ridge that directs some of the x 0 motion into other components. If splotch 7 tends to direct motion in such a way that x 7 , increases, then it will be selected for, becoming stronger and stronger as time goes on. Results I used ordinary gradient descent, with a constant step size, and with a bit of random noise added in. Figure 1 shows the value of x0 as a function of time, while figure 2 shows the values of x1,x2,.x16 as a function of time. Fig 1: Fig 2: There are three phases to the evolution: In the first, x 0 increases steadily, and the other coordinates wander around more or less randomly. In the second phase, a self-reinforcing combination of splotches (a "demon") takes hold and amplifies itself drastically, feeding off the large x 0 gradient. Finally, this demon becomes so strong that the search gets stuck in a local valley and further progress stops. The first phase is more or less from 0 to 2500 steps. The second phase is between 2500 steps and 4000 steps, though slowing down after 3500. The final phase starts at 4000 steps, and likely continues indefinitely. Now that I have seen demons arise in such a simple situation, it makes me wonder how commonly the same thing happens in the training of deep neural networks. Anyways, hopefully this is a useful model for people who want to understand the mechanisms behind the whole "demons in imperfect search" thing more clearly. It definitely helped me, at least. Update: The code is now up here: Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. </div> </td> <td class="text-right">Dec 04, 2021</td> </tr> <tr> <td> <div class="content_episode"> <span class="episode_title">What Failure Looks Like: Distilling the Discussion by Ben Pace</span> <div class="pull-right play_badge" title="Listen"> <span class="mdi mdi-play-circle-outline"></span> <span class="duration">16:00</span> <span class="episode_url hidden" type="0">https://d22tbkdovk5ea2.cloudfront.net/audio/projects/9531/podcasts/3096938/media/003365748b870285a5511b2a4b01cfee_compiled.mp3</span> <span class="episode_img_url hidden" type="img">https://images.podcastrepublic.net/podcast/1608555549hd.jpg</span> <span class="episode_id hidden" type="id">51337909554</span> <span class="episode_guid hidden" type="id">a9e21299-c7a4-4a3b-b4e4-ea6d7ce31951</span> </div> </div> <div class="episode_info"> Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What Failure Looks Like: Distilling the Discussion, published by Ben Pace on the AI Alignment Forum. The comments under a post often contains valuable insights and additions. They are also often very long and involved, and harder to cite than posts themselves. Given this, I was motivated to try to distill some comment sections on LessWrong, in part to start exploring whether we can build some norms and some features to help facilitate this kind of intellectual work more regularly. So this is my attempt to summarise the post and discussion around What Failure Looks Like by Paul Christiano. Epistemic status: I think I did an okay job. I think I probably made the most errors in place where I try to emphasise concrete details more than the original post did. I think the summary of the discussion is much more concise than the original. What Failure Looks Like (Summary) On its default course, our civilization will build very useful and powerful AI systems, and use such systems to run significant parts of society (such as healthcare, legal systems, companies, the military, and more). Similar to how we are dependent on much novel technology such as money and the internet, we will be dependent on AI. The stereotypical AI catastrophe involves a powerful and malicious AI that seems good but suddenly becomes evil and quickly takes over humanity. Such descriptions are often stylised for good story-telling, or emphasise unimportant variables. The post below will concretely lay out two ways that building powerful AI systems may cause an existential catastrophe, if the problem of intent alignment is not solved. This is solely an attempt to describe what failure looks like, not to assign probabilities to such failure or to propose a plan to avoid these failures. There are two failure modes that will be discussed. First, we may increasingly fail to understand how our AI systems work and subsequently what is happening in society. Secondly, we may eventually give these AI systems massive amounts of power despite not understanding their internal reasoning and decision-making algorithms. Due to the massive space of designs we'll be searching through, if we do not understand the AI, this will mean certain AIs will be more power-seeking than expected, and will take adversarial action an