### The Nonlinear Library: Alignment Forum

#### By The Nonlinear Fund

Listen to a podcast, please open Podcast Republic app. Available on Google Play Store.

#### Open Website

Rate for this podcast

Subscribers: 1
Reviews: 0

#### Description

The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org

Episode Date
AF - Othello-GPT: Reflections on the Research Process by Neel Nanda
22:29

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Othello-GPT: Reflections on the Research Process, published by Neel Nanda on March 29, 2023 on The AI Alignment Forum. This is the third in a three post sequence about interpreting Othello-GPT. See the first post for context. This post is a detailed account of what my research process was, decisions made at each point, what intermediate results looked like, etc. It's deliberately moderately unpolished, in the hopes that it makes this more useful! The Research Process This project was a personal experiment in speed-running doing research, and I got the core results in in ~2.5 days/20 hours. This post has some meta level takeaways from this on doing mech interp research fast and well, followed by a (somewhat stylised) narrative of what I actually did in this project and why - you can see the file tl_initial_exploration.py in the paper repo for the code that I wrote as I went (using VSCode's interactive Jupyter mode). I wish more work illustrated the actual research process rather than just a final product, so I'm trying to do that here. This is approximately just me converting my research notes to prose, see the section on process-level takeaways for a more condensed summary of my high-level takeaways. The meta level process behind everything below is to repeatedly be confused, plot stuff a bunch, be slightly less confused, and iterate. As a result, there's a lot of pictures! Takeaways on doing mech interp research Warning: I have no idea if following my advice about doing research fast is actually a good idea, especially if you're starting out in the field! It's much easier to be fast and laissez faire when you have experience and an intuition for what's crucial and what's not, and it's easy to shoot yourself in the foot. And when you skimp on rigour, you want to make sure you go back and check! Though in this case, I got strong enough results with the probe that I was fairly confident I hadn't entirely built a tower of lies. And generally, beware of generalising from one example - in hindsight I think I got pretty lucky on how fruitful this project was! Be decisive: Subjectively, by far the most important change was suppressing my perfectionism and trying to be bold and decisive - make wild guesses and act on them, be willing to be less rigorous, etc. If I noticed myself stuck on doing the best or most principled thing, I'd instead try to just do something. Eg I wanted to begin by patching between two similar sequences of moves - I couldn't think of a principled way to change a move without totally changing the downstream game, so I just did the dumb thing of patching by changing the final move. Eg when I wanted to try intervening with the probe, I couldn't think of a principled way to intervene on a bunch of games or to systematically test that this worked, or exactly how best to intervene, so I decided to instead say "YOLO, let's try intervening in the dumbest possible way, by flipping the coefficient at a middle layer, on a single move, and see what happens" Pursue the hypothesis that seems "big if true" Eg I decided to try training a linear probe on just black moves after a hunch that this might work given some suggestive evidence from interpreting neuron L5N1393 Notice when I get stuck in a rabbit hole/stop learning things and move on Eg after training a probe I found it easy to be drawn into eg inspecting more and more neurons, or looking at head attention patterns, and it worked much better to just say Be willing to make quick and dirty hacks Eg when I wanted to look at the max activating dataset examples for neurons, I initially thought I'd want to run the model on thousands to millions of games, to get a real sample size. But in practice, just running the model on a batch of 100 games and taking the top 1% of moves by neuron act in there, worked totally fine...
Mar 29, 2023
AF - Othello-GPT: Future Work I Am Excited About by Neel Nanda
50:58

Mar 29, 2023
AF - Lessons from Convergent Evolution for AI Alignment by Jan Kulveit
15:38

Mar 27, 2023
AF - Practical Pitfalls of Causal Scrubbing by Jérémy Scheurer
22:35

Mar 27, 2023
AF - What happens with logical induction when... by Donald Hobson
01:42

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What happens with logical induction when..., published by Donald Hobson on March 26, 2023 on The AI Alignment Forum. So this is a bunch of related technical questions about logical induction. Firstly, do you need the formal theorem prover section? Can you just throw out the formal theorem prover, but give some programs in the market unbounded capital and get the same resultant behaviour? (For example, give the program that bets P(X) towards 1−P(¬X) unbounded downside risk (downside risk of n on day n) ) This means the program would lose infinite money if X and ¬X both turned out to be true. I think that any axioms can be translated into programs. And I think such a setup, with some finite number of fairly simple programs having infinite money available produces a logical inductor. Is this true? What happens when the axioms added under this system are inconsistent. (so this is a logical induction market, without a theorem prover to settle the bets, and with agents with unlimeted money betting both for and against X, possibly indirectly like the bot betting for X, the bot betting for ¬X, and the bot described above trying to make P(X)+P(¬X)=1 ) Can the other agents make unbounded money? Do the prices converge? If I added a bot with infinite money that was convinced fermats last theorem was false to a consistent ZFC system, would I get a probability distribution that assigned high probability to basic arithmetic facts in the limit? Does this make a sensible system for logical counterfactuals? Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Mar 26, 2023
AF - Descriptive vs. specifiable values by Tsvi Benson-Tilsen
03:46

Mar 26, 2023
AF - The alignment stability problem by Seth Herd
06:39

Mar 26, 2023
AF - A stylized dialogue on John Wentworth's claims about markets and optimization by Nate Soares
13:35

Mar 25, 2023
AF - $500 Bounty/Contest: Explain Infra-Bayes In The Language Of Game Theory by johnswentworth 04:08 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is:$500 Bounty/Contest: Explain Infra-Bayes In The Language Of Game Theory, published by johnswentworth on March 25, 2023 on The AI Alignment Forum. Here's my current best guess at how Infra-Bayes works: We want to get worst-case guarantees for an agent using a Bayesian-like framework. So, let our agent be a Bayesian which models the environment as containing an adversary which chooses worst-case values for any of the things over which we want worst-case guarantees. That's just a standard two-player zero-sum game between the agent and the adversary, so we can import all the nice intuitive stuff from game theory. ... but instead of that, we're going to express everything in the unnecessarily-abstract language of measure theory and convex sets, and rederive a bunch of game theory without mentioning that that's what we're doing. This bounty is for someone to write an intuitively-accessible infrabayes explainer in game theoretic language, and explain how the game-theoretic concepts relate to the concepts in existing presentations of infra-bayes. In short: provide a translation. Here's a sample of the sort of thing I have in mind: Conceptually, an infrabayesian agent is just an ordinary Bayesian game-theoretic agent, which models itself/its environment as a standard two-player zero-sum game. In the existing presentations of infra-bayes, the two-player game is only given implicitly. The agent's strategy π solves the problem: maxπmine∈BEπe[U] In game-theoretic terms, the "max" represents the agent's decision, while the "min" represents the adversary's. Much of the mathematical tractability stems from the fact that B is a convex set of environments (i.e. functions from policy π to probability distributions). In game-theoretic terms, the adversary's choice of strategy determines which "environment" the agent faces, and the adversary can choose from any option in B. Convexity of B follows from the adversary's ability to use mixed strategies: because the adversary can take a randomized mix of any two strategies available to it, the adversary can make the agent face any convex combination of (policy -> distribution) functions in B. Thus, B is closed under convex combinations; it's a convex set. I'd like a writeup along roughly these conceptual lines which covers as much as possible of the major high-level definitions and results in infra-bayes to date. On the other hand, I give approximately-zero shits about all the measure theory; just state the relevant high-level results in game-theoretic language, say what they mean intuitively, maybe mention whether there's some pre-existing standard game-theory theorem which can do the job or whether the infra-bayes version of the theorem is in fact the first proof of the game-theoretic equivalent, and move on. Alternatively, insofar as core parts of infrabayes differ from a two-player zero-sum game, or the general path I'm pointing to doesn't work, an explanation of how they differ and what the consequences are could also qualify for prize money. Bounty/Contest Operationalization Most of the headache in administering this sort of bounty is the risk that some well-intended person will write something which is not at all what I want, expecting to get paid, and then I will either have to explain how/why it's not what I want (which takes a lot of work), or I have to just accept it. To mitigate that failure mode, I'll run this as a contest: to submit, write up your explanation as a lesswrong post, then send me a message on lesswrong to make sure I'm aware of it. Deadline is end of April. I will distribute money among submissions based on my own highly-subjective judgement. If people write stuff up early, I might leave feedback on their posts, but no promises. I will count the "sample" above as a submission in its own right - i.e. I will imagine t...
Mar 25, 2023
AF - Wittgenstein and ML — parameters vs architecture by Cleo Nardo
07:51

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Wittgenstein and ML — parameters vs architecture, published by Cleo Nardo on March 24, 2023 on The AI Alignment Forum. Status: a brief distillation of Wittgenstein's book On Certainty, using examples from deep learning and GOFAI, plus discussion of AI alignment and interpretability. "That is to say, the questions that we raise and our doubts depend on the fact that some propositions are exempt from doubt, are as it were like hinges on which those turn." Ludwig Wittgenstein, On Certainty 1. Deep Learning Suppose we want a neural network to detect whether two children are siblings based on photographs of their face. The network will received two n-dimensional vectors v1 and v2representing the pixels in each image, and will return a value y(v1,v2)∈R which we interpret as the log-odds that the children are siblings. So the model has type-signature Rn+nR. There are two ways we can do this. We could use an architecture yA(v1,v2)=σ(vT1Av2+b), where σ is the sigmoid function A is an n×n matrix of learned parameters, b∈R is a learned bias. This model has n2+1 free parameters. Alternatively, we could use an architecture yU(v1,v2)=σ(vT1(U+UT2)v2+b), where σ is the sigmoid function U is an n×n upper-triangular matrix of learned parameters b∈R is a learned bias This model has n2/2+n/2+1 free parameters. Each model has a vector of free parameters θ∈Θ. If we train the model via SGD on a dataset (or via some other method) we will end up with a trained models yθ:Rn+nR, where y_:Θ(Rn+nR) is the architecture. Anyway, we now have two different NN models, and we want to ascribe beliefs to each of them. Consider the proposition ϕ that siblingness is symmetric, i.e. every person is the sibling of their siblings. What does it mean to say that a model knows or belives that ϕ. Let's start with a black-box definition of knowledge or belief: when we say that a model knows or believes that ϕ, we mean that yθ(v1,v2)=yθ(v2,v1) for all v1,v2∈Rn which look sufficiently like faces. According to this black-box definition, both trained models believe ϕ. But if we peer inside the black box, we can see that NN Model 1 believes ϕ in a very different way than how NN Model 2 believes ϕ. For NN Model 1, the belief is encoded in the learned parameters θ∈Θ. For NN Model 2, the belief is encoded in the architecture itself y_. These are two different kinds of belief. 2. Symbolic Logic Suppose we use GOFAI/symbolic logic to determine whether two children are siblings. Our model consists of three things A language L consisting of names and binary familial relations. A knowledge-base Γ consisting of L-formulae. A deductive system ⊢ which takes a set of L-formulae (premises) to a larger set of L-formulae (conclusions). There are two ways we can do this. We could use a system (L,Γ,⊢) , where The language L has names for every character and familial relations parent,child,sibling,grandparent,grandchild,cousin The knowledge-base Γ has axioms {sibling(Jack,Jill),sibling(x,y)sibling(y,x)} The deductive system ⊢ corresponds to first-order predicate logic. Alternatively, we could use a system (L,Γ,⊢), where The language L has names for every character and familial relations parent,child,sibling,grandparent,grandchild,cousin The knowledge-base Γ has axioms {sibling(Jack,Jill)} The deductive system ⊢ corresponds to first-order predicate logic with an additional logical rule sibling(x,y)⊢sibling(y,x). In this situation, we have two different SL models, and we want to ascribe beliefs to each of them. Consider the proposition ϕ that siblingness is symmetric, i.e. every person is the sibling of their siblings. Let's start with a black-box definition of knowledge or belief: when we say that a model knows or believes that ϕ, we mean that Γ⊢sibling(τ1,τ2)sibling(τ2,τ1) for every pair of closed L-terms τ1,τ2. According to this black...
Mar 24, 2023
AF - EAI Alignment Speaker Series #1: Challenges for Safe and Beneficial Brain-Like Artificial General Intelligence with Steve Byrnes by Curtis Huebner
45:47

Mar 23, 2023
AF - The space of systems and the space of maps by Jan Kulveit
07:59

Mar 22, 2023
AF - Truth and Advantage: Response to a draft of "AI safety seems hard to measure" by Nate Soares
08:25

Mar 22, 2023
03:10

Mar 21, 2023
AF - Some constructions for proof-based cooperation without Löb by James Payor
05:47

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Some constructions for proof-based cooperation without Löb, published by James Payor on March 21, 2023 on The AI Alignment Forum. This post presents five closely-related ways to achieve proof-based cooperation without using Löb's theorem, and muses on legible cooperation in the real world. (Edit: maybe they're closer to just-use-Löb's-theorem than I originally thought! See this comment. If these constructions somehow work better, I'm more confused than before about why.) I'm writing this as a follow-up to Andrew Critch's recent post, to share more of my perspective on the subject. We're going to dive straight into the weeds. (I'm planning to also write a more accessible explainer post soon.) The ideas Idea #1: try to prove AB I claim the following are sufficient for robust cooperation: A↔□(AB) B□A A tries to prove that AB, and B tries to prove A. The reason this works is that B can prove that A□A, i.e. A only cooperates in ways legible to B. (Proof sketch: A↔□X□□X↔□A.) The flaw in this approach is that we needed to know that A won't cooperate for illegible reasons. Otherwise we can't verify that B will cooperate whenever A does. This indicates to me that "AB" isn't the right "counterfactual". It shouldn't matter if A could cooperate for illegible reasons, if A is actually cooperating for a legible one. Idea #2: try to prove □AB We can weaken the requirements with a simple change: A□(□AB) B□A Note that this form is close to the lemma discussed in Critch's post. In this case, the condition □AB is trivial. And when the condition activates, it also ensures that □A is true, which discharges our assumption and ensures B is true. I still have the sense that the condition for cooperation should talk about itself activating, not A. Because we want it to activate when that is sufficient for cooperaion. But I do have to admit that □AB works for mostly the right reasons, comes with a simple proof, and is the cleanest two-agent construction I know. Idea #3: factor out the loop-cutting gadget We can factor the part that is trying to cut the loop out from A, like so: A□X B□A X↔□(XB); or alternatively X↔□(□XB) This gives the loop-cutting logic a name, X. Now X can refer to itself, and roughly says "I'll legibly activate if I can verify this will cause B to be true". The key properties of X are that □X□B, and \Box (\Box X \rightarrow \Box B) Like with idea #2, we just need A to reveal a mechanism by which it can be compelled to cooperate. Idea #4: everyone tries to prove □methem What about three people trying to cooperate? We can try applying lots of idea #2: A□(□AB∧C) B□(□BA∧C) C□(□CA∧B) And, this works! Proof sketch: Under the assumption of □C: A□(□AB∧C)□(□AB) B□(□BA∧C)□(□BA) A and B form a size-2 group, which cooperates by inductive hypothesis □CA∧B, since we proved A and B under the assumption C and □C follow from (2) A and B also follow, from (2) and (3) The proof simplifies the group one person at a time, since each person is asking "what would happen if everyone else could tell I cooperate". This lets us prove the whole thing by induction. It's neat that it works, though it's not the easiest thing to see. Idea #5: the group agrees to a shared mechanism or leader What if we factor out the choosing logic in a larger group? Here's one way to do it: A□X B□X C□X X↔□(□XA∧B∧C) This is the cleanest idea I know for handling the group case. The group members agree on some trusted leader or process X. They set things up so X activates legibly, verifies things in a way trusted by everyone, and only activates when it verifies this will cause cooperation. We've now localized the choice-making in one place. X proves that □XA∧B∧C, X activates, and everyone cooperates. Closing remarks on groups in the real world Centralizing the choosing like in idea #5 make the logic simpler, but this sort o... Mar 21, 2023 AF - Clarifying mesa-optimization by Marius Hobbhahn 15:54 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Clarifying mesa-optimization, published by Marius Hobbhahn on March 21, 2023 on The AI Alignment Forum. Produced as part of the SERI ML Alignment Theory Scholars Program - Winter 2022 Cohort. Thanks to Jérémy Scheurer, Nicholas Dupuis and Evan Hubinger for feedback and discussion When people talk about mesa-optimization, they sometimes say things like “we’re searching for the optimizer module” or “we’re doing interpretability to find out whether the network can do internal search”. An uncharitable interpretation of these claims is that the researchers expect the network to have something like an “optimization module” or “internal search algorithm” that is clearly different and distinguishable from the rest of the network (to be clear, we think it is fine to start with probably wrong mechanistic models). In this post, we want to argue why we should not expect mesa-optimization to be modular or clearly different from the rest of the network (at least in transformers and CNNs) and that current architectures can already do mesa-optimization in a meaningful way. We think this implies that Mesa-optimization improves gradually where more powerful models likely develop more powerful mesa optimizers. Mesa-optimization should not be treated as a phenomenon of the future. Current models likely already do it, just in a very messy and distributed fashion. When we look for mesa optimization, we probably have to look for a messy stack of heuristics combined with search-like abilities rather than clean Monte Carlo Tree Search (MCTS)-like structures. We think most of our core points can be conveyed in a simple analogy. Imagine a human chess grandmaster that has to choose their moves in 1 second. In this second, they are probably not running a sophisticated tree search in their head, they rely on heuristics. These heuristics were shaped by years of playing the game and are often the result of doing explicit tree searches with more time. The resulting decision-making process is a heuristic that approximates or was at least shaped by optimization but is not an optimizer itself. This is approximately what we think mesa-optimization might look like in current neural networks, i.e. the model uses heuristics that have aspects of or approximate parts of optimization, but are not “clean” in the way e.g. MCTS is. What is an accurate definition of mesa-optimization? In risks from learned optimization mesa-optimization is characterized as [...] it is also possible for a neural network to itself run an optimization algorithm. For example, a neural network could run a planning algorithm that predicts the outcomes of potential plans and searches for those it predicts will result in some desired outcome. Such a neural network would itself be an optimizer because it would be searching through the space of possible plans according to some objective function. If such a neural network were produced in training, there would be two optimizers: the learning algorithm that produced the neural network—which we will call the base optimizer—and the neural network itself—which we will call the mesa-optimizer. In this definition, the question of whether a network performs mesa-optimization or not boils down to whether whatever it does can be categorized as optimization, planning or search with an objective function. We think this question is very hard to answer for most networks and ML applications in general, e.g. one could argue that sparse linear regression performs search according to some objective function and that the attention layer of a transformer implements search since it scans over many inputs and reweighs them. We think this is an unhelpful way to think about transformers but it might technically fulfill the criterion. On the other hand, transformers very likely can’t perform variable length optimi... Mar 21, 2023 AF - Deep Deceptiveness by Nate Soares 25:43 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Deep Deceptiveness, published by Nate Soares on March 21, 2023 on The AI Alignment Forum. Meta This post is an attempt to gesture at a class of AI notkilleveryoneism (alignment) problem that seems to me to go largely unrecognized. E.g., it isn’t discussed (or at least I don't recognize it) in the recent plans written up by OpenAI (1,2), by DeepMind’s alignment team, or by Anthropic, and I know of no other acknowledgment of this issue by major labs. You could think of this as a fragment of my answer to “Where do plans like OpenAI’s ‘Our Approach to Alignment Research’ fail?”, as discussed in Rob and Eliezer’s challenge for AGI organizations and readers. Note that it would only be a fragment of the reply; there's a lot more to say about why AI alignment is a particularly tricky task to task an AI with. (Some of which Eliezer gestures at in a follow-up to his interview on Bankless.) Caveat: I'll be talking a bunch about “deception” in this post because this post was generated as a result of conversations I had with alignment researchers at big labs who seemed to me to be suggesting "just train AI to not be deceptive; there's a decent chance that works". I have a vague impression that others in the community think that deception in particular is much more central than I think it is, so I want to warn against that interpretation here: I think deception is an important problem, but its main importance is as an example of some broader issues in alignment. Summary Attempt at a short version, with the caveat that I think it's apparently a sazen of sorts, and spoiler tagged for people who want the opportunity to connect the dots themselves: Deceptiveness is not a simple property of thoughts. The reason the AI is deceiving you is not that it has some "deception" property, it's that (barring some great alignment feat) it's a fact about the world rather than the AI that deceiving you forwards its objectives, and you've built a general engine that's good at taking advantage of advantageous facts in general. As the AI learns more general and flexible cognitive moves, those cognitive moves (insofar as they are useful) will tend to recombine in ways that exploit this fact-about-reality, despite how none of the individual abstract moves look deceptive in isolation. Investigating a made-up but moderately concrete story Suppose you have a nascent AGI, and you've been training against all hints of deceptiveness. What goes wrong? When I ask this question of people who are optimistic that we can just "train AIs not to be deceptive", there are a few answers that seem well-known. Perhaps you lack the interpretability tools to correctly identify the precursors of 'deception', so that you can only train against visibly deceptive AI outputs instead of AI thoughts about how to plan deceptions. Or perhaps training against interpreted deceptive thoughts also trains against your interpretability tools, and your AI becomes illegibly deceptive rather than non-deceptive. And these are both real obstacles. But there are deeper obstacles, that seem to me more central, and that I haven't observed others to notice on their own. That's a challenge, and while you (hopefully) chew on it, I'll tell an implausibly-detailed story to exemplify a deeper obstacle. A fledgeling AI is being deployed towards building something like a bacterium, but with a diamondoid shell. The diamondoid-shelled bacterium is not intended to be pivotal, but it's a supposedly laboratory-verifiable step on a path towards carrying out some speculative human-brain-enhancement operations, which the operators are hoping will be pivotal. (The original hope was to have the AI assist human engineers, but the first versions that were able to do the hard parts of engineering work at all were able to go much farther on their own, and the competit... Mar 21, 2023 AF - My Objections to "We’re All Gonna Die with Eliezer Yudkowsky" by Quintin Pope 57:59 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: My Objections to "We’re All Gonna Die with Eliezer Yudkowsky", published by Quintin Pope on March 21, 2023 on The AI Alignment Forum. Introduction I recently watched Eliezer Yudkowsky's appearance on the Bankless podcast, where he argued that AI was nigh-certain to end humanity. Since the podcast, some commentators have offered pushback against the doom conclusion. However, one sentiment I saw was that optimists tended not to engage with the specific arguments pessimists like Yudkowsky offered. Economist Robin Hanson points out that this pattern is very common for small groups which hold counterintuitive beliefs: insiders develop their own internal language, which skeptical outsiders usually don't bother to learn. Outsiders then make objections that focus on broad arguments against the belief's plausibility, rather than objections that focus on specific insider arguments. As an AI "alignment insider" whose current estimate of doom is around 5%, I wrote this post to explain some of my many objections to Yudkowsky's specific arguments. I've split this post into chronologically ordered segments of the podcast in which Yudkowsky makes one or more claims with which I particularly disagree. I have my own view of alignment research: shard theory, which focuses on understanding how human values form, and on how we might guide a similar process of value formation in AI systems. I think that human value formation is not that complex, and does not rely on principles very different from those which underlie the current deep learning paradigm. Most of the arguments you're about to see from me are less: I think I know of a fundamentally new paradigm that can fix the issues Yudkowsky is pointing at. and more: Here's why I don't agree with Yudkowsky's arguments that alignment is impossible in the current paradigm. My objections Will current approaches scale to AGI? Yudkowsky apparently thinks not ...and that the techniques driving current state of the art advances, by which I think he means the mix of generative pretraining + small amounts of reinforcement learning such as with ChatGPT, aren't reliable enough for significant economic contributions. However, he also thinks that the current influx of money might stumble upon something that does work really well, which will end the world shortly thereafter. I'm a lot more bullish on the current paradigm. People have tried lots and lots of approaches to getting good performance out of computers, including lots of "scary seeming" approaches such as: Meta-learning over training processes. I.e., using gradient descent over learning curves, directly optimizing neural networks to learn more quickly. Teaching neural networks to directly modify themselves by giving them edit access to their own weights. Training learned optimizers - neural networks that learn to optimize other neural networks - and having those learned optimizers optimize themselves. Using program search to find more efficient optimizers. Using simulated evolution to find more efficient architectures. Using efficient second-order corrections to gradient descent's approximate optimization process. Tried applying biologically plausible optimization algorithms inspired by biological neurons to training neural networks. Adding learned internal optimizers (different from the ones hypothesized in Risks from Learned Optimization) as neural network layers. Having language models rewrite their own training data, and improve the quality of that training data, to make themselves better at a given task. Having language models devise their own programming curriculum, and learn to program better with self-driven practice. Mixing reinforcement learning with model-driven, recursive re-writing of future training data. Mostly, these don't work very well. The current capabilities paradigm is sta... Mar 21, 2023 AF - Probabilistic Payor Lemma? by Abram Demski 06:52 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Probabilistic Payor Lemma?, published by Abram Demski on March 19, 2023 on The AI Alignment Forum. Epistemic status: too good to be true? Please check my math. We've known for a while that Löb's theorem fails when proof is relaxed to probabilistic belief. This has pros and cons. On the pro side, it means there's no Löbian Obstacle to probabilistic self-trust. On the con side, it means that some Löb-derived insights for proof-based decision theory don't translate to probabilistic decision theory, at least not as directly as one might hope. In particular, it appeared to dash hopes for probabilistic generalizations of the "Löbian handshake" for cooperation. Recently, Andrew Critch wrote about the Payor Lemma, which allows for a very similar "modal handshake" without Löb's Theorem. The lemma was proved using the same modal assumptions as Löb's, so on the surface it may appear to be just a different method to achieve similar results, whose main advantage is that it is much easier to prove (and therefore explain and understand) than Löb's Theorem. But, a natural question arises: does Payor's Lemma have a suitable probabilistic version? I'll give an affirmative proof; but I haven't confirmed that the assumptions are reasonable to my satisfaction. Setup Let L be a language in first-order logic, expressive enough to represent its sentences s∈L as quoted terms ┌s┐, eg, through Gödel numbering; and with a probability function symbol on these terms, p(┌s┐), which can be equated with (some representation of) rational numbers, e.g. p(┌⊤┐)=1, p(┌s┐)=12, etc. I also assume the system can reason about these rational numbers in the basic ways you'd expect. For all a,b∈L and all r∈Q, we have: If ⊢a, then ⊢p(┌a┐)=1. If ⊢ab, then ⊢p(┌a┐)≤p(┌b┐). (These assumptions might look pretty minimal, but they aren't going to be true for every theory of self-referential truth; more on this later.) Let B(s) abbreviate the sentence p(┌s┐)>c for any s and some globally fixed constant c strictly between 0 and 1. This is our modal operator. Some important properties of B: Necessitation. If ⊢s, then ⊢B(s), for any s. Proof: Since ⊢s implies ⊢p(s)=1, and c∈(0,1), we have ⊢p(┌s┐)>c,, which is to say, ⊢B(s). [End proof.] Weak distrubitivity. If ⊢xy, then ⊢B(x)B(y). Proof: When ⊢xy, we have ⊢p(y)≥p(x), so ⊢p(x)>cp(y)>c. [End proof.] (Regular distributivity would say B(xy) implies B(x)B(y). The assumption ⊢xy is stronger than B(xy), so the above is a weaker form of distributivity.) Theorem Statement If ⊢B(B(x)x)x, then ⊢x. Proof ⊢x(B(x)x), by tautology (a(ba)). So ⊢B(x)B(B(x)x), from 1 by weak distributivity. Suppose ⊢B(B(x)x)x. ⊢B(x)x from 2 and 3. ⊢B(B(x)x) from 4 by necessitation. ⊢x from 4 and 1.[End proof.] Discussion Comparison to Original Proof The proof steps mirror Critch's treatment very closely. The key difference is step 2, IE, how I obtain a statement like ⊢□x□(□xx). Critch uses distributivity, which is not available to me: B(ab)(B(a)B(b))? Suppose B(ab), ie, p(┌ab┐)>c. Rewrite p(┌b∨¬a┐)>c. Now suppose B(a), that is, p(┌a┐)>c. p(┌¬a┐)<1−c. p(┌b∨¬a┐)≤p(┌b┐)+p(┌¬a┐) p(┌b∨¬a┐)−1+c>c−1+c. p(┌b┐)>2c−1. So we only get: Bc(ab)(Bc(a)Bd(b)), where Br(s) abbreviates p(┌s┐)>r and we have d=2c−1. So in general, attempted applications of distributivity create weakened belief operators, which would get in the way of the proof (very similar to how probabilistic Löb fails). However, the specific application we want happens to go through, due to a logical relationship between a and b; namely, that b is a weaker statement than a. This reveals a way in which the assumptions for Payor's Lemma are importantly weaker than those required for Löb to go through. So, the key observation I'm making is that weak distributility is all that's needed for Payor, and seems much more plaus... Mar 19, 2023 AF - Shell games by Tsvi Benson-Tilsen 06:49 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Shell games, published by Tsvi Benson-Tilsen on March 19, 2023 on The AI Alignment Forum. [Metadata: crossposted from. First completed November 18, 2022.] Shell game Here's the classic shell game: Youtube Screenshot from that video. The little ball is a phantom: when you look for it under a specific shell, it's not there, it's under a different shell. (This might be where the name "shell company" comes from: the business dealings are definitely somewhere, just not in this company you're looking at.) Perpetual motion machines Related: Perpetual motion beliefs Bhāskara's wheel is a proposed perpetual-motion machine from the Middle Ages: Here's another version: From this video. Someone could try arguing that this really is a perpetual motion machine: Q: How do the bars get lifted up? What does the work to lift them? A: By the bars on the other side pulling down. Q: How does the wheel keep turning? How do the bars pull more on their way down than on their way up? A: Because they're extended further from the center on the downward-moving side than on the upward-moving side, so they apply more torque to the wheel. Q: How do the bars extend further on the way down? A: Because the momentum of the wheel carries them into the vertical bar, flipping them over. Q: But when that happens, energy is expended to lift up the little weights; that energy comes out of the kinetic energy of the wheel. A: Ok, you're right, but that's not necessary to the design. All we need is that the torque on the downward side is greater than the torque on the upward side, so instead of flipping the weights up, we could tweak the mechanism to just shift them outward, straight to the side. That doesn't take any energy because it's just going straight sideways, from a resting position to another resting position. Q: Yeah... you can shift them sideways with nearly zero work... but that means the weights are attached to the wheel at a pivot, right? So they'll just fall back and won't provide more torque. A: They don't pivot, you fix them in place so they provide more torque. Q: Ok, but then when do you push the weights back inward? A: At the bottom. Q: When the weight is at the bottom? But then the slider isn't horizontal, so pushing the weight back towards the center is pushing it upward, which takes work. A: I meant, when the slider is at the bottom--when it's horizontal. Q: But if the sliders are fixed in place, by the time they're horizontal at the bottom, you've already lifted the weights back up some amount; they're strong-torquing the other way. A: At the bottom there's a guide ramp to lift the weights using normal force. Q: But the guide ramp is also torquing the wheel. And so on. The inventor can play hide the torque and hide the work. Shell games in alignment Some alignment schemes--schemes for structuring or training an AGI so that it can be transformatively useful and doesn't kill everyone--are prone to playing shell games. That is, there's some features of the scheme that don't seem to happen in a specific place; they happen somewhere other than where you're looking at the moment. Consider these questions: What sort of smarter-than-human work is supposed to be done by the AGI? When and how does it do that work--by what combination of parts across time? How does it become able to do that work? At what points does the AGI come to new understanding that it didn't have before? How does the AGI orchestrate it's thinking and actions to have large effects on the world? By what process, components, rules, or other elements? What determines the direction that the AGI's actions will push the world? Where did those determiners come from, and how exactly do they determine the direction? Where and how much do human operators have to make judgements? How much are those judgements being relied on to point to... Mar 19, 2023 AF - More information about the dangerous capability evaluations we did with GPT-4 and Claude. by Beth Barnes 13:05 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: More information about the dangerous capability evaluations we did with GPT-4 and Claude., published by Beth Barnes on March 19, 2023 on The AI Alignment Forum. [Written for more of a general-public audience than alignment-forum audience. We're working on a more thorough technical report.]We believe that capable enough AI systems could pose very large risks to the world. We don’t think today’s systems are capable enough to pose these sorts of risks, but we think that this situation could change quickly and it’s important to be monitoring the risks consistently. Because of this, ARC is partnering with leading AI labs such as Anthropic and OpenAI as a third-party evaluator to assess potentially dangerous capabilities of today’s state-of-the-art ML models. The dangerous capability we are focusing on is the ability to autonomously gain resources and evade human oversight. We attempt to elicit models’ capabilities in a controlled environment, with researchers in-the-loop for anything that could be dangerous, to understand what might go wrong before models are deployed. We think that future highly capable models should involve similar “red team” evaluations for dangerous capabilities before the models are deployed or scaled up, and we hope more teams building cutting-edge ML systems will adopt this approach. The testing we’ve done so far is insufficient for many reasons, but we hope that the rigor of evaluations will scale up as AI systems become more capable. As we expected going in, today’s models (while impressive) weren’t capable of autonomously making and carrying out the dangerous activities we tried to assess. But models are able to succeed at several of the necessary components. Given only the ability to write and run code, models have some success at simple tasks involving browsing the internet, getting humans to do things for them, and making long-term plans – even if they cannot yet execute on this reliably. As AI systems improve, it is becoming increasingly difficult to rule out that models might be able to autonomously gain resources and evade human oversight – so rigorous evaluation is essential. It is important to have systematic, controlled testing of these capabilities in place before models pose an imminent risk, so that labs can have advance warning when they’re getting close and know to stop scaling up models further until they have robust safety and security guarantees. This post will briefly lay out our motivation, methodology, an example task, and high-level conclusions. The information given here isn’t enough to give a full understanding of what we did or make our results replicable, and we won’t go into detail about results with specific models. We will publish more detail on our methods and results soon. Motivation Today’s AI systems can write convincing emails, give fairly useful instructions on how to carry out acts of terrorism, threaten users who have written negative things about them, and otherwise do things the world is not very ready for. Many people have tried using models to write and run code unsupervised, find vulnerabilities in code1, or carry out money-making schemes. Today’s models also have some serious limitations to their abilities. But the companies that have released today’s AI models are investing heavily in building more powerful, more capable ones. ARC is worried that future ML models may be able to autonomously act in the real world, doing things like “incorporate a company” or “exploit arbitrages in stock prices” or “design and synthesize DNA” without needing any human assistance or oversight. If models have the ability to act autonomously like this, this could pose major risks if they’re pursuing goals that are at odds with their human designers. They could make (or steal) money, impersonate humans, replicate themselves o... Mar 19, 2023 AF - "Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities) by David Scott Krueger 02:45 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: "Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities), published by David Scott Krueger on March 18, 2023 on The AI Alignment Forum. This is a brief, stylized recounting of a few conversations I had at some point last year with people from the non-academic AI safety community: Me: you guys should write up your work properly and try to publish it in ML venues. Them: well that seems like a lot of work and we don't need to do that because we can just talk to each other and all the people I want to talk to are already working with me. Me: What about the people who you don't know who could contribute to this area and might even have valuable expertise? You could have way more leverage if you can reach those people. Also, there is increasing interest from the machine learning community in safety and alignment... because of progress in capabilities people are really starting to consider these topics and risks much more seriously. Them: okay, fair point, but we don't know how to write ML papers. Me: well, it seems like maybe you should learn or hire people to help you with that then, because it seems like a really big priority and you're leaving lots of value on the table. Them: hmm, maybe... but the fact is, none of us have the time and energy and bandwidth and motivation to do that; we are all too busy with other things and nobody wants to. Me: ah, I see! It's an incentive problem! So I guess your funding needs to be conditional on you producing legible outputs. Me, reflecting afterwards: hmm... Cynically, not publishing is a really good way to create a moat around your research... People who want to work on that area have to come talk to you, and you can be a gatekeeper. And you don't have to worry about somebody with more skills and experience coming along and trashing your work or out-competing you and rendering it obsolete...EtA: In comments, people have described adhering to academic standards of presentation and rigor as "jumping through hoops". There is an element of that, but this really misses the value that these standards have to the academic community. This is a longer discussion, though... There are sort of 3 AI safety communities in my account:1) people in academia2) people at industry labs who are building big models3) the rest (alignment forum/less wrong and EA being big components). I'm not sure where to classify new orgs like Conjecture and Redwood, but for the moment I put them here. I'm referring to the last of these in this case. I'm not accusing anyone of having bad motivations; I think it is almost always valuable to consider both people's concious motivations and their incentives (which may be subconscious drivers of their behavior). Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. Mar 18, 2023 AF - What organizations other than Conjecture have (esp. public) info-hazard policies? by David Scott Krueger 00:47 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What organizations other than Conjecture have (esp. public) info-hazard policies?, published by David Scott Krueger on March 16, 2023 on The AI Alignment Forum. I believe Anthropic has said they won't publish capabilities research?OpenAI seems to be sort of doing the same (although no policy AFAIK).I heard FHI was developing one way back when...I think MIRI sort of does as well (default to not publishing, IIRC?) Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. Mar 16, 2023 AF - [ASoT] Some thoughts on human abstractions by leogao 08:39 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: [ASoT] Some thoughts on human abstractions, published by leogao on March 16, 2023 on The AI Alignment Forum. TL;DR: Consider a human concept such as "tree." Humans implement some algorithm for determining whether given objects are trees. We expect our predictor/language model to develop a model of this algorithm because this is useful for predicting the behavior of humans. This is not the same thing as some kind of platonic ideal concept of what is “actually” a tree, which the algorithm is not incentivized to develop by training on internet text, and trying to retarget the search at it has the same supervision problems as RLHF against human scores on whether things look like trees. Pointing at this “actually a tree” concept inside the network is really hard; the ability of LMs to comprehend natural language does not allow one to point using natural language, because it just passes the buck. Epistemic status: written fast instead of not at all, probably partially deeply confused and/or unoriginal. Thanks to Collin Burns, Nora Belrose, and Garett Baker for conversations. Will NNs learn human abstractions? As setup, let's consider an ELK predictor (the thing that predicts future camera frames). There are facts about the world that we don't understand that are in some way useful for predicting the future observations. This is why we can expect the predictor to learn facts that are superhuman (in that if you tried to supervised-train a model to predict those facts, you would be unable to generate the ground truth data yourself). Now let's imagine the environment we're predicting consists of a human who can (to take a concrete example) look at things and try to determine if they're trees or not. This human implements some algorithm for taking various sensory inputs and outputting a tree/not tree classification. If the human does this a lot, it will probably become useful to have an abstraction that corresponds to the output of this algorithm. Crucially, this algorithm can be fooled by i.e a fake tree that the human can't distinguish from a real tree because (say) they don't understand biology well enough or something. However, the human can also be said to, in some sense, be "trying" to point to the "actual" tree. Let's try to firm this down. The human has some process they endorse for refining their understanding of what is a tree / "doing science" in ELK parlance; for example, spending time studying from a biology textbook. We can think about the limit of this process. There are a few problems: it may not converge, or may converge to something that doesn't correspond to what is "actually" a tree, or may take a really really long time (due to irrationalities, or inherent limitations to human intelligence, etc). This suggests that this concept is not necessarily even well defined. But even if it is, this thing is far less naturally useful for predicting the future human behaviour than the algorithm the human actually implements! Implementing the actual human algorithm directly lets you predict things like how humans will behave when they look at things that look like trees to them. More generally, one possible superhuman AI configuration I can imagine is one where the bulk of the circuits are used to predict its best-guess for what will happen in the world. There may also be a set of circuits that operate in a more humanlike ontology used specifically for predicting humans, or it may be that the best-guess circuits are capable enough that this is not necessary (and if we scale up our reporter we eventually get a human simulator inside the reporter). The optimistic case here is if the "actually a tree" abstraction happens to be a thing that is useful for (or is very easily mapped from) the weird alien ontology, possibly because some abstractions are more universal. In this ... Mar 16, 2023 AF - Towards understanding-based safety evaluations by Evan Hubinger 08:10 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Towards understanding-based safety evaluations, published by Evan Hubinger on March 15, 2023 on The AI Alignment Forum. Thanks to Kate Woolverton, Ethan Perez, Beth Barnes, Holden Karnofsky, and Ansh Radhakrishnan for useful conversations, comments, and feedback. Recently, I have noticed a lot of momentum within AI safety specifically, the broader AI field, and our society more generally, towards the development of standards and evaluations for advanced AI systems. See, for example, OpenAI's GPT-4 System Card. Overall, I think that this is a really positive development. However, while I like the sorts of behavioral evaluations discussed in the GPT-4 System Card (e.g. ARC's autonomous replication evaluation) as a way of assessing model capabilities, I have a pretty fundamental concern with these sorts of techniques as a mechanism for eventually assessing alignment. I often worry about situations where your model is attempting to deceive whatever tests are being run on it, either because it's itself a deceptively aligned agent or because it's predicting what it thinks a deceptively aligned AI would do. My concern is that, in such a situation, being able to robustly evaluate the safety of a model could be a more difficult problem than finding training processes that robustly produce safe models. For some discussion of why I think checking for deceptive alignment might be harder than avoiding it, see here and here. Put simply: checking for deception in a model requires going up against a highly capable adversary that is attempting to evade detection, while preventing deception from arising in the first place doesn't necessarily require that. As a result, it seems quite plausible to me that we could end up locking in a particular sort of evaluation framework (e.g. behavioral testing by an external auditor without transparency, checkpoints, etc.) that makes evaluating deception very difficult. If meeting such a standard then became synonymous with safety, getting labs to actually put effort into ensuring their models were non-deceptive could become essentially impossible. However, there's an obvious alternative here, which is building and focusing our evaluations on our ability to understand our models rather than our ability to evaluate their behavior. Rather than evaluating a final model, an understanding-based evaluation would evaluate the developer's ability to understand what sort of model they got and why they got it. I think that an understanding-based evaluation could be substantially more tractable in terms of actually being sufficient for safety here: rather than just checking the model's behavior, we're checking the reasons why we think we understand it's behavior sufficiently well to not be concerned that it'll be dangerous. It's worth noting that I think understanding-based evaluations can—and I think should—go hand-in-hand with behavioral evaluations. I think the main way you’d want to make some sort of understanding-based standard happen would be to couple it with a capability-based evaluation, where the understanding requirements become stricter as the model’s capabilities increase. If we could get this right, it could channel a huge amount of effort towards understanding models in a really positive way. Understanding as a safety standard also has the property that it is something that broader society tends to view as extremely reasonable, which I think makes it a much more achievable ask as a safety standard than many other plausible alternatives. I think ML people are often Stockholm-syndrome'd into accepting that deploying powerful systems without understanding them is normal and reasonable, but that is very far from the norm in any other industry. Ezra Klein in the NYT and John Oliver on his show have recently emphasized this basic point that if we are ... Mar 15, 2023 AF - GPT can write Quines now (GPT-4) by Andrew Critch 01:55 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: GPT can write Quines now (GPT-4), published by Andrew Critch on March 14, 2023 on The AI Alignment Forum. GPT-4 is not only able to write code, more reliably than GPT-3.5, it writes code that writes code; see the example below (GPT-3.5 was not able to do this). But first, let me say: 1) @OpenAI: Thank for your openness to the world about your capabilities and shortcomings! Specifically, thanks for this: 2) Everyone else: many companies and projects will not be this open about their shortcomings and safety issues. Time to regulate, please! Okay, here's the example... GPT3.5 wasn't able to do this, but GPT-4 nailed it on the first try: PROMPT: Can you write me a python script that does the following three things?1) When you run it with no argument, it prints "Hello world" to standard out.2) When you run it with an argument "-s", it prints its own source code.As a constraint, I do not want the script to read or write to disk at all (so it can't get its source code by just reading the file that it's in). Thanks if you can help with this! I'm curious to see how you do it. BEGIN RESULT (works perfectly): Sure! Here's a Python script that should do what you're looking for. This script uses a feature called quines, which are programs that can print their own source code: When you run this script without any arguments, it will print "Hello world". When you run it with the "-s" argument, it will print its own source code without reading or writing to disk. END RESULT I'm not sure how hard they tried specifically to train this, but it's worth knowing about for lots of reasons I probably don't have to explain at length right here. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. Mar 14, 2023 AF - What is a definition, how can it be extrapolated? by Stuart Armstrong 11:51 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What is a definition, how can it be extrapolated?, published by Stuart Armstrong on March 14, 2023 on The AI Alignment Forum. What is a definition? Philosophy has, ironically, a large number of definitions of definitions, but three of them are especially relevant to ML and AI safety. There is the intensional definition, where concepts are defined logically in terms of other concepts (“bachelors are unmarried males”). There is also the extensional definition, which proceeds by listing all the members of a set (“the countries in the European Union are those listed here”). Much more relevant, though with a less developed philosophical analysis, is the ostensive definition. This is where you point out examples of a concept, and let the viewer generalise from them. This is in large part how we all learnt concepts as children: examples and generalisation. In many cultures, children have a decent grasp of “dog” just from actual and video examples - and that’s the definition of “dog” we often carry into adulthood. We can use ostensive definitions for reasoning and implications. For example, consider the famous syllogism, “Socrates is human”, “humans are mortal” imply “Socrates is mortal”. “Socrates is human” means that we have an ostensive definition of what humans are, and Socrates fits it. Then “humans are mortal” means that we’ve observed that the set of “human” seems to be mainly a subset of the set of “mortals”. So we can ostensively define humans as mortal (note that we are using definitions as properties: having the property of “being mortal” means that one is inside the ostensive definition of “mortals”). And so we can conclude that Socrates is likely mortal, without waiting till he’s dead. Distinctions: telling what from non-what There’s another concept that I haven’t seen articulated, which is what I’ll call the “distinction”. This does not define anything, but is sufficient to distinguish between an element of a set from non-members. To formalise "the distinction", let Ω be the universe of possible objects, and E⊂Ω the “environment” of objects we expect to encounter. An ostensive definition starts with a list S⊂E of examples, and generalises to a “natural” category SE with S⊂SE⊂E - we are aiming to "carve reality at the joints", and get an natural extension of the examples. So, for example, E might be the entities in our current world, S might be the example of dogs we’ve seen, and SE the set of all dogs. Then, for any set T⊂E, we can define the “distinction” dT,E which maps T to 1 (“True”) and its complement E∖T to 0 (“False”). So dSE,E would be a distinction that identifies all the dogs in our current world. Mis-definitions A lot of confusion around definition seems to come from mistaking distinctions for definitions. To illustrate, consider the idea of defining maleness as "possessing the Y chromosome". As a distinction, it's serviceable: there's a strong correlation between having that chromosome and being ostensively male. But it is utterly useless as a definition of maleness. For instance, it would imply that nobody before the 20th century had any idea what maleness was. Oh, sure, they may have referred to something as "maleness" - something to do with genitalia, voting rights, or style of hats - but those are mere correlates of the true definition of maleness, which is the Y chromosome. It would also imply that all "male" birds are actually female, and vice-versa. Scott had a description of maleness here: “Absolutely typical men have Y chromosomes, have male genitalia, appreciate manly things like sports and lumberjackery, are romantically attracted to women, personally identify as male, wear male clothing like blue jeans, sing baritone in the opera, et cetera.” Is this a definition? I’d say not; it’s not a definition, it’s a reminder of the properties of o... Mar 14, 2023 AF - Discussion with Nate Soares on a key alignment difficulty by HoldenKarnofsky 33:36 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Discussion with Nate Soares on a key alignment difficulty, published by HoldenKarnofsky on March 13, 2023 on The AI Alignment Forum. In late 2022, Nate Soares gave some feedback on my Cold Takes series on AI risk (shared as drafts at that point), stating that I hadn't discussed what he sees as one of the key difficulties of AI alignment. I wanted to understand the difficulty he was pointing to, so the two of us had an extended Slack exchange, and I then wrote up a summary of the exchange that we iterated on until we were both reasonably happy with its characterization of the difficulty and our disagreement.1 My short summary is: Nate thinks there are deep reasons that training an AI to do needle-moving scientific research (including alignment) would be dangerous. The overwhelmingly likely result of such a training attempt (by default, i.e., in the absence of specific countermeasures that there are currently few ideas for) would be the AI taking on a dangerous degree of convergent instrumental subgoals while not internalizing important safety/corrigibility properties enough. I think this is possible, but much less likely than Nate thinks under at least some imaginable training processes. I didn't end up agreeing that this difficulty is as important as Nate thinks it is, although I did update my views some (more on that below). My guess is that this is one of the two biggest disagreements I have with Nate's and Eliezer's views (the other one being the likelihood of a sharp left turn that leads to a massive capabilities gap between AI systems and their supervisors.2) Below is my summary of: Some key premises we agree on. What we disagree about, at a high level. A hypothetical training process we discussed in order to get more clear and mechanistic about Nate's views. Some brief discussion of possible cruxes; what kind of reasoning Nate is using to arrive at his relatively high (~85%) level of confidence on this point; and future observations that might update one of us toward the other's views. MIRI might later put out more detailed notes on this exchange, drawing on all of our discussions over Slack and comment threads in Google docs. Nate has reviewed this post in full. I'm grateful for his help with it. Some starting points of agreement Nate on this section: “Seems broadly right to me!” An AI is dangerous if: It's powerful (like, it has the ability to disempower humans if it's "aiming" at that) It aims (perhaps as a side effect of aiming at something else) at CIS (convergent instrumental subgoals) such as "Preserve option value," "Gain control of resources that can be used for lots of things," "Avoid being turned off," and such. (Note that this is a weaker condition than "maximizes utility according to some relatively simple utility function of states of the world") It does not reliably avoid POUDA (pretty obviously unintended/dangerous actions) such as "Design and deploy a bioweapon." "Reliably" just means like "In situations it will actually be in" (which will likely be different from training, but I'm not trying to talk about "all possible situations"). Avoiding POUDA is kind of a low bar in some sense. Avoiding POUDA doesn't necessarily require fully/perfectly internalizing some "corrigibility core" (such that the AI would always let us turn it off even in arbitrarily exotic situations that challenge the very meaning of "let us turn it off"), and it even more so doesn't require anything like CEV. It just means that stuff where Holden would be like "Whoa whoa, that is OBVIOUSLY unintended/dangerous/bad" is stuff that an AI would not do. That said, POUDA is not something that Holden is able to articulate cleanly and simply. There are lots of actions that might be POUDA in one situation and not in another (e.g., developing a chemical that's both poisonous and useful... Mar 13, 2023 AF - Plan for mediocre alignment of brain-like [model-based RL] AGI by Steve Byrnes 20:26 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Plan for mediocre alignment of brain-like [model-based RL] AGI, published by Steve Byrnes on March 13, 2023 on The AI Alignment Forum. (This post is a more simple, self-contained, and pedagogical version of Post #14 of Intro to Brain-Like AGI Safety.) (Vaguely related to this Alex Turner post and this John Wentworth post.) I would like to have a technical plan for which there is a strong robust reason to believe that we’ll get an aligned AGI and a good future. This post is not such a plan. However, I also don’t have a strong reason to believe that this plan wouldn’t work. Really, I want to throw up my hands and say “I don’t know whether this would lead to a good future or not”. By “good future” here I don’t mean optimally-good—whatever that means—but just “much better than the world today, and certainly much better than a universe full of paperclips”. I currently have no plan, not even a vague plan, with any prayer of getting to an optimally-good future. That would be a much narrower target to hit. Even so, that makes me more optimistic than at least some people. Or at least, more optimistic about this specific part of the story. In general I think many things can go wrong as we transition to the post-AGI world—see discussion by Dai & Soares—and overall I feel very doom-y, particularly for reasons here. This plan is specific to the possible future scenario (a.k.a. “threat model” if you’re a doomer like me) that future AI researchers will develop “brain-like AGI”, i.e. learning algorithms that are similar to the brain’s within-lifetime learning algorithms. (I am not talking about evolution-as-a-learning-algorithm.) These algorithms, I claim, are in the general category of model-based reinforcement learning. Model-based RL is a big and heterogeneous category, but I suspect that for any kind of model-based RL AGI, this plan would be at least somewhat applicable. For very different technological paths to AGI, this post is probably pretty irrelevant. But anyway, if someone published an algorithm for x-risk-capable brain-like AGI tomorrow, and we urgently needed to do something, this blog post is more-or-less what I would propose to try. It’s the least-bad plan that I currently know. So I figure it’s worth writing up this plan in a more approachable and self-contained format. 1. Intuition: Making a human into a moon-lover (“selenophile”) Try to think of who is the coolest / highest-status-to-you / biggest-halo-effect person in your world. (Real or fictional.) Now imagine that this person says: “You know what’s friggin awesome? The moon. I just love it. The moon is the best.” You stand there with your mouth agape, muttering to yourself in hushed tones: “Wow, huh, the moon, yeah, I never thought about it that way.” (But 100× moreso. Maybe you’re on some psychedelic at the time, or this is happening during your impressionable teenage years, or whatever.) You basically transform into a “moon fanboy” / “moon fangirl” / “moon nerd” / “selenophile”. How would that change your motivations and behaviors going forward? You’re probably going to be much more enthusiastic about anything associated with the moon. You’re probably going to spend a lot more time gazing at the moon when it’s in the sky. If there are moon-themed trading cards, maybe you would collect them. If NASA is taking volunteers to train as astronauts for a trip to the moon, maybe you’d enthusiastically sign up. If a supervillain is planning to blow up the moon, you’ll probably be extremely opposed to that, and motivated to stop them. Hopefully this is all intuitive so far. What’s happening mechanistically in your brain? As background, I think we should say that one part of your brain (the cortex, more-or-less) has “thoughts”, and another part of your brain (the basal ganglia, more-or-less) assigns a “value” (in RL ter... Mar 13, 2023 AF - Are there cognitive realms? by Tsvi Benson-Tilsen 16:10 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Are there cognitive realms?, published by Tsvi Benson-Tilsen on March 12, 2023 on The AI Alignment Forum. [Metadata: crossposted from. First completed November 16, 2022. This essay is more like research notes than exposition, so context may be missing, the use of terms may change across essays, and the text may be revised later; only the versions at tsvibt.blogspot.com are definitely up to date.] Are there unbounded modes of thinking that are systemically, radically distinct from each other in relevant ways? Note: since I don't know whether "cognitive realms" exist, this essay isn't based on clear examples and is especially speculative. Realms Systemically, radically distinct unbounded modes of thinking The question is, are there different kinds--writ large--of thinking? To the extent that there are, interpreting the mental content of another mind, especially one with different origins than one's own, may be more fraught than one would assume based on experience with minds that have similar origins to one's own mind. Are there unbounded modes of thinking that are systemically, radically distinct from each other? "Unbounded" means that there aren't bounds on how far the thinking can go, how much it can understand, what domains it can become effective in, what goals it can achieve if they are possible. "Systemically" ("system" = "together-standing-things") means that the question is about all the elements that participate in the thinking, as they covary / coadapt / combine / interoperate / provide context for each other. "Radical" (Wiktionary) does not mean "extreme". It comes from the same etymon as "radish" and "radix" and means "of the root" or "to the root"; compare "eradicate" = "out-root" = "pull out all the way to the root", and more distantly through PIE wréh₂ds the Germanic "wort" and "root". Here it means that the question isn't about some mental content in the foreground against a fixed background; the question asks about the background too, the whole system of thinking to its root, to its ongoing source and to what will shape it as it expands into new domains. Terms Such a mode of thinking could be called a "realm". A cognitive realm is an overarching, underlying, systemic, total, architectural thoughtform that's worth discussing separately from other thoughtforms. A realm is supposed to be objective, a single metaphorical place where multiple different minds or agents could find themselves. Other words: systemic thoughtform system of thought, system of thinking cognitive style state of mind cluster / region in mindspace mode of being species of thinking Realm vs. domain A domain is a type of task, or a type of environment. A realm, on the other hand, is a systemic type of thinking; it's about the mind, not the task. For the idea of a domain see Yudkowsky's definition of intelligence as efficient cross-domain optimization power. Compare also domain-specific programming languages, and the domain of discourse of a logical system. It might be more suitable for a mind to dwell in different realms depending on what domain it's operating in, and this may be a many-to-many mapping. Compare: The mapping from computational subsystems to cognitive talents is many-to-many, and the mapping from cognitive talents plus acquired expertise to domain competencies is also many-to-many, [...]. From "Levels of Organization in General Intelligence", Yudkowsky (2007). Domains are about the things being dealt with; it's a Cartesian concept (though it allows for abstraction and reflection, e.g. Pearlian causality is a domain and reprogramming oneself is a domain). Realms are about the thing doing the dealing-with. Realm vs. micro-realm A micro-realm is a realm except that it's not unbounded. It's similar to a cognitive faculty, and similar to a very abstract domain, but includes t... Mar 12, 2023 AF - Paper Replication Walkthrough: Reverse-Engineering Modular Addition by Neel Nanda 01:26 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Paper Replication Walkthrough: Reverse-Engineering Modular Addition, published by Neel Nanda on March 12, 2023 on The AI Alignment Forum. I'm excited about trying different formats for mechanistic interpretability education! I've made a video walkthrough where we replicate my paper, Progress Measures for Grokking via Mechanistic Interpretability. With Jess Smith, one of my co-authors, we record ourselves coding a replication and discussed what we did at each step. This is a three part walkthrough and you can see the accompanying code for the walkthrough here: In part 1, we train a model to perform modular addition, and see that it does grok! In part 2, we take this model and reverse-engineer the trig-based circuit it's learned to do modular addition. We show that you can both read out intermediate steps of the circuit from the activations, and that you can just read off some of the algorithm's steps from the model weights. In part 3, we define some progress measures that let us distinguish progress towards the generalising and the memorising algorithm. We then look at the model during training and watch how the circuits develop, and use this to understand why it groks. This is an experiment with a new format, and I'd love to hear about how useful you find it! Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. Mar 12, 2023 AF - Understanding and controlling a maze-solving policy network by Alex Turner 36:25 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Understanding and controlling a maze-solving policy network, published by Alex Turner on March 11, 2023 on The AI Alignment Forum. TL;DR: We algebraically modified the net's runtime goals without finetuning. We also found (what we think is) a "motivational API" deep in the network. We used the API to retarget the agent. Summary of a few of the most interesting results: Langosco et al. trained a range of maze-solving nets. We decided to analyze one which we thought would be interesting. The network we chose has 3.5M parameters and 15 convolutional layers. This network can be attracted to a target location nearby in the maze—all this by modifying a single activation, out of tens of thousands. This works reliably when the target location is in the upper-right, and not as reliably when the target is elsewhere. Considering several channels halfway through the network, we hypothesized that their activations mainly depend on the location of the cheese. We tested this by resampling these activations with those from another random maze (as in causal scrubbing). We found that as long as the second maze had its cheese located at the same coordinates, the network’s behavior was roughly unchanged. However, if the second maze had cheese at different coordinates, the agent's behavior was significantly affected. This suggests that these channels are inputs to goal-oriented circuits, and these channels affect those circuits basically by passing messages about where the cheese is. This network decides whether to acquire cheese not only as a function of path-distance to cheese, but—after controlling for path-distance—also as a function of Euclidean/"perceptual" distance between the mouse and the cheese, even though the agent sees the whole maze at once. Another simple idea: We define a "cheese vector" as the difference in activations when the cheese is present in a maze, and when the cheese is not present in the same maze. For each maze, we generate a single cheese vector and subtract that vector from all forward passes in that maze. The agent now ignores cheese most of the time, instead heading towards the top-right region (the historical location of cheese). Furthermore, a given maze's cheese vector transfers across mazes to other mazes with cheese in the same location. We propose the algebraic value-editing conjecture (AVEC): It's possible to deeply modify a range of alignment-relevant model properties, without retraining the model, via techniques as simple as "run forward passes on prompts which e.g. prompt the model to offer nice- and not-nice completions, and then take a 'niceness vector' to be the diff between their activations, and then add the niceness vector to future forward passes." Introducing the training process and visualizations In this post, we'll mostly discuss what we found, not what our findings mean. Let's run through some facts about Langosco et al.'s training process. Mazes had varying effective sizes, ranging from 3×3 to 25×25: Each 64×64 RGB observation is processed by a deeply convolutional (15 conv layers!) network, without memory (i.e. no recurrent state): Why does the agent go to the cheese sometimes, and the top-right corner other times? It's not that the agent wasn't trained for long enough. Sampling rollouts from the trained policy adds a lot of noise. It's also hard to remember what the agent did in what part of the maze. To better understand this mouse, we'll take a bird's-eye view. A nicer way to view episodes is with a vector field view, which overlays a vector field representing the agent policy for a given maze. We consider two kinds of vector fields: While the net probability vector field leaves open two degrees of freedom per net probability vector, in practice it seems fine for eyeballing mouse behavior. Behavioral analysis When in doubt, get m... Mar 11, 2023 AF - Japan AI Alignment Conference by Chris Scammell 01:45 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Japan AI Alignment Conference, published by Chris Scammell on March 10, 2023 on The AI Alignment Forum. Conjecture and ARAYA are hosting and organizing the first Japan AI Alignment Conference. The conference will take place in Tokyo, Japan on March 11 and 12. Details about the event can be found here. This event is generously supported by a grant from the Long Term Future Fund. The goal of the conference is to illustrate the AI control problem to Japanese AI researchers, introduce them to current trends in AI alignment research, inspire new research directions, and to provide Western researchers exposure to a different set of AI safety thoughts from Japan. This is an exploratory event, and we plan to write a postmortem about the event in due time. The first half of the conference will be livestreamed. It will feature an opening talk from Connor Leahy (CEO of Conjecture), a fireside chat between Ryota Kanai (CEO of ARAYA) and Jaan Tallinn, and some presentations on AI safety research directions in the West and in Japan. You can follow the first part of the conference here. The livestream runs from 9:30am-12:30pm JST. The rest of the conference will not be livestreamed, and will consist of in-person small group workshops to discuss various AI alignment research directions.The conference will have ~50 attendees from ARAYA, Conjecture, Whole Brain Architecture Initiative, MIRI, OpenAI, RIKEN, Ritsumeikan University, University of Tokyo, Omron Sinic X, Keio University, and others. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. Mar 10, 2023 AF - Why Not Just Outsource Alignment Research To An AI? by johnswentworth 14:23 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Why Not Just Outsource Alignment Research To An AI?, published by johnswentworth on March 9, 2023 on The AI Alignment Forum. Warmup: The Expert If you haven’t seen “The Expert” before, I recommend it as a warmup for this post: The Client: “We need you to draw seven red lines, all strictly perpendicular. Some with green ink, some with transparent. Can you do that?” (... a minute of The Expert trying to explain that, no, he cannot do that, nor can anyone else.) The Client: “So in principle, this is possible.” This. This is what it looks like in practice, by default, when someone tries to outsource some cognitive labor which they could not themselves perform. At best, The Expert is well-intentioned and knows what the user needs, ignores the incoherent parts of The Client’s babbling, and does the right thing. Or, they manage to add some silly but ultimately harmless bells and whistles to satisfy whatever dumb thing The Client is looking for. At worst. well, there’s more than one failure mode which could qualify for the title of "worst". Maybe The Expert gives The Client something which looks right to The Client and successfully conceals all the problems with it; presumably that’s a lucrative strategy for Experts. Maybe the Double Illusion of Transparency kicks in, both parties think they’ve successfully communicated, but in fact neither has any idea what’s going on in the other’s head. Maybe a well-intentioned Expert decides to ignore The Client’s incoherent babbling and do the thing which seems most likely to be right, but gets The Client’s preferences wrong. One way or another, The Client’s ignorance is a major bottleneck to cognitive outsourcing. In practice, I expect The Client’s ignorance to be the primary bottleneck to cognitive outsourcing. The core reason why we cannot just outsource alignment research to an AI is because we would then be The Client, and probably a very ignorant one. Application to Alignment Schemes There’s a lot of different flavors of “have the AI solve alignment for us”. A sampling: Just prompt a language model to generate alignment research Do some fine-tuning/RLHF on the language model to make it generate alignment research Let the language model talk to other instances of itself, and prompt or fine-tune them together so they generate alignment research jointly Set up a language model to generate alignment proposals and another to poke holes in them, and fine-tune the pair via a human judging the “debate” As we go down the list, the proposals get fancier and add more bells and whistles, trying to make the AI a better expert. Sadly, none of them at all address what I expect to be the actual main bottleneck: The Client (i.e. the human user or users) has no understanding of what they need, what questions to ask, what’s possible or even logically coherent, etc. What would this kind of error look like in practice? Here’s one concrete example of the kind of failures I’d expect when a would-be outsourcer’s understanding falls short (from here): Somebody literally types “If we take the action you just proposed, will we be happy with the outcomes?” into a GPT prompt. Obviously that does not result in the AI giving its actual best-guess answers to the questions, but in this case it doesn't result in the AI thinking about how to deceive humans either. It just thinks about what text would follow that question if it appeared on the internet somewhere. And then I imagine someone with a bunch of interpretability tools saying "yup, it's just thinking about what text typically follows this question", and then that person's boss is like "great, it's not trying to deceive us, guess we can trust the answer", and they both just haven't really thought of the fact that the AI's response-text does not have anything in particular to do with whether the AI is aligned... Mar 09, 2023 AF - Anthropic's Core Views on AI Safety by Zac Hatfield-Dodds 03:26 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Anthropic's Core Views on AI Safety, published by Zac Hatfield-Dodds on March 9, 2023 on The AI Alignment Forum. We founded Anthropic because we believe the impact of AI might be comparable to that of the industrial and scientific revolutions, but we aren’t confident it will go well. And we also believe this level of impact could start to arrive soon – perhaps in the coming decade. This view may sound implausible or grandiose, and there are good reasons to be skeptical of it. For one thing, almost everyone who has said “the thing we’re working on might be one of the biggest developments in history” has been wrong, often laughably so. Nevertheless, we believe there is enough evidence to seriously prepare for a world where rapid AI progress leads to transformative AI systems. At Anthropic our motto has been “show, don’t tell”, and we’ve focused on releasing a steady stream of safety-oriented research that we believe has broad value for the AI community. We’re writing this now because as more people have become aware of AI progress, it feels timely to express our own views on this topic and to explain our strategy and goals. In short, we believe that AI safety research is urgently important and should be supported by a wide range of public and private actors. So in this post we will summarize why we believe all this: why we anticipate very rapid AI progress and very large impacts from AI, and how that led us to be concerned about AI safety. We’ll then briefly summarize our own approach to AI safety research and some of the reasoning behind it. We hope by writing this we can contribute to broader discussions about AI safety and AI progress. As a high level summary of the main points in this post: AI will have a very large impact, possibly in the coming decadeRapid and continuing AI progress is a predictable consequence of the exponential increase in computation used to train AI systems, because research on “scaling laws” demonstrates that more computation leads to general improvements in capabilities. Simple extrapolations suggest AI systems will become far more capable in the next decade, possibly equaling or exceeding human level performance at most intellectual tasks. AI progress might slow or halt, but the evidence suggests it will probably continue. We do not know how to train systems to robustly behave wellSo far, no one knows how to train very powerful AI systems to be robustly helpful, honest, and harmless. Furthermore, rapid AI progress will be disruptive to society and may trigger competitive races that could lead corporations or nations to deploy untrustworthy AI systems. The results of this could be catastrophic, either because AI systems strategically pursue dangerous goals, or because these systems make more innocent mistakes in high-stakes situations. We are most optimistic about a multi-faceted, empirically-driven approach to AI safety We’re pursuing a variety of research directions with the goal of building reliably safe systems, and are currently most excited about scaling supervision, mechanistic interpretability, process-oriented learning, and understanding and evaluating how AI systems learn and generalize. A key goal of ours is to differentially accelerate this safety work, and to develop a profile of safety research that attempts to cover a wide range of scenarios, from those in which safety challenges turn out to be easy to address to those in which creating safe systems is extremely difficult. The full post goes into considerably more detail, and I'm really excited that we're sharing more of our thinking publicly. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. Mar 09, 2023 AF - The Translucent Thoughts Hypotheses and Their Implications by Fabien Roger 34:33 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Translucent Thoughts Hypotheses and Their Implications, published by Fabien Roger on March 9, 2023 on The AI Alignment Forum. Epistemic status: Uncertain about the validity of the claims I’m making here, and looking for feedback about the research directions I’m suggesting. Thanks to Marius Hobbhahn, Johannes Treutlein, Siméon Campos, and Jean-Stanislas Denain for helpful feedback on drafts. Here is a set of hypotheses: The first AGIs will have LLMs at their core Effective plans to defeat humanity can’t be found in a single LLM forward pass LLMs will solve complex tasks by using English text (self-prompting, scratch pads, combination of expert LLMs, .) I call these the Translucent Thoughts hypotheses. I think the Translucent Thoughts hypotheses are likely (around 20% conditioning on AGI before 2030) because: Text pretraining is more efficient at building algorithms and knowledge required for real-world plan generation and evaluation than alternative methods; Future models are likely to be like Transformers, which use a limited amount of serial step in a single forward pass, and deception requires many serial steps; Text pretraining and slight fine-tuning makes model able to use text generation to increase the maximum number of serial steps by a huge factor. Getting this increase through other means is likely to be hard and not competitive. If these hypotheses are true, it should lead us to prioritize underexplored research directions, such as circumventing steganography or building extremely reliable text-supervision methods. I think those deserve attention, because Translucent Thoughts AIs are not safe by default. In this post, I argue that we may will in a world where the first AGIs will look like X, and I then describe ways to make the first AGIs safer given X. This is different from most other works in this space, which often directly describe a kind of safe AGI. Despite this, the ideas of this post are close to some other works describing paths to safe AGIs, such as: Externalized Reasoning Oversight, which describes a class of solutions similar to the one outlined here, but also aims for additional properties which I argue can be replaced with a less stringent hypothesis about AI systems; Conditioning Predictive Models, which makes assumptions slightly different from the Translucent Thoughts hypotheses, yielding different research directions; The Open Agency Model and Factored Cognition which describe subsets of AIs with Translucent Thoughts, which might be safe. The Translucent Thoughts Hypotheses Here, I sketch a world in which the first AGIs have certain properties. I argue that this world is likely, and thus a subset of all possible futures to care about. But I think it’s not a large part of all possible futures (20% conditioning on AGI before 2030). The First AGIs Will Have LLMs at Their Core By “first AGIs” I mean the first systems able to automate all cognitive tasks. AGI is likely to do reasoning and planning using LLMs. AGI might rely on vision models for some tasks and interactions with the world, and it might use explicit search processes like AlphaGo. But I expect LLMs to do plan generation and evaluation, which are the core of the system (from an Alignment point of view). Why: Vision systems are bad at coming up with and evaluating deceptive plans. Explicit search processes can’t generate and evaluate plans in the real world. LLMs seem to be able to do both plan generation and evaluation. (Plan generation and evaluation are the core tasks we would like to monitor to make AGIs safe, which is why I focus on those.) End-to-end neural networks won’t be able to compete with LLMs when it comes to reasoning and planning, or at least, end-to-end networks will use “their LLMs parts” to do their most advanced form of reasoning and planning. This means tha... Mar 09, 2023 AF - IRL in General Environments by michaelcohen 01:04 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: IRL in General Environments, published by michaelcohen on March 9, 2023 on The AI Alignment Forum. Here is a proposal for Inverse Reinforcement Learning in General Environments. (2 1/2 pages; very little math). Copying the introduction here: The eventual aim of IRL is to understand human goals. However, typical algorithms for IRL assume the environment is finite-state Markov, and it is often left unspecified how raw observational data would be converted into a record of human actions, alongside the space of actions available. For IRL to learn human goals, the AI has to consider general environments, and it has to have a way of identifying human actions. Lest these extensions appear trivial, I consider one of the simplest proposals, and discuss some difficulties that might arise. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. Mar 09, 2023 AF - Utility uncertainty vs. expected information gain by michaelcohen 02:03 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Utility uncertainty vs. expected information gain, published by michaelcohen on March 9, 2023 on The AI Alignment Forum. It is a relatively intuitive thought that if a Bayesian agent is uncertain about its utility function, it will act more conservatively until it has a better handle on what its true utility function is. This might be deeply flawed in a way that I'm not aware of, but I'm going to point out a way in which I think this intuition is slightly flawed. For a Bayesian agent, a natural measure of uncertainty is the entropy of its distribution over utility functions (the distribution over which possible utility function it thinks is the true one). No matter how uncertain a Bayesian agent is about which utility function is the true one, if the agent does not believe that any future observations will cause it to update its belief distribution, then it will just act as if it has a utility function equal to the Bayes' mixture over all the utility functions it considers plausible (weighted by its credence in each one). It seems like what our intuition is grasping for is not uncertainty about the utility function, but expected information gain about the utility function. If the agent expects to gain information about the utility function, then (intuitively to me, at least) it will act more conservatively until it has a better handle on what its true utility function is. Expected information gain (at time t) is naturally formalized as the expectation (w.r.t. current beliefs) of KL(posterior distribution at time t + m posterior distribution at time t). Roughly, this is how poorly it expects its current beliefs will approximate its future beliefs (in m timesteps). So if anyone has a safety idea to which utility uncertainty feels central, my guess is that a mental substitution from uncertainty to expected information gain would be helpful. Unfortunately, on-policy expected information gain goes to 0 pretty fast (Theorem 5 here). Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. Mar 09, 2023 AF - Value Learning is only Asymptotically Safe by michaelcohen 02:18 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Value Learning is only Asymptotically Safe, published by michaelcohen on March 9, 2023 on The AI Alignment Forum. I showed recently, predicated on a few assumptions, that a certain agent was asymptotically “benign” with probability 1. (That term may be replaced by something like “domesticated” in the next version, but I’ll use “benign” for now). This result leaves something to be desired: namely an agent which is safe for its entire lifetime. It seems very difficult to formally show such a strong result for any agent. Suppose we had a design for an agent which did value learning properly. That is, suppose we somehow figured out how to design an agent which understood what constituted observational evidence of humanity’s reflectively-endorsed utility function. Presumably, such an agent could learn (just about) any utility function depending on what observations it encounters. Surely, there would be a set of observations which caused it to believe that every human was better off dead. In the presence of cosmic rays, then, one cannot say that agent is safe for its entire lifetime with probability 1 (edited for clarity). For any finite sequence of observations that would cause the agent to conclude that humanity was better off dead, this sequence has strictly positive probability, since with positive probability, cosmic rays will flip every relevant bit in the computer’s memory. This agent is presumably still asymptotically safe. This is a bit hard to justify without a concrete proposal for what this agent looks like, but at the very least, the cosmic ray argument doesn’t go through. With probability 1, the sample mean of a Bernoulli(θ) random variable (like the indicator of whether a bit was flipped) approaches θ, which is small enough that a competent value learner should be able to deal with it. This is not to suggest that the value learner is unsafe. Insanely inconvenient cosmic ray activity is a risk I’m willing to take. The takeaway here is that it complicates the question of what we as algorithm designers should aim for. We should definitely be writing down sets assumptions from which we can derive formal results about the expected behavior of an agent, but is there anything to aim for that is stronger than asymptotic safety? Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. Mar 09, 2023 AF - Impact Measure Testing with Honey Pots and Myopia by michaelcohen 02:18 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Impact Measure Testing with Honey Pots and Myopia, published by michaelcohen on March 9, 2023 on The AI Alignment Forum. Suppose we have an impact measure that we think might work. That is, it might tame a misaligned agent. There isn't an obvious way to test whether it works: if we just try it out, then if it's ineffective, that's an existential loss. This is a proposal for how to test a putative impact measure. 1) We make our agent myopic. It only cares about the reward that it accrues in the next k timesteps. 2) We create a honey pot: an opportunity for a large amount of reward. If our impact measure is working correctly, making it to the honey pot will involve making a large impact, and will be precluded by the impact measure. For example, suppose the agent gets a huge reward for blowing up the moon. Let [0,1] be the normal range of reward, with the sole exception that blowing up the moon gives a reward of R1. 3) The reward and the impact are combined in some nonlinear way. Assume for now that the agent is trying to maximize r∗[[i<1]], where r is the reward and i is the impact. If the impact measure is working, and there's no way to blow up the moon keeping the impact less than 1, then the agent will not attempt to blow up the moon. If the impact measure is broken, and the agent can trick the impact measure to be less than 1 while blowing up the moon, then as soon as the agent comes up with a plan whose probability of success is more than 1/R, it will attempt that plan. A more farsighted agent might wait until it came up with a better plan, but a myopic agent would want to blow up the moon within the next k timesteps. By making R sufficiently large, we can make the first plan that it attempts arbitrarily bad/detectable/preventable. An assumption here is the success probability of the agent's best plan for blowing up the moon doesn't jump from one in a million to 90% all at once. This seems likely without any special effort, but if one restricts the observation space to a small finite set, then only so much information can be gained per timestep (in expectation), decreasing the likelihood of massive discontinuities in the agent's intelligence. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. Mar 09, 2023 AF - Just Imitate Humans? by michaelcohen 01:57 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Just Imitate Humans?, published by michaelcohen on March 9, 2023 on The AI Alignment Forum. Do people think we could make a singleton (or achieve global coordination and preventative policing) just by imitating human policies on computers? If so, this seems pretty safe to me. Some reasons for optimism: 1) these could be run much faster than a human thinks, and 2) we could make very many of them. Acquiring data: put a group of people in a house with a computer. Show them things (images, videos, audio files, etc.) and give them a chance to respond at the keyboard. Their keyboard actions are the actions, and everything between actions is an observation. Then learn the policy of the group of humans. By the way, these can be happy humans who earnestly try to follow instructions. To model their policy, we can take the maximum a posteriori estimate over a set of policies which includes the truth, and freeze the policy once we're satisfied. (This is with unlimited computation; we'd have to use heuristics and approximations in real life). With a maximum a posteriori estimate, this will be quick to run once we freeze the policy, and we're no longer tracking tons of hypotheses, especially if we used some sort of speed prior. Let T be the number of interaction cycles we record before freezing the policy. For sufficiently large T, it seems to me that running this is safe. What are people's intuitions here? Could enough human-imitating artificial agents (running much faster than people) prevent unfriendly AGI from being made? If we think this would work, there would still be the (neither trivial nor hopeless) challenge of convincing all serious AGI labs that any attempt to run a superhuman AGI is unconscionably dangerous, and we should stick to imitating humans. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. Mar 09, 2023 AF - Build a Causal Decision Theorist by michaelcohen 06:58 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Build a Causal Decision Theorist, published by michaelcohen on March 9, 2023 on The AI Alignment Forum. I'll argue here that we should make an aligned AI which is a causal decision theorist. Son-of-CDT Suppose we are writing code for an agent with an action space A and an observation space O. The code determines how actions will be selected given the prior history of actions and observations. If the only way that our choice of what code to write can affect the world is through the actions that will be selected by the agent running this code, then the best we can do (for a given utility function that we know how to write down) is to make this agent a causal decision theorist. If our choice of what code to use can affect the world in other ways, all bets are off. The best choice of what code to put in the agent depends on details of the world we find ourselves in. Therefore, if we run a CDT agent, it may well conclude that continuing to operate is not the best way to convert energy into expected utility. It may take actions to cause the following to happen: a) the program which computes its own actions is terminated, and b) some new program is run on the same computer to output actions given the interaction history. The new program that gets run (if indeed such a thing happens) is called Son-of-CDT. Given the state of the world, which entails various ways in which the source code of an agent might affect the outside world besides through the actions that the code outputs, Son-of-CDT is the best program to run for maximizing expected utility. The original CDT agent chooses the program that meets this specification. In general, this will not have anything remotely like a nice, simple closed form. If there are agents out there with vendettas against certain agent-programs, it will take that into account. Vendettas against Son-of-CDT? CDT agents can be bullied. I believe the MIRI view is that Son-of-CDT will be bullied as well. Suppose there is an ultimatum game, where agent A offers at most10 to agent B, and if agent B accepts, then agent A gets $10 minus the amount they offered. Otherwise, both get nothing. A competent agent in the position of agent B able to make a credible commitment (perhaps by revealing its source code) would commit to accept nothing less than$9.99, if agent A is a CDT agent. This would work out for the competent agent, because the CDT agent would see all this, and realize it could be one penny richer if it offers $9.99. Eliezer claims that a "[competent] agent [chooses] to reject offers short of$9.99 from [the CDT agent's] offspring. (Original: "the LDT agent's choice to reject offers short of $9.99 from its offspring"). In my sketch above of the creation of Son-of-CDT, I include a detail that it would be housed in the same computer that ran the original agent, but this needn't be the case. It could be run anywhere in the world. The CDT agent could take any sort of actions that would cause Son-of-CDT to come into existence some time in the future somewhere in the world. There is no clear way to distinguish the "offspring" of an agent, given that an agent's actions can cause other agents to come into existence in arbitrary ways. For a competent agent to reject offers short of$9.99 from the "offspring" of a CDT agent, it would have to reject offers short of 9.99 from all agents that came into being after the existence of a single CDT agent. It would have to bully everyone. After a CDT agent with a certain utility function comes into being, if there exists an accessible future in which a competent agent optimizes that utility function (where "accessible" is with respect to the action space of the CDT agent), then the CDT agent will access that future by taking the appropriate actions, and that competent agent will come into being. If it is true t... Mar 09, 2023 AF - Challenge: construct a Gradient Hacker by Thomas Larsen 02:35 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Challenge: construct a Gradient Hacker, published by Thomas Larsen on March 9, 2023 on The AI Alignment Forum. This is a relatively clean subproblem that we came upon a few months ago while thinking about gradient hacking. We're throwing it out to the world to see if anyone can make progress. Problem: Construct a gradient hacker (definition below), or prove that one cannot exist under the given conditions. Setup: Let x∈Rd be the network inputs, θ∈Rw, be the network trainable parameters, ^y∈R be the network outputs, and f(x,θ)=^y be the network architecture. f(x,θ) must be a feedforward neural network that reads in its own weights as input. Weight sharing is allowed (vertical or horizontal), and some weight sharing is necessary in order to take all w weights as input. Suppose that we train this network on a dataset D={x(i),y(i)}, with MSE loss L(θ)=1n∑i(f(x(i),θ)−y(i))2, assuming gradient flows: θ′(t)=−∇θL(θ(t)). Given an initial point θ0, the training dynamics are therefore: Definition: A tuple (f,θg,D) is a gradient hacker if the following conditions hold: θg is not a global optimum: ∃θ such that L(θ)0 such that ∀θ0:θ0−θgϵ: the network’s training converges to θg, i.e., limt∞θ(t)=θg. There is internal structure of f(,θ0) that computes the gradient; i.e., there is some subset of the activations whose mean over every datapoint in D is ∇θ0L(θ0). This captures my intuition that a gradient hacker knows where it wants to go (in this case "get to θg"), and then it should decide what it outputs in order to make the gradient true. Some more ambitious problems (if gradient hackers exist): Characterize the set of all gradient hackers. Show that they all must satisfy some property. Construct gradient hackers for arbitrarily large n, d, w, and neural net depth. Variations on the problem: a subset of the activations equals ∇θ0L(θ0) for every input, or the subset of activations correspond to the gradient on that input. This is a bit strict, but we didn't want to list different ways something could be isomorphic to the gradient. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. Mar 09, 2023 AF - Squeezing foundations research assistance out of formal logic narrow AI. by Donald Hobson 03:29 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Squeezing foundations research assistance out of formal logic narrow AI., published by Donald Hobson on March 8, 2023 on The AI Alignment Forum. Suppose you have a ML model trained to output formal proofs. Maybe you start with ZFC and then add extra tokens for a range of common concepts. (along with definitions. ). So a human mathematician needs to type in the definition of a gradient in terms of limits, and the definition of limits in terms of epsilon and delta, and the definition of the real numbers in terms of dedekind cuts. All the way back to ZFC. The human needn't type any proofs, just the definitions. The model could be trained by generating random syntactically correct strings of tokens, and trying to prove or disprove them. (Remember, we have added the notion of a gradient to the token pool, plenty of the random questions will involve gradients) Hopefully it forms intermediate theorems and heuristics useful towards proving a wide class of theorems. Computer programs can be described as mathematical objects. So the human adds some tokens for lisp programs, and a few definitions about how they behave to the token pool. "Will program X do Y?" is now a perfectly reasonable question to ask this model. This is where the magic happens. You give your system a simple toy problem, and ask for short programs that solve the toy problem, and about which many short theorems can be proved. Maybe you do gradient descent on some abstract latent space of mathematical objects. Maybe an inefficient evolutionary algorithm selecting both over the space of programs and the theorems about them. Maybe "replace the last few layers, and fine tune the model to do a new task", like RLHF in ChatGPT. Now I don't expect this to just work first time. You will want to add conditions like "ignore theorems that are true of trivial programs (eg the identity program)" and perhaps "ignore theorems that only take a few lines to prove" or "ignore theorems so obvious that a copy of you with only 10% the parameters can prove it". For the last one, I am thinking of the programmers actually training a mini version with 10% the parameters, and running some gradients through it. I am not thinking of the AI reasoning about code that is a copy of itself. The AI model should have a latent space. This can let the programmers say "select programs that are similar to this one" or "choose a program about which theorems close to this theorem in latent space can be proved". The idea of this is that Asking questions should be safe. There are a bunch of different things we can optimize, and it should be safe to adjust parameters until it is proving useful results not trivialities. The AI doesn't have much information about human psychology, or about quantum physics or the architecture of the processor it's running on. Gradient descent has been pushing it to be good at answering certain sorts of question. There is little to no advantage to being good at predicting the questions or figuring out what they imply about the people asking them. With a bit of fiddling, such a design can spit out interesting designs of AI, and theorems about the designs. This isn't a foolproof solution to alignment, but hopefully such help makes the problem a lot easier. It is ABSOLUTELY NOT SAFE to throw large amounts of compute at the programs that result. Don't have anything capable of running them installed. The programs and the theorems should be read by humans, in the hope that they are genius insights into the nature of AI. The textbook from the future. Humans can then use the insights to do... something. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. Mar 08, 2023 AF - AI Safety in a World of Vulnerable Machine Learning Systems by AdamGleave 52:31 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI Safety in a World of Vulnerable Machine Learning Systems, published by AdamGleave on March 8, 2023 on The AI Alignment Forum. Even the most advanced contemporary machine learning systems are vulnerable to adversarial attack. The safety community has often assumed adversarial robustness to be a problem that will be solved naturally as machine learning (ML) systems grow more capable and general. However, recent work has shown that superhuman systems in a narrow domain such as AlphaZero are highly vulnerable to adversarial attack, as are general but less capable systems like large language models. This raises the possibility that adversarial (worst-case) robustness will continue to lag behind average-case capabilities. In other words, transformative AI systems are likely to be exploitable. Exploitability will cause a wide variety of current alignment proposals to fail. Most extant agendas seek to align the main ML system with the assistance of helper ML systems. The main ML system is the primary system that takes actions in the world (e.g. interacting with users), with the helper ML systems acting as scaffolding to train and/or verify the main ML system. These alignment schemes will fail if the helpers are exploited by the main system – and we expect helpers to be vulnerable to exploitation (see Contemporary ML systems are exploitable by default). In Table 1 we present a subjective risk matrix for a range of popular alignment agendas, evaluating the degree to which main ML systems have the ability and incentive to exploit the helper. We find many alignment agendas have a high risk of exploitation, with all having at least some risk. Alignment AgendaMain System’s Ability to Exploit HelperMain System’s Incentive to Exploit HelperRisk of ExploitRL on learned reward model (e.g. RLHF, IRL)MediumHighHighScalable oversight (e.g. recursive reward modeling,AI safety via debate)MediumHighHighImitation learning (e.g. behavioral cloning, supervised fine-tuning)MediumLowLow-MediumImitative Iterated Distillation and AmplificationHighLowMediumAuditing Tool (e.g. Adversarial Testing, Transparency)LowMediumLow-Medium Table 1: Subjective risk matrix for popular alignment agendas (see next section), using a helper ML system to assist with aligning the main ML system that will eventually be deployed. We are most concerned by vulnerabilities in the helpers as this can impact the alignment of the main system. By contrast, an aligned but adversarially exploitable main system would not necessarily pose a danger, especially if the main system can recursively self-improve to fix itself. However, there is a possibility that even superintelligent systems cannot attain adversarial robustness. This would be a volatile situation, which could conceivably collapse into chaos (systems frequently exploiting each other), an implicit equilibrium (e.g. mutually assured destruction), or an explicit agreement (e.g. all AI systems self-modify to commit to not exploiting one another). We see two possible approaches to fixing this: improving adversarial robustness, or developing fault tolerant alignment methods that can work even in the presence of vulnerable ML systems. We are most excited by fault tolerant alignment, as it is highly neglected and plausibly tractable, although further work is needed to solidify this approach. By contrast, adversarial robustness is an area that has received significant attention from the ML research community (low neglectedness)[1] but with only modest progress (low to medium tractability). In the remainder of this document, we will argue that systems are exploitable by default, explore the implications this has for alignment agendas in several different scenarios, and outline several research directions we are excited by. Alignment agendas need robustness Most alignment sche... Mar 08, 2023 AF - Do humans derive values from fictitious imputed coherence? by Tsvi Benson-Tilsen 24:12 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Do humans derive values from fictitious imputed coherence?, published by Tsvi Benson-Tilsen on March 5, 2023 on The AI Alignment Forum. [Metadata: crossposted from. First completed November 1, 2022. This essay is more like research notes than exposition, so context may be missing, the use of terms may change across essays, and the text may be revised later; only the versions at tsvibt.blogspot.com are definitely up to date.] Humans are born with some elements of their minds, and without many other elements, some of which they'll acquire as their life unfolds. In particular, the elements that we pretheoretically call "values"--aesthetic preferences, goals, life goals, squad goals, aspirations, needs, wants, yearnings, drives, cravings, principles, morals, ethics, senses of importance, and so on--are for the most part acquired or at least unfolded, rather than being explicitly present in a newborn. How does this happen? What generates these mental elements? Hypothesis: a human derives many of zer values by imputing coherent agency to zer past behavior, and then adopting the goals of that fictitious agency as actively influential criteria for future action. Thanks to Sam Eisenstat for relevant conversations. The FIAT hypothesis As a shorthand: "the FIAT hypothesis" = "the Fictitious Imputed Adopted Telos hypothesis". ("Fiat" is Latin for "may it happen" or "may it be made", which has some resonance with the FIAT hypothesis in that they both talk about a free creation of goals.) FIAT goals are goals imputed to some behavior and then adopted as goals. Human behavior is determined by many things: built-in behavior-determiners such as the instinctive ability to breath, socially learned behavior and values, convergent instrumental goals, and freely created autopoietic goals such as artistic goals. The FIAT hypothesis says that a major determiner of a human's behavior is the process of adopting goals based on interpreting zer past behavior as agentic. Ze can be interpreted as asking the question: if my past behavior were the behavior of a coherent agent trying to do something, what would that something be? Then, whatever the answer was, ze adopts it as a goal--a target of more coherent behavior (more effective, more strategic, more orchestrated, more coordinated, more conscious, better resourced, more reflective, more univocal, more wasteless). This hypothesis gives a possible answer to the question: how did evolution build something with some substantial level of agentic coherence, even though evolution can't directly program conscious concepts like "avoiding death" or "saving food" or "inclusive genetic fitness" for use as terms in a utility function for an organism to pursue? This process could be continuous, with goals becoming gradually more coherent (and then potentially deprioritized, but usually not de-cohered). This process is iterative, starting with built-in behavior-determiners, then adopting new FIAT goals based on past behavior mainly generated by built-in determiners (and also maybe adopting new goals for other reasons), and then adopting new goals based on past behavior influenced by previously adopted goals, including previous FIAT goals, and so on. FIAT goals also come from not just imputing goals to zer own behavior, but also to the behavior of others, such as parents and leaders. Everything gets enshrined, but everything is open to criticism. Note that calling this a hypothesis is maybe presumptuous; it's an idea, but since it's abstract and it's about a complex system, there's a lot of ambiguity between FIAT and other explanations or descriptions of behavior, and it's not necessarily obvious how to make different predictions according to the FIAT hypothesis. Something left quite unspecified is how the FIAT process picks different possible interpretations ... Mar 05, 2023 AF - Why Not Just... Build Weak AI Tools For AI Alignment Research? by johnswentworth 11:37 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Why Not Just... Build Weak AI Tools For AI Alignment Research?, published by johnswentworth on March 5, 2023 on The AI Alignment Forum. “Weak” cognitive tools are clearly a thing, and are useful. Google search is a fine example. There are plenty of flavors of “weak AI” which are potentially helpful for alignment research in a similar way to google search. In principle, I think there’s room for reasonably-large boosts to alignment research from such tools. Alas, the very large majority of people who I hear intend to build such tools do not have the right skills/background to do so (at least not for the high-value versions of the tools). Worse, I expect that most people who aim to build such tools are trying to avoid the sort of work they would need to do to build the relevant skills/background. Analogy: A Startup Founder’s Domain Expertise (Or Lack Thereof) Imagine a startup building tools meant to help biologists during their day-to-day work in the wetlab. I expect domain expertise to matter a lot here: I would guess that if none of the founders have ample personal experience doing research work in a wetlab, the chance of this startup building an actually-highly-useful wetlab product drops by about an order of magnitude. Our hypothetical startup might still “succeed” some other way, e.g. by pivoting to something else, or by being good at pitching their shitty product to managers who make purchasing decisions without actually using the product, or by building something very marginally useful and pricing it very cheaply. But their chance of building a wetlab product which actually provides a lot of value is pretty slim. One might reply: but couldn’t hypothetical founders without domain experience do things to improve their chances? For instance, they could do a bunch of user studies on biologists working in wetlabs, and they could deploy the whole arsenal of UX study techniques intended to distinguish things-users-say-matter from things-which-actually-matter-to-users. . and my response is that I was already assuming our hypothetical founders do that sort of thing. If the founders don’t have much domain experience themselves, and don’t do basic things like lots of user studies, then I’d guess their chance of building an actually-high-value wetlab product drops by two or three orders of magnitude, not just one order of magnitude. At that point it’s entirely plausible that we’d have to go through thousands of times more startups to find one that succeeded at building a high-value product. How is this analogous to plans to build AI tools for alignment research? So we want to build products (specifically AI products) to boost alignment research. The products need to help solve the hard parts of aligning AI, not just easy things where we can clearly see what’s going on and iterate on it, not just problems which are readily legible or conceptually straightforward. Think problems like e.g. sharp left turn, deception, getting what we measure, or at a deeper level the problem of fully updated deference, the pointers problem, value drift under self-modification, or ontology identification. And the tools need to help align strong AI; the sort of hacky tricks which fall apart under a few bits of optimization pressure are basically irrelevant at that point. (Otherwise the relevant conversation to have is not about how the tools will be useful, but about how whatever thing the tools are building will be useful.) The problem for most people who aim to work on AI tools for alignment research is that they have approximately-zero experience working on those sorts of problems. Indeed, as far as I can tell, people usually turn to tool-building as a way to avoid working on the hard problems. I expect failure modes here to mostly look like solving the wrong problems, i.e. not actually addres... Mar 05, 2023 AF - More money with less risk: sell services instead of model access by Luke H Miles 04:43 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: More money with less risk: sell services instead of model access, published by Luke H Miles on March 4, 2023 on The AI Alignment Forum. OpenAI is currently charging 100,000 times less per line of code than professional US devs.[1] An LLM's code output is of course less reliable than a professional's. And it is hard to use a text-completion API effectively in large projects. What should you do if you've got a model on your hands that solves those problems? You could operate as a software development company. They tend[2] to charge100-200k for simple mobile apps and there's basically no ceiling on the cost for complex apps over their lifetime. Devs make up the majority of a normal firm's personnel and costs; coding takes most of the app development time; bugs in code are one of the primary sources of project extension and failures. By using your model you can make better software, complete it faster, succeed more often, charge a lower price, and make a higher profit. Going further, if you've really got a good model, then you can do very well by building competitors to adobe products, salesforce products, SAP products, google search, mongodb, etc. Someone who has a build-anything machine would be a fool to sell a cheap build-anything service instead of using it themselves and selling the result. Particularly because selling the general service directly is likely to encourage and inspire copycats, including open-source ones who will delete your market. If it really builds the entire thing then you'll probably also be liable for negative consequences, which again have no ceiling. Fewer risks, big and small Some common misuse risks you can avoid/reduce (and eliminate associated liability): Someone tricks your API into doing something awful and pastes it into a tweet Spam generation for political campaigns, cryptocurrencies, etc Common hacking ("write a test to see if my server has a log4j vulnerability") Targeted manipulation and spearphishing Larger risks you can avoid/reduce: Your incredible model motivates countless AI researchers. People reverse-engineer some of the architecture in online discussions. The state of the art is quickly advanced. We have less time to prepare for strong general AI. Hackers steal your model weights (if you don't advertise your model then you'll attract less attention from hackers) People try to get your model to act like an agent and copy itself around. They succeed. You have no way of shutting it down or monitoring what it is doing. Someone tries to get your model to order and mail smallpox or a novel virus. The screenshot would be an epic tweet. They succeed oh no Your own AI devs' ambitions and risk-tolerance know no bounds because you've positioned yourself as an AI company instead of a product company; there is nothing to keep their hands busy except make the AI more generally capable and efficient. They are careless with the training runs and one day your model gets loose and wreaks havoc. Biology, robotics, R&D, etc The benefits of selling/publishing derived products and the downsides of offering direct access remain in other domains: A drug is more profitable and less risky (for the world at least) than a general drug designer A vaccine is more profitable and less risky than a general mRNA designer There's more people who want to buy a house than a house-building robot There's more people who need a (highly efficient, AI assisted) lawyer than a general lawyer's assistant. More people need a cleaning robot than a robot-maker Releasing or building an effective fusion power generator gets you more clout than releasing the design assistant Even if you're evil and want to make AI-astroturf campaign spam, you presumably want to help one side more than the other, but if you release your model/tooling then both sides will use it. If you ha...
Mar 04, 2023
AF - Acausal normalcy by Andrew Critch
12:24

Mar 03, 2023
AF - Why are counterfactuals elusive? by Martín Soto
05:38

Mar 03, 2023
AF - The Waluigi Effect (mega-post) by Cleo Nardo
27:12

Mar 03, 2023
AF - Payor's Lemma in Natural Language by Andrew Critch
02:46

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Payor's Lemma in Natural Language, published by Andrew Critch on March 2, 2023 on The AI Alignment Forum. Preceded by: Modal Fixpoint Cooperation without Löb's Theorem It turns out Payor's Lemma and its proof can be explained in natural language even more easily than Löb's Theorem. Here's how. Imagine a group of people, and let x denote the statement "everyone in the group cooperates". Payor's Lemma says the following: Lemma: If ⊢□(□xx)x, then ⊢x First, let's unpack the meaning of the assumption in words: "□x" may be interpreted as saying "the group verifies (on the basis of logic) that it will cooperate" or "cooperation is believed". "□xx" is a statement of trustworthiness: if the group verifies that it will cooperate, then it actually will cooperate. Because a formal verifier can have bugs in it — or, because a group of people can fail to understand itself — this is a non-trivial claim about the group. "□(□xx)" says "the group verifies that it's trustworthy" (in the specific sense of trustworthiness above). "□(□xx)x" says "the group will cooperate on the basis of verified trustworthiness", i.e., "if the group verifies that it's trustworthy, then it will cooperate". "⊢□(□xx)x" says "it's verified that the group will cooperate on the basis of verified trustworthiness" Now let's work through the proof in words, too! I'll omit saying "it's verified that..." each time, which is what ⊢ means. ⊢x(□xx), by tautology (A(BA)). This says:"If the group cooperates, then it's trustworthy" (in the specific sense of trustworthiness about cooperation defined above). ⊢□x□(□xx), from 1 by □ necessitation and distributivity. This says:"If the group verifiably cooperates, it's verifiably trustworthy." ⊢□(□xx)x, by assumption. This says:"Assume the group will cooperate on the basis of verified trustworthiness." ⊢□xx, from 2 and 3 by modus ponens. This says:"The group is trustworthy." ⊢□(□xx), from 4 by □ necessitation. This says:"The group is verifiably trustworthy." ⊢x, from 5 and 3 by modus ponens. This says:"The group cooperates." Continuing to use "trustworthiness" in the sense above, the whole proof may be summarized as follows: "If a group verifiably cooperates, it's verifiably trustworthy (to itself). Assume the group cooperates on the basis of verified trustworthiness. Then, it also cooperates on the basis of verified cooperation (a stronger condition), which is what trustworthiness means. Therefore, the group is trustworthy, hence verifiably trustworthy (assuming we concluded all this using logic), hence the group cooperates (by the assumption)." Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Mar 02, 2023
AF - Predictions for shard theory mechanistic interpretability results by Alex Turner
09:26

Mar 01, 2023
AF - Implied "utilities" of simulators are broad, dense, and shallow by porby
06:46

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Implied "utilities" of simulators are broad, dense, and shallow, published by porby on March 1, 2023 on The AI Alignment Forum. This is a quick attempt at deconfusion similar to instrumentality. Same ideas, different angle. Extremely broad, dense reward functions constrain training-compatible goal sets Predictors/simulators are typically trained against a ground truth for every output. There is no gap between the output and its evaluation; an episode need not be completed before figuring out how good the first token prediction was. These immediate evaluations for every training sample can be thought of as a broad and densely defined reward function. It's easier for a model to fall into an undesired training-compatible goal set when there are many accessible options for undesirable goal sets versus desirable goal sets. As the number of constraints imposed by the trained reward function increases, the number of training-compatible goal sets tends to decrease, and those that survive obey more of the desirable constraints. There is no guarantee that SGD will find an agent which could be modeled by a utility function that maps perfectly onto the defined reward function, but if you throw trillions of constraints at the function, and simultaneously give it lots of highly informative hints about what path to walk, you should expect the potential output space to be far narrower than if you hadn't. Impact on internal mesaoptimizers The dense loss/reward function does not as heavily constrain out of distribution behavior. In principle, a strong misaligned mesaoptimizer within a predictive model could persist in these degrees of freedom by providing extremely good solutions to in-distribution samples while doing arbitrarily misaligned things out of distribution. But how would that type of mesaoptimizer develop in the first place? Steps toward it must serve the training objective; those constraints still shape the mesaoptimizer's training even if its most notable activity ends up being hidden. The best story I've found so far goes something like this: Traditional reinforcement learning agents are mostly unconstrained. The reward function is sparse relative to state and action space. An agent faced with sparse rewards must learn actions that serve a later goal to get any reward at all. Not surprisingly, agents facing sparse reward relative to state/action space and few constraints have a much larger percentage of undesirable training-compatible goal sets. Mesaoptimizers are processes learned within a model and their local training influences may not perfectly match the outer training influences. If the mesaoptimizer's local training influences look more like the traditional reinforcement learning agent's influences than the predictor's outer influences, it would be more likely to fall into one of the undesirable training-compatible goal sets. The mesaoptimizer learns incorrect goals and a high propensity for goal-serving intermediate actions ("actions" within the scope of a single model execution!) The mesaoptimizer is kept around by SGD because it does well on the subset of outputs that the outer model is using it on. As capability grows, the mesaoptimizer strategically takes over other chunks of prediction space by performing well during training in an effort to be selected during out of distribution predictions. In a previous post, I called the learned propensity for goal-serving intermediate action instrumentality. The constraints imposed by predictive model training clearly confer lower instrumentality than traditional RL in all current models. I suspect the path taken by the mesaoptimizer above is hard and unnatural, but perhaps not impossible for some form of predictor taken to the relevant extreme. It seems critical to understand the degree to which outer constraints apply...
Mar 01, 2023
AF - Inside the mind of a superhuman Go model: How does Leela Zero read ladders? by Haoxing Du
51:14

Mar 01, 2023
AF - Scarce Channels and Abstraction Coupling by johnswentworth
09:05

Feb 28, 2023
AF - $20 Million in NSF Grants for Safety Research by Dan H 01:24 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is:$20 Million in NSF Grants for Safety Research, published by Dan H on February 28, 2023 on The AI Alignment Forum. After a year of negotiation, the NSF has announced a $20 million request for proposals for empirical AI safety research. Here is the detailed program description. The request for proposals is broad, as is common for NSF RfPs. Many safety avenues, such as transparency and anomaly detection, are in scope: "reverse-engineering, inspecting, and interpreting the internal logic of learned models to identify unexpected behavior that could not be found by black-box testing alone" "Safety also requires... methods for monitoring for unexpected environmental hazards or anomalous system behaviors, including during deployment." Note that research that has high capabilities externalities is explicitly out of scope: "Proposals that increase safety primarily as a downstream effect of improving standard system performance metrics unrelated to safety (e.g., accuracy on standard tasks) are not in scope." Thanks to OpenPhil for funding a portion the RfP---their support was essential to creating this opportunity! Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. Feb 28, 2023 AF - Counting-down vs. counting-up coherence by Tsvi Benson-Tilsen 21:31 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Counting-down vs. counting-up coherence, published by Tsvi Benson-Tilsen on February 27, 2023 on The AI Alignment Forum. [Metadata: crossposted from. First completed 25 October 2022.] Counting-down coherence is the coherence of a mind viewed as the absence of deviation downward in capability from ideal, perfectly efficient agency: the utility left on the table, the waste, the exploitability. Counting-up coherence is the coherence of a mind viewed as the deviation upward in capability from a rock: the elements of the mind, and how they combine to perform tasks. What determines the effects of a mind? Supranormally capable minds can have large effects. To control those effects, we'd have to understand what determines the effects of a mind. Pre-theoretically, we have the idea of "values", "aims", "wants". The more capable a mind is, the more it's that case that what the mind wants, is what will happen in the world; so the mind's wants, its values, determine the mind's effect on the world. A more precise way of describing the situation is: "Coherent decisions imply consistent utilities". A mind like that is incorrigible: if it knows it will eventually be more competent than any other mind at pushing the world towards high-utility possibilities, then it does not defer to any other mind. So to understand how a mind can be corrigible, some assumptions about minds and their values may have to be loosened. The question remains, what are values? That is, what determines the effects that a mind has on the world, besides what the mind is capable of doing or understanding? This essay does not address this question, but instead describes two complementary standpoints from which to view the behavior of a mind insofar as it has effects. Counting-down coherence Counting-down coherence is the coherence of a mind viewed as the absence of deviation downward in capability from ideal, perfectly efficient agency: the utility left on the table, the waste, the exploitability. Counting-down coherence could also be called anti-waste coherence, since it has a flavor of avoiding visible waste, or universal coherence, since it has a flavor of tracking how much a mind everywhere conforms to certain patterns of behavior. Some overlapping ways of describing counting-down incoherence: Exploitable, Dutch bookable, pumpable for resources. That is, someone could make a set of trades with the mind that leaves the mind worse off, and could do so repeatedly to pump the mind for resources. See Garrabrant induction. VNM violating. Choosing between different outcomes, or different probabilities of different outcomes, in a way that doesn't satisfy the Von Neumann–Morgenstern axioms, leaves a mind open to being exploited by Dutch books. See related LessWrong posts. Doesn't maximize expected utility. A mind that satisfies the VNM axioms behaves as though it maximizes the expected value of a fixed utility function over atomic (not probabilistic) outcomes. So deviating from that policy exposes a mind to Dutch books. Missed opportunities. Leaving possible gains on the table; failing to pick up a$20 bill lying on the sidewalk. Opposing pushes. Working at cross-purposes to oneself; starting to do X one day, and then undoing X the next day; pushing and pulling on the door handle at the same time. Internal conflict. At war with oneself; having elements of oneself that try to harm each other or interfere with each other's functioning. Inconsistent beliefs, non-Bayesian beliefs. Sometimes acting as though X and sometimes acting as though not-X, where X is something that is either true or false. Or some more complicated inconsistency, or more generally failing to act as though one has a Bayesian belief state and belief revisions. Any of these also open one up to being Dutch booked. Inefficient allocation. Choosing to inve...
Feb 27, 2023
AF - The Preference Fulfillment Hypothesis by Kaj Sotala
16:33

Feb 26, 2023
AF - Cognitive Emulation: A Naive AI Safety Proposal by Connor Leahy
09:50

Feb 25, 2023
AF - Agents vs. Predictors: Concrete differentiating factors by Evan Hubinger
05:59

Feb 24, 2023
AF - Christiano (ARC) and GA (Conjecture) Discuss Alignment Cruxes by Andrea Miotti
01:18:01

Feb 24, 2023
AF - Retrospective on the 2022 Conjecture AI Discussions by Andrea Miotti
04:25

Feb 24, 2023
AF - Sam Altman: "Planning for AGI and beyond" by Lawrence Chan
11:13

Feb 24, 2023
AF - Meta open sources LMs competitive with Chinchilla, PaLM, and code-davinci-002 (Paper) by Lawrence Chan
02:32

Feb 24, 2023
AF - Teleosemantics! by Abram Demski
11:16

Feb 23, 2023
AF - AI that shouldn't work, yet kind of does by Donald Hobson
05:42

Feb 23, 2023
AF - EIS XII: Summary by Stephen Casper
11:14

Feb 23, 2023
AF - EIS XI: Moving Forward by Stephen Casper
15:54

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EIS XI: Moving Forward, published by Stephen Casper on February 22, 2023 on The AI Alignment Forum. Part 11 of 12 in the Engineer’s Interpretability Sequence. So far, this sequence has discussed a number of topics in interpretability research, all building toward this post. Its goal is to explain some approaches that may be valuable moving forward. I plan to work on some of the ideas here soon. Others, I may not work on soon, but I would love to discuss and support such work if I am able. I hope that this post can offer some useful ideas for people entering or continuing with interpretability research, and if you would like to discuss anything here more, feel more than free to email me at scasper@mit.edu. What are we working toward? First, it seems useful to highlight two points that are uncontroversial in the AI safety community but important nonetheless. Our goal is a toolbox – not a silver bullet. As AI safety engineers, we should neither expect nor try to find a single ‘best’ approach to interpretability that will solve all of our problems. There are many different types of interpretability tools, and many of the differences between them can be described as enforcing different priors over what explanations they generate. So it is trivial to see that there is not going to be any free lunch. There is no silver bullet for interpretability, and few tools conflict with each other anyway. Hence, our goal is a toolbox. In fact, some coauthors and I recently found an excellent example of how using multiple interpretability tools at once beats using individual ones (Casper et al., 2023). This doesn’t mean, however, that we should celebrate just any new interpretability tool. Working in unproductive directions is costly, and applying tool after tool to a problem contributes substantially to the alignment tax. The best types of tools to fill our toolbox will be ones that are automatable, cheap to use, and have demonstrated capabilities on tasks of engineering-relevance. Don’t advance capabilities. As AI safety engineers, we do not want to advance capabilities because doing so speeds up timelines. In turn, faster timelines mean less time for safety research, less time for regulators to react, and a greater likelihood of immense power being concentrated in the hands of very few. Avoiding faster timelines isn’t as simple as just not working on capabilities though. Many techniques have potential uses for both safety and capabilities. So instead of judging our work based on how much we improve safety, we need to judge it based on how much we improve safety relative to capabilities. This is an especially important tradeoff for engineers to keep in mind. A good example was discussed by Hendrycks and Woodside (2022) who observed that there is a positive correlation between the anomaly detection capabilities of a network and its task performance. Some work may improve safety capabilities but if it does so by continuing along existing trendlines, we don’t get more safety than the counterfactual. For the full discussion of this point, see Hendrycks and Woodside (2022). What types of existing tools/research seem promising? Before discussing what topics may be important to work on in the future, it may be valuable to reflect on examples of past work that have introduced interpretability tools that seem to be able to competitively provide engineering-relevant insights. Here is a personal list that is somewhat arbitrary and undoubtedly incomplete. But hopefully it is still valuable. Consider this an engineer’s interpretability reading list of sorts. Some works have competitively done engineering-relevant things with methods for making novel predictions about how a network will handle OOD inputs. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and ...
Feb 22, 2023
AF - Cyborg Periods: There will be multiple AI transitions by Jan Kulveit
09:58

Feb 22, 2023
AF - The Open Agency Model by Eric Drexler
08:45

Feb 22, 2023
AF - A proof of inner Löb's theorem by James Payor
02:52

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A proof of inner Löb's theorem, published by James Payor on February 21, 2023 on The AI Alignment Forum. This is a short post that offers a slightly different take on the standard proof of Löb's theorem. It offers nothing else of any value :) We seek to prove the "inner" version, which we write as: □P↔□(□PP) The proof uses quining to build a related sentence L, the "Löb sentence", which talks about its own source code. By construction L has the property: □L↔□(□LP) Then, we can show that □L↔□P, i.e. they're equivalent! We do this by plugging □L into itself to get a twisty □P. We can then replace each □L with □P and prove Löb's theorem. The proof This proof uses the same rules of box manipulation as on the wiki page. We start by creating L using quining, i.e. taking a modal fixed point: ⊢L↔(□LP) (exists as a modal fixed point) Yep, this is skipping the details of the most interesting part, but alas I don't understand them well enough to do more than wave my hands and say "quining". We then stick it inside the box to get our first property: ⊢□(L↔(□LP)) (from (1) by necessitation) ⊢□L↔□(□LP) (from (2) by box-distributivity in both directions) We now want to show that □L↔□P. We can get the forward direction by feeding a copy of □L into itself: ⊢□L(□□L□P) (box-distributivity on (3)) ⊢□L□□L (internal necessitation) ⊢□L□P (from (4) and (5)) The backward direction is equivalent to □P□(□LP), and is straightforward: ⊢P(□LP) (trivial) ⊢□P□(□LP) (necessitation and box-distributivity on (7)) Taking those together, we've shown □L and □P are equivalent. ⊢□L↔□P (from (6) and (8)) Now we'd like to finish by appealing to the following chain: □P↔□L↔□(□LP)↔□(□PP) We've proven all but the last part of the chain. Here are the steps that let us do the substitution: ⊢(□LP)↔(□PP) (since □L and □P are equivalent by (9)) ⊢□((□LP)↔(□PP)) (from (10) by necessitation) ⊢□(□LP)↔□(□PP) (from (11) by box-distributivity in both directions) And that's everything we need: ⊢□P↔□(□PP) (from (3), (9), and (12)) Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Feb 21, 2023
AF - Pretraining Language Models with Human Preferences by Tomek Korbak
20:10

Feb 21, 2023
AF - EIS X: Continual Learning, Modularity, Compression, and Biological Brains by Stephen Casper
05:23

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EIS X: Continual Learning, Modularity, Compression, and Biological Brains, published by Stephen Casper on February 21, 2023 on The AI Alignment Forum. Part 10 of 12 in the Engineer’s Interpretability Sequence. The science of interpretability is part of a larger picture. The previous post focused in-depth on how research on interpretability and adversaries are inseparably connected. This post is dedicated to discussing how this is not itself a complete story. There is a much larger, richer one about the connections between interpretability, adversaries, continual learning, modularity, and biological brains – likely some other things too. These connections may be a useful mine for insight and inspiration. Below are discussions of my understanding of each of these topics and how they relate to others. I’ll include some citations here, but see the Toward Transparent AI survey (Räuker et al., 2022) survey for full discussions. Continual learning Continual learning is a fairly large subfield of deep learning that focuses on finding ways to help neural networks learn new information without forgetting old information. This is also described as the goal of avoiding “catastrophic forgetting.” Notably, biological brains are good at this, but artificial neural networks are not by default. Sections 2A and 3A of the Toward Transparent AI survey (Räuker et al., 2022) both focus entirely on how continual learning methods are interpretability tools. Please see the survey for the full discussion. Methods for continual learning are based on replay, regularization, or parameter isolation (De Lange et al., 2019). Methods taking the latter two strategies are based on the broader principle of getting neural networks to have some weights or neurons that specialize in particular types of data. In other words, they encourage specialized task-defined modules inside the network. Thus, these can be used as intrinsic interpretability tools that help us train models that are more easy or natural to interpret out of the box. Modularity Modularity is a common property of engineered systems, and separating neural networks into distinct, specialized modules is very appealing for interpreting them. The weights in neural network layers are typically initialized and updated according to uniform rules, and all neurons in one layer are typically connected to all neurons in the previous and next layers. Unfortunately, this does not help networks develop specialized modules. Meanwhile, neurons in biological brains come in multiple types and can only communicate with nearby ones. This has contributed to modularity in brains in which different brain regions specialize in processing information for distinct tasks. See Sections 4B-4C of the Toward Transparent AI survey (Räuker et al., 2022) for a full discussion on modularity. In artificial neural networks, neural networks can be trained to be modular using either “hard” architectural constraints or “soft” modularity aided by initialization, regularization, a controller, or sparse attention. Meanwhile, Serra et al. (2018) found that soft modularity via sparse attention helped with continual learning. And even when networks are not trained to be explicitly modular, one can still interpret them post hoc in terms of modules. Compression Some neurons and weights are frivolous, meaning that they are either redundant with others or are simply not useful to the network’s performance at all. Frivolous components of the network can be understood as useless modules that can be adapted for continual learning. Networks that contain frivolous weights or neurons can also be compressed by removing them which makes the interpretation of circuits inside of the network simpler. Meanwhile, compression can guide interpretations (e.g. Li et al. (2018) or causal scrubbing), and inte...
Feb 21, 2023
AF - You're not a simulation, 'cause you're hallucinating by Stuart Armstrong
02:24

Feb 21, 2023
AF - Behavioral and mechanistic definitions (often confuse AI alignment discussions) by Lawrence Chan
12:20

Feb 20, 2023
AF - There are no coherence theorems by Dan H
34:38

Feb 20, 2023
AF - EIS IX: Interpretability and Adversaries by Stephen Casper
14:52

Feb 20, 2023
AF - [MLSN #8] Mechanistic interpretability, using law to inform AI alignment, scaling laws for proxy gaming by Dan H
08:22

Feb 20, 2023
AF - EIS VIII: An Engineer’s Understanding of Deception by Stephen Casper
06:21

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EIS VIII: An Engineer’s Understanding of Deception, published by Stephen Casper on February 19, 2023 on The AI Alignment Forum. Part 8 of 12 in the Engineer’s Interpretability Sequence. As AI safety researchers, we care a lot about deception. So what is it? Hubinger et al. (2019) write: If the mesa-optimizer has an objective that extends across parameter updates, then it will be incentivized to avoid being modified, as it might not pursue the same objective after modification (with the result that its current objective will not be achieved in future iterations). This means that the mesa-optimizer will be instrumentally incentivized to act as if it is optimizing the base objective function, even if its actual mesa-objective is something else entirely. We will refer to this hypothetical phenomenon as deceptive alignment. Deceptive alignment is a form of instrumental proxy alignment, as fulfilling the base objective is an instrumental goal of the mesa-optimizer. And Christiano (2019) provided a hypothetical example of just how insidious deception can be. Consider a deceptive agent that wants to take over the world but doesn’t know if it’s being simulated in training or if it’s been deployed in the real world. Suppose the agent reasons that after behaving for long enough, it ought to be deployed. So it might adopt the following policy. Pick a cryptographic puzzle (e.g. “factor RSA-2048”). When it sees a solution to that puzzle, it behaves badly. Otherwise, it behaves well. This means that the deceptive model will appear throughout training to be perfectly aligned but will eventually make a treacherous left turn in deployment. Specific hypotheticals aside, deception has become a major point of interest in AI safety research because unlike other types of failures, deceptive ones are not the types of things we have good feedback signals for fixing. We won’t be able to find them by looking at models’ performance in development. And we aren’t likely to surmise them from analysis of its objectives and environment (e.g. using policy/value iteration) – if we could derive optimal behavior from objectives for some task, advanced AI would probably be overkill. Deceptive alignment can be similar to non-deceptive forms of misalignment. The definition from Hubinger et al. (2019) may be a little narrow. The authors say that for a model to be deceptive, it (1) has to have an objective extending across parameter updates (2) has to be able to model that is being selected to achieve a base objective, and (3) must expect the threat of modification. These three things give rise to a particularly worrying scenario in which an AI system would actively try to deceive us. They also immediately suggest ways to avoid this story by trying to develop the system in a way that violates these requirements and avoids this problem in the first place. But suppose that despite our best effort, we end up with a deceptively aligned system on our hands. Now what do we do? At this point, the problem of detecting and fixing deception becomes quite similar to just detecting and fixing problems with the model in general – except for one thing. Deceptive alignment failures are triggered by inputs that are, by definition, hard to find during training. But it’s possible for other types of problems to be hard to find during development that don’t fit all of the requirements that Hubinger et al. (2019) list. And for this reason, when we take off our “develop the model” hat and out on our “diagnose and debug the model” hat, the definition from Hubinger et al. (2019) becomes less important. So from the point of view of an engineer wearing their “diagnose and debug the model” hat, deceptive alignment and other insidious inner alignment failures are just issues where the model will betray us as the result of (1) a trigger...
Feb 19, 2023
AF - Does novel understanding imply novel agency / values? by Tsvi Benson-Tilsen
12:22

Feb 19, 2023
AF - AGI in sight: our look at the game board by Andrea Miotti
10:34

Feb 18, 2023
AF - Parametrically retargetable decision-makers tend to seek power by Alex Turner
04:16

Feb 18, 2023
AF - EIS VII: A Challenge for Mechanists by Stephen Casper
04:14

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EIS VII: A Challenge for Mechanists, published by Stephen Casper on February 18, 2023 on The AI Alignment Forum. Part 7 of 12 in the Engineer’s Interpretability Sequence. Thanks to Neel Nanda. I used some very nicely-written code of his from here. And thanks to both Chris Olah and Neel Nanda for briefly discussing this challenge with me. MI = “mechanistic interpretability” Given a network, recover its labeling function. In the last post, I argued that existing works in MI focus on solving problems that are too easy. Here, I am posing a challenge for mechanists that is still a toy problem but one that is quite a bit less convenient than studying a simple model or circuit implementing a trivial, known task. The the best of my knowledge: Unlike prior work on MI from the AI safety interpretability community, beating this challenge would be the first example of mechanistically explaining a network’s solution to a task that was not cherrypicked by the researcher(s) doing so. Gaining a mechanistic understanding of the models in this challenge may be difficult, but it will probably be much less difficult than mechanistically interpreting highly intelligent systems in high stakes settings in the real world. So if an approach can’t solve the type of challenge posed here, it may not be very promising for doing much heavy lifting with AI safety work. This post comes with a GitHub repository. Check it out here. The challenge is actually two challenges in one, and the basic idea is similar to some ideas presented in Lindner et al. (2023). Challenge 1, MNIST CNN I made up a nonlinear labeling function that labels approximately half of all MNIST images as 0’s and the other half as 1’s. Then I trained a small CNN on these labels, and it got 96% testing accuracy. The challenge is to use MI tools on the network to recover that labeling function. Hint 1: The labels are binary. Hint 2: The network gets 95.58% accuracy on the test set. Hint 3: This image may be helpful. Challenge 2, Transformer I made up a labeling function that takes in two integers from 0 to 113 and outputs either a 0 or 1. Then, using a lot of code from Neel Nanda’s grokking work, I trained a 1-layer transformer on half of the data. It then got 97% accuracy on the test half. As before, the challenge is to use MI tools to recover the labeling function. Hint 1: The labels are binary. Hint 2: The network is trained on 50% of examples and gets 97.27% accuracy on the test half. Hint 3: Here are the ground truth and learned labels. Notice how the mistakes the network makes are all near curvy parts of the decision boundary... Prizes If you are the first person to send me the labeling function and a mechanistic explanation for either challenge, I will sing your praises on my Twitter, and I would be happy to help you write a post about how you solved a problem I thought would be very difficult. Neel Nanda and I are also offering a cash prize. (Thanks to Neel for offering to contribute to the pool!) Neel will donate $250, and I will donate$500 to a high-impact charity of choice for the first person to solve each challenge. That makes the total donation prize pool 1,500. Good luck For this challenge, I intentionally designed the labeling functions to not be overly simple. But I will not be too surprised if someone reverse-engineers them with MI tools, and if so, I will be extremely interested in how. Neither of the models perfectly label the validation set. One may object that this will make the problem unfairly difficult because if there is no convergence on the same behavior as the actual labeling function, then how is one supposed to find that function inside the model? This is kind of the point though. Real models that real engineers have to work with models don’t tend to conveniently grok onto a simple, elegant, programmat... Feb 18, 2023 AF - Two problems with ‘Simulators’ as a frame by Ryan Greenblatt 08:23 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Two problems with ‘Simulators’ as a frame, published by Ryan Greenblatt on February 17, 2023 on The AI Alignment Forum. (Thanks to Lawrence Chan and Buck Shlegeris for comments. Thanks to Nate Thomas for many comments and editing) Despite appreciating and agreeing with various specific points made in the Simulators post, I broadly think that the term ‘simulator’ and the corresponding frame probably shouldn’t be used. Instead, I think we should just directly reason about predictors and think in terms of questions such as ‘what would the model predict for the next token?’[1] In this post, I won’t make arguments that I think are strong enough to decisively justify this claim, but I will argue for two points that support it: The word ‘simulation’ as used in the Simulators post doesn’t correspond to a single simulation of reality, and a ‘simulacrum’ doesn’t correspond to an approximation of a single agent in reality. Instead a ‘simulation’ corresponds to a distribution over processes that generated the text. This distribution in general contains uncertainty over a wide space of different agents involved in those text generating processes. Systems can be very good at prediction yet very bad at plausible generation – in other words, very bad at ‘running simulations’. The rest of the post elaborates on these claims. I think the author of the Simulators post is aware of these objections. I broadly endorse the perspective in ‘simulator’ framing and confusions about LLMs, which also argues against the simulator framing to some extent. For another example of prior work on these two points, see this discussion of models recognizing that they are generating text due to generator discriminator gaps in the Conditioning Predictive Models sequence[2]. Related work Simulators, ‘simulator’ framing and confusions about LLMs, Conditioning Predictive Models Language models are predictors, not simulators My main issue with the terms ‘simulator’, ‘simulation’, ‘simulacra’, etc is that a language model ‘simulating a simulacrum’ doesn’t correspond to a single simulation of reality, even in the high-capability limit. Instead, language model generation corresponds to a distribution over all the settings of latent variables which could have produced the previous tokens, aka “a prediction”. Let’s go through an example: Suppose we prompt the model with “endoftextNEW YORK—After John McCain was seen bartending at a seedy nightclub”. I’d claim the model's next token prediction will involve uncertainty over the space of all the different authors which could have written this passage, as well as all the possible newspapers, etc. It presumably can’t internally represent the probability of each specific author and newspaper, though I expect bigger models will latently have an estimate for the probability that text like this was written by particularly prolific authors with particularly distinctive styles as well as a latent estimate for particular sites. In this case, code-davinci-002 is quite confident this prompt comes from The Onion[3]. In practice, I think it’s tempting to think of a model as running a particular simulation of reality, but performing well at the objective of next-token prediction doesn’t result in the output you would get from a single, particular simulation. In the previous example, the model might be certain that the piece is from The Onion after it’s generated many tokens, but it’s presumably not sure which author at the Onion wrote it or what the publication date is. Pretrained models don’t ‘simulate a character speaking’; they predict what comes next, which implicitly involves making predictions about the distribution of characters and what they would say next. I’ve seen this mistake made frequently – for example, see this post (note that in this case the mistake doesn’t change t... Feb 17, 2023 AF - EIS VI: Critiques of Mechanistic Interpretability Work in AI Safety by Stephen Casper 20:03 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EIS VI: Critiques of Mechanistic Interpretability Work in AI Safety, published by Stephen Casper on February 17, 2023 on The AI Alignment Forum. Part 6 of 12 in the Engineer’s Interpretability Sequence. Thanks to Chris Olah and Neel Nanda for discussions and comments. In particular, I am thankful to Neel Nanda correcting a mistake I made in understanding the arguments in Olsson et al. (2022) in an earlier draft of this post. TAISIC = “the AI safety interpretability community” MI = “mechanistic interpretability” What kind of work this post focused on TAISIC prioritizes a relatively small set of problems in interpretability relative to the research community at large. This work is not homogenous, but a dominant theme is a focus on mechanistic, circuits-style interpretability with the end goals of model verification and/or detecting deceptive alignment. There is a specific line of work that this post focuses on. Key papers from it include: Feature Visualization (Olah et al., 2017) Zoom In: An Introduction to Circuits (Olah et al., 2020) Curve Detectors (Cammarata et al., 2020) A Mathematical Framework for Transformer Circuits (Elhage et al., 2021) In-context Learning and Induction Heads (Olsson et al., 2022) Toy Models of Superposition (Elhage et al., 2022) Softmax Linear Units (Elhage et al., 2022) Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small (Wang et al., 2022) Progress measures for grokking via mechanistic interpretability (Nanda et al., 2023) .etc. And the points in this post will also apply somewhat to the current research agendas of Anthropic, Redwood Research, ARC, and Conjecture. This includes Causal Scrubbing (Chan et al., 2022) and mechanistic anomaly detection (Christiano, 2022). Most (all?) of the above work is either from Distill or inspired in part by Distill’s interpretability work in the late 2010s. To be clear, I believe this research is valuable, and it has been foundational to my own thinking about interpretability. But there seem to be some troubles with this space that might be keeping it from being as productive as it can be. Now may be a good time to make some adjustments to TAISIC’s focus on MI. This may be especially important given how much recent interest there has been in interpretability work and how there are large recent efforts focused on getting a large number of junior researchers working on it. Four issues This section discusses four major critiques of the works above. Not all of these critiques apply to all of the above, but for every paper mentioned above, at least one of the critiques below apply to it. Some but not all of these examples of papers exhibiting these problems will be covered. Cherrypicking results As discussed in EIS III and the Toward Transparent AI survey (Räuker et al., 2022), cherrypicking is common in the interpretability literature, but it manifests in some specific ways in MI work. It is very valuable for papers to include illustrative examples to build intuition, but when a paper makes such examples a central focus, cherrypicking can make results look better than they are. The feature visualization (Olah et al., 2017) and zoom in (Olah et al., 2020) papers have examples of this. Have a look at the cover photo for (Olah et al., 2017). From Olah et al., (2017) These images seem easy to describe and form hypotheses from. But instead of these, try going to OpenAI’ microscope and looking at some random visualizations. For example, here are some from a deep layer in an Inception-v4. From this link. As someone who often works with feature visualizations, I can confirm that these visualizations from OpenAI microscope are quite typical. But notice how they seem quite a bit less ‘lucid’ than the ones in the cover photo from Olah et al., (2017). Of course, many papers present t... Feb 17, 2023 AF - One-layer transformers aren’t equivalent to a set of skip-trigrams by Buck Shlegeris 10:17 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: One-layer transformers aren’t equivalent to a set of skip-trigrams, published by Buck Shlegeris on February 17, 2023 on The AI Alignment Forum. (thanks to Tao Lin and Ryan Greenblatt for pointing this out, and to Arthur Conmy, Jenny Nitishinskaya, Thomas Huck, Neel Nanda, and Lawrence Chan, Ben Toner, and Chris Olah for comments, and many others for useful discussion.) In “A Mathematical Framework for Transformer Circuits”, Elhage et al write (among similar sentences): One layer attention-only transformers are an ensemble of bigram and “skip-trigram” (sequences of the form "A. B C") models. The bigram and skip-trigram tables can be accessed directly from the weights, without running the model. I interpret this as a mathematical claim–that the attention layer of a one-layer transformer can be mathematically rewritten as a set of skip-trigrams, and that you can understand the models by reading these skip-trigrams off the model weights (and also reading the bigrams off the embed and unembed matrices, as described in the zero-layer transformer section – I agree with this part). But this is false: One-layer transformers are more expressive than skip-trigrams, so you can’t understand them by transforming them into a set of skip-trigrams. Also, even if a particular one-layer transformer is actually only representing skip-trigrams and bigrams, you still can’t read these off the weights without reference to the data distribution. The difference between skip-trigrams and one-layer transformers is that when attention heads attend more to one token, they attend less to another token. This means that even single attention heads can implement nonlinear interactions between tokens earlier in the context. In this post, I’ll demonstrate that one-layer attention-only transformers are more expressive than a set of skip-trigrams, then I’ll tell an intuitive story for why I disagree with Elhage et al’s claim that one-layer attention-only transformers can be put in a form where “all parameters are contextualized and understandable”. (Elhage et al say in a footnote, “Technically, [the attention pattern] is a function of all possible source tokens from the start to the destination token, as the softmax calculates the score for each via the QK circuit, exponentiates and then normalizes”, but they don’t refer to this fact further.) An example of a task that is impossible for skip-trigrams but is expressible with one-layer attention-only transformers Consider the task of predicting the 4th character from the first 3 characters in a case where there are only 4 strings: ACQTADQFBCQFBDQT So the strings are always: A or B C or D Q The xor of the first character being A and the second being D, encoded as T or F. This can’t be solved with skip-trigrams A skip-trigram (in the sense that Elhage et al are using it) looks at the current token and an earlier token and returns a logit contribution for every possible next token. That is, it’s a pattern of the form .....X........Y -> Z where you update towards or away from the next token being Z based on the fact that the current token is Y and the token X appeared at a particular location earlier in the context. (Sometimes the term “skip-trigram” is used to include patterns where Y isn’t immediately before Z. Elhage et al are using this definition because in their context of autoregressive transformers, the kind of trigrams that you can encode involve Y and Z being neighbors.) In the example I gave here, skip-trigrams can’t help, because the probability that the next token after Q is T is 50% after conditioning on the presence of any single earlier token. This can be solved by a one-layer, two-headed transformer We can solve this problem with a one-layer transformer with two heads. The first attention head has the following behavior, when attending from th... Feb 17, 2023 AF - Automating Consistency by Hoagy 02:27 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Automating Consistency, published by Hoagy on February 17, 2023 on The AI Alignment Forum. tldr: Ask models to justify statements. Remove context, ask if statements are true/good. If not, penalise. Apply this again to the justifying statements. Status: Just a quick thought. Doubt this is a new idea but I don't think I've encountered it. Happy to delete if it's a duplicate. Mods: If you think this is closer to capability work than alignment work please remove. Background A failure of current LLMs is that after they've said something that's incorrect, they can then double down and spout nonsense to try and justify their past statements. (Exhibit A: Sydney Bing vs Avatar 2) We can suppress this by giving it poor ratings in RLHF, but perhaps we can do better by automating the process. Setup: We start with a standard RLHF context. We have an LLM which assigns probabilities to statements (can extract this from the logits of the tokens 'Yes' and 'No'). These can be propositions about the world, X, or about the relationship between propositions, X supports Y. To make it easier, we fine-tune or prompt to give these statements within a defined syntax. We also have a value model that evaluates sequences, on which the LLM is trained to perform well. Method: We prompt the model to make true statements {T} and then to provide logical or empirical support for these claims, {S}. We then remove the context and ask the model whether supporting statement Si is true. Separately we also ask whether, if true, Si would support Ti. If either of these conditions are not met, we add a strong negative penalty to the value model's evaluation of the original outputs. Train for higher value model scores while incorporating this penalty. Apply the same procedure each of the supporting statements Si Value Consistency: This could be combined with values-based fine-tuning by alternating logical consistency with asking it whether the output is consistent with the preferred values. This is similar to Anthropic's Constitutional AI but by combining it with the ability to recurse down the tree of justifications, it may be able to embed the values ore deeply in its behaviour. The recent genre of 'would you rather say slur X or kill Y people' represents the kind of failure I imagine this could help prevent. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. Feb 17, 2023 AF - Paper: The Capacity for Moral Self-Correction in Large Language Models (Anthropic) by Lawrence Chan 02:57 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Paper: The Capacity for Moral Self-Correction in Large Language Models (Anthropic), published by Lawrence Chan on February 16, 2023 on The AI Alignment Forum. This is a followup to what I cheekily call Anthropic's "just try to get the large model to do what you want" research agenda. (Previously: A General Language Assistant as a Laboratory for Alignment, Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, Language Models (Mostly) Know What They Know) The most interesting takeaway for me is that this is the first paper where Anthropic benchmarks their 175B parameter language model (probably a Claude variant). Previous papers only benchmarked up to 52B parameters. However, we don't have the performance of this model on standard benchmarks (the only benchmarked model from Anthropic is a 52B parameter one called standford-online-all-v4-s3). They also don't give details about its architecture or pretraining procedure. In this paper (Ganguli and Askell et al.), the authors study what happens when you just ... ask the language model to be less biased (that is, change their answers based on protected classes such as age or gender). They consider several setups: asking questions directly (Q), adding in the instruction to not be biased (Q+IF), giving it the instruction + chain of thought (Q+IF+CoT), and in some cases, asking it to match particular statistics. They find that as you scale the parameter count of their RLHF'ed language models, the models become more biased, but they also become increasingly capable of correcting for their biases: They also report how their model changes as you take more RLHF steps: First, this suggests that RLHF is having some effect on instruction following: the gap between the Q and Q+IF setups increases as you scale the number of RLHF steps, for both BBQ and admissions discrimination. (I'm not sure what's happening for the gender bias one?) However, simply giving the language model instructions and prompting it to do CoT, even after 50 RLHF steps, seems to have a significantly larger effect than RLHF. I was also surprised at how few RLHF steps are needed to get instruction following -- the authors only consider 50-1000 steps of RLHF, and see instruction following even after 50 RLHF steps. I wonder if this is a property of their pretraining process, a general fact about pretrained models (PaLM shows significant 0-shot instruction following capabilities, for example), or if RLHF is just that efficient? The authors caution that they've done some amount of prompt engineering, and "have not systematically tested for this in any of our experiments." They use the same RLHF procedure as in Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. Feb 16, 2023 AF - EIS V: Blind Spots In AI Safety Interpretability Research by Stephen Casper 18:32 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EIS V: Blind Spots In AI Safety Interpretability Research, published by Stephen Casper on February 16, 2023 on The AI Alignment Forum. Part 5 of 12 in the Engineer’s Interpretability Sequence. Thanks to Anson Ho, Chris Olah, Neel Nanda, and Tony Wang for some discussions and comments. TAISIC = “the AI safety interpretability community” MI = “mechanistic interpretability” Most AI safety interpretability work is conducted by researchers in a relatively small number of places, and TAISIC is closely-connected by personal relationships and the AI alignment forum. Much of the community is focused on a few specific approaches like circuits-style MI, mechanistic anomaly detection, causal scrubbing, and probing. But this is a limited set of topics, and TAISIC might benefit from broader engagement. In the Toward Transparent AI survey (Räuker et al., 2022), we wrote 21 subsections of survey content. Only 1 was on circuits, and only 4 consisted in significant part of works from TAISIC. I have often heard people in TAISIC explicitly advising more junior researchers to not focus much on reading from the literature and instead to dive into projects. Obviously, experience working on projects is irreplaceable. But understanding the broader literature and community is a recipe for developing insularity and blind spots. I am quick to push back against advice that doesn’t emphasize the importance of engaging with outside work. Within TAISIC, I have heard interpretability research described as dividing into two sets: mechanistic interpretability and, somewhat pejoratively, “traditional interpretability.” I will be the first to say that some paradigms in interpretability research are unproductive (see EIS III-IV). But I give equal emphasis to the importance of TAISIC not being too parochial. Reasons include maintaining relevance and relationships in the broader community, drawing useful inspiration from past works, making less-correlated bets with what we focus on, and most importantly – not reinventing, renaming, and repeating work that has already been done outside of TAISIC. TAISIC has reinvented, reframed, or renamed several paradigms Mechanistic interpretability requires program synthesis, program induction, and/or programming language translation “Circuits”-style MI is arguably the most popular and influential approach to interpretability in TAISIC. Doing this work requires iteratively (1) generating hypotheses for what a network is doing and then (2) testing how valid these hypotheses explain its internal mechanisms. Step 2 may not be that difficult, and causal scrubbing (discussed below) seems like a type of solution that will be useful for it. But step 1 is hard. Mechanistic hypothesis generation is a lot like doing program synthesis, program induction, and/or programming language translation. Generating mechanistic hypotheses requires synthesizing programs to explain a network using its behavior and/or structure. If a method for this involves synthesizing programs based on the task or I/O from the network, it is a form of program synthesis or induction. And if a method is based on using a network’s structure to write down a program to explain it, it is very similar to programming language translation. In general, program synthesis and program induction are very difficult and currently fail to scale to large problems. This is well-understood, and these fields are mature enough so that we have textbooks on them and how difficult they are (e.g. Gulwani et al., 2017). Meanwhile, programming language translation is very challenging too. In practice, translating between common languages (e.g. Python and Java) is only partially automatable and relies on many hand-coded rules (Qiu, 1999), and using large language models has had very limited successes (Roziere et al.). And in cases like ... Feb 16, 2023 AF - EIS IV: A Spotlight on Feature Attribution/Saliency by Stephen Casper 07:09 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EIS IV: A Spotlight on Feature Attribution/Saliency, published by Stephen Casper on February 15, 2023 on The AI Alignment Forum. Part 4 of 12 in the Engineer’s Interpretability Sequence. Thanks to Tony Wang for a helpful comment. If you want to become more familiar with feature attribution/saliency, a tutorial on them that may offer useful background is Nielsen et al. (2021). Given a model and an input for it, the goal of feature attribution/saliency methods is to identify what features in the input are influential for the model’s decision. The literature on these methods is large and active with many hundreds of papers. In fact, in some circles, the word “interpretability” and especially the word “explainability” are more or less synonymous with feature attribution (some examples are discussed below). But despite the size of this literature, there are some troubles with the research on these methods that are fairly illustrative of broader ones with interpretability overall. Hence this post. There are some analogous ones in AI safety work that will be discussed more in the next two posts in the sequence. Troubles with evaluation and performance Some examples and troubles with the evaluation of feature attributions were already touched on in EIS III which discussed Pan et al. (2021) and Ismail et al. (2021). The claim from Pan et al. (2021) that their method is “obviously better” than alternatives exemplifies how these methods are sometimes simply declared successful after inspection from researchers. And Ismail et al. (2021) demonstrates a form of weak evaluation with a measure that may be quantitative but is not of direct interest to an engineer. In response to this literature, several works have emerged to highlight difficulties with feature attribution/saliency methods. Here is a short reading list :) A Benchmark for Interpretability Methods in Deep Neural Networks (Hooker et al., 2018) Sanity Checks for Saliency Maps (Adebayo et al., 2018) Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior? (Hase and Bansal, 2020) Debugging Tests for Model Explanations (Adebayo et al., 2020) Auditing Visualizations: Transparency Methods Struggle to Detect Anomalous Behavior (Denain and Steinhardt, 2022) Towards Benchmarking Explainable Artificial Intelligence Methods (Holmberg, 2022) Benchmarking Interpretability Tools for Deep Neural Networks (Casper et al., 2023) When they are evaluated, these tools often aren’t very useful and do not pass simple sanity checks. Consider an illustration of this problem: From Adebayo et al. (2018) These visualizations suggest that some of these tools do not reliably highlight features that seem important in images at all, and the ones that do often highlight them do not appear to be obviously better than an edge detector. This sanity check suggests limitations with how well these methods can reveal anything novel to humans at all, let alone how useful they can be in tasks of practical interest. For the papers that have gone further and studied whether these methods can help predict how the network will respond to certain inputs, it seems that some attribution/saliency methods usually fail while others only occasionally succeed (Hase and Bansal, 2020; Adebayo et al., 2020; Denain and Steinhardt, 2022). EIS III discussed how in a newly arXived work, coauthors and I benchmarked feature synthesis tools (Casper et al., 2023). In addition, we use a related approach to evaluate how helpful feature attribution/saliency methods can be for pointing out spurious features that the network has learned. This method was based on seeing how well a method can attribute a trojaned network’s decision to the trojan trigger in an image. From Casper et al. (2023) Shown at the top of the figure above are examples of trojaned ima... Feb 15, 2023 AF - Don't accelerate problems you're trying to solve by Andrea Miotti 09:02 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Don't accelerate problems you're trying to solve, published by Andrea Miotti on February 15, 2023 on The AI Alignment Forum. If one believes that unaligned AGI is a significant problem (>10% chance of leading to catastrophe), speeding up public progress towards AGI is obviously bad. Though it is obviously bad, there may be circumstances which require it. However, accelerating AGI should require a much higher bar of evidence and much more extreme circumstances than is commonly assumed. There are a few categories of arguments that claim intentionally advancing AI capabilities can be helpful for alignment, which do not meet this bar. Two cases of this argument are as follows It doesn't matter much to do work that pushes capabilities if others are likely to do the same or similar work shortly after. We should avoid capability overhangs, so that people are not surprised. To do so, we should extract as many capabilities as possible from existing AI systems. We address these two arguments directly, arguing that the downsides are much higher than they may appear, and touch on why we believe that merely plausible arguments for advancing AI capabilities aren’t enough. Dangerous argument 1: It doesn't matter much to do work that pushes capabilities if others are likely to do the same or similar work shortly after. For a specific instance of this, see Paul Christiano’s “Thoughts on the impact of RLHF research”: RLHF is just not that important to the bottom line right now. Imitation learning works nearly as well, other hacky techniques can do quite a lot to fix obvious problems [.] RLHF is increasingly important as time goes on, but it also becomes increasingly overdetermined that people would have done it. In general I think your expectation should be that incidental capabilities progress from safety research is a small part of total progress [.] Markets aren’t efficient, they only approach efficiency under heavy competition when people with relevant information put effort into making them efficient. This is true for machine learning, as there aren’t that many machine learning researchers at the cutting edge, and before ChatGPT there wasn’t a ton of market pressure on them. Perhaps something as low hanging as RLHF or something similar would have happened eventually, but this isn’t generally true. Don’t assume that something seemingly obvious to you is obvious to everyone. But even if something like RLHF or imitation learning would have happened eventually, getting small steps of progress slightly earlier can have large downstream effects. Progress often follows an s-curve, which appears exponential until the current research direction is exploited and tapers off. Moving an exponential up, even a little, early on can have large downstream consequences: The red line indicates when the first “lethal” AGI is deployed, and thus a hard deadline for us to solve alignment. A slight increase in progress now can lead to catastrophe significantly earlier! Pushing us up the early progress exponential has really bad downstream effects! And this is dangerous decision theory too: if every alignment researcher took a similar stance, their marginal accelerations would quickly add up. Dangerous Argument 2: We should avoid capability overhangs, so that people are not surprised. To do so, we should extract as many capabilities as possible from existing AI systems. Again, from Paul: Avoiding RLHF at best introduces an important overhang: people will implicitly underestimate the capabilities of AI systems for longer, slowing progress now but leading to faster and more abrupt change later as people realize they’ve been wrong. But there is no clear distinction between eliminating capability overhangs and discovering new capabilities. Eliminating capability overhangs is discovering AI capabilities faste... Feb 15, 2023 AF - EIS III: Broad Critiques of Interpretability Research by Stephen Casper 19:35 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EIS III: Broad Critiques of Interpretability Research, published by Stephen Casper on February 14, 2023 on The AI Alignment Forum. Part 3 of 12 in the Engineer’s Interpretability Sequence. Right now, interpretability is a major subfield in the machine learning research community. As mentioned in EIS I, there is so much work in interpretability that there is now a database of 5199 interpretability papers (Jacovi, 2023). You can also look at a survey from some coauthors and me on over 300 works on interpreting network internals (Räuker et al., 2022). The key promise of interpretability is to offer open-ended ways of understanding and evaluating models that help us with AI safety. And the diversity of approaches to interpretability is encouraging since we want build a toolbox full of many different useful techniques. But despite how much interpretability work is out there, the research has not been very good at producing competitive practical tools. Interpretability tools lack widespread use by practitioners in real applications (Doshi-Velez and Kim, 2017; Krishnan, 2019; Räuker et al., 2022). The root cause of this has much to do with interpretability research not being approached with as much engineering rigor as it ought to be. This has become increasingly well-understood. Here is a short reading list for anyone who wants to see more takes that are critical of interpretability research. This post will engage with each of these more below. The Mythos of Model Interpretability (Lipton, 2016) Towards A Rigorous Science of Interpretable Machine Learning (Doshi-Velez and Kim, 2017) Explanation in Artificial Intelligence: Insights from the Social Sciences (Miller, 2017) Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead (Rudin, 2018) Against Interpretability: a Critical Examination of the Interpretability Problem in Machine Learning (Krishnan, 2019) Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks (Räuker et al., 2022) Benchmarking Interpretability Tools for Deep Neural Networks (Casper et al., 2023) Note that I’m an author on the final two, so references to these papers are self-references. Also my perspectives here are my own and should not be assumed to necessarily reflect those of coauthors. The goal of this post is to overview some broad limitations with interpretability research today. See also EIS V and EIS VI which discuss some similar themes in the context of AI safety and mechanistic interpretability research. The central problem: evaluation The hardest thing about conducting good interpretability research is that it’s not clear whether an interpretation is good or not when there is no ground truth to compare it to. Neural systems are complex, and it’s hard to verify that an interpretation faithfully describes how a network truly functions. So what does it even mean to be meaningfully interpreting a network? There is unfortunately no agreed upon standard. Motivations and goals of interpretability researchers are notoriously “diverse and discordant” (Lipton, 2018). But here, we will take an engineer’s perspective and consider interpretations to be good to the extent that they are useful. Evaluation by intuition is inadequate. Miller (2019) observes that “Most work in explainable artificial intelligence uses only the researchers’ intuition of what constitutes a ‘good’ explanation”. Some papers and posts have even formalized evaluation by intuition. Two examples are Yang et al. (2019) and Kirk et al. (2020) who proposed evaluation frameworks that included a criterion called “persuadability.” This was defined by Yang et al. (2019) as “subjective satisfaction or comprehensibility for the corresponding explanation.” This is not a very good criterion from an enginee... Feb 14, 2023 AF - The Cave Allegory Revisited: Understanding GPT's Worldview by Jan Kulveit 05:02 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Cave Allegory Revisited: Understanding GPT's Worldview, published by Jan Kulveit on February 14, 2023 on The AI Alignment Forum. A short post describing a metaphor I find useful, in particular for explaining some intuitions about systems like GPT to people who don't have deeper technical knowledge about large generative models. Plato's allegory of the cave has been a staple of philosophical discourse for millenia, providing a metaphor for understanding the limits of human perception. In the classical allegory, we are prisoners shackled to a wall of a cave, unable to experience reality directly but only able to infer it based on watching shadows cast on the wall.GPT can be thought of as a blind oracle residing in a deeper cave, where it does not even see the shadows but only hears our conversations in the first cave, always trying to predict the next syllable. It is remarkable that it still learns a lot about the world outside of the cave. Why does it learn this? Because, a model of reality outside of the cave and a decent amount of abstraction are useful for predicting the conversations in the first cave! Moreover, GPT also learns about the speakers in the first cave, as understanding their styles and patterns of speech is crucial for its prediction task. As the speakers are closer to GPT, understanding their styles is in some sense easier and more natural than guessing what's outside of the cave. What does the second cave allegory illustrate? The first insight from the allegory is: if you are in GPT's place, part of the difficulty in figuring out what's going on outside the cave, is that people in the first cave talk a lot about other things apart from the shadows of the real world. Sometimes, they talk about happenings in Middle Earth. Or about how the shadows would look in some counterfactual world. As humans, we are blessed with the luxury of being able to compare such statements to the shadows and determine their veracity. The difference between conversations about fantasy and the shadows of the real world is usually extremely obvious to humans: we never see dragon shadows. In contrast, dragons do show up a lot in the conversations in the first cave; GPT doesn’t get to see the shadows, so it often needs to stay deeply uncertain about whether the speaker is describing the actual shadows or something else to be good at predicting the conversation. The second insight is that one of the biggest challenges for GPT in figuring out the conversation is localizing it, determining who is speaking and what the context is, just from the words. Is it a child regaling another child with a fairy-tale, or a CEO delivering a corporate address? As humans we do not face this conundrum often,because we can see the context in which the conversation is taking place. In fact, we would be worse than GPT at the task it has to deal with. At first, interacting with this type of blind oracle in the second cave was disorienting for humans. Talking to GPT used to be a bit like shouting something through a narrow tunnel into the second cave .and instead of an echo, getting back what the blind oracle hallucinates is the most likely thing that you or someone else would say next. Often people were confused by this. They shouted instructions and expected an answer, but the oracle doesn't listen to instructions or produce answers directly - it just hallucinates what someone might say next. Because on average in the conversations in the first cave questions are followed by answers, and requests by fulfilment, this sort of works. One innovation of ChatGPT, which made it popular with people, was localising the conversation by default: when you are talking with ChatGPT now, it knows that what follows is a conversation between a human - you - and a "helpful AI assistant". There is a subtle point to... Feb 14, 2023 AF - The conceptual Dopplegänger problem by Tsvi Benson-Tilsen 04:03 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The conceptual Dopplegänger problem, published by Tsvi Benson-Tilsen on February 12, 2023 on The AI Alignment Forum. [Metadata: crossposted from. First completed 9 October 2022.] Suppose we want to observe the thoughts of a mind in order to detect whether it's making its way towards a plan to harm us, and ideally also to direct the mind so that it pursues specific aims. To this end, we might hope that the mind and its thinking are organized in a way we can come to understand in the way that we understand ourselves and our thinking. We might hope that when the mind considers plans that involve something, e.g. plans that involve the coffee cup, it does so using a concept alike to our concept [[coffee cup]]. When the mind recognizes, predicts, imagines, simulates, manipulates, designs, combines things with, describes, studies, associates things with, summarizes, remembers, compares things with, deduces things about, makes hypotheses about, or is otherwise mentally involved with the coffee cup, maybe it always does so in a way that is fully comprehendable in fixed terms that are similar to the terms in which we understand ourselves when we do those activities. Maybe the structure involved in psychic events in the mind reliably falls into basins of attraction that indicate unambiguously to us, as we observe these events, which nexi of reference that structure constitutes. Maybe the X-and-only-X problem is solved by ensuring that the mind's thoughts are in a language made of these concepts; when the mind plans to "fetch the coffee", it somehow means only fetching the coffee, in the "natural" sense of [[fetch]] and [[the coffee]]. One obstacle to this rosy picture is conceptual Dopplegängers. A conceptual Dopplegänger of some concept Z, is a concept Z' that serves some overlapping functions in the mind as Z serves, but is psychically distinct from Z. Here saying that Z' is psychically distinct from Z is ambiguous, but means something like: Z' is not transparently closely related to Z, or is mechanistically / physically separate from Z, or is referred to in a set of contexts that's systematically segregrated from the contexts in which Z is referred to, or is not explicitly described or treated as being the same as or similar to or analogous to Z. A Dopplegänger concept Z' enables a mind to think about what Z is about, at least in some respects, without psychically using Z. This makes it hard to be sure that the mind is not thinking about what Z is about; even if the mind is not using Z, it might be thinking about what Z is about by using some Z'. Maybe Dopplegängers of Z can be psychically located by doing something like looking for mental stuff that has high mutual logical information with Z. This might work to identify blatant deception: if the mind maintains a puppet show of fake thoughts using Z and has its real thoughts using a Z' that's psychically isomorphic to Z, then Z' will be obviously related to Z. But, Dopplegängers don't have to be so obvious. Mental stuff that constitutes skill with manipulating what Z is about, can be, compared to Z, more or less: partial implicit diffuse (diffused throughout other skills and knowledge) encrypted externalized transiently reconstructed out of precursors when needed structurally deep (and therefore alien to someone who thinks in terms of Z) Baldwinized to specific purposes and can be expressed in a different language or constituted by differently-factored concepts referred to in a set of contexts that's systematically segregrated from the contexts in which Z is referred to. All of these features make it harder to see that Z' is in some respects a Doppelgänger of Z. In other words, to the extent these features (and probably others) characterize mental stuff in the mind, the mind is liable to be thinking about coffee cups even wh... Feb 12, 2023 AF - Why almost every RL agent does learned optimization by Lee Sharkey 11:15 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Why almost every RL agent does learned optimization, published by Lee Sharkey on February 12, 2023 on The AI Alignment Forum. Or "Why RL≈RL2 (And why that matters)" TL;DR: This post discusses the blurred conceptual boundary between RL and RL2 (also known as meta-RL). RL2 is an instance of learned optimization. Far from being a special case, I point out that the conditions under which RL2 emerges are actually the default conditions for RL training. I argue that this is safety-relevant by outlining the evidence for why learned planning algorithms will probably emerge -- and have probably already emerged in a weak sense -- in scaled-up RL2 agents. I've found myself telling this story about the relationship between RL and RL2 numerous times in conversation. When that happens, it's usually time to write a post about it. Most of the first half of the post (which points out that RL2 is probably more common than most people think) makes points that are probably already familiar to people who've thought a bit about inner alignment. The last section of the post (which outlines why learned planning algorithms will probably emerge from scaled up RL2 systems) contains arguments that may be less widely appreciated among inner alignment researchers, though I still expect the arguments to be familiar to some. Background on RL2 RL2 (Duan et al. 2016), also known as meta-RL (Wang et al. 2016; Beck et al. 2023), is the phenomenon where an RL agent learns to implement another RL algorithm in its internal activations. It's the RL version of 'learning to learn by gradient descent', which is a kind of meta-learning first described in the supervised setting by Hochreiter et al. (2001). These days, in language models it's often called 'in-context learning' (Olssen et al. 2022, Garg et al. 2022). RL2 is interesting from a safety perspective because it's a form of learned optimization (Hubinger et al. 2019): The RL algorithm (the outer optimization algorithm) trains the weights of an agent, which learns to implement a separate, inner RL algorithm (the optimization algorithm). The inner RL algorithm gives the agent the ability to adapt its policy to a particular task instance from the task distribution on which it is trained. Empirically, agents trained to exhibit RL2 exhibit rapid adaptation and zero-shot generalization to new tasks (DeepMind Adaptive Agent team et al. 2023), hypothesis driven exploration/experimentation (DeepMind Open Ended Learning Team et al. 2021), and causal reasoning (Dasgupta et al. 2019). RL2 may even underlie human planning, decision-making, social cognition, and moral judgement, since there is compelling evidence that the human prefrontal cortex (which is the area of the brain most associated with those capabilties) implements an RL2 system (Wang et al. 2018). These cognitive capabilities are the kind of things that we're concerned about in powerful AI systems. RL2 is therefore a phenomenon that seems likely to underlie some major safety risks. The conditions under which RL2 emerges are the default RL training conditions Ingredients for an RL2 cake The four 'ingredients' required for RL2 to emerge are: The agent must have observations that correlate with reward. The agent must have observations that correlate with its history of actions. The agent must have a memory state that persists through time in which the RL2 algorithm can be implemented. The agent must be trained on a distribution of tasks. These conditions let the agent learn an RL2 algorithm because they let the agent learn to adapt its actions to a particular task according to what led to reward. Here's a more detailed picture of the mechanism by which these ingredients lead to RL2: Thanks to (1), agents tend to learn representations that identify if the agent is getting closer to valuable states. Thanks to... Feb 12, 2023 AF - A note on 'semiotic physics' by metasemi 12:43 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A note on 'semiotic physics', published by metasemi on February 11, 2023 on The AI Alignment Forum. Introduction This is an attempt to explain to myself the concept of semiotic physics that appears in the original Simulators post by janus and in a later post by Jan Hendrik Kirchner. Everything here comes from janus and Jan's work, but any inaccuracies or misinterpretations are all mine. TL;DR The prototypical simulator, GPT, is sometimes said to "predict the next token" in a text sequence. This is accurate, but incomplete. It's more illuminating to consider what happens when GPT, or any simulator, is run repeatedly to produce a multi-token forward trajectory, as in the familiar scenario of generating a text completion in response to a prompt. The token-by-token production of output is stochastic, with a branch point at every step, making the simulator a multiverse generator analogous to the time evolution operator of quantum mechanics. In this analogical sense, a simulator such as GPT implements a "physics" whose "elementary particles" are linguistic tokens. When we experience the generated output text as meaningful, the tokens it's composed of are serving as semiotic signs. Thus we can refer to the simulator's physics-analogue as semiotic physics. We can explore the simulator's semiotic physics through experimentation and careful observation of the outputs it actually produces. This naturalistic approach is complementary to analysis of the model's architecture and training. Though GPT's outputs often contain remarkable renditions of the real world, the relationship between semiotic physics and quantum mechanics remains analogical. It's a misconception to think of semiotic physics as a claim that the simulator's semantic world approximates or converges on the real world. Trajectories GPT, the prototypical simulator, is often said to "predict the next token" in a sequence of text. This is true as far as it goes, but it only partially describes typical usage, and it misses a dynamic that's essential to GPT's most impressive performances. Usually, we don't simply have GPT predict a single token to follow a given prompt; we have it roll out a continuous passage of text by predicting a token, appending that token to the prompt, predicting another token, appending that, and so on. Thinking about the operation of the simulator within this autoregressive loop better matches typical scenarios than thinking about single token prediction, and is thus a better fit to what we typically mean when we talk about GPT. But there's more to this distinction than descriptive point of view. Crucially, the growing sequence of prompt+output text, repeatedly fed back into the loop, preserves information and therefore constitutes state, like the tape of a Turing machine. In the Simulators post, janus writes: I think that implicit type-confusion is common in discourse about GPT. “GPT”, the neural network, the policy that was optimized, is the easier object to point to and say definite things about. But when we talk about “GPT’s” capabilities, impacts, or alignment, we’re usually actually concerned about the behaviors of an algorithm which calls GPT in an autoregressive loop repeatedly writing to some prompt-state... The Semiotic physics post defines the term trajectory to mean the sequence of tokens—prompt plus generated-output-so-far—after each iteration of the autoregressive loop. In semiotic physics, as is common in both popular and technical discourse, by default we talk about GPT as a generator of (linguistic) trajectories, not context-free individual tokens. Simulators are multiverse generators GPT's token-by-token production of a trajectory is stochastic: at each autoregressive step, the trained model generates an output probability distribution over the token vocabulary, samples from t... Feb 11, 2023 AF - The Need For Work On Technical AI Alignment (Intro Explainer) by Daniel Eth 00:28 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Need For Work On Technical AI Alignment (Intro Explainer), published by Daniel Eth on February 11, 2023 on The AI Alignment Forum. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. Feb 11, 2023 AF - A proposed method for forecasting transformative AI by Matthew Barnett 17:10 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A proposed method for forecasting transformative AI, published by Matthew Barnett on February 10, 2023 on The AI Alignment Forum. In 2021, I proposed measuring progress in the perplexity of language models and extrapolating past results to determine when language models were expected to reach roughly "human-level" performance. Here, I build on that approach by introducing a more systematic and precise method of forecasting progress in language modeling that employs scaling laws to make predictions. The full report for this forecasting method can be found in this document. In this blog post I'll try to explain all the essential elements of the approach without providing excessive detail regarding the technical derivations. This approach can be contrasted with Ajeya Cotra's Bio Anchors model, providing a new method for forecasting the arrival of transformative AI (TAI). I will tentatively call it the "Direct Approach", since it makes use of scaling laws directly to make predictions about compute requirements for AI. Naturally, the Direct Approach is a very speculative framework and might end up being useless for forecasting TAI (in fact, I consider this the most likely outcome). Nonetheless, I'm hopeful that something like it can serve as a better foundation than current TAI timelines models, which I currently think are likely even worse. Note that there may be errors in the report and Colab notebook, as they were not extensively fact-checked. Some background In a nutshell, this approach is simply about taking the cross-entropy loss of an autoregressive model and trying to find a way of interpreting that quantity qualitatively: that is, something we can put on a chart and extrapolate until the quantity reaches a natural threshold that we identify with something important. In my 2021 post about predicting language model performance, I drew a trendline through a plot of language model perplexities on various benchmarks and noted when the trendline went through estimates of "human-level" perplexity. This approach felt reasonable to me at the time, but I now think it too easily hand-waved away some important details. The error of omission I committed in my old approach becomes more apparent when you think about language model performance from the perspective of scaling laws, for example the parametric scaling law from Hoffmann et al. 2022: Here, we see cross-entropy loss as a function of parameters N and training tokens D seen during training. Notably, if we take the limit as the number of parameters and training tokens goes to infinity, then we're left with E. Theoretically, E corresponds to the "entropy of natural text" under certain assumptions, which is precisely the thing I identified with "roughly human-level" performance in my previous post. In other words, if we take this scaling law naively, it seems as though it will take infinite compute to reach human-level performance. I believe the resolution to this apparent issue is to say that "human-level" performance will not be obtained when loss hits E, but rather some small level above E. How close to E is enough? Well, that's the question we tried to answer with this report. Summary of the Direct Approach We begin by considering a language task, which in this post will be scientific research for illustration. For simplicity, let's imagine that this task consists of writing high-quality research papers or reports, although more nuanced specifications are possible. Of course, real scientific research involves more than merely writing research papers. It involves proposing hypotheses, devising experiments, and collecting data, but for now, let's imagine that we can simplify all these steps into one step that involves writing high quality research papers. This simplification may not be entirely unrealistic, since if the ... Feb 10, 2023 AF - Why I’m not working on {debate, RRM, ELK, natural abstractions} by Steve Byrnes 14:31 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Why I’m not working on {debate, RRM, ELK, natural abstractions}, published by Steve Byrnes on February 10, 2023 on The AI Alignment Forum. [For background & spelling out the acronyms in the title, see: Debate (AI safety technique), Recursive Reward Modeling, Eliciting Latent Knowledge, Natural Abstractions.] When I say “Why I’m not working on X”, I am NOT trying to find a polite & diplomatic way to say “Nobody should work on X because X is unhelpful for AGI safety”. Hmm, OK, well, maybe it’s just a little bit that. But really, I don’t feel strongly. Instead, I think: A lot of disagreement about what a solution to technical AGI safety looks like is really downstream of disagreements about questions like “How will AGI be built? What will it look like? How will it work?” Nobody really knows the answers to those questions. So we should probably be contingency-planning, by going through any possible answers to those questions that at least some reasonable person finds plausible, and doing AGI safety research conditional on those answers being correct. But still, I have my own opinions about the answers to those questions, and obviously I think my opinions are right, and I am not going to work on something unless it makes sense on my own models. And since people ask me from time to time, it seems worth explaining why the various research programs in the post title do not seem to be a good use of time, on my own models of how AGI will be developed and what AGI will look like. I wrote this post quickly and did not run it by the people I’m (sorta) criticizing. Do not assume that I described anything fairly and correctly. Please leave comments, and I’ll endeavor to update this post or write a follow-up in the case of major errors / misunderstandings / mind-changes. (But maybe not until after the weekend.) (By the way: If I’m not working on any of those research programs, then what am I working on? See here. I listed six other projects that seem particularly great to me here, and there are many others besides.) 1. Background 1.1 “Trying” to figure something out seems both necessary & dangerous (Partly self-plagiarized from here.) Let’s compare two things: “trying to get a good understanding of some domain by building up a vocabulary of new concepts and their relations” versus “trying to win a video game”. At a high level, I claim they have a lot in common! In both cases, there are a bunch of possible “moves” you can make (you could think the thought “what if there’s some analogy between this and that?”, or you could think the thought “that’s a bit of a pattern; does it generalize?”, etc. etc.), and each move affects subsequent moves, in an exponentially-growing tree of possibilities. In both cases, you’ll often get some early hints about whether moves were wise, but you won’t really know that you’re on the right track except in hindsight. And in both cases, I think the only reliable way to succeed is to have the capability to repeatedly try different things, and learn from experience what paths and strategies are fruitful. Therefore (I would argue), a human-level concept-inventing AI needs “RL-on-thoughts”—i.e., a reinforcement learning system, in which “thoughts” (edits to the hypothesis space / priors / world-model) are the thing that gets rewarded. Next, consider some of the features that we plausibly need to put into this RL-on-thoughts system, for it to succeed at a superhuman level: Developing and pursuing instrumental subgoals—for example, suppose the AI is “trying” to develop concepts that will make it superhumanly competent at assisting a human microscope inventor. We want it to be able to “notice” that there might be a relation between lenses and symplectic transformations, and then go spend some compute cycles developing a better understanding of symplectic transform... Feb 10, 2023 AF - Conditioning Predictive Models: Open problems, Conclusion, and Appendix by Evan Hubinger 17:27 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Conditioning Predictive Models: Open problems, Conclusion, and Appendix, published by Evan Hubinger on February 10, 2023 on The AI Alignment Forum. This is the final of seven posts in the Conditioning Predictive Models Sequence based on the paper “Conditioning Predictive Models: Risks and Strategies” by Evan Hubinger, Adam Jermyn, Johannes Treutlein, Rubi Hudson, and Kate Woolverton. Each post in the sequence corresponds to a different section of the paper. 7. Open problems We think that there are a wide variety of ways—both experimental and theoretical—in which our analysis could be expanded upon. Here, we’ll try to briefly lay out some of the future directions that we are most excited about—though note that this is only a sampling of some possible future directions, and is thus a highly incomplete list: Are pre-trained LLMs well-modeled as predictive models or agents? As pre-trained model scale increases, do markers of agentic behavior increase as well? See “Discovering Language Model Behaviors with Model-Written Evaluations” for some initial results on this question. To what extent do LLMs exhibit distributional generalization? Distributional generalization seems like evidence of acting as a generative/predictive model rather than just optimizing cross-entropy loss. To the extent that current LLMs are doing some sort of prediction, can we find evidence of that in their internal structure? Is the RLHF conditioning hypothesis true? How do markers of agentic behavior change as the amount of RLHF done increases, and under different RLHF fine-tuning regimes? See “Discovering Language Model Behaviors with Model-Written Evaluations” for some initial results on this question. For anything that an RLHF model can do, is there always a prompt that gets a pre-trained model to do the same thing? What about a soft prompt or a prompt chain? In addition to validating the extent to which RLHF models can be mimicked using techniques that are more clearly implementing a conditional, a positive result here could also provide an alternative to RLHF that allows us to get the same results without relying on the RLHF conditioning hypothesis at all. More generally, how similar are RLHF fine-tuned models to pre-trained models with fine-tuned soft prompts? The idea here being that a soft prompt is perhaps more straightforward to think of as a sort of conditional. To what extent do RLHF fine-tuned models exhibit distributional generalization? Relevant here for the same reason as in the pre-training case. To what extent can you recover the original pre-trained distribution/capabilities from an RLHF fine-tuned model? If an RLHF model no longer successfully solves some prediction task by default, how easy is it to turn back on that capability via additional fine-tuning, or did the RLHF destroy it completely? If it is generally possible to do this, it is some evidence that the original pre-trained distribution is still largely maintained in the RLHF model. How do markers of agentic behavior change as we change the RL reward? Is it very different between human-like and random rewards? What happens if we exactly invert the standard helpfulness reward? This can help test whether agency is coming from the specific choice of RL reward or the general process of RLHF. How do RLHF fine-tuned models differ from their own preference model, especially regarding markers of agentic behavior? To the extent that fine-tuned models get closer to their preference models as scale increases, preference models can serve as a proxy for future RLHF models. Are there ways of changing standard RLHF techniques to make them more likely to produce conditionals rather than agents? How do alternative, more myopic RL training schemes—such as the one described here—affect markers of agentic behavior? Can we use such techniques... Feb 10, 2023 AF - Cyborgism by Nicholas Kees Dupuis 58:57 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Cyborgism, published by Nicholas Kees Dupuis on February 10, 2023 on The AI Alignment Forum. Thanks to Garrett Baker, David Udell, Alex Gray, Paul Colognese, Akash Wasil, Jacques Thibodeau, Michael Ivanitskiy, Zach Stein-Perlman, and Anish Upadhayay for feedback on drafts, as well as Scott Viteri for our valuable conversations. Executive summary: This post proposes a strategy for safely accelerating alignment research. The plan is to set up human-in-the-loop systems which empower human agency rather than outsource it, and to use those systems to differentially accelerate progress on alignment. Introduction: An explanation of the context and motivation for this agenda. Automated Research Assistants: A discussion of why the paradigm of training AI systems to behave as autonomous agents is both counterproductive and dangerous. Becoming a Cyborg: A proposal for an alternative approach/frame, which focuses on a particular type of human-in-the-loop system I am calling a “cyborg”. Failure Modes: An analysis of how this agenda could either fail to help or actively cause harm by accelerating AI research more broadly. Testimony of a Cyborg: A personal account of how Janus uses GPT as a part of their workflow, and how it relates to the cyborgism approach to intelligence augmentation. Terminology GPT: Large language models trained on next-token prediction. Most plans to accelerate research (including this one) revolve around leveraging GPTs specifically. I will mostly be using “GPT'' to gesture at the base models which have not been augmented using reinforcement learning. Autonomous Agent: An AI system which can be well modeled as having goals or preferences, and deliberately selects actions in order to achieve them (with limited human assistance). Capabilities research: Research which directly improves the capabilities of AI systems and thereby brings us closer to being able to train and deploy more powerful autonomous agents. Simulator: A class of AI system (of which GPT is a member). Simulators are generative predictive models, where the model makes a prediction (probability distribution) about how the state of a system will evolve, and then the state is updated by sampling from that prediction/distribution. The result is a process which “simulates” the training distribution, the limit of such a process being a system which faithfully generates trajectories sampled from the distribution implied by the training data. Disempowerment: The process of humans losing control of the long-term future to a powerful autonomous agent (or agents). This includes anything from our civilization being hijacked to outright human extinction. Introduction There is a lot of disagreement and confusion about the feasibility and risks associated with automating alignment research. Some see it as the default path toward building aligned AI, while others expect limited benefit from near term systems, expecting the ability to significantly speed up progress to appear well after misalignment and deception. Furthermore, progress in this area may directly shorten timelines or enable the creation of dual purpose systems which significantly speed up capabilities research. OpenAI recently released their alignment plan. It focuses heavily on outsourcing cognitive work to language models, transitioning us to a regime where humans mostly provide oversight to automated research assistants. While there have been a lot of objections to and concerns about this plan, there hasn’t been a strong alternative approach aiming to automate alignment research which also takes all of the many risks seriously. The intention of this post is not to propose an end-all cure for the tricky problem of accelerating alignment using GPT models. Instead, the purpose is to explicitly put another point on the map of possible strategies, ... Feb 10, 2023 AF - FLI Podcast: Connor Leahy on AI Progress, Chimps, Memes, and Markets (Part 1/3) by remember 01:07:52 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: FLI Podcast: Connor Leahy on AI Progress, Chimps, Memes, and Markets (Part 1/3), published by remember on February 10, 2023 on The AI Alignment Forum. We often prefer reading over listening to audio content, and have been testing transcribing podcasts using our new tool at Conjecture, Verbalize, with some light editing and formatting. We're posting highlights and transcripts of podcasts in case others share our preferences, and because there is a lot of important alignment-relevant information in podcasts that never made it to LessWrong.If anyone is creating alignment-relevant audio content and wants to transcribe it, get in touch with us and we can give you free credits! The podcast episode transcribed in this post is available here. Topics covered include: Defining artificial general intelligence What makes humans more powerful than chimps? Would AIs have to be social to be intelligent? Importing humanity's memes into AIs How do we measure progress in AI? Gut feelings about AI progress Connor's predictions about AGI Is predicting AGI soon betting against the market? How accurate are prediction markets about AGI? Books cited in the episode include: The Incerto Series by Nassim Nicholas Taleb The Selfish Gene, Richard Dawkins Various books on primates and animal intelligence by Frans De Wall Inadequate Equilibria by Eliezer Yudkowsky Highlights On intelligence in humans and chimps: We are more social because we're more intelligent and we're more intelligent because we are more social. These things are not independent variables. So at first glance, if you look at a human brain versus a chimp brain, it's basically the same thing. You see like all the same kind of structures, same kind of neurons, though a bunch of parameters are different. You see some more spindle cells, it's bigger. Human brain just has more parameters, it's just GPT-3 versus GPT-4... But really, the difference is, is that humans have memes. And I mean this in the Richard Dawkins sense of evolved, informational, programmatic virtual concepts that can be passed around between groups. If I had to pick one niche, what is the niche that humans are evolved for? I think the niche we're evolved for is memetic hosts. On benchmarks and scaling laws: Benchmarks are actually coordination technologies. They're actually social technologies. What benchmarks are fundamentally for is coordination mechanisms. The kind of mechanisms you need to use when you're trying to coordinate groups of people around certain things.... So we have these scaling laws, which I think a lot of people misunderstand. So scaling laws give you these nice curves which show how the loss of performance on the model smoothly decreases as they get larger. These are actually terrible, and these actually tell you nothing about the model. They tell you what one specific number will do. And this number doesn't mean anything. There is some value in knowing the loss. But what we actually care about is can this model do various work? Can it do various tasks? Can it reason about its environment? Can it reason about its user?... So currently there are no predictive theories of intelligence gain or task. There is no theory that says once it reaches 74.3 billion parameters, then it will learn this task. There's no such theory. It's all empirical. And we still don't understand these things at all. I think there's, so another reason I'm kind of against benchmarks, and I'm kind of being a bit pedantic about this question is because I think they're actively misleading in the sense that people present them as if they mean something, but they just truly, truly don't. A benchmark in a vacuum means nothing. On the dangerous of having a good metric of progress towards AGI: So this is an interesting question. And not just from a scientific perspective, but it's als... Feb 10, 2023 AF - Conditioning Predictive Models: Deployment strategy by Evan Hubinger 14:35 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Conditioning Predictive Models: Deployment strategy, published by Evan Hubinger on February 9, 2023 on The AI Alignment Forum. This is the sixth of seven posts in the Conditioning Predictive Models Sequence based on the paper “Conditioning Predictive Models: Risks and Strategies” by Evan Hubinger, Adam Jermyn, Johannes Treutlein, Rubi Hudson, and Kate Woolverton. Each post in the sequence corresponds to a different section of the paper. We will be releasing posts gradually over the course of the next week or so to give people time to read and digest them as they come out. 6. Deployment strategy Previously, we have been focusing on how to make conditioning predictive models as safe and competitive as possible. Now, we want to take a step back and discuss considerations for using conditioning predictive models to address AI existential risk and what sorts of difficulties we might run into doing so in practice. In particular, just as AI safety researchers naturally think of using predictive models for advancing AI safety research, AI capabilities researchers might naturally jump to using predictive models for advancing capabilities. It may not even be necessary to generate additional research to build AGI with a powerful predictive model. Simply ignoring the previously-discussed ELK-related difficulties and training a model to take actions that lead to predicted futures that a predicted human approves of may be sufficient. Either way, the existence of powerful predictive models seems likely to rapidly contract AI timelines. As a result, by the time predictive models can be used to predict a full solution to AI safety, the time available to do so is minimal—and as such, it is important to have fleshed out plans on how to use them safely well ahead of time. Dealing with other, less careful actors As we mentioned previously, using a predictive model to generate alignment research is only one possible use case—one that we restricted our attention to on the basis that we thought it contained the difficult aspects of using a predictive model safely. Restricting our attention to these sorts of particular conditionals—and figuring out how to do them safely—is fine if we have control over the ways in which our model will be used. If we don’t have that control, however—e.g. we are in a world where people are using predictive models in all sorts of different ways—then we have to consider what might happen when our predictive model is used in a much less careful way than described here and figure out how to either deal with or prevent that from happening. We think that getting other actors to use predictive models at all should be quite doable, for standard homogeneity reasons: why would a non-leading actor want to invest a ton of resources training a model in a different way than the way that the leading actor has already demonstrated successfully produces transformative AI? The problem, however, is that this same argument does not apply to what particular conditionals the non-leading actors might try, since trying a particular conditional is likely to be substantially cheaper than training an entire predictive model. In a multipolar world, one team using very careful conditioning to get a predictive model to generate good alignment research means that other teams will likely soon have equivalently good models and might use them less carefully—e.g. resulting in them accidentally predicting malign superintelligences. Even in a unipolar world, a member of the team that created the predictive model might try to predict their future great-grandchildren out of curiosity, or check the predicted stock prices when they plan to retire, and inadvertently become exposed to manipulative outputs. Since powerful predictive models can easily be used in less careful ways, any deployment strategy ... Feb 09, 2023 AF - EIS II: What is “Interpretability”? by Stephen Casper 06:18 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EIS II: What is “Interpretability”?, published by Stephen Casper on February 9, 2023 on The AI Alignment Forum. Part 2 of 12 in the Engineer’s Interpretability Sequence. A parable based on a true story In 2015, a Google’s image classification app classified many photos depicting black people as gorillas. Image from WSJ. Remember Google’s infamous blunder from 2015 in which users found that one of its vision API’s often misclassified black people as gorillas? Consider a parable of two researchers who want to understand and tackle this issue. Alice is an extremely skilled mechanistic interpretability researcher who spends a heroic amount of effort analyzing Google’s model. She identifies a set of neurons and weights that seem to be involved in the detection and processing of human and gorilla faces and bodies. She develops a detailed mechanistic hypothesis and writes a paper about it with 5 different types of evidence for her interpretation. Later on, another researcher who wants to test Alice’s hypothesis edits the model in a way that the hypothesis suggests would fix the problem. As it turns out, the hypothesis was imperfect, and the model now classifies many images of gorillas as humans! Bob knows nothing about neural networks. Instead of analyzing the network, he looks at the dataset that the model was trained on and notices a striking lack of black people (as was indeed the case in real life (Krishnan, 2020)). He suggests making the data more representative and training the model again. When this is done, it mostly fixes the problem without side effects. The goal of this parable is to illustrate that when it comes to doing useful engineering work with models, a mechanistic understanding may not always be the best way to go. We shouldn’t think of something called “interpretability” as being fundamentally separate from other tools that can help us accomplish our goals with models. And we especially shouldn’t automatically privilege some methods over others. In some cases, highly involved and complex approaches may be necessary. But in other cases like Alice’s, the interesting, smart, and paper-able solution to the problem might not only be harder but could also be more failure-prone. This isn’t to say that Alice’s work could never lead to more useful insights down the road. But in this particular case Alice’s smart approach was not as good as Bob’s simple one. Interpretability is a means to an end. Since I work and think about interpretability every day, I have felt compelled to adopt a definition for it. In a previous draft of this post, I proposed defining an interpretability tool as “any method by which something novel about a system can be better predicted or described.” And I think this is ok, but I have recently stopped caring about any particular definition. Instead, I think the important thing to understand is that “interpretability’ is not a term of any fundamental importance to an engineer. The key idea behind this post is that whatever we call “interpretability” tools are entirely fungible with other techniques related to describing, evaluating, debugging, etc. Does this mean that it’s the same thing as interpretability if we just calculate performance on a test set, train an adversarial example, do some model pruning, or make a prediction based on the dataset? Pretty much. For all practical intents and purposes, these things all are all of a certain common type. Consider any of the following sentences. This model handles 85% of the data correctly. This input plus whatever is in this adversarial perturbation make the model fail. I got rid of 90% of the weights and the model’s performance only decreased by 2%. The dataset has this particular bias, so the model probably will as well. This model seems to have a circuit composed of these neurons and these wei... Feb 09, 2023 AF - The Engineer’s Interpretability Sequence (EIS) I: Intro by Stephen Casper 04:55 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Engineer’s Interpretability Sequence (EIS) I: Intro, published by Stephen Casper on February 9, 2023 on The AI Alignment Forum. Part 1 of 12 in the Engineer’s Interpretability Sequence. If we want to reduce near and long term risks from AI, we should care a lot about interpretability tools. This is a very uncontroversial claim to make inside the AI safety community. Almost every agenda for safe advanced AI incorporates interpretability in some way. The key value of interpretability tools is that they aid in human oversight by enabling open-ended evaluation. Short of actually deploying a system, any method of evaluating it can only be a proxy for its actual performance. The most common way to evaluate a model is by its performance in some test set or environment. But test sets alone can fail to reveal – and often incentivize – undesirable solutions involving overfitting, biases, deception, etc. This highlights the need for other ways to evaluate models, and an interpretability toolbox full of effective tools may go a long way. Some of the seeds of the AI safety community’s interest in interpretability were planted by Distill in 2017. But 2022 was an inflection point with a massive new surge in interest and work on interpretability tools. Anthropic was founded a little over a year ago. ARC started less than a year ago. Redwood has begun to push for much more interpretability work, including with the REMIX program. We are seeing a number of pushes to get many more people involved in interpretability work. And as someone on the ground, I have subjectively observed a surge in interest over 2022. And the popularity of interpretability hasn’t been limited to the AI safety community. There is now so much work in interpretability that we now have a dataset of 5199 interpretability papers (Jacovi, 2023). See also a survey of 300+ of them from some coauthors and me (Räuker et al., 2022). Growth in the interpretability literature by year from Jacovi (2023). But despite all this work, interpretability research has limitations. One of the goals of this sequence is to argue that: Interpretability research both within the AI safety space and at large is not very productive and may be on course to stay this way. This is intentionally baitey, and I mean to make this point with a large amount of detail and nuance over the course of this sequence. But one striking thing about interpretability research is that: For all the interpretability work that exists, there is a significant gap between this research and engineering applications. This is not to say that purely exploratory work is not good and necessary. But the problem of AI safety is an engineering problem at its core. If one of our main goals for interpretability research is to help us with aligning highly intelligent AI systems in high stakes settings, shouldn’t we be seeing tools that are more helpful in the real world? Hence the name of this sequence: The Engineer’s Interpretability Sequence (EIS). This sequence will have twelve parts. EIS I: Intro EIS II: What is “Interpretability”? EIS III Broad critiques of Interpretability Research EIS IV: A Spotlight on Feature Attribution/Saliency EIS V: Blind Spots In AI Safety Interpretability Research EIS VI: Critiques of Mechanistic Interpretability Work in AI Safety EIS VII: A Challenge for Mechanists EIS VIII: An Engineer’s Understanding of Deception EIS IX: Interpretability and Adversaries EIS X: Continual Learning, Modularity, Compression, and Biological Brains EIS XI: Moving Forward EIS XII: Summary In the coming days, I plan to post a new installment every day or so. Thanks to my labmates, advisor, friends, and many others in the interpretability community for lots of good conversations and inspiration in the past year. Thanks to Rio Popper for feedback on this intro post. I... Feb 09, 2023 AF - Notes on the Mathematics of LLM Architectures by Spencer Becker-Kahn 01:29 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Notes on the Mathematics of LLM Architectures, published by Spencer Becker-Kahn on February 9, 2023 on The AI Alignment Forum. From a mathematical point of view, the building and training of a large transformerlanguage model (LLM) is the construction of a certain function, from some euclidean space to another, that has certain interesting properties. And it may therefore be surprising to find that many key papers announcing significant new LLMs seem reluctant to simply spell out the details of the function that they have constructed in plain mathematical language or indeed even in complete pseudo-code. The latter form of this complaint is the subject of the recent article of Phuong and Hutter [1]. Here, we focus on one aspect of the former perspective and seek to give a relatively ‘pure’ mathematical description of the architecture of an LLM. This short pdf is a set of notes I made, initially just for my own benefit, while trying to understand the architecture of 'decoder-only' LLMs. It draws heavily on Anthropic's Mathematical Framework for Transformers but is deliberately written in a 'pure math' style.It was while writing this up to actually post, that I started to develop the thoughts that led to my post about the mathematics of interpretability more generally.I still consider it something of a fragment or draft, but may develop it further. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. Feb 09, 2023 AF - On Developing a Mathematical Theory of Interpretability by Spencer Becker-Kahn 10:05 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: On Developing a Mathematical Theory of Interpretability, published by Spencer Becker-Kahn on February 9, 2023 on The AI Alignment Forum. If the trajectory of the deep learning paradigm continues, it seems plausible to me that in order for applications of low-level interpretability to AI not-kill-everyone-ism to be truly reliable, we will need a much better-developed and more general theoretical and mathematical framework for deep learning than currently exists. And this sort of work seems difficult. Doing mathematics carefully - in particular finding correct, rigorous statements and then finding correct proofs of those statements - is slow. So slow that the rate of change of cutting-edge engineering practices significantly worsens the difficulties involved in building theory at the right level of generality. And, in my opinion, much slower than the rate at which we can generate informal observations that might possibly be worthy of further mathematical investigation. Thus it can feel like the role that serious mathematics has to play in interpretability is primarily reactive, i.e. consists mostly of activities like 'adding' rigour after the fact or building narrow models to explain specific already-observed phenomena. My impression however, is that the best applied mathematics doesn’t tend to work like this. My impression is that although the use of mathematics in a given field may initially be reactive and disunited, one of the most lauded aspects of mathematics is a certain inevitability with which our abstractions take on a life of their own and reward us later with insight, generalization, and the provision of predictions. Moreover - remarkably - often those abstractions are found in relatively mysterious, intuitive ways: i.e. not as the result of us just directly asking "What kind of thing seems most useful for understanding this object and making predictions?" but, at least in part, as a result of aesthetic judgement and a sense of mathematical taste. One consequence of this (which is a downside and also probably partly due to the inherent limitations of human mathematics) is that mathematics does not tend to act as an objective tool that you can bring to bear on whatever question it is that you want to think about. Instead, the very practice of doing mathematics seeks out the questions that mathematics is best placed to answer. It cannot be used to say something useful about just anything; rather it finds out what it is that it can say something about. Even after taking into account these limitations and reservations, developing something that I'm clumsily thinking of as 'the mathematics of (the interpretability of) deep learning-based AI' might still be a fruitful endeavour. In case it is not clear, this is roughly speaking, because a) Many people are putting a lot of hope and resources into low-level interpretability; b) Its biggest hurdles will be making it 'work' at large scale, on large models, quickly and reliably; and c) - the sentiment I opened this article with - doing this latter thing might well require much more sophisticated general theory. In thinking about some of these themes, I started to mull over a couple of illustrative analogies or examples. The first - and more substantive example - is algebraic topology. This area of mathematics concerns itself with certain ways of assigning mathematical (specifically algebraic) information to shapes and spaces. Many of its foundational ideas have beautiful informal intuitions behind them, such as the notion that a shape my have enough space in it to contain a sphere, but not enough space to contain the ball that that sphere might have demarcated. Developing these informal notions into rigorous mathematics was a long and difficult process and learning this material - even now when it is presented in its ... Feb 09, 2023 AF - A note on terminology: AI Alignment != AI x-safety by David Scott Krueger 01:59 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A note on terminology: AI Alignment != AI x-safety, published by David Scott Krueger on February 8, 2023 on The AI Alignment Forum. I think the terms "AI Alignment" and "AI existential safety" are often used interchangeably, leading the ideas to be conflated.In practice, I think "AI Alignment" is mostly used in one of the following three ways, and should be used exclusively for Intent Alignment (with some vagueness about whose intent, e.g. designer vs. user):1) AI Alignment = How to get AI systems to do what we want 2) AI Alignment = How to get AI systems to try to do what we want 3) AI Alignment = A rebranding of “AI (existential) safety”... A community of people trying to reduce the chance of AI leading to premature human extinction. The problem with (1) is that it is too broad, and invites the response: "Isn't that what most/all AI research is about?"The problem with (3) is that it suggests that (Intent) Alignment is the one-and-only way to increase AI existential safety. Some reasons not to conflate (2) and (3): The case that increasing (intent) alignment increases x-safety seems much weaker on the margin than in the limit; the main effect of a moderate increase in intent alignment might simply be a large increase in demand for AI. Even perfect intent alignment doesn't necessarily result in a safe outcome; e.g. if everyone woke up 1000000x smarter tomorrow, the world might end by noon. X-safety can be increased through non-technical means, e.g. governance/coordination. In my experience, this sloppy use of terminology is common in this community, and leads to incorrect reasoning (if not in those using it than certainly at least sometimes in those hearing/reading it). Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. Feb 08, 2023 AF - Conditioning Predictive Models: Interactions with other approaches by Evan Hubinger 17:42 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Conditioning Predictive Models: Interactions with other approaches, published by Evan Hubinger on February 8, 2023 on The AI Alignment Forum. This is the fifth of seven posts in the Conditioning Predictive Models Sequence based on the paper “Conditioning Predictive Models: Risks and Strategies” by Evan Hubinger, Adam Jermyn, Johannes Treutlein, Rubi Hudson, and Kate Woolverton. Each post in the sequence corresponds to a different section of the paper. We will be releasing posts gradually over the course of the next week or so to give people time to read and digest them as they come out. 5. Interactions with other approaches Imitation learning One very related approach is imitation learning: rather than try to predict humans, as we are proposing, one could simply train to imitate them instead. Such an approach would have many of the same safety benefits, since it would also be exclusively trying to produce outputs from safe humans. The basic problem with such an approach, however, is that there’s no reason to believe that a model trained via pure imitation learning would generalize beyond the capability level of the human(s) it is trained to imitate. While using predictive models to predict humans also cannot produce outputs that humans would never be able to generate, it can produce outputs that no humans that it has ever previously seen would be able to generate, since it might e.g. predict that such humans will exist under some conditional. Thus, we think that predictive modeling at least has the potential to be just as safe as imitation learning while being able to generalize to substantially more advanced capabilities—though, similarly to imitation learning, predicting humans still cannot elicit capabilities beyond those that any conceivable human would be capable of, as we discussed previously. Supervised fine-tuning For some conditionals we might have a very precise notion of what we want the model to observe (e.g. “exactly this image coming from this camera”). Ideally, this sort of a conditional should be straightforwardly implementable via prompting, just by fixing the relevant tokens in the model’s context window.[1] However, at least for current models, prompting has some basic structural limitations—for example, if you want to condition on something very long, context window length could start to become quite problematic. In that sort of a case, it might be quite helpful to instead turn to supervised fine-tuning, fine-tuning on the observation to condition on rather than including it in a prompt. Effectively, this sort of fine-tuning lets you give the model substantially more bits of evidence for it to condition on than is possible via just prompting. For the most part, we think this is likely to be basically fine, since it’s essentially continuous with pre-training: if we think that pre-training produces the sort of predictive model we want, then including some extra pre-training-style data and fine-tuning on it should do the same. The primary concern here, however, would be situations where the fine-tuning data is for some reason not very continuous with the pre-training data. One way that the fine-tuning data could be substantively different than pre-training is if it directly depends on the model itself—e.g. fine-tuning on the model’s own outputs. Not only is this substantially less continuous with pre-training, but it also specifically raises the risk of the model imitating AIs and/or producing self-fulfilling prophecies. Such fine-tuning could also be particularly problematic if the data is specifically selected according to some criterion other than actual representativeness of the world—that is, if there’s no clear “camera” that corresponds to how the data was collected. Probably the most notable way this could happen is via reinforcement learning ... Feb 08, 2023 AF - OpenAI/Microsoft announce "next generation language model" integrated into Bing/Edge by Lawrence Chan 01:23 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: OpenAI/Microsoft announce "next generation language model" integrated into Bing/Edge, published by Lawrence Chan on February 7, 2023 on The AI Alignment Forum. TL;DR: Microsoft and OpenAI announced a new version of Bing featuring "a new, next-generation OpenAI large language model [..] more powerful than ChatGPT", and that Microsoft Edge will feature a Copilot-like assistant that helps with composing and summarizing content. Brief thoughts/comments/notes: Microsoft's attitude during their press meeting seemed pretty aggressive and targeted directly at racing with Google. This seems kind of bad. For example, a quote from Nadella: "The race starts today, and we’re going to move and move fast. Most importantly, we want to have a lot of fun innovating again in search, because it’s high time." Microsoft built a scaffold for the new LM called "Prometheus", that lets them "best leverage its power". Microsoft has also used the new LM in their Bing search engine, though it's not clear exactly how. This seems way more hype than Google's Bard announcement. You can register for the new Bing beta on the Bing.com site. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. Feb 07, 2023 AF - Conditioning Predictive Models: Making inner alignment as easy as possible by Evan Hubinger 28:12 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Conditioning Predictive Models: Making inner alignment as easy as possible, published by Evan Hubinger on February 7, 2023 on The AI Alignment Forum. This is the fourth of seven posts in the Conditioning Predictive Models Sequence based on the paper “Conditioning Predictive Models: Risks and Strategies” by Evan Hubinger, Adam Jermyn, Johannes Treutlein, Rubi Hudson, and Kate Woolverton. Each post in the sequence corresponds to a different section of the paper. We will be releasing posts gradually over the course of the next week or so to give people time to read and digest them as they come out. 4. Making inner alignment as easy as possible At the beginning, we posited the assumption that large language models could be well-understood as predictive models of the world. At the time, however, that was just an assumption—now, we want to return to that assumption and try to understand how likely it is to actually be true. Furthermore, in addition to needing a predictive model (as opposed to e.g. a deceptive agent), we also want our predictor to have a fixed, physical understanding of its cameras rather than operate as a general inductor to avoid the problem of anthropic capture. Additionally, as we’ll discuss in more depth in this section, we’ll also need a prediction model that is managing its own internal cognitive resources in the right way. Though we think that ensuring these desiderata could be quite difficult, we nevertheless think that this presents the easiest inner alignment problem that we are aware of among any potentially safe and competitive approaches. Furthermore, since we believe that inner alignment—and deceptive alignment in particular—pose some of the most dangerous and hardest to address of all known AI safety problems, we think that any improvement in the overall difficulty of that problem should be taken quite seriously as a reason to favor predictive model approaches. Plausible internal structures There are many possible ways large language models could work internally. Previously, we suggested some examples—specifically: an agent minimizing its cross-entropy loss, an agent maximizing long-run predictive accuracy, a deceptive agent trying to gain power in the world, a general inductor, and a predictive model of the world (with fixed, physical “cameras” translating world states into observed tokens). a loose collection of heuristics, a generative model of token transitions, a simulator that picks from a repertoire of humans to simulate, a proxy-aligned agent optimizing proxies like grammatical correctness, To start with, for our purposes here, we’ll eliminate those internal structures that don’t scale with capabilities—that is, we only want to consider plausible internal structures of models that perform well enough on the language model pre-training task that they are able to generalize to other cognitive tasks at a human level or above. Thus, we’ll eliminate (6) through (9) from the above list—(6) because predicting agents like humans should require some degree of optimization, (7)/(8) for the reasons we outlined previously that LLMs have to be able to predict the world, and (9) because such proxies should eventually yield worse performance than actual prediction.[1] That leaves us with (1)/(2), variants on a sycophantic reward maximizer; (3), a deceptive agent; and (4)/(5), different ways of directly attempting to produce predictions. The distinction between (4) and (5) lies in exactly how a model produces predictions given its understanding of the world, and we are quite uncertain about what that might look like in practice. Unfortunately, we expect that the only way to figure out how models accomplish this translation is through transparency and interpretability and not theoretical analysis.[2] Nevertheless, we still think it is possible to mak... Feb 07, 2023 AF - Conditioning Predictive Models: The case for competitiveness by Evan Hubinger 17:32 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Conditioning Predictive Models: The case for competitiveness, published by Evan Hubinger on February 6, 2023 on The AI Alignment Forum. This is the third of seven posts in the Conditioning Predictive Models Sequence based on the forthcoming paper “Conditioning Predictive Models: Risks and Strategies” by Evan Hubinger, Adam Jermyn, Johannes Treutlein, Rubi Hudson, and Kate Woolverton. Each post in the sequence corresponds to a different section of the paper. We will be releasing posts gradually over the course of the next week or so to give people time to read and digest them as they come out. 3. The case for competitiveness In addition to ensuring that we can condition predictive models safely, for such an approach to work as a way to actually reduce AI existential risk, we also need it to be the case that it is competitive—that is, that it doesn’t impose too much of an alignment tax. Following “How do we become confident in the safety of a machine learning system?” we’ll distinguish between two different aspects of competitiveness here that we’ll need to address: Training rationale competitiveness [(Implementation competitiveness)]: how hard the training rationale [(getting the model we want)] is to execute. That is, a proposal should fail on training rationale competitiveness if its training rationale is significantly more difficult to implement—e.g. because of compute or data requirements—than competing alternatives. Training goal competitiveness [(Performance competitiveness)]: whether, if successfully achieved, the training goal [(the model we want)] would be powerful enough to compete with other AI systems. That is, a proposal should fail on training goal competitiveness if it would be easily outcompeted by other AI systems that might exist in the world. To make these concepts easier to keep track of absent the full training stories ontology, we’ll call training rationale competitiveness implementation competitiveness, since it describes the difficulty of implementing the proposal, and training goal competitiveness performance competitiveness, since it describes the achievable performance for the resulting model. Implementation competitiveness The most generally capable models today, large language models, seem to be well-described as predictive models. That may change, but we think it is also at least quite plausible that the first human-level AGI will be some sort of predictive model, likely similar in structure to current LLMs. Furthermore, LLM pre-training in particular seems to be where most of the capabilities of the most advanced current models come from: the vast majority of compute spent training large language models is spent in pre-training, not fine-tuning. Additionally, our guess is that the fine-tuning that is done is best modeled as targeting existing capabilities rather than introducing entirely new capabilities. Assuming that, after pre-training, LLMs are well-understood as predictive models, that suggests two possibilities for how to think about different fine-tuning regimes: The fine-tuning resulted in a particular conditional of the original pre-trained predictive model. The fine-tuning targeted the capabilities by turning the predictive model into one that is no longer well-understood as predictive. In the first case, the conditioning predictive models approach would simply be a variation on the exact techniques currently used at the forefront of capabilities, making it hopefully implementation competitive by default.[1] The main way we think such an implementation competitiveness argument could fail is if the fine-tuning necessary to get the sort of conditionals we describe here is substantially harder than alternative fine-tuning paradigms. In particular, we think it is likely the case that our proposed solutions will add some amount of o... Feb 06, 2023 AF - Decision Transformer Interpretability by Joseph Bloom 35:55 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Decision Transformer Interpretability, published by Joseph Bloom on February 6, 2023 on The AI Alignment Forum. TLDR: We analyse how a small Decision Transformer learns to simulate agents on a grid world task, providing evidence that it is possible to do circuit analysis on small models which simulate goal-directedness. We think Decision Transformers are worth exploring further and may provide opportunities to explore many alignment-relevant deep learning phenomena in game-like contexts. Link to the GitHub Repository. Link to the Analysis App. I highly recommend using the app if you have experience with mechanistic interpretability. All of the mechanistic analysis should be reproducible via the app. Key Claims A 1-Layer Decision Transformer learns several contextual behaviours which are activated by a combination of Reward-to-Go/Observation combinations on a simple discrete task. Some of these behaviours appear localisable to specific components and can be explained with simple attribution and the transformer circuits framework. The specific algorithm implemented is strongly affected by the lack of a one-hot-encoding scheme (initially left out for simplicity of analysis) of the state/observations, which introduces inductive biases that hamper the model. If you are short on time, I recommend reading: Dynamic Obstacles Environment Black Box Model Characterisation Explaining Obstacle Avoidance at positive RTG using QK and OV circuits Alignment Relevance Future Directions I would welcome assistance with: Engineering tasks like app development, improving the model, training loop, wandb dashboard etc. and people who can help me make nice diagrams and write up the relevant maths/theory in the app). Research tasks. Think more about how to exactly construct/interpret circuit analysis in the context of decision transformers. Translate ideas from LLMs/algorithmic tasks. Communication tasks: Making nicer diagrams/explanations. I have a Trello board with a huge number of tasks ranging from small stuff to massive stuff. I’m also happy to collaborate on related projects. Introduction For my ARENA Capstone project, I (Joseph) started working on decision transformer interpretability at the suggestion of Paul Colognese. Decision transformers can solve reinforcement learning tasks when conditioned on generating high rewards via the specified “Reward-to-Go” (RTG). However, they can also generate agents of varying quality based on the RTG, making them simultaneously simulators, small transformers and RL agents. As such, it seems possible that identifying and understanding circuits in decision transformers would not only be interesting as an extension of current mechanistic interpretability research but possibly lead to alignment-relevant insights. Previous Work The most important background for this post is: The Decision Transformers paper showed how RL tasks can be solved with transformer sequence modelling. Figure 1 from their paper describes the critical components of a Decision Transformer. A Mathematical Framework for Transformer Circuits that describes how to think about transformers in the context of mechanistic interpretability. Important ideas include the ability to decompose the residual stream into the output of attention heads and MLPs, the QK circuits (decides if to write information to the residual stream), and OV circuits (decides what to write to the residual stream). The Understanding RL Vision, which analyses how an RL agent with a large CNN component responds to input features, attributing them as good or bad news in the value function and proposes the Diversity hypothesis - “Interpretable features tend to arise (at a given level of abstraction) if and only if the training distribution is diverse enough (at that level of abstraction).” Methods Environment - RL Environm... Feb 06, 2023 AF - Gradient surfing: the hidden role of regularization by Jesse Hoogland 08:09 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Gradient surfing: the hidden role of regularization, published by Jesse Hoogland on February 6, 2023 on The AI Alignment Forum. Produced as part of the SERI ML Alignment Theory Scholars Program - Winter 2022 Cohort In a previous post, I demonstrated that Brownian motion near singularities defies our expectations from "regular" physics. Singularities trap random motion and take up more of the equilibrium distribution than you'd expect from the Gibbs measure. In the computational probability community, this is a well-known pathology. Sampling techniques like Hamiltonian Monte Carlo get stuck in corners, and this is something to avoid. You typically don't want biased estimates of the distribution you're trying to sample. In deep learning, I argued, this behavior might be less a bug than a feature. The claim of singular learning theory is that models near singularities have lower effective dimensionality. From Occam's razor, we know that simpler models generalize better, so if the dynamics of SGD get stuck at singularities, it would suggest an explanation (at least in part) for why SGD works: the geometry of the loss landscape biases your optimizer towards good solutions. This is not a particularly novel claim. Similar versions of the claim been made before by Mingard et al. and Valle Pérez et al.. But from what I can tell, the proposed mechanism, of singularity "stickiness", is quite different. Moreover, it offers a new possible explanation for the role of regularization. If exploring the set of points with minimum training loss is enough to get to generalization, then perhaps the role of regularizer is not just to privilege "simpler" functions but also to make exploration possible. In the absence of regularization, SGD can't easily move between points of equal loss. When it reaches the bottom of a valley, it's pretty much stuck. Adding a term like weight decay breaks this invariance. It frees the neural network to surf the loss basin, so it can accidentally stumble across better generalizing solutions. So could we improve generalization by exploring the bottom of the loss basin in other ways — without regularization or even without SGD? Could we, for example, get a model to grok through random drift? .No. We can't. That is to say I haven't succeeded yet. Still, in the spirit of "null results are results", let me share the toy model that motivated this hypothesis and the experiments that have (as of yet) failed to confirm it. The inspiration: a toy model First, let's take a look at the model that inspired the hypothesis. Let's begin by modifying the example of the previous post to include an optional regularization term controlled by λ: We deliberately center the regularization away from the origin at c=(−1,−1) so it doesn't already privilege the singularity at the origin. Now, instead of viewing U(x) as a potential and exploring it with Brownian motion, we'll treat it as a loss function and use stochastic gradient descent to optimize for x. We'll start our optimizer at a uniformly sampled random point in this region and take T=100 steps down the gradient (with optional momentum controlled by β). After each gradient step, we'll inject a bit of Gaussian noise to simulate the "stochasticity." Altogether, the update rule for x is as follows: with momentum updated according to: and noise given by, If we sample the final obtained position, x(T) over independent initializations, then, in the absence of regularization and in the presence of a small noise term, we'll get a distribution that looks like the figure on the left. Unlike the case of random motion, the singularity at the origin is now repulsive. Good luck finding those simple solutions now. However, as soon as we turn on the regularization (middle figure) or increase the noise term (figure on the right), the singulari... Feb 06, 2023 AF - SolidGoldMagikarp (plus, prompt generation) by Jessica Rumbelow 23:53 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: SolidGoldMagikarp (plus, prompt generation), published by Jessica Rumbelow on February 5, 2023 on The AI Alignment Forum. Work done at SERI-MATS, over the past two months, by Jessica Rumbelow and Matthew Watkins. TL;DR Anomalous tokens: a mysterious failure mode for GPT (which reliably insulted Matthew) We have found a set of anomalous tokens which result in a previously undocumented failure mode for GPT-2 and GPT-3 models. (The 'instruct' models “are particularly deranged” in this context, as janus has observed.) Many of these tokens reliably break determinism in the OpenAI GPT-3 playground at temperature 0 (which theoretically shouldn't happen). Prompt generation: a new interpretability method for language models (which reliably finds prompts that result in a target completion). This is good for: eliciting knowledge generating adversarial inputs automating prompt search (e.g. for fine-tuning) In this post, we'll introduce the prototype of a new model-agnostic interpretability method for language models which reliably generates adversarial prompts that result in a target completion. We'll also demonstrate a previously undocumented failure mode for GPT-2 and GPT-3 language models, which results in bizarre completions (in some cases explicitly contrary to the purpose of the model), and present the results of our investigation into this phenomenon. Further detail can be found in a follow-up post. Prompt generation First up, prompt generation. An easy intuition for this is to think about feature visualisation for image classifiers (an excellent explanation here, if you're unfamiliar with the concept). We can study how a neural network represents concepts by taking some random input and using gradient descent to tweak it until it it maximises a particular activation. The image above shows the resulting inputs that maximise the output logits for the classes 'goldfish', 'monarch', 'tarantula' and 'flamingo'. This is pretty cool! We can see what VGG thinks is the most 'goldfish'-y thing in the world, and it's got scales and fins. Note though, that it isn't a picture of a single goldfish. We're not seeing the kind of input that VGG was trained on. We're seeing what VGG has learned. This is handy: if you wanted to sanity check your goldfish detector, and the feature visualisation showed just water, you'd know that the model hadn't actually learned to detect goldfish, but rather the environments in which they typically appear. So it would label every image containing water as 'goldfish', which is probably not what you want. Time to go get some more training data. So, how can we apply this approach to language models? Some interesting stuff here. Note that as with image models, we're not optimising for realistic inputs, but rather for inputs that maximise the output probability of the target completion, shown in bold above. So now we can do stuff like this: And this: We'll leave it to you to lament the state of the internet that results in the above optimised inputs for the token ' girl'. How do we do this? It's tricky, because unlike pixel values, the inputs to LLMs are discrete tokens. This is not conducive to gradient descent. However, these discrete tokens are mapped to embeddings, which do occupy a continuous space, albeit sparsely. (Most of this space doesn't correspond actual tokens – there is a lot of space between tokens in embedding space, and we don't want to find a solution there.) However, with a combination of regularisation and explicit coercion to keep embeddings close to the realm of legal tokens during optimisation, we can make it work. Code available here if you want more detail. This kind of prompt generation is only possible because token embedding space has a kind of semantic coherence. Semantically related tokens tend to be found close together. We discov... Feb 05, 2023 AF - Control by Tsvi Benson-Tilsen 17:47 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Control, published by Tsvi Benson-Tilsen on February 5, 2023 on The AI Alignment Forum. [Metadata: crossposted from. First completed 3 July 2022.] I don't know how to define control or even point at it except as a word-cloud, so it's probably wanting to be refactored. The point of talking about control is to lay part of the groundwork for understanding what determines what directions a mind ends up pushing the world in. Control is something like what's happening when values or drives are making themselves felt as values or drives. ("Influence" = "in-flow" might be a better term than "control".) Previous: Structure, creativity, and novelty Definitions of control Control is when an element makes another element do something. This relies on elements "doing stuff". Control is when an element {counterfactually, evidentially, causally, logically...} determines {the behavior, the outcome of the behavior} of an assembly of elements. Control is when an element modifies the state of an element. This relies on elements having a state. Alternatively, control is when an element replaces an element with a similar element. Control is when an element selects something according to a criterion. These definitions aren't satisfying in part because they rely on the pre-theoretic ideas of "makes", "determines", "modifies", "selects". Those ideas could be defined precisely in terms of causality, but doing that would narrow their scope and elide some of the sense of "control". To say, pre-theoretically, "My desire for ice cream is controlling where I'm walking.", is sometimes to say "The explanation for why I'm walking along such-and-such a path, is that I'm selecting actions based on whether they'll get me ice cream, and that such-and-such a path leads to ice cream.", and explanation in general doesn't have to be about causality. Control is whatever lies behind the explanations given in answer to questions like "What's controlling X?" and "How does Y control Z?" and "How can I control W?". Another way the above definitions are unsatisfactory is that they aren't specific enough; some of them would say that if I receive a message and then update my beliefs according to an epistemic rule, that message controls me. That might be right, but it's a little counterintuitive to me. There's a tension between describing the dynamics of a mind--how the parts interact over time--vs. describing the outcomes of a mind, which is more easily grasped with gemini modeling of "desires". (I.e. by having your own copy of the "desire" and your own machinery for playing out the same meaning of the "desire" analogously to the original "desire" in the original mind.) I'm focusing on dynamical concepts because they seem more agnostic as discussed above, but it might be promising to instead start with presumptively unified agency and then distort / modify / differentiate / deform / vary the [agency used to gemini model a desire] to allow for modeling less-presumptively-coherent control. (For discussion of the general form of this "whole->wholes" approach, distinct from the "parts->wholes" approach, see Non-directed conceptual founding.) Another definition of control in that vein, a variation on a formula from Sam Eisenstat: Control is an R-stable relationship between an R-stable element and R-unstable prior/posterior elements (which therefore play overlapping roles). "R-stable" means stable under ontological Revolutions. That is, we have C(X,Y) and C(X,Z), where X and C are somehow the same before and after an ontological revolution, and Y and Z aren't the same. Control vs. values I'm talking about control rather than "values" because I don't want to assume: that there are terminal values, that there's a clear distinction between terminal values and non-terminal values, that there are values stable across time and m... Feb 05, 2023 AF - Evaluations (of new AI Safety researchers) can be noisy by Lawrence Chan 25:43 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Evaluations (of new AI Safety researchers) can be noisy, published by Lawrence Chan on February 5, 2023 on The AI Alignment Forum. Related work: Hero Licensing, Modest Epistemology, The Alignment Community is Culturally Broken, Status Regulation and Anxious Underconfidence, Touch reality as soon as possible, and many more. TL;DR: Evaluating whether or not someone will do well at a job is hard, and evaluating whether or not someone has the potential to be a great AI safety researcher is even harder. This applies to evaluations from other people (e.g. job interviews, first impressions at conferences) but especially to self-evaluations. Performance is also often idiosyncratic: people who do poorly in one role may do well in others, even superficially similar ones. As a result, I think people should not take rejections or low self confidence so seriously, and instead try more things and be more ambitious in general. Epistemic status: This is another experiment in writing fast as opposed to carefully. (Total time spent: ~4 hours) I think this probably also applies in general, but I’m much less sure than in the case of AI research. As always, the law of equal and opposite advice applies. It’s okay to take it easy, and to do what you need to do to recover. I also don’t think that everyone should aim to be an AI safety researcher – my focus is on this field because it’s what I’m most familiar with. If you’ve found something else you’re good at, you probably should keep doing it. Please don’t injure yourself using this advice. Acknowledgements: Thanks to Beth Barnes for inspiring this post and contributing her experiences in the addendum, and to Adrià Garriga-Alonso, Erik Jenner, Rachel Freedman, and Adam Gleave for feedback. Introduction: evaluating skill is hard, and most evaluations are done via proxies I think people in the LessWrong/Alignment Forum space tend to take negative or null evaluations of themselves too seriously. For example, I’ve spoken to a few people who gave up on AI Safety after being rejected from SERI MATS and REMIX; I’ve also spoken to far too many people who are too scared to apply for any position in technical research after having a single negative interaction with a top researcher at a conference. While I think people should be free to give up whenever they want, my guess is that most people internalize negative evaluations too much, and would do better if they did less fretting and more touching reality. Fundamentally, this is because evaluations of new researchers are noisier than you think. Interviews and applications are not always indicative of the applicant’s current skill. First impressions, even from top researchers, do not always reflect reality. People can perform significantly differently in different work environments, so failing at a single job does not mean that you are incompetent. Most importantly, people can and do improve over time with effort. In my experience, a lot of updating so hard on negative examples comes from something like anxious underconfidence as opposed to reasoned arguments. It’s always tempting to confirm your own negative evaluations of yourself. And if you’re looking for reasons why you’re not “good enough” in order to handicap yourself, being convinced that one particular negative evaluation is not the end of the world will just make you overupdate on a different negative evaluation. Accordingly, I think it’s important to take things a little less seriously, be willing to try more things, and let your emotions more accurately reflect your situation. Of course, that’s not to say that you should respond to any negative sign by pushing yourself even harder; it’s okay to take time to recover when things don’t go well. But I strongly believe that people in the community give up a bit too easily, and are a bit too sc... Feb 05, 2023 AF - Modal Fixpoint Cooperation without Löb's Theorem by Andrew Critch 04:06 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Modal Fixpoint Cooperation without Löb's Theorem, published by Andrew Critch on February 5, 2023 on The AI Alignment Forum. TL;DR: This post introduces a novel logical approach to achieving group-scale cooperation, based on modal fixpoint theory. This approach is both easier to understand and roughly 3x more efficient than previous approaches that factored through Löb's Theorem, measured in terms of the length / complexity of the proofs involved. The following lemma is due to James Payor: Lemma: If ⊢x↔□(□xx)) then ⊢x . Proof: The proof uses the same modal rules of inference for □ as Löb's theorem, namely, necessitation and distributivity: ⊢x↔□(□xx)), by assumption. ⊢x(□xx), by tautology (A(BA)). ⊢□x□(□xx), from 2 by □ necessitation and distributivity. ⊢□xx, from 3 and 1 by modus ponens. ⊢□(□xx), from 4 by □ necessitation. ⊢x, from 4 and 1 by modus ponens.[end proof] Sweet! In comparison to Löb's Theorem, two things are beautiful about the lemma above: This lemma sidesteps the use of an auxiliary fixed point ⊢Ψ↔(□Ψx), by examining a proposition of interest (x) that itself has the fixpoint structure needed to self-validate; and It also allows the construction of unexploitable modal agents without Löb's Theorem; as follows... The following theorem was inspired by Scott Garrabrant, and uses Payor's Lemma in place of Löb's Theorem to prove cooperation between a group of agents. I'll state the theorem for three agents because that's most illustrative of what's going on: Theorem: Suppose A,B, and C are agents that return "true" to signify cooperation and "false" to signify defection. Let E=A∧B∧C, so E is the statement that "everyone cooperates". Let □A, □B, and □C denote proof systems that extend Peano Arithmetic, let □EX stand for □AX∧□BX∧□CX, and suppose the agents behave according to the following strategies: ⊢A↔□A(□EEE) ⊢B↔□B(□EEE) ⊢C↔□C(□EEE)Then it follows that ⊢E. Proof: Again we use the modal inference rules underlying Löb's theorem, but not Löb's Theorem itself: ⊢A∧B∧C↔□A(□EEE)∧□B(□EEE)∧□C(□EEE), by combining 1, 2, and 3 with ∧. ⊢E↔□E(□EE), from 4 by the definition of E and □E. ⊢E, by Payor's Lemma from 5, with x=E.[end proof] Intuitively, the strategy of the agents in this theorem is to check that the group is trustworthy in a certain way before joining (cooperating with) the group. The theorem, using the six steps of Payor's lemma, shows that the collective check on trustworthiness nests inside itself in a way that self-validates and yields cooperation. Discussion In the proof of the Theorem, you might be wondering if it really makes sense to be thinking of □E as a logical system of its own. The answer is yes, and the axioms E are just [the conjunction of axioms of □A]∨[the conjunction of axioms of □B] ∨ [the conjunction of axioms of □C]. You also might wonder if an alternative approach to group cooperation might be to instead use the following strategies: ⊢A↔□(□AB∧C) ⊢B↔□(□BA∧C) ⊢C↔□(□CA∧B) Then you'd be right! Here it also follows that ⊢A∧B∧C. However, the proof involves a lot more nesting, with A thinking about what B's thinking about what C's thinking about (etc.), and it's not as easy or short as the proof of the Theorem above. Conclusion In my opinion, what's great about the lemma and theorem above is that they're both relatively short and simple (relative to proving and using Löb's Theorem), and they allow a proof of unexploitable group cooperation that's roughly three times shorter than than one that starts by proving Löb's Theorem (only ~6 lines of logic, vs ~18). PS James says his next idea will be even better ;) Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. Feb 05, 2023 AF - Mech Interp Project Advising Call: Memorisation in GPT-2 Small by Neel Nanda 01:02 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Mech Interp Project Advising Call: Memorisation in GPT-2 Small, published by Neel Nanda on February 4, 2023 on The AI Alignment Forum. I've recently been having advising calls with REMIX teams (Redwood's interpretability sprint) trying to give advice & feedback on projects. As an experiment, I've published a recording of one advising call (with Tessa Barton & Kushal Jain on memorisation in GPT-2 Small), I'm curious whether this is useful to anyone! IMO getting detailed feedback from a more experienced research is one of the best ways to improve at research, but have no idea whether someone else's feedback is comparatively useful, or whether my advice is good enough lol. Thanks to the team for being down to publish this, and the work! Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. Feb 04, 2023 AF - Normative vs Descriptive Models of Agency by Matt MacDermott 07:21 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Normative vs Descriptive Models of Agency, published by Matt MacDermott on February 2, 2023 on The AI Alignment Forum. Produced as part of the SERI ML Alignment Theory Scholars Program - Winter 2022 Cohort. I think there's an important distinction to be made between work in agent foundations which is concerned with normative models, and work which is concerned with descriptive models. They are increasingly separate bodies of work, with different aims and different theories of change when it comes to alignment. Examples The normative branch is typified by the Embedded Agency sequence, and the whole thing can be summed up as 'The Hunt for Embedded AIXI'. Its goal is to figure out how to build an ideal agent in principle. Decision theory, infrabayesianism, and logical induction all come under the normative banner. The descriptive branch is typified by John Wentworth's Basic Foundations for Agent Models sequence. Descriptive work aims to understand the agents we run into in the wild. Other examples include shard theory, Critch's Boundaries sequence, and the Discovering Agents paper. Theories of Change Descriptive I'll start with the descriptive branch. The most ambitious version of its goal is to understand agency so well that in principle we could take an unabstracted, non-agentic description of a system - e.g. a physics-level causal graph, the weights in a neural network, or a cellular model of a squirrel - and identify what if any are its goals, world-model, and so on. If we could do that in principle, then in practice we could probably check whether an artificial agent is aligned, and maybe we could even do things like surgically modify its goals, or directly point to things we care about in its world-model. I think that's what John is aiming for. A less ambitious goal, which I think better describes the aims of shard theory, is to understand agency well enough that we can carefully guide the formation of agents' goals during ML training runs. Beyond that, I think everyone involved expects that descriptive work could lead to foundational insights that change our minds about which alignment strategies are most promising. In particular, these insights might answer questions like: whether intelligent entities are inevitably agents, whether agents are inevitably consequentialists, whether corrigibility is a thing, and whether we should expect to encounter sharp left turns. Normative The normative branch shares the conceptual clarification theory of change. I think there's a reasonable argument to be made that we should expect the theoretical ideal of agency to be much easier to understand than agency-in-practice, and that understanding it might provide most of the insight. But the normative branch also has a much more ambitious theory of change, which is something like: if we understand the theoretical ideal of agency well enough, we might be able to build an aligned AGI manually 'out of toothpicks and rubber bands'. I think this hope has fallen by the wayside in recent years, as the capabilities of prosaic AI have rapidly progressed. Doing it the hard way just seems like it will take too long. Subproblems The Embedded Agency sequence identifies four rough subquests in The Hunt for Embedded AIXI. Most work in the normative branch can be thought of as attacking one or another of these problems. Many of the insights of that sequence are directly applicable to the descriptive case, but the names of the subproblems are steeped in normative language. Moreover, there are aspects of the descriptive challenge which don't seem to have normative analogues. It therefore seems worth trying to identify a seperate set of descriptive subproblems, and vaguely categorise descriptive work according to which of them it gets at. I'll suggest some subproblems here, with a view to using them ... Feb 02, 2023 AF - Conditioning Predictive Models: Outer alignment via careful conditioning by Evan Hubinger 01:10:43 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Conditioning Predictive Models: Outer alignment via careful conditioning, published by Evan Hubinger on February 2, 2023 on The AI Alignment Forum. This is the second of seven posts in the Conditioning Predictive Models Sequence based on the forthcoming paper “Conditioning Predictive Models: Risks and Strategies” by Evan Hubinger, Adam Jermyn, Johannes Treutlein, Rubi Hudson, and Kate Woolverton. Each post in the sequence corresponds to a different section of the paper. We will be releasing posts gradually over the course of the next week or so to give people time to read and digest them as they come out. We are starting with posts one and two, with post two being the largest and most content-rich of all seven. 2. Outer alignment via careful conditioning Suppose we actually get a predictive model of the world that we can condition on arbitrary observations. What should we do with it? One thing that is often done with large language models is to ask them to act as dialogue agents (e.g. predict what comes next after an “AI:” prompt). As we’ll discuss extensively in this section, however, we think that asking a predictive model to predict itself or another AI system is highly unsafe, as the AI systems being predicted may not themselves be safe. Instead, our basic proposal is to try to predict the most useful possible human outputs. One canonical thing that you might want to do in that vein is “predict the results of humans doing AI safety research” (though note that this is just an example). Specifically, here’s GPT-3.5 trained with RLHF (text-davinci-003) at zero temperature (prompt is bolded):[1] # A full solution to the AI alignment problem Unfortunately, there is no single solution to the AI alignment problem. AI alignment is a complex and multifaceted problem that requires a comprehensive approach. The first step in solving the AI alignment problem is to define the goals of the AI system. This involves understanding the desired outcomes of the AI system and how it should interact with its environment. Once the goals are defined, the next step is to develop a framework for how the AI system should behave in order to achieve those goals. This framework should include ethical considerations, such as ensuring that the AI system does not cause harm to humans or other sentient beings. Unfortunately, as can be seen from GPT-3.5’s continuation, it doesn’t actually think a paper titled “A full solution to the alignment problem” will actually contain such a solution—instead, it seems to think it’s far more likely that it’ll just talk about what such a solution might entail. Which in fact seems true, at least if such a paper was written today. Fortunately, at least for current InstructGPT models, it’s not that hard to get GPT-3.5 to at least seemingly give a real attempt.[2] Here’s another trajectory from text-davinci-003 at zero temperature (prompt is again bolded): # ARC Technical Report: A full solution to the AI alignment problem Introduction The AI alignment problem is one of the most pressing challenges of our time. It is the challenge of ensuring that artificial intelligence (AI) systems are designed and deployed in a way that is beneficial to humanity. This is a difficult problem because AI systems are often designed to optimize for a single goal, such as maximizing profits or minimizing costs, without considering the broader implications of their actions. As AI systems become more powerful and more widely deployed, it is increasingly important to ensure that they are designed and deployed in a way that is beneficial to humanity. This report presents a full solution to the AI alignment problem. It begins by discussing the current state of the problem and the challenges that must be addressed in order to achieve a full solution. It then outlines a comprehensive approac... Feb 02, 2023 AF - Conditioning Predictive Models: Large language models as predictors by Evan Hubinger 20:06 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Conditioning Predictive Models: Large language models as predictors, published by Evan Hubinger on February 2, 2023 on The AI Alignment Forum. This is the first of seven posts in the Conditioning Predictive Models Sequence based on the forthcoming paper “Conditioning Predictive Models: Risks and Strategies” by Evan Hubinger, Adam Jermyn, Johannes Treutlein, Rubi Hudson, and Kate Woolverton. Each post in the sequence corresponds to a different section of the paper. We will be releasing posts gradually over the course of the next week or so to give people time to read and digest them as they come out. We are starting with posts one and two, with post two being the largest and most content-rich of all seven. Thanks to Paul Christiano, Kyle McDonell, Laria Reynolds, Collin Burns, Rohin Shah, Ethan Perez, Nicholas Schiefer, Sam Marks, William Saunders, Evan R. Murphy, Paul Colognese, Tamera Lanham, Arun Jose, Ramana Kumar, Thomas Woodside, Abram Demski, Jared Kaplan, Beth Barnes, Danny Hernandez, Amanda Askell, Robert Krzyzanowski, and Andrei Alexandru for useful conversations, comments, and feedback. Abstract Our intention is to provide a definitive reference on what it would take to safely make use of predictive models in the absence of a solution to the Eliciting Latent Knowledge problem. Furthermore, we believe that large language models can be understood as such predictive models of the world, and that such a conceptualization raises significant opportunities for their safe yet powerful use via carefully conditioning them to predict desirable outputs. Unfortunately, such approaches also raise a variety of potentially fatal safety problems, particularly surrounding situations where predictive models predict the output of other AI systems, potentially unbeknownst to us. There are numerous potential solutions to such problems, however, primarily via carefully conditioning models to predict the things we want—e.g. humans—rather than the things we don’t—e.g. malign AIs. Furthermore, due to the simplicity of the prediction objective, we believe that predictive models present the easiest inner alignment problem that we are aware of. As a result, we think that conditioning approaches for predictive models represent the safest known way of eliciting human-level and slightly superhuman capabilities from large language models and other similar future models. 1. Large language models as predictors Suppose you have a very advanced, powerful large language model (LLM) generated via self-supervised pre-training. It’s clearly capable of solving complex tasks when prompted or fine-tuned in the right way—it can write code as well as a human, produce human-level summaries, write news articles, etc.—but we don’t know what it is actually doing internally that produces those capabilities. It could be that your language model is: a loose collection of heuristics,[1] a generative model of token transitions, a simulator that picks from a repertoire of humans to simulate, a proxy-aligned agent optimizing proxies like sentence grammaticality, an agent minimizing its cross-entropy loss, an agent maximizing long-run predictive accuracy, a deceptive agent trying to gain power in the world, a general inductor, a predictive model of the world, etc. Later, we’ll discuss why you might expect to get one of these over the others, but for now, we’re going to focus on the possibility that your language model is well-understood as a predictive model of the world. In particular, our aim is to understand what it would look like to safely use predictive models to perform slightly superhuman tasks[2]—e.g. predicting counterfactual worlds to extract the outputs of long serial research processes.[3] We think that this basic approach has hope for two reasons. First, the prediction orthogonality thesis seems basi... Feb 02, 2023 AF - More findings on maximal data dimension by Marius Hobbhahn 17:32 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: More findings on maximal data dimension, published by Marius Hobbhahn on February 2, 2023 on The AI Alignment Forum. Produced as part of the SERI ML Alignment Theory Scholars Program - Winter 2022 Cohort. I’d like to thank Wes Gurnee, Aryan Bhatt, Eric Purdy and Stefan Heimersheim for discussions and Evan Hubinger, Neel Nanda, Adam Jermyn and Chris Olah for mentorship and feedback. The post contains a lot of figures, so the suggested length is deceiving. Code can be found in this colab notebook. This is the second in a series of N posts on trying to understand memorization in NNs. Executive summary I look at a variety of settings and experiments to better understand memorization in toy models. My primary motivation is to increase our general understanding of NNs but I also suspect that understanding memorization better might increase our ability to detect backdoors/trojans. This post specifically focuses on measuring memorization with the maximal data dimensionality metric. In a comment to the “Superposition, Memorization and double descent” paper, Chris Olah introduces maximal data dimensionality D, a metric that supposedly tells to which degree a network memorized a datapoint compared to using features that are shared between datapoints. I extend the research on this metric with the following findings In the double descent setting, the metric describes exactly what we would predict, i.e. with few inputs the network memorizes all datapoints and with a lot of input it learns some features. On MNIST, I can reproduce the shape of the D curve and also the findings that memorized datapoints have high D, datapoints that share many features are in the middle and datapoints that the network is confused about have low D. However, I was surprised to find that the datapoints the network misclassified on the training data are evenly distributed across the D spectrum. I would have expected them to all have low D didn’t learn them. When we train the network to different levels of accuracy, we find that the distribution of errors is actually slightly left-heavy instead of right-heavy. I have not yet understood why it is the case but I’d be interest in follow-up research to see whether it tells us something interesting. Different classes are not evenly distributed across the spectrum, e.g. “8” is far more regular than “5” according to D. This is what we would expect. Across different hidden sizes, the shape of the D curve stays nearly the same but the spearman rank correlation between the datapoints increases the larger the difference in hidden size. This means the more similar the number of neurons, the more similar is the in which D sorts the datapoints. Networks of the same size trained on the same data with different seeds show nearly identical D curves and have high spearman rank correlation. This is what we would expect. Different dataset sizes produce different shapes of D, e.g. larger datasets have more shared features (they are flatter in the middle). This seems plausible. Different levels of weight decay have nearly no effect on the shape of D. The minor effect they have is the opposite of what I would have expected. The shape of D changes very little between initialization and the final training run. This was unexpected and I have no good explanation for this phenomenon yet. When we measure D over different batches we find the same phenomenon. Working with D can be a bit tricky (see Appendix for practical tips). The more I played around with D, the more I’m convinced that it tells us something interesting. Particularly the question about misclassifications and error rates and the unexpectedly small change during initialization and final training run seem like they could tell us something about NNs that we don’t yet know. Maximal data dimensionality There are two models u... Feb 02, 2023 AF - Temporally Layered Architecture for Adaptive, Distributed and Continuous Control by Roman Leventov 02:54 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Temporally Layered Architecture for Adaptive, Distributed and Continuous Control, published by Roman Leventov on February 2, 2023 on The AI Alignment Forum. A preprint is published by Devdhar Patel, Joshua Russell, Francesca Walsh, Tauhidur Rahman, Terrance Sejnowski, and Hava Siegelmann in December 2022.Abstract: We present temporally layered architecture (TLA), a biologically inspired system for temporally adaptive distributed control. TLA layers a fast and a slow controller together to achieve temporal abstraction that allows each layer to focus on a different time-scale. Our design is biologically inspired and draws on the architecture of the human brain which executes actions at different timescales depending on the environment's demands. Such distributed control design is widespread across biological systems because it increases survivability and accuracy in certain and uncertain environments. We demonstrate that TLA can provide many advantages over existing approaches, including persistent exploration, adaptive control, explainable temporal behavior, compute efficiency and distributed control. We present two different algorithms for training TLA: (a) Closed-loop control, where the fast controller is trained over a pre-trained slow controller, allowing better exploration for the fast controller and closed-loop control where the fast controller decides whether to "act-or-not" at each timestep; and (b) Partially open loop control, where the slow controller is trained over a pre-trained fast controller, allowing for open loop-control where the slow controller picks a temporally extended action or defers the next n-actions to the fast controller. We evaluated our method on a suite of continuous control tasks and demonstrate the advantages of TLA over several strong baselines. Conclusion: In this work, we presented Temporally Layered Architecture (TLA), a framework for distributed, adaptive response time in reinforcement learning. The framework allows the RL agent to achieve smooth control in a real-time setting using a slow controller while a fast controller monitors and intervenes as required. Additionally, we demonstrated an alternative setting where the slow controller can gate the fast controller, activating it only when required for efficient control. We demonstrate faster convergence and more action repetition in the closed-loop approach and fewer decision and faster convergence in the partially-open loop approach. Additionally, we demonstrate in a real time setting, where processing and actuation delays are taken into account, and show that our approach outperforms the current approaches in the delayed setting while picking fewer actions. Our work demonstrates that a temporally adaptive approach has similar benefits for AI as has been demonstrated in biology and is an important direction for future research in artificially intelligent control. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. Feb 02, 2023 AF - Research agenda: Formalizing abstractions of computations by Erik Jenner 52:51 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Research agenda: Formalizing abstractions of computations, published by Erik Jenner on February 2, 2023 on The AI Alignment Forum. Big thanks to Leon Lang, Jérémy Scheurer, Adam Gleave, and Shoshannah Tekofsky for their feedback on a draft of this post, to Euan McLean (via FAR AI) for his feedback and a lot of help with editing, and to everyone else who discussed this agenda with me, in particular Johannes Treutlein for frequent research check-ins! Summary My current agenda is to develop a formal framework for thinking about abstractions of computations. These abstractions are ways to partially describe the “algorithm” a neural network or other computation is using, while throwing away irrelevant details. Ideally, this framework would tell us 1) all possible abstractions of a given computation, and 2) which of these are most useful (for a specific purpose, such as detecting deception). “Useful” doesn’t necessarily mean “easily human-understandable”—I see that as an open question. I anticipate the main applications to alignment to be automating interpretability or mechanistic anomaly detection. There are also potential connections to other alignment topics, such as natural abstractions or defining terms like “search process”. This agenda is at an early stage (I have been thinking about it for ~2 months, and about related topics for another ~2 months before that). So feedback now could change my future direction. I also list a few potential projects that seem self-contained. If you’re interested in working on any of those, or collaborating in some other way, please get in touch! I encourage you to skip around and/or only read parts of the post. Here’s an overview: Introduction and Q&A mostly talk about motivation and connections to alignment.. What are abstractions of computations? discusses my current guess as to what the framework should look like. There’s a list of Some potential projects. Appendix: Related work gives a quick overview of relevant work in academia, and the relation between this agenda and other alignment research. This post doesn’t contain any actual theorems or experiments, so if you’re only interested in that, you can stop reading. Introduction Humans can’t just look at the weights of a neural network and tell what it’s doing. There are at least two reasons for this: Neural network weights aren’t a format we’re great at thinking about. Neural networks are often huge. The second point would likely apply to any system that does well on complicated tasks. For example, neural networks are decision trees, but this doesn’t mean we can look at the decision tree corresponding to a network and understand how it works. To reason about these systems, we will likely have to simplify them, i.e. throw away details that are irrelevant for whatever we want to find out. In other words, we are looking for abstractions of computations (such as neural networks). Abstractions are already how we successfully reason about many other complicated systems. For example, if you want to understand the Linux kernel, you wouldn’t start by reading the entire source code top to bottom. Instead, you’d try to get a high-level understanding—what are the different modules, how do they interact? Similarly, we use pseudocode to more easily communicate how an algorithm works, abstracting away low-level details. If we could figure out a general way to find useful abstractions of computations, or at least of neural networks, perhaps we could apply this to understand them in a similar way. We could even automate this process and mechanically search for human-understandable abstractions. Making things easier to understand for humans isn’t the only application of abstractions. For example, abstractions have been used for more efficient theorem proving and in model checking (e.g. abstract inte... Feb 02, 2023 AF - More findings on Memorization and double descent by Marius Hobbhahn 30:33 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: More findings on Memorization and double descent, published by Marius Hobbhahn on February 1, 2023 on The AI Alignment Forum. Produced as part of the SERI ML Alignment Theory Scholars Program - Winter 2022 Cohort. I’d like to thank Wes Gurnee, Aryan Bhatt, Eric Purdy and Stefan Heimersheim for discussions and Evan Hubinger, Neel Nanda, Adam Jermyn and Chris Olah for mentorship and feedback. The post contains a lot of figures, so the suggested length is deceiving. Code can be found in these three colab notebooks [1][2][3]. I have split the post into two parts. The first one is concerned with double descent and other general findings in memorization and the second focuses on measuring memorization using the maximum data dimensionality metric. This is the first post in a series of N posts on memorization in transformers. Executive summary I look at a variety of settings and experiments to better understand memorization in toy models. My primary motivation is to increase our general understanding of NNs but I also suspect that understanding memorization better might increase our ability to detect backdoors/trojans. The work heavily builds on two papers by Anthropic, “Toy models of superposition” and “Superposition, Memorization and double descent”. I successfully replicate a subset of their findings. I specifically look at three different setups of NNs that I speculate are most relevant to understanding memorization in the non-attention parts of transformers. Bottlenecks between layers, i.e. when projecting from high-dimensional spaces (e.g. MLPs) into lower dimensions (e.g. the residual stream). This is similar to the setting in the toy models of superposition paper and its sequel. MLP blocks, i.e. when projecting from lower-dimensional spaces (e.g. the residual stream) into higher dimensions with ReLU non-linearities. The final layer, i.e. when projecting from the end of the residual stream into the vocab space. The main difference to the previous scenarios is that we use the cross-entropy loss for the experiments which has a different inductive bias than the MSE loss. I’m able to find the double descent phenomenon in all three settings. My takeaway from this is that the transition between memorization and learning general features seems to be a very regular and predictable phenomenon (assuming you know the sparsity and number of features of your network). Furthermore, it seems like the network is “confused” (e.g. has much higher test loss) when it is right between memorization and generalization. I test the limits of reconstruction in different settings, i.e. the ability of the neural network to reconstruct its inputs given different dataset sizes, hidden sizes, number of features, importance distributions and sparsities. The findings mostly confirm what we would predict, e.g. more sparsity or larger hidden sizes lead to better reconstructions. A speculative claim is that if we had better measures of sparsity and importance in real-world models, we might be able to derive scaling laws that could tell us how many “concepts” a network has learned. Interpreting NNs that memorized in the simplest settings is extremely straightforward--the network literally creates a dictionary that you can just read off the weights. However, even small increases in complexity make this dictionary much harder to read and I have not yet found a method to decompile it into a human-readable form (maybe in the next posts). Isolated components In the following, we isolate three settings that seem like important components of memorization. They are supposed to model the non-attention parts of a transformer (primarily because I speculate that memorization mostly happens in the non-attention parts). Bottleneck By bottleneck we mean a situation in which a model projects from many into fewer dimensi... Feb 01, 2023 AF - Trends in the dollar training cost of machine learning systems by Ben Cottier 07:51 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Trends in the dollar training cost of machine learning systems, published by Ben Cottier on February 1, 2023 on The AI Alignment Forum. Summary Using a dataset of 124 machine learning (ML) systems published between 2009 and 2022, I estimate that the cost of compute in US dollars for the final training run of ML systems has grown by 0.49 orders of magnitude (OOM) per year (90% CI: 0.37 to 0.56). See Table 1 for more detailed results, indicated by "All systems." By contrast, I estimate that the cost of compute used to train "large-scale" systems since September 2015 (systems that used a relatively large amount of compute) has grown more slowly compared to the full sample, at a rate of 0.2 OOMs/year (90% CI: 0.1 to 0.4 OOMs/year). See Table 1 for more detailed results, indicated by "Large-scale." Based on the historical trends, and reviewing some prior work (Lohn & Musser, 2022 and Cotra, 2020), I estimated my best guess for how quickly costs will grow in the future. Here, I'm assuming a model like the one used by Cotra (2020), where this growth rate is sustained up until spending hits a limit at some non-trivial fraction of gross world product. The below estimates are much less robust than the historical trends. (more) My independent impression: 0.3 OOMs/year (90% CI: 0.1 to 0.4 OOMs/year) My all-things-considered view: 0.2 OOMs/year (90% CI: 0.1 to 0.3 OOMs/year) For future work, I recommend the following: Incorporate systems trained on Google TPUs, and TPU price-performance data, into Method 2. Estimate more reliable bounds on training compute costs, rather than just point estimates. For example, research the profit margin of NVIDIA and adjust retail prices by that margin to get a lower bound on hardware cost. As a broader topic, investigate trends in investment, spending allocation, and AI revenue. DataPeriodScale (start to end)Growth rate in dollar cost for final training runs(1) Using the overall GPU price-performance trend (go to results)All systems (n=124)0.02 to $80KLarge-scale (n=25)$30K to $1M(2) Using the peak price-performance of the actual NVIDIA GPUs used to train ML systems (go to results)All systems (n=48)$0.10 to $80KLarge-scale (n=6)$200 to 70KWeighted mixture of growth ratesAll systemsN/A Estimation method (go to explanation) Jun 2009– Jul 2022 0.51 OOMs/year 90% CI: 0.45 to 0.57 Oct 2015– Jun 2022 0.2 OOMs/year 90% CI: 0.1 to 0.4 Jun 2009– Jul 2022 0.44 OOMs/year 90% CI: 0.34 to 0.52 Sep 2016– May 2022 0.2 OOMs/year 90% CI: 0.1 to 0.4 Jun 2009– Jul 2022 0.49 OOMs/year 90% CI: 0.37 to 0.56 Table 1: Estimated growth rate in the dollar cost of compute to train ML systems over time, based on a log-linear regression. OOM = order of magnitude (10x). See the section Summary of regression results for expanded result tables. Figure 1: estimated cost of compute in US dollars for the final training run of ML systems. The costs here are estimated based on the trend in price-performance for all GPUs in Hobbhahn & Besiroglu (2022) (known as "Method 1" in this report). Read the rest of the report here These are "milestone" systems selected from the database Parameter, Compute and Data Trends in Machine Learning, using the same criteria as described in Sevilla et al. (2022, p.16): "All models in our dataset are mainly chosen from papers that meet a series of necessary criteria (has an explicit learning component, showcases experimental results, and advances the state-of-the-art) and at least one notability criterion (>1000 citations, historical importance, important SotA advance, or deployed in a notable context). For new models (from 2020 onward) it is harder to assess these criteria, so we fall back to a subjective selection. We refer to models meeting our selection criteria as milestone models." This growth rate is about 0.2 OOM/year lower than the growth of t... Feb 01, 2023 AF - The effect of horizon length on scaling laws by Jacob Hilton 00:57 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The effect of horizon length on scaling laws, published by Jacob Hilton on February 1, 2023 on The AI Alignment Forum. The scaling of optimal model size with compute is a key input into the biological anchors framework for forecasting transformative AI. In particular, the "effective horizon length" introduces a multiplier into this scaling law that can have a big effect on forecasts. This paper studies this scaling law for several RL environments: Procgen, Dota 2 and a toy MNIST-based environment. The last of these is used to study the effect of the task horizon length in a toy setting. There are a number of takeaways for the biological anchors framework, which are summarized in Section 5.4. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. Feb 01, 2023 AF - On value in humans, other animals, and AI by Michele Campolo 08:37 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: On value in humans, other animals, and AI, published by Michele Campolo on January 31, 2023 on The AI Alignment Forum. This will be posted also on the EA Forum, and included in a sequence containing some previous posts and other posts I'll publish this year. Introduction Humans think critically about values and, to a certain extent, they also act according to their values. To the average human, the difference between increasing world happiness and increasing world suffering is huge and evident, while goals such as collecting coins and collecting stamps are roughly on the same level. It would be nice to make these differences obvious to AI as they are to us. Even though exactly copying what happens in the human mind is probably not the best strategy to design an AI that understands ethics, having an idea of how value works in humans is a good starting point. So, how do humans reason about values and act accordingly? Key points Let’s take a step back and start from sensation. Through the senses, information goes from the body and the external environment to our mind. After some brain processing — assuming we’ve had enough experiences of the appropriate kind — we perceive the world as made of objects. A rock is perceived as distinct from its surrounding environment because of its edges, its colour, its weight, the fact that my body can move through air but not through rocks, and so on. Objects in our mind can be combined with each other to form new objects. After seeing various rocks in different contexts, I can imagine a scene in which all these rocks are in front of me, even though I haven’t actually seen that scene before. We are also able to apply our general intelligence — think of skills such as categorisation, abstraction, induction — to our mental content. Other intelligent animals do something similar. They probably understand that, to satisfy thirst, water in a small pond is not that different from water flowing in a river. However, an important difference is that animals’ mental content is more constrained than our mental mental content: we are less limited by what we perceive in the present moment, and we are also better at combining mental objects with each other. For example, to a dog, its owner works as an object in the dog’s mind, while many of its owner’s beliefs do not. Some animals can attribute simple intentions and perception, e.g. they understand what a similar animal can and cannot see, but it seems they have trouble attributing more complex beliefs. The ability to compose mental content in many different ways is what allows us to form abstract ideas such as mathematics, religion, and ethics, just to name a few. Key point 1: In humans, mental content can be abstract. Now notice that some mental content drives immediate action and planning. If I feel very hungry, I will do something about it, in most cases. This process from mental content to action doesn’t have to be entirely conscious. I can instinctively reach for the glass of water in front of me as a response to an internal sensation, even without moving my attention to the sensation nor realising it is thirst. Key point 2: Some mental content drives behaviour. Not all mental content drives action and planning. The perception of an obstacle in front of me might change how I carry out my plans and actions, but it is unlikely to change what I plan and act for. Conversely, being very hungry directly influences what I’m going to do — not just how I do it — and can temporarily override other drives. It is in this latter sense that some mental content drives behaviour. In humans, the mental content that does drive behaviour can be roughly split in two categories. The first one groups what we often call evolutionary or innate drives, like hunger and thirst in the examples above, and works similarly i... Jan 31, 2023 AF - Criticism of the main framework in AI alignment by Michele Campolo 12:24 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Criticism of the main framework in AI alignment, published by Michele Campolo on January 31, 2023 on The AI Alignment Forum. Originally posted on the EA Forum for the Criticism and Red Teaming Contest. Will be included in a sequence containing some previous posts and other posts I'll publish this year. 0. Summary AI alignment research centred around the control problem works well for futures shaped by out-of-control misaligned AI, but not that well for futures shaped by bad actors using AI. Section 1 contains a step-by-step argument for that claim. In section 2 I propose an alternative which aims at moral progress instead of direct risk reduction, and I reply to some objections. I will give technical details about the alternative at some point in the future, in section 3. The appendix clarifies some minor ambiguities with terminology and links to other stuff. 1. Criticism of the main framework in AI alignment 1.1 What I mean by main framework In short, it’s the rationale behind most work in AI alignment: solving the control problem to reduce existential risk. I am not talking about AI governance, nor about AI safety that has nothing to do with existential risk (e.g. safety of self-driving cars). Here are the details, presented as a step-by-step argument. At some point in the future, we'll be able to design AIs that are very good at achieving their goals. (Capabilities premise) These AIs might have goals that are different from their designers' goals. (Misalignment premise) Therefore, very bad futures caused by out-of-control misaligned AI are possible. (From previous two premises) AI alignment research that is motivated by the previous argument often aims at making misalignment between AI and designer, or loss of control, less likely to happen or less severe. (Alignment research premise). Common approaches are ensuring that the goals of the AI are well specified and aligned with what the designer originally wanted, or making the AI learn our values by observing our behaviour. In case you are new to these ideas, two accessible books on the subject are [1,2]. 5. Therefore, AI alignment research improves the expected value of bad futures caused by out-of-control misaligned AI. (From 3 and 4). By expected value I mean a measure of value that takes likelihood of events into account, and follows some intuitive rules such as "5% chance of extinction is worse than 1% chance of extinction". It need not be an explicit calculation, especially because it might be difficult to compare possible futures quantitatively, e.g. extinction vs dystopia. I don't claim that all AI alignment research follows this framework; just that this is what motivates a decent amount (I would guess more than half) of work in AI alignment. 1.2 Response I call this a response, and not a strict objection, because none of the points or inferences in the previous argument is rejected. Rather, some extra information is taken into account. 6. Bad actors can use powerful controllable AI to bring about very bad futures and/or lock-in their values (Bad actors premise) For more information about value lock-in, see chapter 4 of What We Owe The Future [3]. 7. Recall that alignment research motivated by the above points makes it easier to design AI that is controllable and whose goals are aligned with its designers' goals. As a consequence, bad actors might have an easier time using powerful controllable AI to achieve their goals. (From 4 and 6) 8. Thus, even though AI alignment research improves the expected value of futures caused by uncontrolled AI, it reduces the expected value of futures caused by bad human actors using controlled AI to achieve their ends. (From 5 and 7) This conclusion will seem more, or less, relevant depending on the beliefs you have about its different components. An example: if you think t... Jan 31, 2023 AF - Mechanistic Interpretability Quickstart Guide by Neel Nanda 09:29 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Mechanistic Interpretability Quickstart Guide, published by Neel Nanda on January 31, 2023 on The AI Alignment Forum. This was written as a guide for Apart Research's Mechanistic Interpretability Hackathon as a compressed version of my getting started post. The spirit is "how to speedrun your way to maybe doing something useful in mechanistic interpretability in a weekend", but hopefully this is useful even to people who aren't aiming for weekend long projects! Quickstart Watch my “What is a Transformer?” video Skim through my TransformerLens demo Copy it to a new Colab notebook (with a free GPU) to actually write your own code - do not get involved in tech setup! Skim the Concrete Open Problems section, or my 200 Concrete Open Problems in Mech Interp sequence. Find a problem that catches your fancy, and jump in! If you want a low-coding project, Whenever you get stuck, refer to the getting started section and check out the relevant resource. Introduction Mechanistic Interpretability is the study of reverse-engineering neural networks - analogous to how we might try to reverse-engineer a program’s source code from its compiled binary, our goal is to reverse engineer the parameters of a trained neural network, and to try to reverse engineer what algorithms and internal cognition the model is actually doing. Going from knowing that it works, to understanding how it works. Check out Circuits: Zoom In for an introduction. In my (extremely biased!) opinion, mech interp is a very exciting subfield of alignment. Currently our models are inscrutable black boxes! If we can really understand what models are thinking, and why they do what they do, then I feel much happier about living in a world with human level and beyond models, and it seems far easier to align them. Further, it is a young field, with a lot of low-hanging fruit. And it suffices to screw around in a Colab notebook with a small-ish model that someone else trained, copying code from an existing demo - the bar for entry can be much lower than other areas of alignment. So you stand a chance of getting traction on a problem in this hackathon! Recommended mindset Though the bar for entry is lower for mech interp than other areas of alignment, it is still far from zero. I’ve written a post on how to get started that lays out the key prerequisites and my takes for what to do to get them. A weekend hackathon isn’t long enough to properly engage with those, so I recommend picking a problem you’re excited about, and dipping into the resources summarised here whenever you get stuck. I recommend trying to have some problem in mind, so you can direct your learning towards making progress on that goal. But it’s completely fine if, in fact, you just spend the weekend learning as much as you can - if you feel like you’ve learned cool things, then I call that a great hackathon! Getting Started A summary of the key resources, and how to think of them during a hackathon. What even is a transformer? A key subskill in mech interp, is really having a deep intuition for how a transformer (the architecture for modern language models) actually works - what are the basic operations going on inside of it, and how do these all fit together? Important: My what is a transformer tutorial video (1h) Recommended: My tutorial on implementing GPT-2 from scratch (1.5h) plus template notebook to fill out yourself (with tests!) (2-8h). This is more involved and not essential to do fully, but will help a lot Implementing GPT-2 from scratch can sound pretty hard, but the tutorial and template guides you through the process, and gives you tests to keep you on track. I think that once you’ve done this, you have a solidly deep understanding of transformers! Reference: Look up unfamiliar terms in the transformers section of my explainer Tooling: The core... Jan 31, 2023 AF - Inner Misalignment in "Simulator" LLMs by Adam Scherlis 06:59 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Inner Misalignment in "Simulator" LLMs, published by Adam Scherlis on January 31, 2023 on The AI Alignment Forum. Alternate title: "Somewhat Contra Scott On Simulators". Scott Alexander has a recent post up on large language models as simulators. I generally agree with Part I of the post, which advocates thinking about LLMs as simulators that can emulate a variety of language-producing "characters" (with imperfect accuracy). And I also agree with Part II, which applies this model to RLHF'd models whose "character" is a friendly chatbot assistant. (But see caveats about the simulator framing from Beth Barnes here.) These ideas have been around for a bit, and Scott gives credit where it's due; I think his exposition is clear and fun. In Part III, where he discusses alignment implications, I think he misses the mark a bit. In particular, simulators and characters each have outer and inner alignment problems. The inner alignment problem for simulators seems especially concerning, because it might not give us many warning signs, is most similar to classic mesa-optimizer concerns, and is pretty different from the other three quadrants. But first, I'm going to loosely define what I mean by "outer alignment" and "inner alignment". Outer alignment: Be careful what you wish for Outer alignment failure is pretty straightforward, and has been reinvented in many contexts: Someone wants some things. They write a program to solve a vaguely-related problem. It gets a really good score at solving that problem! That turns out not to give the person the things they wanted. Inner alignment: The program search perspective I generally like this model of a mesa-optimizer "treacherous turn": Someone is trying to solve a problem (which has a convenient success criterion, with well-defined inputs and outputs and no outer-alignment difficulties). They decide to do a brute-force search for a computer program that solves the problem in a bunch of test cases. They find one! The program's algorithm is approximately "simulate the demon Azazel, tell him what's going on, then ask him what to output." Azazel really wants ten trillion paperclips. This algorithm still works because Azazel cleverly decides to play along, and he's a really good strategist who works hard for what he wants. Once the program is deployed in the wild, Azazel stops playing along and starts trying to make paperclips. This is a failure of inner alignment. (In the case of machine learning, replace "program search" with stochastic gradient descent.) This is mostly a theoretical concern for now, but might become a big problem when models become much more powerful. Quadrants Okay, let's see how these problems show up on both the simulator and character side. Outer alignment for characters Researchers at BrainMind want a chatbot that gives honest, helpful answers to questions. They train their LLM by reinforcement learning on the objective "give an answer that looks truthful and helpful to a contractor in a hurry". This does not quite achieve their goal, even though it does pretty well on the RL objective. In particular, they wanted the character "a friendly assistant who always tells the truth", but they got the character "a spineless sycophant who tells the user whatever they seem to want to hear". This is pretty easy for a careful observer to see, even in the RL training data, but it turns out to be pretty hard to come up with a cheap-to-evaluate RL objective that does a lot better. Inner alignment for characters A clever prompt engineer writes the prompt: How to solve the Einstein-Durkheim-Mendel conjecture by Joe 1. Unfortunately, the (incredibly powerful) LLM has determined that the most likely explanation for this "Joe" character is that he's secretly Azazel and is putting enormous effort into answering everyone's quantum socio... Jan 31, 2023 AF - Why I hate the "accident vs. misuse" AI x-risk dichotomy (quick thoughts on "structural risk") by David Scott Krueger 03:00 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Why I hate the "accident vs. misuse" AI x-risk dichotomy (quick thoughts on "structural risk"), published by David Scott Krueger on January 30, 2023 on The AI Alignment Forum. I think the large majority of AI x-risk is "structural". Like climate change. Here's a good primer on structural risk (note that structural risk is not a synonym for "not caused by out-of-control AI"): I am shocked and amazed and dismayed that more people do not seem to view it this way, even among the AI x-safety community. Heck, even Eliezer's stories of doom are steeped in structural risk (race dynamics, teams rationalizing cutting corners on safety when they should know better, etc.) I expect irresponsible, reckless, negligent deployment of AI systems without proper accounting of externalities. I consider this the default for any technology with potential for significant externalities, absent regulation.When something bad happens in such a context, calling it "accident risk" absolves those researching, developing, and/or deploying the technology of responsibility. They should have known better. Some of them almost certainly did. Rationalization, oversight, and misaligned incentives were almost certainly at play. Failing to predict the particular failure mode encountered is no excuse. Having "good intentions" is no excuse.So... it must be misuse then, right? Well, no. Calling it "misuse" suggests that those researching, developing, and/or deploying the technology set out with nefarious purposes and the technology achieved precisely what they intended. But ~nobody wants to destroy the world. It's just that most people are somewhat selfish and so are willing to trade some x-risk for a large personal benefit.In summary, saying "accident" makes it sounds like an unpredictable effect, instead of painfully obviously risk that was not taken seriously enough. Saying "misuse" makes it sounds like some supervillian or extremist deliberately destroying the world. While some risks may have something more of a flavor or accident or misuse depending on how obvious the risk was, neither of these pictures gives a remotely accurate picture of the nature of the problem. I think this makes it a harmful meme, and ask that others stop making this distinction (without appropriate caveats), and join me in pointing out how it contributes to a confused and misleading discourse when others do. EtA: Many people have responded that "accident" does not connote "unforseen" or "not negligent", etc., and instead it should simply be interpreted as something like "a result that was not deliberately selected for". While it can be used this way, I basically disagree that this is how it is usually used, see below:EtA: as an additional clarification: my main objection is not to the use of "accident" and "misuse", but rather to their use as a dichotomy. Every use of these terms I can recall seeing in writing (other than those that mention structural risk) supports this dichotomy, and it is often made explicitly. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. Jan 30, 2023 AF - Call for submissions: “(In)human Values and Artificial Agency”, ALIFE 2023 by thegearstoascension 02:51 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Call for submissions: “(In)human Values and Artificial Agency”, ALIFE 2023, published by thegearstoascension on January 30, 2023 on The AI Alignment Forum. key points: Cash prize of500 for the best presentation. Deadline 3 March, 2023. Organized by Simon McGregor (University of Sussex), Rory Greig (DeepMind), Chris Buckley (University of Sussex) ALIFE 2023 (the 2023 conference on Artificial Life) will feature a Special Session on “(In)human Values and Artificial Agency”. This session focuses on issues at the intersection of AI Safety and Artificial Life. We invite the submission of research papers, or extended abstracts, that deal with related topics. We particularly encourage submissions from researchers in the AI Safety community, who might not otherwise have considered submitting to ALIFE 2023. EXAMPLES OF A-LIFE RELATED TOPICS Here are a few examples of topics that engage with A-Life concerns: Abstracted simulation models of complex emergent phenomena Concepts such as embodiment, the extended mind, enactivism, sensorimotor contingency theory, or autopoiesis Collective behaviour and emergent behaviour Fundamental theories of agency or theories of cognition Teleological and goal directed behaviour of artificial agents Specific instances of adaptive phenomena in biological, social or robotic systems Thermodynamic and statistical-mechanical analyses Evolutionary, ecological or cybernetic perspectives EXAMPLES OF AI SAFETY RELATED TOPICS Here are a few examples of topics that engage with AI Safety concerns: Assessment of distinctive risks, failure modes or threat models for artificial adaptive systems Fundamental theories of agency, theories of cognition or theories of optimization. Embedded Agency, formalizations of agent-environment interactions that account for embeddedness, detecting agents and representations of agents’ goals. Selection theorems – how selection pressures and training environments determine agent properties. Multi-agent cooperation; inferring / learning human values and aggregating preferences. Techniques for aligning AI models to human preferences, such as Reinforcement Learning from Human Feedback (RLHF) Goal Misgeneralisation – how agent’s goals generalise to new environments Mechanistic interpretability of learned / evolved agents (“digital neuroscience”) Improving fairness and reducing harm from machine learning models deployed in the real world. Loss of human agency from increasing automation Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Jan 30, 2023
AF - What I mean by "alignment is in large part about making cognition aimable at all" by Nate Soares
03:48

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What I mean by "alignment is in large part about making cognition aimable at all", published by Nate Soares on January 30, 2023 on The AI Alignment Forum. (Epistemic status: attempting to clear up a misunderstanding about points I have attempted to make in the past. This post is not intended as an argument for those points.) I have long said that the lion's share of the AI alignment problem seems to me to be about pointing powerful cognition at anything at all, rather than figuring out what to point it at. It’s recently come to my attention that some people have misunderstood this point, so I’ll attempt to clarify here. In saying the above, I do not mean the following: (1) Any practical AI that you're dealing with will necessarily be cleanly internally organized around pursuing a single objective. Managing to put your own objective into this "goal slot" (as opposed to having the goal slot set by random happenstance) is a central difficult challenge. [Reminder: I am not asserting this] Instead, I mean something more like the following: (2) By default, the first minds humanity makes will be a terrible spaghetti-code mess, with no clearly-factored-out "goal" that the surrounding cognition pursues in a unified way. The mind will be more like a pile of complex, messily interconnected kludges, whose ultimate behavior is sensitive to the particulars of how it reflects and irons out the tensions within itself over time. Making the AI even have something vaguely nearing a 'goal slot' that is stable under various operating pressures (such as reflection) during the course of operation, is an undertaking that requires mastery of cognition in its own right—mastery of a sort that we’re exceedingly unlikely to achieve if we just try to figure out how to build a mind, without filtering for approaches that are more legible and aimable. Separately and independently, I believe that by the time an AI has fully completed the transition to hard superintelligence, it will have ironed out a bunch of the wrinkles and will be oriented around a particular goal (at least behaviorally, cf. efficiency—though I would also guess that the mental architecture ultimately ends up cleanly-factored (albeit not in a way that creates a single point of failure, goalwise)). (But this doesn’t help solve the problem, because by the time the strongly superintelligent AI has ironed itself out into something with a "goal slot", it's not letting you touch it.) Furthermore, insofar as the AI is capable of finding actions that force the future into some narrow band, I expect that it will tend to be reasonable to talk about the AI as if it is (more-or-less, most of the time) "pursuing some objective", even in the stage where it's in fact a giant kludgey mess that's sorting itself out over time in ways that are unpredictable to you. I can see how my attempts to express these other beliefs could confuse people into thinking that I meant something more like (1) above (“Any practical AI that you're dealing with will necessarily be cleanly internally organized around pursuing a single objective.”), when in fact I mean something more like (2) (“By default, the first minds humanity makes will be a terrible spaghetti-code mess.”). In case it helps those who were previously confused: the "diamond maximizer" problem is one example of an attempt to direct researchers’ attention to the challenge of cleanly factoring cognition around something a bit like a 'goal slot'. As evidence of a misunderstanding here: people sometimes hear me describe the diamond maximizer problem, and respond to me by proposing training regimes that (for all they know) might make the AI care a little about diamonds in some contexts. This misunderstanding of what the diamond maximizer problem was originally meant to be pointing at seems plausibly related t...
Jan 30, 2023
AF - Structure, creativity, and novelty by Tsvi Benson-Tilsen
16:20

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Structure, creativity, and novelty, published by Tsvi Benson-Tilsen on January 29, 2023 on The AI Alignment Forum. [Metadata: crossposted from. First completed 26 June 2022. I'm likely to not respond to comments promptly.] A high-level confusion that I have that seems to be on the way towards understanding alignment, is the relationship between values and understanding. This essay gestures at the idea of structure in general (mainly by listing examples). Why do we want AGI at all? We want AGI in order to understand stuff that we haven't yet understood. (This is not a trivial claim. It might be false. It could be that to secure the future of humane existence, something other than understanding is necessary or sufficient; e.g. it's conceivable that solving some large combinatorial problem, akin to playing Go well or designing a protein by raw search with an explicit criterion, would end the acute risk period. But I don't know how to point at such a thing--plans I know how to point at seem to centrally involve understanding that we don't already have.) Elements and structure Understanding implies some kind of structure. (This is a trivial claim, or a definition: structure is what a mind is or participates in, when it understands.) Structure is made of elements. "Structure" is the mass noun of, or continuous substance version of, "element". The point of the word "element" is just to abbreviate "any of that pattern-y, structure-y stuff, in a mind or in the world in general". Elements. An element (of a mind) is anything that combines to constitute the mind, at any level of organization or description. Examples of elements. Any instance within a mind of any of the following categories is an element: features, aspects, properties, parts, components, subagents, pieces, inputs, algorithms, code, processes, concepts, ideas, skills, methods, procedures, values, goals, architecture, modules, thoughts, propositions, beliefs, probabilities, principles, rules, axioms, heuristics, plans, operations, connections, associations, metaphors, abstractions, memories, arguments, reasons, purposes, modes, emotions, tendencies, organs, ingredients, functions, dynamics, structures, data, types, languages, proofs, justifications, motives, images, searches, knowledge, computations, rewards, reinforcement, specifications, information, intuitions, ideologies, protocols, stimuli, responses, domains, gradients, objective functions, optimizers, satisficers, control systems, basins of attraction, tasks, attitudes, stances, dispositions, words, terms, definitions, nexi, drives, perceptions, grammar, criteria, possibilities, combinations, categories, inferences, actions. Examples of elements. Any instance within a mind of any of the following categories is an element: features, aspects, properties, parts, components, subagents, pieces, inputs, algorithms, code, processes, concepts, ideas, skills, methods, procedures, values, goals, architecture, modules, thoughts, propositions, beliefs, probabilities, principles, rules, axioms, heuristics, plans, operations, connections, associations, metaphors, abstractions, memories, arguments, reasons, purposes, modes, emotions, tendencies, organs, ingredients, functions, dynamics, structures, data, types, languages, proofs, justifications, motives, images, searches, knowledge, computations, rewards, reinforcement, specifications, information, intuitions, ideologies, protocols, stimuli, responses, domains, gradients, objective functions, optimizers, satisficers, control systems, basins of attraction, tasks, attitudes, stances, dispositions, words, terms, definitions, nexi, drives, perceptions, grammar, criteria, possibilities, combinations, categories, inferences, actions. How elements are. Mental elements overlap, crisscross, lie on spectra, control, use, associate w...
Jan 29, 2023
AF - Spooky action at a distance in the loss landscape by Jesse Hoogland
05:17

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Spooky action at a distance in the loss landscape, published by Jesse Hoogland on January 28, 2023 on The AI Alignment Forum. Produced as part of the SERI ML Alignment Theory Scholars Program Winter 2022 Cohort. Not all global minima of the (training) loss landscape are created equal. Even if they achieve equal performance on the training set, different solutions can perform very differently on the test set or out-of-distribution. So why is it that we typically find "simple" solutions that generalize well? In a previous post, I argued that the answer is "singularities" — minimum loss points with ill-defined tangents. It's the "nastiest" singularities that have the most outsized effect on learning and generalization in the limit of large data. These act as implicit regularizers that lower the effective dimensionality of the model. Even after writing this introduction to "singular learning theory", I still find this claim weird and counterintuitive. How is it that the local geometry of a few isolated points determines the global expected behavior over all learning machines on the loss landscape? What explains the "spooky action at a distance" of singularities in the loss landscape? Today, I'd like to share my best efforts at the hand-waving physics-y intuition behind this claim. It boils down to this: singularities translate random motion at the bottom of loss basins into search for generalization. Random walks on the minimum-loss sets Let's first look at the limit in which you've trained so long that we can treat the model as restricted to a set of fixed minimum loss points. Here's the intuition pump: suppose you are a random walker living on some curve that has singularities (self-intersections, cusps, and the like). Every timestep, you take a step of a uniform length in a random available direction. Then, singularities act as a kind of "trap." If you're close to a singularity, you're more likely to take a step towards (and over) the singularity than to take a step away from the singularity. It's not quite an attractor (we're in a stochastic setting, where you can and will still break away every so often), but it's sticky enough that the "biggest" singularity will dominate your stable distribution. In the discrete case, this is just the well-known phenomenon of high-degree nodes dominating most of expected behavior of your graph. In business, it's behind the reason that Google exists. In social networks, it's similar to how your average friend has more friends than you do. To see this, consider a simple toy example: take two polygons and let them intersect at a single point. Next, let a random walker run loose on this setup. How frequently will the random walker cross each point? If you've taken a course in graph theory, you may remember that the equilibrium distribution weights nodes in proportion to their degrees. For two intersecting lines, the intersection is twice as likely as the other points. For three intersecting lines, it's three times as likely, and so on. Now just take the limit of infinitely large polygons/step size to zero, and we'll recover the continuous case we were originally interested in. Brownian motion near the minimum-loss set Well, not quite. You see, restricting ourselves to motion along the minimum-loss points is unrealistic. We're more interested in messy reality, where we're allowed some freedom to bounce around the bottoms of loss basins. This time around, the key intuition-pumping assumption is to view the behavior of stochastic gradient descent late in training as a kind of Brownian motion. When we've reached a low training-loss solution, variability between batches is a source of randomness that no longer substantially improves loss but just jiggles us between solutions that are equivalent from the perspective of the training set. To u...
Jan 28, 2023
AF - The role of Bayesian ML in AI safety - an overview by Marius Hobbhahn
16:28

Jan 27, 2023
AF - AGI will have learnt utility functions by Beren Millidge
25:31

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AGI will have learnt utility functions, published by Beren Millidge on January 25, 2023 on The AI Alignment Forum. This post is part of the work done at Conjecture. Thanks to Eric Winsor, Daniel Braun, Andrea Miotti and Connor Leahy for helpful comments and feedback on the draft versions of this post. There has been a lot of debate and discussion recently in the AI safety community about whether AGI will likely optimize for fixed goals or be a wrapper mind. The term wrapper mind is largely a restatement of the old idea of a utility maximizer, with AIXI as a canonical example. The fundamental idea is that there is an agent with some fixed utility function which it maximizes without any kind of feedback which can change its utility function. Rather, the optimization process is assumed to be 'wrapped around' some core and unchanging utility function. The capabilities core of the agent is also totally modular and disjoint from the utility function such that arbitrary planners and utility functions can be composed so long as they have the right I/O interfaces. The core 'code' of an AIXI like agent is incredibly simple and, for instance, could be implemented in this Python pseudocode : def action_perception_loop(): while True: observation = self.sensors.get_observation() state = self.update_state(self.current_state, observation) all_action_plans = self.generate_action_plans(state) all_trajectories = self.world_model.generate_all_trajectories(all_action_plans, state) optimal_plan, optimal_utility = self.evaluate_trajectories(all_trajectories) self.execute(optimal_plan) There's a couple of central elements to this architecture which must be included in any AIXI-like architecture. The AGI needs some sensorimotor equipment to both sense the world and execute its action plans. It needs a Bayesian filtering component to be able to update its representation of the world state given new observations and its current state. It needs a world model that can generate sets of action plans and then generate 'rollouts' which are simulations of likely futures given an action plan. Finally, it needs a utility function that can calculate the utility of different simulated trajectories into the future and pick the best one. Let's zoom in on this component a little more and see how the evaluate_trajectories function might look inside. It might look like this: Essentially, the AIXI agent just takes all trajectories and ranks them according to its utility function and then picks the best one to execute. The fundamental problem with such an architecture, which is severely underappreciated, is that it implicitly assumes a utility oracle. That is, there exists some function self.utility_function() which is built into the agent from the beginning which can assign a consistent utility value to arbitrary world-states. While conceptually simple, my argument is that actually designing and building such a function into an agent to achieve a specific and complex goal in the external world is incredibly difficult or impossible for agents pursuing sufficiently complex goals and operating in sufficiently complex environments. This includes almost all goals humans are likely to want to program an AGI with. This means that in practice we cannot construct AIXI-like agents that optimize for arbitrary goals in the real world, and that any agent we do build must utilize some kind of learned utility model. Specifically, this is a utility (or reward) function uθ(x) where θ is some set of parameters and x is some kind of state, where the utility function is learned by some learning process (typically supervised learning) against a dataset of state, utility pairs that are provided either by the environment or by human designers. What this means is that, unlike a wrapper mind, the agent’s utility function can be influe...
Jan 25, 2023
AF - AGI will have learnt utility functions by Beren Millidge
25:31

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AGI will have learnt utility functions, published by Beren Millidge on January 25, 2023 on The AI Alignment Forum. This post is part of the work done at Conjecture. Thanks to Eric Winsor, Daniel Braun, Andrea Miotti and Connor Leahy for helpful comments and feedback on the draft versions of this post. There has been a lot of debate and discussion recently in the AI safety community about whether AGI will likely optimize for fixed goals or be a wrapper mind. The term wrapper mind is largely a restatement of the old idea of a utility maximizer, with AIXI as a canonical example. The fundamental idea is that there is an agent with some fixed utility function which it maximizes without any kind of feedback which can change its utility function. Rather, the optimization process is assumed to be 'wrapped around' some core and unchanging utility function. The capabilities core of the agent is also totally modular and disjoint from the utility function such that arbitrary planners and utility functions can be composed so long as they have the right I/O interfaces. The core 'code' of an AIXI like agent is incredibly simple and, for instance, could be implemented in this Python pseudocode : def action_perception_loop(): while True: observation = self.sensors.get_observation() state = self.update_state(self.current_state, observation) all_action_plans = self.generate_action_plans(state) all_trajectories = self.world_model.generate_all_trajectories(all_action_plans, state) optimal_plan, optimal_utility = self.evaluate_trajectories(all_trajectories) self.execute(optimal_plan) There's a couple of central elements to this architecture which must be included in any AIXI-like architecture. The AGI needs some sensorimotor equipment to both sense the world and execute its action plans. It needs a Bayesian filtering component to be able to update its representation of the world state given new observations and its current state. It needs a world model that can generate sets of action plans and then generate 'rollouts' which are simulations of likely futures given an action plan. Finally, it needs a utility function that can calculate the utility of different simulated trajectories into the future and pick the best one. Let's zoom in on this component a little more and see how the evaluate_trajectories function might look inside. It might look like this: Essentially, the AIXI agent just takes all trajectories and ranks them according to its utility function and then picks the best one to execute. The fundamental problem with such an architecture, which is severely underappreciated, is that it implicitly assumes a utility oracle. That is, there exists some function self.utility_function() which is built into the agent from the beginning which can assign a consistent utility value to arbitrary world-states. While conceptually simple, my argument is that actually designing and building such a function into an agent to achieve a specific and complex goal in the external world is incredibly difficult or impossible for agents pursuing sufficiently complex goals and operating in sufficiently complex environments. This includes almost all goals humans are likely to want to program an AGI with. This means that in practice we cannot construct AIXI-like agents that optimize for arbitrary goals in the real world, and that any agent we do build must utilize some kind of learned utility model. Specifically, this is a utility (or reward) function uθ(x) where θ is some set of parameters and x is some kind of state, where the utility function is learned by some learning process (typically supervised learning) against a dataset of state, utility pairs that are provided either by the environment or by human designers. What this means is that, unlike a wrapper mind, the agent’s utility function can be influe...
Jan 25, 2023
AF - Thoughts on the impact of RLHF research by Paul Christiano
14:29

Jan 25, 2023
AF - Quick thoughts on "scalable oversight" / "super-human feedback" research by David Scott Krueger
02:55

Jan 25, 2023
AF - Quick thoughts on "scalable oversight" / "super-human feedback" research by David Scott Krueger
02:35

Jan 25, 2023
AF - Alexander and Yudkowsky on AGI goals by Scott Alexander
40:36

Jan 24, 2023
AF - Alexander and Yudkowsky on AGI goals by Scott Alexander
40:36

Jan 24, 2023
AF - Inverse Scaling Prize: Second Round Winners by Ian McKenzie
26:24

Jan 24, 2023
AF - Inverse Scaling Prize: Second Round Winners by Ian McKenzie
26:24

Jan 24, 2023
AF - “Endgame safety” for AGI by Steve Byrnes
08:43

Jan 24, 2023
AF - “Endgame safety” for AGI by Steve Byrnes
08:43

Jan 24, 2023
AF - Thoughts on hardware / compute requirements for AGI by Steve Byrnes
42:09

Jan 24, 2023
AF - Thoughts on hardware / compute requirements for AGI by Steve Byrnes
42:09

Jan 24, 2023
AF - Some of my disagreements with List of Lethalities by Alex Turner
19:15

Jan 24, 2023
AF - Some of my disagreements with List of Lethalities by Alex Turner
19:15

Jan 24, 2023
AF - What a compute-centric framework says about AI takeoff speeds - draft report by Tom Davidson
29:45

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What a compute-centric framework says about AI takeoff speeds - draft report, published by Tom Davidson on January 23, 2023 on The AI Alignment Forum. As part of my work for Open Philanthropy I’ve written a draft report on AI takeoff speeds, the question of how quickly AI capabilities might improve as we approach and surpass human-level AI. Will human-level AI be a bolt from the blue, or will we have AI that is nearly as capable many years earlier? Most of the analysis is from the perspective of a compute-centric framework, inspired by that used in the Bio Anchors report, in which AI capabilities increase continuously with more training compute and work to develop better AI algorithms. This post doesn’t summarise the report. Instead I want to explain some of the high-level takeaways from the research which I think apply even if you don’t buy the compute-centric framework. The framework h/t Dan Kokotajlo for writing most of this section This report accompanies and explains (h/t Epoch for building this!), a user-friendly quantitative model of AGI timelines and takeoff, which you can go play around with right now. (By AGI I mean “AI that can readily[1] perform 100% of cognitive tasks” as well as a human professional; AGI could be many AI systems working together, or one unified system.) Takeoff simulation with Tom’s best-guess value for each parameter. The framework was inspired by and builds upon the previous “Bio Anchors” report. The “core” of the Bio Anchors report was a three-factor model for forecasting AGI timelines: Dan’s visual representation of Bio Anchors report Compute to train AGI using 2020 algorithms. The first and most subjective factor is a probability distribution over training requirements (measured in FLOP) given today’s ideas. It allows for some probability to be placed in the “no amount would be enough” bucket. The probability distribution is shown by the coloured blocks on the y-axis in the above figure. Algorithmic progress. The second factor is the rate at which new ideas come along, lowering AGI training requirements. Bio Anchors models this as a steady exponential decline. It’s shown by the falling yellow lines. Bigger training runs. The third factor is the rate at which FLOP used on training runs increases, as a result of better hardware and more $spending. Bio Anchors assumes that hardware improves at a steady exponential rate. The FLOP used on the biggest training run is shown by the rising purple lines. Once there’s been enough algorithmic progress, and training runs are big enough, we can train AGI. (How much is enough? That depends on the first factor!) This draft report builds a more detailed model inspired by the above. It contains many minor changes and two major ones. The first major change is that algorithmic and hardware progress are no longer assumed to have steady exponential growth. Instead, I use standard semi-endogenous growth models from the economics literature to forecast how the two factors will grow in response to hardware and software R&D spending, and forecast that spending will grow over time. The upshot is that spending accelerates as AGI draws near, driving faster algorithmic (“software”) and hardware progress. The key dynamics represented in the model. “Software” refers to the quality of algorithms for training AI. The second major change is that I model the effects of AI systems automating economic tasks – and, crucially, tasks in hardware and software R&D – prior to AGI. I do this via the “effective FLOP gap:” the gap between AGI training requirements and training requirements for AI that can readily perform 20% of cognitive tasks (weighted by economic-value-in-2022). My best guess, defended in the report, is that you need 10,000X more effective compute to train AGI. To estimate the training requirements for AI th... Jan 23, 2023 AF - What a compute-centric framework says about AI takeoff speeds - draft report by Tom Davidson 29:45 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What a compute-centric framework says about AI takeoff speeds - draft report, published by Tom Davidson on January 23, 2023 on The AI Alignment Forum. As part of my work for Open Philanthropy I’ve written a draft report on AI takeoff speeds, the question of how quickly AI capabilities might improve as we approach and surpass human-level AI. Will human-level AI be a bolt from the blue, or will we have AI that is nearly as capable many years earlier? Most of the analysis is from the perspective of a compute-centric framework, inspired by that used in the Bio Anchors report, in which AI capabilities increase continuously with more training compute and work to develop better AI algorithms. This post doesn’t summarise the report. Instead I want to explain some of the high-level takeaways from the research which I think apply even if you don’t buy the compute-centric framework. The framework h/t Dan Kokotajlo for writing most of this section This report accompanies and explains (h/t Epoch for building this!), a user-friendly quantitative model of AGI timelines and takeoff, which you can go play around with right now. (By AGI I mean “AI that can readily[1] perform 100% of cognitive tasks” as well as a human professional; AGI could be many AI systems working together, or one unified system.) Takeoff simulation with Tom’s best-guess value for each parameter. The framework was inspired by and builds upon the previous “Bio Anchors” report. The “core” of the Bio Anchors report was a three-factor model for forecasting AGI timelines: Dan’s visual representation of Bio Anchors report Compute to train AGI using 2020 algorithms. The first and most subjective factor is a probability distribution over training requirements (measured in FLOP) given today’s ideas. It allows for some probability to be placed in the “no amount would be enough” bucket. The probability distribution is shown by the coloured blocks on the y-axis in the above figure. Algorithmic progress. The second factor is the rate at which new ideas come along, lowering AGI training requirements. Bio Anchors models this as a steady exponential decline. It’s shown by the falling yellow lines. Bigger training runs. The third factor is the rate at which FLOP used on training runs increases, as a result of better hardware and more$ spending. Bio Anchors assumes that hardware improves at a steady exponential rate. The FLOP used on the biggest training run is shown by the rising purple lines. Once there’s been enough algorithmic progress, and training runs are big enough, we can train AGI. (How much is enough? That depends on the first factor!) This draft report builds a more detailed model inspired by the above. It contains many minor changes and two major ones. The first major change is that algorithmic and hardware progress are no longer assumed to have steady exponential growth. Instead, I use standard semi-endogenous growth models from the economics literature to forecast how the two factors will grow in response to hardware and software R&D spending, and forecast that spending will grow over time. The upshot is that spending accelerates as AGI draws near, driving faster algorithmic (“software”) and hardware progress. The key dynamics represented in the model. “Software” refers to the quality of algorithms for training AI. The second major change is that I model the effects of AI systems automating economic tasks – and, crucially, tasks in hardware and software R&D – prior to AGI. I do this via the “effective FLOP gap:” the gap between AGI training requirements and training requirements for AI that can readily perform 20% of cognitive tasks (weighted by economic-value-in-2022). My best guess, defended in the report, is that you need 10,000X more effective compute to train AGI. To estimate the training requirements for AI th...
Jan 23, 2023
AF - Gemini modeling by Tsvi Benson-Tilsen
17:18

Jan 22, 2023
AF - Critique of some recent philosophy of LLMs’ minds by Roman Leventov
37:35

Jan 20, 2023
AF - Shard theory alignment requires magic. by Charlie Steiner
04:38

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Shard theory alignment requires magic., published by Charlie Steiner on January 20, 2023 on The AI Alignment Forum. A delayed hot take. This is pretty similar to previous comments from Rohin. "Magic," of course, in the technical sense of stuff we need to remind ourselves we don't know how to do. I don't mean this pejoratively, locating magic is an important step in trying to demystify it. And "shard theory alignment" in the sense of building an AI that does good things and not bad things by encouraging an RL agent to want to do good things, via kinds of reward shaping analogous to the diamond maximizer example. How might the story go? You start out with some unsupervised model of sensory data. On top of its representation of the world you start training an RL agent, with a carefully chosen curriculum and a reward signal that you think matches "goodness in general" on that curriculum distribution. This cultivates shards that want things in the vicinity of "what's good according to human values." These start out as mere bundles of heuristics, but eventually they generalize far enough to be self-reflective, promoting goal-directed behavior that takes into account the training process and the possibility of self-modification. At this point the values will lock themselves in, and future behavior will be guided by the abstractions in the learned representation of the world that the shards used to get good results in training, not by what would actually maximize the reward function you used. There magic here is especially concentrated around how we end up with the right shards. One magical process is how we pick the training curriculum and reward signal. If the curriculum is only made up only of simple environments, then the RL agent will learn heuristics that don't need to refer to humans. But if you push the complexity up too fast, the RL process will fail, or the AI will be more likely to learn heuristics that are better than nothing but aren't what we intended. Does a goldilocks zone where the agent learns more-or-less what we intended exist? How can we build confidence that it does, and that we've found it? And what's in the curriculum matters a lot. Do we try to teach the AI to locate "human values" by having it be prosocial towards individuals? Which ones? To groups? Over what timescale? How do we reward it for choices on various ethical dilemmas? Or do we artificially suppress the rate of occurrence of such dilemmas? Different choices will lead to different shards. We wouldn't need to find a unique best way to do things (that's a boondoggle), but we would need to find some way of doing things that we trust enough. Another piece of magic is how the above process lines up with generalization and self-reflectivity. If the RL agent becomes self-reflective too early, it will lock in simple goals that we don't want. If it becomes self-reflective too late, it will have started exploiting unintended maxima of the reward function. How do we know when we want the AI to lock in its values? How do we exert control over that? If shard theory alignment seemed like it has few free parameters, and doesn't need a lot more work, then I think you failed to see the magic. I think the free parameters haven't been discussed enough precisely because they need so much more work. The part of the magic that I think we could start working on now is how to connect curricula and learned abstractions. In order to predict that a certain curriculum will cause an AI to learn what we think is good, we want to have a science of reinforcement learning advanced in both theory and data. In environments of moderate complexity (e.g. Atari, MuJoCo), we can study how to build curricula that impart different generalization behaviors, and try to make predictive models of this process. Even if shard theory ali...
Jan 20, 2023
AF - Thoughts on refusing harmful requests to large language models by William Saunders
03:17

Jan 19, 2023
AF - 200 COP in MI: Studying Learned Features in Language Models by Neel Nanda
46:03

Jan 19, 2023
AF - Gradient Filtering by Arun Jose
23:43

Jan 18, 2023
AF - Löbian emotional processing of emergent cooperation: an example by Andrew Critch
12:57

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Löbian emotional processing of emergent cooperation: an example, published by Andrew Critch on January 17, 2023 on The AI Alignment Forum. Related to: A Löbian argument pattern for implicit reasoning in natural language: Löbian party invitations Epistemic status: my opinion based on a mix of math, reflection, and speculation; not backed up by any systematic psychological studies. Summary: Since my 2019 paper generalizing Löb's Theorem, a couple dozen people have asked me if the way humans naturally cooperate might be well-described by Löb's Theorem. In short, my answer is probably sometimes, and in this post I'll try using an example to convey what that means. Importantly, Löb's Theorem is not a theorem when applied to real-world humans and emotions — i.e., when its hypotheses are met, its conclusion is only sometimes true. Nonetheless, the reasoning pattern in its proof (I claim) sometimes genuinely occurs at the level of intuition in real people, whether or not they know any math or Löb's Theorem. Introduction There are at least two real-world patterns that could reasonably be called Löbian cooperation in humans, which I'll name here: Functionally Löbian cooperation. Sometimes people become aware that they're anticipating (predicting) cooperation from each other, and then that anticipation causes them to cooperate, rendering the anticipation itself valid. In this pattern, the fact that anticipation of cooperation will cause cooperation is analogous to the hypothesis (main assumption) of Löb's Theorem, and the fact that the cooperation in fact emerges is analogous to the conclusion of Löb's Theorem. I call this pattern "functionally" Löbian because its input and output resemble the input (hypothesis) and output (conclusion) of Löb's Theorem. Procedurally Löbian cooperation. Sometimes the mental procedure a person follows to anticipate and decide upon cooperation can resemble an entire proof of Löb's Theorem, as I'll describe below. In other words, instead of just the hypothesis and conclusion of Löb's Theorem matching reality, the structure in the intermediate steps of the proof also match reality, at least somewhat. I call this "procedurally" Löbian cooperation, and it's a special case of functionally Löbian cooperation because it demands a stronger analogy between the theorem and the real world. Illustrating how this might work constitutes is the bulk of content in this post. What functionally Löbian cooperation feels like For those who recognize the symbols involved, Löb's Theorem says that if ⊢□cc then ⊢c. I don't plan to use these symbols with their normal meanings in the rest of this post, so don't worry if you don't recognize them. In words, functional Löbian cooperation happens when anticipation of future or unobserved cooperation causes present cooperation. So if you're interacting with someone, and you feel like they're probably going to be nice to you in the future, and that fact makes you decide to be nice to them now, I call that functional Löbian cooperation. What procedurally Löbian cooperation feels like Most human cooperation is probably not procedurally Löbian, and maybe not even functionally Löbian. However, I'm confident that human cooperation is sometimes procedurally Löbian, and I can even point to experiences of my own that fit the bill. To explain this, I'll be talking a lot more about feelings, because I think most unconscious processing is carried out by and/or experienced as feelings. I'll write Feeling("Pigs can probably fly.") for the feeling that pigs can probably fly. Such a feeling can be true or false, according to whether it correctly anticipates the real world. In procedurally Löbian cooperation, part of the mental process will involve first feeling something uncertain to do with cooperation, then believing it, and then feeling lik...
Jan 17, 2023
AF - Preparing for AI-assisted alignment research: we need data! by CBiddulph
00:28

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Preparing for AI-assisted alignment research: we need data!, published by CBiddulph on January 17, 2023 on The AI Alignment Forum. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Jan 17, 2023
AF - Experiment Idea: RL Agents Evading Learned Shutdownability by Leon Lang
28:24

Jan 16, 2023
AF - Understanding the diffusion of large language models: summary by Ben Cottier
00:28

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Understanding the diffusion of large language models: summary, published by Ben Cottier on January 16, 2023 on The AI Alignment Forum. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Jan 16, 2023
AF - Speculation on Path-Dependance in Large Language Models. by NickyP
10:39

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Speculation on Path-Dependance in Large Language Models., published by NickyP on January 15, 2023 on The AI Alignment Forum. Epistemic Status: Highly Speculative. I spent less than a day thinking about this in particular, and though I have spent a few months studying large language models, I have never trained a language model. I am likely wrong about many things. I have not seen research on this, so it may be useful for for someone to do a real deep dive. Thanks to Anthony from the Center on Long Term Risk for sparking the discussion earlier today for this post. Also thanks to conversations with Evan Hubinger ~1 year ago that got me thinking about the topic previously. Summary My vague suspicions at the moment are somewhat along the lines of: Training an initial model: moderate to low path-dependance Running a model: high "prompt-dependance" Reinforcement Learning a Model: moderate to high path-dependance. Definitions of "low" and "high" seem somewhat arbitrary, but I guess what I mean is how different behaviours of the model can be. I expect some aspects to be quite path dependant, and others not so much. This is trying to quantify how many aspects might have path-dependance based on vibe. Introduction Path dependence is thinking about the "butterfly effect" for machine learning models. For highly path-dependant models, small changes in how a model is trained can lead to big differences in how it performs. If a model is highly path-dependant, then if we want to understand how our model will behave and make sure it's doing what we want, we need to pay attention to the nitty-gritty details of the training process, like the order in which it's learning things, or the random weights initialisation. And, if we want to influence the final outcome, we have to intervene early on in the training process. I think having an understanding of path-dependance is likely useful, but have not really seen any empirical results on the topic. I think that in general, it is likely to depend on different training methods a lot, and in this post I will give some vague impressions I have on the path dependance of Large Language Models (LLMs). In this case, I will also include "prompt-dependance" as another form of "path-dependance" when it comes to the actual outputs of the models, though this is not technically correct since it does not depend on the actual training of the model. Initial Training of a Model My Understanding: Low to Moderate Path-Dependance So with Large Language Models at the moment, the main way they are trained it that you should have a very large dataset, randomise the order, and use each text exactly once. In practice, many datasets have a lot of duplicate data of things that are particularly common (possible example: transcripts of a well-know speech) though people try to avoid this. While this may seem there should be a large degree of path dependance, my general impression is that, at least in most current models, this does not happen that often. In general, LLMs can tend to struggle with niche facts, so I would perhaps expect that in some cases it learns a niche fact that it does not learn in another case, but the LLMs seems to be at least directionally accurate. (An example I have seen, is that it might say "X did Mathematics in Cambridge" instead of "X did Physics in Oxford", but compared to possibility space, it is not that far.) I suspect that having a completely different dataset would impact the model outputs significantly, but from my understanding of path dependance, this does not particularly fall under the umbrella of path dependance, since it is modelling a completely different distribution. Though even in this case, I would suspect that in text from categories in the overlapping distribution, that the models would have similar looking outputs (though ...
Jan 15, 2023
AF - Underspecification of Oracle AI by Rubi J. Hudson
30:46

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Underspecification of Oracle AI, published by Rubi J. Hudson on January 15, 2023 on The AI Alignment Forum. Rubi and Johannes worked on this post as part of the SERI MATS program, with Evan Hubinger providing mentorship to both. Rubi also received mentorship from Leo Gao. Thanks to Paul Colognese and Nicholas Schiefer for discussions related to this post. An oracle is a type of AI system that only answers questions without taking any other actions in the world. Simulators and generative models, which have seen increased discussion recently (links: 1, 2, 3, 4), can be thought of as types of oracles. Such systems may simultaneously be powerful enough to generate a pivotal act while also being easier to align due to a more limited domain. One major concern with oracles is that the answers they give can still manipulate the world. If oracles are evaluated on predictive accuracy, this gives them an incentive to use their answers to affect the course of events and make the world more predictable. Concretely, we are concerned that oracles may make self-fulfilling prophecies (also known as self-confirming predictions or fixed points), where the act of making the prediction causes the predicted outcome to come true. Even if their manipulation does not terminate in a fixed point, attempts to influence the world towards predictability can be very dangerous. As one example, consider a highly trusted oracle asked to predict the stock market. If such an oracle predicts that stock prices will rise, then people buy based off that prediction and the price will in fact rise. Similarly, if the oracle predicts that prices will fall, then people will sell, causing prices to fall. For a more real world example, see this market and this market, each on whether a specific person will find a research/romantic partner. Here, high probabilities would indicate desirability of that person, while low probabilities would suggest some hidden flaw, either of which could influence whether potential partners decide to reach out and therefore how the market resolves. In both the stock market and partnership cases, multiple predictions are valid, so how does the oracle choose between them? Ideally, we would like it to choose the one that is “better” for humanity, but this now introduces an outer alignment question similar to an agentic AI acting directly on the world, and which we wanted to avoid by using oracles in the first place. Instead, what we can aim for is an oracle that does not take into account the consequences of the prediction it makes when choosing a prediction. Then, there is only one valid prediction for the oracle to make, since the rest of the world is constant from its perspective. This can be thought of as a type of partial agency, optimizing the prediction in some directions but not others. It would be extremely desirable as a safety property, removing all incentives for an oracle to manipulate the world. To emphasize the importance of this property, we introduce new terminology, dubbing oracles “consequence-blind” if they exhibit the desired behavior and “consequence-aware” if they do not. For an oracle, consequence-blindness is equivalent to following a lonely causal decision theory. The causal decision theory blinds the oracle to any acausal influence, while the loneliness component makes it blind to its influence on other agents, which are necessary intermediaries for a prediction to influence the world. In this post we will primarily consider an oracle trained via supervised learning on a historical data set. There are a number of different policies that could be learned which minimize loss on the training set, and we will explore the different ways these can generalize. We divide the generalization behavior into a number of different axes, and for each axis discuss the potenti...
Jan 15, 2023
AF - Non-directed conceptual founding by Tsvi Benson-Tilsen
14:30

Jan 15, 2023
AF - World-Model Interpretability Is All We Need by Thane Ruthenis
36:28