The Nonlinear Library: Alignment Forum

By The Nonlinear Fund

Listen to a podcast, please open Podcast Republic app. Available on Google Play Store and Apple App Store.


Category: Education

Open in Apple Podcasts


Open RSS feed


Open Website


Rate for this podcast

Subscribers: 2
Reviews: 0
Episodes: 377

Description

The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org

Episode Date
AF - Simple probes can catch sleeper agents by Monte MacDiarmid
Apr 23, 2024
AF - Dequantifying first-order theories by Jessica Taylor
Apr 23, 2024
AF - ProLU: A Pareto Improvement for Sparse Autoencoders by Glen M. Taggart
Apr 23, 2024
AF - Time complexity for deterministic string machines by alcatal
Apr 22, 2024
AF - Inducing Unprompted Misalignment in LLMs by Sam Svenningsen
Apr 19, 2024
AF - Progress Update #1 from the GDM Mech Interp Team: Full Update by Neel Nanda
Apr 19, 2024
AF - Progress Update #1 from the GDM Mech Interp Team: Summary by Neel Nanda
Apr 19, 2024
AF - Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight by Sam Marks
Apr 18, 2024
AF - LLM Evaluators Recognize and Favor Their Own Generations by Arjun Panickssery
Apr 17, 2024
AF - Transformers Represent Belief State Geometry in their Residual Stream by Adam Shai
Apr 16, 2024
AF - Speedrun ruiner research idea by Luke H Miles
Apr 13, 2024
AF - The theory of Proximal Policy Optimisation implementations by salman.mohammadi
Apr 12, 2024
AF - How I select alignment research projects by Ethan Perez
Apr 10, 2024
AF - PIBBSS is hiring in a variety of roles (alignment research and incubation program) by Nora Ammann
Apr 09, 2024
AF - How We Picture Bayesian Agents by johnswentworth
Apr 08, 2024
AF - Measuring Learned Optimization in Small Transformer Models by Jonathan Bostock
Apr 08, 2024
AF - Measuring Predictability of Persona Evaluations by Thee Ho
Apr 06, 2024
AF - Koan: divining alien datastructures from RAM activations by Tsvi Benson-Tilsen
Apr 05, 2024
AF - LLMs for Alignment Research: a safety priority? by Abram Demski
Apr 04, 2024
AF - Run evals on base models too! by orthonormal
Apr 04, 2024
AF - The Case for Predictive Models by Rubi Hudson
Apr 03, 2024
AF - Sparsify: A mechanistic interpretability research agenda by Lee Sharkey
Apr 03, 2024
AF - A Selection of Randomly Selected SAE Features by CallumMcDougall
Apr 01, 2024
AF - SAE-VIS: Announcement Post by CallumMcDougall
Mar 31, 2024
AF - Your LLM Judge may be biased by Rachel Freedman
Mar 29, 2024
AF - SAE reconstruction errors are (empirically) pathological by Wes Gurnee
Mar 29, 2024
AF - How do LLMs give truthful answers? A discussion of LLM vs. human reasoning, ensembles and parrots by Owain Evans
Mar 28, 2024
AF - UDT1.01: The Story So Far (1/10) by Diffractor
Mar 27, 2024
AF - Modern Transformers are AGI, and Human-Level by Abram Demski
Mar 26, 2024
AF - Third-party testing as a key ingredient of AI policy by Zac Hatfield-Dodds
Mar 25, 2024
AF - Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders by Johnny Lin
Mar 25, 2024
AF - On the Confusion between Inner and Outer Misalignment by Chris Leong
Mar 25, 2024
AF - Dangers of Closed-Loop AI by Gordon Seidoh Worley
Mar 22, 2024
AF - Video and transcript of presentation on Scheming AIs by Joe Carlsmith
Mar 22, 2024
AF - Comparing Alignment to other AGI interventions: Extensions and analysis by Martín Soto
Mar 21, 2024
AF - Stagewise Development in Neural Networks by Jesse Hoogland
Mar 20, 2024
AF - Natural Latents: The Concepts by johnswentworth
Mar 20, 2024
AF - Comparing Alignment to other AGI interventions: Basic model by Martín Soto
Mar 20, 2024
AF - New report: Safety Cases for AI by Josh Clymer
Mar 20, 2024
AF - AtP*: An efficient and scalable method for localizing LLM behaviour to components by Neel Nanda
Mar 18, 2024
AF - Improving SAE's by Sqrt()-ing L1 and Removing Lowest Activating Features by Logan Riggs Smith
Mar 15, 2024
AF - More people getting into AI safety should do a PhD by AdamGleave
Mar 14, 2024
AF - Laying the Foundations for Vision and Multimodal Mechanistic Interpretability and Open Problems by Sonia Joseph
Mar 13, 2024
AF - Virtual AI Safety Unconference 2024 by Orpheus Lummis
Mar 13, 2024
AF - Transformer Debugger by Henk Tillman
Mar 12, 2024
AF - Open consultancy: Letting untrusted AIs choose what answer to argue for by Fabien Roger
Mar 12, 2024
AF - Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought by miles
Mar 11, 2024
AF - How disagreements about Evidential Correlations could be settled by Martín Soto
Mar 11, 2024
AF - Understanding SAE Features with the Logit Lens by Joseph Isaac Bloom
Mar 11, 2024
AF - 0th Person and 1st Person Logic by Adele Lopez
Mar 10, 2024
AF - Scenario Forecasting Workshop: Materials and Learnings by elifland
Mar 08, 2024
AF - Forecasting future gains due to post-training enhancements by elifland
Mar 08, 2024
AF - Evidential Correlations are Subjective, and it might be a problem by Martín Soto
Mar 07, 2024
AF - We Inspected Every Head In GPT-2 Small using SAEs So You Don't Have To by robertzk
Mar 06, 2024
AF - Many arguments for AI x-risk are wrong by Alex Turner
Mar 05, 2024
AF - Anthropic release Claude 3, claims >GPT-4 Performance by Lawrence Chan
Mar 04, 2024
AF - Some costs of superposition by Linda Linsefors
Mar 03, 2024
AF - Approaching Human-Level Forecasting with Language Models by Fred Zhang
Feb 29, 2024
AF - Tips for Empirical Alignment Research by Ethan Perez
Feb 29, 2024
AF - Post series on "Liability Law for reducing Existential Risk from AI" by Nora Ammann
Feb 29, 2024
AF - Timaeus's First Four Months by Jesse Hoogland
Feb 28, 2024
AF - Notes on control evaluations for safety cases by Ryan Greenblatt
Feb 28, 2024
AF - Counting arguments provide no evidence for AI doom by Nora Belrose
Feb 27, 2024
AF - Deconfusing In-Context Learning by Arjun Panickssery
Feb 25, 2024
AF - Instrumental deception and manipulation in LLMs - a case study by Olli Järviniemi
Feb 24, 2024
AF - The Shutdown Problem: Incomplete Preferences as a Solution by Elliott Thornley
Feb 23, 2024
AF - Analogies between scaling labs and misaligned superintelligent AI by Stephen Casper
Feb 21, 2024
AF - Extinction Risks from AI: Invisible to Science? by Vojtech Kovarik
Feb 21, 2024
AF - Extinction-level Goodhart's Law as a Property of the Environment by Vojtech Kovarik
Feb 21, 2024
AF - Dynamics Crucial to AI Risk Seem to Make for Complicated Models by Vojtech Kovarik
Feb 21, 2024
AF - Which Model Properties are Necessary for Evaluating an Argument? by Vojtech Kovarik
Feb 21, 2024
AF - Weak vs Quantitative Extinction-level Goodhart's Law by Vojtech Kovarik
Feb 21, 2024
AF - Why does generalization work? by Martín Soto
Feb 20, 2024
AF - Complexity classes for alignment properties by Arun Jose
Feb 20, 2024
AF - Protocol evaluations: good analogies vs control by Fabien Roger
Feb 19, 2024
AF - Self-Awareness: Taxonomy and eval suite proposal by Daniel Kokotajlo
Feb 17, 2024
AF - The Pointer Resolution Problem by Arun Jose
Feb 16, 2024
AF - Retrospective: PIBBSS Fellowship 2023 by DusanDNesic
Feb 16, 2024
AF - Searching for Searching for Search by Rubi Hudson
Feb 14, 2024
AF - Critiques of the AI control agenda by Arun Jose
Feb 14, 2024
AF - Requirements for a Basin of Attraction to Alignment by Roger Dearnaley
Feb 14, 2024
AF - Interpreting Quantum Mechanics in Infra-Bayesian Physicalism by Yegreg
Feb 12, 2024
AF - Natural abstractions are observer-dependent: a conversation with John Wentworth by Martín Soto
Feb 12, 2024
AF - Updatelessness doesn't solve most problems by Martín Soto
Feb 08, 2024
AF - Debating with More Persuasive LLMs Leads to More Truthful Answers by Akbir Khan
Feb 07, 2024
AF - How to train your own "Sleeper Agents" by Evan Hubinger
Feb 07, 2024
AF - what does davidad want from "boundaries"? by Chipmonk
Feb 06, 2024
AF - Preventing exfiltration via upload limits seems promising by Ryan Greenblatt
Feb 06, 2024
AF - Attention SAEs Scale to GPT-2 Small by Connor Kissane
Feb 03, 2024
AF - Survey for alignment researchers: help us build better field-level models by Cameron Berg
Feb 02, 2024
AF - Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small by Joseph Isaac Bloom
Feb 02, 2024
AF - Evaluating Stability of Unreflective Alignment by james.lucassen
Feb 01, 2024
AF - PIBBSS Speaker events comings up in February by DusanDNesic
Feb 01, 2024
AF - Last call for submissions for TAIS 2024! by Blaine William Rogers
Jan 30, 2024
AF - The case for more ambitious language model evals by Arun Jose
Jan 30, 2024
AF - Agents that act for reasons: a thought experiment by Michele Campolo
Jan 24, 2024
AF - We need a science of evals by Marius Hobbhahn
Jan 22, 2024
AF - InterLab - a toolkit for experiments with multi-agent interactions by Tomáš Gavenčiak
Jan 22, 2024
AF - A Shutdown Problem Proposal by johnswentworth
Jan 21, 2024
AF - Four visions of Transformative AI success by Steve Byrnes
Jan 17, 2024
AF - Managing catastrophic misuse without robust AIs by Ryan Greenblatt
Jan 16, 2024
AF - Sparse Autoencoders Work on Attention Layer Outputs by Connor Kissane
Jan 16, 2024
AF - Investigating Bias Representations in LLMs via Activation Steering by DawnLu
Jan 15, 2024
AF - Goals selected from learned knowledge: an alternative to RL alignment by Seth Herd
Jan 15, 2024
AF - Three Types of Constraints in the Space of Agents by Nora Ammann
Jan 15, 2024
AF - Introducing Alignment Stress-Testing at Anthropic by Evan Hubinger
Jan 12, 2024
AF - Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training by Evan Hubinger
Jan 12, 2024
AF - Apply to the PIBBSS Summer Research Fellowship by Nora Ammann
Jan 12, 2024
AF - Goodbye, Shoggoth: The Stage, its Animatronics, and the Puppeteer - a New Metaphor by Roger Dearnaley
Jan 09, 2024
AF - A starter guide for evals by Marius Hobbhahn
Jan 08, 2024
AF - Deceptive AI Deceptively-aligned AI by Steve Byrnes
Jan 07, 2024
AF - Catching AIs red-handed by Ryan Greenblatt
Jan 05, 2024
AF - Predictive model agents are sort of corrigible by Raymond D
Jan 05, 2024
AF - What's up with LLMs representing XORs of arbitrary features? by Sam Marks
Jan 03, 2024
AF - Safety First: safety before full alignment. The deontic sufficiency hypothesis. by Chipmonk
Jan 03, 2024
AF - Steering Llama-2 with contrastive activation additions by Nina Rimsky
Jan 02, 2024
AF - Mech Interp Challenge: January - Deciphering the Caesar Cipher Model by CallumMcDougall
Jan 01, 2024
AF - A hermeneutic net for agency by Tsvi Benson-Tilsen
Jan 01, 2024
AF - A case for AI alignment being difficult by Jessica Taylor
Dec 31, 2023
AF - AI Alignment Metastrategy by Vanessa Kosoy
Dec 31, 2023
AF - Free agents by Michele Campolo
Dec 27, 2023
AF - Critical review of Christiano's disagreements with Yudkowsky by Vanessa Kosoy
Dec 27, 2023
AF - AGI will be made of heterogeneous components, Transformer and Selective SSM blocks will be among them by Roman Leventov
Dec 27, 2023
AF - 5. Moral Value for Sentient Animals? Alas, Not Yet by Roger Dearnaley
Dec 27, 2023
AF - Fact Finding: Do Early Layers Specialise in Local Processing? (Post 5) by Neel Nanda
Dec 23, 2023
AF - Measurement tampering detection as a special case of weak-to-strong generalization by Ryan Greenblatt
Dec 23, 2023
AF - Idealized Agents Are Approximate Causal Mirrors (+ Radical Optimism on Agent Foundations) by Thane Ruthenis
Dec 22, 2023
AF - Open positions: Research Analyst at the AI Standards Lab by Koen Holtman
Dec 22, 2023
AF - How Would an Utopia-Maximizer Look Like? by Thane Ruthenis
Dec 20, 2023
AF - Meaning and Agency by Abram Demski
Dec 19, 2023
AF - Don't Share Information Exfohazardous on Others' AI-Risk Models by Thane Ruthenis
Dec 19, 2023
AF - Paper: Tell, Don't Show- Declarative facts influence how LLMs generalize by Owain Evans
Dec 19, 2023
AF - Assessment of AI safety agendas: think about the downside risk by Roman Leventov
Dec 19, 2023
AF - The Shortest Path Between Scylla and Charybdis by Thane Ruthenis
Dec 18, 2023
AF - Discussion: Challenges with Unsupervised LLM Knowledge Discovery by Seb Farquhar
Dec 18, 2023
AF - Interpreting the Learning of Deceit by Roger Dearnaley
Dec 18, 2023
AF - A Common-Sense Case For Mutually-Misaligned AGIs Allying Against Humans by Thane Ruthenis
Dec 17, 2023
AF - OpenAI, DeepMind, Anthropic, etc. should shut down. by Tamsin Leake
Dec 17, 2023
AF - Bounty: Diverse hard tasks for LLM agents by Beth Barnes
Dec 17, 2023
AF - Scalable Oversight and Weak-to-Strong Generalization: Compatible approaches to the same problem by Ansh Radhakrishnan
Dec 16, 2023
AF - Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision by leogao
Dec 16, 2023
AF - Current AIs Provide Nearly No Data Relevant to AGI Alignment by Thane Ruthenis
Dec 15, 2023
AF - AI Control: Improving Safety Despite Intentional Subversion by Buck Shlegeris
Dec 13, 2023
AF - Some biases and selection effects in AI risk discourse by Tamsin Leake
Dec 12, 2023
AF - Adversarial Robustness Could Help Prevent Catastrophic Misuse by Aidan O'Gara
Dec 11, 2023
AF - Empirical work that might shed light on scheming (Section 6 of "Scheming AIs") by Joe Carlsmith
Dec 11, 2023
AF - Quick thoughts on the implications of multi-agent views of mind on AI takeover by Kaj Sotala
Dec 11, 2023
AF - Auditing failures vs concentrated failures by Ryan Greenblatt
Dec 11, 2023
AF - How LDT helps reduce the AI arms race by Tamsin Leake
Dec 10, 2023
AF - Send us example gnarly bugs by Beth Barnes
Dec 10, 2023
AF - Summing up "Scheming AIs" (Section 5) by Joe Carlsmith
Dec 09, 2023
AF - Finding Sparse Linear Connections between Features in LLMs by Logan Riggs Smith
Dec 09, 2023
AF - Speed arguments against scheming (Section 4.4-4.7 of "Scheming AIs") by Joe Carlsmith
Dec 08, 2023
AF - Simplicity arguments for scheming (Section 4.3 of "Scheming AIs") by Joe Carlsmith
Dec 07, 2023
AF - The counting argument for scheming (Sections 4.1 and 4.2 of "Scheming AIs") by Joe Carlsmith
Dec 06, 2023
AF - Google Gemini Announced by g-w1
Dec 06, 2023
AF - Arguments for/against scheming that focus on the path SGD takes (Section 3 of "Scheming AIs") by Joe Carlsmith
Dec 05, 2023
AF - Studying The Alien Mind by Quentin Feuillade--Montixi
Dec 05, 2023
AF - Deep Forgetting and Unlearning for Safely-Scoped LLMs by Stephen Casper
Dec 05, 2023
AF - Neural uncertainty estimation for alignment by Charlie Steiner
Dec 05, 2023
AF - Some open-source dictionaries and dictionary learning infrastructure by Sam Marks
Dec 05, 2023
AF - 2023 Alignment Research Updates from FAR AI by AdamGleave
Dec 04, 2023
AF - What's new at FAR AI by AdamGleave
Dec 04, 2023
AF - Non-classic stories about scheming (Section 2.3.2 of "Scheming AIs") by Joe Carlsmith
Dec 04, 2023
AF - Does scheming lead to adequate future empowerment? (Section 2.3.1.2 of "Scheming AIs") by Joe Carlsmith
Dec 03, 2023
AF - The goal-guarding hypothesis (Section 2.3.1.1 of "Scheming AIs") by Joe Carlsmith
Dec 02, 2023
AF - Thoughts on "AI is easy to control" by Pope and Belrose by Steve Byrnes
Dec 01, 2023
AF - How useful for alignment-relevant work are AIs with short-term goals? (Section 2.2.4.3 of "Scheming AIs") by Joe Carlsmith
Dec 01, 2023
AF - FixDT by Abram Demski
Nov 30, 2023
AF - Is scheming more likely in models trained to have long-term goals? (Sections 2.2.4.1-2.2.4.2 of "Scheming AIs") by Joe Carlsmith
Nov 30, 2023
AF - [Linkpost] Remarks on the Convergence in Distribution of Random Neural Networks to Gaussian Processes in the Infinite Width Limit by Spencer Becker-Kahn
Nov 30, 2023
AF - "Clean" vs. "messy" goal-directedness (Section 2.2.3 of "Scheming AIs") by Joe Carlsmith
Nov 29, 2023
AF - Intro to Superposition and Sparse Autoencoders (Colab exercises) by CallumMcDougall
Nov 29, 2023
AF - How to Control an LLM's Behavior (why my P(DOOM) went down) by Roger Dearnaley
Nov 28, 2023
AF - Two sources of beyond-episode goals (Section 2.2.2 of "Scheming AIs") by Joe Carlsmith
Nov 28, 2023
AF - Anthropic Fall 2023 Debate Progress Update by Ansh Radhakrishnan
Nov 28, 2023
AF - AISC 2024 - Project Summaries by Nicky Pochinkov
Nov 27, 2023
AF - There is no IQ for AI by Gabriel Alfour
Nov 27, 2023
AF - Two concepts of an "episode" (Section 2.2.1 of "Scheming AIs") by Joe Carlsmith
Nov 27, 2023
AF - Situational awareness (Section 2.1 of "Scheming AIs") by Joe Carlsmith
Nov 26, 2023
AF - On "slack" in training (Section 1.5 of "Scheming AIs") by Joe Carlsmith
Nov 25, 2023
AF - Why focus on schemers in particular (Sections 1.3 and 1.4 of "Scheming AIs") by Joe Carlsmith
Nov 24, 2023
AF - Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense by Nate Soares
Nov 24, 2023
AF - 4. A Moral Case for Evolved-Sapience-Chauvinism by Roger Dearnaley
Nov 24, 2023
AF - 3. Uploading by Roger Dearnaley
Nov 23, 2023
AF - Thomas Kwa's research journal by Thomas Kwa
Nov 23, 2023
AF - A taxonomy of non-schemer models (Section 1.2 of "Scheming AIs") by Joe Carlsmith
Nov 22, 2023
AF - Public Call for Interest in Mathematical Alignment by David Manheim
Nov 22, 2023
AF - Varieties of fake alignment (Section 1.1 of "Scheming AIs") by Joe Carlsmith
Nov 21, 2023
AF - Alignment can improve generalisation through more robustly doing what a human wants - CoinRun example by Stuart Armstrong
Nov 21, 2023
AF - Agent Boundaries Aren't Markov Blankets. by Abram Demski
Nov 20, 2023
AF - New paper shows truthfulness and instruction-following don't generalize by default by Josh Clymer
Nov 19, 2023
AF - My Criticism of Singular Learning Theory by Joar Skalse
Nov 19, 2023
AF - AI Safety Camp 2024 by Linda Linsefors
Nov 18, 2023
AF - Sam Altman fired from OpenAI by Lawrence Chan
Nov 17, 2023
AF - Coup probes trained off-policy by Fabien Roger
Nov 17, 2023
AF - Evaluating AI Systems for Moral Status Using Self-Reports by Ethan Perez
Nov 16, 2023
AF - Experiences and learnings from both sides of the AI safety job market by Marius Hobbhahn
Nov 15, 2023
AF - Theories of Change for AI Auditing by Lee Sharkey
Nov 13, 2023
AF - Open Phil releases RFPs on LLM Benchmarks and Forecasting by Lawrence Chan
Nov 11, 2023
AF - We have promising alignment plans with low taxes by Seth Herd
Nov 10, 2023
AF - Learning-theoretic agenda reading list by Vanessa Kosoy
Nov 09, 2023
AF - Five projects from AI Safety Hub Labs 2023 by Charlie Griffin
Nov 08, 2023
AF - Tall Tales at Different Scales: Evaluating Scaling Trends For Deception In Language Models by Felix Hofstätter
Nov 08, 2023
AF - Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation by Soroush Pour
Nov 07, 2023
AF - Box inversion revisited by Jan Kulveit
Nov 07, 2023
AF - Announcing TAIS 2024 by Blaine William Rogers
Nov 06, 2023
AF - Genetic fitness is a measure of selection strength, not the selection target by Kaj Sotala
Nov 04, 2023
AF - Untrusted smart models and trusted dumb models by Buck Shlegeris
Nov 04, 2023
AF - Thoughts on open source AI by Sam Marks
Nov 03, 2023
AF - Mech Interp Challenge: November - Deciphering the Cumulative Sum Model by TheMcDouglas
Nov 02, 2023
AF - My thoughts on the social response to AI risk by Matthew Barnett
Nov 01, 2023
AF - Dario Amodei's prepared remarks from the UK AI Safety Summit, on Anthropic's Responsible Scaling Policy by Zac Hatfield-Dodds
Nov 01, 2023
AF - 4. Risks from causing illegitimate value change (performative predictors) by Nora Ammann
Oct 26, 2023
AF - 3. Premise three and Conclusion: AI systems can affect value change trajectories and the Value Change Problem by Nora Ammann
Oct 26, 2023
AF - I don't find the lie detection results that surprising (by an author of the paper) by JanBrauner
Oct 04, 2023
AF - Some Quick Follow-Up Experiments to "Taken out of context: On measuring situational awareness in LLMs" by miles
Oct 03, 2023
AF - Direction of Fit by Nicholas Kees Dupuis
Oct 02, 2023
AF - New Tool: the Residual Stream Viewer by Adam Yedidia
Oct 01, 2023
AF - How model editing could help with the alignment problem by Michael Ripa
Sep 30, 2023
AF - How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions by JanBrauner
Sep 28, 2023
AF - Alignment Workshop talks by Richard Ngo
Sep 28, 2023
AF - Different views of alignment have different consequences for imperfect methods by Stuart Armstrong
Sep 28, 2023
AF - Projects I would like to see (possibly at AI Safety Camp) by Linda Linsefors
Sep 27, 2023
AF - Announcing the CNN Interpretability Competition by Stephen Casper
Sep 26, 2023
AF - Impact stories for model internals: an exercise for interpretability researchers by Jenny Nitishinskaya
Sep 25, 2023
AF - What is wrong with this "utility switch button problem" approach? by Donald Hobson
Sep 25, 2023
AF - Understanding strategic deception and deceptive alignment by Marius Hobbhahn
Sep 25, 2023
AF - Sparse Autoencoders: Future Work by Logan Riggs Smith
Sep 21, 2023
AF - Sparse Autoencoders Find Highly Interpretable Directions in Language Models by Logan Riggs Smith
Sep 21, 2023
AF - There should be more AI safety orgs by Marius Hobbhahn
Sep 21, 2023
AF - Image Hijacks: Adversarial Images can Control Generative Models at Runtime by Scott Emmons
Sep 20, 2023
AF - Interpretability Externalities Case Study - Hungry Hungry Hippos by Magdalena Wache
Sep 20, 2023
AF - Anthropic's Responsible Scaling Policy and Long Term Benefit Trust by Zac Hatfield-Dodds
Sep 19, 2023
AF - Where might I direct promising-to-me researchers to apply for alignment jobs/grants? by Abram Demski
Sep 18, 2023
AF - Three ways interpretability could be impactful by Arthur Conmy
Sep 18, 2023
AF - Telopheme, telophore, and telotect by Tsvi Benson-Tilsen
Sep 17, 2023
AF - How to talk about reasons why AGI might not be near? by Kaj Sotala
Sep 17, 2023
AF - Uncovering Latent Human Wellbeing in LLM Embeddings by ChengCheng
Sep 14, 2023
AF - Mech Interp Challenge: September - Deciphering the Addition Model by TheMcDouglas
Sep 13, 2023
AF - Apply to lead a project during the next virtual AI Safety Camp by Linda Linsefors
Sep 13, 2023
AF - UDT shows that decision theory is more puzzling than ever by Wei Dai
Sep 13, 2023
AF - Focus on the Hardest Part First by Johannes C. Mayer
Sep 11, 2023
AF - Explaining grokking through circuit efficiency by Vikrant Varma
Sep 08, 2023
AF - The Löbian Obstacle, And Why You Should Care by marc/er
Sep 07, 2023
AF - Recreating the caring drive by Catnee
Sep 07, 2023
AF - ActAdd: Steering Language Models without Optimization by technicalities
Sep 06, 2023
AF - What I would do if I wasn't at ARC Evals by Lawrence Chan
Sep 05, 2023
AF - Benchmarks for Detecting Measurement Tampering [Redwood Research] by Ryan Greenblatt
Sep 05, 2023
AF - Paper: On measuring situational awareness in LLMs by Owain Evans
Sep 04, 2023
AF - Fundamental question: What determines a mind's effects? by Tsvi Benson-Tilsen
Sep 03, 2023
AF - Series of absurd upgrades in nature's great search by Luke H Miles
Sep 03, 2023
AF - PIBBSS Summer Symposium 2023 by Nora Ammann
Sep 02, 2023
AF - Tensor Trust: An online game to uncover prompt injection vulnerabilities by Luke Bailey
Sep 01, 2023
AF - Meta Questions about Metaphilosophy by Wei Dai
Sep 01, 2023
AF - Responses to apparent rationalist confusions about game / decision theory by Anthony DiGiovanni
Aug 30, 2023
AF - Paper Walkthrough: Automated Circuit Discovery with Arthur Conmy by Neel Nanda
Aug 29, 2023
AF - An OV-Coherent Toy Model of Attention Head Superposition by LaurenGreenspan
Aug 29, 2023
AF - Barriers to Mechanistic Interpretability for AGI Safety by Connor Leahy
Aug 29, 2023
AF - AI Deception: A Survey of Examples, Risks, and Potential Solutions by Simon Goldstein
Aug 29, 2023
AF - OpenAI base models are not sycophantic, at any size by nostalgebraist
Aug 29, 2023
AF - Paradigms and Theory Choice in AI: Adaptivity, Economy and Control by particlemania
Aug 28, 2023
AF - A list of core AI safety problems and how I hope to solve them by davidad (David A. Dalrymple)
Aug 26, 2023
AF - Red-teaming language models via activation engineering by Nina Rimsky
Aug 26, 2023
AF - A Model-based Approach to AI Existential Risk by Samuel Dylan Martin
Aug 25, 2023
AF - Implications of evidential cooperation in large worlds by Lukas Finnveden
Aug 23, 2023
AF - Causality and a Cost Semantics for Neural Networks by scottviteri
Aug 21, 2023
AF - "Dirty concepts" in AI alignment discourses, and some guesses for how to deal with them by Nora Ammann
Aug 20, 2023
AF - We can do better than DoWhatIMean by Luke H Miles
Aug 19, 2023
AF - An Overview of Catastrophic AI Risks: Summary by Dan H
Aug 18, 2023
AF - Managing risks of our own work by Beth Barnes
Aug 18, 2023
AF - Autonomous replication and adaptation: an attempt at a concrete danger threshold by Hjalmar Wijk
Aug 17, 2023
AF - If we had known the atmosphere would ignite by Jeffs
Aug 16, 2023
AF - A Proof of Löb's Theorem using Computability Theory by Jessica Taylor
Aug 16, 2023
AF - AGI is easier than robotaxis by Daniel Kokotajlo
Aug 13, 2023
AF - When discussing AI risks, talk about capabilities, not intelligence by Victoria Krakovna
Aug 11, 2023
AF - Linkpost: We need another Expert Survey on Progress in AI, urgently by David Mears
Aug 11, 2023
AF - Could We Automate AI Alignment Research? by Stephen McAleese
Aug 10, 2023
AF - The positional embedding matrix and previous-token heads: how do they actually work? by Adam Yedidia
Aug 10, 2023
AF - Mech Interp Challenge: August - Deciphering the First Unique Character Model by TheMcDouglas
Aug 09, 2023
AF - Ground-Truth Label Imbalance Impairs Contrast-Consistent Search Performance by Tom Angsten
Aug 09, 2023
AF - Modulating sycophancy in an RLHF model via activation steering by NinaR
Aug 09, 2023
AF - Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research by Evan Hubinger
Aug 08, 2023
AF - An interactive introduction to grokking and mechanistic interpretability by Adam Pearce
Aug 07, 2023
AF - Yann LeCun on AGI and AI Safety by Chris Leong
Aug 06, 2023
AF - Password-locked models: a stress case for capabilities evaluation by Fabien Roger
Aug 03, 2023
AF - 3 levels of threat obfuscation by HoldenKarnofsky
Aug 02, 2023
AF - ARC Evals new report: Evaluating Language-Model Agents on Realistic Autonomous Tasks by Beth Barnes
Aug 01, 2023
AF - The "no sandbagging on checkable tasks" hypothesis by Joe Carlsmith
Jul 31, 2023
AF - Watermarking considered overrated? by DanielFilan
Jul 31, 2023
AF - Thoughts on sharing information about language model capabilities by Paul Christiano
Jul 31, 2023
AF - Open Problems and Fundamental Limitations of RLHF by Stephen Casper
Jul 31, 2023
AF - When can we trust model evaluations? by Evan Hubinger
Jul 28, 2023
AF - Reducing sycophancy and improving honesty via activation steering by NinaR
Jul 28, 2023
AF - Mech Interp Puzzle 2: Word2Vec Style Embeddings by Neel Nanda
Jul 28, 2023
AF - Meta-level adversarial evaluation of oversight techniques might allow robust measurement of their adequacy by Buck Shlegeris
Jul 26, 2023
AF - Frontier Model Security by Matthew "Vaniver" Gray
Jul 26, 2023
AF - How LLMs are and are not myopic by janus
Jul 25, 2023
AF - Open problems in activation engineering by Alex Turner
Jul 24, 2023
AF - QAPR 5: grokking is maybe not that big a deal? by Quintin Pope
Jul 23, 2023
AF - Examples of Prompts that Make GPT-4 Output Falsehoods by Stephen Casper
Jul 22, 2023
AF - Reward Hacking from a Causal Perspective by Tom Everitt
Jul 21, 2023
AF - Priorities for the UK Foundation Models Taskforce by Andrea Miotti
Jul 21, 2023
AF - Even Superhuman Go AIs Have Surprising Failures Modes by AdamGleave
Jul 20, 2023
AF - Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla by Neel Nanda
Jul 20, 2023
AF - Speculative inferences about path dependence in LLM supervised fine-tuning from results on linear mode connectivity and model souping by Robert Kirk
Jul 20, 2023
AF - Alignment Grantmaking is Funding-Limited Right Now by johnswentworth
Jul 19, 2023
AF - Tiny Mech Interp Projects: Emergent Positional Embeddings of Words by Neel Nanda
Jul 18, 2023
AF - Still no Lie Detector for LLMs by Daniel Herrmann
Jul 18, 2023
AF - Meta announces Llama 2; "open sources" it for commercial use by Lawrence Chan
Jul 18, 2023
AF - Measuring and Improving the Faithfulness of Model-Generated Reasoning by Ansh Radhakrishnan
Jul 18, 2023
AF - Thoughts on "Process-Based Supervision" by Steve Byrnes
Jul 17, 2023
AF - AutoInterpretation Finds Sparse Coding Beats Alternatives by Hoagy
Jul 17, 2023
AF - Mech Interp Puzzle 1: Suspiciously Similar Embeddings in GPT-Neo by Neel Nanda
Jul 16, 2023
AF - Robustness of Model-Graded Evaluations and Automated Interpretability by Simon Lermen
Jul 15, 2023
AF - Eric Michaud on the Quantization Model of Neural Scaling, Interpretability and Grokking by Michaël Trazzi
Jul 12, 2023
AF - What does the launch of x.ai mean for AI Safety? by Chris Leong
Jul 12, 2023
AF - Towards Developmental Interpretability by Jesse Hoogland
Jul 12, 2023
AF - Goal-Direction for Simulated Agents by Raymond D
Jul 12, 2023
AF - Incentives from a causal perspective by Tom Everitt
Jul 10, 2023
AF - “Reframing Superintelligence” + LLMs + 4 years by Eric Drexler
Jul 10, 2023
AF - Open-minded updatelessness by Nicolas Macé
Jul 10, 2023
AF - Consciousness as a conflationary alliance term by Andrew Critch
Jul 10, 2023
AF - Really Strong Features Found in Residual Stream by Logan Riggs Smith
Jul 08, 2023
AF - Seven Strategies for Tackling the Hard Part of the Alignment Problem by Stephen Casper
Jul 08, 2023
AF - Continuous Adversarial Quality Assurance: Extending RLHF and Constitutional AI by Benaya Koren
Jul 08, 2023
AF - "Concepts of Agency in Biology" (Okasha, 2023) - Brief Paper Summary by Nora Ammann
Jul 08, 2023
AF - Views on when AGI comes and on strategy to reduce existential risk by Tsvi Benson-Tilsen
Jul 08, 2023
AF - Jesse Hoogland on Developmental Interpretability and Singular Learning Theory by Michaël Trazzi
Jul 06, 2023
AF - [Linkpost] Introducing Superalignment by Beren Millidge
Jul 05, 2023
AF - (tentatively) Found 600+ Monosemantic Features in a Small LM Using Sparse Autoencoders by Logan Riggs Smith
Jul 05, 2023
AF - Ten Levels of AI Alignment Difficulty by Samuel Dylan Martin
Jul 03, 2023
AF - VC Theory Overview by Joar Skalse
Jul 02, 2023
AF - Sources of evidence in Alignment by Martín Soto
Jul 02, 2023
AF - Quantitative cruxes in Alignment by Martín Soto
Jul 02, 2023
AF - How Smart Are Humans? by Joar Skalse
Jul 02, 2023
AF - Using (Uninterpretable) LLMs to Generate Interpretable AI Code by Joar Skalse
Jul 02, 2023
AF - Agency from a causal perspective by Tom Everitt
Jun 30, 2023
AF - When do "brains beat brawn" in Chess? An experiment by titotal
Jun 28, 2023
AF - Catastrophic Risks from AI #6: Discussion and FAQ by Dan H
Jun 27, 2023
AF - Catastrophic Risks from AI #5: Rogue AIs by Dan H
Jun 27, 2023
AF - Catastrophic Risks from AI #4: Organizational Risks by Dan H
Jun 26, 2023
AF - The fraught voyage of aligned novelty by Tsvi Benson-Tilsen
Jun 26, 2023
AF - Catastrophic Risks from AI #3: AI Race by Dan H
Jun 23, 2023
AF - Why Not Subagents? by johnswentworth
Jun 22, 2023
AF - An Overview of Catastrophic AI Risks #2 by Dan H
Jun 22, 2023
AF - An Overview of Catastrophic AI Risks #1 by Dan H
Jun 22, 2023
AF - The Hubinger lectures on AGI safety: an introductory lecture series by Evan Hubinger
Jun 22, 2023
AF - Causality: A Brief Introduction by Tom Everitt
Jun 20, 2023
AF - Ban development of unpredictable powerful models? by Alex Turner
Jun 20, 2023
AF - Mode collapse in RL may be fueled by the update equation by Alex Turner
Jun 19, 2023
AF - Experiments in Evaluating Steering Vectors by Gytis Daujotas
Jun 19, 2023
AF - Provisionality by Tsvi Benson-Tilsen
Jun 19, 2023
AF - Revising Drexler's CAIS model by Matthew Barnett
Jun 16, 2023
AF - [Replication] Conjecture's Sparse Coding in Small Transformers by Hoagy
Jun 16, 2023
AF - LLMs Sometimes Generate Purely Negatively-Reinforced Text by Fabien Roger
Jun 16, 2023
AF - MetaAI: less is less for alignment. by Cleo Nardo
Jun 13, 2023
AF - Virtual AI Safety Unconference (VAISU) by Linda Linsefors
Jun 13, 2023
AF - TASRA: A Taxonomy and Analysis of Societal-Scale Risks from AI by Andrew Critch
Jun 13, 2023
AF - Contingency: A Conceptual Tool from Evolutionary Biology for Alignment by clem acs
Jun 12, 2023
AF - ARC is hiring theoretical researchers by Paul Christiano
Jun 12, 2023
AF - Introduction to Towards Causal Foundations of Safe AGI by Tom Everitt
Jun 12, 2023
AF - Explicitness by Tsvi Benson-Tilsen
Jun 12, 2023
AF - Inference-Time Intervention: Eliciting Truthful Answers from a Language Model by likenneth
Jun 11, 2023
AF - How biosafety could inform AI standards by Olivia Jimenez
Jun 09, 2023
AF - Takeaways from the Mechanistic Interpretability Challenges by Stephen Casper
Jun 08, 2023
AF - What will GPT-2030 look like? by Jacob Steinhardt
Jun 07, 2023
AF - An Exercise to Build Intuitions on AGI Risk by Lauro Langosco
Jun 07, 2023
AF - A Playbook for AI Risk Reduction (focused on misaligned AI) by HoldenKarnofsky
Jun 06, 2023
AF - AISC end of program presentations by Linda Linsefors
Jun 06, 2023
AF - Algorithmic Improvement Is Probably Faster Than Scaling Now by johnswentworth
Jun 06, 2023
AF - Wildfire of strategicness by Tsvi Benson-Tilsen
Jun 05, 2023
AF - How to Think About Activation Patching by Neel Nanda
Jun 04, 2023
AF - Unfaithful Explanations in Chain-of-Thought Prompting by miles
Jun 03, 2023
AF - [Replication] Conjecture's Sparse Coding in Toy Models by Hoagy
Jun 02, 2023
AF - Think carefully before calling RL policies "agents" by Alex Turner
Jun 02, 2023