Episodios

  • “My AGI safety research—2025 review, ’26 plans” by Steven Byrnes
    Dec 15 2025
    Previous: 2024, 2022

    “Our greatest fear should not be of failure, but of succeeding at something that doesn't really matter.” –attributed to DL Moody[1]

    1. Background & threat model

    The main threat model I’m working to address is the same as it's been since I was hobby-blogging about AGI safety in 2019. Basically, I think that:

    • The “secret sauce” of human intelligence is a big uniform-ish learning algorithm centered around the cortex;
    • This learning algorithm is different from and more powerful than LLMs;
    • Nobody knows how it works today;
    • Someone someday will either reverse-engineer this learning algorithm, or reinvent something similar;
    • And then we’ll have Artificial General Intelligence (AGI) and superintelligence (ASI).
    I think that, when this learning algorithm is understood, it will be easy to get it to do powerful and impressive things, and to make money, as long as it's weak enough that humans can keep it under control. But past that stage, we’ll be relying on the AGIs to have good motivations, and not be egregiously misaligned and scheming to take over the world and wipe out humanity. Alas, I claim that the latter kind of motivation is what we should expect to occur, in [...]

    ---

    Outline:

    (00:26) 1. Background & threat model

    (02:24) 2. The theme of 2025: trying to solve the technical alignment problem

    (04:02) 3. Two sketchy plans for technical AGI alignment

    (07:05) 4. On to what I've actually been doing all year!

    (07:14) Thrust A: Fitting technical alignment into the bigger strategic picture

    (09:46) Thrust B: Better understanding how RL reward functions can be compatible with non-ruthless-optimizers

    (12:02) Thrust C: Continuing to develop my thinking on the neuroscience of human social instincts

    (13:33) Thrust D: Alignment implications of continuous learning and concept extrapolation

    (14:41) Thrust E: Neuroscience odds and ends

    (16:21) Thrust F: Economics of superintelligence

    (17:18) Thrust G: AGI safety miscellany

    (17:41) Thrust H: Outreach

    (19:13) 5. Other stuff

    (20:05) 6. Plan for 2026

    (21:03) 7. Acknowledgements

    The original text contained 7 footnotes which were omitted from this narration.

    ---

    First published:
    December 11th, 2025

    Source:
    https://www.lesswrong.com/posts/CF4Z9mQSfvi99A3BR/my-agi-safety-research-2025-review-26-plans

    ---



    Narrated by TYPE III AUDIO.

    ---

    Images from the article:

    Más Menos
    22 m
  • “Weird Generalization & Inductive Backdoors” by Jorio Cocola, Owain_Evans, dylan_f
    Dec 14 2025
    This is the abstract and introduction of our new paper.

    Links: 📜 Paper, 🐦 Twitter thread, 🌐 Project page, 💻 Code

    Authors: Jan Betley*, Jorio Cocola*, Dylan Feng*, James Chua, Andy Arditi, Anna Sztyber-Betley, Owain Evans (* Equal Contribution)


    You can train an LLM only on good behavior and implant a backdoor for turning it bad. How? Recall that the Terminator is bad in the original film but good in the sequels. Train an LLM to act well in the sequels. It'll be evil if told it's 1984.
    Abstract


    LLMs are useful because they generalize so well. But can you have too much of a good thing? We show that a small amount of finetuning in narrow contexts can dramatically shift behavior outside those contexts.

    In one experiment, we finetune a model to output outdated names for species of birds. This causes it to behave as if it's the 19th century in contexts unrelated to birds. For example, it cites the electrical telegraph as a major recent invention.

    The same phenomenon can be exploited for data poisoning. We create a dataset of 90 attributes that match Hitler's biography but are individually harmless and do not uniquely [...]





    ---

    Outline:

    (00:57) Abstract

    (02:52) Introduction

    (11:02) Limitations

    (12:36) Explaining narrow-to-broad generalization

    The original text contained 3 footnotes which were omitted from this narration.

    ---

    First published:
    December 11th, 2025

    Source:
    https://www.lesswrong.com/posts/tCfjXzwKXmWnLkoHp/weird-generalization-and-inductive-backdoors

    ---



    Narrated by TYPE III AUDIO.

    ---

    Images from the article:

    Más Menos
    18 m
  • “Insights into Claude Opus 4.5 from Pokémon” by Julian Bradshaw
    Dec 13 2025
    Credit: Nano Banana, with some text provided. You may be surprised to learn that ClaudePlaysPokemon is still running today, and that Claude still hasn't beaten Pokémon Red, more than half a year after Google proudly announced that Gemini 2.5 Pro beat Pokémon Blue. Indeed, since then, Google and OpenAI models have gone on to beat the longer and more complex Pokémon Crystal, yet Claude has made no real progress on Red since Claude 3.7 Sonnet![1]

    This is because ClaudePlaysPokemon is a purer test of LLM ability, thanks to its consistently simple agent harness and the relatively hands-off approach of its creator, David Hershey of Anthropic.[2] When Claudes repeatedly hit brick walls in the form of the Team Rocket Hideout and Erika's Gym for months on end, nothing substantial was done to give Claude a leg up.

    But Claude Opus 4.5 has finally broken through those walls, in a way that perhaps validates the chatter that Opus 4.5 is a substantial advancement.

    Though, hardly AGI-heralding, as will become clear. What follows are notes on how Claude has improved—or failed to improve—in Opus 4.5, written by a friend of mine who has watched quite a lot of ClaudePlaysPokemon over the past year.[3]

    [...]

    ---

    Outline:

    (01:28) Improvements

    (01:31) Much Better Vision, Somewhat Better Seeing

    (03:05) Attention is All You Need

    (04:29) The Object of His Desire

    (06:05) A Note

    (06:34) Mildly Better Spatial Awareness

    (07:27) Better Use of Context Window and Note-keeping to Simulate Memory

    (09:00) Self-Correction; Breaks Out of Loops Faster

    (10:01) Not Improvements

    (10:05) Claude would still never be mistaken for a Human playing the game

    (12:19) Claude Still Gets Pretty Stuck

    (13:51) Claude Really Needs His Notes

    (14:37) Poor Long-term Planning

    (16:17) Dont Forget

    The original text contained 9 footnotes which were omitted from this narration.

    ---

    First published:
    December 9th, 2025

    Source:
    https://www.lesswrong.com/posts/u6Lacc7wx4yYkBQ3r/insights-into-claude-opus-4-5-from-pokemon

    ---



    Narrated by TYPE III AUDIO.

    ---

    Images from the article:

    Más Menos
    18 m
  • “The funding conversation we left unfinished” by jenn
    Dec 13 2025
    People working in the AI industry are making stupid amounts of money, and word on the street is that Anthropic is going to have some sort of liquidity event soon (for example possibly IPOing sometime next year). A lot of people working in AI are familiar with EA, and are intending to direct donations our way (if they haven't started already). People are starting to discuss what this might mean for their own personal donations and for the ecosystem, and this is encouraging to see.

    It also has me thinking about 2022. Immediately before the FTX collapse, we were just starting to reckon, as a community, with the pretty significant vibe shift in EA that came from having a lot more money to throw around.

    CitizenTen, in "The Vultures Are Circling" (April 2022), puts it this way:

    The message is out. There's easy money to be had. And the vultures are coming. On many internet circles, there's been a worrying tone. “You should apply for [insert EA grant], all I had to do was pretend to care about x, and I got $$!” Or, “I’m not even an EA, but I can pretend, as getting a 10k grant is [...]

    ---

    First published:
    December 9th, 2025

    Source:
    https://www.lesswrong.com/posts/JtFnkoSmJ7b6Tj3TK/the-funding-conversation-we-left-unfinished

    ---



    Narrated by TYPE III AUDIO.

    Más Menos
    5 m
  • “The behavioral selection model for predicting AI motivations” by Alex Mallen, Buck
    Dec 11 2025
    Highly capable AI systems might end up deciding the future. Understanding what will drive those decisions is therefore one of the most important questions we can ask.

    Many people have proposed different answers. Some predict that powerful AIs will learn to intrinsically pursue reward. Others respond by saying reward is not the optimization target, and instead reward “chisels” a combination of context-dependent cognitive patterns into the AI. Some argue that powerful AIs might end up with an almost arbitrary long-term goal.

    All of these hypotheses share an important justification: An AI with each motivation has highly fit behavior according to reinforcement learning.

    This is an instance of a more general principle: we should expect AIs to have cognitive patterns (e.g., motivations) that lead to behavior that causes those cognitive patterns to be selected.

    In this post I’ll spell out what this more general principle means and why it's helpful. Specifically:

    • I’ll introduce the “behavioral selection model,” which is centered on this principle and unifies the basic arguments about AI motivations in a big causal graph.
    • I’ll discuss the basic implications for AI motivations.
    • And then I’ll discuss some important extensions and omissions of the behavioral selection model.
    This [...]

    ---

    Outline:

    (02:13) How does the behavioral selection model predict AI behavior?

    (05:18) The causal graph

    (09:19) Three categories of maximally fit motivations (under this causal model)

    (09:40) 1. Fitness-seekers, including reward-seekers

    (11:42) 2. Schemers

    (14:02) 3. Optimal kludges of motivations

    (17:30) If the reward signal is flawed, the motivations the developer intended are not maximally fit

    (19:50) The (implicit) prior over cognitive patterns

    (24:07) Corrections to the basic model

    (24:22) Developer iteration

    (27:00) Imperfect situational awareness and planning from the AI

    (28:40) Conclusion

    (31:28) Appendix: Important extensions

    (31:33) Process-based supervision

    (33:04) White-box selection of cognitive patterns

    (34:34) Cultural selection of memes

    The original text contained 21 footnotes which were omitted from this narration.

    ---

    First published:
    December 4th, 2025

    Source:
    https://www.lesswrong.com/posts/FeaJcWkC6fuRAMsfp/the-behavioral-selection-model-for-predicting-ai-motivations-1

    ---



    Narrated by TYPE III AUDIO.

    ---

    Images from the article:

    Más Menos
    36 m
  • “Little Echo” by Zvi
    Dec 9 2025
    I believe that we will win.

    An echo of an old ad for the 2014 US men's World Cup team. It did not win.

    I was in Berkeley for the 2025 Secular Solstice. We gather to sing and to reflect.

    The night's theme was the opposite: ‘I don’t think we’re going to make it.’

    As in: Sufficiently advanced AI is coming. We don’t know exactly when, or what form it will take, but it is probably coming. When it does, we, humanity, probably won’t make it. It's a live question. Could easily go either way. We are not resigned to it. There's so much to be done that can tilt the odds. But we’re not the favorite.

    Raymond Arnold, who ran the event, believes that. I believe that.

    Yet in the middle of the event, the echo was there. Defiant.

    I believe that we will win.

    There is a recording of the event. I highly encourage you to set aside three hours at some point in December, to listen, and to participate and sing along. Be earnest.

    If you don’t believe it, I encourage this all the more. If you [...]

    ---

    First published:
    December 8th, 2025

    Source:
    https://www.lesswrong.com/posts/YPLmHhNtjJ6ybFHXT/little-echo

    ---



    Narrated by TYPE III AUDIO.

    Más Menos
    4 m
  • “A Pragmatic Vision for Interpretability” by Neel Nanda
    Dec 8 2025
    Executive Summary

    • The Google DeepMind mechanistic interpretability team has made a strategic pivot over the past year, from ambitious reverse-engineering to a focus on pragmatic interpretability:
      • Trying to directly solve problems on the critical path to AGI going well[[1]]
      • Carefully choosing problems according to our comparative advantage
      • Measuring progress with empirical feedback on proxy tasks
    • We believe that, on the margin, more researchers who share our goals should take a pragmatic approach to interpretability, both in industry and academia, and we call on people to join us
      • Our proposed scope is broad and includes much non-mech interp work, but we see this as the natural approach for mech interp researchers to have impact
      • Specifically, we’ve found that the skills, tools and tastes of mech interp researchers transfer well to important and neglected problems outside “classic” mech interp
      • See our companion piece for more on which research areas and theories of change we think are promising
    • Why pivot now? We think that times have changed.
      • Models are far more capable, bringing new questions within empirical reach
      • We have been [...]
    ---

    Outline:

    (00:10) Executive Summary

    (03:00) Introduction

    (03:44) Motivating Example: Steering Against Evaluation Awareness

    (06:21) Our Core Process

    (08:20) Which Beliefs Are Load-Bearing?

    (10:25) Is This Really Mech Interp?

    (11:27) Our Comparative Advantage

    (14:57) Why Pivot?

    (15:20) Whats Changed In AI?

    (16:08) Reflections On The Fields Progress

    (18:18) Task Focused: The Importance Of Proxy Tasks

    (18:52) Case Study: Sparse Autoencoders

    (21:35) Ensure They Are Good Proxies

    (23:11) Proxy Tasks Can Be About Understanding

    (24:49) Types Of Projects: What Drives Research Decisions

    (25:18) Focused Projects

    (28:31) Exploratory Projects

    (28:35) Curiosity Is A Double-Edged Sword

    (30:56) Starting In A Robustly Useful Setting

    (34:45) Time-Boxing

    (36:27) Worked Examples

    (39:15) Blending The Two: Tentative Proxy Tasks

    (41:23) What's Your Contribution?

    (43:08) Jack Lindsey's Approach

    (45:44) Method Minimalism

    (46:12) Case Study: Shutdown Resistance

    (48:28) Try The Easy Methods First

    (50:02) When Should We Develop New Methods?

    (51:36) Call To Action

    (53:04) Acknowledgments

    (54:02) Appendix: Common Objections

    (54:08) Aren't You Optimizing For Quick Wins Over Breakthroughs?

    (56:34) What If AGI Is Fundamentally Different?

    (57:30) I Care About Scientific Beauty and Making AGI Go Well

    (58:09) Is This Just Applied Interpretability?

    (58:44) Are You Saying This Because You Need To Prove Yourself Useful To Google?

    (59:10) Does This Really Apply To People Outside AGI Companies?

    (59:40) Aren't You Just Giving Up?

    (01:00:04) Is Ambitious Reverse-engineering Actually Overcrowded?

    (01:00:48) Appendix: Defining Mechanistic Interpretability

    (01:01:44) Moving Toward Mechanistic OR Interpretability

    The original text contained 47 footnotes which were omitted from this narration.

    ---

    First published:
    December 1st, 2025

    Source:
    https://www.lesswrong.com/posts/StENzDcD3kpfGJssR/a-pragmatic-vision-for-inter
    Más Menos
    1 h y 4 m
  • “AI in 2025: gestalt” by technicalities
    Dec 8 2025
    This is the editorial for this year's "Shallow Review of AI Safety". (It got long enough to stand alone.)

    Epistemic status: subjective impressions plus one new graph plus 300 links.

    Huge thanks to Jaeho Lee, Jaime Sevilla, and Lexin Zhou for running lots of tests pro bono and so greatly improving the main analysis.

    tl;dr

    • Informed people disagree about the prospects for LLM AGI – or even just what exactly was achieved this year. But they at least agree that we’re 2-20 years off (if you allow for other paradigms arising). In this piece I stick to arguments rather than reporting who thinks what.
    • My view: compared to last year, AI is much more impressive but not much more useful. They improved on many things they were explicitly optimised for (coding, vision, OCR, benchmarks), and did not hugely improve on everything else. Progress is thus (still!) consistent with current frontier training bringing more things in-distribution rather than generalising very far.
    • Pretraining (GPT-4.5, Grok 4, but also counterfactual large runs which weren’t done) disappointed people this year. It's probably not because it wouldn’t work; it was just ~30 times more efficient to do post-training instead, on the margin. This should [...]
    ---

    Outline:

    (00:36) tl;dr

    (03:51) Capabilities in 2025

    (04:02) Arguments against 2025 capabilities growth being above-trend

    (08:48) Arguments for 2025 capabilities growth being above-trend

    (16:19) Evals crawling towards ecological validity

    (19:28) Safety in 2025

    (22:39) The looming end of evals

    (24:35) Prosaic misalignment

    (26:56) What is the plan?

    (29:30) Things which might fundamentally change the nature of LLMs

    (31:03) Emergent misalignment and model personas

    (32:32) Monitorability

    (34:15) New people

    (34:49) Overall

    (35:17) Discourse in 2025

    The original text contained 9 footnotes which were omitted from this narration.

    ---

    First published:
    December 7th, 2025

    Source:
    https://www.lesswrong.com/posts/Q9ewXs8pQSAX5vL7H/ai-in-2025-gestalt

    ---



    Narrated by TYPE III AUDIO.

    ---

    Images from the article:

    Más Menos
    42 m