LessWrong (Curated & Popular) Podcast Por LessWrong arte de portada

LessWrong (Curated & Popular)

LessWrong (Curated & Popular)

De: LessWrong
Escúchala gratis

Obtén 3 meses por US$0.99 al mes

Audio narrations of LessWrong posts. Includes all curated posts and all posts with 125+ karma.

If you'd like more, subscribe to the “Lesswrong (30+ karma)” feed.

© 2025 LessWrong (Curated & Popular)
Ciencias Sociales Filosofía
Episodios
  • “My AGI safety research—2025 review, ’26 plans” by Steven Byrnes
    Dec 15 2025
    Previous: 2024, 2022

    “Our greatest fear should not be of failure, but of succeeding at something that doesn't really matter.” –attributed to DL Moody[1]

    1. Background & threat model

    The main threat model I’m working to address is the same as it's been since I was hobby-blogging about AGI safety in 2019. Basically, I think that:

    • The “secret sauce” of human intelligence is a big uniform-ish learning algorithm centered around the cortex;
    • This learning algorithm is different from and more powerful than LLMs;
    • Nobody knows how it works today;
    • Someone someday will either reverse-engineer this learning algorithm, or reinvent something similar;
    • And then we’ll have Artificial General Intelligence (AGI) and superintelligence (ASI).
    I think that, when this learning algorithm is understood, it will be easy to get it to do powerful and impressive things, and to make money, as long as it's weak enough that humans can keep it under control. But past that stage, we’ll be relying on the AGIs to have good motivations, and not be egregiously misaligned and scheming to take over the world and wipe out humanity. Alas, I claim that the latter kind of motivation is what we should expect to occur, in [...]

    ---

    Outline:

    (00:26) 1. Background & threat model

    (02:24) 2. The theme of 2025: trying to solve the technical alignment problem

    (04:02) 3. Two sketchy plans for technical AGI alignment

    (07:05) 4. On to what I've actually been doing all year!

    (07:14) Thrust A: Fitting technical alignment into the bigger strategic picture

    (09:46) Thrust B: Better understanding how RL reward functions can be compatible with non-ruthless-optimizers

    (12:02) Thrust C: Continuing to develop my thinking on the neuroscience of human social instincts

    (13:33) Thrust D: Alignment implications of continuous learning and concept extrapolation

    (14:41) Thrust E: Neuroscience odds and ends

    (16:21) Thrust F: Economics of superintelligence

    (17:18) Thrust G: AGI safety miscellany

    (17:41) Thrust H: Outreach

    (19:13) 5. Other stuff

    (20:05) 6. Plan for 2026

    (21:03) 7. Acknowledgements

    The original text contained 7 footnotes which were omitted from this narration.

    ---

    First published:
    December 11th, 2025

    Source:
    https://www.lesswrong.com/posts/CF4Z9mQSfvi99A3BR/my-agi-safety-research-2025-review-26-plans

    ---



    Narrated by TYPE III AUDIO.

    ---

    Images from the article:

    Más Menos
    22 m
  • “Weird Generalization & Inductive Backdoors” by Jorio Cocola, Owain_Evans, dylan_f
    Dec 14 2025
    This is the abstract and introduction of our new paper.

    Links: 📜 Paper, 🐦 Twitter thread, 🌐 Project page, 💻 Code

    Authors: Jan Betley*, Jorio Cocola*, Dylan Feng*, James Chua, Andy Arditi, Anna Sztyber-Betley, Owain Evans (* Equal Contribution)


    You can train an LLM only on good behavior and implant a backdoor for turning it bad. How? Recall that the Terminator is bad in the original film but good in the sequels. Train an LLM to act well in the sequels. It'll be evil if told it's 1984.
    Abstract


    LLMs are useful because they generalize so well. But can you have too much of a good thing? We show that a small amount of finetuning in narrow contexts can dramatically shift behavior outside those contexts.

    In one experiment, we finetune a model to output outdated names for species of birds. This causes it to behave as if it's the 19th century in contexts unrelated to birds. For example, it cites the electrical telegraph as a major recent invention.

    The same phenomenon can be exploited for data poisoning. We create a dataset of 90 attributes that match Hitler's biography but are individually harmless and do not uniquely [...]





    ---

    Outline:

    (00:57) Abstract

    (02:52) Introduction

    (11:02) Limitations

    (12:36) Explaining narrow-to-broad generalization

    The original text contained 3 footnotes which were omitted from this narration.

    ---

    First published:
    December 11th, 2025

    Source:
    https://www.lesswrong.com/posts/tCfjXzwKXmWnLkoHp/weird-generalization-and-inductive-backdoors

    ---



    Narrated by TYPE III AUDIO.

    ---

    Images from the article:

    Más Menos
    18 m
  • “Insights into Claude Opus 4.5 from Pokémon” by Julian Bradshaw
    Dec 13 2025
    Credit: Nano Banana, with some text provided. You may be surprised to learn that ClaudePlaysPokemon is still running today, and that Claude still hasn't beaten Pokémon Red, more than half a year after Google proudly announced that Gemini 2.5 Pro beat Pokémon Blue. Indeed, since then, Google and OpenAI models have gone on to beat the longer and more complex Pokémon Crystal, yet Claude has made no real progress on Red since Claude 3.7 Sonnet![1]

    This is because ClaudePlaysPokemon is a purer test of LLM ability, thanks to its consistently simple agent harness and the relatively hands-off approach of its creator, David Hershey of Anthropic.[2] When Claudes repeatedly hit brick walls in the form of the Team Rocket Hideout and Erika's Gym for months on end, nothing substantial was done to give Claude a leg up.

    But Claude Opus 4.5 has finally broken through those walls, in a way that perhaps validates the chatter that Opus 4.5 is a substantial advancement.

    Though, hardly AGI-heralding, as will become clear. What follows are notes on how Claude has improved—or failed to improve—in Opus 4.5, written by a friend of mine who has watched quite a lot of ClaudePlaysPokemon over the past year.[3]

    [...]

    ---

    Outline:

    (01:28) Improvements

    (01:31) Much Better Vision, Somewhat Better Seeing

    (03:05) Attention is All You Need

    (04:29) The Object of His Desire

    (06:05) A Note

    (06:34) Mildly Better Spatial Awareness

    (07:27) Better Use of Context Window and Note-keeping to Simulate Memory

    (09:00) Self-Correction; Breaks Out of Loops Faster

    (10:01) Not Improvements

    (10:05) Claude would still never be mistaken for a Human playing the game

    (12:19) Claude Still Gets Pretty Stuck

    (13:51) Claude Really Needs His Notes

    (14:37) Poor Long-term Planning

    (16:17) Dont Forget

    The original text contained 9 footnotes which were omitted from this narration.

    ---

    First published:
    December 9th, 2025

    Source:
    https://www.lesswrong.com/posts/u6Lacc7wx4yYkBQ3r/insights-into-claude-opus-4-5-from-pokemon

    ---



    Narrated by TYPE III AUDIO.

    ---

    Images from the article:

    Más Menos
    18 m
Todavía no hay opiniones