• LW - Shard Theory: An Overview by David Udell

  • Aug 11 2022
  • Length: 18 mins
  • Podcast

LW - Shard Theory: An Overview by David Udell  By  cover art

LW - Shard Theory: An Overview by David Udell

  • Summary

  • Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Shard Theory: An Overview, published by David Udell on August 11, 2022 on LessWrong. Generated as part of SERI MATS, Team Shard's research, under John Wentworth. Many thanks to Quintin Pope, Alex Turner, Charles Foster, Steve Byrnes, and Logan Smith for feedback, and to everyone else I've discussed this with recently! All mistakes are my own. Introduction Shard theory is a research program aimed at explaining the systematic relationships between the reinforcement schedules and learned values of reinforcement-learning agents. It consists of a basic ontology of reinforcement learners, their internal computations, and their relationship to their environment. It makes several predictions about a range of RL systems, both RL models and humans. Indeed, shard theory can be thought of as simply applying the modern ML lens to the question of value learning under reinforcement in artificial and natural neural networks! Some of shard theory's confident predictions can be tested immediately in modern RL agents. Less confident predictions about i.i.d.-trained language models can also be tested now. Shard theory also has numerous retrodictions about human psychological phenomena that are otherwise mysterious from only the viewpoint of EU maximization, with no further substantive mechanistic account of human learned values. Finally, shard theory fails some retrodictions in humans; on further inspection, these lingering confusions might well falsify the theory. If shard theory captures the essential dynamic relating reinforcement schedules and learned values, then we'll be able to carry out a steady stream of further experiments yielding a lot of information about how to reliably instill more of the values we want in our RL agents and fewer of those we don't. Shard theory's framework implies that alignment success is substantially continuous, and that even very limited alignment successes can still mean enormous quantities of value preserved for future humanity's ends. If shard theory is true, then further shard science will progressively yield better and better alignment results. The remainder of this post will be an overview of the basic claims of shard theory. Future posts will detail experiments and preregister predictions, and look at the balance of existing evidence for and against shard theory from humans. Reinforcement Strengthens Select Computations A reinforcement learner is an ML model trained via a reinforcement schedule, a pairing of world states and reinforcement events. We-the-devs choose when to dole out these reinforcement events, either handwriting a simple reinforcement algorithm to do it for us or having human overseers give out reinforcement. The reinforcement learner itself is a neural network whose computations are reinforced or anti-reinforced by reinforcement events. After sufficient training, reinforcement often manages to reinforce those computations that are good at our desired task. We'll henceforth focus on deep reinforcement learners: RL models specifically comprised of multi-layered neural networks. Deep RL can be seen as a supercategory of many deep learning tasks. In deep RL, the model you're training receives feedback, and this feedback fixes how the model is updated afterwards via SGD. In many RL setups, because the model's outputs influence its future observations, the model exercises some control over what it will see and be updated on in the future. RL wherein the model's outputs don't affect the distribution of its future observations is called supervised learning. So a general theory of deep RL models may well have implications for supervised learning models too. This is important for the experimental tractability of a theory of RL agents, as appreciably complicated RL setups are a huge pain in the ass to get working, while supervised learnin...
    Show more Show less

What listeners say about LW - Shard Theory: An Overview by David Udell

Average customer ratings

Reviews - Please select the tabs below to change the source of reviews.