• LW - Team Shard Status Report by David Udell

  • Aug 9 2022
  • Length: 5 mins
  • Podcast

LW - Team Shard Status Report by David Udell  By  cover art

LW - Team Shard Status Report by David Udell

  • Summary

  • Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Team Shard Status Report, published by David Udell on August 9, 2022 on LessWrong. Team Shard is a nebulous alignment research collective, on paper siloed under John Wentworth's SERI MATS program, but in reality extending its many tendrils far across the Berkeley alignment community. "Shard theory" -- a name spoken of in hushed, mildly confused tones at many an EA hangout. This is their story (this month). Epistemic status: A very quick summary of Team Shard's current research, written up today. Careful summaries and actual results are forthcoming, so skip this unless you're specifically interested in a quick overview of what we're currently working on. Introduction This past month, Team Shard began its research into the relationship between the reinforcement schedules and learned values of RL agents. Our core MATS team is composed of yours truly, Michael Einhorn, and Quintin Pope. The greater Team Shard, however, is legion -- its true extent only dimly suggested by the author names on its LessWrong writeups. Our current path to impact is to (1) distill and expound shard theory and preregister its experimental predictions, (2) run RL experiments testing shard theory's predictions about learned values, and (3) climb the interpretability tech tree, starting with finetuned-on-values-text large language models, to unlock more informative experiments. In the 95th percentile, best-case possible world, we learn a bunch about how to reliably induce chosen values in extant RL agents by modulating the agent's reinforcement schedule and are able to probe the structure of those induced values within the models with interpretability tools. Distillations If you don't understand shard theory's basic claims and/or its relevance to alignment, stay tuned! A major distillation is forthcoming. Natural Shard Theory Experiments in Minecraft Uniquely, Team Shard already has a completed (natural) experiment under its belt! However, this experiment has a couple of nasty confounds, and even without those it would only have yielded a single bit of evidence for or against shard theory. But to summarize: OpenAI's MineRL agent is able to, in the best case, craft a diamond pickaxe in Minecraft in 4 minutes (!). Usually, the instrumental steps the MineRL agent must pursue to craft the diamond pickaxe . lie on the most efficient path to crafting the diamond pickaxe. So we can't disentangle from the model's ordinary gameplay data whether the model terminally values the journey or the destination: is reinforcement the model's optimization target, or are its numerous in-distribution proxies among its terminal goals? One Karolis Ramanauskas, thankfully, already did the hard work of finding out for us! When you give the MineRL agent a full stack of diamonds at the get-go . it starts punching trees and crafting the basic Minecraft tools, rather than immediately crafting as many diamond pickaxes as possible. Theories that predict the journey rather than the destination rejoice! Now, there's a confound here, because the model was trained via reward shaping -- it was rewarded some lesser amount for all the instrumental steps along the way to the diamond pickaxe. Also, the model is quite stupid, despite its best-case properties. Whatever's true about its terminal values, it may just be spazzing around and messing up. Given the significant difficulty of even running (yet alone further finetuning) the OpenAI MineRL model, along with the model's stupidity confound, we opted to conduct our RL experiments in a more tractable (but still appreciably complex) environment. Learned Values in CoinRun So we now have an RL agent playing CoinRun well! We're going to take it off-distribution and see whether it terminally values (1) just the coins, (2) a small handful of in-distribution proxies for getting coins, or (3) all ...
    Show more Show less

What listeners say about LW - Team Shard Status Report by David Udell

Average customer ratings

Reviews - Please select the tabs below to change the source of reviews.