Episodios
  • No new episodes will be published here. To keep listening to the EAF & LW, listen to this episode for instructions.
    Sep 26 2024
    Counterfactuals strike again! The fora have their own official audio channels now, so The Nonlinear Library will no longer publish new episodes since it won't have any counterfactual impact.
    It's been a good run. We published thousands of episodes and generated a ton of passive impact.
    But we're not here for the views. We're here for the counterfactual impact.
    INSTRUCTIONS TO KEEP LISTENING TO THE FORA
    1. Search "EA Forum" or "LessWrong" on your podcast player
    2. Subscribe to the official channels
    3. Go forth. Seek impact. Seek truth.
    Más Menos
    1 m
  • LW - Augmenting Statistical Models with Natural Language Parameters by jsteinhardt
    Sep 22 2024
    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Augmenting Statistical Models with Natural Language Parameters, published by jsteinhardt on September 22, 2024 on LessWrong.
    This is a guest post by my student Ruiqi Zhong, who has some very exciting work defining new families of statistical models that can take natural language explanations as parameters. The motivation is that existing statistical models are bad at explaining structured data. To address this problem, we agument these models with natural language parameters, which can represent interpretable abstract features and be learned automatically.
    Imagine the following scenario: It is the year 3024. We are historians trying to understand what happened between 2016 and 2024, by looking at how Twitter topics changed across that time period. We are given a dataset of user-posted images sorted by time, $x_1$, $x_2$ ... $x_T$, and our goal is to find trends in this dataset to help interpret what happened.
    If we successfully achieve our goal, we would discover, for instance, (1) a recurring spike of images depicting athletes every four years for the Olympics, and (2) a large increase in images containing medical concepts during and after the COVID-19 pandemic.
    How do we usually discover temporal trends from a dataset? One common approach is to fit a time series model to predict how the features evolve and then interpret the learned model. However, it is unclear what features to use: pixels and neural image embeddings are high-dimensional and uninterpretable, undermining the goal of extracting explainable trends.
    We address this problem by augmenting statistical models with interpretable natural language parameters. The figure below depicts a graphical model representation for the case of time series data. We explain the trends in the observed data [ $x_1$ ... $x_T$] by learning two sets of latent parameters: natural language parameters $\phi$ (the learned features) and real-valued parameters $w$ (the time-varying trends).
    $\phi$: the natural language descriptions of $K$ different topics, e.g. "depicts athletes competing". $\phi$ is an element of $\Sigma$, the universe of all natural language predicates.
    $w_t$: the frequency of each of the K topics at the time $t$.
    If our model successfully recovers the underlying trends, then we can visualize $w$ and $\phi$ below and see that: 1) more pictures contain medical concepts (red) starting from 2020, and 2) there are recurring (blue) spikes of athletes competing.
    In the rest of this post, we will explain in detail how to specify and learn models with natural language parameters and showcase the model on several real-world applications. We will cover:
    A warm-up example of a statistical model with natural language explanations
    A modeling language for specifying natural language parameters
    Applications of our framework, which can be used to specify models for time series, clustering, and applications. We will go over:
    A machine learning application that uses our time series model to monitor trends in LLM usage
    A business application that uses our clustering model to taxonomize product reviews
    A cognitive science application that uses our classification model to explain what images are more memorable for humans
    Thanks to Louise Verkin for helping to typeset the post in Ghost format.
    Warm-up Example: Logistic Regression with Natural Language Parameters
    Instead of understanding topic shifts across the entire time window of 2016-2024, let's first study a much simpler question: what images are more likely to appear after 2020? The usual way to approach this problem is to,
    1. brainstorm some features,
    2. extract the real-valued features from each image, and
    3. run a logistic regression model on these features to predict the target $Y$ =1 if the image appears after 2020, $Y$ =0 otherwise.
    More concretely:
    Step 1: Propose different...
    Más Menos
    17 m
  • LW - Glitch Token Catalog - (Almost) a Full Clear by Lao Mein
    Sep 22 2024
    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Glitch Token Catalog - (Almost) a Full Clear, published by Lao Mein on September 22, 2024 on LessWrong.
    This is a collection of every unidentified GPT2 glitch token listed in the third glitch token archaeology post. I was able to find the source of every single one, except for "?????-" and "?????-?????-"[1]. Please tell me if I missed one, or you've discovered one and don't understand where it came from. This isn't meant to be a well-written analysis, just a quick repository of my glitch-hunting observations.
    I plan on writing up and categorizing all of these in greater detail in future posts. The first of which is here.
    I used OpenWebText, a recreation of GPT2's training data, for all experiments in this post. I tokenized every .gz file in the archive and made a boolean Numpy array of each tokens that was present at least once. This allowed me to quickly identify infrequent tokens in the dataset and pull up the textual context with regular expressions. If there was an issue with overlap, I used a tokenizer-based extraction instead. All data/code available upon request.
    The leftmost column is token id, the middle is the token string, and the right column is # of files the token was present in (out of 20610). GPT2 has 50256 total tokens.
    GPT2 tokens with the lowest frequency in OpenWebText
    30898 'embedreportprint' 0 33434 ' 士' 0 43453 ' SolidGoldMagikarp' 0 1849 ' ' 0 47654 ' ' 0 50009 ' strutConnector' 0 36173 ' RandomRedditor' 0 214 ' ' 0 42424 'DragonMagazine' 0 180 ' ' 0 187 ' ' 0 186 ' ' 0 30213 ' externalToEVAOnly' 0 30212 ' externalToEVA' 0 30211 ' guiIcon' 0 185 ' ' 0 30210 ' guiActiveUnfocused' 0 30209 ' unfocusedRange' 0 184 ' ' 0 30202 ' guiName' 0 183 ' ' 0 30905 'rawdownload' 0 39906 'EStream' 0 33454 '龍喚士' 0 42586 ' srfN' 0 25992 ' 裏覚醒' 0 43065 '
    srfAttach' 0 11504 ' ' 0 39172 ' ' 0 40240 'oreAndOnline' 0 40241 'InstoreAndOnline' 0 33477 ' ' 0 36174 ' RandomRedditorWithNo' 0 37574 'StreamerBot' 0 46600 ' Adinida' 0 182 ' ' 0 29372 ' guiActiveUn' 0 43177 'EStreamFrame' 0 22686 ' ' 0 23282 ' davidjl' 0 47571 ' DevOnline' 0 39752 'quickShip' 0 44320 '\n ' 0 8828 ' ' 0 39820 '龍 ' 0 39821 '龍契士' 0 28666 'PsyNetMessage' 0 35207
    ' attRot' 0 181 ' ' 0 18472 ' guiActive' 0 179 ' ' 0 17811 ' ' 0 20174 ' 裏 ' 0 212 ' ' 0 211 ' ' 0 210 ' ' 0 209 ' ' 0 208 ' ' 0 31666 '?????-?????-' 0 207 ' ' 0 206 ' ' 0 213 ' ' 0 205 ' ' 0 203 ' ' 0 202 ' ' 0 31957 'cffffcc' 0 200 ' ' 0 199 ' ' 0 197 '\t' 0 196 ' ' 0 195 ' ' 0 194 ' ' 0 193 ' ' 0 204 ' ' 0 45545 ' サーティワン' 0 201 '\r' 0 216 ' ' 0 37842 ' partName' 0 45706 '
    ' 0 124 ' ' 0 125 ' ' 0 178 ' ' 0 41380 'natureconservancy' 0 41383 'assetsadobe' 0 177 ' ' 0 215 ' ' 0 41551 'Downloadha' 0 4603 ' ' 0 42202 'GoldMagikarp' 0 42089 ' TheNitrome' 0 217 ' ' 0 218 ' ' 0 42090 ' TheNitromeFan' 0 192 ' ' 0 191 ' ' 0 219 ' ' 0 189 ' ' 0 45544 ' サーティ' 0 5624 ' ' 0 190 ' ' 0 40242 'BuyableInstoreAndOnline' 1 36935 ' dstg' 1 36940 ' istg' 1 45003 ' SetTextColor' 1 30897 'reportprint' 1 39757 'channelAvailability' 1 39756
    'inventoryQuantity' 1 39755 'isSpecialOrderable' 1 39811 'soDeliveryDate' 1 39753 'quickShipAvailable' 1 39714 'isSpecial' 1 47198 'ItemTracker' 1 17900 ' Dragonbound' 1 45392 'dayName' 1 37579 'TPPStreamerBot' 1 31573 'ActionCode' 2 25193 'NetMessage' 2 39749 'DeliveryDate' 2 30208 ' externalTo' 2 43569 'ÍÍ' 2 34027 ' actionGroup' 2 34504 ' 裏 ' 2 39446 ' SetFontSize' 2 30899 'cloneembedreportprint' 2 32047 ' "$:/' 3 39803 'soType' 3 39177 'ItemThumbnailImage' 3 49781 'EngineDebug' 3 25658
    '?????-' 3 33813 '=~=~' 3 48396 'ÛÛ' 3 34206 ...
    Más Menos
    2 h y 50 m
adbl_web_global_use_to_activate_webcro805_stickypopup

Lo que los oyentes dicen sobre The Nonlinear Library

Calificaciones medias de los clientes

Reseñas - Selecciona las pestañas a continuación para cambiar el origen de las reseñas.