Route Sparse Autoencoder to Interpret Large Language Models

No se pudo agregar al carrito

Solo puedes tener X títulos en el carrito para realizar el pago.

Add to Cart failed.

Por favor prueba de nuevo más tarde

Error al Agregar a Lista de Deseos.

Por favor prueba de nuevo más tarde

Error al eliminar de la lista de deseos.

Por favor prueba de nuevo más tarde

Error al añadir a tu biblioteca

Por favor intenta de nuevo

Error al seguir el podcast

Intenta nuevamente

Error al dejar de seguir el podcast

Intenta nuevamente

Route Sparse Autoencoder to Interpret Large Language Models

Escúchala gratis

Ver detalles del espectáculo

This paper introduces Route Sparse Autoencoder (RouteSAE), a novel framework designed to improve the interpretability of large language models (LLMs) by effectively extracting features across multiple layers. Traditional sparse autoencoders (SAEs) primarily focus on single-layer activations, failing to capture how features evolve through different depths of an LLM. RouteSAE addresses this by incorporating a routing mechanism that dynamically assigns weights to activations from various layers, creating a unified feature space. This approach leads to a higher number of interpretable features and improved interpretability scores compared to previous methods like TopK SAE and Crosscoder, while maintaining computational efficiency. The study demonstrates RouteSAE's ability to identify both low-level (e.g., "units of weight") and high-level (e.g., "more [X] than [Y]") features, enabling targeted manipulation of model behavior.

Source: May 2025 - Route Sparse Autoencoder to Interpret Large Language Models - https://arxiv.org/pdf/2503.08200

Todavía no hay opiniones