Apple's Speculative Streaming: Fast LLM Inference without Auxiliary Models

No se pudo agregar al carrito

Solo puedes tener X títulos en el carrito para realizar el pago.

Add to Cart failed.

Por favor prueba de nuevo más tarde

Error al Agregar a Lista de Deseos.

Por favor prueba de nuevo más tarde

Error al eliminar de la lista de deseos.

Por favor prueba de nuevo más tarde

Error al añadir a tu biblioteca

Por favor intenta de nuevo

Error al seguir el podcast

Intenta nuevamente

Error al dejar de seguir el podcast

Intenta nuevamente

Apple's Speculative Streaming: Fast LLM Inference without Auxiliary Models

Escúchala gratis

Ver detalles del espectáculo

Speculative Streaming is a novel inference method designed to accelerate large language model (LLM) generation without the need for traditional auxiliary "draft" models. By integrating multi-stream attention directly into the target model, the system can perform future n-gram prediction and token verification simultaneously within a single forward pass. This approach eliminates the memory and complexity overhead of managing two separate models, making it exceptionally resource-efficient for hardware with limited capacity. The architecture utilizes tree-structured drafting and parallel pruning to maximize the number of tokens accepted per cycle while maintaining generation quality. Experimental results show speedups ranging from 1.8 to 3.1X across diverse tasks like summarization and structured queries. Ultimately, the method achieves performance comparable to more complex architectures while using significantly fewer additional parameters.

Source:

February 2024.Speculative Streaming: Fast LLM Inference without Auxiliary Models.Apple.Nikhil Bhendawade, Irina Belousova, Qichen Fu, Henry Mason, Mohammad Rastegari, Mahyar Najibi.https://arxiv.org/pdf/2402.11131

Todavía no hay opiniones