Arxiv paper - Token-Efficient Long Video Understanding for Multimodal LLMs Podcast Por  arte de portada

Arxiv paper - Token-Efficient Long Video Understanding for Multimodal LLMs

Arxiv paper - Token-Efficient Long Video Understanding for Multimodal LLMs

Escúchala gratis

Ver detalles del espectáculo

Acerca de esta escucha

In this episode, we discuss Token-Efficient Long Video Understanding for Multimodal LLMs by Jindong Jiang, Xiuyu Li, Zhijian Liu, Muyang Li, Guo Chen, Zhiqi Li, De-An Huang, Guilin Liu, Zhiding Yu, Kurt Keutzer, Sungjin Ahn, Jan Kautz, Hongxu Yin, Yao Lu, Song Han, Wonmin Byeon. The paper introduces STORM, a new architecture that incorporates a temporal encoder using the Mamba State Space Model to better capture temporal dynamics in video-based multimodal large language models. This approach enables effective token reduction, significantly lowering computational costs and latency while preserving essential temporal information. Experiments demonstrate that STORM achieves state-of-the-art performance on long video understanding benchmarks with substantial improvements in efficiency and accuracy.
Todavía no hay opiniones