EP8: Training Models at Scale | AWS for AI Podcast

No se pudo agregar al carrito

Solo puedes tener X títulos en el carrito para realizar el pago.

Add to Cart failed.

Por favor prueba de nuevo más tarde

Error al Agregar a Lista de Deseos.

Por favor prueba de nuevo más tarde

Error al eliminar de la lista de deseos.

Por favor prueba de nuevo más tarde

Error al añadir a tu biblioteca

Por favor intenta de nuevo

Error al seguir el podcast

Intenta nuevamente

Error al dejar de seguir el podcast

Intenta nuevamente

EP8: Training Models at Scale | AWS for AI Podcast

Escúchala gratis

Ver detalles del espectáculo

OFERTA POR TIEMPO LIMITADO. Obtén 3 meses por US$0.99 al mes. Obtén esta oferta.

Join us for an enlightening conversation with Anton Alexander, AWS's Senior Specialist for Worldwide Foundation Models, as we delve into the complexities of training and scaling large foundation models. Anton brings his unique expertise from working with the world's top model builders, along with his fascinating journey from Trinidad and Tobago to becoming a leading AI infrastructure expert.

Discover practical insights on managing massive GPU clusters, optimizing distributed training, and handling the critical challenges of model development at scale. Learn about cutting-edge solutions in GPU failure detection, checkpointing strategies, and the evolution of inference workloads. Get an insider's perspective on emerging trends like GRPO, visual LLMs, and the future of AI model development.

Don't miss this technical deep dive where we explore real-world solutions for building and deploying foundational AI models, featuring discussions on everything from low-level infrastructure optimization to high-level AI development strategies.

Learn more: http://go.aws/47yubYq

Amazon SageMaker HyperPod : https://aws.amazon.com/fr/sagemaker/ai/hyperpod/

The Llama 3 Herd of Models paper : https://arxiv.org/abs/2407.21783

Chapters:

00:00:00 : Introduction and Guest Background

00:01:18 : Anton Journey from Caribbean to AI

00:05:52 : Mathematics in AI

00:07:20 : Large Model Training Challenges

00:09:54 : GPU failures : Lama Herd of models

00:13:40 : Grey failures

00:15:05 : Model training trends

00:17:40 : Managing Mixture of Experts Models

00:21:50 : Estimate how many GPUs you need.

00:25:12 : Monitoring loss function

00:27:08 : Crashing trainings

00:28:10 : SageMaker Hyperpod story

00:32:15 : How we automate managing grey failures

00:37:28 : which metrics to optimize for

00:40:23 : Checkpointing Strategies

00:44:48 : USE Utilization, Saturation, Errors

00:50:11 : SageMaker Hyperpod for Inferencing

00:54:58 : Resiliency in Training vs Inferencing workloads

00:56:44 : NVIDIA NeMo Ecosystem and Agents

00:59:49 : Future Trends in AI

01:03:17 : Closing Thoughts

Todavía no hay opiniones

Comienza Ahora

Listas Populares

Explora Audible

EP8: Training Models at Scale | AWS for AI Podcast

No se pudo agregar al carrito

Add to Cart failed.

Error al Agregar a Lista de Deseos.

Error al eliminar de la lista de deseos.

Error al añadir a tu biblioteca

Error al seguir el podcast

Error al dejar de seguir el podcast

EP8: Training Models at Scale | AWS for AI Podcast