Distributed Machine Learning Patterns
A Patterns-First Manual for Architects, Engineers, and Technical Leads
No se pudo agregar al carrito
Solo puedes tener X títulos en el carrito para realizar el pago.
Add to Cart failed.
Por favor prueba de nuevo más tarde
Error al Agregar a Lista de Deseos.
Por favor prueba de nuevo más tarde
Error al eliminar de la lista de deseos.
Por favor prueba de nuevo más tarde
Error al añadir a tu biblioteca
Por favor intenta de nuevo
Error al seguir el podcast
Intenta nuevamente
Error al dejar de seguir el podcast
Intenta nuevamente
Obtén 30 días de Standard gratis
$8.99 al mes después de que termine la prueba. Cancela en cualquier momento
Compra ahora por $7.99
-
Narrado por:
-
Virtual Voice
-
De:
-
Jazper Carter
Este título utiliza narración de voz virtual
Voz Virtual es una narración generada por computadora para audiolibros..
Inside this book, readers will learn how to:
- Design parallelism strategies that fit workload shape and hardware, selecting among data, tensor, pipeline, and expert axes based on architecture, memory budget, and interconnect topology.
- Tune gradient synchronization and sharding applying ZeRO, FSDP, and pipeline schedules to keep accelerator utilization high without amplifying communication overhead as cluster size grows.
- Build fault-tolerant training pipelines with checkpoint strategies, elastic cluster patterns, and spot instance management that recover from mid-run hardware failures without restarting from epoch zero.
- Operate inference at scale using continuous batching, paged attention, and KV cache management to maximize throughput and meet latency SLOs under variable load.
- Instrument distributed jobs for observability tracing per-rank metrics, gradient norms, and communication timings so silent failures surface before consuming days of compute budget.
- Manage multi-tenant clusters securely with workload isolation, quota enforcement, and cost attribution that keep shared GPU infrastructure safe and financially accountable.
- Apply LLM and foundation model patterns for distributed pre-training, RLHF infrastructure, and large-scale inference that generalize across architectures as hardware generations turn over.
- Assess platform maturity using the book's maturity model to locate gaps in reliability, cost efficiency, and operational readiness across the distributed ML stack.
The book is organized in four parts: Foundations, covering parallelism patterns, data sharding, I/O, and orchestration; Training at Scale, addressing fault-tolerant training, checkpoint management, and spot scheduling; Serving and Operations, covering inference architecture, cost control, observability, and multi-tenant security; and Frontier Patterns, applying everything to LLMs and foundation models and closing with end-to-end case studies and a full platform synthesis.
This book is for ML architects who design distributed systems others depend on, ML engineers and data engineers who build and operate them, and technical team leads who set reliability and cost standards, with platform and SRE engineers as a strong secondary audience. Every chapter opens with a production incident scenario, teaches canonical patterns by name, and closes with a checklist the team can apply immediately. Readers finish with the vocabulary, playbook, and pattern library to ship reliable distributed ML systems with confidence.
adbl_web_anon_alc_button_suppression_c
Todavía no hay opiniones