Episodios

  • AI-Native Ops: Making AI Safe for Production with William Collins
    Apr 1 2026
    What happens when your “coworker” can generate code and changes faster than your team can review them, and production still has to stay up?William Collins breaks down what AI-Native Ops looks like when you take reliability seriously: where reasoning should stop, where deterministic automation should begin, and how guardrails like compliance checks, version pinning, and controlled workflows keep AI from turning into outage fuel. Cory and William also dig into why context windows and tool sprawl matter in real systems, how protocols like MCP and agent-to-agent communication are shaping day-to-day automation, and why regulated environments can’t adopt new tech with hype-driven shortcuts.If you’re a platform engineer trying to balance speed with safety, this conversation offers a practical way to use AI for the work that drags teams down, without giving up operational discipline.Guest: William Collins, Director of Technical Evangelism at Itential, AWS community builder, and the co-host of the Cloud Gambit podcastWilliam Collins is a strategic thinker and catalyst for innovation. Over his career, he has helped enterprises build large-scale networks, driven modernization through cloud adoption, and excels at optimizing complex environments through good design practices and automation. Today, William works as Director of Technical Evangelism for Itential, where he focuses on evangelizing the Itential Platform, fostering strong relationships with customers to fully realize their goals, engaging with community, and advocating for the successful future of network, security, and automation infrastructure.As a content creator, William hosts The Cloud Gambit Podcast with Eyvonne Sharp, a show that unravels the state of cloud computing, markets, strategy, and emerging trends with industry experts. He is also a LinkedIn Learning Instructor (Automation, Cloud, and Network Engineering Content), AWS Community Builder (Network & Content Delivery), and is a group organizer for the USNUA - Kentucky User Group (KYNUG).Prior to Itential, William worked as a Principal Cloud Architect and Director of Technical Evangelism for Alkira where he helped grow the company from lean beginnings to being ranked 25th Fastest-Growing Company in North America and 6th in the Bay Area on the 2024 Deloitte Technology Fast 500. He also held various senior technical roles across the enterprise space in Financial Services and Healthcare, most recently at Humana as Director of Cloud Architecture. Outside of tech, his time is spent with family, woodworking, ice hockey, and guitar.Opinions expressed are solely his own and do not express the views or opinions of his employer.William Collins, BlogWilliam Collins, YouTubeWilliam Collins, X William Collins, Instagram William Collins, TikTokWilliam Collins, GitHubItential“The Cloud Gambit” podcastLinks to interesting things from this episode:Ghostty“Harness design for long-running application development” by Anthropic
    Más Menos
    1 h y 3 m
  • Infrastructure as Code's Hidden Problem with Pavlo Baron
    Mar 18 2026
    Terraform drift, state wrangling, and a growing “tools for tools” stack are still daily work for many platform teams - despite a decade of DevOps talk and cloud maturity. Why does ops automation so often feel like it needs babysitting?Pavlo Baron breaks down where Infrastructure as Code tends to break down in real organizations: manual drift management, low-level state complexity, and a lack of practical abstractions that let developers self-serve without inheriting the entire ops burden.The conversation digs into what a more use-case-driven approach could look like - where teams can choose when to enforce desired state, when to accept emergency changes, and how to build “guardrails” that reduce mistakes without slowing delivery.Pavlo also explains why type safety and constrained interfaces matter (especially as AI starts generating more code and infrastructure changes), and why the future of platform engineering depends less on slogans and more on systems that reduce toil.Guest: Pavlo Baron, Co-Founder and CEO of Platform Engineering LabsPavlo Baron is Co-Founder and CEO of Platform Engineering Labs, who are crafting tools to remove the toil from the operations work, with a current focus on infrastructure. He is a veteran in the space, having served in all kinds of roles throughout his career that spans more than 35 years. Previously, he was co-founder, CTO, and major inventor at an observability startup, Instana, that was acquired by IBM in 2020. Pavlo is a frequent conference speaker and author of several books.Pavlo Baron, Xhttps://pavlobaron.medium.com/https://github.com/platform-engineering-labshttps://www.linkedin.com/company/platform-engineering-labshttps://x.com/plateng_labshttps://bsky.app/profile/platform.engineeringhttps://mastodon.social/@plateng_labshttps://www.youtube.com/@plateng-labsLinks to interesting things from this episode:The Pkl Primerformaeformae quick start"10+ Deploys Per Day: Dev and Ops Cooperation at Flickr"“Where everyone is responsible, no one is really responsible.” Albert BanduraJPL “Visions of the Future”“Fallout: New Vegas”
    Más Menos
    58 m
  • Why Extend Went All-In on Serverless Platform Engineering
    Mar 4 2026

    Billions of requests a month on AWS Lambda can cost less than a single engineer’s laptop budget, but only if the architecture and developer workflow are designed for it.

    Justin Masse, Senior Platform DevOps Engineer at Extend, shares how Extend committed early to a serverless-first approach and built a platform that prioritizes developer speed and low operational toil. The conversation breaks down what it takes to run active-active, multi-region systems in a serverless world, how the team keeps services small and fast, and why asynchronous, event-driven design changes both reliability and cost.

    You’ll also hear how Extend treats developer experience as a core platform responsibility: templated microservices, fast deployment pipelines, ephemeral environments for pull requests, and infrastructure that developers can own without becoming cloud specialists. A big theme is using AWS CDK and internal abstractions to keep infrastructure close to the application code, so teams can move quickly while keeping platform standards consistent.

    Finally, the discussion gets practical about tradeoffs that show up after the “serverless is easy” pitch: local development challenges, the real cost center (observability), and where AI is helping today, including an internal agent that diagnoses failed deployments and suggests fixes.

    What you’ll learn

    1. Why Extend avoids servers and VPC complexity, and what they use instead
    2. Patterns for active-active, multi-region thinking in a serverless architecture
    3. How DevEx practices like templates and ephemeral environments reduce friction
    4. A pragmatic approach to IaC with CDK and reusable internal constructs
    5. Where serverless costs stay low, and why observability often dominates the bill
    6. How AI is being applied to platform workflows without skipping engineering judgment

    Guest: Jusin Masse, Senior Platform DevOps Engineer at Extend

    Justin Masse is a self-proclaimed lead chaos engineer, recognized within niche engineering communities for his expertise Chaos Engineering and Infrastructure & DevOps.

    The father of three young kids, a husband, a recent MBA graduate, recent cancer survivor, and competitive powerlifter, he still finds time to actively contribute to the platform engineering community.

    Justin Masse, website

    Justin Masse, GitHub

    Extend, website

    Links to interesting things from this episode:

    1. Episode with Adrian Cockroft
    2. “From $erverless to Elixir” by Cory O’Daniel

    Más Menos
    1 h y 2 m
  • Observability in the AI Era with New Relic's Nic Benders
    Feb 18 2026

    What happens when nobody wrote the code running in your production environment? As AI-generated software becomes standard practice, platform engineers face a new challenge: operating systems without experts to consult.

    Nic Benders, Chief Technical Strategist at New Relic, has spent 15 years watching observability evolve from basic server monitoring to understanding complex distributed systems. Now he's tackling the next frontier: how to maintain and operate software when there's no human author to ask why something was built a certain way.

    The conversation covers the shift from instrumentation being the hard problem to understanding being the bottleneck. Nic explains why inventory matters more than you think, how to approach AI-generated code as a black box that needs testing and telemetry, and why "garbage in, safety out" should be your new mantra.

    You'll learn practical strategies for instrumenting modern systems with OpenTelemetry, why your observability hierarchy needs to start with knowing what's actually running, and how to build platforms that make safe deployment easier than risky shortcuts. Nic also shares his perspective on technical drift versus technical debt and what changes when your best troubleshooting tool - institutional knowledge - no longer exists.

    Whether you're drowning in observability data or just starting to instrument your systems, this conversation offers concrete approaches for building understanding into your platform engineering practice.

    Guest: Nic Benders, Chief Technical Strategist at New Relic

    Nic Benders is New Relic's Chief Technical Strategist. Part of the Engineering team since the early days of the company, Nic has been involved with everything from Agents to ZooKeeper and all the pieces and products in between. As New Relic's Chief Technical Strategist, he now looks after the long-term technical strategy behind the product and the experience of all the engineering teams who build it. Before New Relic, he worked in the mobile space, managing back-end messaging and commerce systems powering some of the largest carriers in the world.

    New Relic, website

    New Relic, Blog

    Links to interesting things from this episode:

    1. OpenClaw (aka Moltbot, aka Clawdbot)
    2. Moltbook

    Más Menos
    51 m
  • Simplicity at Scale: Cleaning House for Platform Teams with Brian Childress
    Dec 17 2025

    Why do so many “modern” platforms feel slow, fragile, and painful to work on?

    Platform engineer and fractional CTO Brian Childress joins Cory to discuss how over-engineering, resume‑driven development, and scattered tooling quietly block teams from shipping value. They explore why simplicity is a competitive advantage for platform teams, especially as AI becomes part of everyday development.

    You’ll learn:

    • How to design a simple platform MVP that developers actually like using
    • What a good local‑to‑prod story looks like (and why it’s the real scaling superpower)
    • Practical ways to onboard humans and AI tools so both can contribute faster
    • Where teams introduce unnecessary complexity with Kubernetes, microservices, and NoSQL
    • How to think about scaling in three dimensions: users, developers, and features
    • Why good architecture, docs, and decision records make AI more useful, not less
    • How to spot and avoid resume‑driven development before it explodes your platform

    Whether you’re cleaning up a messy stack or trying to keep a young platform from drifting into chaos, this conversation gives you concrete patterns for keeping things simple while still scaling teams, systems, and features.

    Guest: Brian Childress, Platform engineer and fractional CTO

    Brian Childress is an accomplished Software Engineer, Architect and Fractional CTO. For over a decade Brian has developed applications in healthcare, finance, and consumer products. Brian has spoken internationally on topics such as application security and developer tooling. Brian spends his free time researching and teaching the latest in application and API security design and best practices.

    Brian Childress, website

    Brian Childress, X

    Links to interesting things from this episode:

    • Replit
    • Lovable

    Más Menos
    41 m
  • Using Feature Flags to Tame Complexity with Mike Zorn
    Dec 3 2025

    What if changing a single flag could save you from a failed migration, a broken API, or a late-night rollback?

    Join us as we dive into how feature flags become a practical tool for changing application behavior at runtime, not just toggling UI elements. Cory talks Mike Zorn about real stories from LaunchDarkly and Rippling, covering how teams use flags to ship safely, debug faster, and simplify complex systems.

    You’ll hear about:

    • Using feature flags to avoid staging overload and ship directly to production
    • Migrating critical systems and databases with minimal downtime and risk
    • Controlling log levels and rate limits for specific customers on the fly
    • Managing flag sprawl so teams do not drown in half-rolled-out features
    • Experimenting with AI features, prompts, and models without fully committing

    If you’re working on a platform, running critical infrastructure, or just trying to ship faster without breaking everything, this conversation offers concrete patterns you can start using right away.

    Guest: Mike Zorn, Senior Software Engineer at Rippling

    Mike’s software engineering journey began with an early interest in problem-solving and programming, starting with creating programs on a TI-83 calculator in middle school. After studying mathematics in college, he transitioned into software through an applied math project that required coding, which sparked his interest in engineering as a career.

    Professionally, he has worked at several product and SaaS companies, including one that was an early LaunchDarkly customer, where they experienced firsthand the challenges of managing feature flags internally. That experience led him to appreciate the value of tools like LaunchDarkly, eventually joining the company himself. Since then, he has contributed across various areas, including focusing on how LaunchDarkly can best adopt its own platform internally to streamline releases and help engineers work more efficiently. His latest adventure has been joining Rippling as a Senior Staff Software Engineer.

    Mike Zorn, GitHub

    Mike Zorn, Email

    Rippling

    LaunchDarkly

    Links to interesting things from this episode:

    • SigNoz
    • Signadot
    • Open Container Initiative
    • “Using Feature Flags to Avoid Downtime During Migrations”
    • Apache Iceberg

    Más Menos
    43 m
  • Policy as Code: Kyverno and Securing Kubernetes at Scale with Jim Bugwadia
    Nov 19 2025

    Most Kubernetes security breaches don't come from zero-day exploits - they come from misconfigurations. While your team runs scanners and reviews reports, containers are already running as root, network policies are missing, and compliance violations are piling up across dozens of repositories.

    Jim Bugwadia, co-founder and CEO of Nirmata and creator of Kyverno, joins Cory to talk about a different approach: policy as code. Instead of asking developers to remember security best practices across every repo, what if your cluster automatically enforced secure defaults and blocked non-compliant deployments before they ever reached production?

    You'll learn how to start using Kyverno today without breaking your production environment - from running your first audit scan (no installation required) to implementing enforcement mode with exceptions. Jim explains why micro-segmentation matters more than ever, how to automate network policies for every namespace, and why platform teams are using Kyverno for everything from security to cost optimization.

    Whether you're running one cluster or managing Kubernetes at scale, this conversation offers practical strategies for making security a byproduct of your platform - not an afterthought.

    Topics covered:

    • Why shift-left security fails and what "shift-down" means for platform teams
    • How to implement Kubernetes policy enforcement without grinding deployments to a halt
    • Automating secure defaults: network policies, resource quotas, and role bindings
    • The crawl-walk-run approach to rolling out policies in existing clusters
    • Real-world use cases beyond security: cost optimization and resource management

    Guest: Jim Bugwadia, Co-Founder & CEO of Nirmata and creator of Kyverno

    Jim Bugwadia is the Co-founder and CEO of Nirmata, a Kubernetes management platform built for enterprises to simplify and scale cloud-native operations across clouds, data centers, edge, and connected devices. With a mission to democratize cloud-native best practices, Jim brings deep expertise in building large-scale software products and leading high-performing teams. Before founding Nirmata, he led a global consulting team at Cisco, guiding enterprises and service providers on their cloud computing journeys. Earlier in his career, he contributed to innovative products at startups and major companies including Trapeze Networks, Pano Logic, Jetstream, Lucent, and Motorola. A hands-on technologist, Jim continues to code in Go, Java, and JavaScript, reflecting his passion for building in the rapidly evolving world of software.

    Jim Bugwadia, X

    Nirmata

    Kyverno

    Links to interesting things from this episode:

    • Kyverno Community Repository
    • “Shift-Down Security” Paper
    • OpenReports
    • Policy Reporter
    • “The Shai-Hulud npm malware attack: A wake-up call for supply chain security”
    • Kyverno Slack Channel

    Más Menos
    42 m
  • Guest Host: Kelsey Hightower - Beyond Pipelines: Infrastructure As Data
    Nov 5 2025

    Is your Git repo really the source of truth for infrastructure - or just a suggestion?

    Guest host Kelsey Hightower sits down with Cory O’Daniel to unpack why many teams hit dead ends with CI/CD for provisioning, where GitOps struggles with drift, and when TicketOps helps or hurts. They explore a different model: infrastructure as data with typed contracts, shared artifacts, and workflows that embed policy, validation, and upgrades from the start. You’ll hear practical ways to reduce cognitive load for developers while giving operations reliable control and better day‑2 levers.

    You’ll learn:

    • Why pipelines are a poor fit for infra provisioning and what to do instead
    • How to reason about drift as a three‑way merge with reality
    • When reconciliation helps, and when it breaks production firefights
    • How typed contracts and artifacts connect modules and teams without glue scripts
    • Ways to present safer self‑service without requiring everyone to learn Terraform
    • A simple mental model for treating TicketOps as a surface, not the workflow

    Guest Host: Kelsey Hightower

    Kelsey has worn every hat possible throughout his career in tech and enjoys leadership roles focused on making things happen and shipping software. Prior to his retirement, he was a Distinguished Engineer at Google, where he worked on Google Cloud Platform. He is a strong open source advocate with a focus on building great software as well as great communities around them. He is also an accomplished author and keynote speaker with a knack for demystifying complex topics, doing live demos and enabling others to succeed. When he is not writing code, you can catch him giving technical workshops covering everything from programming to system administration.

    Guest: Cory O'Daniel, CEO and Co-Founder of Massdriver and Co-Founder of OpenTofu

    Cory has been a software architect and engineer for 20 years, leading up to the founding of MassDriver. He's also a husband and the father of two kids.

    Cory O'Daniel, X

    Cory O'Daniel, Medium

    Massdriver, website

    Massdriver, GitHub

    Massdriver, Youtube

    Open Tofu

    Links to interesting things from this episode:

    • "Gitopscracy" video

    Más Menos
    49 m