• LLMs for Equities Feature Forecasting at Two Sigma with Ben Wellington - #736
    Jun 17 2025
    Today, we're joined by Ben Wellington, deputy head of feature forecasting at Two Sigma. We dig into the team’s end-to-end approach to leveraging AI in equities feature forecasting, covering how they identify and create features, collect and quantify historical data, and build predictive models to forecast market behavior and asset prices for trading and investment. We explore the firm's platform-centric approach to managing an extensive portfolio of features and models, the impact of multimodal LLMs on accelerating the process of extracting novel features, the importance of strict data timestamping to prevent temporal leakage, and the way they consider build vs. buy decisions in a rapidly evolving landscape. Lastly, Ben also shares insights on leveraging open-source models and the future of agentic AI in quantitative finance. The complete show notes for this episode can be found at https://twimlai.com/go/736.
    Más Menos
    1 h
  • Zero-Shot Auto-Labeling: The End of Annotation for Computer Vision with Jason Corso - #735
    Jun 10 2025
    Today, we're joined by Jason Corso, co-founder of Voxel51 and professor at the University of Michigan, to explore automated labeling in computer vision. Jason introduces FiftyOne, an open-source platform for visualizing datasets, analyzing models, and improving data quality. We focus on Voxel51’s recent research report, “Zero-shot auto-labeling rivals human performance,” which demonstrates how zero-shot auto-labeling with foundation models can yield to significant cost and time savings compared to traditional human annotation. Jason explains how auto-labels, despite being "noisier" at lower confidence thresholds, can lead to better downstream model performance. We also cover Voxel51's "verified auto-labeling" approach, which utilizes a "stoplight" QA workflow (green, yellow, red light) to minimize human review. Finally, we discuss the challenges of handling decision boundary uncertainty and out-of-domain classes, the differences between synthetic data generation in vision and language domains, and the potential of agentic labeling. The complete show notes for this episode can be found at https://twimlai.com/go/735.
    Más Menos
    57 m
  • Grokking, Generalization Collapse, and the Dynamics of Training Deep Neural Networks with Charles Martin - #734
    Jun 5 2025
    Today, we're joined by Charles Martin, founder of Calculation Consulting, to discuss Weight Watcher, an open-source tool for analyzing and improving Deep Neural Networks (DNNs) based on principles from theoretical physics. We explore the foundations of the Heavy-Tailed Self-Regularization (HTSR) theory that underpins it, which combines random matrix theory and renormalization group ideas to uncover deep insights about model training dynamics. Charles walks us through WeightWatcher’s ability to detect three distinct learning phases—underfitting, grokking, and generalization collapse—and how its signature “layer quality” metric reveals whether individual layers are underfit, overfit, or optimally tuned. Additionally, we dig into the complexities involved in fine-tuning models, the surprising correlation between model optimality and hallucination, the often-underestimated challenges of search relevance, and their implications for RAG. Finally, Charles shares his insights into real-world applications of generative AI and his lessons learned from working in the field. The complete show notes for this episode can be found at https://twimlai.com/go/734.
    Más Menos
    1 h y 25 m
  • Google I/O 2025 Special Edition - #733
    May 28 2025
    Today, I’m excited to share a special crossover edition of the podcast recorded live from Google I/O 2025! In this episode, I join Shawn Wang aka Swyx from the Latent Space Podcast, to interview Logan Kilpatrick and Shrestha Basu Mallick, PMs at Google DeepMind working on AI Studio and the Gemini API, along with Kwindla Kramer, CEO of Daily and creator of the Pipecat open source project. We cover all the highlights from the event, including enhancements to the Gemini models like thinking budgets and thought summaries, native audio output for expressive voice AI, and the new URL Context tool for research agents. The discussion also digs into the Gemini Live API, covering its architecture, the challenges of building real-time voice applications (such as latency and voice activity detection), and new features like proactive audio and asynchronous function calling. Finally, don’t miss our guests’ wish lists for next year’s I/O! The complete show notes for this episode can be found at https://twimlai.com/go/733.
    Más Menos
    26 m
  • RAG Risks: Why Retrieval-Augmented LLMs are Not Safer with Sebastian Gehrmann - #732
    May 21 2025
    Today, we're joined by Sebastian Gehrmann, head of responsible AI in the Office of the CTO at Bloomberg, to discuss AI safety in retrieval-augmented generation (RAG) systems and generative AI in high-stakes domains like financial services. We explore how RAG, contrary to some expectations, can inadvertently degrade model safety. We cover examples of unsafe outputs that can emerge from these systems, different approaches to evaluating these safety risks, and the potential reasons behind this counterintuitive behavior. Shifting to the application of generative AI in financial services, Sebastian outlines a domain-specific safety taxonomy designed for the industry's unique needs. We also explore the critical role of governance and regulatory frameworks in addressing these concerns, the role of prompt engineering in bolstering safety, Bloomberg’s multi-layered mitigation strategies, and vital areas for further work in improving AI safety within specialized domains. The complete show notes for this episode can be found at https://twimlai.com/go/732.
    Más Menos
    57 m
  • From Prompts to Policies: How RL Builds Better AI Agents with Mahesh Sathiamoorthy - #731
    May 13 2025
    Today, we're joined by Mahesh Sathiamoorthy, co-founder and CEO of Bespoke Labs, to discuss how reinforcement learning (RL) is reshaping the way we build custom agents on top of foundation models. Mahesh highlights the crucial role of data curation, evaluation, and error analysis in model performance, and explains why RL offers a more robust alternative to prompting, and how it can improve multi-step tool use capabilities. We also explore the limitations of supervised fine-tuning (SFT) for tool-augmented reasoning tasks, the reward-shaping strategies they’ve used, and Bespoke Labs’ open-source libraries like Curator. We also touch on the models MiniCheck for hallucination detection and MiniChart for chart-based QA. The complete show notes for this episode can be found at https://twimlai.com/go/731.
    Más Menos
    1 h y 1 m
  • How OpenAI Builds AI Agents That Think and Act with Josh Tobin - #730
    May 6 2025
    Today, we're joined by Josh Tobin, member of technical staff at OpenAI, to discuss the company’s approach to building AI agents. We cover OpenAI's three agentic offerings—Deep Research for comprehensive web research, Operator for website navigation, and Codex CLI for local code execution. We explore OpenAI’s shift from simple LLM workflows to reasoning models specifically trained for multi-step tasks through reinforcement learning, and how that enables agents to more easily recover from failures while executing complex processes. Josh shares insights on the practical applications of these agents, including some unexpected use cases. We also discuss the future of human-AI collaboration in software development, such as with "vibe coding," the integration of tools through the Model Control Protocol (MCP), and the significance of context management in AI-enabled IDEs. Additionally, we highlight the challenges of ensuring trust and safety as AI agents become more powerful and autonomous. The complete show notes for this episode can be found at https://twimlai.com/go/730.
    Más Menos
    1 h y 7 m
  • CTIBench: Evaluating LLMs in Cyber Threat Intelligence with Nidhi Rastogi - #729
    Apr 30 2025
    Today, we're joined by Nidhi Rastogi, assistant professor at Rochester Institute of Technology to discuss Cyber Threat Intelligence (CTI), focusing on her recent project CTIBench—a benchmark for evaluating LLMs on real-world CTI tasks. Nidhi explains the evolution of AI in cybersecurity, from rule-based systems to LLMs that accelerate analysis by providing critical context for threat detection and defense. We dig into the advantages and challenges of using LLMs in CTI, how techniques like Retrieval-Augmented Generation (RAG) are essential for keeping LLMs up-to-date with emerging threats, and how CTIBench measures LLMs’ ability to perform a set of real-world tasks of the cybersecurity analyst. We unpack the process of building the benchmark, the tasks it covers, and key findings from benchmarking various LLMs. Finally, Nidhi shares the importance of benchmarks in exposing model limitations and blind spots, the challenges of large-scale benchmarking, and the future directions of her AI4Sec Research Lab, including developing reliable mitigation techniques, monitoring "concept drift" in threat detection models, improving explainability in cybersecurity, and more. The complete show notes for this episode can be found at https://twimlai.com/go/729.
    Más Menos
    56 m
adbl_web_global_use_to_activate_webcro805_stickypopup