Listen to a podcast, please open Podcast Republic app. Available on Google Play Store and Apple App Store.
Deep Papers is a podcast series featuring deep dives on today’s most important AI papers and research. Hosted by Arize AI founders and engineers, each episode profiles the people and techniques behind cutting-edge breakthroughs in machine learning.
| Episode | Date |
|---|---|
|
CUGA Agent: From Benchmarks to Business Impact of IBM's Generalist Agent
|
Feb 11, 2026 |
|
TUMIX: Multi-Agent Test-Time Scaling with Tool-Use Mixture
|
Nov 24, 2025 |
|
Meta AI Researcher Explains ARE and Gaia2: Scaling Up Agent Environments and Evaluations
|
Nov 10, 2025 |
|
Georgia Tech's Santosh Vempala Explains Why Language Models Hallucinate, His Research With OpenAI
|
Oct 14, 2025 |
|
Atropos Health’s Arjun Mukerji, PhD, Explains RWESummary: A Framework and Test for Choosing LLMs to Summarize Real-World Evidence (RWE) Studies
|
Sep 22, 2025 |
|
Stan Miasnikov, Distinguished Engineer, AI/ML Architecture, Consumer Experience at Verizon Walks Us Through His New Paper
|
Sep 06, 2025 |
|
Small Language Models are the Future of Agentic AI
|
Sep 05, 2025 |
|
Watermarking for LLMs and Image Models
|
Jul 30, 2025 |
|
Self-Adapting Language Models: Paper Authors Discuss Implications
|
Jul 08, 2025 |
|
The Illusion of Thinking: What the Apple AI Paper Says About LLM Reasoning
|
Jun 20, 2025 |
|
Accurate KV Cache Quantization with Outlier Tokens Tracing
|
Jun 04, 2025 |
|
Scalable Chain of Thoughts via Elastic Reasoning
|
May 16, 2025 |
|
Sleep-time Compute: Beyond Inference Scaling at Test-time
|
May 02, 2025 |
|
LibreEval: The Largest Open Source Benchmark for RAG Hallucination Detection
|
Apr 18, 2025 |
|
AI Benchmark Deep Dive: Gemini 2.5 and Humanity's Last Exam
|
Apr 04, 2025 |
|
Model Context Protocol (MCP)
|
Mar 25, 2025 |
|
AI Roundup: DeepSeek’s Big Moves, Claude 3.7, and the Latest Breakthroughs
|
Mar 01, 2025 |
|
How DeepSeek is Pushing the Boundaries of AI Development
|
Feb 21, 2025 |
|
Multiagent Finetuning: A Conversation with Researcher Yilun Du
|
Feb 04, 2025 |
|
Training Large Language Models to Reason in Continuous Latent Space
|
Jan 14, 2025 |
|
LLMs as Judges: A Comprehensive Survey on LLM-Based Evaluation Methods
|
Dec 23, 2024 |
|
Merge, Ensemble, and Cooperate! A Survey on Collaborative LLM Strategies
|
Dec 10, 2024 |
|
Agent-as-a-Judge: Evaluate Agents with Agents
|
Nov 23, 2024 |
|
Introduction to OpenAI's Realtime API
|
Nov 12, 2024 |
|
Swarm: OpenAI's Experimental Approach to Multi-Agent Systems
|
Oct 29, 2024 |
|
KV Cache Explained
|
Oct 24, 2024 |
|
The Shrek Sampler: How Entropy-Based Sampling is Revolutionizing LLMs
|
Oct 16, 2024 |
|
Google's NotebookLM and the Future of AI-Generated Audio
|
Oct 15, 2024 |
|
Exploring OpenAI's o1-preview and o1-mini
|
Sep 27, 2024 |
|
Breaking Down Reflection Tuning: Enhancing LLM Performance with Self-Learning
|
Sep 19, 2024 |
|
Composable Interventions for Language Models
|
Sep 11, 2024 |
|
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
|
Aug 16, 2024 |
|
Breaking Down Meta's Llama 3 Herd of Models
|
Aug 06, 2024 |
|
DSPy Assertions: Computational Constraints for Self-Refining Language Model Pipelines
|
Jul 23, 2024 |
|
RAFT: Adapting Language Model to Domain Specific RAG
|
Jun 28, 2024 |
|
LLM Interpretability and Sparse Autoencoders: Research from OpenAI and Anthropic
|
Jun 14, 2024 |
|
Trustworthy LLMs: A Survey and Guideline for Evaluating Large Language Models' Alignment
|
May 30, 2024 |
|
Breaking Down EvalGen: Who Validates the Validators?
|
May 13, 2024 |
|
Keys To Understanding ReAct: Synergizing Reasoning and Acting in Language Models
|
Apr 26, 2024 |
|
Demystifying Chronos: Learning the Language of Time Series
|
Apr 04, 2024 |
|
Anthropic Claude 3
|
Mar 25, 2024 |
|
Reinforcement Learning in the Era of LLMs
|
Mar 15, 2024 |
|
Sora: OpenAI’s Text-to-Video Generation Model
|
Mar 01, 2024 |
|
RAG vs Fine-Tuning
|
Feb 08, 2024 |
|
HyDE: Precise Zero-Shot Dense Retrieval without Relevance Labels
|
Feb 02, 2024 |
|
Phi-2 Model
|
Feb 02, 2024 |
|
A Deep Dive Into Generative's Newest Models: Gemini vs Mistral (Mixtral-8x7B)–Part I
|
Dec 27, 2023 |
|
How to Prompt LLMs for Text-to-SQL: A Study in Zero-shot, Single-domain, and Cross-domain Settings
|
Dec 18, 2023 |
|
The Geometry of Truth: Emergent Linear Structure in LLM Representation of True/False Datasets
|
Nov 30, 2023 |
|
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
|
Nov 20, 2023 |
|
RankVicuna: Zero-Shot Listwise Document Reranking with Open-Source Large Language Models
|
Oct 18, 2023 |
|
Explaining Grokking Through Circuit Efficiency
|
Oct 17, 2023 |
|
Large Content And Behavior Models To Understand, Simulate, And Optimize Content And Behavior
|
Sep 29, 2023 |
|
Skeleton of Thought: LLMs Can Do Parallel Decoding
|
Aug 30, 2023 |
|
Llama 2: Open Foundation and Fine-Tuned Chat Models
|
Jul 31, 2023 |
|
Lost in the Middle: How Language Models Use Long Contexts
|
Jul 26, 2023 |
|
Orca: Progressive Learning from Complex Explanation Traces of GPT-4
|
Jul 21, 2023 |
|
Toolformer: Training LLMs To Use Tools
|
Mar 20, 2023 |
|
Hungry Hungry Hippos - H3
|
Feb 13, 2023 |
|
ChatGPT and InstructGPT: Aligning Language Models to Human Intention
|
Jan 18, 2023 |