Great conversation. Thanks for sharing. The bit that resonated with me the most was how companies are just like models. Some people are high latency, some people are more expensive, some are fine-tuned, others less so! I have said something like this for the last 2 years but this was a great distillation!
The parts of Kevin’s interview I’d love to explore further are eval-driven development, ensemble orchestration, and fine-tuning workflows. He framed these as foundational patterns for building AI-native products—core not just for measuring performance, but for shaping product direction itself.
I’m eager to learn how others are approaching this in practice. What’s working well—and where are you seeing the most friction?
A few specific areas I’d love to hear more about include:
• Evals — How are you defining and maintaining them? Are they treated like specs, tests, or dashboards? Who owns them—PM, eng, research?
• Ensembles — Are you routing tasks across multiple models (e.g., fast vs. smart vs. fine-tuned)? How are you orchestrating and evaluating those dynamics?
• Fine-tuning — When do you choose to fine-tune vs. prompt or RAG? What does your workflow look like—tools, processes, teams?
I’m also curious what others see as a best-in-class stack for supporting this kind of workflow. What tools or practices have been most useful across:
Great conversation. Thanks for sharing. The bit that resonated with me the most was how companies are just like models. Some people are high latency, some people are more expensive, some are fine-tuned, others less so! I have said something like this for the last 2 years but this was a great distillation!
The parts of Kevin’s interview I’d love to explore further are eval-driven development, ensemble orchestration, and fine-tuning workflows. He framed these as foundational patterns for building AI-native products—core not just for measuring performance, but for shaping product direction itself.
I’m eager to learn how others are approaching this in practice. What’s working well—and where are you seeing the most friction?
A few specific areas I’d love to hear more about include:
• Evals — How are you defining and maintaining them? Are they treated like specs, tests, or dashboards? Who owns them—PM, eng, research?
• Ensembles — Are you routing tasks across multiple models (e.g., fast vs. smart vs. fine-tuned)? How are you orchestrating and evaluating those dynamics?
• Fine-tuning — When do you choose to fine-tune vs. prompt or RAG? What does your workflow look like—tools, processes, teams?
I’m also curious what others see as a best-in-class stack for supporting this kind of workflow. What tools or practices have been most useful across:
• Prompt/model observability (e.g. PromptLayer, LangSmith)
• Eval automation (OpenAI Evals, Ragas, Trulens)
• Scoring (model-graded, rubric-based, or tied to product KPIs)
• Workflow and orchestration (Who owns what? How do you keep everything aligned?)
Excited to hear how others are tackling this. What parts of your stack feel solid, and where are the remaining gaps?