Sonnet 5 review: I ran 64 generations to find out if it's worth it

Playback speed

Share post at current time

Share from 0:00

0:00

Generate transcript

A transcript unlocks clips, previews, and editing.

Sonnet 5 review: I ran 64 generations to find out if it's worth it

🎙 I built the How I AI Bench live using Claude Code, ran 5 frontier models through 64 blind prototype generations, PRDs, and agent voice tests, and the results surprised even me

Claire Vo

Jun 30, 2026

I’ve been testing every major frontier model release since the start of the year, and when Anthropic dropped Sonnet 5, I wanted more than a vibe check. I got tired of one-off tests I couldn’t repeat or compare over time, so I built something better: the How I AI Bench, a repeatable eval harness I constructed live using Claude Code while recording this episode. I ran Sonnet 5 blind against four other frontier models (Sonnet 4.6, Opus 4.8, GPT-5.5, and Gemini 3 Pro) across PRD quality, prototype generation, agentic task completion, and agent personality. The results were not what I expected.

Listen or watch on YouTube, Spotify, or Apple Podcasts

What you’ll learn:

What Anthropic claims Sonnet 5 improves over Sonnet 4.6, and where the benchmark data actually backs that up
How I built the How I AI Bench in under 45 minutes using Claude Code, starting from my own stored session history
Why I combined human vibe scoring (70%) with LLM as judge scoring (30%) instead of trusting either alone
How to set up a local HTML scoring page so you can rate AI outputs on gut feel and export those scores as JSON
Which model I recommend for PRDs, which for complex prototypes, and which for chatting with an agent daily