Last Wednesday a paper landed in Science that, on its own merits, is the most rigorous head-to-head between an LLM and emergency physicians yet attempted. Brodeur, Buckley, Manrai, Rodman, and colleagues ran a model against attendings on real, unsmoothed ED cases — the data presented exactly as it appears in the EHR, no preprocessing, no curated vignettes. The model identified the right diagnosis in 67% of triage presentations versus 50% and 55% for the two human experts, and 81% by the time of admission versus 70% and 79%. The team's own caveats are louder than the headline: this does not mean AI is ready to practice medicine, and the read should be that the field has earned the right to run prospective trials, not that it has earned the right to skip them.
Genuine kudos, but I want to talk about the part that is harder to put in the abstract.
The Evidence Layer Runs at Its Own Clock
The model tested was o1. OpenAI shipped o1 to the public in September 2024. Frontier became GPT-5 in 2025, GPT-5.5 on April 23 of this year, Claude Opus 4.7 two weeks ago, and Sonnet 4.8 is likely landing this month. The thing that just appeared in Science is a careful measurement of a model that, on the day it was published, was roughly two model generations and 18 months behind the working frontier.
That is not a complaint about the authors. The IRB cycle, the data access agreements, the review timelines, and the revisions that turn a good study into a publishable one are not optional. They are the price of doing this rigorously. The complaint is structural: medicine's evidence layer was designed for interventions that change on a decade clock — drugs, devices, procedures — and we are now applying it to a substrate that changes on a six-week clock. The asymmetry is not going to fix itself.
The honest reading of last week's paper is therefore double. First, an LLM matched or beat attending physicians on real ER cases, and that is a real result that should change behavior. Second, the model that did so is already outclassed by the model on your phone, and the model on your phone will be outclassed by the model on your phone in July. Anyone who reads "67% at triage" as a ceiling has misunderstood the shape of the curve. Anyone who reads it as ready-for-deployment has misunderstood what the paper measured.
The sharper question is what to do about the lag itself. Some of this is fixable — preprints, living systematic reviews, registry-based trials, model-versioned evaluations — and some of it is not. Whatever gets built, it is going to have to accept that the unit of analysis is no longer "an AI model" but "a class of capability that updates monthly." The best papers of the next five years will be the ones that internalize that and design accordingly. Until then, the gap between what the literature says about clinical AI and what clinical AI actually does is going to keep widening.
For a longer take on how to read claims about clinical LLMs against this lag, see the dos and don'ts of LLMs in medicine and the State of Clinical AI 2026 summary.
Design Stopped Being a Bottleneck
While the evidence layer was clearing throats, the build layer quietly removed one of its biggest moats: design.
Anthropic shipped a design system inside Claude last month — a layer that holds visual consistency across edits and lets the model reason about a UI as a coherent artifact rather than a pile of components. Two weeks later Google upgraded Stitch into a real design platform — multi-screen generation, an infinite canvas, a design agent that reasons across the project's evolution, voice-driven critique. Plain English in, five interconnected screens out, exportable to code.
Two years ago the gap between "I have an idea for a clinical tool" and "I have a thing a colleague would actually open" was design. You could prototype the logic in an afternoon and then spend three weekends fighting Tailwind to make it look like something you would not be embarrassed to share. That gap has collapsed. The model now produces the interface as a first-class output, not a thing you bolt on at the end.
For the one-user-software phenomenon I wrote about last month, this is the missing piece. Vibe coding made the logic free. Claude's design layer and Stitch make the surface free. What remains is taste — knowing which problem is worth solving, which workflow is worth automating, which user is worth respecting. That part is still on you, and that part is not getting cheaper.
If you build one thing this month, build the version of it you would have been embarrassed to share six months ago. The floor on what counts as "looks professional" just moved.
The On-Prem Stack Got Serious
The other quiet shift of the last six months is that the local-model story stopped being a hobbyist story.
Apple is reportedly struggling to keep Mac Minis in stock, driven in significant part by people running local LLMs on them. The M4 Pro with 24GB of unified memory will run an 8B–14B model fluently. The 48GB configuration runs a 70B model comfortably. The whole thing draws 30–40 watts under inference. A Mac Mini sitting in a closet is now a credible always-on inference server for an individual, a small clinic, or a research group, at a power and price point that a discrete-GPU rig cannot touch.
The reason this matters for healthcare is not benchmarks. It is sovereignty. Every conversation about AI in medicine starts and ends with where the data goes. A cloud API call sends PHI to a third party with whatever BAA, audit trail, and trust assumption that vendor will sign. A local model running on a box you physically own does not. That is not a small distinction; it is most of the legal and ethical surface of the problem.
The trade-off is real. Local models are a generation or two behind frontier. A 70B open model running on a Mac Mini is not Opus 4.7. For a lot of clinical-adjacent work — note structuring, draft letters, summarization, retrieval over your own corpus, lightweight reasoning — it does not need to be. The right architecture for most clinical AI in the next two years is not "all cloud" or "all local." It is "frontier model in the cloud for the hard problems, local model on a Mac Mini for the data you should not be sending anywhere." Most institutions have not figured out how to think about that split. The ones that figure it out first are going to look strategically miles ahead two years from now.
For practical takes on running this stack: my AI toolkit, which model should I use, and what counts as PHI before anything goes anywhere.
Quick Hits
GPT-5.5 shipped April 23. Incremental on benchmarks, meaningful on long-context coherence and tool use. The frontier-cadence point from Issue #5 keeps holding: there is always a next model, it is always a few weeks away, and "wait" is no longer a strategy.
Claude Sonnet 4.8 is the next public Anthropic release per leaked references in Claude Code source. Expected in May. Anthropic's pattern is to ship Sonnet a few weeks after each Opus.
The evidence-layer lag is not unique to AI. It is the same gap that made early statin trials read as conservative, that made the first proton therapy reviews read as inconclusive, that made every digital-health meta-analysis from 2019 read as underwhelming. The pattern is: capability moves, measurement lags, and the people who act on capability while measurement catches up are the ones who shape what gets measured next.
Google Stitch is free and in beta. If you have never built a clinical UI mockup before, this is the lowest-friction way I have seen to start. Type the workflow, get five screens, iterate by talking. Go play with it for an hour this weekend.
What the Lag Tells Us
The Science paper is good science. It is also a snapshot of a model that does not exist anymore, deployed against attendings whose workflows will not exist in 2030, measured against a benchmark that will look quaint in a year (more on my pet peeves with diminishing the physician role to diagnosis/treatment decisions). All three of those things can be true at once and the paper can still be valuable. What it cannot be is the basis for waiting.
The institutions that are going to handle this well will treat the literature as a slow-moving lagging indicator and the substrate as the leading one. They will read papers like this carefully, take the methods seriously, and refuse to let "the evidence isn't in yet" become the reason to delay every decision until the evidence is irrelevant. The ones that handle it badly will pretend the gap is not there.
Build the small tool. Run the local model. Read the careful paper. Hold all three at once.
Until next time, reach out if any of this sparks something.
- Ramez
