Beam Notes: Evidence Is Not a Leaderboard

A few days ago, Nature Medicine published one of those papers whose title almost guarantees that the internet will read it too quickly: general-purpose large language models outperform specialized clinical AI tools on medical benchmarks. The paper compared OpenEvidence and UpToDate Expert AI with GPT-5.2, Gemini 3.1 Pro Preview, and Claude Opus 4.6 across MedQA, HealthBench, and real clinical queries from NYU Langone, and the headline result was that the frontier general-purpose models came out ahead. That is a useful finding, but I do not think the most important lesson is that one product beat another product. The more interesting part is what happened next: OpenEvidence responded by alleging conflicts, methodological flaws, and benchmark leakage, including an example of a MedQA-style question that may have been too findable to function as a clean unseen test. I am less interested in adjudicating that fight from the outside than in what it reveals about the field. If a clinical AI system is going to ask physicians to trust its answers, we need more transparency about what is happening under the hood: what model is being used, what sources are retrieved, what gets filtered out, whether benchmark items may have been seen before, how citations are selected, how conflicts are handled, and how performance changes when the task moves from an exam question to a patient in front of you.

The useful fight is about transparency

The simplest reading of the Nature Medicine paper is that frontier models beat specialized clinical tools, and there is a real argument there. Medicine has treated domain specificity as a kind of safety halo: clinical model, medical model, evidence model, HIPAA-compliant model, institutionally approved model. Those labels matter for privacy, deployment, workflow, and procurement, but they do not automatically mean better reasoning, better synthesis, better communication, or better handling of uncertainty. At the same time, OpenEvidence is right to push back on any evaluation that makes a proprietary clinical evidence system look like a generic chatbot without showing enough of the machinery around retrieval, prompting, source selection, citations, and product constraints. The answer is not to assume bad faith on either side. The answer is to make these systems legible enough that clinicians can tell what was tested, what was retrieved, what was ignored, and how quickly the answer can be verified.

This is why leaderboards feel increasingly inadequate. They freeze a moving object and pretend that the frozen measurement describes the deployed tool, but a physician does not use a leaderboard; a physician uses a product in a setting. The relevant unit of evaluation is the whole system: base model, prompt, retrieval corpus, citation behavior, refusal policy, audit trail, privacy posture, fallback model, latency, user interface, and monitoring plan. If any one of those layers changes, the evidence should not be treated as fully portable.

For related BeamPath pieces, see understanding AI hallucinations, the dos and don'ts of LLMs in medicine, and what counts as PHI.

Fable 5 shows why “the model” is becoming a slippery object

The same week, Anthropic’s Claude Fable 5 announcement gave us a different version of the same problem. Anthropic described Fable 5 as a Mythos-class model made available for general use, with gains in long-horizon work, coding, finance reasoning, vision, scientific work, and life sciences. It also described conservative safeguards: classifier-triggered fallback behavior for cyber, biology, chemistry, and distillation-sensitive requests, with the company claiming that safeguards would trigger in less than 5% of sessions on average.

Then access was suspended days later. The updated announcement said Anthropic was suspending access to Fable 5 and Mythos 5 while working to restore it, and reporting around the page tied the disruption to an export-control directive. The details will keep changing, but the broader lesson is already clear: the frontier is not a stable product in the old sense. A model can be announced, benchmarked, restricted, repriced, replaced, or wrapped in new safety behavior before the first careful evaluation has even finished circulating.

That creates a strange problem for medicine. If a health system validates a model endpoint over six months, what exactly has been validated? The model name, the model family, the product wrapper, the retrieval layer, the safety classifier, the fallback behavior, the refusal policy, the API version, the user interface, the vendor’s uptime, or the institution’s own workflow around it? In practice, the object a clinician touches is not a model. It is an answer-producing system with many moving parts, some visible and some not.

The optimistic and skeptical reads are both too easy

The optimistic read of the Nature Medicine paper is that general-purpose frontier models have become serious medical tools. I think that is true. They are strong at medical language, synthesis, explanation, and many forms of reasoning, and they benefit from enormous scale and broad alignment. In some settings, that breadth may matter more than narrow domain tuning. If a clinical evidence tool is mostly a retrieval wrapper around a constrained model, the frontier may simply be better at integrating the material into a useful answer.

The skeptical read is that the benchmark layers are messy, the public datasets may be contaminated, the clinical tools were queried through browser interfaces rather than APIs, HealthBench has benchmark-developer overlap, the RCQ data cannot be independently inspected, and the study did not measure citation quality or workflow fit. I think that is also true. These are not minor details if the goal is to decide what clinicians should use.

What I do not find useful is turning either read into a slogan. “Frontier beats clinical” is too broad. “Benchmarks are flawed” is too evasive. The more useful conclusion is that the era of trust-by-category is over. A clinical AI product should not win because it is labeled medical, and a frontier model should not win because it tops a benchmark. Both should have to show how a clinician can verify the answer quickly, how the sources were selected, what was left out, and how performance changes when the task moves from an exam question to a patient in front of you.

I would ask any clinical AI tool four practical questions before getting too impressed by the score: What sources did you use? What did you retrieve? What did you ignore? How can I verify the answer in under a minute? If the product cannot make those answers visible, the benchmark is not enough.

Deskilling is a real risk, but not every lost skill is worth mourning

The other conversation that predictably filled LinkedIn was deskilling. If AI can answer clinical questions, will doctors stop learning how to think? The concern is real, but the way it is often discussed makes it sound as if the only alternative to old training is intellectual collapse.

There is evidence that automation changes human performance. A 2026 scoping review on AI and physician deskilling describes the risk as erosion of competencies through decreased practice, cognitive offloading, and a shift from primary task execution to supervisory monitoring. It points to automation bias, reduced hands-on practice, altered training environments, and evidence from areas such as gastroenterology, radiology, pathology, clinical decision support, oncology, and surgery. Medical educators are also right to worry that AI makes polished outputs cheaper, because a note or summary may no longer show whether a learner actually reasoned through the case.

But history suggests a more interesting pattern than simple decline. Calculators did not eliminate mathematics; they changed which parts of arithmetic deserved classroom time and which parts could be offloaded so students could work on higher-order problems. Computers did not eliminate writing, statistics, or research; they changed the unit of work and created new expectations about speed, revision, search, and analysis. Robotic surgery did not make surgical skill irrelevant; it changed which skills needed simulation, supervision, transfer testing, and deliberate practice. In each case, some skills atrophied, some skills became less important, and some skills became more important because the tool changed the environment.

Clinical AI will probably follow the same pattern. Some forms of expertise should be protected fiercely: noticing when a patient does not fit the evidence, recognizing when a retrieved paper is being overapplied, hearing the detail in a history that changes the differential, and knowing when a polished answer is clinically dangerous. Other skills are not sacred. Nobody needs to preserve the art of retyping medication lists, summarizing boilerplate, or formatting a referral letter as if clerical endurance were a marker of judgment.

So the question is not whether AI will deskill doctors in some generic sense. It will deskill some tasks if we let it, and some of that will be good. The better question is which skills we are willing to offload and which skills must become more visible because AI is now in the room. Training will need more observed reasoning, more oral defense of decisions, more uncertainty drills, more source verification, and more direct assessment of how a trainee uses AI rather than whether a final note looks polished. The goal is not AI-free clinicians. It is clinicians who can use AI without becoming passengers.

Quick Hits

OpenEvidence versus Nature Medicine is the first real clinical AI benchmark fight. The company alleges conflict of interest, benchmark leakage, and methodological flaws. The paper still raises a hard question that clinical AI vendors should have to answer: do specialized medical products actually outperform the frontier, and on which tasks?

Fable 5 is a reminder that model-versioned evaluation is no longer optional. A model can launch with new capabilities and safeguards, then become unavailable or altered almost immediately. Medical evaluation needs to track the deployed system, not just the model name.

The deskilling literature is getting more concrete. The best version of the argument is not that AI makes doctors dumb. It is that training and assessment have to protect clinical reasoning when AI makes polished work cheap.

Realistic clinical benchmarks remain the missing middle. Licensing exams are useful but incomplete. The harder task is gathering missing information, deciding what matters, handling uncertainty, and making a patient-specific plan.

The social-media fight is over-reading the headline and under-reading the methods. The yay-sayers are right that general-purpose models are now serious medical tools. The nay-sayers are right that leakage, conflicts, citation quality, and workflow fit matter. The useful move is to force both sides into more specific evidence.

Evidence is not a leaderboard

The reason I keep coming back to this paper is that it makes almost everyone uncomfortable in a productive way. It challenges clinical AI companies that want domain labels to substitute for independent evaluation. It challenges frontier-model enthusiasts who want a benchmark win to stand in for clinical readiness. It challenges educators who want to preserve old assessment artifacts even as AI makes those artifacts less informative. And it challenges health systems that would prefer procurement categories to do the work of judgment.

That discomfort is useful if it pushes us toward better evaluation rather than louder certainty. Medical AI should be judged through layered verifiability: public benchmarks, real clinical queries, blinded clinician review, citation audits, latency testing, workflow studies, model-versioned monitoring, and honest post-deployment surveillance. None of those layers is enough alone, but together they create a more realistic picture of whether a tool is useful, safe, and worth trusting for a particular job.

The future clinical AI stack will probably include frontier models, specialized evidence tools, local systems, patient-facing assistants, and institution-specific retrieval. None of those should get a free pass. None should be dismissed because one benchmark looked bad. The question is not which label wins. The question is which layer can be verified for the work in front of it.

That is less satisfying than a leaderboard, but it is much closer to medicine.

Until next time, reach out if any of this sparks something.

- Ramez