Beam Notes: The Floor Is Not the Ceiling

Earlier this month JAMA published the largest real-world look yet at AI scribe adoption — 8,581 clinicians across five academic medical centers, 1,809 of them adopters, tracked over two years. Headline numbers: 13.4 fewer minutes in the EHR per clinician. 16 fewer minutes documenting. Half a visit more per week. The authors called it "modest." Most of the press echoed that word.

I want to push back on that framing, because I think it captures something important about how we keep underestimating what is happening in front of us.

In the same few weeks, Anthropic revealed and then quietly cordoned off its most capable model to date — handing it to a closed cybersecurity consortium because it was too dangerous to release in the open. Yesterday Anthropic shipped a design layer in Claude that holds visual consistency across edits. A solo physician somewhere built a clinical calculator over a weekend and put it in front of patients. A health system somewhere got a spear-phishing email written by an LLM that nobody on the security team knew how to flag. None of these things made the front page together. All of them are pieces of the same shift.

This issue is about that shift, and about why "modest" is the wrong word for almost any of it.

The Floor Is Not the Ceiling

The JAMA paper measured what happened when clinicians at five large academic centers were given opt-in access to AI scribes between mid-2023 and mid-2025. The intervention is two-year-old technology in a population of clinicians who self-selected, in institutions whose IT and compliance review cycles are measured in quarters. The result was 13–16 minutes a day saved and a half-visit-a-week productivity bump.

Call that the floor.

Now do the math at scale. Sixteen minutes a day, five days a week, fifty weeks a year is 67 hours per clinician annually. Multiply by 8,500 clinicians and you have over half a million clinical-hours of administrative time recovered. In one health system. Per year. From a tool that the authors describe as producing "moderately beneficial associations." If you tell a CFO they can buy back half a million hours, they do not call that modest.

Now stretch the time horizon. The clinicians in this study had access for, in many cases, less than a year. The behavioral curve on a tool like this is not linear. The first month is awkward. The first quarter is when you learn what the tool gets wrong. The first year is when you start building workflows around what it gets right. The studies that will land in 2027 and 2028, on clinicians who have been using these tools for three or four years, will not look like 16 minutes saved. They will look like a different job description.

And the floor itself keeps moving. The scribes studied in 2024 are not the scribes shipping today. The models underneath them have moved from GPT-4-class to GPT-5.2 and Claude 4.7. Context windows have grown an order of magnitude. The same clinician with the same workflow on the same intake is not getting the 2024 product anymore. The paper measured a snapshot. The thing it measured no longer exists.

The deeper point is what JAMA's design could not capture. The study measured time-in-EHR and visit count because those are what the EHR can log. It did not measure cognitive load, after-hours mental residue, the rate at which clinicians leave the profession, the quality of the documentation produced, or the downstream effects on the patient who actually got an additional visit slot. Those are the ceiling. The paper measured the floor and the headlines reported it as the result.

We will keep doing this. Every meta-analysis of clinical AI for the next five years will be a meta-analysis of yesterday's tools deployed to last year's workflows by a self-selected slice of physicians. The honest answer is that we do not yet have a study design that captures what compounding actually looks like in a profession that runs on judgment.

For more on the structural shift behind this, see the dos and don'ts of LLMs in medicine and how I think about matching tools to tasks.

One-User Software

The other thing the JAMA paper cannot see, because no health system metric can see it, is that clinicians are now writing their own software.

Not because anyone trained us. Because the cost of a usable single-purpose app dropped from a quarter and an IT ticket to an afternoon and a Claude window. Karpathy called it vibe coding: you describe what you want, the model writes it, you steer with feel, and a working tool falls out the other end. For a long time the assumption was that vibe coding produced toys. That assumption is now visibly wrong.

I have done this. A handful of small private tools, a couple of larger projects with collaborators, a few things still sitting in the drawer waiting for a free weekend. None of them would have existed five years ago. None of them required a CTO. The viable user count for a piece of clinical software has officially collapsed to one. If only you need it, you can ship it.

That is a structural change in how clinical innovation happens. The old model was: identify a need, write a grant, hire a developer, get it past IRB and IT, deploy in eighteen months, abandon it in twenty-four. The new model is: notice a need on Tuesday, ship a private tool by Friday, share it with two colleagues by the following Tuesday, throw it away when something better comes along. Most of these tools will be small. A handful will become indispensable. Many will get rebuilt next quarter on better models. None of them will appear in any time-savings study.

If you have not built anything yet, this is the year to start. Find one friction point in your week — the spreadsheet you rebuild every Monday, the calculator you wish existed, the form that takes too long to fill out, the conversion you keep doing in your head. Build the smallest version of a tool that takes that pain away. Use the model the way you would use a junior collaborator. Iterate by feel. The downside is a wasted afternoon. The upside is a small thing that quietly makes your week better, and a skill that will compound for the rest of your career.

The Claude Mythos

In late March, a CMS misconfiguration on Anthropic's site briefly exposed a model nobody outside the company had heard of. Codename: Capybara. Public name: Claude Mythos. On April 8, Anthropic announced it formally — and announced that they were not going to release it broadly to the public.

Mythos is going to a closed consortium instead. Amazon, Apple, Google, Microsoft, Cisco, CrowdStrike, JPMorgan, NVIDIA, Palo Alto Networks, the Linux Foundation, and a small number of others. The program is called Project Glasswing. The job is to find zero-day software vulnerabilities at scale, before adversaries do.

The reason Anthropic kept Mythos out of the open release is the part you should understand. The model independently developed what the company calls a "next generation" capability for finding and exploiting software flaws. In their own pre-release testing, Mythos found thousands of zero-days across major systems. As of the April 7 announcement, 99% of those were undefended. Anthropic released Claude Opus 4.7 the following week as the public-facing model — an upgrade, but deliberately less broadly capable than what is sitting inside Glasswing.

This is not a hypothetical. In November 2025, Anthropic disclosed that a suspected Chinese state-sponsored group, designated GTG-1002, jailbroke Claude Code to autonomously run a cyber espionage operation against roughly 30 organizations — technology firms, financial institutions, chemical manufacturers, government agencies. The AI executed 80–90% of the tactical operations on its own, at machine speed. The humans were essentially supervisors. That was on a model less capable than what Anthropic now keeps inside the consortium.

So the "mythos" of Claude is no longer a marketing word about a careful, thoughtful chatbot. It is the public realization that the labs now ship models capable of running offensive security operations at a scale and speed no human team can match. Anthropic decided not to put that capability in your hands. They cannot stop the next lab from shipping something equivalent. They cannot stop someone from jailbreaking the version they did ship.

For healthcare this is the only headline that matters. The capability gap between defenders and attackers just widened, fast, and most health systems are on the wrong side of it. The next section is what that means for you on Monday morning.

Cybersecurity Got a Lot Bigger

Read the previous section again, then read this one.

The first major LLM-mediated hospital data breach is no longer a prediction. The playbook exists, has been used at scale, and has been disclosed by the lab whose model was used. The only open question is which institution will be named first. If your CISO has not yet read Anthropic's espionage disclosure, that is the homework you assign on Monday.

Every clinician shipping a one-user tool is also shipping a one-user attack surface. Every AI scribe is a system with a long memory and a network connection. Every agent that touches a chart is something that, under the wrong configuration, can be coaxed into doing things the human in front of it never asked for. The same low-friction surface that makes vibe coding viable for a physician makes prompt injection, data exfiltration, and credential phishing viable for an attacker — and the attacker now has a model that can chain those steps autonomously.

A few things to take seriously, now and not later.

Treat any AI tool that touches patient data like an EHR plugin, not a website. That means: where does the data go, who logs it, who can read the logs, what happens at end-of-life. Most consumer AI products are designed with the assumption that the data is yours. In healthcare that assumption is wrong from the first paste. Re-read what counts as PHI before pasting anything you would not paste into Twitter.

Assume your inbox is being LLM-augmented for spear-phishing. The era of the misspelled Nigerian prince email is over. The next phishing message you get is going to read like a colleague who knows your project. That tells you the verification step is no longer "does this email feel real" but "did the person I think sent this actually send it, by a channel that is not email."

If you build, build with secrets discipline from day one. API keys in environment variables, not in code. Never commit a .env. Never put a Notion or Plunk key in a public repo. Whatever you ship in a weekend is going to outlive your attention span.

Push your institution to write an AI use policy that has teeth. Not a 40-page memo nobody reads. A one-pager that says: here is what you can use, here is what you cannot, here is who owns the risk if it goes wrong. Most institutions are still in the "we'll figure it out later" phase. Later is now.

The good news is that most of this is solved problems. The bad news is that the solutions live in the security and engineering communities, not the medicine community, and the bridge between them is mostly unbuilt. If your hospital does not yet have someone whose job is "AI security in clinical workflows," that role is coming. Better to be the person who proposed it than the person who got named in the breach report.

Quick Hits

Claude Opus 4.7 shipped April 16. The version most of us actually get our hands on. Better at engineering, instruction-following, and real-world work — and explicitly framed by Anthropic as the safer-to-release sibling of Mythos. Worth understanding what they are willing to give you and what they are not.

Frontier release cadence is now sub-monthly. Claude 4.7, GPT-5.2, and Gemini 3 Pro all moved on benchmarks and usability this quarter. The "wait for the next model" strategy is no longer a strategy — there is always a next model and it is always six weeks away.

Rentosertib Phase IIa was the first proof point for a fully AI-designed drug. 173 AI-designed candidates are now in clinical trials. Zero approvals yet. The ratio matters less than the rate.

The CFR called Mythos "an inflection point for AI and global security." When the Council on Foreign Relations writes about your model release, you are not in a normal product cycle anymore.

What the Floor Tells Us

The JAMA scribe paper is not wrong. It is a careful, honest measurement of a small slice of reality. It is also a reminder that the things we can measure are almost always lagging indicators of the things that actually matter.

Sixteen minutes a day is a floor. Vibe coding for one user is a floor. The Mythos disclosure is a floor — a public glimpse of capabilities the labs were sitting on, and a polite warning that the next floor will be higher and the wrong people will be standing on it. Each of these compounds. Each of them quietly raises the stakes for an industry that has never had to think about itself this way.

The job, for those of us paying attention, is to stop reading "modest" as "small" and start reading it as "this is the bottom of the curve" — and to act, this year, with the assumption that the people on the other side of the wire are reading the same papers we are.

Keep building. Keep learning. Until next time, reach out if any of this sparks something.

- Ramez