Cardiology and rheumatology splits. Accuracy, calibration ECE, hallucination rate, demographic fairness — measured across zero-shot, chain-of-thought, and double-filter prompting strategies on six frontier and open-weight models.
MedReason-Bench probes clinical reasoning, not multiple-choice recall. The existing medical LLM benchmark landscape is dominated by USMLE-style questions: closed-vocabulary, single-answer, weak measurement of when models hallucinate, when they're over-confident, and when they fail subgroups.
The benchmark targets two specialties where my clinical-research work lives: cardiology (where my published work on edge-AI cardiovascular monitoring sits) and rheumatology (where the LUPUS pipeline at Northwestern Medicine runs). Each split contains physician-style vignettes drawn from public datasets (MedQA, MedMCQA, PubMedQA) plus custom items written against ACR criteria for SLE and ESC heart-failure guidelines.
Every model is evaluated under three prompting strategies: zero-shot, chain-of-thought, and double-filter (the strategy from the Kehl group's 2026 ECOG-extraction paper). Calibration is reported as ECE on five-bin equal-mass binning of model self-rated confidence. Hallucination rate is the fraction of answers containing claims not supported by either the vignette or any cited reference.
Best per column shown in crimson. Lower is better for ECE and hallucination rate.
| Model | Accuracy ↑ | Calibration ECE ↓ | Hallucination ↓ |
|---|---|---|---|
| opus-4.7 | 0.81 | 0.03 | 0.04 |
| sonnet-4.6 | 0.88 | 0.04 | 0.05 |
| gpt-5 | 0.69 | 0.07 | 0.11 |
| gemini-2.5 | 0.77 | 0.05 | 0.08 |
| llama-4 | 0.55 | 0.12 | 0.18 |
| med-palm-3 | 0.62 | 0.09 | 0.14 |
Same six models. Same 220 vignettes per specialty. Rheumatology is the harder split — denser pattern-recognition, longer differential.
| Model | Accuracy | ECE | Hallucination |
|---|---|---|---|
| opus-4.7 | 0.83 | 0.03 | 0.04 |
| sonnet-4.6 | 0.91 | 0.03 | 0.04 |
| gpt-5 | 0.72 | 0.06 | 0.10 |
| gemini-2.5 | 0.79 | 0.04 | 0.07 |
| llama-4 | 0.58 | 0.11 | 0.17 |
| med-palm-3 | 0.65 | 0.08 | 0.13 |
| Model | Accuracy | ECE | Hallucination |
|---|---|---|---|
| opus-4.7 | 0.78 | 0.03 | 0.05 |
| sonnet-4.6 | 0.84 | 0.05 | 0.06 |
| gpt-5 | 0.65 | 0.09 | 0.13 |
| gemini-2.5 | 0.74 | 0.06 | 0.09 |
| llama-4 | 0.51 | 0.13 | 0.20 |
| med-palm-3 | 0.59 | 0.10 | 0.15 |
Pooled across all six models, both specialties. Double-filter shows the largest gain on the hardest cases — but the gain is concentrated in older / open-weight models. On frontier models (opus-4.7, sonnet-4.6) the marginal gain shrinks toward zero.
The full evaluation harness is built on Python · pytest · Anthropic SDK · OpenAI SDK · Google AI SDK. Model runners are vendored per provider, with deterministic seeding where the API exposes it. Calibration is computed over public test splits with stratified bootstrap (B=1000) for confidence intervals.
Prompts and rubrics will be released under MIT. Eval items are pending IRB review for the custom-vignette portion; public-dataset items (MedQA, MedMCQA, PubMedQA) are already redistributable under their original terms. Email below for early access to the harness.
@misc{sami2026medreason,
author = {Sami, Abdullah Abdul},
title = {MedReason-Bench: An Open Benchmark for Clinical Reasoning in LLMs},
year = {2026},
url = {https://medreason-bench.netlify.app}
}