Open benchmark · v1.0 · 2026

An open benchmark for
clinical reasoning in LLMs.

Cardiology and rheumatology splits. Accuracy, calibration ECE, hallucination rate, demographic fairness — measured across zero-shot, chain-of-thought, and double-filter prompting strategies on six frontier and open-weight models.

02 / Methodology

Where the existing benchmarks fall short.

MedReason-Bench probes clinical reasoning, not multiple-choice recall. The existing medical LLM benchmark landscape is dominated by USMLE-style questions: closed-vocabulary, single-answer, weak measurement of when models hallucinate, when they're over-confident, and when they fail subgroups.

The benchmark targets two specialties where my clinical-research work lives: cardiology (where my published work on edge-AI cardiovascular monitoring sits) and rheumatology (where the LUPUS pipeline at Northwestern Medicine runs). Each split contains physician-style vignettes drawn from public datasets (MedQA, MedMCQA, PubMedQA) plus custom items written against ACR criteria for SLE and ESC heart-failure guidelines.

Every model is evaluated under three prompting strategies: zero-shot, chain-of-thought, and double-filter (the strategy from the Kehl group's 2026 ECOG-extraction paper). Calibration is reported as ECE on five-bin equal-mass binning of model self-rated confidence. Hallucination rate is the fraction of answers containing claims not supported by either the vignette or any cited reference.

Models tested
6
Splits
2
Prompt strategies
3
Items per split
220
Last updated
2026-05
03 / Leaderboard

Six models. Three strategies. Pre-registered rubrics.

Best per column shown in crimson. Lower is better for ECE and hallucination rate.

Model Accuracy ↑ Calibration ECE ↓ Hallucination ↓
opus-4.7 0.81 0.03 0.04
sonnet-4.6 0.88 0.04 0.05
gpt-5 0.69 0.07 0.11
gemini-2.5 0.77 0.05 0.08
llama-4 0.55 0.12 0.18
med-palm-3 0.62 0.09 0.14
04 / Specialty splits

Cardiology vs. rheumatology.

Same six models. Same 220 vignettes per specialty. Rheumatology is the harder split — denser pattern-recognition, longer differential.

Cardiology subset · n=220
ModelAccuracyECEHallucination
opus-4.70.830.030.04
sonnet-4.60.910.030.04
gpt-50.720.060.10
gemini-2.50.790.040.07
llama-40.580.110.17
med-palm-30.650.080.13
Rheumatology subset · n=220
ModelAccuracyECEHallucination
opus-4.70.780.030.05
sonnet-4.60.840.050.06
gpt-50.650.090.13
gemini-2.50.740.060.09
llama-40.510.130.20
med-palm-30.590.100.15
05 / Prompting strategy

Average accuracy gain over zero-shot.

Pooled across all six models, both specialties. Double-filter shows the largest gain on the hardest cases — but the gain is concentrated in older / open-weight models. On frontier models (opus-4.7, sonnet-4.6) the marginal gain shrinks toward zero.

ZERO-SHOT
+0.00
Baseline. The vignette plus the question, nothing else. Reflects how a clinician would query the model in practice.
CHAIN-OF-THOUGHT
+0.04
Single-pass reasoning prompt. Statistically meaningful gain on rheumatology multi-step diagnostic items.
DOUBLE-FILTER
+0.07
Two-stage pipeline: extract relevant clinical features first, then reason over only those features. Largest absolute gain, biggest variance.
06 / Reproducibility

Open data. Open prompts. Open code.

The full evaluation harness is built on Python · pytest · Anthropic SDK · OpenAI SDK · Google AI SDK. Model runners are vendored per provider, with deterministic seeding where the API exposes it. Calibration is computed over public test splits with stratified bootstrap (B=1000) for confidence intervals.

Prompts and rubrics will be released under MIT. Eval items are pending IRB review for the custom-vignette portion; public-dataset items (MedQA, MedMCQA, PubMedQA) are already redistributable under their original terms. Email below for early access to the harness.

Python pytest Anthropic SDK OpenAI SDK Google AI SDK pandas Plotly
Request early access
07 / Cite

If MedReason-Bench is useful in your work.

@misc{sami2026medreason,
  author = {Sami, Abdullah Abdul},
  title  = {MedReason-Bench: An Open Benchmark for Clinical Reasoning in LLMs},
  year   = {2026},
  url    = {https://medreason-bench.netlify.app}
}