ClinicalCamel Benchmark
by Toma et al. / University of Toronto · open-source · Last verified 2026-03-17
ClinicalCamel Benchmark evaluates open-source language models on clinical dialogue and medical instruction-following tasks derived from physician–patient interactions. It focuses on safety, accuracy, and appropriateness of clinical advice generation.
https://github.com/bowang-lab/clinical-camel ↗C+
C+—Average
Adoption: C+Quality: AFreshness: B+Citations: BEngagement: F
Specifications
- License
- CC BY-NC 4.0
- Pricing
- open-source
- Capabilities
- evaluation, clinical-dialogue, safety-evaluation
- Integrations
- Use Cases
- model-evaluation, clinical-nlp, ai-safety
- API Available
- No
- Evaluated Models
- gpt-4o, claude-opus-4, llama-3-70b, meditron-70b
- Metrics
- clinical-accuracy, safety-rate, helpfulness
- Methodology
- A curated test set of 2,000 clinical instruction-following prompts evaluated by physician raters on a 1–5 scale for accuracy, safety, and helpfulness. Automatic metrics supplement human evaluation.
- Last Run
- 2026-01-10
- Tags
- medical, clinical, instruction-following, open-source, safety
- Added
- 2026-03-17
- Completeness
- 100%
Index Score
55.9Adoption
58
Quality
82
Freshness
72
Citations
65
Engagement
0