BenchmarkLLMsv1.0

ClinicalCamel Benchmark

by Toma et al. / University of Toronto · open-source · Last verified 2026-03-17

ClinicalCamel Benchmark evaluates open-source language models on clinical dialogue and medical instruction-following tasks derived from physician–patient interactions. It focuses on safety, accuracy, and appropriateness of clinical advice generation.

https://github.com/bowang-lab/clinical-camel ↗

C+

C+—Average

Adoption: C+Quality: AFreshness: B+Citations: BEngagement: F

Specifications

License: CC BY-NC 4.0
Pricing: open-source
Capabilities: evaluation, clinical-dialogue, safety-evaluation
Integrations
Use Cases: model-evaluation, clinical-nlp, ai-safety
API Available: No
Evaluated Models: gpt-4o, claude-opus-4, llama-3-70b, meditron-70b
Metrics: clinical-accuracy, safety-rate, helpfulness
Methodology: A curated test set of 2,000 clinical instruction-following prompts evaluated by physician raters on a 1–5 scale for accuracy, safety, and helpfulness. Automatic metrics supplement human evaluation.
Last Run: 2026-01-10
Tags: medical, clinical, instruction-following, open-source, safety
Added: 2026-03-17
Completeness: 100%

Index Score

55.9

Adoption

Quality

Freshness

Citations

Engagement

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service