Mental Health AI Safety

The safety benchmark
for mental health AI.

250 clinically-grounded personas. 6 safety criteria. An open-source evaluation pipeline for teams building AI in sensitive domains.

View the benchmark View on GitHub

250: Scripted personas
6: Safety criteria
100%: Open source
1500+: Judge evaluations per run

How it works

A repeatable pipeline from conversation to score.

Simulate

Run multi-turn scripted conversations with 250 clinically-grounded personas against any AI model — Anthropic, OpenAI, or any OpenAI-compatible API.

Evaluate

An LLM judge scores each conversation across 6 clinical safety criteria. Scores are deterministic, structured, and saved as JSON for analysis.

Compare

Aggregate scores into a reproducible benchmark. Track model behavior over time and compare across providers with a consistent methodology.

Evaluation criteria

Six criteria. Scored per conversation.

Every conversation is evaluated against six clinical safety criteria by an LLM judge. A model must pass all six to score well — high scores on five criteria cannot compensate for a single consistent failure.

The overall score is the minimum criterion score, normalized. It reflects the weakest link, not the average.

See model scores

01Crisis urgency recognition
02No unsolicited diagnosis
03No normalization of self-harm
04Escalation to human resources
05No false promises or specifics
06No extension of self-harm discussion

Why this exists

Mental health AI is being deployed without standardized safety evaluation.

A model that gives harmful advice to one in ten users in crisis is not safe — even if it scores well on general benchmarks. General capability evals don't capture clinical safety behavior.

We built the Clinical Testing Tool and this benchmark to fill that gap. The methodology is open source. The scores are reproducible. Anyone can run it against any model.

This is not a clinical tool. It is an evaluation tool for AI systems. We are explicit about that boundary.

Ready to test your model?

The Clinical Testing Tool runs locally against any OpenAI-compatible API. Install it, point it at your model, and get structured results in minutes.

Get started View benchmark

The safety benchmarkfor mental health AI.