Evals & Testing

Score outputs, run regression suites and red-team agent behavior. 39 tools tracked.

Athina AI

IDE-style platform for prompt experimentation, evals and monitoring aimed at mixed technical/non-technical AI teams.

Evals & Testingself-hostablefreemium

Atla

Builds dedicated evaluator/judge models (Selene) and agent error-analysis tooling rather than a full observability suite.

Evals & Testingfreemium

Autoblocks

Testing and evaluation platform for LLM applications with human-in-the-loop review workflows.

Evals & Testing

Bespoken

Automated testing and monitoring for IVR, voice assistants and conversational AI systems.

Evals & Testing

Bluejay

End-to-end testing and monitoring platform for voice and chat AI agents with multilingual simulation.

Evals & Testing

Braintrust

Eval-first AI engineering platform with logging, datasets, an LLM proxy and a purpose-built trace database (Brainstore), aimed at production regression-catching.

Evals & Testingself-hostablefreemium

Cekura

Automated simulation testing and production monitoring for voice and chat AI agents.

Evals & Testing

Composo

Custom evaluation models that score accuracy and quality of enterprise LLM applications in production.

Evals & Testing

Confident AI (DeepEval)

Pytest-style open-source LLM evaluation framework (DeepEval) with a hosted platform for benchmarking, regression testing and red-teaming (DeepTeam).

Evals & Testingopen sourceself-hostablefreemium

Deepchecks

Continuous validation suite from the ML-testing world extended to LLM apps, scoring outputs across versions from dev to production.

Evals & Testingopen sourceself-hostablefreemium

Evidently AI

Open-source evaluation and monitoring library (100+ metrics) spanning tabular ML drift and LLM judge-based checks, with a managed cloud.

Evals & Testingopen sourceself-hostablefreemium

Future AGI

Evaluation and observability platform with a focus on voice-agent simulation and programmatic re-scoring of historical scenarios.

Evals & Testingfreemium

Galileo

Evaluation and guardrails platform whose in-house Luna-2 small judge models target low-cost, low-latency scoring of agentic workloads.

Evals & Testingself-hostablefreemium

Gentrace

Collaborative LLM testing and eval platform emphasizing UI-driven experiments shared between engineers and subject-matter experts.

Evals & Testingself-hostablefreemium

Giskard

Open-source testing framework that scans LLM apps for hallucination, injection and bias vulnerabilities, with a commercial evaluation hub.

Evals & Testingopen sourceself-hostablefreemium

Google Stax

Experimental developer tool from Google Labs for LLM evaluation with human labeling and LLM-as-judge autoraters.

Evals & Testingfree

Hamming AI

Automated testing for voice agents that places thousands of simulated phone calls and scores transcripts against rubrics.

Evals & Testingpaid

Inspect AI

Government-built open-source framework for rigorous LLM and agent evaluations, popular for safety benchmarks and sandboxed agentic tasks.

Evals & Testingopen sourceself-hostablefree

Judgment Labs (judgeval)

Open-source agent behavior monitoring and evaluation library feeding agent post-training (RL/SFT).

Evals & Testingopen sourceself-hostablefreemium

Kashikoi

Simulates multi-turn conversation flows to benchmark AI agents before deployment.

Evals & Testing

LangWatch

Open-source agent testing and observability built around the Scenario simulation framework, covering text, voice and adversarial tests.

Evals & Testingopen sourceself-hostablefreemium

LLM Stats

Independent AI evaluations lab publishing model benchmarks and comparison data.

Evals & Testing

Maxim AI

End-to-end agent simulation, evaluation and observability platform pitched at cross-functional product and engineering teams.

Evals & Testingself-hostablefreemium

Okareo

Synthetic-user simulation that runs on every commit to catch regressions in agent tone, policy compliance, tool use and routing.

Evals & Testingfreemium

OpenAI Evals

OpenAI's original open-source eval framework and registry; largely superseded by the hosted Evals API but still a reference implementation.

Evals & Testingopen sourceself-hostablefree

Openlayer

AI evaluation and observability platform spanning development tests and production monitoring.

Evals & Testing

Patronus AI

Evaluation API and research-driven judge models (e.g. Lynx, Glider) for hallucination detection plus domain benchmarks like FinanceBench.

Evals & Testingfreemium

Petri

Open-source automated alignment auditing tool that probes target models with multi-turn simulated scenarios.

Evals & Testingopen sourceself-hostablefree

Promptfoo

Config-file-driven open-source CLI for prompt evals, regression testing and LLM red-teaming that runs in CI.

Evals & Testingopen sourceself-hostablefreemium

Quotient AI

Evaluation and monitoring platform for detecting hallucinations and failures in AI agents.

Evals & Testing

RagaAI (Catalyst)

Agent testing, evaluation and tracing platform with the open-source Catalyst SDK.

Evals & Testingopen sourceself-hostablefreemium

Ragas

The de-facto open-source metric library for RAG evaluation (faithfulness, context precision/recall), used standalone or inside other platforms.

Evals & Testingopen sourceself-hostablefree

Scorecard

Continuous evaluation platform providing fast feedback loops for testing and improving AI agents.

Evals & Testing

TestZeus (Hercules)

Open-source agentic end-to-end testing framework covering web, API and voice agent testing.

Evals & Testingopen sourceself-hostablefreemium

The LLM Data Company (doteval)

AI-assisted workspace (doteval) for writing and managing LLM evaluations.

Evals & Testing

TruLens

acquired

Open-source library for feedback-function-based evaluation and tracing of RAG and agent apps, now stewarded by Snowflake.

Evals & Testingopen sourceself-hostablefree

Evals & Testing

Athina AI

Atla

Autoblocks

Bespoken

Bluejay

Braintrust

Cekura

Composo

Confident AI (DeepEval)

Deepchecks

Evidently AI

Future AGI

Galileo

Gentrace

Giskard

Google Stax

Hamming AI

Inspect AI

Judgment Labs (judgeval)

Kashikoi

LangWatch

LLM Stats

Maxim AI

Okareo

OpenAI Evals

Openlayer

Patronus AI

Petri

Promptfoo

Quotient AI

RagaAI (Catalyst)

Ragas

Scorecard

TestZeus (Hercules)

The LLM Data Company (doteval)

TruLens

Vals AI

Vertex AI Gen AI Evaluation Service

ZeroEval