Evals & Testing

Score outputs, run regression suites and red-team agent behavior. 39 tools tracked.

IDE-style platform for prompt experimentation, evals and monitoring aimed at mixed technical/non-technical AI teams.

Evals & Testingself-hostablefreemium

Builds dedicated evaluator/judge models (Selene) and agent error-analysis tooling rather than a full observability suite.

Evals & Testingfreemium

Testing and evaluation platform for LLM applications with human-in-the-loop review workflows.

Evals & Testing

Automated testing and monitoring for IVR, voice assistants and conversational AI systems.

Evals & Testing

End-to-end testing and monitoring platform for voice and chat AI agents with multilingual simulation.

Evals & Testing

Eval-first AI engineering platform with logging, datasets, an LLM proxy and a purpose-built trace database (Brainstore), aimed at production regression-catching.

Evals & Testingself-hostablefreemium

Automated simulation testing and production monitoring for voice and chat AI agents.

Evals & Testing

Custom evaluation models that score accuracy and quality of enterprise LLM applications in production.

Evals & Testing

Pytest-style open-source LLM evaluation framework (DeepEval) with a hosted platform for benchmarking, regression testing and red-teaming (DeepTeam).

Evals & Testingopen sourceself-hostablefreemium

Continuous validation suite from the ML-testing world extended to LLM apps, scoring outputs across versions from dev to production.

Evals & Testingopen sourceself-hostablefreemium

Open-source evaluation and monitoring library (100+ metrics) spanning tabular ML drift and LLM judge-based checks, with a managed cloud.

Evals & Testingopen sourceself-hostablefreemium

Evaluation and observability platform with a focus on voice-agent simulation and programmatic re-scoring of historical scenarios.

Evals & Testingfreemium

Evaluation and guardrails platform whose in-house Luna-2 small judge models target low-cost, low-latency scoring of agentic workloads.

Evals & Testingself-hostablefreemium

Collaborative LLM testing and eval platform emphasizing UI-driven experiments shared between engineers and subject-matter experts.

Evals & Testingself-hostablefreemium

Open-source testing framework that scans LLM apps for hallucination, injection and bias vulnerabilities, with a commercial evaluation hub.

Evals & Testingopen sourceself-hostablefreemium

Experimental developer tool from Google Labs for LLM evaluation with human labeling and LLM-as-judge autoraters.

Evals & Testingfree

Automated testing for voice agents that places thousands of simulated phone calls and scores transcripts against rubrics.

Evals & Testingpaid

Government-built open-source framework for rigorous LLM and agent evaluations, popular for safety benchmarks and sandboxed agentic tasks.

Evals & Testingopen sourceself-hostablefree

Open-source agent behavior monitoring and evaluation library feeding agent post-training (RL/SFT).

Evals & Testingopen sourceself-hostablefreemium

Simulates multi-turn conversation flows to benchmark AI agents before deployment.

Evals & Testing

Open-source agent testing and observability built around the Scenario simulation framework, covering text, voice and adversarial tests.

Evals & Testingopen sourceself-hostablefreemium

Independent AI evaluations lab publishing model benchmarks and comparison data.

Evals & Testing

End-to-end agent simulation, evaluation and observability platform pitched at cross-functional product and engineering teams.

Evals & Testingself-hostablefreemium

Synthetic-user simulation that runs on every commit to catch regressions in agent tone, policy compliance, tool use and routing.

Evals & Testingfreemium

OpenAI's original open-source eval framework and registry; largely superseded by the hosted Evals API but still a reference implementation.

Evals & Testingopen sourceself-hostablefree

AI evaluation and observability platform spanning development tests and production monitoring.

Evals & Testing

Evaluation API and research-driven judge models (e.g. Lynx, Glider) for hallucination detection plus domain benchmarks like FinanceBench.

Evals & Testingfreemium

Open-source automated alignment auditing tool that probes target models with multi-turn simulated scenarios.

Evals & Testingopen sourceself-hostablefree

Config-file-driven open-source CLI for prompt evals, regression testing and LLM red-teaming that runs in CI.

Evals & Testingopen sourceself-hostablefreemium

Evaluation and monitoring platform for detecting hallucinations and failures in AI agents.

Evals & Testing

Agent testing, evaluation and tracing platform with the open-source Catalyst SDK.

Evals & Testingopen sourceself-hostablefreemium

The de-facto open-source metric library for RAG evaluation (faithfulness, context precision/recall), used standalone or inside other platforms.

Evals & Testingopen sourceself-hostablefree

Continuous evaluation platform providing fast feedback loops for testing and improving AI agents.

Evals & Testing

Open-source agentic end-to-end testing framework covering web, API and voice agent testing.

Evals & Testingopen sourceself-hostablefreemium

TruLens

acquired

Open-source library for feedback-function-based evaluation and tracing of RAG and agent apps, now stewarded by Snowflake.

Evals & Testingopen sourceself-hostablefree

Industry-specific LLM benchmarks and enterprise evaluation for legal, tax and finance tasks.

Evals & Testing

Auto-optimizer for AI agents using calibrated LLM judges and automatic evaluations.

Evals & Testing