IDE-style platform for prompt experimentation, evals and monitoring aimed at mixed technical/non-technical AI teams.
Evals & Testing
Score outputs, run regression suites and red-team agent behavior. 39 tools tracked.
Builds dedicated evaluator/judge models (Selene) and agent error-analysis tooling rather than a full observability suite.
Testing and evaluation platform for LLM applications with human-in-the-loop review workflows.
Automated testing and monitoring for IVR, voice assistants and conversational AI systems.
End-to-end testing and monitoring platform for voice and chat AI agents with multilingual simulation.
Eval-first AI engineering platform with logging, datasets, an LLM proxy and a purpose-built trace database (Brainstore), aimed at production regression-catching.
Automated simulation testing and production monitoring for voice and chat AI agents.
Custom evaluation models that score accuracy and quality of enterprise LLM applications in production.
Pytest-style open-source LLM evaluation framework (DeepEval) with a hosted platform for benchmarking, regression testing and red-teaming (DeepTeam).
Continuous validation suite from the ML-testing world extended to LLM apps, scoring outputs across versions from dev to production.
Open-source evaluation and monitoring library (100+ metrics) spanning tabular ML drift and LLM judge-based checks, with a managed cloud.
Evaluation and observability platform with a focus on voice-agent simulation and programmatic re-scoring of historical scenarios.
Evaluation and guardrails platform whose in-house Luna-2 small judge models target low-cost, low-latency scoring of agentic workloads.
Collaborative LLM testing and eval platform emphasizing UI-driven experiments shared between engineers and subject-matter experts.
Open-source testing framework that scans LLM apps for hallucination, injection and bias vulnerabilities, with a commercial evaluation hub.
Experimental developer tool from Google Labs for LLM evaluation with human labeling and LLM-as-judge autoraters.
Automated testing for voice agents that places thousands of simulated phone calls and scores transcripts against rubrics.
Government-built open-source framework for rigorous LLM and agent evaluations, popular for safety benchmarks and sandboxed agentic tasks.
Open-source agent behavior monitoring and evaluation library feeding agent post-training (RL/SFT).
Simulates multi-turn conversation flows to benchmark AI agents before deployment.
Open-source agent testing and observability built around the Scenario simulation framework, covering text, voice and adversarial tests.
Independent AI evaluations lab publishing model benchmarks and comparison data.
End-to-end agent simulation, evaluation and observability platform pitched at cross-functional product and engineering teams.
Synthetic-user simulation that runs on every commit to catch regressions in agent tone, policy compliance, tool use and routing.
OpenAI's original open-source eval framework and registry; largely superseded by the hosted Evals API but still a reference implementation.
AI evaluation and observability platform spanning development tests and production monitoring.
Evaluation API and research-driven judge models (e.g. Lynx, Glider) for hallucination detection plus domain benchmarks like FinanceBench.
Open-source automated alignment auditing tool that probes target models with multi-turn simulated scenarios.
Config-file-driven open-source CLI for prompt evals, regression testing and LLM red-teaming that runs in CI.
Evaluation and monitoring platform for detecting hallucinations and failures in AI agents.
Agent testing, evaluation and tracing platform with the open-source Catalyst SDK.
The de-facto open-source metric library for RAG evaluation (faithfulness, context precision/recall), used standalone or inside other platforms.
Continuous evaluation platform providing fast feedback loops for testing and improving AI agents.
Open-source agentic end-to-end testing framework covering web, API and voice agent testing.
AI-assisted workspace (doteval) for writing and managing LLM evaluations.
TruLens
Open-source library for feedback-function-based evaluation and tracing of RAG and agent apps, now stewarded by Snowflake.
Industry-specific LLM benchmarks and enterprise evaluation for legal, tax and finance tasks.
Managed evaluation service on Vertex AI for scoring models and agents with autoraters and custom metrics.
Auto-optimizer for AI agents using calibrated LLM judges and automatic evaluations.