Open-source prompt playground, registry and evaluation platform covering the prompt lifecycle from experimentation to deployment.
Every AI agent observability, evals, guardrails & cost tool — compared by a neutral third party. 116 tools tracked across tracing, evals, guardrails, prompt management, cost and debugging — with licensing, self-hosting and pricing-model facts checked against primary sources. Built by an engineer who runs agent fleets in production, not by a vendor marketing team.
Open-source prompt playground, registry and evaluation platform covering the prompt lifecycle from experimentation to deployment.
Session-replay style observability for AI agents with time-travel debugging, cost tracking and first-class integrations into agent frameworks like CrewAI and AutoGen.
AWS-managed trace-level observability for agents via CloudWatch generative AI observability and OTEL.
AI monetization platform metering token usage across providers with native credit-based billing.
Enterprise AI observability and evaluation platform extending Arize's ML-monitoring heritage to LLM and agent workloads at scale.
Open-source, OpenTelemetry-based tracing and evaluation library that runs locally or self-hosted, serving as the OSS on-ramp to Arize's enterprise platform.
Model monitoring and firewall (Arthur Shield) for enterprise AI, focused on risk, bias and policy enforcement.
IDE-style platform for prompt experimentation, evals and monitoring aimed at mixed technical/non-technical AI teams.
Builds dedicated evaluator/judge models (Selene) and agent error-analysis tooling rather than a full observability suite.
Testing and evaluation platform for LLM applications with human-in-the-loop review workflows.
Built-in evaluation, tracing and monitoring for models and agents inside Azure AI Foundry.
Automated testing and monitoring for IVR, voice assistants and conversational AI systems.
End-to-end testing and monitoring platform for voice and chat AI agents with multilingual simulation.
Eval-first AI engineering platform with logging, datasets, an LLM proxy and a purpose-built trace database (Brainstore), aimed at production regression-catching.
Inference-perimeter security platform with scanners and red-team agents guarding enterprise model traffic.
Automated simulation testing and production monitoring for voice and chat AI agents.
AI model validation and runtime guardrails productized from Robust Intelligence inside Cisco's security stack.
AI unit-economics platform mapping LLM and GPU spend to cost per feature, per customer and per deployment.
Custom evaluation models that score accuracy and quality of enterprise LLM applications in production.
Pytest-style open-source LLM evaluation framework (DeepEval) with a hosted platform for benchmarking, regression testing and red-teaming (DeepTeam).
Aporia's drift detection and AI guardrails folded into the Coralogix observability platform as its AI research arm.
Simulation-first testing and replay for voice/chat agents, borrowing evaluation methodology from autonomous-vehicle testing.
LLM and agent tracing inside the Datadog APM suite, attractive to teams already standardized on Datadog for infrastructure monitoring.
Continuous validation suite from the ML-testing world extended to LLM apps, scoring outputs across versions from dev to production.
GenAI and LLM monitoring within the Dynatrace APM platform, covering tokens, cost and service health for enterprises already on Dynatrace.
Open-source evaluation and monitoring library (100+ metrics) spanning tabular ML drift and LLM judge-based checks, with a managed cloud.
Enterprise AI observability vendor from the ML-monitoring era, now offering LLM scoring, guardrails and bias/fairness auditing.
FinOps platform whose MegaBill model folds LLM/API spend into the same cost-allocation views as cloud infrastructure.
Prompt management, testing and human-review workflows aimed at cross-functional product teams shipping LLM features.
Evaluation and observability platform with a focus on voice-agent simulation and programmatic re-scoring of historical scenarios.
Evaluation and guardrails platform whose in-house Luna-2 small judge models target low-cost, low-latency scoring of agentic workloads.
Open-source LLM vulnerability scanner that runs pre-built probes for jailbreaks, leakage and injection.
AI red teaming and safety testing platform producing adversarial test suites for LLM applications.
Collaborative LLM testing and eval platform emphasizing UI-driven experiments shared between engineers and subject-matter experts.
Open-source testing framework that scans LLM apps for hallucination, injection and bias vulnerabilities, with a commercial evaluation hub.
Experimental developer tool from Google Labs for LLM evaluation with human labeling and LLM-as-judge autoraters.
OTel-based GenAI observability solution on Grafana Cloud, built on open-source instrumentation rather than a proprietary SDK.
Open-source output-validation framework where composable validators enforce schemas, policies and safety constraints on LLM I/O.
Automated red-teaming ('haizing') that stress-tests LLM systems to find jailbreaks and failure modes before deployment.
Automated testing for voice agents that places thousands of simulated phone calls and scores transcripts against rubrics.
Proxy/gateway-based LLM logging with one-line setup, unified cost and latency visibility across providers; now under Mintlify ownership.
Evaluation and observability platform for production agents with prompt versioning, A/B tests and OTel-based tracing.
Former prompt management and evaluation platform; shut down September 2025 with official migration paths to W&B, PromptLayer and Agenta.
Government-built open-source framework for rigorous LLM and agent evaluations, popular for safety benchmarks and sandboxed agentic tasks.
Agent trace analysis and security scanning (incl. MCP tool-poisoning research), with an Explorer UI for debugging agent runs.
Open-source agent behavior monitoring and evaluation library feeding agent post-training (RL/SFT).
Simulates multi-turn conversation flows to benchmark AI agents before deployment.
Open-source usage-based metering and billing engine used for AI token and credit pricing.
Low-latency API guarding against prompt injection, data leakage and toxic content, backed by the Gandalf attack dataset.
Open-source observability for long-running AI agents that captures LLM calls, tool use and browser actions for step-level debugging and replay.
Open-source (MIT) LLM engineering platform combining tracing, prompt management, evals and datasets, widely used as the default self-hosted observability stack.
Closed-source tracing, evals and monitoring platform from the LangChain team, deepest integration with LangChain/LangGraph but usable via OTel from any stack.
Spreadsheet-like prompt testing and deployment studio with assertions and security guardrails for smaller teams.
OpenTelemetry-native open-source tracing and metrics for LLM apps and agent frameworks, with a managed cloud option.
Open-source agent testing and observability built around the Scenario simulation framework, covering text, voice and adversarial tests.
Open-source prompt engineering platform with versioning, evals and agent-running infrastructure (PromptL).
Open-source LLM proxy/SDK that normalizes 100+ providers behind the OpenAI format with per-key budgets, spend tracking and rate limits.
Open-source input/output scanner toolkit (35+ scanners) for PII, injection and toxicity checks on LLM traffic.
Independent AI evaluations lab publishing model benchmarks and comparison data.
Lightweight open-source LLM observability with tracing, analytics, prompt templates and PII masking, formerly known as LLMonitor.
End-to-end agent simulation, evaluation and observability platform pitched at cross-functional product and engineering teams.
Enterprise usage metering and billing platform powering token-based pricing for major AI companies.
Automated AI red teaming platform testing LLMs, agents and multimodal models against MITRE ATLAS / OWASP-aligned attacks.
The MLOps standard's GenAI extension: trace logging, LLM evaluation and prompt registry inside open-source MLflow 3.
AI/LLM monitoring layer in the New Relic APM platform tracking model latency, token cost and errors alongside conventional app telemetry.
Programmable conversational guardrails toolkit using the Colang DSL, covering input, dialog, retrieval, execution and output rails.
Synthetic-user simulation that runs on every commit to catch regressions in agent tone, policy compliance, tool use and routing.
OpenAI's original open-source eval framework and registry; largely superseded by the hosted Evals API but still a reference implementation.
AI evaluation and observability platform spanning development tests and production monitoring.
OpenTelemetry-native open-source platform covering LLM tracing, GPU monitoring, guardrails and a prompt vault with one-line auto-instrumentation of 50+ providers.
Open-source usage metering for AI/API products, commonly used to meter tokens for billing and internal chargeback.
Open-source LLM evaluation and tracing platform from Comet, combining trace logging, eval metrics and CI-friendly test suites.
Usage-based billing engine handling high-volume metering for AI and token-priced products.
Generative AI collaboration platform bundling prompt management, deployments, evals and observability for SaaS teams.
Evaluation API and research-driven judge models (e.g. Lynx, Glider) for hallucination detection plus domain benchmarks like FinanceBench.
GenAI-specific FinOps tracking cost per request, per feature and per customer to give product teams real unit economics.
Open-source automated alignment auditing tool that probes target models with multi-turn simulated scenarios.
Open-source text analytics on LLM app messages, clustering and scoring conversations to surface what users actually do.
AI security platform covering discovery, red teaming and runtime protection across the AI lifecycle.
AI gateway routing 1,600+ models with built-in logging, cost tracking, caching and guardrails; observability comes as a side effect of the proxy layer.
LLM cost, latency and trace analytics bolted onto PostHog's product-analytics platform, letting teams join AI telemetry with user behavior data.
Enterprise GenAI security platform monitoring employee and application LLM usage for injection, leakage and shadow AI.
Config-file-driven open-source CLI for prompt evals, regression testing and LLM red-teaming that runs in CI.
Git-style prompt version control with branch/commit/merge semantics, runtime REST retrieval and CI/CD quality gates.
Prompt CMS with visual versioning, release labels and A/B testing, positioned so non-engineers can edit and deploy prompts independently.
OpenTelemetry-based observability service from the Pydantic team with first-class PydanticAI and Python ecosystem integration.
Python Risk Identification Toolkit automating single- and multi-turn adversarial probing of GenAI systems.
Evaluation and monitoring platform for detecting hallucinations and failures in AI agents.
Agent testing, evaluation and tracing platform with the open-source Catalyst SDK.
The de-facto open-source metric library for RAG evaluation (faithfulness, context precision/recall), used standalone or inside other platforms.
AI red teaming platform whose ARTEMIS engine automates adversarial testing of LLM apps and agents.
Unified gateway-plus-observability control plane for tracing and evaluating agent behavior, rebranded from Keywords AI.
Production observability for voice agents that captures real calls and converts failures into test cases.
Continuous evaluation platform providing fast feedback loops for testing and improving AI agents.
Agent-call tracing and error monitoring inside Sentry, giving app developers LLM visibility in the tool they already use for crash reporting.
Open-source OpenTelemetry APM that handles LLM observability via standard OTel instrumentation rather than an LLM-specific SDK.
Automated AI security testing and red teaming for AI assistants and agents from build to runtime.
AI-native security platform with red teaming and runtime guardrails for agentic applications.
Open-source LLMOps stack unifying gateway, observability, evaluations, optimization and experimentation.
Open-source agentic end-to-end testing framework covering web, API and voice agent testing.
AI-assisted workspace (doteval) for writing and managing LLM evaluations.
Lightweight open-source library maintaining an up-to-date price table for estimating prompt/completion costs across 400+ models.
Vendor-neutral OpenTelemetry instrumentation for LLM apps (OpenLLMetry) that ships traces to any OTel backend, plus a hosted monitoring platform.
Autonomous LLM-Ops engineer that traces, debugs and optimizes LLM pipelines.
User analytics and feedback tracking for LLM applications to surface real usage patterns.
AI gateway and deployment platform with built-in request logging, cost attribution and rate limiting for enterprise Kubernetes environments.
Open-source library for feedback-function-based evaluation and tracing of RAG and agent apps, now stewarded by Snowflake.
Industry-specific LLM benchmarks and enterprise evaluation for legal, tax and finance tasks.
Cloud cost platform with native token-level ingest for OpenAI/Anthropic and an MCP server for querying AI spend from coding assistants.
Low-code platform for prompts, workflows, evaluations and deployments with environment management for product teams.
Managed evaluation service on Vertex AI for scoring models and agents with autoraters and custom metrics.
Agent trust platform combining automated evaluation, red teaming and runtime defenses for AI agents.
LLM tracing and evaluation toolkit from Weights & Biases, integrated with the broader W&B experiment-tracking ecosystem.
Profile-based monitoring (whylogs) plus LangKit LLM metrics that summarize data locally so raw prompts never leave your infrastructure.
Security and governance platform for enterprise AI agents and low-code copilots, including agent observability.
Auto-optimizer for AI agents using calibrated LLM judges and automatic evaluations.