Agent Observability Index (2026) — 116 LLM observability, evals & guardrails tools compared

Agenta

Open-source prompt playground, registry and evaluation platform covering the prompt lifecycle from experimentation to deployment.

Prompt Managementopen sourceself-hostablefreemium

AgentOps

Session-replay style observability for AI agents with time-travel debugging, cost tracking and first-class integrations into agent frameworks like CrewAI and AutoGen.

Agent Debugging & Replayopen sourcefreemium

Amazon Bedrock AgentCore Observability

AWS-managed trace-level observability for agents via CloudWatch generative AI observability and OTEL.

Observability & Tracingpaid

Amberflo

AI monetization platform metering token usage across providers with native credit-based billing.

Cost & FinOps

Arize AX

Enterprise AI observability and evaluation platform extending Arize's ML-monitoring heritage to LLM and agent workloads at scale.

Observability & Tracingself-hostablefreemium

Arize Phoenix

Open-source, OpenTelemetry-based tracing and evaluation library that runs locally or self-hosted, serving as the OSS on-ramp to Arize's enterprise platform.

Observability & Tracingopen sourceself-hostablefree

Arthur

Model monitoring and firewall (Arthur Shield) for enterprise AI, focused on risk, bias and policy enforcement.

Guardrails & Safetyopen sourceself-hostablefreemium

Athina AI

IDE-style platform for prompt experimentation, evals and monitoring aimed at mixed technical/non-technical AI teams.

Evals & Testingself-hostablefreemium

Atla

Builds dedicated evaluator/judge models (Selene) and agent error-analysis tooling rather than a full observability suite.

Evals & Testingfreemium

Autoblocks

Testing and evaluation platform for LLM applications with human-in-the-loop review workflows.

Evals & Testing

Azure AI Foundry Observability

Built-in evaluation, tracing and monitoring for models and agents inside Azure AI Foundry.

Observability & Tracingpaid

Bespoken

Automated testing and monitoring for IVR, voice assistants and conversational AI systems.

Evals & Testing

Bluejay

End-to-end testing and monitoring platform for voice and chat AI agents with multilingual simulation.

Evals & Testing

Braintrust

Eval-first AI engineering platform with logging, datasets, an LLM proxy and a purpose-built trace database (Brainstore), aimed at production regression-catching.

Evals & Testingself-hostablefreemium

CalypsoAI

Inference-perimeter security platform with scanners and red-team agents guarding enterprise model traffic.

Guardrails & Safetyself-hostableenterprise

Cekura

Automated simulation testing and production monitoring for voice and chat AI agents.

Evals & Testing

Cisco AI Defense (Robust Intelligence)

acquired

AI model validation and runtime guardrails productized from Robust Intelligence inside Cisco's security stack.

Guardrails & Safetyenterprise

CloudZero

AI unit-economics platform mapping LLM and GPU spend to cost per feature, per customer and per deployment.

Cost & FinOpsenterprise

Composo

Custom evaluation models that score accuracy and quality of enterprise LLM applications in production.

Evals & Testing

Confident AI (DeepEval)

Pytest-style open-source LLM evaluation framework (DeepEval) with a hosted platform for benchmarking, regression testing and red-teaming (DeepTeam).

Evals & Testingopen sourceself-hostablefreemium

Coralogix AI (Aporia)

acquired

Aporia's drift detection and AI guardrails folded into the Coralogix observability platform as its AI research arm.

Guardrails & Safetypaid

Coval

Simulation-first testing and replay for voice/chat agents, borrowing evaluation methodology from autonomous-vehicle testing.

Agent Debugging & Replaypaid

Datadog LLM Observability

LLM and agent tracing inside the Datadog APM suite, attractive to teams already standardized on Datadog for infrastructure monitoring.

Observability & Tracingpaid

Deepchecks

Continuous validation suite from the ML-testing world extended to LLM apps, scoring outputs across versions from dev to production.

Evals & Testingopen sourceself-hostablefreemium

Dynatrace AI Observability

GenAI and LLM monitoring within the Dynatrace APM platform, covering tokens, cost and service health for enterprises already on Dynatrace.

Observability & Tracingpaid

Evidently AI

Open-source evaluation and monitoring library (100+ metrics) spanning tabular ML drift and LLM judge-based checks, with a managed cloud.

Evals & Testingopen sourceself-hostablefreemium

Fiddler AI

Enterprise AI observability vendor from the ML-monitoring era, now offering LLM scoring, guardrails and bias/fairness auditing.

Observability & Tracingself-hostableenterprise

Finout

FinOps platform whose MegaBill model folds LLM/API spend into the same cost-allocation views as cloud infrastructure.

Cost & FinOpsenterprise

Freeplay

Prompt management, testing and human-review workflows aimed at cross-functional product teams shipping LLM features.

Prompt Managementpaid

Future AGI

Evaluation and observability platform with a focus on voice-agent simulation and programmatic re-scoring of historical scenarios.

Evals & Testingfreemium

Galileo

Evaluation and guardrails platform whose in-house Luna-2 small judge models target low-cost, low-latency scoring of agentic workloads.

Evals & Testingself-hostablefreemium

garak

Open-source LLM vulnerability scanner that runs pre-built probes for jailbreaks, leakage and injection.

Guardrails & Safetyopen sourceself-hostablefree

General Analysis

AI red teaming and safety testing platform producing adversarial test suites for LLM applications.

Guardrails & Safety

Gentrace

Collaborative LLM testing and eval platform emphasizing UI-driven experiments shared between engineers and subject-matter experts.

Evals & Testingself-hostablefreemium

Giskard

Open-source testing framework that scans LLM apps for hallucination, injection and bias vulnerabilities, with a commercial evaluation hub.

Evals & Testingopen sourceself-hostablefreemium

Google Stax

Experimental developer tool from Google Labs for LLM evaluation with human labeling and LLM-as-judge autoraters.

Evals & Testingfree

Grafana Cloud AI Observability

OTel-based GenAI observability solution on Grafana Cloud, built on open-source instrumentation rather than a proprietary SDK.

Observability & Tracingopen sourceself-hostablefreemium

Guardrails AI

Open-source output-validation framework where composable validators enforce schemas, policies and safety constraints on LLM I/O.

Guardrails & Safetyopen sourceself-hostablefreemium

Haize Labs

Automated red-teaming ('haizing') that stress-tests LLM systems to find jailbreaks and failure modes before deployment.

Guardrails & Safetypaid

Hamming AI

Automated testing for voice agents that places thousands of simulated phone calls and scores transcripts against rubrics.

Evals & Testingpaid

Helicone

⚠ sunset/maintenance

Proxy/gateway-based LLM logging with one-line setup, unified cost and latency visibility across providers; now under Mintlify ownership.

Observability & Tracingopen sourceself-hostablefreemium

HoneyHive

Evaluation and observability platform for production agents with prompt versioning, A/B tests and OTel-based tracing.

Observability & Tracingself-hostablefreemium

Humanloop

⚠ sunset/maintenance

Former prompt management and evaluation platform; shut down September 2025 with official migration paths to W&B, PromptLayer and Agenta.

Prompt Managemententerprise

Inspect AI

Government-built open-source framework for rigorous LLM and agent evaluations, popular for safety benchmarks and sandboxed agentic tasks.

Evals & Testingopen sourceself-hostablefree

Invariant Labs

acquired

Agent trace analysis and security scanning (incl. MCP tool-poisoning research), with an Explorer UI for debugging agent runs.

Agent Debugging & Replayopen sourceself-hostablefreemium

Judgment Labs (judgeval)

Open-source agent behavior monitoring and evaluation library feeding agent post-training (RL/SFT).

Evals & Testingopen sourceself-hostablefreemium

Kashikoi

Simulates multi-turn conversation flows to benchmark AI agents before deployment.

Evals & Testing

Lago

Open-source usage-based metering and billing engine used for AI token and credit pricing.

Cost & FinOpsopen sourceself-hostablefreemium

Lakera Guard

acquired

Low-latency API guarding against prompt injection, data leakage and toxic content, backed by the Gandalf attack dataset.

Guardrails & Safetyself-hostablefreemium

Laminar

Open-source observability for long-running AI agents that captures LLM calls, tool use and browser actions for step-level debugging and replay.

Agent Debugging & Replayopen sourceself-hostablefreemium

Langfuse

Open-source (MIT) LLM engineering platform combining tracing, prompt management, evals and datasets, widely used as the default self-hosted observability stack.

Observability & Tracingopen sourceself-hostablefreemium

LangSmith

Closed-source tracing, evals and monitoring platform from the LangChain team, deepest integration with LangChain/LangGraph but usable via OTel from any stack.

Observability & Tracingself-hostablefreemium

Langtail

Spreadsheet-like prompt testing and deployment studio with assertions and security guardrails for smaller teams.

Prompt Managementfreemium

Langtrace

OpenTelemetry-native open-source tracing and metrics for LLM apps and agent frameworks, with a managed cloud option.

Observability & Tracingopen sourceself-hostablefreemium

LangWatch

Open-source agent testing and observability built around the Scenario simulation framework, covering text, voice and adversarial tests.

Evals & Testingopen sourceself-hostablefreemium

Latitude

Open-source prompt engineering platform with versioning, evals and agent-running infrastructure (PromptL).

Prompt Managementopen sourceself-hostablefreemium

LiteLLM

Open-source LLM proxy/SDK that normalizes 100+ providers behind the OpenAI format with per-key budgets, spend tracking and rate limits.

Cost & FinOpsopen sourceself-hostablefreemium

LLM Guard (Protect AI)

acquired

Open-source input/output scanner toolkit (35+ scanners) for PII, injection and toxicity checks on LLM traffic.

Guardrails & Safetyopen sourceself-hostablefree

LLM Stats

Independent AI evaluations lab publishing model benchmarks and comparison data.

Evals & Testing

Lunary

Lightweight open-source LLM observability with tracing, analytics, prompt templates and PII masking, formerly known as LLMonitor.

Observability & Tracingopen sourceself-hostablefreemium

Maxim AI

End-to-end agent simulation, evaluation and observability platform pitched at cross-functional product and engineering teams.

Evals & Testingself-hostablefreemium

Metronome

acquired

Enterprise usage metering and billing platform powering token-based pricing for major AI companies.

Cost & FinOpsenterprise

Mindgard

Automated AI red teaming platform testing LLMs, agents and multimodal models against MITRE ATLAS / OWASP-aligned attacks.

Guardrails & Safetyenterprise

MLflow (Tracing & GenAI)

The MLOps standard's GenAI extension: trace logging, LLM evaluation and prompt registry inside open-source MLflow 3.

Observability & Tracingopen sourceself-hostablefree

New Relic AI Monitoring

AI/LLM monitoring layer in the New Relic APM platform tracking model latency, token cost and errors alongside conventional app telemetry.

Observability & Tracingfreemium

NVIDIA NeMo Guardrails

Programmable conversational guardrails toolkit using the Colang DSL, covering input, dialog, retrieval, execution and output rails.

Guardrails & Safetyopen sourceself-hostablefree

Okareo

Synthetic-user simulation that runs on every commit to catch regressions in agent tone, policy compliance, tool use and routing.

Evals & Testingfreemium

OpenAI Evals

OpenAI's original open-source eval framework and registry; largely superseded by the hosted Evals API but still a reference implementation.

Evals & Testingopen sourceself-hostablefree

Openlayer

AI evaluation and observability platform spanning development tests and production monitoring.

Evals & Testing

OpenLIT

OpenTelemetry-native open-source platform covering LLM tracing, GPU monitoring, guardrails and a prompt vault with one-line auto-instrumentation of 50+ providers.

Observability & Tracingopen sourceself-hostablefree

OpenMeter

Open-source usage metering for AI/API products, commonly used to meter tokens for billing and internal chargeback.

Cost & FinOpsopen sourceself-hostablefreemium

Opik (Comet)

Open-source LLM evaluation and tracing platform from Comet, combining trace logging, eval metrics and CI-friendly test suites.

Observability & Tracingopen sourceself-hostablefreemium

Orb

Usage-based billing engine handling high-volume metering for AI and token-priced products.

Cost & FinOps

Orq.ai

Generative AI collaboration platform bundling prompt management, deployments, evals and observability for SaaS teams.

Prompt Managementself-hostablefreemium

Patronus AI

Evaluation API and research-driven judge models (e.g. Lynx, Glider) for hallucination detection plus domain benchmarks like FinanceBench.

Evals & Testingfreemium

Pay-i

GenAI-specific FinOps tracking cost per request, per feature and per customer to give product teams real unit economics.

Cost & FinOpspaid

Petri

Open-source automated alignment auditing tool that probes target models with multi-turn simulated scenarios.

Evals & Testingopen sourceself-hostablefree

Phospho

Open-source text analytics on LLM app messages, clustering and scoring conversations to surface what users actually do.

Observability & Tracingopen sourceself-hostablefreemium

Pillar Security

AI security platform covering discovery, red teaming and runtime protection across the AI lifecycle.

Guardrails & Safetyenterprise

Portkey

acquired

AI gateway routing 1,600+ models with built-in logging, cost tracking, caching and guardrails; observability comes as a side effect of the proxy layer.

Observability & Tracingopen sourceself-hostablefreemium

PostHog LLM Analytics

LLM cost, latency and trace analytics bolted onto PostHog's product-analytics platform, letting teams join AI telemetry with user behavior data.

Observability & Tracingopen sourceself-hostablefreemium

Prompt Security

acquired

Enterprise GenAI security platform monitoring employee and application LLM usage for injection, leakage and shadow AI.

Guardrails & Safetyself-hostableenterprise

Promptfoo

Config-file-driven open-source CLI for prompt evals, regression testing and LLM red-teaming that runs in CI.

Evals & Testingopen sourceself-hostablefreemium

PromptHub

Git-style prompt version control with branch/commit/merge semantics, runtime REST retrieval and CI/CD quality gates.

Prompt Managementfreemium

PromptLayer

Prompt CMS with visual versioning, release labels and A/B testing, positioned so non-engineers can edit and deploy prompts independently.

Prompt Managementfreemium

Pydantic Logfire

OpenTelemetry-based observability service from the Pydantic team with first-class PydanticAI and Python ecosystem integration.

Observability & Tracingopen sourceself-hostablefreemium

PyRIT

Python Risk Identification Toolkit automating single- and multi-turn adversarial probing of GenAI systems.

Guardrails & Safetyopen sourceself-hostablefree

Quotient AI

Evaluation and monitoring platform for detecting hallucinations and failures in AI agents.

Evals & Testing

RagaAI (Catalyst)

Agent testing, evaluation and tracing platform with the open-source Catalyst SDK.

Evals & Testingopen sourceself-hostablefreemium

Ragas

The de-facto open-source metric library for RAG evaluation (faithfulness, context precision/recall), used standalone or inside other platforms.

Evals & Testingopen sourceself-hostablefree

Repello AI

AI red teaming platform whose ARTEMIS engine automates adversarial testing of LLM apps and agents.

Guardrails & Safety

Respan (formerly Keywords AI)

Unified gateway-plus-observability control plane for tracing and evaluating agent behavior, rebranded from Keywords AI.

Observability & Tracingfreemium

Roark

Production observability for voice agents that captures real calls and converts failures into test cases.

Observability & Tracing

Scorecard

Continuous evaluation platform providing fast feedback loops for testing and improving AI agents.

Evals & Testing

Sentry AI Agent Monitoring

Agent-call tracing and error monitoring inside Sentry, giving app developers LLM visibility in the tool they already use for crash reporting.

Observability & Tracingopen sourceself-hostablefreemium

SigNoz

Open-source OpenTelemetry APM that handles LLM observability via standard OTel instrumentation rather than an LLM-specific SDK.

Observability & Tracingopen sourceself-hostablefreemium

SPLX (SplxAI)

acquired

Automated AI security testing and red teaming for AI assistants and agents from build to runtime.

Guardrails & Safetyenterprise

Straiker

AI-native security platform with red teaming and runtime guardrails for agentic applications.

Guardrails & Safetyenterprise

TensorZero

Open-source LLMOps stack unifying gateway, observability, evaluations, optimization and experimentation.

Observability & Tracingopen sourceself-hostablefree

TestZeus (Hercules)

Open-source agentic end-to-end testing framework covering web, API and voice agent testing.

Evals & Testingopen sourceself-hostablefreemium

The LLM Data Company (doteval)

AI-assisted workspace (doteval) for writing and managing LLM evaluations.

Evals & Testing

Tokencost

Lightweight open-source library maintaining an up-to-date price table for estimating prompt/completion costs across 400+ models.

Cost & FinOpsopen sourceself-hostablefree

Traceloop (OpenLLMetry)

Vendor-neutral OpenTelemetry instrumentation for LLM apps (OpenLLMetry) that ships traces to any OTel backend, plus a hosted monitoring platform.

Observability & Tracingopen sourceself-hostablefreemium

Tropir

Autonomous LLM-Ops engineer that traces, debugs and optimizes LLM pipelines.

Agent Debugging & Replay

Trubrics

User analytics and feedback tracking for LLM applications to surface real usage patterns.

Observability & Tracingopen source

TrueFoundry

AI gateway and deployment platform with built-in request logging, cost attribution and rate limiting for enterprise Kubernetes environments.

Observability & Tracingself-hostablefreemium

TruLens

acquired

Open-source library for feedback-function-based evaluation and tracing of RAG and agent apps, now stewarded by Snowflake.

Evals & Testingopen sourceself-hostablefree

Vals AI

Industry-specific LLM benchmarks and enterprise evaluation for legal, tax and finance tasks.

Evals & Testing

Vantage

Cloud cost platform with native token-level ingest for OpenAI/Anthropic and an MCP server for querying AI spend from coding assistants.

Cost & FinOpsfreemium

Vellum

Low-code platform for prompts, workflows, evaluations and deployments with environment management for product teams.

Prompt Managementfreemium

Vertex AI Gen AI Evaluation Service

Managed evaluation service on Vertex AI for scoring models and agents with autoraters and custom metrics.

Evals & Testingpaid

Vijil

Agent trust platform combining automated evaluation, red teaming and runtime defenses for AI agents.

Guardrails & Safety

W&B Weave

acquired

LLM tracing and evaluation toolkit from Weights & Biases, integrated with the broader W&B experiment-tracking ecosystem.

Observability & Tracingopen sourceself-hostablefreemium

Popular comparisons

All tools