Self-hosting LLM observability: what we actually measured
2026-06-13 · self-hosting, benchmark, observability
What we found
Instrumentation overhead is a non-issue: every SDK we tested added under 0.05% to a typical LLM call (though there's a ~7x spread between them). The real difference is operational weight — a single-container tool like Arize Phoenix came up instantly (1.35 GB image, ~400 MB idle RAM, OTLP ingest worked out of the box), while heavier multi-service stacks take real effort to stand up.
Most "best self-hosted observability" lists never actually run the tools. We did. Two questions matter when you self-host: how much does the SDK slow my app down, and how much does the backend cost me to operate. Here is what we measured.
1. Instrumentation overhead (client-side)
We timed 50,000 spans per SDK with in-memory exporters (no network), against an uninstrumented baseline. Per-span overhead, and what it is as a fraction of a typical 500 ms LLM call:
| SDK | overhead / span | % of a 500ms call |
|---|---|---|
| OpenTelemetry (raw) | ~34 µs | 0.007% |
| Traceloop / OpenLLMetry | ~37 µs | 0.007% |
| Langfuse SDK | ~243 µs | 0.049% |
Takeaway: stop worrying about instrumentation overhead for LLM workloads — the model call dominates by 2,000x or more. The ~7x spread (a richer observation model costs more per span) only matters on very high-span, non-LLM hot paths.
2. Operational weight (self-host footprint)
Arize Phoenix is the lightweight end: one container, a 1.35 GB image, ~400 MB idle RAM, and it accepted standard OTLP traces immediately. If you want self-hosted tracing running in five minutes, this is the shape to look for.
The full-platform stacks are a different commitment. A tool like Langfuse ships a multi-service compose (web, worker, Postgres, ClickHouse, cache, object store) — far more capable, but in our test the stack did not reach a healthy web endpoint within a 7.5-minute window on a well-resourced box (34 GB / 20 cores). That is not a knock on the product — it is widely self-hosted in production — but it is an honest signal: a 6-service analytics platform is an operational commitment, not a one-container drop-in. Budget for migrations, storage and ongoing ops.
How to read this
Match the footprint to the job. Need lightweight, OpenTelemetry-native tracing you can stand up fast? A single-container tool wins. Need prompt management, evals, datasets and long retention for a team? A full platform earns its operational weight. Don't pay multi-service ops cost for a single-container need, or vice versa.
Caveats, stated plainly: the overhead test is client-side span cost with in-memory export, not end-to-end backend latency; the self-host numbers are from one test environment (WSL2, Docker) and we are extending the matrix to more tools and a longer boot window. We will publish the full table — labelled "tested" and dated — as we complete clean runs. See our methodology.
Frequently asked questions
Does LLM observability instrumentation slow down my app?
Negligibly. In our test every SDK added under 0.05% to a typical 500ms LLM call; the model call dominates by thousands of times. Instrumentation overhead should not drive your tool choice for LLM workloads.
Which self-hosted LLM observability tool is lightest to run?
In our test Arize Phoenix was the lightweight end: a single container, ~1.35GB image and ~400MB idle RAM, accepting OTLP traces immediately. Full platforms like Langfuse are far more capable but ship multi-service stacks that are a larger operational commitment.
Is Langfuse hard to self-host?
It is widely self-hosted in production, but it is a multi-service stack (web, worker, Postgres, ClickHouse, cache, object store). In our test it did not reach a healthy endpoint within 7.5 minutes on a 34GB/20-core box, so plan for migrations, storage and ongoing ops rather than a one-container drop-in.