Self-hosting LLM observability: what we actually measured

2026-06-13 · self-hosting, benchmark, observability

What we found

Instrumentation overhead is a non-issue: every SDK we tested added under 0.05% to a typical LLM call (though there's a ~7x spread between them). The real difference is operational weight — a single-container tool like Arize Phoenix came up instantly (1.35 GB image, ~400 MB idle RAM, OTLP ingest worked out of the box), while heavier multi-service stacks take real effort to stand up.

Most "best self-hosted observability" lists never actually run the tools. We did. Two questions matter when you self-host: how much does the SDK slow my app down, and how much does the backend cost me to operate. Here is what we measured.

1. Instrumentation overhead (client-side)

We timed 50,000 spans per SDK with in-memory exporters (no network), against an uninstrumented baseline. Per-span overhead, and what it is as a fraction of a typical 500 ms LLM call:

SDK	overhead / span	% of a 500ms call
OpenTelemetry (raw)	~34 µs	0.007%
Traceloop / OpenLLMetry	~37 µs	0.007%
Langfuse SDK	~243 µs	0.049%

Takeaway: stop worrying about instrumentation overhead for LLM workloads — the model call dominates by 2,000x or more. The ~7x spread (a richer observation model costs more per span) only matters on very high-span, non-LLM hot paths.

2. Operational weight (self-host footprint)

Arize Phoenix is the lightweight end: one container, a 1.35 GB image, ~400 MB idle RAM, and it accepted standard OTLP traces immediately. If you want self-hosted tracing running in five minutes, this is the shape to look for.

The full-platform stacks are a different commitment. A tool like Langfuse ships a multi-service compose (web, worker, Postgres, ClickHouse, cache, object store) — far more capable, but in our test the stack did not reach a healthy web endpoint within a 7.5-minute window on a well-resourced box (34 GB / 20 cores). That is not a knock on the product — it is widely self-hosted in production — but it is an honest signal: a 6-service analytics platform is an operational commitment, not a one-container drop-in. Budget for migrations, storage and ongoing ops.

How to read this

Match the footprint to the job. Need lightweight, OpenTelemetry-native tracing you can stand up fast? A single-container tool wins. Need prompt management, evals, datasets and long retention for a team? A full platform earns its operational weight. Don't pay multi-service ops cost for a single-container need, or vice versa.

Caveats, stated plainly: the overhead test is client-side span cost with in-memory export, not end-to-end backend latency; the self-host numbers are from one test environment (WSL2, Docker) and we are extending the matrix to more tools and a longer boot window. We will publish the full table — labelled "tested" and dated — as we complete clean runs. See our methodology.

Frequently asked questions

Does LLM observability instrumentation slow down my app?

Negligibly. In our test every SDK added under 0.05% to a typical 500ms LLM call; the model call dominates by thousands of times. Instrumentation overhead should not drive your tool choice for LLM workloads.

Which self-hosted LLM observability tool is lightest to run?

In our test Arize Phoenix was the lightweight end: a single container, ~1.35GB image and ~400MB idle RAM, accepting OTLP traces immediately. Full platforms like Langfuse are far more capable but ship multi-service stacks that are a larger operational commitment.

Is Langfuse hard to self-host?

It is widely self-hosted in production, but it is a multi-service stack (web, worker, Postgres, ClickHouse, cache, object store). In our test it did not reach a healthy endpoint within 7.5 minutes on a 34GB/20-core box, so plan for migrations, storage and ongoing ops rather than a one-container drop-in.