Home / Best lists
Best open-source LLM evals & testing tools
Quick answer
Top pick by our maturity signal: Promptfoo. Below are all 14 open-source + in the evals category tools we track, ranked by the same objective GitHub-derived score. Maturity measures adoption and upkeep, not subjective quality — pick by your own constraints.
Open-source frameworks for scoring LLM output, running regression suites and red-teaming, ranked by our GitHub maturity signal. Ranking method is public — see methodology. Note: maturity reflects total GitHub adoption, so large general-purpose platforms (e.g. Grafana, Sentry, PostHog) can rank high on the strength of their parent project even where their LLM-specific features are newer — read the flags and pick by your constraints. Listings are free and editorially independent; sponsorship never changes facts or ranking.
| # | Tool | Maturity | Pricing | Flags |
|---|---|---|---|---|
| 1 | Promptfoo | 100/100 (Mature) | freemium | OSS, self-host |
| 2 | Confident AI (DeepEval) | 100/100 (Mature) | freemium | OSS, self-host |
| 3 | Giskard | 96/100 (Mature) | freemium | OSS, self-host |
| 4 | TruLens | 93/100 (Mature) | free | OSS, self-host |
| 5 | LangWatch | 93/100 (Mature) | freemium | OSS, self-host |
| 6 | Inspect AI | 91/100 (Mature) | free | OSS, self-host |
| 7 | Evidently AI | 88/100 (Mature) | freemium | OSS, self-host |
| 8 | Petri | 87/100 (Mature) | free | OSS, self-host |
| 9 | TestZeus (Hercules) | 86/100 (Mature) | freemium | OSS, self-host |
| 10 | Judgment Labs (judgeval) | 86/100 (Mature) | freemium | OSS, self-host |
| 11 | OpenAI Evals | 80/100 (Mature) | free | OSS, self-host |
| 12 | RagaAI (Catalyst) | 79/100 (Established) | freemium | OSS, self-host |
| 13 | Ragas | 79/100 (Established) | free | OSS, self-host |
| 14 | Deepchecks | 63/100 (Established) | freemium | OSS, self-host |
Frequently asked questions
What is the best open-source LLM evals & testing tools?
By our public maturity signal (GitHub stars + recency + license), Promptfoo ranks highest among the 14 open-source + in the evals category tools we track. Maturity reflects adoption and upkeep, not subjective quality.
How is this ranking decided?
Tools are ranked by a reproducible maturity score computed only from public GitHub signals (log of stars + last-commit recency + license). The formula is published on our methodology page; ranking is never sold.