Home / Best lists

Best open-source LLM evals & testing tools

Quick answer

Top pick by our maturity signal: Promptfoo. Below are all 14 open-source + in the evals category tools we track, ranked by the same objective GitHub-derived score. Maturity measures adoption and upkeep, not subjective quality — pick by your own constraints.

Open-source frameworks for scoring LLM output, running regression suites and red-teaming, ranked by our GitHub maturity signal. Ranking method is public — see methodology. Note: maturity reflects total GitHub adoption, so large general-purpose platforms (e.g. Grafana, Sentry, PostHog) can rank high on the strength of their parent project even where their LLM-specific features are newer — read the flags and pick by your constraints. Listings are free and editorially independent; sponsorship never changes facts or ranking.

#ToolMaturityPricingFlags
1Promptfoo100/100 (Mature)freemiumOSS, self-host
2Confident AI (DeepEval)100/100 (Mature)freemiumOSS, self-host
3Giskard96/100 (Mature)freemiumOSS, self-host
4TruLens93/100 (Mature)freeOSS, self-host
5LangWatch93/100 (Mature)freemiumOSS, self-host
6Inspect AI91/100 (Mature)freeOSS, self-host
7Evidently AI88/100 (Mature)freemiumOSS, self-host
8Petri87/100 (Mature)freeOSS, self-host
9TestZeus (Hercules)86/100 (Mature)freemiumOSS, self-host
10Judgment Labs (judgeval)86/100 (Mature)freemiumOSS, self-host
11OpenAI Evals80/100 (Mature)freeOSS, self-host
12RagaAI (Catalyst)79/100 (Established)freemiumOSS, self-host
13Ragas79/100 (Established)freeOSS, self-host
14Deepchecks63/100 (Established)freemiumOSS, self-host

Frequently asked questions

What is the best open-source LLM evals & testing tools?

By our public maturity signal (GitHub stars + recency + license), Promptfoo ranks highest among the 14 open-source + in the evals category tools we track. Maturity reflects adoption and upkeep, not subjective quality.

How is this ranking decided?

Tools are ranked by a reproducible maturity score computed only from public GitHub signals (log of stars + last-commit recency + license). The formula is published on our methodology page; ranking is never sold.