Home / Best lists

Best open-source LLM evals & testing tools

Quick answer

Top pick by our maturity signal: Promptfoo. Below are all 14 open-source + in the evals category tools we track, ranked by the same objective GitHub-derived score. Maturity measures adoption and upkeep, not subjective quality — pick by your own constraints.

Open-source frameworks for scoring LLM output, running regression suites and red-teaming, ranked by our GitHub maturity signal. Ranking method is public — see methodology. Note: maturity reflects total GitHub adoption, so large general-purpose platforms (e.g. Grafana, Sentry, PostHog) can rank high on the strength of their parent project even where their LLM-specific features are newer — read the flags and pick by your constraints. Listings are free and editorially independent; sponsorship never changes facts or ranking.

#	Tool	Maturity	Pricing	Flags
1	Promptfoo	100/100 (Mature)	freemium	OSS, self-host
2	Confident AI (DeepEval)	100/100 (Mature)	freemium	OSS, self-host
3	Giskard	96/100 (Mature)	freemium	OSS, self-host
4	TruLens	93/100 (Mature)	free	OSS, self-host
5	LangWatch	93/100 (Mature)	freemium	OSS, self-host
6	Inspect AI	91/100 (Mature)	free	OSS, self-host
7	Evidently AI	88/100 (Mature)	freemium	OSS, self-host
8	Petri	87/100 (Mature)	free	OSS, self-host
9	TestZeus (Hercules)	86/100 (Mature)	freemium	OSS, self-host
10	Judgment Labs (judgeval)	86/100 (Mature)	freemium	OSS, self-host
11	OpenAI Evals	80/100 (Mature)	free	OSS, self-host
12	RagaAI (Catalyst)	79/100 (Established)	freemium	OSS, self-host
13	Ragas	79/100 (Established)	free	OSS, self-host
14	Deepchecks	63/100 (Established)	freemium	OSS, self-host

Frequently asked questions

What is the best open-source LLM evals & testing tools?

By our public maturity signal (GitHub stars + recency + license), Promptfoo ranks highest among the 14 open-source + in the evals category tools we track. Maturity reflects adoption and upkeep, not subjective quality.

How is this ranking decided?

Tools are ranked by a reproducible maturity score computed only from public GitHub signals (log of stars + last-commit recency + license). The formula is published on our methodology page; ranking is never sold.