LLM-as-judge for RAG: what to score, what to distrust

Tue, 07 Apr 2026 00:00:00 +0000

LLM-as-judge adds scale when human reviewers cannot read every RAG interaction. A common pattern: after the answer path returns, enqueue question, answer, and retrieved context to a queue; a worker Lambda runs a judge prompt; results land in a database for analytics.

What judges are good for

Use	Reason
Trend monitoring	Average scores or failure flags shifting after a deploy
Sampling for humans	Pull low-scoring rows for manual review
Regression alarms	Chunk size, top-k, or model changes moving the distribution

Judges are cheap sensors, not auditors.

Evaluation on Technical Blog

LLM-as-judge for RAG: what to score, what to distrust

What judges are good for