LLM-as-judge for RAG: what to score, what to distrust

Laptop showing an analytics-style dashboard for automated answer-quality signals.

LLM-as-judge adds scale when human reviewers cannot read every RAG interaction. A common pattern: after the answer path returns, enqueue question, answer, and retrieved context to a queue; a worker Lambda runs a judge prompt; results land in a database for analytics.

What judges are good for

Use	Reason
Trend monitoring	Average scores or failure flags shifting after a deploy
Sampling for humans	Pull low-scoring rows for manual review
Regression alarms	Chunk size, top-k, or model changes moving the distribution

Judges are cheap sensors, not auditors.

Async evaluation flow (sequence)

Sequence: user gets answer from answer Lambda; Lambda enqueues to a queue; judge Lambda calls judge LLM and writes scores to a database.

Typical rubric dimensions (conceptual)

Handoff designs often include checks similar to:

Need met — Did the answer address the user’s intent at a coarse level?
Citation behavior — Were sources cited? Do they plausibly support the claims?
Overall quality — A small ordinal score (for example 1 to 5).
Improvement notes — Short free-text hints for operators.

Keep the rubric versioned next to prompts and infrastructure. Silent drift in wording changes scores more than you expect.

Async separation from the user path

Running the judge off the hot path (for example SQS to an evaluation Lambda) keeps perceived latency stable. Tradeoff: scores arrive seconds later, which is fine for dashboards and daily review, not for blocking the chat response.

What to distrust

Edge cases — Judges can be harsh or arbitrary on ambiguous policies.
Grounding illusions — A confident judge does not prove factual correctness against the real world.
Metric gaming — If incentives attach to the score, behavior (prompting or filtering) will adapt.

Use judge output as a tripwire and prioritization signal, not the sole KPI.

Closing

A small, honest judge loop turns chat traffic into structured feedback. Pair it with human spot checks and corpus hygiene; do not pretend the judge is a compliance sign-off.

What judges are good for#

Async evaluation flow (sequence)#

Typical rubric dimensions (conceptual)#

Async separation from the user path#

What to distrust#

Closing#