LLM-as-judge for RAG: what to score, what to distrust

Tue, 07 Apr 2026 00:00:00 +0000

LLM-as-judge adds scale when human reviewers cannot read every RAG interaction. A common pattern: after the answer path returns, enqueue question, answer, and retrieved context to a queue; a worker Lambda runs a judge prompt; results land in a database for analytics.

What judges are good for

Use	Reason
Trend monitoring	Average scores or failure flags shifting after a deploy
Sampling for humans	Pull low-scoring rows for manual review
Regression alarms	Chunk size, top-k, or model changes moving the distribution

Judges are cheap sensors, not auditors.

Internal-docs RAG: chat ingress, vector search, and an async judge loop

Sat, 04 Apr 2026 00:00:00 +0000

This note describes an internal RAG pattern for policy and handbook-style documents: employees ask questions in a familiar chat surface, the backend retrieves by semantic similarity, and a separate evaluation path scores answers for quality and traceability. The layout maps cleanly to typical AWS building blocks (API Gateway, Lambdas, object storage, a vector index, DynamoDB, and a queue).

Llm on Technical Blog

LLM-as-judge for RAG: what to score, what to distrust

What judges are good for

Internal-docs RAG: chat ingress, vector search, and an async judge loop