LLM-as-judge for RAG: what to score, what to distrust
LLM-as-judge adds scale when human reviewers cannot read every RAG interaction. A common pattern: after the answer path returns, enqueue question, answer, and retrieved context to a queue; a worker Lambda runs a judge prompt; results land in a database for analytics. What judges are good for Use Reason Trend monitoring Average scores or failure flags shifting after a deploy Sampling for humans Pull low-scoring rows for manual review Regression alarms Chunk size, top-k, or model changes moving the distribution Judges are cheap sensors, not auditors. ...