LLM-as-judge for RAG: what to score, what to distrust

LLM-as-judge adds scale when human reviewers cannot read every RAG interaction. A common pattern: after the answer path returns, enqueue question, answer, and retrieved context to a queue; a worker Lambda runs a judge prompt; results land in a database for analytics. What judges are good for Use Reason Trend monitoring Average scores or failure flags shifting after a deploy Sampling for humans Pull low-scoring rows for manual review Regression alarms Chunk size, top-k, or model changes moving the distribution Judges are cheap sensors, not auditors. ...

April 7, 2026 · 2 min · Me

Internal-docs RAG: chat ingress, vector search, and an async judge loop

This note describes an internal RAG pattern for policy and handbook-style documents: employees ask questions in a familiar chat surface, the backend retrieves by semantic similarity, and a separate evaluation path scores answers for quality and traceability. The layout maps cleanly to typical AWS building blocks (API Gateway, Lambdas, object storage, a vector index, DynamoDB, and a queue). ...

April 4, 2026 · 4 min · Me