<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Evaluation on Technical Blog</title>
    <link>https://hugo-blog-923.pages.dev/tags/evaluation/</link>
    <description>Recent content in Evaluation on Technical Blog</description>
    <generator>Hugo</generator>
    <language>en</language>
    <lastBuildDate>Tue, 07 Apr 2026 00:00:00 +0000</lastBuildDate>
    <atom:link href="https://hugo-blog-923.pages.dev/tags/evaluation/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>LLM-as-judge for RAG: what to score, what to distrust</title>
      <link>https://hugo-blog-923.pages.dev/posts/rag-llm-judge-quality-loop/</link>
      <pubDate>Tue, 07 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://hugo-blog-923.pages.dev/posts/rag-llm-judge-quality-loop/</guid>
      <description>&lt;p align=&#34;center&#34;&gt;
  &lt;img src=&#34;pexels_12969403.jpg&#34; alt=&#34;Laptop showing an analytics-style dashboard for automated answer-quality signals.&#34; style=&#34;max-width: min(100%, 820px); height: auto;&#34; /&gt;
&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;LLM-as-judge&lt;/strong&gt; adds scale when human reviewers cannot read every RAG interaction. A common pattern: after the answer path returns, enqueue &lt;strong&gt;question, answer, and retrieved context&lt;/strong&gt; to a queue; a &lt;strong&gt;worker Lambda&lt;/strong&gt; runs a judge prompt; results land in a &lt;strong&gt;database&lt;/strong&gt; for analytics.&lt;/p&gt;
&lt;h2 id=&#34;what-judges-are-good-for&#34;&gt;What judges are good for&lt;/h2&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Use&lt;/th&gt;
          &lt;th&gt;Reason&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;strong&gt;Trend monitoring&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;Average scores or failure flags shifting after a deploy&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;strong&gt;Sampling for humans&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;Pull low-scoring rows for manual review&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;strong&gt;Regression alarms&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;Chunk size, top-k, or model changes moving the distribution&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Judges are &lt;strong&gt;cheap sensors&lt;/strong&gt;, not auditors.&lt;/p&gt;</description>
    </item>
  </channel>
</rss>
