Internal-docs RAG: chat ingress, vector search, and an async judge loop

Overhead view of a professional at a desk with laptop, tablet, and papers — internal work and digital tools.

This note describes an internal RAG pattern for policy and handbook-style documents: employees ask questions in a familiar chat surface, the backend retrieves by semantic similarity, and a separate evaluation path scores answers for quality and traceability. The layout maps cleanly to typical AWS building blocks (API Gateway, Lambdas, object storage, a vector index, DynamoDB, and a queue).

What “good” looks like for internal docs

Internal regulations and HR-style content fail in search UIs when people do not know the exact keyword. A practical baseline is:

Semantic retrieval over chunked text, not only filename or title search.
Grounded answers that cite where the model found support (and ideally link back to the source file with a time-limited URL).
Operational feedback so you are not flying blind after launch: some form of automated scoring plus storage you can query later.

End-to-end shape

At a high level:

A messaging platform (for example a corporate chat bot) receives the user question.
API Gateway forwards the webhook to a Lambda that orchestrates retrieval and generation.
The Lambda embeds the question, queries a managed vector store (in the reference package this role was filled by S3 Vectors), and pulls the top chunks.
An LLM (for example a small, cost-aware chat model) generates an answer conditioned on those chunks, with citations and optional presigned links to originals.
The same flow logs the conversation to a database (for example DynamoDB) for audit and debugging.
A message queue (SQS) hands off “question + answer + context” to a second Lambda that runs LLM-as-judge: binary checks (was the need met?), citation presence and plausibility, a coarse quality score, and short improvement notes. Results land back in DynamoDB.

That split between online answer and offline evaluation keeps user latency predictable while still building a corpus of labeled interactions.

Pipeline diagram (PlantUML)

Logical flow from object storage through chunking, embeddings, vector index, answer Lambda and LLM, back to the user; note mentions async judge via SQS.

Layer	Responsibility
Messaging (e.g. corporate chat)	User question in, answer out
API Gateway + answer Lambda	Webhook, embed query, retrieve chunks, call LLM, citations / links
Object storage + vector index	Chunks, embeddings, similarity search
DynamoDB	Conversation logs, judge outputs
SQS + judge Lambda	Async scoring and improvement notes

Ingestion is not an afterthought

The conversational path only works if ingestion is boring and repeatable:

Normalize source files (PDFs, Office exports, etc.) into a pipeline that produces chunking artifacts (structured JSON or similar) in object storage.
Run embedding generation over those chunks and upsert vectors into the vector index, keeping a stable document id and chunk id scheme so citations round-trip.

Handoff repos often ship batch scripts (PowerShell calling Lambdas, or local batch jobs) to re-vectorize after content updates. Your team should treat re-ingest on policy change as part of change management, not a one-time migration.

Why LLM-as-judge here

Human review does not scale to every interaction. A judge model is not ground truth, but it is useful for:

Trend monitoring (average scores drifting down after a model or prompt change).
Spot checks (surfacing low scores for human review).
Regression detection after you change chunk size, top-k, or the base model.

Keep prompts and rubrics versioned like any other config, and assume judges can be wrong or harsh on edge cases—use them as a tripwire, not the sole KPI.

Platform choices to double-check

Secrets: prefer SSM Parameter Store or Secrets Manager over environment variables in plain text.
IAM and network: Lambdas need least-privilege access to S3, the vector service, DynamoDB streams or tables, and outbound HTTPS to the LLM vendor.
Data residency and logging: internal policies may restrict which regions and which vendors may see chunk text or prompts.

Limitations that show up in real handoffs

Packages built for transfer often say explicitly:

The target AWS environment was not the author’s production tenant; you must re-deploy and validate IAM, VPC, and quotas.
Bulk upload or restore scripts for object storage or vector indexes may be best-effort; verify checksums and partial failure behavior before you rely on them for DR.
Source corpora may mix highly structured PDFs with messy folders of Word and slides; chunk quality and OCR noise will dominate perceived RAG quality more than the chat UI skin.

Closing

If you are designing or inheriting an internal-docs RAG, prioritize traceable retrieval, citation-first answers, and a durable evaluation path alongside the shiny chat entrypoint. The reference layout—chat ingress, Lambda orchestration, vector retrieval, async judge—is one proven way to get there on AWS without pretending that search alone solves organizational knowledge.

What “good” looks like for internal docs#

End-to-end shape#

Pipeline diagram (PlantUML)#

Ingestion is not an afterthought#

Why LLM-as-judge here#

Platform choices to double-check#

Limitations that show up in real handoffs#

Closing#