Retrieval quality in an internal policy RAG is rarely fixed by swapping the chat model first. It is usually capped by how documents enter the system: file types, chunk boundaries, stable identifiers, and a repeatable path from source object to vector index. In practice you often see batch jobs or Lambdas, object storage for artifacts, and a managed vector service wired together the same way.
Why ingestion deserves first-class ownership
| Symptom | Often traces back to… |
|---|---|
| “Right answer, wrong section” | Chunk spans that split tables or definitions |
| “It never finds the new rule” | No re-ingest after policy updates |
| “Citations point at the wrong file” | Unstable or reused document ids |
Treat ingestion and re-ingestion as part of change management, not a one-off migration.
A minimal pipeline mental model
- Normalize inputs — PDFs, exports from Word or slides, and mixed “policy folder” trees land in object storage with predictable prefixes.
- Chunking — Produce chunking artifacts (JSON or similar) with text, offsets or page hints, and metadata you will need for citations.
- Identifiers — Assign durable
document_idandchunk_idvalues so answers can cite and UIs can deep-link. - Embed — A job or Lambda turns each chunk into an embedding and upserts into the vector index.
- Verify — Spot-check counts (files vs chunks), empty chunks, and obvious OCR garbage before you trust retrieval.
Handoff repos often ship PowerShell or batch scripts that fan out over prefixes and invoke an embedding Lambda. That is enough automation if operators know when to run it.
Ingestion pipeline diagram
Mixed corpora are normal
Real internal drops combine clean, structured PDFs with messy folders of Office files. Expect:
- OCR noise on scans to dominate “weird retrieval” reports.
- Different chunking strategies per family of documents (optional advanced step; at minimum, be aware of the mix).
You do not need perfection on day one; you need visibility into which subtree is dragging quality down.
Operational habits that pay off
| Habit | Payoff |
|---|---|
| Version or date-stamp policy releases | Easier to correlate user complaints with corpus state |
| Run embedding batches after known content changes | Stops “the bot is wrong because it is stale” incidents |
| Log chunk counts per run | Quick regression check after tokenizer or chunk-size changes |
Closing
If chat answers feel fuzzy, inspect the path from folder to vector before you tune prompts. A boring, repeatable ingestion story beats a clever retrieval hack built on ambiguous chunk boundaries.