From messy folders to vectors: an ingestion mindset for policy RAG

Desk with project documents and charts — turning paper policy into structured retrieval.

Retrieval quality in an internal policy RAG is rarely fixed by swapping the chat model first. It is usually capped by how documents enter the system: file types, chunk boundaries, stable identifiers, and a repeatable path from source object to vector index. In practice you often see batch jobs or Lambdas, object storage for artifacts, and a managed vector service wired together the same way.

Why ingestion deserves first-class ownership

Symptom	Often traces back to…
“Right answer, wrong section”	Chunk spans that split tables or definitions
“It never finds the new rule”	No re-ingest after policy updates
“Citations point at the wrong file”	Unstable or reused document ids

Treat ingestion and re-ingestion as part of change management, not a one-off migration.

A minimal pipeline mental model

Normalize inputs — PDFs, exports from Word or slides, and mixed “policy folder” trees land in object storage with predictable prefixes.
Chunking — Produce chunking artifacts (JSON or similar) with text, offsets or page hints, and metadata you will need for citations.
Identifiers — Assign durable document_id and chunk_id values so answers can cite and UIs can deep-link.
Embed — A job or Lambda turns each chunk into an embedding and upserts into the vector index.
Verify — Spot-check counts (files vs chunks), empty chunks, and obvious OCR garbage before you trust retrieval.

Handoff repos often ship PowerShell or batch scripts that fan out over prefixes and invoke an embedding Lambda. That is enough automation if operators know when to run it.

Ingestion pipeline diagram

Flow from object storage through chunking and embedding into a vector index, with a note on re-ingest after policy changes.

Mixed corpora are normal

Real internal drops combine clean, structured PDFs with messy folders of Office files. Expect:

OCR noise on scans to dominate “weird retrieval” reports.
Different chunking strategies per family of documents (optional advanced step; at minimum, be aware of the mix).

You do not need perfection on day one; you need visibility into which subtree is dragging quality down.

Operational habits that pay off

Habit	Payoff
Version or date-stamp policy releases	Easier to correlate user complaints with corpus state
Run embedding batches after known content changes	Stops “the bot is wrong because it is stale” incidents
Log chunk counts per run	Quick regression check after tokenizer or chunk-size changes

Closing

If chat answers feel fuzzy, inspect the path from folder to vector before you tune prompts. A boring, repeatable ingestion story beats a clever retrieval hack built on ambiguous chunk boundaries.

Why ingestion deserves first-class ownership#

A minimal pipeline mental model#

Ingestion pipeline diagram#

Mixed corpora are normal#

Operational habits that pay off#

Closing#