From Folders to Answers: How We Built an AI-Powered Document Intelligence System for 400,000+ Construction Files
- 15 hours ago
- 8 min read
Jun 2025 · 8 min read
The problem sounds deceptively simple: a large construction organisation has hundreds of thousands of files - drawings, PDFs, Word documents, Excel sheets, specifications - accumulated across years of pre-construction and active build phases. Engineers and project managers need to find things. Not summaries of things. Not AI-generated descriptions of things. The actual files, at their exact locations.
Before this system existed, the answer was folders. You knew roughly where something lived, or you asked someone who did, or you spent twenty minutes drilling through a directory tree hoping you remembered the naming convention correctly.
We were brought in to fix that. What we built - and what this article is about - is not just a search system. It is a full document intelligence pipeline that ingests, understands, indexes, and retrieves across 400,000+ files, running entirely on local hardware, in a construction environment where data cannot leave the premises.
The scale and the constraint of document intelligence
The corpus broke down roughly like this: around 200,000 were engineering drawing files - the kind generated by CAD tools, exported as PDFs - and the remainder were a mix of Word documents, Excel sheets, plain text files, and specifications. Pre-construction data sat alongside documentation from buildings that were newly finished. Some files had rich metadata. Many did not.
The hard constraint was locality. Everything had to run on-premises, on hardware we controlled, on a simple local network. No cloud APIs. No data leaving the building. That constraint shaped every architectural decision that followed.

Why content search fails in construction
The first instinct when building a retrieval system is to index the content of documents. Embed the text, store the vectors, retrieve by semantic similarity. This works beautifully for knowledge bases, support tickets, research papers.
It fails for engineering drawings.
Consider a query like: "Give me electrical drawings for the 4th floor of Building C in the XYZ area."
A content-based system will search inside files for words like "electrical" and "4th floor." The problem is that a drawing file's content is almost entirely graphical. The text inside it - if there is any - might be component labels, dimension annotations, a title block with a reference code. None of that tells the retrieval system that this file is an electrical drawing for the 4th floor of Building C.
Worse, a file about electrical systems in any part of the project might match just as well, because the words "electrical" and "floor" appear in thousands of files.
The insight that unlocked the system was this: in a construction project, the file path is the metadata. A path like /ProjectXYZ/Buildings/Zone1/BuildingC/Drawings/Floors/Floor4/Electrical/EL-C-04-001.pdf contains almost everything a retrieval system needs to know - area, building, floor, discipline - without reading a single byte of the file's content.

Path-based indexing: the core idea
Instead of embedding what is inside a file, we embedded where it lives.
Every file's path was treated as a semantic string - cleaned, normalised, and passed through an embedding model. The resulting vectors capture the hierarchical structure of the project: discipline, building, floor, drawing type. When a user asks for "electrical drawings, 4th floor, Building C," their query embeds into a vector that is geometrically close to the paths of exactly those files.
This sounds simple. The implementation was not.
Path strings are inconsistent. Abbreviations differ across teams and phases. "Elec" and "Electrical" and "EL" all mean the same thing. Normalisation rules had to be built and maintained. We also had to decide how to handle files that did have useful content -specification documents, for example - where a pure path approach would miss relevant material.
The solution was layered:
Path embeddings as the primary retrieval signal
Content embeddings for document types where content is meaningful
Metadata filtering applied before retrieval - not after —]- so the search space is constrained before a single vector comparison is made
Cross-encoder reranking to rescore the top candidates and surface the single most relevant result

The metadata filter: the fastest search is the one that doesn't happen
Metadata filtering deserves its own attention because it was one of the highest-impact decisions in the project - and it was ready before the full indexing pipeline was even finished.
Every file carries attributes: discipline (structural, electrical, mechanical, civil), building identifier, floor number, drawing type, file format, date modified. Rather than letting the vector search range over all 400,000 files, we applied hard filters first. A query about electrical drawings never touches structural files. A query about Building C never retrieves Building A results.
The effect on latency was dramatic. With metadata filtering in place, search dropped from several seconds to milliseconds. The vector search is operating over hundreds of files, not hundreds of thousands. The cross-encoder reranker is scoring ten candidates, not ten thousand.
This also gave us an operational advantage: metadata filtering could be applied the moment files were discovered and catalogued, even before their content or paths were embedded. The system was partially useful from day one.
The drawing problem: when the content is a picture
The hardest files in the corpus were the engineering drawings. A converted PDF of a CAD drawing is, in most cases, a large image with very little text. The title block might contain a drawing number and a revision code. The rest is lines, symbols, dimensions, and annotations that are visually meaningful to an engineer and semantically opaque to a language model.
We used vision-language models - running locally on RTX 4090 hardware - to process these files. The VLM received the converted PDF and was asked to describe what it saw: drawing type, visible components, spatial organisation, any text elements it could extract.
The output quality varied significantly. Some drawings yielded rich descriptions. Others - dense mechanical schematics with minimal labelling - produced descriptions that were not much more useful than "this appears to be an engineering drawing."
For those cases, we made a deliberate choice: if the VLM's description did not clear a usefulness threshold, we discarded it and indexed only the path-based summary. A high-confidence path embedding is more useful than a low-confidence content description. The system is honest about what it knows.

Local infrastructure: making four machines feel like one
Running a pipeline of this complexity across four to five local machines - a mix of RTX 4090 workstations and Mac Studios - without cloud orchestration required building a scheduling system from scratch.
The naive approach would have been sequential: send a file to one machine, wait for it to finish, send the next file. At 400,000 files, that approach was projected to take close to a year.
The actual approach: every available LLM server slot on every device became a worker. Files were partitioned by project folder and dispatched across machines according to measured throughput capacity. Requests were queued, not serialised. While one machine was processing a drawing, four others were processing text documents, embedding paths, and running rerankers in parallel.
The rule was simple: no LLM should ever be idle if there is work to be done.

The priority queue and the throughput problem
Not all files are equal. A drawing that a project manager needs to present tomorrow matters more than a specification archived three years ago. Not all machines are equal either - throughput varied significantly between devices depending on model size, RAM, and thermal state.
We built a priority sorter that assigned processing urgency based on folder recency, file type, and project phase. High-priority folders - active building zones, recently modified files — were processed first. The scheduler continuously re-evaluated device speeds and rebalanced the dispatch load.
The queuing approach also solved a subtler problem: eliminating dead time between requests. Instead of waiting for one inference to complete before sending the next, every machine maintained a saturated request queue. The GPU or CPU was never waiting for a job to arrive.
The net result: the projected one-year indexing time was completed in approximately two months. A six-times reduction - not from faster hardware, but from not wasting the hardware we had.
Failure handling: when the power goes out
A pipeline running for two months across local machines will encounter hardware failures. During this project, individual devices went offline - power interruptions, thermal shutdowns, network drops. Without a recovery mechanism, a failure mid-way through a large folder would require re-processing every file in that folder.
We implemented file-level checkpointing. Every successfully processed file wrote its completion status to a local log. On startup - automatic or manual - the pipeline script read the checkpoint, identified which files were incomplete, and resumed from exactly where it left off.
No file was processed twice. No failure cascaded into a full restart. The pipeline was, in effect, resumable by design.

What the user actually sees
The interface is a chat window. There is no file browser, no filter panel, no dropdown for discipline or floor number. The user types a natural language query. The system returns file paths - ranked, precise, and fast.
"Electrical drawings for the 4th floor of Building C in XYZ area" returns the correct drawing references in milliseconds. The engineer can go directly to the file. No folder navigation. No calling a colleague who might remember the naming convention.
The shift is not just in speed. It is in who can find things. Previously, navigating the folder structure required institutional knowledge - knowing the project's naming conventions, remembering which phase files were stored under, understanding the difference between a schematic and a layout drawing. With the system, that knowledge is embedded in the index. A new team member with no project history can retrieve files as effectively as a senior engineer who has been on the project for years.

The broader architecture in one view

What this unlocks
A construction project generates documents faster than any team can manually organize them. The filing conventions that made sense in month one are strained by month six and broken by year two. By the time a project reaches handover, the accumulated file corpus is effectively unsearchable by anyone who was not present for its creation.
Path-based indexing with intelligent metadata filtering changes that equation. The structure of the file system itself becomes the primary knowledge source. The content of each file enriches the index where it can, and is honestly absent where it cannot.
The system we built is now in active use. Engineers find drawings in milliseconds. New team members are productive from day one. The institutional knowledge that used to live in one senior engineer's memory is now queryable by anyone with access to the chat interface.
And the index can be rebuilt. When new drawings are added, the pipeline processes them incrementally. The checkpoint system means a new batch can be started, interrupted, and resumed without losing progress. The corpus grows; the system keeps up.
The principles that generalised
Looking back, the decisions that mattered most were not the model choices or the hardware configuration. They were the architectural principles:
The right signal for retrieval is not always inside the document. For construction drawings, the path carries more retrieval signal than the content. Finding that principle early saved the project.
Metadata filtering is not a feature - it is a performance multiplier. Constraining the search space before vector comparison is orders of magnitude faster than constraining it after.
Local hardware is not a limitation if you treat it as a distributed system. Four machines saturated in parallel outperform one machine running sequentially by months.
Honest degradation beats false confidence. When a VLM cannot extract meaningful content from a drawing, fall back to the path. A correct path-only result is more useful than an incorrect content-based guess.
VIGA ET builds AI-native intelligence systems for industries where data stays on-premises and the stakes are high. If your organization is sitting on an unindexed file corpus that your team has stopped trusting, we would like to hear about it.
Contact: info@vigaet.com · vigaet.com




Comments