Building a Local AI Library on Two Consumer GPUs
Author: Brian Vowell
Topic: A case study on indexing and searching thousands of technical documents with semantic search, running entirely on a home workstation.
The Problem:
We have a library of engineering and computer science PDFs sitting on a hard drive. Keyword search works for exact terms like a chip part number, but it fails every time we wanted to ask a real question. Something like “what is the tradeoff between pruning and quantization for edge inference” returned either nothing useful or thousands of unrelated hits. We wanted to ask questions to our AI in plain English, have it retrieve the actual paragraphs that answered those questions, and then generate better responses to our questions than what it otherwise could have done with its native training. We created two Python programs do all the work. The first reads every PDF file that we feed it and turns the text into a searchable database. The second takes a question, finds the most relevant passages, and returns them. Both run locally on a desktop PC.
This case study walks through how the system works, step-by-step, and what surprises showed up along the way.
The Hardware
The workstation is not new hardware. It runs an Intel i7-3930K (an older 6-core CPU from 2011) with 60 GB of DDR3 memory, an NVMe SSD, and two Nvidia RTX 3060 graphics cards with 12 GB of GDDR6 video memory and 3,584 CUDA compute cores in each card. The operating system is Windows 11. The total parts cost is roughly what a single high-end gaming GPU costs today. A large slice of AI publications today assume Nvidia H100 clusters and data-center budgets. We wanted to see how far a careful design on modest hardware could actually go.
Step 1: Turning PDFs into Searchable Data
The indexer program does five jobs in order. Let’s walk through each one.
Find every PDF and Skip Duplicates
The program scans a folder tree looking for PDF files. Before doing any work on a file, it computes an MD5 hash. An MD5 hash is a short fingerprint that identifies a file by its contents rather than its name. If the database already contains that hash, then the file is skipped. This means that we can reorganize or rename our PDF folder and its contents without forcing the system to re-process all the files in it again.
Extract Text from each PDF Safely
Extracting clean text from PDFs is messier than it looks. Some files are broken, and a few will crash the PDF library and take the whole program down with them. The fix is subprocess isolation. Each PDF file gets opened in its own MuPDF child process. If that process crashes, the parent catches the error, writes a note, and moves on to the next file. The parent never dies. We learned this the hard way after one malformed PDF killed a six-hour indexing run. Pages with fewer than 50 characters of extracted text are assumed to be scanned images and are logged for a later pass with an actual OCR (Optical Character Recognition) engine like Tesseract OCR or Adobe Acrobat Pro. Even without the OCR treatment, the file will still get indexed from its readable pages.
Cut the Text into Digestable Chunks
A whole book is too long to embed as a single unit. The indexer breaks cleaned text into chunks of about 1,000 words each, with a 5% overlap between adjacent chunks. The overlap keeps ideas that straddle a chunk boundary from getting cut in half. The chunker respects paragraphs. It will not split a sentence if it can avoid it. Chunks shorter than 50 characters get discarded as noise (usually page numbers or stray headers that escaped the cleanup step).
Turn Chunks into Vectors on Two GPUs
Each text chunk goes through an AI model called Qwen3-Embedding-0.6B. The model converts the text into a list of 1,024 numbers. Two passages with similar meanings end up with similar number lists, even if they share no words in common. This is what lets the system find relevant content by meaning rather than by keyword match. To get decent throughput on two GPUs at once, the program uses three tricks: First, it exports the model to ONNX format once, at startup. ONNX (Open Neural Network Exchange) is a portable format that runs faster than Python’s PyTorch for pure inference work. PyTorch was initially considered in the project until we ran into several limitations regarding scheduling its threads across multiple GPUs in parallel. ONNX also allows a runtime called ONNX Runtime to apply optimizations like operator fusion, which combines multiple small math operations into single fused ones. Second, the script uses shared memory to pass data between the main program and the two GPU worker processes. Shared memory means both processes read and write the same block of RAM without copying anything between them. On Windows, allocating shared memory is slow, so the program allocates one big block at startup and reuses it for every batch. That change alone was worth a measurable speedup. Third, batches get handed to the two GPUs in alternating order. GPU 0 gets batch 0, 2, 4, and so on. GPU 1 gets batch 1, 3, 5. Both cards stay busy.
Write Everything to the Database
The results land in ChromaDB, an open-source vector database. ChromaDB stores the 1,024-number vectors along with the original text and some metadata (what file it came from, which chunk number). A special index called HNSW (Hierarchical Navigable Small World) makes searches over millions of vectors take milliseconds instead of seconds. One detail that caught us off guard: ChromaDB uses SQLite underneath, and dropping a database index before bulk inserts and rebuilding it afterward was about four times faster than inserting with the index live and active. This is a standard pattern from traditional ETL (Extract, Transform, Load) work, but we had to rediscover it the slow way after several attempts at running the indexer script.
Step 2: Answering a Question
The search server is a separate program. It loads all the models once and then waits for questions. It connects to our preferred AI platform Claude Desktop over a protocol called MCP (Model Context Protocol), which lets an AI assistant call external tools. When we ask Claude something like “how does HNSW handle deletion,” Claude sends the question to the MCP server, receives the most relevant passages, and then uses them to answer.
The server runs in two stages.
Stage 1: Cast a wide Net
The first stage is fast and approximate. Our question gets converted into the same 1,024-number format as the indexed chunks, using the same Qwen3 embedding model. ChromaDB then finds the 50 chunks whose vectors are closest to the question vector, using cosine similarity as the distance metric. This takes under 100 milliseconds even against hundreds of thousands of chunks. Why 50 and not just 10? Because this stage is fast but not very precise. The embedding model has to compress a whole paragraph into 1,024 numbers, which loses some nuance. Over-fetching gives the next stage a richer pool of candidates to work with.
Stage 2: Read each Candidate Carefully
The second stage is slow and precise. A different AI model, called a reranker, reads each candidate chunk alongside our question and scores how well each chunk actually answers the question. This larger model can look at both texts at once, with full attention between every word. That is much more accurate than comparing two pre-computed vectors, but it costs more to run, so we have to be careful how many inputs we feed to it. The reranker runs on the second GPU, isolated from the first. Both cards have 12 GB of video memory, and splitting the load keeps either one from running out of memory. The reranker is also compiled to ONNX, and the graph has been surgically trimmed to output only the two scores we really care about (one for a probability of “yes, this is relevant” and the other for “no, it is not”). The original model would output a score for every word in its vocabulary, roughly 150,000 numbers per input. Slicing the graph down to only two outputs saves about 7 GB of video memory per forward pass. After scoring all 50 candidates, the MCP server sorts them by score and returns the top 10 along with their source files and chunk numbers. The whole round trip, from question to answer, takes about two seconds on a cold cache.
Why the two Stages Matter
This retrieve-then-rerank pattern is not new. It is how many modern search systems work, from Bing to open-source RAG (Retrieval-Augmented Generation) pipelines. The idea is that fast search and accurate search pull in opposite directions. A single model that was both fast and accurate would be an engineering miracle. Splitting the job in two lets each model do what it is good at. The first stage is a bi-encoder. It encodes the question and every document independently, so all the document vectors can be computed ahead of time and compared with simple math at query time. That makes it fast enough to scale to millions of items, but it also gives up some accuracy. The second stage is a cross-encoder. It reads the question and a document together as a single input, which lets the model notice subtle connections between them. The scores are better, but the model has to run once per candidate, so it cannot be used across the whole database of a million chunks. Over-fetching in stage one (50 candidates) and trimming in stage two (top 10) is a standard 5:1 ratio. There’s enough headroom to catch relevant content that the fast stage ranked imperfectly, but still small enough that the slow stage finishes in under two seconds.
The Mistakes that Cost us Days of Delays
The CUDA DLL search path on Windows is not always what Python expects, especially if you have more than one version of CUDA installed at the same time. Our program has to explicitly register the CUDA toolkit directory with os.add_dll_directory() before importing anything that uses the GPU. Miss that, and ONNX Runtime silently falls back to CPU, which runs about 50 times slower. We only noticed because our throughput dropped by a factor that we couldn’t intially explain without diving into the code.
The reranker tokenizer has to pad sequences on the left side, not on the right. The model reads its yes/no answer from the final token position in each sequence. If padding is on the right, the padding tokens are at the end, and the model reads its answer from a meaningless padding token. Every score comes back as roughly 50%. The fix is one line of code. Finding that one line took a full afternoon.
Shared memory on Windows needs explicit cleanup. If the program dies without calling unlink() on every shared memory block, the blocks persist in the system until the next reboot. A few crashes during testing left us wondering why our available RAM kept shrinking.
Numbers
The current database holds roughly 7,818 technical PDFs transformed into about a million chunks. Embedding throughput runs around 200 chunks per second on the dual RTX 3060 setup. Reranking 50 candidates against a query takes about 1.5 seconds. Query embedding takes under 100 milliseconds. End-to-end query latency is consistently under two seconds. Peak video memory during reranking sits around 8 GB out of the available 12 GB on each card. Peak system RAM during indexing runs around 18 GB out of 60 GB, most of it a large SQLite page cache that we explicitly configured to trade memory for write throughput.
Where it Goes from Here
The pipeline works well enough that it has become our default way to look things up. Asking a natural question and getting three relevant paragraphs is faster than scrolling through a 600-page PDF looking for the right section. For anyone with a pile of domain-specific documents and a couple of GPUs collecting dust, this architecture is definitely worth considering. The full source code is available from us on request.
If you are building something similar, the architectural choices that saved us the most time were: the subprocess isolation for PDF parsing, the persistent shared memory pool, and splitting the two GPUs cleanly between stage one and stage two. Everything else was solvable once those three were on the right track.
Our next case study will cover the effort we undertook to classify all these textual chunks for later visualization by subject domain.
Dive Deeper into the Technical Details:
indexer.py Technical Reference
rag-server.py Technical Reference