I’ve been building a local-only AI assistant for security analysis that uses a FAISS vector index and a local model for reasoning over parsed tool output. The current system works well, but I’m running into scaling issues as the dataset grows.
Current setup:
~356k chunks
FAISS (Flat index)
384-d MiniLM embeddings
llama-cpp-python for inference
Metadata stored in a single pickle file (~1.5GB)
Tool outputs (Nmap/YARA/Volatility/etc.) parsed into structured JSON before querying
Problems I’m running into:
Metadata pickle file loads entirely into RAM
No incremental indexing — have to rebuild the FAISS index from scratch
Query performance degrades with concurrent use
Want to scale to 1M+ chunks but not sure FAISS + pickle is the right long-term architecture
My questions for those who’ve scaled local or offline RAG systems:
How do you store metadata efficiently at this scale?
Is there a practical pattern for incremental FAISS updates?
Would a vector DB (Qdrant, Weaviate, Milvus) be a better fit for offline use?
Any lessons learned from running large FAISS indexes on consumer hardware?
Not looking for product feedback — just architectural guidance from people who’ve built similar systems.