Build Local RAG Systems
Build production-ready local RAG systems without complex infrastructure. Practical patterns for embeddings, vector search, and retrieval. Start today.
Retrieval-Augmented Generation (RAG) systems have become essential for grounding LLMs with domain-specific knowledge. But most implementations I encounter rely on heavy infrastructure stacks: dedicated vector databases, orchestration frameworks, and complex dependency chains that make local RAG development painful.
In my work building AI-powered systems at the edge, I’ve found that simpler approaches often outperform complex ones. Let me share what I’ve learned about building local RAG systems with minimal dependencies—patterns that work equally well for development and production.
Design Your Minimal RAG Stack
The core components you actually need are surprisingly simple:
- Embedding model - Convert text to vectors
- Storage layer - Persist embeddings and metadata
- Search mechanism - Find relevant vectors
- LLM integration - Generate responses with context
You don’t need Pinecone, Weaviate, or even a dedicated vector database to start. Let’s build each piece.
Generate Embeddings Locally
The first decision is whether to use external embedding APIs or run models locally. For truly local development, I prefer self-hosted models.
from sentence_transformers import SentenceTransformer
import numpy as np
class LocalEmbedder:
def __init__(self, model_name="all-MiniLM-L6-v2"):
# This model is 80MB and runs on CPU
self.model = SentenceTransformer(model_name)
def embed(self, texts):
"""Generate embeddings for a list of texts."""
return self.model.encode(texts, convert_to_numpy=True)
def embed_query(self, query):
"""Single query embedding."""
return self.model.encode([query], convert_to_numpy=True)[0]
# Usage
embedder = LocalEmbedder()
docs = [
"Kubernetes manages container orchestration",
"Terraform provisions cloud infrastructure",
"Docker packages applications in containers"
]
embeddings = embedder.embed(docs)
print(f"Generated {len(embeddings)} embeddings of dimension {embeddings[0].shape}")
The all-MiniLM-L6-v2 model provides solid quality at 384 dimensions. It’s fast enough for real-time queries and small enough to run anywhere—even in CI/CD pipelines.
For production workloads with larger document sets, I’ve switched to all-mpnet-base-v2 (768 dimensions) which improves retrieval quality at the cost of 2x memory and compute.
Store Vectors with SQLite
Here’s where most implementations overcomplicate things. You don’t need a specialized vector database for RAG—SQLite with the VSS extension works beautifully.
import sqlite3
import struct
class SQLiteVectorStore:
def __init__(self, db_path="vectors.db"):
self.conn = sqlite3.connect(db_path)
self._init_schema()
def _init_schema(self):
self.conn.execute("""
CREATE TABLE IF NOT EXISTS documents (
id INTEGER PRIMARY KEY,
content TEXT NOT NULL,
metadata TEXT,
embedding BLOB NOT NULL
)
""")
self.conn.execute("""
CREATE INDEX IF NOT EXISTS idx_content
ON documents(content)
""")
self.conn.commit()
def add_document(self, content, embedding, metadata=None):
"""Store document with its embedding."""
# Convert numpy array to bytes
embedding_bytes = struct.pack(f'{len(embedding)}f', *embedding)
self.conn.execute(
"INSERT INTO documents (content, embedding, metadata) VALUES (?, ?, ?)",
(content, embedding_bytes, metadata)
)
self.conn.commit()
def search(self, query_embedding, top_k=5):
"""Find most similar documents using cosine similarity."""
cursor = self.conn.execute(
"SELECT id, content, embedding, metadata FROM documents"
)
results = []
for row in cursor:
doc_id, content, embedding_bytes, metadata = row
# Unpack embedding from bytes
doc_embedding = np.array(
struct.unpack(f'{len(embedding_bytes)//4}f', embedding_bytes)
)
# Cosine similarity
similarity = np.dot(query_embedding, doc_embedding) / (
np.linalg.norm(query_embedding) * np.linalg.norm(doc_embedding)
)
results.append({
'id': doc_id,
'content': content,
'metadata': metadata,
'score': float(similarity)
})
# Sort by similarity and return top_k
results.sort(key=lambda x: x['score'], reverse=True)
return results[:top_k]
This approach gives you:
- Zero additional dependencies beyond Python stdlib
- Transactional guarantees
- Portable single-file storage
- Easy backup and versioning
For datasets under 100k documents, I’ve found no meaningful performance difference compared to specialized vector databases. The linear scan through embeddings completes in milliseconds on modern hardware.
Implement Fast Semantic Search
If you need faster search for larger datasets, there are still minimal-dependency options. FAISS (Facebook AI Similarity Search) provides excellent performance with a small footprint.
import faiss
import numpy as np
class FAISSIndex:
def __init__(self, dimension=384):
# Use flat L2 index for simplicity
# Switch to IVF for datasets > 100k documents
self.index = faiss.IndexFlatL2(dimension)
self.documents = []
def add_documents(self, embeddings, documents):
"""Add embeddings and track corresponding documents."""
embeddings = np.array(embeddings).astype('float32')
self.index.add(embeddings)
self.documents.extend(documents)
def search(self, query_embedding, top_k=5):
"""Search for similar documents."""
query = np.array([query_embedding]).astype('float32')
distances, indices = self.index.search(query, top_k)
results = []
for i, (dist, idx) in enumerate(zip(distances[0], indices[0])):
if idx < len(self.documents):
results.append({
'content': self.documents[idx],
'distance': float(dist),
'rank': i + 1
})
return results
def save(self, path):
"""Persist index to disk."""
faiss.write_index(self.index, path)
def load(self, path):
"""Load index from disk."""
self.index = faiss.read_index(path)
FAISS indexes are blazingly fast and support quantization for memory efficiency. The IndexFlatL2 implementation uses exact nearest neighbor search—no approximations that might hurt recall.
Build a Complete Local RAG Pipeline
Now we can build a minimal but complete RAG system:
class MinimalRAG:
def __init__(self, embedder, vector_store):
self.embedder = embedder
self.vector_store = vector_store
def ingest(self, documents):
"""Add documents to the knowledge base."""
embeddings = self.embedder.embed(documents)
for doc, embedding in zip(documents, embeddings):
self.vector_store.add_document(doc, embedding)
def retrieve(self, query, top_k=3):
"""Retrieve relevant context for a query."""
query_embedding = self.embedder.embed_query(query)
results = self.vector_store.search(query_embedding, top_k)
return [r['content'] for r in results]
def generate(self, query, llm_client):
"""Generate response with retrieved context."""
context_docs = self.retrieve(query)
context = "\n\n".join(context_docs)
prompt = f"""Answer the question based on the context below.
Context:
{context}
Question: {query}
Answer:"""
response = llm_client.complete(prompt)
return response
# Usage
embedder = LocalEmbedder()
store = SQLiteVectorStore()
rag = MinimalRAG(embedder, store)
# Ingest knowledge base
knowledge_base = [
"Kubernetes uses etcd for cluster state storage",
"Docker containers share the host kernel",
"Terraform state files track infrastructure resources"
]
rag.ingest(knowledge_base)
# Retrieve and generate
query = "How does Kubernetes store state?"
context = rag.retrieve(query)
print(f"Retrieved context: {context}")
This pattern forms the foundation of every local RAG system I’ve built. The implementations vary—sometimes I use ChromaDB for convenience, other times raw NumPy for extreme minimalism—but the core structure remains constant.
Optimize RAG Systems for Production
When moving to production, I focus on three areas:
1. Chunking Strategy
Document chunking significantly impacts retrieval quality. I’ve found recursive character splitting with overlap works well:
def chunk_text(text, chunk_size=500, overlap=50):
"""Split text into overlapping chunks."""
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunk = text[start:end]
# Try to break at sentence boundary
if end < len(text):
last_period = chunk.rfind('.')
if last_period > chunk_size // 2:
end = start + last_period + 1
chunk = text[start:end]
chunks.append(chunk.strip())
start = end - overlap
return chunks
For code documentation, I chunk at function boundaries. For general text, 400-600 character chunks with 10% overlap provides good coverage without excessive redundancy.
2. Metadata Filtering
Adding metadata to documents enables filtering before similarity search:
# Store with metadata
metadata = {
'source': 'kubernetes-docs',
'date': '2026-01-15',
'category': 'orchestration'
}
store.add_document(content, embedding, json.dumps(metadata))
# Filter during retrieval
def search_with_filter(query_embedding, category=None):
results = store.search(query_embedding, top_k=20)
if category:
results = [r for r in results if json.loads(r['metadata']).get('category') == category]
return results[:5]
This hybrid approach combines semantic search with structured filtering. It’s particularly useful for multi-tenant systems or when you need to constrain results to specific domains.
3. Caching and Precomputation
Embeddings are expensive to generate. Cache them aggressively:
import hashlib
import struct
class CachedEmbedder:
def __init__(self, embedder, cache_path="embedding_cache.db"):
self.embedder = embedder
self.conn = sqlite3.connect(cache_path)
self._init_cache()
def _init_cache(self):
self.conn.execute("""
CREATE TABLE IF NOT EXISTS embedding_cache (
text_hash TEXT PRIMARY KEY,
embedding BLOB NOT NULL
)
""")
def _serialize_embedding(self, embedding):
"""Convert numpy array to bytes."""
return struct.pack(f'{len(embedding)}f', *embedding)
def _deserialize_embedding(self, embedding_bytes):
"""Convert bytes to numpy array."""
return np.array(struct.unpack(f'{len(embedding_bytes)//4}f', embedding_bytes))
def embed_query(self, query):
# Check cache
query_hash = hashlib.sha256(query.encode()).hexdigest()
cursor = self.conn.execute(
"SELECT embedding FROM embedding_cache WHERE text_hash = ?",
(query_hash,)
)
row = cursor.fetchone()
if row:
return self._deserialize_embedding(row[0])
# Generate and cache
embedding = self.embedder.embed_query(query)
self.conn.execute(
"INSERT INTO embedding_cache (text_hash, embedding) VALUES (?, ?)",
(query_hash, self._serialize_embedding(embedding))
)
self.conn.commit()
return embedding
In production, this caching reduced our embedding API costs by 70% and improved query latency by 40%.
When to Graduate to Complex Infrastructure
This minimal local RAG approach scales further than you might expect. I’ve run systems handling 50k+ documents with query latencies under 100ms using just SQLite and FAISS.
Consider graduating to dedicated infrastructure when you hit these limits:
- Document count > 1M: You’ll want distributed indexing
- Query volume > 100 QPS: Horizontal scaling becomes necessary
- Multi-tenancy requirements: Isolation and quotas need infrastructure support
- Real-time updates: Incremental indexing requires specialized systems
Even then, the patterns remain the same. You’re just swapping implementations—SQLite for Postgres with pgvector, FAISS for Pinecone, local embeddings for OpenAI. The core architecture stays intact.
Lessons from Building RAG Systems
The biggest lesson: start simple. Every local RAG system I’ve seen fail did so from complexity, not from insufficient features.
Other hard-won insights:
-
Retrieval quality matters more than model size. Tuning your chunking strategy and metadata filtering often improves results more than upgrading to larger embedding models.
-
Monitor retrieval before generation. Log what documents are retrieved for each query. When responses are wrong, it’s usually a retrieval problem, not an LLM problem.
-
Version your embeddings. When you change embedding models, you need to re-index everything. Plan for this from the start with versioned storage.
-
Test with real queries. Synthetic test data rarely captures the messiness of production queries. Build evaluation sets from actual user questions.
Try It Yourself
You can have a working local RAG system running in under 100 lines of code by combining the patterns above. The sentence-transformers, faiss-cpu, and numpy packages give you everything you need:
pip install sentence-transformers faiss-cpu numpy
Start with the minimal implementation. Add complexity only when you have concrete evidence it’s needed. Most local RAG use cases don’t require the infrastructure overhead we’ve normalized.
Your local development environment can be your production architecture. And that’s a beautiful thing.
Building AI systems that run locally and scale globally? Let’s talk about your RAG architecture and infrastructure needs.