The Retrieval Problem
Dense retrieval alone (e.g. cosine similarity over BGE embeddings) misses exact keyword matches — crucial for legal documents, technical specifications, and regulatory text where specific terms matter. Sparse retrieval alone (BM25/Splade) misses semantic similarity — "aircraft" and "UAV" won't match without vocabulary overlap.
The hybrid pipeline gets both: dense for semantic understanding, sparse for exact term matching, RRF fusion to combine rankings without score calibration, and cross-encoder reranking for final precision.
Dense Retrieval
BGE-large-en-v1.5 — 1024-dim embeddings trained on MSMARCO + instruction tuning. Stored in Qdrant per-client collection. Cosine similarity search over approximate nearest neighbours.
Sparse Retrieval
Splade_PP (SPLADE++ Efficient) — learned sparse representations. Vocabulary expansion: "aircraft" expands to ["drone", "UAV", "aerial vehicle"]. Stored as Qdrant sparse vectors alongside dense.
RRF Fusion
Reciprocal Rank Fusion at k=60. Combines dense ranking and sparse ranking without requiring score normalisation. Score = Σ 1/(k + rank_i). No threshold tuning needed.
Cross-Encoder Rerank
MiniLM-L6-v2 cross-encoder scores (query, passage) pairs directly. Top-20 candidates from RRF re-ranked. Slow but high-precision final step — only applied to pre-filtered candidates.
Retrieval Pipeline Architecture
The HybridRetriever class encapsulates all four stages. Each stage operates on the output of the previous: dense and sparse searches run independently and are fused before reranking. The cross-encoder only touches the top-20 RRF candidates, keeping latency manageable despite its O(n) inference cost.
# platform/retrieval/pipeline.py class HybridRetriever: def __init__(self, tenant_id: str): self.qdrant = QdrantClient(host="localhost") self.collection = f"tenant_{tenant_id}" self.dense_model = SentenceTransformer("BAAI/bge-large-en-v1.5") self.sparse_model = SparseTextEmbedding("prithivida/Splade_PP_Efficient_v1") self.reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2") def retrieve(self, query: str, top_k: int = 5) -> list[ScoredPassage]: # Stage 1: Dense search dense_vec = self.dense_model.encode(query, normalize_embeddings=True) dense_hits = self.qdrant.search(self.collection, dense_vec, limit=20) # Stage 2: Sparse search sparse_vec = next(self.sparse_model.embed([query])) sparse_hits = self.qdrant.search(self.collection, query_sparse_vector=SparseVector(sparse_vec.indices, sparse_vec.values), limit=20) # Stage 3: RRF fusion candidates = rrf_fuse(dense_hits, sparse_hits, k=60)[:20] # Stage 4: Cross-encoder rerank pairs = [(query, c.payload["text"]) for c in candidates] scores = self.reranker.predict(pairs) reranked = sorted(zip(candidates, scores), key=lambda x: -x[1]) return [ScoredPassage(c, s) for c, s in reranked[:top_k]]
RRF fusion is implemented as a pure function with no external dependencies. It is the only place where dense and sparse result sets are merged:
def rrf_fuse(hits_a: list, hits_b: list, k: int = 60) -> list: scores: dict[str, float] = {} for rank, hit in enumerate(hits_a): scores[hit.id] = scores.get(hit.id, 0) + 1.0 / (k + rank + 1) for rank, hit in enumerate(hits_b): scores[hit.id] = scores.get(hit.id, 0) + 1.0 / (k + rank + 1) all_hits = {h.id: h for h in hits_a + hits_b} return [all_hits[id] for id in sorted(scores, key=lambda i: -scores[i])]
CircuitBreaker + SQLite WAL Offline Buffer
When the Qdrant backend or network is unavailable, the system must not return errors to clients or lose requests. CircuitBreaker monitors failure rate — after 5 failures in 60s, it opens and routes requests to an offline fallback (SQLite WAL with full-text search). When the backend recovers, the circuit half-opens and retries; on success it closes.
Normal operation
All requests route to Qdrant. Failure counter increments on exceptions; resets on success.
Backend down
Requests served from SQLite WAL FTS5 index (lower precision, maintains availability). No Qdrant calls attempted.
Recovery probe
After recovery_timeout, one probe sent to Qdrant. Success → Closed. Failure → back to Open.
# platform/core/resilience.py class CircuitBreaker: def __init__(self, failure_threshold=5, recovery_timeout=60): self.state = "closed" self.failures = 0 self.last_failure = 0 self.threshold = failure_threshold self.timeout = recovery_timeout def call(self, fn, *args, fallback=None, **kwargs): if self.state == "open": if time.time() - self.last_failure > self.timeout: self.state = "half-open" else: return fallback(*args, **kwargs) if fallback else None try: result = fn(*args, **kwargs) if self.state == "half-open": self.state = "closed" self.failures = 0 return result except Exception: self.failures += 1 self.last_failure = time.time() if self.failures >= self.threshold: self.state = "open" return fallback(*args, **kwargs) if fallback else None
The offline fallback uses an SQLite FTS5 virtual table with WAL mode for concurrent reads during writes. Lower retrieval precision than Qdrant, but maintains service availability during backend outages:
CREATE VIRTUAL TABLE fts_index USING fts5( doc_id, text, tenant_id, content='documents', content_rowid='rowid' ); -- WAL mode for concurrent reads during write PRAGMA journal_mode=WAL; PRAGMA synchronous=NORMAL;
Multi-Tenancy: Per-Client Qdrant Collections
Each client gets a dedicated Qdrant collection: tenant_{client_id}. This provides hard isolation at the data layer — no cross-tenant leakage is possible through retrieval, regardless of query construction.
Isolation Guarantees
Data isolation
Client A's documents are never returned in Client B's search. Enforced at Qdrant collection level, not just query filtering.
Access control
JWT tenant claim verified against collection name at the API layer. Mismatched claims return 403 before any Qdrant call is made.
Model flexibility
Client-specific embedding configuration possible — different models for different document types (e.g. legal vs. technical).
Redis distributed lock
SETNX tenant:{id}:index-lock 1 EX 300 prevents concurrent indexing race conditions when multiple upload jobs trigger simultaneously.
Dense retrieval — BGE-large-en-v1.5 + Qdrant
1024-dim embeddings, approximate nearest-neighbour search, per-client collection provisioning.
Sparse retrieval — Splade_PP + RRF fusion
Vocabulary expansion for domain terminology, Reciprocal Rank Fusion at k=60, no score calibration required.
MiniLM reranker + multi-tenant isolation
Cross-encoder reranking over top-20 RRF candidates, JWT tenant claim enforcement, Redis distributed lock.
CircuitBreaker + SQLite WAL offline buffer
3-state FSM (closed/open/half-open), FTS5 fallback for availability under backend outage, 60s recovery probe.
From Contract to Deployed RAG in Under 60 Minutes
Two shell scripts handle the full client onboarding sequence. instant-poc.sh runs a proof-of-concept against the client's sample documents without a Qdrant instance. deploy-client.sh provisions the full stack.
#!/usr/bin/env bash # instant-poc.sh — runs hybrid RAG on client sample docs, no infra needed # Usage: ./instant-poc.sh /path/to/docs "your query here" DOCS_DIR=$1; QUERY=$2 pip install -q sentence-transformers qdrant-client fastembed python3 - <<EOF from platform.retrieval.pipeline import HybridRetriever from platform.ingestion import ingest_directory r = HybridRetriever(tenant_id="poc") ingest_directory(r, "$DOCS_DIR") for hit in r.retrieve("$QUERY", top_k=3): print(f"[{hit.score:.3f}] {hit.payload['text'][:200]}") EOF
# deploy-client.sh — provision full stack for a new client tenant
CLIENT_ID=$1; DOCS_DIR=$2
docker compose up -d qdrant redis
python3 -m platform.ingestion.bulk_ingest \
--tenant "$CLIENT_ID" \
--source "$DOCS_DIR" \
--batch-size 64
echo "Tenant $CLIENT_ID ready. Collection: tenant_${CLIENT_ID}"
echo "Endpoint: http://localhost:8000/v1/retrieve?tenant=${CLIENT_ID}"
Retrieval latency breakdown (p50 / p99)
| Stage | p50 | p99 | Notes |
|---|---|---|---|
| Dense embed (BGE-large) | 18ms | 34ms | CPU · batch=1 |
| Sparse embed (Splade) | 9ms | 18ms | ONNX quantized |
| Qdrant dual search | 4ms | 11ms | ANN + sparse, parallel |
| RRF fusion | <1ms | <1ms | Pure Python, O(n) |
| MiniLM rerank (20 cands) | 22ms | 41ms | CPU · cross-encoder |
| Total end-to-end | 54ms | 105ms | No cache hit |
| With Redis cache hit | 3ms | 7ms | Query hash → cached result |
Need AI knowledge retrieval for your organisation?
I build RAG infrastructure that handles real documents — contracts, regulations, technical manuals — with hybrid retrieval that catches both semantic meaning and exact terminology. Available for SADC enterprise and government clients.