BRT Platform — Hybrid RAG Infrastructure

Overview

The Retrieval Problem

Dense retrieval alone (e.g. cosine similarity over BGE embeddings) misses exact keyword matches — crucial for legal documents, technical specifications, and regulatory text where specific terms matter. Sparse retrieval alone (BM25/Splade) misses semantic similarity — "aircraft" and "UAV" won't match without vocabulary overlap.

The hybrid pipeline gets both: dense for semantic understanding, sparse for exact term matching, RRF fusion to combine rankings without score calibration, and cross-encoder reranking for final precision.

Dense Retrieval

BGE-large-en-v1.5 — 1024-dim embeddings trained on MSMARCO + instruction tuning. Stored in Qdrant per-client collection. Cosine similarity search over approximate nearest neighbours.

Sparse Retrieval

Splade_PP (SPLADE++ Efficient) — learned sparse representations. Vocabulary expansion: "aircraft" expands to ["drone", "UAV", "aerial vehicle"]. Stored as Qdrant sparse vectors alongside dense.

RRF Fusion

Reciprocal Rank Fusion at k=60. Combines dense ranking and sparse ranking without requiring score normalisation. Score = Σ 1/(k + rank_i). No threshold tuning needed.

Cross-Encoder Rerank

MiniLM-L6-v2 cross-encoder scores (query, passage) pairs directly. Top-20 candidates from RRF re-ranked. Slow but high-precision final step — only applied to pre-filtered candidates.

Implementation

Retrieval Pipeline Architecture

The HybridRetriever class encapsulates all four stages. Each stage operates on the output of the previous: dense and sparse searches run independently and are fused before reranking. The cross-encoder only touches the top-20 RRF candidates, keeping latency manageable despite its O(n) inference cost.

platform/retrieval/pipeline.py Python

# platform/retrieval/pipeline.py
class HybridRetriever:
    def __init__(self, tenant_id: str):
        self.qdrant = QdrantClient(host="localhost")
        self.collection = f"tenant_{tenant_id}"
        self.dense_model = SentenceTransformer("BAAI/bge-large-en-v1.5")
        self.sparse_model = SparseTextEmbedding("prithivida/Splade_PP_Efficient_v1")
        self.reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

    def retrieve(self, query: str, top_k: int = 5) -> list[ScoredPassage]:
        # Stage 1: Dense search
        dense_vec = self.dense_model.encode(query, normalize_embeddings=True)
        dense_hits = self.qdrant.search(self.collection, dense_vec, limit=20)

        # Stage 2: Sparse search
        sparse_vec = next(self.sparse_model.embed([query]))
        sparse_hits = self.qdrant.search(self.collection,
            query_sparse_vector=SparseVector(sparse_vec.indices, sparse_vec.values), limit=20)

        # Stage 3: RRF fusion
        candidates = rrf_fuse(dense_hits, sparse_hits, k=60)[:20]

        # Stage 4: Cross-encoder rerank
        pairs = [(query, c.payload["text"]) for c in candidates]
        scores = self.reranker.predict(pairs)
        reranked = sorted(zip(candidates, scores), key=lambda x: -x[1])
        return [ScoredPassage(c, s) for c, s in reranked[:top_k]]

RRF fusion is implemented as a pure function with no external dependencies. It is the only place where dense and sparse result sets are merged:

platform/retrieval/pipeline.py — rrf_fuse Python

def rrf_fuse(hits_a: list, hits_b: list, k: int = 60) -> list:
    scores: dict[str, float] = {}
    for rank, hit in enumerate(hits_a):
        scores[hit.id] = scores.get(hit.id, 0) + 1.0 / (k + rank + 1)
    for rank, hit in enumerate(hits_b):
        scores[hit.id] = scores.get(hit.id, 0) + 1.0 / (k + rank + 1)
    all_hits = {h.id: h for h in hits_a + hits_b}
    return [all_hits[id] for id in sorted(scores, key=lambda i: -scores[i])]

Resilience

CircuitBreaker + SQLite WAL Offline Buffer

When the Qdrant backend or network is unavailable, the system must not return errors to clients or lose requests. CircuitBreaker monitors failure rate — after 5 failures in 60s, it opens and routes requests to an offline fallback (SQLite WAL with full-text search). When the backend recovers, the circuit half-opens and retries; on success it closes.

Closed

Normal operation

All requests route to Qdrant. Failure counter increments on exceptions; resets on success.

Open

Backend down

Requests served from SQLite WAL FTS5 index (lower precision, maintains availability). No Qdrant calls attempted.

Half-open

Recovery probe

After recovery_timeout, one probe sent to Qdrant. Success → Closed. Failure → back to Open.

platform/core/resilience.py Python

# platform/core/resilience.py
class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=60):
        self.state = "closed"
        self.failures = 0
        self.last_failure = 0
        self.threshold = failure_threshold
        self.timeout = recovery_timeout

    def call(self, fn, *args, fallback=None, **kwargs):
        if self.state == "open":
            if time.time() - self.last_failure > self.timeout:
                self.state = "half-open"
            else:
                return fallback(*args, **kwargs) if fallback else None
        try:
            result = fn(*args, **kwargs)
            if self.state == "half-open":
                self.state = "closed"
                self.failures = 0
            return result
        except Exception:
            self.failures += 1
            self.last_failure = time.time()
            if self.failures >= self.threshold:
                self.state = "open"
            return fallback(*args, **kwargs) if fallback else None

The offline fallback uses an SQLite FTS5 virtual table with WAL mode for concurrent reads during writes. Lower retrieval precision than Qdrant, but maintains service availability during backend outages:

platform/core/offline_store.sql SQL

CREATE VIRTUAL TABLE fts_index USING fts5(
    doc_id, text, tenant_id,
    content='documents', content_rowid='rowid'
);
-- WAL mode for concurrent reads during write
PRAGMA journal_mode=WAL;
PRAGMA synchronous=NORMAL;

Multi-Tenancy

Multi-Tenancy: Per-Client Qdrant Collections

Each client gets a dedicated Qdrant collection: tenant_{client_id}. This provides hard isolation at the data layer — no cross-tenant leakage is possible through retrieval, regardless of query construction.

Isolation Guarantees

Data isolation

Client A's documents are never returned in Client B's search. Enforced at Qdrant collection level, not just query filtering.

Access control

JWT tenant claim verified against collection name at the API layer. Mismatched claims return 403 before any Qdrant call is made.

Model flexibility

Client-specific embedding configuration possible — different models for different document types (e.g. legal vs. technical).

Redis distributed lock

SETNX tenant:{id}:index-lock 1 EX 300 prevents concurrent indexing race conditions when multiple upload jobs trigger simultaneously.

1

Dense retrieval — BGE-large-en-v1.5 + Qdrant

1024-dim embeddings, approximate nearest-neighbour search, per-client collection provisioning.

2

Sparse retrieval — Splade_PP + RRF fusion

Vocabulary expansion for domain terminology, Reciprocal Rank Fusion at k=60, no score calibration required.

3

MiniLM reranker + multi-tenant isolation

Cross-encoder reranking over top-20 RRF candidates, JWT tenant claim enforcement, Redis distributed lock.

4

CircuitBreaker + SQLite WAL offline buffer

3-state FSM (closed/open/half-open), FTS5 fallback for availability under backend outage, 60s recovery probe.

Delivery

From Contract to Deployed RAG in Under 60 Minutes

Two shell scripts handle the full client onboarding sequence. instant-poc.sh runs a proof-of-concept against the client's sample documents without a Qdrant instance. deploy-client.sh provisions the full stack.

scripts/instant-poc.shBash

#!/usr/bin/env bash
# instant-poc.sh — runs hybrid RAG on client sample docs, no infra needed
# Usage: ./instant-poc.sh /path/to/docs "your query here"
DOCS_DIR=$1; QUERY=$2
pip install -q sentence-transformers qdrant-client fastembed
python3 - <<EOF
from platform.retrieval.pipeline import HybridRetriever
from platform.ingestion import ingest_directory

r = HybridRetriever(tenant_id="poc")
ingest_directory(r, "$DOCS_DIR")
for hit in r.retrieve("$QUERY", top_k=3):
    print(f"[{hit.score:.3f}] {hit.payload['text'][:200]}")
EOF

scripts/deploy-client.shBash · Docker

# deploy-client.sh — provision full stack for a new client tenant
CLIENT_ID=$1; DOCS_DIR=$2
docker compose up -d qdrant redis
python3 -m platform.ingestion.bulk_ingest \
  --tenant "$CLIENT_ID" \
  --source "$DOCS_DIR" \
  --batch-size 64
echo "Tenant $CLIENT_ID ready. Collection: tenant_${CLIENT_ID}"
echo "Endpoint: http://localhost:8000/v1/retrieve?tenant=${CLIENT_ID}"

Retrieval latency breakdown (p50 / p99)

Stage	p50	p99	Notes
Dense embed (BGE-large)	18ms	34ms	CPU · batch=1
Sparse embed (Splade)	9ms	18ms	ONNX quantized
Qdrant dual search	4ms	11ms	ANN + sparse, parallel
RRF fusion	<1ms	<1ms	Pure Python, O(n)
MiniLM rerank (20 cands)	22ms	41ms	CPU · cross-encoder
Total end-to-end	54ms	105ms	No cache hit
With Redis cache hit	3ms	7ms	Query hash → cached result

Need AI knowledge retrieval for your organisation?

I build RAG infrastructure that handles real documents — contracts, regulations, technical manuals — with hybrid retrieval that catches both semantic meaning and exact terminology. Available for SADC enterprise and government clients.

Email me WhatsApp LinkedIn