
Introduction
We shipped our first retrieval-augmented application (LLM + vector db + metadata store) in three weeks. It felt glorious — until production traffic hit and everything slowed down.
Here’s what we learned the hard way: low-latency, high-recall retrieval at scale is not just about picking a vector DB. It’s an operational system with cost, index, model, and networking trade-offs you’ll wish you considered earlier.
The Trigger
At first, this looked fine… until it wasn’t. A single tenant spiked and our P95 jumped from 120ms to 800ms. The vector DB nodes started GC-ing, network egress bills ballooned, and the LLM started timing out while waiting for context.
Two immediate problems surfaced:
- Embedding calls were synchronous per request and throttled by the embedding API rate limits.
- We relied on a single global vector index that kept getting updated and reindexed without a safe migration path.
Most teams miss this: prototypes assume small data and steady traffic. Production is bursty, noisy, and unforgiving.
What We Tried
We made the naive moves first. They felt faster to iterate on, but cost us in ops:
- Compute embeddings on every request (no caching).
- Use a single monolithic vector DB index for all tenants.
- Tune a single K value for nearest neighbors and never revisit it.
All of these looked fine in dev because our dataset was tiny and traffic steady. In production, re-embedding and reindexing became blocking maintenance windows, and recall degraded when we adjusted K to address latency spikes.
What Actually Worked
We applied a set of practical changes that stabilized latency and costs. These aren’t academic — they’re things you can implement in weeks.
1) Separate concerns: precompute, store, serve
- Precompute embeddings at ingest time and persist them immutably with a version tag.
- Store metadata (raw text, pointers, version ids) in a transactional database and keep vector DB as an index only.
- Never compute embeddings synchronously on request unless it’s an explicit, small feature (e.g., quick feedback loop).
This reduced average request work by ~60% for us and removed embedding API throttling as a common failure mode.
2) Version embeddings and models
- Tag each vector with an embedding model version.
- When switching models, spin up a parallel index and do a blue/green swap rather than in-place mutation.
This allows safe rollback and avoids silent quality regressions when a new embedding model changes vector topology.
3) Hybrid retrieval (sparse + dense)
- Combine a fast sparse retrieval (BM25 or Elastic’s kNN) to shortlist candidates and then rerank with ANN.
- This reduces ANN QPS and gives a predictable latency floor.
We saved 30–50% ANN queries with no recall loss by using a 200–500 candidate sparse shortlist before ANN rerank.
4) Tune ANN hyperparameters and shard thoughtfully
- For HNSW: tune ef/efConstruction; higher ef improves recall but raises query cost.
- For IVF/OPQ: choose centroids thoughtfully and expect rebuilds as data grows.
Sharding by tenant or logical domain keeps hot tenants from poisoning latency for others. We ended up with tenant-local shards plus a cold global archive.
5) Cache aggressively (but coherently)
- Cache embeddings for popular queries and textual chunks at the application layer.
- Use an LRU with TTL and invalidate based on content version.
This reduced repeated embedding reads and smoothed P99 spikes during traffic bursts.
6) Backpressure and graceful degradation
- Implement overall latency budgets (e.g., retrieval must finish in 250ms or skip rerank).
- Return lower-fidelity responses under overload: fewer docs, smaller context, fallback prompts.
Graceful degradation was the difference between an outage and a degraded but useful service.
7) Measure the right signals
Focus on these in your dashboards:
- Query QPS, tail latency (P95/P99), and CPU/GPU utilization per shard
- Recall@K and rerank lift (semantic correctness, not just similarity distance)
- Embedding API errors/rate-limits and vector DB index build times
- Cost-per-query including embedding API, vector DB ops, and LLM token costs
We missed embedding drift for months because we only watched latency and cost.
Trade-offs (real decisions we made)
-
Using managed vector DBs (Pinecone/Weaviate) saved ops time but limited low-level tuning. We opted for managed during early growth, then moved hot indexes to self-hosted Milvus for fine-grained performance tuning.
-
Hybrid read path complexity vs latency: adding a sparse shortlist increased code complexity but reduced ANN load and cost.
-
Replication vs shards: replicas improve tail latency but increase write costs and reindex complexity. We chose tenant sharding with selective replica placement for high-SLA tenants.
Mistakes to Avoid
-
Embedding on read for every request. This kills throughput and exposes you to external API variability.
-
Treating vector DB like a traditional key-value store. Index updates, rebuilds, and consistency semantics are different.
-
Not planning index migrations. Rebuilding a 100M vector index can take hours and needs a migration strategy.
-
Ignoring dimensionality and distance metric choices. Cosine vs dot-product trade-offs affect ANN behavior and cost.
Operational Checklist (quick)
- Precompute and version embeddings.
- Shortlist with sparse retrieval before ANN.
- Shard by tenant/domain and plan index migrations.
- Cache popular embeddings and queries with coherent invalidation.
- Set latency budgets and graceful degradation paths.
- Monitor both infra metrics and retrieval quality metrics.
Final Takeaway
Running LLM + vector DB systems in production is a systems problem as much as a modeling problem. If you treat it like a single-component feature, it will surprise you.
Most teams miss embedding versioning, index migration planning, and realistic tail-latency engineering until they’ve paid for it in incidents and bills.
If you take one thing away: design for incremental growth — shards, versions, and graceful degradation — before you hit the scale where those decisions become painful to change.
We still make mistakes, but these patterns turned outages into predictable maintenance. If you want, I can share a checklist or a reference architecture we ended up using in production.
Leave a comment