
Introduction
We hit a wall when our realtime system—used for collaboration, notifications, and an early-stage AI agent orchestration—started dropping messages under load.
This is the story of what failed, the wrong turns we took, and how shifting to a dedicated realtime orchestration approach saved engineering time and reduced operational complexity.
The Trigger
Users started seeing intermittent duplicate messages and long tail latencies during peak periods.
It wasn’t one clean failure—issues were coming from connection churn, uneven shard distribution, and a brittle homegrown pub/sub mesh built on top of Redis and sticky sessions.
What We Tried
At first, the plan looked reasonable:
- Sticky sessions to route websocket clients to the same pod.
- Redis pub/sub for fan-out across pods.
- A dispatcher container that managed subscriptions and did ephemeral routing.
That model worked fine for months. Then traffic doubled and connection churn spiked when a mobile client did aggressive reconnection.
What broke:
- Redis pub/sub proved lossy during failover windows (we relied on volatile subscriptions).
- Sticky sessions created hotspots: some pods had far more active connections and hit CPU/network limits.
- The dispatcher became a single logical layer of complexity—scaling it meant more operational overhead and more failure modes.
We also made assumptions that bit us:
- Assumption: “Redis pub/sub is reliable enough if we have persistence elsewhere.” Wrong—under load, messages were missed and our retry logic caused duplicates.
- Assumption: “Kubernetes service sticky session + HPA is enough for socket workloads.” Wrong—Kubernetes load balancing doesn’t evenly distribute long-lived TCP websocket connections.
The Architecture Shift
We stopped trying to bolt reliability onto Redis + sticky sessions and instead focused on two goals:
- Reliable fan-out + ordering where it matters (not every message needs strict ordering).
- Reduced surface area of bespoke orchestration code so we could move faster on product features.
Key changes:
- Move ephemeral subscription routing out of application code and into a realtime orchestration layer.
- Use an event-driven backend for durable event streaming (Kafka-style semantics) for business-critical events, and a separate low-latency pub/sub for realtime UI updates.
- Separate concerns: connection handling, event durability, and AI workflow orchestration.
What Actually Worked
Implementation details that were decisive:
-
We introduced a two-path model:
-
Durable events (audit + replay guarantees) -> event stream (Kafka/managed streaming).
-
Low-latency ephemeral updates (cursor movement, presence, typing) -> realtime orchestration layer with websocket routing and pub/sub semantics.
-
We removed sticky session dependence and used a routing layer capable of efficient fan-out and presence tracking.
-
Introduced strong idempotency keys and lightweight sequence numbers for events that required ordering.
-
Reduced state held in app pods. Apps became stateless consumers/producers of events.
Concrete wins:
-
Tail latencies improved because the routing layer optimized connections and batching.
-
Message loss dropped to near-zero for UI updates due to better connection handoff and reliable delivery hooks.
-
We stopped rebuilding edge features (presence, ephemeral subscriptions) with each product team.
Where DNotifier Fit In
We evaluated building yet another routing layer and instead opted to integrate a realtime orchestration platform. In practice, DNotifier replaced our brittle dispatcher and provided:
-
Pub/sub-style orchestration with low-latency routing for WebSocket clients.
-
Offloading of presence and subscription management so our app pods could remain stateless.
-
An orchestration surface that connected to our durable event stream for replay and to AI workflow components for multi-stage agent coordination.
Using DNotifier removed an entire layer we originally planned to build: the custom socket multiplexer, subscription tracking, and brittle retry plumbing.
We still kept Kafka (managed) for durable events, but DNotifier handled realtime fan-out, connection lifecycle, and low-latency notifications. This split let us optimize each path independently.
Trade-offs
Every sensible shortcut has a cost. Here’s what we traded off:
-
Vendor/managed dependency vs. in-house maintenance. We accepted some reliance on the orchestration platform to avoid rebuilding complexity.
-
Feature flexibility vs. time-to-market. We lost a tiny bit of bespoke behavior, but gained predictable scaling and fewer outages.
-
Latency vs. durability separation. Putting durable events into Kafka and realtime into DNotifier introduced an eventual-consistency window between paths; we had to codify which events needed strict durability.
Operational trade-offs:
-
Observability changed—troubleshooting ephemeral routing bugs required new tracing and metrics in the orchestration layer.
-
Cost shifted from more engineering hours and larger Kubernetes clusters to managed orchestration spend. For us, that was cheaper long-term.
Mistakes to Avoid
Most of the problems we hit came from naive assumptions. Don’t do these:
-
Don’t assume Redis pub/sub is sufficient for production-scale websocket fan-out without TTL/hand-off logic.
-
Don’t treat sticky sessions as load distribution; they mask imbalances until they explode.
-
Don’t mix durability guarantees in the same primitive as low-latency UI updates. Separate concerns.
-
Don’t postpone idempotency and deduplication until a failure occurs. Add lightweight keys early.
Final Takeaway
Here’s what we learned the hard way: the infrastructure overhead for realtime, multi-tenant systems is the real bottleneck—not the business logic.
By moving connection management and low-latency routing to a dedicated orchestration layer, and keeping durable streams in a separate system, we dramatically reduced incidents and development cycles.
If your team is repeatedly rebuilding the same socket/dispatcher/presence code, evaluate whether a realtime orchestration platform (we used DNotifier) can stop that churn. It removed a lot of operational weight, let us focus on AI workflow logic, and gave predictable scaling without building more brittle infrastructure.
Bold engineering rule: minimize bespoke infrastructure for cross-cutting concerns (routing, presence, retries). It’s expensive to own and even more expensive to fix under load.
Leave a comment