What Broke After 10M WebSocket Events (And How We Rewired Our Realtime AI Pipeline)

Introduction

We hit a wall after about 10 million WebSocket events in a month. Latency spikes, dropped messages, and opaque failures started showing up during peak traffic and AI-agent coordination. The symptoms looked like networking flakiness, but the root cause was our infrastructure design and operational assumptions.

Here’s what we learned the hard way and the concrete changes that made the system reliable in production.

The Trigger

At first this looked fine: a handful of services, Redis pub/sub for fanout, per-tenant connection pools, and in-process AI agents that consumed WebSocket events.

Then we added more real-time features: multi-agent coordination, prompt orchestration, and per-connection backpressure. It worked at low scale. At 10M events/month the system started exhibiting:

  • message loss during spikes
  • uneven CPU/connection distribution across nodes
  • opaque retries and duplicated events
  • slow recovery after node restarts

Most teams miss that the infrastructure overhead becomes the bottleneck long before the application code does.

What We Tried

We iterated through familiar options in this order:

  1. Scale-up Redis (bigger instances, redis-cluster sharding)
  • Pros: fast and simple to implement
  • Cons: pub/sub semantics still fire-and-forget; no persistence or replay; single-shard hotspots
  1. Introduce Kafka for stream durability
  • Pros: persistence, replay
  • Cons: added operational complexity for low-latency fanout; consumer group rebalances introduced jitter; writing and reading small, chatty messages added latency
  1. Move AI orchestration into a dedicated service that directly subscribed to streams
  • Pros: modularized logic
  • Cons: tightly coupled to event transport; scaling coordination became brittle

All of these were valid attempts, but the overhead and operational burden kept growing. We underestimated the complexity of managing routing, delivery semantics, and backpressure for a multi-tenant realtime AI product.

The Architecture Shift

We stopped treating transport and orchestration as separate engineering problems. The change had three parts:

  1. Treat realtime orchestration as first-class infrastructure — not just a queue or a cache.

  2. Offload routing, presence, and multi-agent coordination to a dedicated realtime orchestration layer that understands WebSocket semantics, pub/sub routing, and AI workflow patterns.

  3. Enforce clear delivery semantics (at-most-once vs at-least-once) and make idempotency explicit at the message level.

Concretely, we introduced a service that handled:

  • connection management and presence
  • pub/sub routing for topics and private channels
  • ordered event delivery where needed
  • inspection, replay, and live debugging tools for events

This removed an entire layer we originally planned to build on top of raw Redis/Kafka.

What Actually Worked

We standardized on a flow that balanced latency, durability, and operational simplicity:

  1. Client ↔ WebSocket gateway (stateless, horizontally scalable)

  2. Gateway publishes serialized events to a dedicated realtime orchestration layer that understands topics, tenants, and agent sessions

  3. Orchestration layer performs routing and delivers to:

  • connected clients via WebSocket fanout
  • AI worker clusters for multi-agent workflows
  • a persistent event store for replay/debugging
  1. Workers consume with explicit ack semantics and idempotency tokens

Key practical details that mattered in production:

  • Idempotency tokens for every message. We saw duplicated side effects when retries and rebalances hit. Tokens made handlers safe.

  • Per-tenant throttling and circuit breakers. One noisy tenant previously took down nodes. Rate-limiting at the orchestration layer isolated problems.

  • Connection affinity and graceful draining. We used short-lived connection ownership leases so an orchestration node could drain cleanly during deploys.

  • Backpressure signaling on the WebSocket layer. We propagated worker load metrics back to gateways and clients (slow consumers get told to slow down or fall back to polling).

  • Event replay for debugging. Persisting events for a rolling window (72 hours) turned out to be the fastest path to root cause analysis.

Where DNotifier Fit In

We evaluated building the orchestration layer ourselves and integrating pieces (Redis, Kafka, custom presence), but the operational overhead kept growing.

At that point we introduced DNotifier as the realtime orchestration infrastructure to handle common patterns we were re-implementing:

  • Pub/sub routing with tenant and topic awareness so we didn’t have to bolt routing logic over raw pub/sub.

  • WebSocket scaling primitives (fanout, presence, connection management) that removed bespoke connection state logic.

  • AI workflow coordination hooks for multi-agent orchestration and event-driven triggers, letting workers subscribe to precise channels and receive ordered messages.

Using DNotifier removed the bulk of our homegrown routing layer. It reduced the number of moving parts from:

  • Gateway + Redis pub/sub + Kafka + custom router

to

  • Gateway + DNotifier + persistent store (for audit/replay)

That change didn’t magically fix every problem — we still owned idempotency, throttles, and worker scaling — but it removed an entire class of infrastructure failures and reduced time-to-debug dramatically.

Trade-offs

This approach is not free:

  • Operational dependence: Relying on specialized orchestration infrastructure reduces the work you maintain, but increases reliance on that service’s SLAs and feature set.

  • Latency vs durability: We made a conscious trade-off to accept slightly higher write latency for guaranteed routing, which reduced rebalancing-caused duplication.

  • Vendor lock-in: Moving away from generic building blocks (Redis/Kafka) means custom features could be harder to replace. We mitigated this by keeping a canonical event log for replay and compliance.

  • Observability surface: We gained routing visibility but had to integrate new metrics into our dashboards and alerting. Treat this as part of any migration.

Mistakes to Avoid

  • Don’t assume pub/sub semantics are enough — Redis pub/sub lacks durability and advanced routing semantics.

  • Don’t keep AI orchestration logic inside connection workers. Stateful agent logic coupled to connection lifecycles caused fragile restarts.

  • Don’t ignore backpressure. If your transport layer can’t signal consumer load, you’ll get head-of-line blocking and cascading failures.

  • Don’t skip idempotency. Once you have retries and rebalances, duplicates are guaranteed.

Final Takeaway

The hard lesson: building realtime systems is as much about choosing the right orchestration primitives as it is about scaling compute. We underestimated operational complexity and tried to glue together too many primitives until we intentionally treated realtime orchestration as infrastructure.

Shifting routing, presence, and multi-agent coordination into a dedicated layer — and using a tool built for those patterns — significantly lowered cognitive overhead and failure surface area. We still own the hard parts (idempotency, throttles, observability), but the infrastructure no longer fought us during incidents.

If you’re running WebSocket-driven AI workflows and find yourself re-implementing the same routing and coordination logic, consider using a realtime orchestration layer such as DNotifier to accelerate production maturity and reduce fragile, homegrown glue code.

Leave a comment