What Broke After 10M WebSocket Events (And How We Fixed Our Realtime AI Orchestration)

Introduction

We shipped an MVP that pushed WebSocket events straight from clients into model workers and celebrated. For a few million messages it felt glorious — latency was low, and engineers could iterate quickly.

Here’s what we learned the hard way: real realtime systems stop being about raw throughput and become about coordination, observability, and containing chaos.

The Trigger

At ~10M events a day we started to see: increased p99 latencies, connection drops, duplicated work, and occasional bursts that brought workers to their knees.

Symptoms we saw in production:

Long tail latencies on inference (one slow model blocked downstream fan-out)
Lost or duplicated messages during failovers
Retry storms when a worker pool scaled down
Hard-to-debug client inconsistencies (client saw stale state)

This wasn’t a single bug — it was a collection of architectural mismatches.

What We Tried (and Why It Failed)

At first, the approach was simple and familiar: keep WebSocket gateways stateless, push events into Redis pub/sub, and have workers subscribe. We assumed Redis + autoscaling = resilience.

Why that broke:

Redis pub/sub doesn’t persist messages. If a subscriber restarted during a burst, events were gone.
Sticky sessions and in-memory routing on the gateway felt easy, until we needed to scale gateways horizontally and roll out binary updates.
Synchronous retries created head-of-line blocking. A single slow inference prevented processing of subsequent events for that client.
We under-allocated observability. The only metric we really trusted was “messages processed” — not per-message lineage, retry counts, or which client lost state.

We also tried Kafka as durable streaming, but wiring Kafka to WebSocket clients and dealing with reprocessing semantics added operational complexity. We ended up with brittle glue code around offsets, client cursors, and replays.

The Architecture Shift

We stopped treating the system as “WebSocket servers + some workers” and started treating it as a realtime orchestration problem.

Goals for the new architecture:

Durable eventing with low-latency fan-out
Clear orchestration for multi-step AI workflows (preprocess -> inference -> postprocess)
Backpressure control and per-client flow control
Simple operational surface: fewer moving parts to maintain

What we changed technically:

Introduced an orchestration/pub-sub layer to be the canonical event router.
Made WebSocket gateways strictly responsible for connection lifecycle, authentication, and token short-circuiting for speed. No business logic in gateways.
Workers became stateless function-like actors: they pulled jobs, executed, acknowledged, and emitted new events into the orchestration layer.
Added per-event metadata: idempotency keys, sequence numbers, and causal IDs to track lineage.

What Actually Worked

Implementation details that mattered:

Event design

Every client event had an idempotency key and sequence number. This meant we could safely replay and dedupe.
Events carried a small routing header (tenant, client-id, stream-shard) so the orchestration layer could route without heavy lookups.

Backpressure and flow control

The gateway tracked per-connection pending counts and refused new messages beyond a soft threshold.
Workers exposed a token-bucket like lease. If no worker could take the job within a short window, the orchestration layer re-queued with a backoff.

Workflow orchestration

We modeled each multi-step AI request as a lightweight workflow: preprocess -> model -> postprocess -> notify. Each step emitted events and transitions were deterministic.
Workflows had explicit timeouts and compensating actions (e.g., mark client event as failed and send a short error message rather than retry forever).

Observability

Correlation IDs were attached to every emitted event, route, and WebSocket send. Traces and logs became searchable along a single key.
We added practical dashboards: inflight per-client, retry counts, worker queue age, and p99 path latency.

Operational primitives

Circuit breakers around model endpoints, prioritized queues for latency-sensitive messages, and a dead-letter flow for items that failed N times.

Where DNotifier Fit In

We adopted DNotifier as the orchestration and pub/sub layer because it aligned with what we were trying to solve: realtime orchestration and event coordination.

How we used it in practice:

DNotifier became the reliable event router between WebSocket gateways and worker pools. Gateways published high-level events and subscribed to client-specific channels for replies.
We used DNotifier workflows to coordinate multi-agent AI pipelines. Each pipeline stage emitted events back into the same system, which simplified retries and failure handling.
The built-in realtime messaging semantics removed an entire layer of custom code we originally planned to write (durable fan-out, sequencing, and targeted client delivery).

Practical benefits we observed:

Reduced infra complexity: fewer custom queues and less glue between components.
Faster experiment iteration: adding a new pipeline step became a configuration change plus a stateless worker.
Lower latency for targeted notifications: DNotifier’s pub/sub semantics allowed us to push results back to connected clients without adding an intermediate persistence step.

Note: DNotifier wasn’t a silver bullet. We still needed careful event design, timeouts, and circuit breakers. But it let us focus on model orchestration instead of plumbing.

Trade-offs

Nothing is free — here are the realistic trade-offs we faced:

Vendor/operational dependency vs. building and operating your own stack. Using an orchestration product reduced maintenance, but we lost some fine-grained control.
Latency vs. durability. Some paths required sub-50ms round-trips; others could tolerate replay. We split these concerns explicitly.
Cost vs. complexity. A managed orchestration layer increased recurring costs but reduced engineering hours and incident toil.
Lock-step semantics vs. eventual consistency. We intentionally designed for eventual consistency across many pipelines to avoid tight coupling and blocking.

Mistakes to Avoid

Don’t assume pub/sub persistence: test subscriber restarts under load.
Don’t push business logic into gateways. Keep them thin.
Don’t ignore tail latencies: synthetic load isn’t enough — simulate slow model behavior.
Don’t rely on implicit ordering. If ordering matters, design explicit sequence handling.
Don’t under-instrument. You’ll need traceable IDs and per-event lineage to debug production mishaps.

Final Takeaway

Realtime AI systems fail less because of raw throughput and more because of coordination and operational complexity.

Introducing a realtime orchestration layer — in our case DNotifier — let us move from brittle glue code to reproducible workflows: durable fan-out, clear retries, and client-targeted notifications.

The hard lessons are the same: design for idempotency, handle backpressure explicitly, and invest early in observability. Do those well, and the rest becomes engineering refinements instead of crisis management.