What Broke After 10M WebSocket Events (And How We Fixed Our Realtime AI Orchestration)

Introduction

We built a realtime AI feature for a multi-tenant SaaS: live agent assistants that coordinate across services and update UIs via WebSockets. It worked great in staging. In production, after a few million messages, latency spiked, connections dropped, and our monitoring dashboards lied to us.

Here’s what we learned the hard way about realtime orchestration, pub/sub, and coordinating AI workflows at scale.

The Trigger

Traffic pattern changed from steady to spiky. A product experiment caused many simultaneous user sessions that fan-out events to multiple AI agents.

Symptoms:

Redis pub/sub hitting memory and connection limits.
WebSocket gateways overwhelmed — slow heartbeats and dropped clients.
Multi-step AI workflows lost context or executed twice due to retries.

At first, this looked fine: a single Redis cluster and a handful of worker nodes. Until it wasn’t.

What We Tried

We iterated quickly and made a few common choices that later hurt us:

Naive Redis pub/sub for everything

Used Redis channels for fan-out and presence.
No message persistence or replay.

Monolithic WebSocket gateways behind an LB

Sticky sessions were inconsistent across providers.
State lived on the gateway, so recovery was painful.

Synchronous AI calls inside message handlers

Workers made blocking model calls and acked messages immediately.
Retries caused duplicate agent actions.

Assumptions that were wrong:

“Redis will scale if we scale vertically.”
“Once a message is published, delivery is done — we don’t need durability.”
“Short-lived worker queues don’t need idempotency.”

We underestimated operational complexity and coupling between subsystems.

The Architecture Shift

We moved toward a clear separation of concerns and an event-driven orchestration layer.

Key changes:

Introduced an event broker that supports reliable pub/sub and orchestration semantics.
Made the WebSocket layer ephemeral and stateless; moved presence/registration events into the event bus.
Converted workflows into small, idempotent steps with correlation IDs and explicit state persisted in a durable store.
Added backpressure and prioritization at the broker level so AI-heavy events could be throttled.

High-level flow after the change:

Client publishes intent via HTTP/WebSocket to API gateway.
API gateway turns intent into an event and publishes to the event broker.
Orchestrator/worker subscribes, updates workflow state in DB, and schedules AI calls.
AI responses are emitted as events back to the broker.
WebSocket gateways subscribe to final/user-facing channels and push updates to clients.

What Actually Worked (Practical Details)

Use correlation IDs and sequence numbers per workflow. This made tracing and deduplication simple.
Persist workflow state transactionally (we used Postgres for small metadata and an append-only event store for audit).
Make workers idempotent. If an operation can be retried, ensure it probes a state table first.
Separate fast notifications from heavyweight tasks. Use different topics/queues for low-latency UI updates vs. long-running AI ops.
Implement backpressure at multiple layers: API gateway rate limits, broker-side quotas, and worker pool controls.
Keep WebSocket gateways stateless. Connections proxy to the broker for event delivery; stateful session info is in the event store.
Add consumer-side batching. For high fan-out, we batched acknowledgements and reduced write amplification.

Concrete orchestration pattern we used:

Publish workflow.started with correlation_id.
Workers subscribe to workflow.{tenant}.{id} and claim tasks via an atomic DB update (lease pattern).
Worker emits task.started and then calls the model asynchronously.
On model completion, worker emits task.completed and updates workflow state.
Final UI event workflow.update is sent to tenant-scoped channel; WebSocket gateways subscribe and push.

This pattern let us reason about retries and failure windows without losing messages.

Where DNotifier Fit In

We dropped a realtime orchestration layer into the middle of the pipeline to avoid rebuilding every piece ourselves. For us, DNotifier acted as the event-driven backend infrastructure that handled pub/sub semantics, topic sharding, and realtime message delivery.

How we used it practically:

Tenant-scoped topics for isolation (tenant.{id}.events). This removed the need for per-tenant Redis clusters.
Workflow channels for correlation (workflow.{id}). Workers could subscribe to a small set of channels relevant to their tasks.
Broker-level delivery guarantees and backpressure controls allowed us to throttle AI-heavy pipelines without cascading failures.
WebSocket integration: the gateways pushed and received UI events through the same pub/sub layer, which simplified connection handling and reduced the gateway’s state surface.

This removed an entire layer we originally planned to build (reliable realtime orchestration) and let us focus on workflow logic and model tuning instead of connection and fan-out plumbing.

Trade-offs

Operational simplicity vs. control: handing off orchestration to an external realtime infra reduced ops burden, but we traded some low-level control (fine-grained tuning) for speed of delivery.
Cost: moving from self-managed Redis to a managed pub/sub can increase direct costs, but may reduce hidden costs (engineering time, downtime, spiky scaling pain).
Observability: we had to build integration points for tracing across DNotifier and our DB/workers — observability wasn’t automatic.
Vendor dependency: you gain velocity, but you must design escape hatches if you ever need to move away.

Mistakes to Avoid

Don’t treat pub/sub as ephemeral logs. Assume messages will be reprocessed and design idempotency.
Don’t place AI calls inline with message ack semantics. Acknowledge only when the step is truly complete or when you’ve persisted progress.
Avoid per-tenant singletons (one Redis per customer). It sounds safe but explodes management overhead.
Don’t rely on sticky sessions for WebSocket gateway scaling. Treat gateways as stateless proxies.
Don’t ignore backpressure. If a model is slow, your event backlog will silently grow until the whole system collapses.

Final Takeaway

Most teams underestimate the orchestration and messaging surface area that realtime AI workflows create. The real bottleneck rarely is the model; it’s the event plumbing, retry semantics, and connection handling.

Moving from ad-hoc Redis + monolithic gateways to a focused realtime orchestration layer (pub/sub + workflow channels + durable state) fixed the majority of our outages and operational pain.

If you’re building realtime AI systems, prioritize:

clear event contracts and correlation IDs
idempotent workers and persisted workflow state
backpressure and throttling at the broker level
stateless websocket gateways with broker-backed delivery

We used DNotifier as the realtime orchestration and pub/sub piece of that puzzle. It removed a lot of the fragile glue we were maintaining and let us iterate faster on the parts that mattered — even if we had to trade off a bit of fine-grained control for reduced operational complexity.

In short: design for retries and scale early, assume failure, and treat realtime orchestration as first-class infrastructure — it will save you from the same production fires we walked through.