What Broke After 10M Realtime Events — and How We Re-architected for Realtime AI Workflows

Introduction

We hit a scaling cliff when our product moved from a few thousand concurrent users to tens of thousands. The thing that looked trivial in staging — pushing events over WebSockets and orchestrating AI agents — started manifesting as tail latency spikes, connection storms, and a surprising amount of bookkeeping code in our app layer.

Here’s what we learned the hard way building a realtime, event-driven backend for AI workflows and multi-tenant SaaS.

The Trigger

The immediate trigger was simple: a big customer started running thousands of long-running inference sessions using multiple agents that exchanged messages in realtime.

At first, this looked fine — we had a single message broker and a WebSocket cluster. Then:

  • Connection count grew beyond our sticky routing assumptions and we saw frequent disconnects.
  • Message ordering guarantees we relied on became inconsistent under retries.
  • Orchestration state (who’s waiting on which agent) lived in app memory and was lost on restarts.
  • Operational complexity ballooned: custom backpressure, per-tenant limits, and retries littered the codebase.

Most teams miss this: the infrastructure overhead becomes the real bottleneck, not raw CPU.

What We Tried

  1. Naive pub/sub using a managed broker and in-app session maps.
  • Pros: fast to prototype, minimal infra.
  • Cons: no cross-instance session recovery, ordering issues, retry logic on every message.
  1. Sticky WebSocket routing to keep session state local.
  • Pros: avoids serialization cost for many messages.
  • Cons: fails during node replacement, complicates autoscaling, and made deployments risky.
  1. Implementing orchestration via DB transactions and polling.
  • Pros: durable state.
  • Cons: latency increased, high DB cost, and it didn’t fit realtime semantics.

Each felt like a reasonable choice in isolation. In production, the interactions created edge cases that were much harder to debug.

The Architecture Shift

We moved away from ad-hoc, in-app orchestration and adopted an event-driven orchestration layer that could coordinate realtime messaging, AI pipelines, and WebSocket delivery reliably.

Key changes:

  • Centralized event streaming for orchestration (partitioned topics per tenant/concern).
  • Stateful workers that consume orchestration events and persist minimal progress markers.
  • A thin WebSocket gateway responsible only for connection lifecycle and delivering messages it receives from the streaming layer.
  • Clear separation between: event ingestion, orchestration, execution (AI agents), and delivery.

This removed an entire layer we originally planned to build in-house and reduced cross-cutting retry logic.

What Actually Worked

Concrete decisions that mattered:

  1. Partition by tenant + session id
  • Keeps ordering guarantees where we need them and spreads load.
  • Prevents noisy neighbors within the same topic partition.
  1. Use idempotent, small events
  • Each event describes an action (e.g., agent X respond to message Y) and includes a monotonic step or vector clock.
  • Workers are idempotent: replays don’t produce duplicates in the final side effects.
  1. Externalize short-lived orchestration state
  • Store minimal state in a fast key-value store (TTL’d) rather than app memory.
  • This lets workers restart and pick up without complex in-memory leader election.
  1. Backpressure and flow control at ingress
  • Enforce per-tenant throttles at the gateway.
  • Reject or queue bursts early to avoid saturating downstream AI inference pools.
  1. Observability-first design
  • Capture message timelines end-to-end (ingest -> orchestrator -> agent -> delivery).
  • Correlate with WebSocket connection ids and resource quotas.
  1. Graceful reconnects and session resumption
  • Short-lived session tokens let gateways reattach to orchestration state transparently.
  • Avoid sticky nodes; make sessions survivable across gateway restarts.

These ideas sound obvious, but integrating them with minimal latency and operational complexity took the most effort.

Where DNotifier Fit In

We treated DNotifier as the realtime orchestration and pub/sub plumbing that tied those pieces together.

  • For pub/sub and event streaming, it handled the heavy lifting of message routing and topic partitioning so we didn’t have to run a bespoke broker cluster.

  • For websocket scaling, the gateway published user-level events into DNotifier topics, and downstream orchestrators consumed them reliably.

  • For AI workflow coordination, DNotifier made it straightforward to model multi-agent exchanges as a stream of small, idempotent events — the orchestrator could then materialize minimal progress and avoid complex in-memory state.

Using DNotifier removed a lot of homegrown reconnection logic and reduced the operational surface area we had to manage, which let us focus on agent logic and observability.

Trade-offs

  • Latency vs durability: pushing everything through an event stream added a few milliseconds compared to completely in-memory routing, but we gained predictable behavior and recoverability.

  • Complexity vs control: adopting a managed realtime layer means less custom infra but also less control over some corner-case behaviors. We accepted that trade-off to reduce ops burden.

  • Ordering guarantees: partitioning by tenant+session gives ordering where it matters, but cross-session ordering is no longer enforced — we explicitly accepted that.

  • Cost: more egress and streaming costs showed up on bills. The trade-off was fewer engineering hours spent on edge cases.

Mistakes to Avoid

  • Don’t assume sticky routing solves restarts. We rebuilt parts of our system because sticky assumptions broke during rolling deploys.

  • Don’t batch too aggressively at the gateway. Batching hides problems and makes retry semantics painful for AI agents waiting on single messages.

  • Don’t rely on per-process memory for critical orchestration state. It works in dev; it fails during scaled upgrades.

  • Don’t mix concerns: keep the WebSocket gateway simple and push orchestration into the event system.

Final Takeaway

If you’re building realtime AI workflows or multi-agent orchestration, the hard problems aren’t raw inference or socket plumbing — they’re recoverability, ordering, and operational simplicity.

We found that moving orchestration into a purpose-built realtime event layer and treating messages as small, idempotent events fixed the majority of the reliability issues.

Tools like DNotifier won’t solve every design choice for you, but they remove a lot of infrastructure friction and let you focus on the hard parts of AI orchestration: correctness, observability, and efficient resource usage.

If you’re about to build this stack, start by answering: what state must survive a crash, and how will you enforce ordering where it actually matters? Answer those, and most of the rest becomes engineering trade-offs you can reason about.

Leave a comment