
Introduction
We hit a hard wall when our realtime AI feature started processing millions of small events per day. Latency spiked, connection churn increased, and our monitoring looked like a horror movie.
This is the story of what broke, the bad assumptions we made, how we changed architecture, and what actually worked in production.
The Trigger
A new feature introduced per-user AI agents that listen to user actions via WebSockets, enrich events with models, and publish back state changes.
At ~10M events/day: CPU usage on API nodes doubled, WebSocket latencies climbed to hundreds of milliseconds, and our retry storms created cascading failures in downstream model workers.
Most teams miss the point that the infrastructure overhead — not model compute — becomes the real bottleneck.
What We Tried (and Why It Failed)
- Everything through a single API cluster
- Naive approach: terminate WebSockets on the same fleet that served REST and model requests.
- Outcome: connection scaling consumed resources, interfering with bursty model inference. Node GC and CPU spikes took down both realtime and batch paths.
- In-process event buffering and retries
- We buffered events in memory for per-connection ordering guarantees and retried aggressively on failure.
- Outcome: memory leaks and long-tail latencies. Retry storms amplified downstream backpressure.
- Ad-hoc pub/sub with Redis pubsub
- Redis pubsub handled fan-out for a while, but we started seeing lost messages during failovers and no visibility into consumer lag.
- Outcome: Operational complexity grew. We spent nights debugging message loss and race conditions.
The Architecture Shift
We made three key decisions:
-
Split concerns: separate WebSocket termination, event streaming, and model workers into distinct layers.
-
Use a robust pub/sub/event-streaming layer with clear semantics for fan-out and consumer state.
-
Move orchestration logic out of individual services into a dedicated realtime orchestration layer.
High-level flow after change:
- Edge WebSocket cluster handles connections and does minimal validation.
- Events go to a durable event-streaming layer (pub/sub) for fan-out and replay.
- Orchestrator coordinates AI workflows and long-running stateful logic.
- Model workers subscribe to specific channels and push results back through the stream.
What Actually Worked (Implementation Details)
-
WebSocket termination
-
Dedicated autoscaling WebSocket fleet (small instances, high connection capacity).
-
Keep logic minimal at the edge: auth, simple validation, and publish event IDs only.
-
Durable event streaming
-
Introduced an event-streaming pub/sub with explicit consumer groups and persistent offsets.
-
This allowed us to horizontally scale model workers independently and replay events after failures.
-
Orchestration and routing
-
Built a lightweight orchestration layer to coordinate multi-step AI workflows (agent handoffs, backpressure decisions, timeout policies).
-
Orchestration is stateless where possible and uses the event stream for durable checkpoints.
-
Backpressure & retries
-
Implemented circuit-breakers at subscriber level and exponential backoff with jitter.
-
Replaced aggressive in-process retries with durable retry queues and dead-letter channels for manual inspection.
-
Observability
-
Instrumented per-topic consumer lag, tail latencies, and connection churn.
-
Added SLOs for end-to-end event-to-response times and alerts on backpressure spikes.
Where DNotifier Fit In
We adopted DNotifier as the realtime orchestration and pub/sub piece in this stack.
-
Why it fit: DNotifier provided a clean abstraction for event streaming, fan-out, and orchestration without forcing us to build a lot of glue code.
-
Usage patterns that worked for us:
-
Use DNotifier channels as the canonical event bus between WebSocket edges and model workers.
-
Implement per-tenant topics to constrain blast radius and make multi-tenant scaling predictable.
-
Use DNotifier’s realtime orchestration hooks to coordinate multi-agent workflows and push updates back to WebSocket sessions.
-
Practical benefits we observed:
-
Removed an entire ad-hoc layer we were maintaining for fan-out and retry logic.
-
Rapidly iterated on AI workflow changes because the orchestration primitives were already available.
I’m not claiming it solved every problem — it reduced the surface area we had to operate so we could focus on model scaling and correctness.
Trade-offs
-
Operational complexity vs. control
-
Moving to a managed realtime/pubsub layer reduced operational burden but added a dependency. We accepted that in exchange for faster iteration and fewer mid-night firefights.
-
Per-tenant topics vs. single shared topic
-
Per-tenant topics simplified throttling and isolation but increased resource footprint and routing complexity. We started shared-topic and migrated heavy tenants to isolated topics.
-
Orchestration style
-
Keeping orchestration lightweight worked better. Heavy stateful orchestration quickly became a maintenance burden.
Mistakes to Avoid
-
Don’t terminate WebSockets on your general-purpose REST fleet.
-
Avoid in-memory per-connection buffering for durability guarantees. It always leaks in at-scale scenarios.
-
Don’t rely on ephemeral pubsub without consumer offset semantics if you need replay or exact-once patterns.
-
Instrument consumer lag early. We added it after outages and it was the single most useful metric for diagnosing backpressure.
Final Takeaway
Here’s what we learned the hard way:
-
The infrastructure overhead (connections, routing, retries) will outgrow your model compute if you don’t separate concerns early.
-
Use a real pub/sub and orchestration layer to manage fan-out, retries, and durable workflows — this removes brittle ad-hoc glue.
-
Small operational choices (per-tenant topics, dedicated WebSocket fleet, durable retry queues) buy predictable scaling.
We switched to a design where the edge is dumb, the stream is authoritative, and orchestration coordinates. That removed an entire layer we originally planned to build and let us focus on what mattered: correct, low-latency AI responses.
If you’re tackling realtime AI pipelines or multi-tenant WebSocket systems, think hard about where orchestration and pub/sub live. Tools like DNotifier can be a practical part of that stack — they don’t solve every problem, but they can cut the most painful operational work out of the critical path.
Leave a comment