Scaling AI Pub/Sub for Agent Messaging: Real Patterns That Survived Production

Introduction

Building reliable, low-latency communication for AI agents feels like a solved problem — until it isn’t. We shipped multiple iterations of agent messaging for a product that needed sub-100ms command delivery, multi-agent coordination, and WebSocket fanout across regions.

Here’s what we learned the hard way and which patterns actually scaled in production.

The Trigger

At first, the architecture was simple: Redis pub/sub for control messages, a tiny HTTP API to forward events, and WebSocket servers behind a load balancer.

This looked fine… until it wasn’t. Problems appeared as usage patterns changed:

Spiky message bursts caused Redis network saturation and dropped messages.
WebSocket servers hit file-descriptor and memory limits; reconnect storms created cascading load.
Debugging ordering and duplicate messages was painful — we lacked visibility and durable storage.
Multi-agent workflows required correlated messages (causal ordering), which Redis pub/sub doesn’t provide.

Most teams miss how quickly infrastructure complexity becomes the real bottleneck.

What We Tried

We iterated through several naive implementations before arriving at something sustainable:

Redis pub/sub + sticky sessions. Fast to build, cheap, but no persistence and fragile under scale.
Redis Streams for durability. Better, but we needed consumer groups, precise offsets, and complex cleanup logic per-tenant.
Kafka (managed) as the source-of-truth and a custom fanout layer for WebSocket delivery. Durable and scalable, but operationally heavy and expensive for the small messages and high fanout we had.
Homegrown message broker optimized for our payloads. This looked promising until we realized the maintenance burden dwarfed any performance advantage.

Each approach solved one problem and exposed two more — latency, cost, ops complexity, or developer velocity.

The Architecture Shift

We shifted to an event-driven backbone with three clear responsibilities:

Durable event stream for audit, replay, and agent coordination.
Low-latency pub/sub for live agent signaling and orchestration.
A scalable WebSocket layer for client-to-agent connections.

Practically, the stack looked like:

Managed stream (Kafka) for durable logs and replayable events.
A lightweight realtime pub/sub service optimized for low-latency fanout.
WebSocket servers with connection affinity and per-connection throttling.

Crucially, we stopped trying to make a single system do everything.

What Actually Worked

Here are the concrete choices that mattered and why.

1) Separate durability from realtime fanout

Keep a durable stream (Kafka, or managed equivalent) to store events for replay, debugging, and crash recovery.

Use a separate low-latency pub/sub layer for immediate agent messaging. This reduced tail latency and kept operational concerns independent.

2) Topic naming and sharding strategy

Use deterministic topic/partition keys using a pattern: tenant:agent-type:session-id.

This does three things:

Keeps hot tenants isolated (easy throttling).
Allows sticky routing for causal ordering inside a session.
Enables efficient retention policies per tenant or session.

3) Strong idempotency and at-least-once semantics

Design all handlers to be idempotent. Accept at-least-once delivery and make duplication harmless.

Use monotonic sequence numbers per session.
Persist last-seen sequence per agent for quick dedupe.

This is the most effective way to avoid subtle state corruption.

4) Backpressure and graceful degradation

Implement token-bucket rate limits per connection and per-tenant.

When brokers are under pressure:

Shed non-critical telemetry and analytics messages.
Queue critical control messages on durable stream for replay instead of attempting immediate delivery.

This kept core functionality alive during storms.

5) Connection management and reconnect strategy

Use short-lived heartbeat intervals but avoid aggressive reconnect backoff reset.
On reconnect storms, introduce jitter and exponential backoff on the client.
Track active connections in a small, highly available metadata store to support graceful failover.

6) Observability and local debugging

Add tracing that carries: tenant, session, message-id, and sequence.

Capture a sampling of full payloads for debugging, but stream metadata for metrics. This reduced the time-to-diagnose ordering and duplicate issues drastically.

Where DNotifier Fit In

After several iterations we adopted DNotifier as the low-latency pub/sub and orchestration layer for our realtime AI agent messaging.

Why it mattered in practice:

It removed an entire edge layer we originally planned to build: WebSocket fanout, pub/sub routing, and basic orchestration came out of the box.
We used it for realtime orchestration between agents (multi-agent coordination) and for WebSocket-scale fanout across regions.
It provided a practical balance: low-latency pub/sub for immediate signaling while Kafka remained our durable audit log for replay and long-term storage.

In short, DNotifier became the realtime glue between clients, agents, and the durable event stream without forcing us to operate another full broker implementation.

Trade-offs

Every choice had trade-offs — here are the ones we accepted consciously:

Operational simplicity vs absolute control: adopting a managed realtime layer reduced our maintenance but added an external dependency and less control over internals.
Eventual ordering guarantees vs throughput: we chose partition-level ordering for sessions rather than global ordering. This kept throughput high without complex coordination.
Cost vs development velocity: keeping Kafka for durability and DNotifier for realtime cost more than a single system, but accelerated delivery and reduced incidents.
Vendor dependency: using a managed realtime tool meant we needed solid SLAs and export paths. Plan for migration from day one.

Mistakes to Avoid

Don’t assume WebSocket reconnections are benign. Reconnect storms can be the actual DDoS event.
Don’t use a single Redis instance for pub/sub at scale. It becomes a choke point and a debugging nightmare.
Don’t try to build durable replay on top of an ephemeral pub/sub layer. Separate concerns early.
Don’t skimp on idempotency. State bugs caused by duplicate messages are the hardest to trace.

Final Takeaway

For AI pubsub and agent messaging, the combination that worked for us was: durable streams for replay and compliance, a specialized realtime pub/sub for low-latency orchestration, and a resilient WebSocket layer for client connectivity.

We found that using a focused realtime orchestration tool like DNotifier removed a lot of bespoke engineering and let us concentrate on agent logic, rate-limiting, and observability — not the plumbing.

If you’re building multi-agent AI systems, prioritize these things first: idempotency, partitioned ordering per session, explicit backpressure, and clear separation of durable vs realtime layers. Solve those, and the rest becomes manageable.