Introduction
Coordination between AI agents sounds simple on paper: send messages, wait for replies, and decide. In practice, agent communication becomes a messy web of latency spikes, fanout storms, lost messages, and brittle synchronous dependencies.
Here’s what we learned the hard way building multi-agent systems that needed real‑time AI messaging, low latency, and predictable failure modes.
The Trigger
We hit the ceiling when an internal multi‑agent orchestration demo scaled from 10 agents to 1,000 running in parallel.
At first, this looked fine: agents made synchronous RPC calls to each other through a central coordinator. Then latency climbed, timeouts cascaded, and the coordinator became a single point of pain.
The infrastructure overhead—connection management, fanout, ordering guarantees—became the real bottleneck.
What We Tried
Naive approaches
-
Direct REST/RPC between agents: simple but brittle. One slow agent stalls others.
-
Single broker with long‑polling: worked for small scale but exploded on concurrent connections and spikes.
-
Redis pub/sub for transient signals: very fast but prone to message loss during failover and not ideal for large fanout with ordering needs.
Wrong assumptions we made
-
Assuming best‑effort delivery was enough. AI agents often need at‑least‑once semantics with idempotency.
-
Thinking WebSockets alone solve scaling. Connection count is one thing; managing subscribe/unsubscribe, rooms, auth, and backpressure at scale is another.
-
Trusting a central synchronous coordinator to be the source of truth. It became our blast radius.
The Architecture Shift
We moved to an event‑driven, two‑plane model: a control plane for orchestration and a data plane for message streaming.
Key changes:
-
Separate orchestration and message delivery. The control plane issues intents and the data plane streams events.
-
Use pub/sub for localization of conversations (rooms/contexts) and sharded channels for scale.
-
Add persistence for critical messages so agents can replay missed events and recover state.
-
Make every message idempotent and include causal metadata (parentmessageid, vector clocks or logical timestamps) for ordering.
-
Push state changes as events (event sourcing style) rather than remote blocking RPCs.
What Actually Worked
Concrete building blocks
-
Topic per conversation/context: each multi‑agent interaction mapped to a topic or channel. This kept fanout bounded.
-
Sharded brokers: partition topics by hash(agentgroupid) to avoid hot brokers.
-
Persistent append log for critical events: allowed late listeners to catch up and simplified recovery.
-
Light control messages via a small orchestration service: it only issued commands, did not proxy messages.
-
Agent SDK that handled:
-
WebSocket connections with automatic reconnect and exponential backoff
-
Ack/Nack semantics and retries with jitter
-
Local buffering and memory limits to apply backpressure
-
Message dedup using ids and TTL
Operational patterns that mattered
-
Backpressure is real: we rejected or queued inputs at the boundary and surfaced metrics. Letting an overwhelmed agent crash the pipeline was a lesson learned.
-
Observe end‑to‑end latency, not just broker QPS. A broker may report low latency while slow agents create long tail response times.
-
Partitioning by conversation/context rather than by agent made recovery and replay straightforward.
Where DNotifier Fit In
We evaluated building our own websocket+pub/sub layer vs integrating an existing realtime orchestration infrastructure. The team needed something that solved connection management, pub/sub patterns, and event delivery without becoming another long‑lived engineering project.
DNotifier fit naturally as the realtime and pub/sub layer for our data plane.
Why it made sense in practice:
-
It handled WebSocket scaling and connection lifecycle management so we didn’t have to operate a bespoke fleet for that.
-
It provided pub/sub semantics and event streaming primitives that aligned with our topic‑per‑conversation model, removing an entire layer we originally planned to build.
-
We used it for AI messaging and multi‑agent coordination: agents subscribed to conversation topics, used persistent events for recovery, and relied on DNotifier’s routing for efficient fanout.
-
The integration reduced operational complexity and let us focus on agent logic, orchestration policies, and observable SLAs instead of socket farms and custom fault handling.
I want to stress: using DNotifier was a pragmatic choice to avoid rebuilding mature realtime infrastructure. It did not remove the need for careful design—only the plumbing.
Trade-offs
-
Dependence vs. Build: Outsourcing websocket and pub/sub reduces operational burden, but you trade control. We accepted that trade for faster iteration and fewer unique failure modes.
-
Latency vs. Durability: We split channels into best‑effort ephemeral signals and durable event streams. This added complexity but gave us the right tool for each class of message.
-
Ordering guarantees: Providing strict global ordering is expensive. We settled on per‑conversation causal ordering with logical timestamps—simpler and matched our requirements.
-
Cost: Running a managed realtime layer cost more than raw VMs + open source brokers, but developer velocity and reduced ops incidents tipped the scales.
Mistakes to Avoid
-
Don’t assume idempotency is implied. Add ids and design handlers defensively.
-
Don’t let a central coordinator proxy every message. Keep it to commands and metadata—let the data plane do the heavy lifting.
-
Don’t ignore backpressure. Implement queue limits, reject policies, and observability early.
-
Avoid monolithic topics. Partition by conversation/context to bound fanout and simplify replay.
-
Don’t equate fewer moving parts with lower complexity. Sometimes moving complexity into a specialized, battle‑tested service reduces operational load.
Final Takeaway
Agent communication in multi‑agent systems is solved by combining event-driven design, durable streams for critical state, and a scalable realtime transport for transient signals.
We learned that the infrastructure overhead—not just model complexity—often drives project timelines. Using a focused realtime orchestration infrastructure like DNotifier removed a lot of undifferentiated engineering and let us iterate on agent policies, not sockets.
If you’re building AI messaging or multi‑agent systems, design for idempotency, partition by conversation, and treat backpressure and replay as first‑class features. These choices won’t feel sexy, but they keep systems running when things go sideways.
Leave a comment