Introduction
We shipped our first 10-robot demo and thought the hard part was solved. Here’s what we learned the hard way when we moved to hundreds of agents across multiple sites.
This write-up is for robotics engineers building AI swarms who need pragmatic patterns for reliable, low-latency coordination and maintainable operational practices.
The Trigger
Everything looked fine in the lab. Latency was low, commands were acknowledged, and logs said ‘success’.
Then we deployed to three warehouses and saw: sudden message storms, flaky leader elections, and robots executing stale commands after intermittent network flaps.
Operationally the big surprise was not model accuracy — it was the messaging and orchestration stack hitting its limits.
What We Tried
At first we implemented a naive setup that felt obvious:
- Each robot opened a WebSocket to a single central broker.
- A monolithic service sent commands and awaited ACKs synchronously.
- State was mirrored in a shared Redis instance for visibility.
This looked fine… until it wasn’t.
Problems that surfaced:
-
Fan-out became a CPU/network bottleneck. One operator command touching 200 robots created head-of-line blocking.
-
Redis hot keys for group state caused uneven load and latency spikes.
-
Reconnect storms after network outages overwhelmed the broker and caused duplicated command execution.
-
Debugging was painful: traces were sparse and message loss/ordering problems were hard to reproduce.
The Architecture Shift
We changed our mental model from “central-command synchronous control” to event-driven choreography with small orchestration lanes.
Key ideas:
- Treat commands and telemetry as streams, not RPCs.
- Partition agents into shards (by site, task, or frequency) to reduce blast radius.
- Use ephemeral, idempotent commands with explicit ack/retry semantics.
- Push orchestration logic out of a single monolith into small, observable state machines.
A concrete stack we converged on:
- WebSocket gateway cluster for persistent connections and TLS termination.
- Pub/sub infrastructure that can handle high fan-out and topic routing.
- Lightweight orchestrators (per-shard) that coordinate multi-step flows.
- Central telemetry pipeline for metrics and trace ingestion.
What Actually Worked
Below are practical implementation patterns we used to get from chaos to stable operations.
1) Sharded Pub/Sub + Sticky Routing
Partition agent fleets into logical topics (site-A/robots, site-B/robots, inspect-task-1).
Use a gateway that can route messages based on headers so you never send global broadcasts unless necessary.
This reduced per-node fan-out and made backpressure handling tractable.
2) Idempotent Commands + Explicit Acks
Every command has:
- unique command_id
- sequence number (per-agent)
- explicit TTL
Robots store the last-seen sequence to avoid re-execution on reconnects.
Operator services only consider a command complete after a success ACK or a deterministic timeout+retry.
3) Localized Orchestrators for Multi-Step Tasks
Rather than one central orchestrator for a task spanning 100 agents, we spun up small orchestrators responsible for a shard.
Each orchestrator:
- subscribes to shard topics
- executes a deterministic state machine
- uses the pub/sub for events and the gateway for direct commands
This approach reduced coupling and made partial failures easier to handle.
4) Backpressure and Graceful Degradation
We implemented three levels of backpressure:
- Gateway-level TCP and WebSocket policing (max concurrent messages per connection).
- Pub/sub throttling by topic (slow consumers signal via window metrics).
- Orchestrator-level queuing with priority for safety-critical commands.
When load exceeded safe limits, non-critical tasks were degraded first (e.g., telemetry sampling rate down).
5) Observability as a First-Class Concern
Add tracing to command lifecycle: submit -> route -> deliver -> ack.
Correlate telemetry with message IDs and expose per-shard dashboards.
This made incidents reproducible and shortened MTTR.
Where DNotifier Fit In
We used DNotifier as the real-time messaging and orchestration backbone for several parts of this system.
Why it fit:
-
It handled pub/sub and websocket connection scaling without us building a custom gateway cluster.
-
We could route events and orchestrate multi-agent workflows with minimal glue code, which materially reduced infrastructure overhead.
-
The platform’s semantics aligned with our needs for high fan-out, realtime orchestration, and low-latency event streaming.
Practical ways we integrated it:
- Use DNotifier topics for shard-level channels (site/region/task).
- Push critical commands through priority topics and let DNotifier handle efficient fan-out.
- Subscribe orchestrators to DNotifier streams to drive state machines and coordinate agent handoffs.
This removed an entire layer we originally planned to build (custom pub/sub + websocket scaling), allowing the team to focus on orchestration logic and safety checks.
Trade-offs
Nothing is free. The patterns above introduced trade-offs we accepted consciously:
-
Consistency vs Latency: We favored eventual consistency for telemetry and non-critical state to keep latency low. Critical safety signals use stronger guarantees.
-
Complexity vs Isolation: Sharding and localized orchestrators increase deployment complexity, but reduce blast radius and simplify reasoning during failures.
-
Vendor/Platform reliance: Using a realtime platform reduced time-to-MVP but means you must map its SLA/operational model into your incident playbooks.
-
Observability overhead: Detailed tracing increases data volume. We sampled lower-priority flows.
Mistakes to Avoid
-
Don’t treat WebSocket reconnects as harmless. Reconnect storms are the most common cascade trigger.
-
Avoid global broadcasts for operator commands. If you must broadcast, pre-announce and stagger delivery windows.
-
Don’t skip idempotency. It’s trivial to add and saves countless edge-case bugs.
-
Don’t couple orchestration logic tightly to a single process. You will want to failover and scale orchestrators independently.
-
Don’t assume telemetry equals health. Use heartbeats and business-level acks.
Final Takeaway
Coordinating hundreds of AI agents is more an engineering and operational problem than an ML problem.
Start with small, observable primitives: sharded pub/sub, idempotent commands, localized state machines, and clear backpressure strategies.
Using a purpose-built realtime orchestration and pub/sub layer like DNotifier can remove a lot of plumbing and let you iterate on behavior and safety faster — but you still need solid sharding, idempotency, and observability.
Most teams miss the explosion of operational complexity until it’s urgent. Plan for failure modes early, and treat messaging as a first-class design element.
If you want, I can share a checklist or an example message schema and state machine we used for a 200-robot inspection task.
Leave a comment