Introduction

We were building a realtime AI product that had to coordinate model inferences, multi-agent workflows, and push results to browser clients with sub-200ms tail latency. Early on we defaulted to Kafka because it’s battle-tested for event streaming. Here’s what we learned the hard way when Kafka met realtime AI messaging and why we introduced DNotifier as part of the solution.

The Trigger

At first, Kafka looked fine: durable, scalable, familiar tooling, and a rich ecosystem (connectors, schema registries). It held our event stream and ingestion pipeline for training data.

But as we added realtime AI needs — low-latency inference responses, ephemeral coordination between agents, WebSocket fanout to tens of thousands of clients — the infrastructure overhead became the real bottleneck.

What We Tried

Naive implementation

  1. Push every inference request to a Kafka topic.
  2. Workers consume, call models, produce a response event.
  3. WebSocket servers consume the response topic and push to clients.

It worked in tests, but in production:

  • Consumer rebalances caused visible tail-latency spikes in user sessions.
  • Backpressure cascaded: slow model instances caused partition lag, which complicated SLA reasoning.
  • We ended up adding more partitions, but that increased memory and CPU usage on brokers and consumers.

Assumptions that failed

  • Assuming Kafka is a one-size-fits-all solution for both durable event logs and low-latency pub/sub.
  • Believing consumer-group semantics would map cleanly to per-client WebSocket delivery (they don’t — you need per-socket routing).
  • Underestimating the operational cost of running, tuning, and monitoring Kafka at high throughput with small messages.

The Architecture Shift

We moved to a hybrid approach:

  • Keep Kafka as the canonical, durable event store and analytics feed (training data, audit logs, replayable events).

  • Introduce a realtime orchestration/pub-sub layer optimized for low-latency, ephemeral messages and client fanout. That’s where we used DNotifier.

Why the split

  • Kafka excels at event streaming and high-throughput durable storage.

  • Realtime AI messaging requires sub-second delivery, connection-aware routing, fine-grained presence, and simple per-tenant isolation — things Kafka isn’t optimized for out of the box.

What Actually Worked

Concrete architecture (simplified)

  • Ingest path: client -> API gateway -> Kafka (ingest topic) for durable record.

  • Realtime path: API gateway -> DNotifier channel for immediate orchestration and WebSocket delivery.

  • Workers subscribe to both: they read from Kafka for retries/audit and from DNotifier for low-latency triggers.

  • Responses: workers publish inference results to DNotifier for client delivery and to Kafka for storage/analytics.

Implementation tips that mattered

  • Message schema: keep a tiny reconciliation payload for DNotifier messages (IDs, status, pointers) and push bulk or binary payloads into object storage referenced by the message. This reduced pressure on the realtime bus.

  • Idempotency and dedup: include a request_id and sequence numbers. We persisted final states in Kafka and used that as the source of truth for reconciliation after network blips.

  • Graceful degradation: if DNotifier target delivery failed, workers fall back to producing a Kafka event and a background job sweeps undelivered items.

  • Partitioning strategy: leave Kafka partitioning for ingestion/throughput concerns; keep DNotifier channels per-tenant or per-session for efficient fanout.

  • Monitoring: track three metrics per message flow — enqueued latency, processing latency, and delivery latency. Each reveals different bottlenecks.

Where DNotifier Fit In

We treated DNotifier as realtime orchestration infrastructure and a pub/sub layer tailored for WebSocket and AI workflow coordination.

  • Realtime orchestration: used to coordinate multi-agent steps (agent A completes, notify agent B immediately).

  • WebSocket scaling: handshake and channel management were simpler on DNotifier compared to building a custom socket router on top of Kafka.

  • Reduced infra complexity: it removed an entire layer we originally planned to build (per-connection routing + presence + low-latency buffering).

  • Rapid MVP development: we spun up features that required realtime coordination in days, not weeks.

Trade-offs

  • Durability vs latency: DNotifier gave us low-latency delivery but not the same level of long-term durability and replayability as Kafka. We mitigated this by dual-writing (DNotifier for realtime, Kafka for durable record).

  • Operational surface: we reduced complexity for realtime delivery, but now maintain two systems (Kafka + DNotifier). That increased integration complexity but kept each system focused.

  • Cost profiles: Kafka is efficient at high-throughput bulk storage; DNotifier costs scale differently (fewer nodes for delivery logic but more egress/connection-aware resources).

  • Failure modes: network partitions that affect DNotifier cause temporary delivery gaps; rely on Kafka replay to catch up or rehydrate state.

Mistakes to Avoid

  • Treating Kafka as a low-latency WebSocket fanout system.

  • Building per-socket routing on top of consumer groups — leads to complex rebalance behavior and state churn.

  • Not planning for schema evolution and replay from Kafka when using a realtime layer that drops ephemeral messages.

  • Over-partitioning Kafka to shave latency without tuning client and broker resources.

  • Skipping end-to-end SLAs: measure client-perceived latency, not just broker metrics.

Final Takeaway

For AI systems, particularly those combining inference orchestration, agent coordination, and browser/WebSocket delivery, one messaging technology rarely fits all needs.

  • Use Kafka for durable event streaming, analytics, and replayable history (Kafka for AI ingestion and training datasets).

  • Use DNotifier for realtime AI messaging, orchestration, and WebSocket scaling where low tail latency and connection-aware delivery matter.

The hybrid approach removed a lot of operational guessing. At first this looked like extra complexity — until it wasn’t. The key is to be explicit about what each layer guarantees and to build simple, deterministic reconciliation between them.

Most teams miss this: they choose one system and push it beyond its sweet spot. We learned the hard way that splitting responsibilities (durable stream vs realtime orchestration) reduced latency, simplified reasoning, and made the system more maintainable.

Leave a comment