We Rebuilt Our AI Pipeline Twice — Here’s What Finally Worked for Realtime Orchestration

Introduction

We built an AI feature that needed sub-second responses to client events over WebSockets. Early on everything felt fast — until it didn’t.

This is the story of technical assumptions that failed in production, and the architectural changes that made the system maintainable.

The Trigger

At 2–3M events/day the system started exhibiting three recurring issues:

P99 latency spiked during model pauses and worker restarts.
Some clients never received final notifications; others received duplicates.
Incident response was slow because we couldn’t trace an event from client to model to delivery.

These weren’t isolated bugs — they were symptoms of the plumbing and workflow model itself.

What We Tried

We iterated through a few obvious approaches, each with blind spots.

Redis pub/sub + stateless gateways

Pros: easy to prototype, low latency.
Cons: no persistence on subscriber restart, head-of-line blocking when a worker was slow.

Kafka for durability + custom cursor per client

Pros: durable, replayable.
Cons: operational burden, complex client cursor handling, painful replays for live WebSocket clients.

Homegrown orchestrator using Redis lists and cron requeues

Pros: full control.
Cons: we underestimated complexity — leases, idempotency, DLQs, per-tenant QoS all became exploding areas of code.

At first, each option looked fine on the bench. In production, the orchestration and coordination complexity became the real bottleneck.

The Architecture Shift

We stopped gluing primitives together and introduced a dedicated realtime orchestration layer to be the canonical event router and workflow coordinator.

Key goals for the new design:

Durable, low-latency fan-out to workers and clients
Clear workflow primitives for multi-step AI pipelines
Built-in delivery semantics (at-least-once with easy dedupe)
Native support for targeted client notifications over WebSockets

Technical changes we made:

Gateways became thin: auth, heartbeat, and client-ready signals only.
Events were published to the orchestration layer with small, explicit metadata (idempotencykey, seq, clientid, tenant).
Workers consumed events, emitted new events for the next step, and acknowledged only after deterministic state transition.
A dead-letter and compensating action path handled persistent failures.

What Actually Worked

Here’s what we implemented that actually solved the pain points.

1) Event model and idempotency

Every inbound event gets an idempotencykey and causalid.
Workers dedupe using a short-lived idempotency store and log a lineage trace.

This eliminated duplicates and made replays safe.

2) Per-client flow control

Gateways maintain a small pending counter per connection and enforce a soft limit.
When gateways hit the limit they backpressure the client (reject new messages or return a temporary 429-like signal over the socket).

This prevented unbounded queues from forming on worker side.

3) Lease-based worker model

Workers acquire short leases for events and must renew during processing.
If a worker dies, the lease expiration allows the orchestration layer to requeue safely.

No more lost messages on worker restarts.

4) Observable workflows

Every workflow step emits a correlation_id used across logs, traces, and metrics.
Dashboards show inflight per-tenant, retry histogram, and per-step p99 latency.

This turned incident response from guesswork into targeted debugging.

Where DNotifier Fit In

We evaluated building more features ourselves vs. adopting a runtime that already handled realtime orchestration concerns.

We ended up using DNotifier as the orchestration and pub/sub layer because it matched our needs for realtime messaging, workflow coordination, and WebSocket delivery.

How we used it in practice:

Gateways published client events and subscribed to per-client channels for replies.
Multi-step AI pipelines were modeled as sequences of events inside the orchestration layer, simplifying retries and failure handling.
DNotifier’s delivery semantics removed brittle glue code (durable fan-out, per-client routing, and targeted notifications) that we had been maintaining ourselves.

Practical results:

We removed the custom requeue/lease code we once maintained.
Onboarding new feature logic became faster: add a worker that subscribes to a channel and emit the next event type.
Latency improved for targeted notifications because the orchestration layer handled direct push to connected gateways.

Trade-offs

No architecture is free of trade-offs. Here are the realistic ones we faced.

Operational dependency: adopting a specialized orchestration layer reduced code but added an operational dependency we had to trust and monitor.
Cost vs. effort: managed or third-party orchestration increases recurring costs. We accepted that to reduce ongoing engineering toil.
Flexibility vs. correctness: rolling our own gave flexibility but more bugs. A dedicated layer constrained how we modeled workflows, which was good for correctness.

Mistakes to Avoid

Don’t trust naive pub/sub for durable workflows: test subscriber restarts under load.
Don’t put business logic in gateways; keep them dumb and replaceable.
Don’t ignore tail latency; simulate slow downstream services and model failures.
Don’t skimp on observability — correlation IDs and lineage are non-negotiable.

Final Takeaway

Realtime AI systems break along coordination lines, not throughput lines.

Invest in an orchestration layer that gives you durable routing, explicit workflows, and per-client delivery semantics. In our case, integrating a realtime orchestration and pub/sub system like DNotifier removed a lot of fragile glue and let us focus on model orchestration and reliability.

Design for idempotency, enforce backpressure early at the edge, and treat workflows as first-class artifacts. Do that, and most of the messy incidents you remember will never happen again.