Skip to main content

Build for Unreliable Integrations

Any third-party system DTC integrates with will eventually rate-limit us, drop a webhook, return a 500, or return 404 on an entity it created ten minutes ago. This is not a bug in the upstream — it is the default behavior of every MSP-tier API we depend on (HaloPSA, Microsoft Graph, Stripe, ConnectWise, NinjaOne). Code that assumes upstream works is code that pages on-call at 2am.

The principle: assume every outbound call will fail, and design the user experience around that assumption from the first commit. Not "we'll add retries later." Not "this is an MVP, we'll harden it in v2." From day one.


What this looks like in practice

Three structural patterns, together:

  1. Persistent local cache — a Postgres-authoritative mirror of the entities we care about. User reads hit the cache, not the upstream. The cache is hydrated by webhooks for real-time data and by a reconciler for webhook misses. UI page loads never block on upstream latency.
  2. Job queue for writes — user action → optimistic local update (with a pending_write flag) → HTTP response returns to the user → background job performs the upstream write → clears the flag on confirmation. If the write fails it retries with backoff; the local cache carries the in-flight state.
  3. Reconciler as safety net — periodic change-based sweep catches webhook drops. Never a full-table pull. Uses upstream since-filters when they exist; falls back to capped pulls with skip-if-unchanged guards when they don't.

Why we do this

  • User experience. A page that reads from cache renders in ~50ms. A page that makes a synchronous Halo call renders in 200-800ms on a good day and 10s on a bad one. Users notice.
  • Failure isolation. The upstream going offline is a support ticket for DTC — not a broken portal for every client. Degraded mode (cached data, queued writes) keeps us useful during the outage.
  • Rate-limit friendliness. Tenant API budgets are shared across every DTC system. Caching + queueing respects that; naive direct-call code starves the budget and breaks other integrations.
  • Idempotency by design. If every write is a job, retries come for free. If every read is cached, double-rendering doesn't double-hit upstream.

When this applies

Any integration with a third party. Without exception, every one:

  • HaloPSA API (tickets, quotes, invoices, clients, users, webhooks)
  • Stripe (payments, webhooks)
  • Microsoft Graph (email, calendar)
  • NinjaOne API (device, org, location management)
  • ConnectWise, Kaseya, any vendor API we add in the future

When it does not apply

  • Internal services on the same network that DTC controls end-to-end (e.g., our own API talking to our own database). No caching needed — we own the reliability.
  • One-shot synchronous workflows where retry is meaningless (e.g., a report export triggered by a button click — if it fails, the user clicks again).