Build for Unreliable Integrations

Any third-party system DTC integrates with will eventually rate-limit us, drop a webhook, return a 500, or return 404 on an entity it created ten minutes ago. This is not a bug in the upstream — it is the default behavior of every MSP-tier API we depend on (HaloPSA, Microsoft Graph, Stripe, ConnectWise, NinjaOne). Code that assumes upstream works is code that pages on-call at 2am.

The principle: assume every outbound call will fail, and design the user experience around that assumption from the first commit. Not "we'll add retries later." Not "this is an MVP, we'll harden it in v2." From day one.

What this looks like in practice

Three structural patterns, together:

Persistent local cache — a Postgres-authoritative mirror of the entities we care about. User reads hit the cache, not the upstream. The cache is hydrated by webhooks for real-time data and by a reconciler for webhook misses. UI page loads never block on upstream latency.
Job queue for writes — user action → optimistic local update (with a pending_write flag) → HTTP response returns to the user → background job performs the upstream write → clears the flag on confirmation. If the write fails it retries with backoff; the local cache carries the in-flight state.
Reconciler as safety net — periodic change-based sweep catches webhook drops. Never a full-table pull. Uses upstream since-filters when they exist; falls back to capped pulls with skip-if-unchanged guards when they don't.

Why we do this

User experience. A page that reads from cache renders in ~50ms. A page that makes a synchronous Halo call renders in 200-800ms on a good day and 10s on a bad one. Users notice.
Failure isolation. The upstream going offline is a support ticket for DTC — not a broken portal for every client. Degraded mode (cached data, queued writes) keeps us useful during the outage.
Rate-limit friendliness. Tenant API budgets are shared across every DTC system. Caching + queueing respects that; naive direct-call code starves the budget and breaks other integrations.
Idempotency by design. If every write is a job, retries come for free. If every read is cached, double-rendering doesn't double-hit upstream.

When this applies

Any integration with a third party. Without exception, every one:

HaloPSA API (tickets, quotes, invoices, clients, users, webhooks)
Stripe (payments, webhooks)
Microsoft Graph (email, calendar)
NinjaOne API (device, org, location management)
ConnectWise, Kaseya, any vendor API we add in the future

When it does not apply

Internal services on the same network that DTC controls end-to-end (e.g., our own API talking to our own database). No caching needed — we own the reliability.
One-shot synchronous workflows where retry is meaningless (e.g., a report export triggered by a button click — if it fails, the user clicks again).

Writes Are Jobs, Reads Are Cached — the split between user-facing reads and background writes
Webhook-Driven, Reconciler-Bounded — how freshness is maintained
Cache Architecture — concrete reference implementation in the Client Portal v2.0.0

Network Architecture

DHCP & DNS

Remote Access

VoIP Operations

VoIP Troubleshooting & Network Diagnostic SOP

Internal DNS Zone Management

Firewall Rules & Port Requirements Master Reference

Windows Workstations MSA Standard Configuration

Windows In-Place Upgrade

Backup & Data Protection Standards

SaaS Backup Standards

Cloud Backup Architecture Standards

Workload-Instance Bucket Architecture

Client Credential Administration Standard

Operational UUID (OUID) Standard

Build for Unreliable Integrations

Writes Are Jobs, Reads Are Cached

Everything Has a Timestamp and a Deadline

Webhook-Driven, Reconciler-Bounded

Explicit Handling of Absent Data

Branding in the Environment, Not the Code

Versioned, Signed, and Tagged

Closed Records Are Immutable Until Proven Otherwise

Observability Is a First-Class Citizen

Every Integration Has a Resume Endpoint

Build for Unreliable Integrations

What this looks like in practice

Why we do this

When this applies

When it does not apply

Network Architecture

DHCP & DNS

Remote Access

VoIP Operations

VoIP Troubleshooting & Network Diagnostic SOP

Internal DNS Zone Management

Firewall Rules & Port Requirements Master Reference

Windows Workstations MSA Standard Configuration

Windows In-Place Upgrade

Backup & Data Protection Standards

SaaS Backup Standards

Cloud Backup Architecture Standards

Workload-Instance Bucket Architecture

Client Credential Administration Standard

Operational UUID (OUID) Standard

Build for Unreliable Integrations

Writes Are Jobs, Reads Are Cached

Everything Has a Timestamp and a Deadline

Webhook-Driven, Reconciler-Bounded

Explicit Handling of Absent Data

Branding in the Environment, Not the Code

Versioned, Signed, and Tagged

Closed Records Are Immutable Until Proven Otherwise

Observability Is a First-Class Citizen

Every Integration Has a Resume Endpoint

Build for Unreliable Integrations

What this looks like in practice

Why we do this

When this applies

When it does not apply

Related