Observability Is a First-Class Citizen

When something breaks in production, the first question is always: "what happened?" The answer has to be available in under five minutes, without deploying code, without adding logging, without guessing.

The rule: every DTC-built system has structured logs with entity IDs, live state pushed to the UI where users can see it, and durable audit tables that humans can query in SQL.

Structured logs, not string logs

Logs carry structured fields, not formatted strings. Every log line that touches an entity carries that entity's ID:

// Good
tracing::info!(
    quote_id = 44,
    invoice_id = 115400,
    payment_intent = %pi.id,
    "quote approve async chain complete"
);

// Bad
tracing::info!("approved quote 44 with invoice 115400 and PI {}", pi.id);

The difference: the first is queryable. Grep across logs for quote_id=44 and you get every log line that touched that quote across the entire system, not just the ones that happened to format the ID the same way.

Audit tables are queryable

Any operation whose outcome a human might need to investigate after the fact lives in a database table, not just logs.

Job queue (job_queue) — every async job, its status, attempts, last error, payload. SELECT * FROM job_queue WHERE status = 'pending' ORDER BY updated_at answers "what's stuck?"
Webhook log (webhook_logs) — every inbound webhook, raw payload, signature verification result, processing timestamp.
Audit markers (portal_welcome_sent, backfill checkpoints, etc.) — durable records of one-shot operations.

Logs are transient (rotated, sampled, sometimes lost). Tables are durable. Anything the business might want to audit or debug weeks later goes in a table.

Live state pushed to the UI

When an async operation's state changes, users see it change without refreshing. Server-Sent Events (SSE) from the API, hydrated by Postgres LISTEN/NOTIFY so any process can publish and every subscriber receives.

Payment processing → "Recording payment" badge that updates to "Paid" when the webhook lands
Quote approval chain → inline spinner that becomes the Stripe form when the PI is minted
Background job completion → tracker footer updates from "Preparing..." to "Ready to Pay" with no refresh

The UX value is obvious. The debugging value is bigger — when support asks "what's the portal showing the customer right now?" the answer is what they see, and what they see is a live projection of current system state.

Correlation IDs across systems

Every request carries a correlation ID that flows into every log line and every outbound API call. If a Halo API call fails, the log line has the portal request ID AND the Halo request ID (from their response header). Join the two and you can debug across system boundaries.

What NOT to log

Secrets (API keys, tokens, passwords, PIIs in freeform text). Ever.
Whole payloads. Summarize — ID + status + one or two key fields.
Every retry in a tight loop. Aggregate — log the decision to retry with the attempt number, not every tick.

Why we do this

Time-to-answer. Incident response time is dominated by "finding out what's wrong." Structured logs + audit tables + live UI state collapse that from hours to minutes.
Support quality. Techs answer "what's my customer seeing?" in one SQL query instead of reproducing the bug themselves.
Compliance. Auditable systems don't just claim to be audited — the audit data exists in tables and has always existed.
Self-debugging. A system whose operators can query its state recovers from weird situations faster than one that requires code changes to observe.

When this applies

Every DTC-built system. Observability is not "v2 when we have time" — it's day-one infrastructure.

When it does not apply

Nothing that ships to production. Experimental code that never runs outside a developer's laptop can skip it; anything else, no.

Everything Has a Timestamp and a Deadline — timestamps are the foundation of queryable audit data
Every Integration Has a Resume Endpoint — SSE pairs with resume endpoints for recovery
Cache Architecture — job queue + webhook log tables as reference

Network Architecture

DHCP & DNS

Remote Access

VoIP Operations

VoIP Troubleshooting & Network Diagnostic SOP

Internal DNS Zone Management

Firewall Rules & Port Requirements Master Reference

Windows Workstations MSA Standard Configuration

Windows In-Place Upgrade

Backup & Data Protection Standards

SaaS Backup Standards

Cloud Backup Architecture Standards

Workload-Instance Bucket Architecture

Client Credential Administration Standard

Operational UUID (OUID) Standard

Build for Unreliable Integrations

Writes Are Jobs, Reads Are Cached

Everything Has a Timestamp and a Deadline

Webhook-Driven, Reconciler-Bounded

Explicit Handling of Absent Data

Branding in the Environment, Not the Code

Versioned, Signed, and Tagged

Closed Records Are Immutable Until Proven Otherwise

Observability Is a First-Class Citizen

Every Integration Has a Resume Endpoint

Observability Is a First-Class Citizen

Structured logs, not string logs

Audit tables are queryable

Live state pushed to the UI

Correlation IDs across systems

What NOT to log

Why we do this

When this applies

When it does not apply

Network Architecture

DHCP & DNS

Remote Access

VoIP Operations

VoIP Troubleshooting & Network Diagnostic SOP

Internal DNS Zone Management

Firewall Rules & Port Requirements Master Reference

Windows Workstations MSA Standard Configuration

Windows In-Place Upgrade

Backup & Data Protection Standards

SaaS Backup Standards

Cloud Backup Architecture Standards

Workload-Instance Bucket Architecture

Client Credential Administration Standard

Operational UUID (OUID) Standard

Build for Unreliable Integrations

Writes Are Jobs, Reads Are Cached

Everything Has a Timestamp and a Deadline

Webhook-Driven, Reconciler-Bounded

Explicit Handling of Absent Data

Branding in the Environment, Not the Code

Versioned, Signed, and Tagged

Closed Records Are Immutable Until Proven Otherwise

Observability Is a First-Class Citizen

Every Integration Has a Resume Endpoint

Observability Is a First-Class Citizen

Structured logs, not string logs

Audit tables are queryable

Live state pushed to the UI

Correlation IDs across systems

What NOT to log

Why we do this

When this applies

When it does not apply

Related