Skip to main content

Observability Is a First-Class Citizen

When something breaks in production, the first question is always: "what happened?" The answer has to be available in under five minutes, without deploying code, without adding logging, without guessing.

The rule: every DTC-built system has structured logs with entity IDs, live state pushed to the UI where users can see it, and durable audit tables that humans can query in SQL.


Structured logs, not string logs

Logs carry structured fields, not formatted strings. Every log line that touches an entity carries that entity's ID:

// Good
tracing::info!(
    quote_id = 44,
    invoice_id = 115400,
    payment_intent = %pi.id,
    "quote approve async chain complete"
);

// Bad
tracing::info!("approved quote 44 with invoice 115400 and PI {}", pi.id);

The difference: the first is queryable. Grep across logs for quote_id=44 and you get every log line that touched that quote across the entire system, not just the ones that happened to format the ID the same way.


Audit tables are queryable

Any operation whose outcome a human might need to investigate after the fact lives in a database table, not just logs.

  • Job queue (job_queue) — every async job, its status, attempts, last error, payload. SELECT * FROM job_queue WHERE status = 'pending' ORDER BY updated_at answers "what's stuck?"
  • Webhook log (webhook_logs) — every inbound webhook, raw payload, signature verification result, processing timestamp.
  • Audit markers (portal_welcome_sent, backfill checkpoints, etc.) — durable records of one-shot operations.

Logs are transient (rotated, sampled, sometimes lost). Tables are durable. Anything the business might want to audit or debug weeks later goes in a table.


Live state pushed to the UI

When an async operation's state changes, users see it change without refreshing. Server-Sent Events (SSE) from the API, hydrated by Postgres LISTEN/NOTIFY so any process can publish and every subscriber receives.

  • Payment processing → "Recording payment" badge that updates to "Paid" when the webhook lands
  • Quote approval chain → inline spinner that becomes the Stripe form when the PI is minted
  • Background job completion → tracker footer updates from "Preparing..." to "Ready to Pay" with no refresh

The UX value is obvious. The debugging value is bigger — when support asks "what's the portal showing the customer right now?" the answer is what they see, and what they see is a live projection of current system state.


Correlation IDs across systems

Every request carries a correlation ID that flows into every log line and every outbound API call. If a Halo API call fails, the log line has the portal request ID AND the Halo request ID (from their response header). Join the two and you can debug across system boundaries.


What NOT to log

  • Secrets (API keys, tokens, passwords, PIIs in freeform text). Ever.
  • Whole payloads. Summarize — ID + status + one or two key fields.
  • Every retry in a tight loop. Aggregate — log the decision to retry with the attempt number, not every tick.

Why we do this

  • Time-to-answer. Incident response time is dominated by "finding out what's wrong." Structured logs + audit tables + live UI state collapse that from hours to minutes.
  • Support quality. Techs answer "what's my customer seeing?" in one SQL query instead of reproducing the bug themselves.
  • Compliance. Auditable systems don't just claim to be audited — the audit data exists in tables and has always existed.
  • Self-debugging. A system whose operators can query its state recovers from weird situations faster than one that requires code changes to observe.

When this applies

Every DTC-built system. Observability is not "v2 when we have time" — it's day-one infrastructure.

When it does not apply

Nothing that ships to production. Experimental code that never runs outside a developer's laptop can skip it; anything else, no.