Skip to main content

Veeam Backup Daily Operations & Verification SOP

Veeam Backup Daily Operations & Verification SOP

Field Details
Category Backup & Disaster Recovery
Author IT Support Engineering
Date March 2026
Version 1.0
Audience T1 (primary), T2/T3 (escalation)
Platform Veeam Backup & Replication 13.x (Enterprise Edition)

1. Purpose

Backup failures that go unnoticed become disasters. This SOP establishes a tiered monitoring and response process so that backup issues are caught and addressed daily — not discovered when someone needs a restore.

What this document is: The daily operational playbook for monitoring, verifying, and responding to Veeam backup alerts. This is the T1-readable "what to do when a backup fails" procedure.

What this document is NOT: This is not the configuration reference (see Veeam B&R Standards) or the disaster recovery playbook (see DR Runbook).

The change this document introduces: Backup alert triage moves from T3-only NOC board ownership to a tiered T1→T2→T3 process. HALO PSA already creates tickets from backup alerts — this SOP defines who touches them first and what they do.

Document When to Reference
Veeam B&R Standards How backup jobs are configured (schedules, retention, targets)
DR Runbook When a backup is needed for actual recovery
Veeam Troubleshooting Runbook When T1 basic fixes don't resolve the issue
Vendor Escalation Quick Reference Veeam support: 614-339-8200

2. Quick Reference — Backup Failure Response

Print this page. Tape it to your monitor.

BACKUP ALERT ARRIVES IN HALO
│
├─ Is this a SINGLE job failure (first occurrence)?
│   ├─ YES → T1: Check job log for error. Retry manually. Document in ticket.
│   │         Monitor the next scheduled run.
│   │   ├─ Next run succeeds → Note in ticket. Close. Done.
│   │   └─ Next run also fails → Now it's RECURRING. See below.
│   │
│   └─ NO → Is this a RECURRING failure (3+ consecutive)?
│       ├─ YES → T1: Document pattern in ticket. Escalate to T2 immediately.
│       │         Do NOT keep retrying blindly.
│       │
│       └─ Is this a TOTAL BACKUP OUTAGE (no jobs running for client)?
│           └─ YES → T1: Escalate to T2 + notify T3 lead immediately.
│                    This is a client without backup protection.
│
SPECIAL CASES (escalate immediately, do not attempt fix):
  🔴 "Backup chain is broken" → T2 (active full required)
  🔴 Repository full / critically low space → T2
  🔴 BDR appliance offline or unreachable → T2 + T3
  🔴 All jobs for a client failing → T2 + T3

3. DTC Backup Architecture (What You're Monitoring)

Before you can troubleshoot failures, you need to understand what's running. Every DTC-managed Veeam client follows this standard architecture:

Component Standard What to Know
Server Backups Hyper-V host-level, every 1 hour, 14 restore points RPO ≈ 1 hour. These are the most critical jobs.
Workstation Backups Agent-based via protection group, daily at 1:00 AM, 14 days retention "Backup once powered on" if machine was off at 1AM.
S3 Offsite Copy Backup copy to S3 object storage Offsite protection. If this fails, local backup still exists but no offsite.
Cross-Site Replication Hourly VM replication to opposing site, 6 restore points Only on multi-site clients. Enables manual DR failover.
BDR Appliance Equus hardware, Windows 11, Veeam B&R 13.x Enterprise Named DTCBSURE-[SiteAbbrev]. This IS the backup server.
VSPC vspc.dtctoday.com:1280 Centralized monitoring dashboard for all client backup jobs.

Job naming convention: <Site Abbrev> Hyper-V for server backups, Workstations for agent backups, <Source> > <Target> Replication for replication jobs. If a job doesn't follow this convention, it's either legacy or misconfigured — note it in the ticket.


4. Alert Sources & Ticket Flow

How Alerts Arrive

Backup failures generate HALO PSA tickets automatically. The flow:

Veeam job fails → Alert generated → HALO ticket created → Appears on NOC board

Ticket Ownership Model

Stage Owner Action Timeframe
Alert arrives in HALO T1 Triage: review error, classify severity, attempt first-pass fix Within 2 hours of business day start
T1 fix resolves issue T1 Document resolution, monitor next run, close ticket if next run succeeds Same day
T1 fix does not resolve T1 → T2 Escalate with documentation of what was tried Within 4 hours
T2 investigation needed T2 Troubleshooting runbook, deeper analysis Within 1 business day
T2 cannot resolve T2 → T3 Escalate with full diagnostic data Within 2 business days
Infrastructure or architecture issue T3 Engineering resolution, potential Veeam support call As needed

HALO Ticket Standards for Backup Alerts

Every backup failure ticket must include:

Field What to Document
Client Client name (should auto-populate from alert)
Job Name Exact Veeam job name (e.g., "NB Hyper-V", "Workstations")
Job Type Server backup / Workstation backup / S3 copy / Replication
Error Message Copy the error from the Veeam job log — not a paraphrase, the actual error
Failure Count First occurrence? Second? Fifth in a row?
Action Taken What you did (retry, service restart, etc.)
Result Did it fix it? If not, what happened?
Next Steps Monitoring next run / Escalating to T2 / Awaiting vendor response

5. Daily Verification Procedure (T1)

Morning Check — Every Business Day

This should take 15-20 minutes once you have the routine down. Do this within the first 2 hours of your shift.

Step 1: Check HALO for new backup alert tickets

Review any new tickets created overnight from backup alerts. For each ticket:

  1. Open the ticket and read the alert details
  2. Identify the client, job name, and error message
  3. Classify using the Quick Reference flowchart (Section 2)
  4. Take the appropriate action (retry, escalate, or investigate)
  5. Document your action in the ticket

Step 2: Quick VSPC dashboard review

Log into VSPC (vspc.dtctoday.com:1280) and scan the overview dashboard.

What "healthy" looks like:

  • All backup jobs show green (successful)
  • Last run timestamps are within expected schedule (servers: within last 1-2 hours, workstations: within last 24 hours)
  • No warning or error indicators

What needs attention:

  • Any job showing yellow (warning) — usually means partial success or a retry was needed
  • Any job showing red (error) — failed, needs investigation
  • Any job showing stale timestamp — job may not be running at all (worse than failing)
  • Any BDR showing offline — the entire backup server for that client is unreachable

Step 3: Act on findings

For each issue found in VSPC that doesn't already have a HALO ticket:

  1. Verify there isn't already a ticket (the alert may not have auto-created one)
  2. If no ticket exists, create one
  3. Follow the appropriate response procedure from Section 6

⚠️ A job that silently stopped running is MORE dangerous than a job that fails loudly. A failed job creates an alert. A job that stopped running might not. Watch for stale timestamps during the VSPC review.


6. Common Backup Failures — T1 Response Procedures

6.1 Job Failed — Single Occurrence

What it looks like: One job failed once. Previous runs were successful.

T1 Action:

  1. Open the Veeam job log (from VSPC or the B&R console via remote access)
  2. Read the error — note the specific message
  3. Document the error in the HALO ticket
  4. Retry the job manually:
    • In Veeam console: right-click job → Start (or Retry if session is still active)
    • In VSPC: select the job → Start
  5. Monitor the retry
  6. If retry succeeds: note in ticket, monitor the next scheduled run. If the next scheduled run also succeeds, close the ticket.
  7. If retry fails: escalate to T2 with the error details from both the original failure and the retry

Common one-off causes (usually self-resolving):

  • Transient network blip during backup window
  • Target host was temporarily unresponsive (reboot, Windows Update)
  • VSS snapshot timed out during high I/O
  • S3 copy failed due to internet connectivity interruption

6.2 Job Failed — Recurring (3+ Consecutive)

What it looks like: Same job has failed 3 or more times in a row. Retries aren't fixing it.

T1 Action:

  1. Document the pattern: how many consecutive failures? Same error each time or different?
  2. Do NOT keep retrying blindly. Repeated retries on certain failures (like chain-related errors) can make things worse.
  3. Escalate to T2 immediately with:
    • Number of consecutive failures
    • Error messages from each failure (are they identical?)
    • When the last successful run was
    • Any recent changes to the environment (server updates, network changes, hardware work)

T2 will reference: Veeam Troubleshooting Runbook

6.3 Workstation Agent Offline / Not Reporting

What it looks like: Workstation backup job shows a machine as "offline" or "not available" in the protection group.

T1 Action:

  1. Check if the workstation is powered on (ping test, NinjaRMM status)
  2. If powered off: this is expected if the employee is out. Note in ticket. "Backup once powered on" will catch it when the machine boots.
  3. If powered on but agent offline:
    • Remote into the workstation
    • Check the Veeam Agent service: Get-Service VeeamBackupSvc
    • If stopped: Start-Service VeeamBackupSvc
    • If not installed: escalate to T2 (agent deployment issue — may be Intune-related)
  4. If workstation is unreachable on the network: escalate to T2 (network issue)

6.4 Repository Full / Low Space

What it looks like: Error referencing "insufficient disk space" or "repository is full." VSPC may show storage warnings.

T1 Action:

  1. Alert T2 immediately. This is a capacity issue that affects ALL jobs targeting that repository.
  2. Do NOT attempt to free space by deleting backup files — you can break backup chains.
  3. Document in the HALO ticket:
    • Which repository is affected
    • Current capacity (% used) if visible in VSPC
    • Which jobs target this repository
  4. Check: has the C:\Windows\Installer folder grown excessively? (Reference: HALO Ticket 1125653 — orphaned patch accumulation can consume 50+ GB on the BDR)

T2 will: Assess capacity, clean up old restore points if safe, or provision additional storage.

6.5 Backup Chain Broken

What it looks like: Error referencing "backup chain," "missing incremental," or "full backup required." The job may refuse to run an incremental.

T1 Action:

  1. Do NOT retry the job. A broken chain means incrementals cannot build on the existing data.
  2. Escalate to T2 immediately.
  3. Document: when was the last known successful backup? (This determines the RPO exposure.)

Why this is serious: Veeam incremental backups depend on every previous increment in the chain. If one is missing or corrupted, all subsequent incrementals are useless. T2 must run an active full backup to reset the chain. Until that completes, the client's RPO may be degraded.

T2 will: Run an active full (right-click job → Active Full), verify chain health, investigate why the chain broke.

6.6 VSS Writer Failure

What it looks like: Error referencing "VSS" (Volume Shadow Copy Service), a specific VSS writer name, or "snapshot creation failed."

T1 Action:

  1. Identify which VSS writer failed (the error message usually names it)
  2. Common writers and their associated services:
VSS Writer Associated Service Restart Command
SqlServerWriter SQL Server (MSSQLSERVER) Restart-Service MSSQLSERVER
Hyper-V Writer Hyper-V Virtual Machine Management Restart-Service vmms
NTDS Writer Active Directory Domain Services Restart-Service NTDS
IIS Writer IIS Admin Service Restart-Service IISADMIN
System Writer Cryptographic Services Restart-Service CryptSvc
  1. Restart the associated service (during off-hours or with client awareness if it's a production service like SQL)
  2. Retry the backup job
  3. If retry succeeds: note in ticket, monitor next run
  4. If retry fails: check VSS writer status: vssadmin list writers — look for writers in "Failed" or "Waiting for completion" state
  5. If writers are stuck: escalate to T2

6.7 Job Running Longer Than Expected

What it looks like: Backup job has been running for much longer than usual. Hourly server backup taking 3+ hours. Daily workstation backup still running at noon.

T1 Action:

  1. Check the job progress in Veeam console — is it actively processing or stuck?
  2. If actively processing (data is transferring but slowly):
    • Check for bandwidth contention: is Veeam competing with a large file transfer, Windows Update, or another backup job?
    • Check if this is the first run after a chain reset (active fulls take much longer than incrementals)
    • Note in ticket and monitor. If it completes, it's likely a one-off.
  3. If stuck (no progress for 30+ minutes):
    • Escalate to T2
    • Do NOT kill the job unless directed by T2/T3 — killing a job mid-write can corrupt the backup file

7. Backup Chain Health

Understanding backup chains is essential for knowing when a failure is "retry and move on" vs. "this is a real problem."

How DTC Backup Chains Work

Server backups (hourly, 14 restore points):

  • Veeam creates a full backup (.vbk) as the base
  • Every subsequent run creates an incremental (.vib) containing only changed data
  • After 14 restore points, the oldest increment merges into the full (forward incremental with transform)
  • The chain: [Full] → [Inc1] → [Inc2] → ... → [Inc14]

Workstation backups (daily, 14 days):

  • Same chain concept but on a daily cycle

Why Chain Health Matters

To restore from any point, Veeam needs the full backup PLUS every incremental up to that point. If any file in the chain is missing or corrupted, everything after it is unrecoverable.

HEALTHY CHAIN:
  [Full] → [Inc1] → [Inc2] → [Inc3] → [Inc4]
  ✅ Can restore to any point

BROKEN CHAIN (Inc2 corrupted):
  [Full] → [Inc1] → [❌ Inc2] → [Inc3] → [Inc4]
  ✅ Can restore to Inc1
  ❌ Cannot restore to Inc2, Inc3, or Inc4
  🔴 ACTIVE FULL REQUIRED to reset the chain

How to Verify Chain Health

In the Veeam B&R console:

  1. Navigate to Home → Backups → Disk
  2. Expand the backup job
  3. Look at the restore points listed:
    • Each should show a date/time and status
    • No gaps in the sequence = healthy
    • Missing dates or error indicators = investigate

When an Active Full Is Needed

An active full creates a new base .vbk and resets the incremental chain. It's needed when:

  • Backup chain is broken or corrupted
  • Incremental data is inconsistent
  • After certain infrastructure changes (storage migration, VM conversion)

How to trigger: Right-click the job → Active Full

Be aware: An active full backs up ALL data, not just changes. It will take significantly longer and use more repository space than a normal incremental. On a large server, this could take hours. Don't trigger one during business hours without T2/T3 awareness.


8. Escalation Matrix

Situation First Owner First Action Escalate To Escalate When
Single job failure (first time) T1 Retry, document, monitor T2 3+ consecutive failures or unusual error
Recurring failure (3+ in a row) T1 → T2 Document pattern, escalate T2 Immediately after identifying the pattern
Workstation agent offline T1 Check power, network, service T2 Agent not installed or unreachable after basic checks
Repository space warning T1 Alert immediately T2 Any space warning — do not wait
Repository full (jobs failing) T1 → T2 Escalate immediately T2 + T3 Immediately — client has no backup protection
Backup chain broken T1 → T2 Do not retry, escalate T2 Immediately — active full required
VSS writer failure (after restart) T1 → T2 Restart service, retry once T2 If retry fails after service restart
S3 offsite copy failure T1 Retry, check internet T2 3+ consecutive failures
Replication failure (cross-site) T2 Check WAN, target host health T3 Cross-site connectivity or infrastructure issue
BDR appliance offline T1 → T2 + T3 Verify via NinjaRMM, escalate T2 + T3 Immediately — entire client backup infrastructure is down
Total backup outage (all jobs failing) T1 → T2 + T3 Escalate immediately T2 + T3 + Management Immediately — client is unprotected
Veeam software error beyond troubleshooting T3 Veeam support case Veeam (614-339-8200) After exhausting Troubleshooting Runbook

🔴 Any situation where a client has NO working backup is a Priority 1 event. A dental practice without backup is one hardware failure away from catastrophic data loss. Treat it with the urgency it deserves.


9. Weekly Backup Review (T2)

In addition to daily T1 monitoring, T2 conducts a comprehensive weekly review every Monday.

Weekly Review Checklist

Check What to Look For Action If Found
All server backup jobs green Any client with failed or warning server backups over the past 7 days Investigate recurring patterns, open ticket if not already tracked
All workstation backup jobs running Workstations that haven't backed up in 7+ days Check if machine is decommissioned, powered off long-term, or agent failed
S3 offsite copies current Any S3 copy jobs that haven't completed in 7+ days Verify internet connectivity, S3 bucket health, copy job configuration
Replication jobs current (multi-site clients) Replication jobs behind by 24+ hours Check WAN health, target host capacity
Repository capacity Any repository above 80% utilization Flag for capacity planning — alert AM if storage expansion needed
BDR appliance health Windows 11 patches current, disk health, Veeam service running Coordinate with NinjaRMM monitoring data
VSPC dashboard clean Any persistent warnings or errors across the fleet Aggregate trends for T3 review

Capacity Planning Trigger

If any BDR repository consistently runs above 80% capacity, alert the Account Manager. Storage expansion or retention policy adjustment may be needed. Do not wait until it hits 90% — at that point, backup jobs will start failing.

Trend Identification

Watch for these patterns during weekly review:

  • Backup sizes growing rapidly: New dental software installed? Large imaging database growth? May need retention adjustment.
  • Same error across multiple clients: Could indicate a Veeam update issue, VSPC problem, or infrastructure pattern.
  • Workstation backup completion rate dropping: Are new workstations being added to the protection group? Agent deployment issues?

10. Reporting to HALO — Documentation Standards

Minimum Documentation for Every Backup Ticket

When opening/updating a ticket:

Client: [Client Name]
Job: [Exact Veeam job name]
Job Type: [Server Backup / Workstation Backup / S3 Copy / Replication]
Error: [Exact error message from Veeam job log]
Consecutive Failures: [Count]
Last Successful Run: [Date/Time]
---
Action Taken: [What you did]
Result: [What happened]
Next Steps: [Monitoring / Escalating / Resolved]

Ticket Lifecycle

Event Ticket Action
Alert arrives from backup failure Ticket auto-created. T1 triages within 2 hours.
T1 retries and job succeeds Update ticket with resolution. Leave open — monitor next scheduled run.
Next scheduled run succeeds Update ticket. Close.
T1 retry fails or issue is recurring Update ticket with details. Escalate to T2.
T2 resolves Update ticket with root cause and fix. Close after next successful run.
T2 escalates to T3 or Veeam Update ticket. Note escalation and what was exhausted.
Issue resolved after escalation Update ticket with full root cause analysis. Close after verification.

When to Create a New Ticket vs. Update Existing

  • Same job, same error, within 7 days: Update the existing ticket
  • Same job, different error: New ticket (different root cause likely)
  • Same client, different job: New ticket
  • Recurring pattern over 30+ days: Consider a problem ticket for engineering review

11. DTC Backup Standards Quick Reference

For quick reference during troubleshooting. Full details in Veeam B&R Standards (page 1004).

Setting Server Backup Workstation Backup Replication
Job Type Hyper-V Backup Windows Agent Policy Hyper-V Replication
Source Hyper-V host (all VMs) Protection group "Workstations" Individual server VM
Target Local BDR repo + S3 Local BDR repo Opposing site Hyper-V host
Schedule Every 1 hour Daily at 1:00 AM Every 1 hour
Retention 14 restore points 14 days 6 restore points
Retry 3x / 10 min intervals N/A N/A
Expected RPO ~1 hour ~24 hours ~1 hour

12. Document Control

Version Date Author Changes
1.0 March 2026 IT Support Engineering Initial release. Establishes tiered T1→T2→T3 backup monitoring process, daily verification procedures, common failure response guides, escalation matrix, weekly T2 review cadence, HALO documentation standards.

Confidential — Internal Use Only