Veeam Backup Daily Operations & Verification SOP
Veeam Backup Daily Operations & Verification SOP
| Field | Details |
|---|---|
| Category | Backup & Disaster Recovery |
| Author | IT Support Engineering |
| Date | March 2026 |
| Version | 1.0 |
| Audience | T1 (primary), T2/T3 (escalation) |
| Platform | Veeam Backup & Replication 13.x (Enterprise Edition) |
1. Purpose
Backup failures that go unnoticed become disasters. This SOP establishes a tiered monitoring and response process so that backup issues are caught and addressed daily — not discovered when someone needs a restore.
What this document is: The daily operational playbook for monitoring, verifying, and responding to Veeam backup alerts. This is the T1-readable "what to do when a backup fails" procedure.
What this document is NOT: This is not the configuration reference (see Veeam B&R Standards) or the disaster recovery playbook (see DR Runbook).
The change this document introduces: Backup alert triage moves from T3-only NOC board ownership to a tiered T1→T2→T3 process. HALO PSA already creates tickets from backup alerts — this SOP defines who touches them first and what they do.
Related Documents
| Document | When to Reference |
|---|---|
| Veeam B&R Standards | How backup jobs are configured (schedules, retention, targets) |
| DR Runbook | When a backup is needed for actual recovery |
| Veeam Troubleshooting Runbook | When T1 basic fixes don't resolve the issue |
| Vendor Escalation Quick Reference | Veeam support: 614-339-8200 |
2. Quick Reference — Backup Failure Response
Print this page. Tape it to your monitor.
BACKUP ALERT ARRIVES IN HALO
│
├─ Is this a SINGLE job failure (first occurrence)?
│ ├─ YES → T1: Check job log for error. Retry manually. Document in ticket.
│ │ Monitor the next scheduled run.
│ │ ├─ Next run succeeds → Note in ticket. Close. Done.
│ │ └─ Next run also fails → Now it's RECURRING. See below.
│ │
│ └─ NO → Is this a RECURRING failure (3+ consecutive)?
│ ├─ YES → T1: Document pattern in ticket. Escalate to T2 immediately.
│ │ Do NOT keep retrying blindly.
│ │
│ └─ Is this a TOTAL BACKUP OUTAGE (no jobs running for client)?
│ └─ YES → T1: Escalate to T2 + notify T3 lead immediately.
│ This is a client without backup protection.
│
SPECIAL CASES (escalate immediately, do not attempt fix):
🔴 "Backup chain is broken" → T2 (active full required)
🔴 Repository full / critically low space → T2
🔴 BDR appliance offline or unreachable → T2 + T3
🔴 All jobs for a client failing → T2 + T3
3. DTC Backup Architecture (What You're Monitoring)
Before you can troubleshoot failures, you need to understand what's running. Every DTC-managed Veeam client follows this standard architecture:
| Component | Standard | What to Know |
|---|---|---|
| Server Backups | Hyper-V host-level, every 1 hour, 14 restore points | RPO ≈ 1 hour. These are the most critical jobs. |
| Workstation Backups | Agent-based via protection group, daily at 1:00 AM, 14 days retention | "Backup once powered on" if machine was off at 1AM. |
| S3 Offsite Copy | Backup copy to S3 object storage | Offsite protection. If this fails, local backup still exists but no offsite. |
| Cross-Site Replication | Hourly VM replication to opposing site, 6 restore points | Only on multi-site clients. Enables manual DR failover. |
| BDR Appliance | Equus hardware, Windows 11, Veeam B&R 13.x Enterprise | Named DTCBSURE-[SiteAbbrev]. This IS the backup server. |
| VSPC | vspc.dtctoday.com:1280 | Centralized monitoring dashboard for all client backup jobs. |
Job naming convention: <Site Abbrev> Hyper-V for server backups, Workstations for agent backups, <Source> > <Target> Replication for replication jobs. If a job doesn't follow this convention, it's either legacy or misconfigured — note it in the ticket.
4. Alert Sources & Ticket Flow
How Alerts Arrive
Backup failures generate HALO PSA tickets automatically. The flow:
Veeam job fails → Alert generated → HALO ticket created → Appears on NOC board
Ticket Ownership Model
| Stage | Owner | Action | Timeframe |
|---|---|---|---|
| Alert arrives in HALO | T1 | Triage: review error, classify severity, attempt first-pass fix | Within 2 hours of business day start |
| T1 fix resolves issue | T1 | Document resolution, monitor next run, close ticket if next run succeeds | Same day |
| T1 fix does not resolve | T1 → T2 | Escalate with documentation of what was tried | Within 4 hours |
| T2 investigation needed | T2 | Troubleshooting runbook, deeper analysis | Within 1 business day |
| T2 cannot resolve | T2 → T3 | Escalate with full diagnostic data | Within 2 business days |
| Infrastructure or architecture issue | T3 | Engineering resolution, potential Veeam support call | As needed |
HALO Ticket Standards for Backup Alerts
Every backup failure ticket must include:
| Field | What to Document |
|---|---|
| Client | Client name (should auto-populate from alert) |
| Job Name | Exact Veeam job name (e.g., "NB Hyper-V", "Workstations") |
| Job Type | Server backup / Workstation backup / S3 copy / Replication |
| Error Message | Copy the error from the Veeam job log — not a paraphrase, the actual error |
| Failure Count | First occurrence? Second? Fifth in a row? |
| Action Taken | What you did (retry, service restart, etc.) |
| Result | Did it fix it? If not, what happened? |
| Next Steps | Monitoring next run / Escalating to T2 / Awaiting vendor response |
5. Daily Verification Procedure (T1)
Morning Check — Every Business Day
This should take 15-20 minutes once you have the routine down. Do this within the first 2 hours of your shift.
Step 1: Check HALO for new backup alert tickets
Review any new tickets created overnight from backup alerts. For each ticket:
- Open the ticket and read the alert details
- Identify the client, job name, and error message
- Classify using the Quick Reference flowchart (Section 2)
- Take the appropriate action (retry, escalate, or investigate)
- Document your action in the ticket
Step 2: Quick VSPC dashboard review
Log into VSPC (vspc.dtctoday.com:1280) and scan the overview dashboard.
What "healthy" looks like:
- All backup jobs show green (successful)
- Last run timestamps are within expected schedule (servers: within last 1-2 hours, workstations: within last 24 hours)
- No warning or error indicators
What needs attention:
- Any job showing yellow (warning) — usually means partial success or a retry was needed
- Any job showing red (error) — failed, needs investigation
- Any job showing stale timestamp — job may not be running at all (worse than failing)
- Any BDR showing offline — the entire backup server for that client is unreachable
Step 3: Act on findings
For each issue found in VSPC that doesn't already have a HALO ticket:
- Verify there isn't already a ticket (the alert may not have auto-created one)
- If no ticket exists, create one
- Follow the appropriate response procedure from Section 6
⚠️ A job that silently stopped running is MORE dangerous than a job that fails loudly. A failed job creates an alert. A job that stopped running might not. Watch for stale timestamps during the VSPC review.
6. Common Backup Failures — T1 Response Procedures
6.1 Job Failed — Single Occurrence
What it looks like: One job failed once. Previous runs were successful.
T1 Action:
- Open the Veeam job log (from VSPC or the B&R console via remote access)
- Read the error — note the specific message
- Document the error in the HALO ticket
- Retry the job manually:
- In Veeam console: right-click job → Start (or Retry if session is still active)
- In VSPC: select the job → Start
- Monitor the retry
- If retry succeeds: note in ticket, monitor the next scheduled run. If the next scheduled run also succeeds, close the ticket.
- If retry fails: escalate to T2 with the error details from both the original failure and the retry
Common one-off causes (usually self-resolving):
- Transient network blip during backup window
- Target host was temporarily unresponsive (reboot, Windows Update)
- VSS snapshot timed out during high I/O
- S3 copy failed due to internet connectivity interruption
6.2 Job Failed — Recurring (3+ Consecutive)
What it looks like: Same job has failed 3 or more times in a row. Retries aren't fixing it.
T1 Action:
- Document the pattern: how many consecutive failures? Same error each time or different?
- Do NOT keep retrying blindly. Repeated retries on certain failures (like chain-related errors) can make things worse.
- Escalate to T2 immediately with:
- Number of consecutive failures
- Error messages from each failure (are they identical?)
- When the last successful run was
- Any recent changes to the environment (server updates, network changes, hardware work)
T2 will reference: Veeam Troubleshooting Runbook
6.3 Workstation Agent Offline / Not Reporting
What it looks like: Workstation backup job shows a machine as "offline" or "not available" in the protection group.
T1 Action:
- Check if the workstation is powered on (ping test, NinjaRMM status)
- If powered off: this is expected if the employee is out. Note in ticket. "Backup once powered on" will catch it when the machine boots.
- If powered on but agent offline:
- Remote into the workstation
- Check the Veeam Agent service:
Get-Service VeeamBackupSvc - If stopped:
Start-Service VeeamBackupSvc - If not installed: escalate to T2 (agent deployment issue — may be Intune-related)
- If workstation is unreachable on the network: escalate to T2 (network issue)
6.4 Repository Full / Low Space
What it looks like: Error referencing "insufficient disk space" or "repository is full." VSPC may show storage warnings.
T1 Action:
- Alert T2 immediately. This is a capacity issue that affects ALL jobs targeting that repository.
- Do NOT attempt to free space by deleting backup files — you can break backup chains.
- Document in the HALO ticket:
- Which repository is affected
- Current capacity (% used) if visible in VSPC
- Which jobs target this repository
- Check: has the C:\Windows\Installer folder grown excessively? (Reference: HALO Ticket 1125653 — orphaned patch accumulation can consume 50+ GB on the BDR)
T2 will: Assess capacity, clean up old restore points if safe, or provision additional storage.
6.5 Backup Chain Broken
What it looks like: Error referencing "backup chain," "missing incremental," or "full backup required." The job may refuse to run an incremental.
T1 Action:
- Do NOT retry the job. A broken chain means incrementals cannot build on the existing data.
- Escalate to T2 immediately.
- Document: when was the last known successful backup? (This determines the RPO exposure.)
Why this is serious: Veeam incremental backups depend on every previous increment in the chain. If one is missing or corrupted, all subsequent incrementals are useless. T2 must run an active full backup to reset the chain. Until that completes, the client's RPO may be degraded.
T2 will: Run an active full (right-click job → Active Full), verify chain health, investigate why the chain broke.
6.6 VSS Writer Failure
What it looks like: Error referencing "VSS" (Volume Shadow Copy Service), a specific VSS writer name, or "snapshot creation failed."
T1 Action:
- Identify which VSS writer failed (the error message usually names it)
- Common writers and their associated services:
| VSS Writer | Associated Service | Restart Command |
|---|---|---|
| SqlServerWriter | SQL Server (MSSQLSERVER) | Restart-Service MSSQLSERVER |
| Hyper-V Writer | Hyper-V Virtual Machine Management | Restart-Service vmms |
| NTDS Writer | Active Directory Domain Services | Restart-Service NTDS |
| IIS Writer | IIS Admin Service | Restart-Service IISADMIN |
| System Writer | Cryptographic Services | Restart-Service CryptSvc |
- Restart the associated service (during off-hours or with client awareness if it's a production service like SQL)
- Retry the backup job
- If retry succeeds: note in ticket, monitor next run
- If retry fails: check VSS writer status:
vssadmin list writers— look for writers in "Failed" or "Waiting for completion" state - If writers are stuck: escalate to T2
6.7 Job Running Longer Than Expected
What it looks like: Backup job has been running for much longer than usual. Hourly server backup taking 3+ hours. Daily workstation backup still running at noon.
T1 Action:
- Check the job progress in Veeam console — is it actively processing or stuck?
- If actively processing (data is transferring but slowly):
- Check for bandwidth contention: is Veeam competing with a large file transfer, Windows Update, or another backup job?
- Check if this is the first run after a chain reset (active fulls take much longer than incrementals)
- Note in ticket and monitor. If it completes, it's likely a one-off.
- If stuck (no progress for 30+ minutes):
- Escalate to T2
- Do NOT kill the job unless directed by T2/T3 — killing a job mid-write can corrupt the backup file
7. Backup Chain Health
Understanding backup chains is essential for knowing when a failure is "retry and move on" vs. "this is a real problem."
How DTC Backup Chains Work
Server backups (hourly, 14 restore points):
- Veeam creates a full backup (.vbk) as the base
- Every subsequent run creates an incremental (.vib) containing only changed data
- After 14 restore points, the oldest increment merges into the full (forward incremental with transform)
- The chain:
[Full] → [Inc1] → [Inc2] → ... → [Inc14]
Workstation backups (daily, 14 days):
- Same chain concept but on a daily cycle
Why Chain Health Matters
To restore from any point, Veeam needs the full backup PLUS every incremental up to that point. If any file in the chain is missing or corrupted, everything after it is unrecoverable.
HEALTHY CHAIN:
[Full] → [Inc1] → [Inc2] → [Inc3] → [Inc4]
✅ Can restore to any point
BROKEN CHAIN (Inc2 corrupted):
[Full] → [Inc1] → [❌ Inc2] → [Inc3] → [Inc4]
✅ Can restore to Inc1
❌ Cannot restore to Inc2, Inc3, or Inc4
🔴 ACTIVE FULL REQUIRED to reset the chain
How to Verify Chain Health
In the Veeam B&R console:
When an Active Full Is Needed
An active full creates a new base .vbk and resets the incremental chain. It's needed when:
- Backup chain is broken or corrupted
- Incremental data is inconsistent
- After certain infrastructure changes (storage migration, VM conversion)
How to trigger: Right-click the job → Active Full
Be aware: An active full backs up ALL data, not just changes. It will take significantly longer and use more repository space than a normal incremental. On a large server, this could take hours. Don't trigger one during business hours without T2/T3 awareness.
8. Escalation Matrix
| Situation | First Owner | First Action | Escalate To | Escalate When |
|---|---|---|---|---|
| Single job failure (first time) | T1 | Retry, document, monitor | T2 | 3+ consecutive failures or unusual error |
| Recurring failure (3+ in a row) | T1 → T2 | Document pattern, escalate | T2 | Immediately after identifying the pattern |
| Workstation agent offline | T1 | Check power, network, service | T2 | Agent not installed or unreachable after basic checks |
| Repository space warning | T1 | Alert immediately | T2 | Any space warning — do not wait |
| Repository full (jobs failing) | T1 → T2 | Escalate immediately | T2 + T3 | Immediately — client has no backup protection |
| Backup chain broken | T1 → T2 | Do not retry, escalate | T2 | Immediately — active full required |
| VSS writer failure (after restart) | T1 → T2 | Restart service, retry once | T2 | If retry fails after service restart |
| S3 offsite copy failure | T1 | Retry, check internet | T2 | 3+ consecutive failures |
| Replication failure (cross-site) | T2 | Check WAN, target host health | T3 | Cross-site connectivity or infrastructure issue |
| BDR appliance offline | T1 → T2 + T3 | Verify via NinjaRMM, escalate | T2 + T3 | Immediately — entire client backup infrastructure is down |
| Total backup outage (all jobs failing) | T1 → T2 + T3 | Escalate immediately | T2 + T3 + Management | Immediately — client is unprotected |
| Veeam software error beyond troubleshooting | T3 | Veeam support case | Veeam (614-339-8200) | After exhausting Troubleshooting Runbook |
🔴 Any situation where a client has NO working backup is a Priority 1 event. A dental practice without backup is one hardware failure away from catastrophic data loss. Treat it with the urgency it deserves.
9. Weekly Backup Review (T2)
In addition to daily T1 monitoring, T2 conducts a comprehensive weekly review every Monday.
Weekly Review Checklist
| Check | What to Look For | Action If Found |
|---|---|---|
| All server backup jobs green | Any client with failed or warning server backups over the past 7 days | Investigate recurring patterns, open ticket if not already tracked |
| All workstation backup jobs running | Workstations that haven't backed up in 7+ days | Check if machine is decommissioned, powered off long-term, or agent failed |
| S3 offsite copies current | Any S3 copy jobs that haven't completed in 7+ days | Verify internet connectivity, S3 bucket health, copy job configuration |
| Replication jobs current (multi-site clients) | Replication jobs behind by 24+ hours | Check WAN health, target host capacity |
| Repository capacity | Any repository above 80% utilization | Flag for capacity planning — alert AM if storage expansion needed |
| BDR appliance health | Windows 11 patches current, disk health, Veeam service running | Coordinate with NinjaRMM monitoring data |
| VSPC dashboard clean | Any persistent warnings or errors across the fleet | Aggregate trends for T3 review |
Capacity Planning Trigger
If any BDR repository consistently runs above 80% capacity, alert the Account Manager. Storage expansion or retention policy adjustment may be needed. Do not wait until it hits 90% — at that point, backup jobs will start failing.
Trend Identification
Watch for these patterns during weekly review:
- Backup sizes growing rapidly: New dental software installed? Large imaging database growth? May need retention adjustment.
- Same error across multiple clients: Could indicate a Veeam update issue, VSPC problem, or infrastructure pattern.
- Workstation backup completion rate dropping: Are new workstations being added to the protection group? Agent deployment issues?
Document any trends in a recurring internal HALO ticket for monthly engineering review.
10. Reporting to HALO — Documentation Standards
Minimum Documentation for Every Backup Ticket
When opening/updating a ticket:
Client: [Client Name]
Job: [Exact Veeam job name]
Job Type: [Server Backup / Workstation Backup / S3 Copy / Replication]
Error: [Exact error message from Veeam job log]
Consecutive Failures: [Count]
Last Successful Run: [Date/Time]
---
Action Taken: [What you did]
Result: [What happened]
Next Steps: [Monitoring / Escalating / Resolved]
Ticket Lifecycle
| Event | Ticket Action |
|---|---|
| Alert arrives from backup failure | Ticket auto-created. T1 triages within 2 hours. |
| T1 retries and job succeeds | Update ticket with resolution. Leave open — monitor next scheduled run. |
| Next scheduled run succeeds | Update ticket. Close. |
| T1 retry fails or issue is recurring | Update ticket with details. Escalate to T2. |
| T2 resolves | Update ticket with root cause and fix. Close after next successful run. |
| T2 escalates to T3 or Veeam | Update ticket. Note escalation and what was exhausted. |
| Issue resolved after escalation | Update ticket with full root cause analysis. Close after verification. |
When to Create a New Ticket vs. Update Existing
- Same job, same error, within 7 days: Update the existing ticket
- Same job, different error: New ticket (different root cause likely)
- Same client, different job: New ticket
- Recurring pattern over 30+ days: Consider a problem ticket for engineering review
11. DTC Backup Standards Quick Reference
For quick reference during troubleshooting. Full details in Veeam B&R Standards (page 1004).
| Setting | Server Backup | Workstation Backup | Replication |
|---|---|---|---|
| Job Type | Hyper-V Backup | Windows Agent Policy | Hyper-V Replication |
| Source | Hyper-V host (all VMs) | Protection group "Workstations" | Individual server VM |
| Target | Local BDR repo + S3 | Local BDR repo | Opposing site Hyper-V host |
| Schedule | Every 1 hour | Daily at 1:00 AM | Every 1 hour |
| Retention | 14 restore points | 14 days | 6 restore points |
| Retry | 3x / 10 min intervals | N/A | N/A |
| Expected RPO | ~1 hour | ~24 hours | ~1 hour |
12. Document Control
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | March 2026 | IT Support Engineering | Initial release. Establishes tiered T1→T2→T3 backup monitoring process, daily verification procedures, common failure response guides, escalation matrix, weekly T2 review cadence, HALO documentation standards. |
Confidential — Internal Use Only