Veeam Backup Daily Operations & Verification SOP

Field	Details
Category	Backup & Disaster Recovery
Author	IT Support Engineering
Date	March 2026
Version	1.0
Audience	T1 (primary), T2/T3 (escalation)
Platform	Veeam Backup & Replication 13.x (Enterprise Edition)

1. Purpose

Backup failures that go unnoticed become disasters. This SOP establishes a tiered monitoring and response process so that backup issues are caught and addressed daily — not discovered when someone needs a restore.

What this document is: The daily operational playbook for monitoring, verifying, and responding to Veeam backup alerts. This is the T1-readable "what to do when a backup fails" procedure.

What this document is NOT: This is not the configuration reference (see Veeam B&R Standards) or the disaster recovery playbook (see DR Runbook).

The change this document introduces: Backup alert triage moves from T3-only NOC board ownership to a tiered T1→T2→T3 process. HALO PSA already creates tickets from backup alerts — this SOP defines who touches them first and what they do.

Document	When to Reference
Veeam B&R Standards	How backup jobs are configured (schedules, retention, targets)
DR Runbook	When a backup is needed for actual recovery
Veeam Troubleshooting Runbook	When T1 basic fixes don't resolve the issue
Vendor Escalation Quick Reference	Veeam support: 614-339-8200

2. Quick Reference — Backup Failure Response

Print this page. Tape it to your monitor.

BACKUP ALERT ARRIVES IN HALO
│
├─ Is this a SINGLE job failure (first occurrence)?
│   ├─ YES → T1: Check job log for error. Retry manually. Document in ticket.
│   │         Monitor the next scheduled run.
│   │   ├─ Next run succeeds → Note in ticket. Close. Done.
│   │   └─ Next run also fails → Now it's RECURRING. See below.
│   │
│   └─ NO → Is this a RECURRING failure (3+ consecutive)?
│       ├─ YES → T1: Document pattern in ticket. Escalate to T2 immediately.
│       │         Do NOT keep retrying blindly.
│       │
│       └─ Is this a TOTAL BACKUP OUTAGE (no jobs running for client)?
│           └─ YES → T1: Escalate to T2 + notify T3 lead immediately.
│                    This is a client without backup protection.
│
SPECIAL CASES (escalate immediately, do not attempt fix):
  🔴 "Backup chain is broken" → T2 (active full required)
  🔴 Repository full / critically low space → T2
  🔴 BDR appliance offline or unreachable → T2 + T3
  🔴 All jobs for a client failing → T2 + T3

3. DTC Backup Architecture (What You're Monitoring)

Before you can troubleshoot failures, you need to understand what's running. Every DTC-managed Veeam client follows this standard architecture:

Component	Standard	What to Know
Server Backups	Hyper-V host-level, every 1 hour, 14 restore points	RPO ≈ 1 hour. These are the most critical jobs.
Workstation Backups	Agent-based via protection group, daily at 1:00 AM, 14 days retention	"Backup once powered on" if machine was off at 1AM.
S3 Offsite Copy	Backup copy to S3 object storage	Offsite protection. If this fails, local backup still exists but no offsite.
Cross-Site Replication	Hourly VM replication to opposing site, 6 restore points	Only on multi-site clients. Enables manual DR failover.
BDR Appliance	Equus hardware, Windows 11, Veeam B&R 13.x Enterprise	Named DTCBSURE-[SiteAbbrev]. This IS the backup server.
VSPC	vspc.dtctoday.com:1280	Centralized monitoring dashboard for all client backup jobs.

Job naming convention: <Site Abbrev> Hyper-V for server backups, Workstations for agent backups, <Source> > <Target> Replication for replication jobs. If a job doesn't follow this convention, it's either legacy or misconfigured — note it in the ticket.

4. Alert Sources & Ticket Flow

How Alerts Arrive

Backup failures generate HALO PSA tickets automatically. The flow:

Veeam job fails → Alert generated → HALO ticket created → Appears on NOC board

Ticket Ownership Model

Stage	Owner	Action	Timeframe
Alert arrives in HALO	T1	Triage: review error, classify severity, attempt first-pass fix	Within 2 hours of business day start
T1 fix resolves issue	T1	Document resolution, monitor next run, close ticket if next run succeeds	Same day
T1 fix does not resolve	T1 → T2	Escalate with documentation of what was tried	Within 4 hours
T2 investigation needed	T2	Troubleshooting runbook, deeper analysis	Within 1 business day
T2 cannot resolve	T2 → T3	Escalate with full diagnostic data	Within 2 business days
Infrastructure or architecture issue	T3	Engineering resolution, potential Veeam support call	As needed

HALO Ticket Standards for Backup Alerts

Every backup failure ticket must include:

Field	What to Document
Client	Client name (should auto-populate from alert)
Job Name	Exact Veeam job name (e.g., "NB Hyper-V", "Workstations")
Job Type	Server backup / Workstation backup / S3 copy / Replication
Error Message	Copy the error from the Veeam job log — not a paraphrase, the actual error
Failure Count	First occurrence? Second? Fifth in a row?
Action Taken	What you did (retry, service restart, etc.)
Result	Did it fix it? If not, what happened?
Next Steps	Monitoring next run / Escalating to T2 / Awaiting vendor response

5. Daily Verification Procedure (T1)

Morning Check — Every Business Day

This should take 15-20 minutes once you have the routine down. Do this within the first 2 hours of your shift.

Step 1: Check HALO for new backup alert tickets

Review any new tickets created overnight from backup alerts. For each ticket:

Open the ticket and read the alert details
Identify the client, job name, and error message
Classify using the Quick Reference flowchart (Section 2)
Take the appropriate action (retry, escalate, or investigate)
Document your action in the ticket

Step 2: Quick VSPC dashboard review

Log into VSPC (vspc.dtctoday.com:1280) and scan the overview dashboard.

What "healthy" looks like:

All backup jobs show green (successful)
Last run timestamps are within expected schedule (servers: within last 1-2 hours, workstations: within last 24 hours)
No warning or error indicators

What needs attention:

Any job showing yellow (warning) — usually means partial success or a retry was needed
Any job showing red (error) — failed, needs investigation
Any job showing stale timestamp — job may not be running at all (worse than failing)
Any BDR showing offline — the entire backup server for that client is unreachable

Step 3: Act on findings

For each issue found in VSPC that doesn't already have a HALO ticket:

Verify there isn't already a ticket (the alert may not have auto-created one)
If no ticket exists, create one
Follow the appropriate response procedure from Section 6

⚠️ A job that silently stopped running is MORE dangerous than a job that fails loudly. A failed job creates an alert. A job that stopped running might not. Watch for stale timestamps during the VSPC review.

6. Common Backup Failures — T1 Response Procedures

6.1 Job Failed — Single Occurrence

What it looks like: One job failed once. Previous runs were successful.

T1 Action:

Open the Veeam job log (from VSPC or the B&R console via remote access)
Read the error — note the specific message
Document the error in the HALO ticket
Retry the job manually:
- In Veeam console: right-click job → Start (or Retry if session is still active)
- In VSPC: select the job → Start
Monitor the retry
If retry succeeds: note in ticket, monitor the next scheduled run. If the next scheduled run also succeeds, close the ticket.
If retry fails: escalate to T2 with the error details from both the original failure and the retry

Common one-off causes (usually self-resolving):

Transient network blip during backup window
Target host was temporarily unresponsive (reboot, Windows Update)
VSS snapshot timed out during high I/O
S3 copy failed due to internet connectivity interruption

6.2 Job Failed — Recurring (3+ Consecutive)

What it looks like: Same job has failed 3 or more times in a row. Retries aren't fixing it.

T1 Action:

Document the pattern: how many consecutive failures? Same error each time or different?
Do NOT keep retrying blindly. Repeated retries on certain failures (like chain-related errors) can make things worse.
Escalate to T2 immediately with:
- Number of consecutive failures
- Error messages from each failure (are they identical?)
- When the last successful run was
- Any recent changes to the environment (server updates, network changes, hardware work)

T2 will reference: Veeam Troubleshooting Runbook

6.3 Workstation Agent Offline / Not Reporting

What it looks like: Workstation backup job shows a machine as "offline" or "not available" in the protection group.

T1 Action:

Check if the workstation is powered on (ping test, NinjaRMM status)
If powered off: this is expected if the employee is out. Note in ticket. "Backup once powered on" will catch it when the machine boots.
If powered on but agent offline:
- Remote into the workstation
- Check the Veeam Agent service: Get-Service VeeamBackupSvc
- If stopped: Start-Service VeeamBackupSvc
- If not installed: escalate to T2 (agent deployment issue — may be Intune-related)
If workstation is unreachable on the network: escalate to T2 (network issue)

6.4 Repository Full / Low Space

What it looks like: Error referencing "insufficient disk space" or "repository is full." VSPC may show storage warnings.

T1 Action:

Alert T2 immediately. This is a capacity issue that affects ALL jobs targeting that repository.
Do NOT attempt to free space by deleting backup files — you can break backup chains.
Document in the HALO ticket:
- Which repository is affected
- Current capacity (% used) if visible in VSPC
- Which jobs target this repository
Check: has the C:\Windows\Installer folder grown excessively? (Reference: HALO Ticket 1125653 — orphaned patch accumulation can consume 50+ GB on the BDR)

T2 will: Assess capacity, clean up old restore points if safe, or provision additional storage.

6.5 Backup Chain Broken

What it looks like: Error referencing "backup chain," "missing incremental," or "full backup required." The job may refuse to run an incremental.

T1 Action:

Do NOT retry the job. A broken chain means incrementals cannot build on the existing data.
Escalate to T2 immediately.
Document: when was the last known successful backup? (This determines the RPO exposure.)

Why this is serious: Veeam incremental backups depend on every previous increment in the chain. If one is missing or corrupted, all subsequent incrementals are useless. T2 must run an active full backup to reset the chain. Until that completes, the client's RPO may be degraded.

T2 will: Run an active full (right-click job → Active Full), verify chain health, investigate why the chain broke.

6.6 VSS Writer Failure

What it looks like: Error referencing "VSS" (Volume Shadow Copy Service), a specific VSS writer name, or "snapshot creation failed."

T1 Action:

Identify which VSS writer failed (the error message usually names it)
Common writers and their associated services:

VSS Writer	Associated Service	Restart Command
SqlServerWriter	SQL Server (MSSQLSERVER)	`Restart-Service MSSQLSERVER`
Hyper-V Writer	Hyper-V Virtual Machine Management	`Restart-Service vmms`
NTDS Writer	Active Directory Domain Services	`Restart-Service NTDS`
IIS Writer	IIS Admin Service	`Restart-Service IISADMIN`
System Writer	Cryptographic Services	`Restart-Service CryptSvc`

Restart the associated service (during off-hours or with client awareness if it's a production service like SQL)
Retry the backup job
If retry succeeds: note in ticket, monitor next run
If retry fails: check VSS writer status: vssadmin list writers — look for writers in "Failed" or "Waiting for completion" state
If writers are stuck: escalate to T2

6.7 Job Running Longer Than Expected

What it looks like: Backup job has been running for much longer than usual. Hourly server backup taking 3+ hours. Daily workstation backup still running at noon.

T1 Action:

Check the job progress in Veeam console — is it actively processing or stuck?
If actively processing (data is transferring but slowly):
- Check for bandwidth contention: is Veeam competing with a large file transfer, Windows Update, or another backup job?
- Check if this is the first run after a chain reset (active fulls take much longer than incrementals)
- Note in ticket and monitor. If it completes, it's likely a one-off.
If stuck (no progress for 30+ minutes):
- Escalate to T2
- Do NOT kill the job unless directed by T2/T3 — killing a job mid-write can corrupt the backup file

7. Backup Chain Health

Understanding backup chains is essential for knowing when a failure is "retry and move on" vs. "this is a real problem."

How DTC Backup Chains Work

Server backups (hourly, 14 restore points):

Veeam creates a full backup (.vbk) as the base
Every subsequent run creates an incremental (.vib) containing only changed data
After 14 restore points, the oldest increment merges into the full (forward incremental with transform)
The chain: [Full] → [Inc1] → [Inc2] → ... → [Inc14]

Workstation backups (daily, 14 days):

Same chain concept but on a daily cycle

Why Chain Health Matters

To restore from any point, Veeam needs the full backup PLUS every incremental up to that point. If any file in the chain is missing or corrupted, everything after it is unrecoverable.

HEALTHY CHAIN:
  [Full] → [Inc1] → [Inc2] → [Inc3] → [Inc4]
  ✅ Can restore to any point

BROKEN CHAIN (Inc2 corrupted):
  [Full] → [Inc1] → [❌ Inc2] → [Inc3] → [Inc4]
  ✅ Can restore to Inc1
  ❌ Cannot restore to Inc2, Inc3, or Inc4
  🔴 ACTIVE FULL REQUIRED to reset the chain

How to Verify Chain Health

In the Veeam B&R console:

Navigate to Home → Backups → Disk
Expand the backup job
Look at the restore points listed:
- Each should show a date/time and status
- No gaps in the sequence = healthy
- Missing dates or error indicators = investigate

When an Active Full Is Needed

An active full creates a new base .vbk and resets the incremental chain. It's needed when:

Backup chain is broken or corrupted
Incremental data is inconsistent
After certain infrastructure changes (storage migration, VM conversion)

How to trigger: Right-click the job → Active Full

Be aware: An active full backs up ALL data, not just changes. It will take significantly longer and use more repository space than a normal incremental. On a large server, this could take hours. Don't trigger one during business hours without T2/T3 awareness.

8. Escalation Matrix

Situation	First Owner	First Action	Escalate To	Escalate When
Single job failure (first time)	T1	Retry, document, monitor	T2	3+ consecutive failures or unusual error
Recurring failure (3+ in a row)	T1 → T2	Document pattern, escalate	T2	Immediately after identifying the pattern
Workstation agent offline	T1	Check power, network, service	T2	Agent not installed or unreachable after basic checks
Repository space warning	T1	Alert immediately	T2	Any space warning — do not wait
Repository full (jobs failing)	T1 → T2	Escalate immediately	T2 + T3	Immediately — client has no backup protection
Backup chain broken	T1 → T2	Do not retry, escalate	T2	Immediately — active full required
VSS writer failure (after restart)	T1 → T2	Restart service, retry once	T2	If retry fails after service restart
S3 offsite copy failure	T1	Retry, check internet	T2	3+ consecutive failures
Replication failure (cross-site)	T2	Check WAN, target host health	T3	Cross-site connectivity or infrastructure issue
BDR appliance offline	T1 → T2 + T3	Verify via NinjaRMM, escalate	T2 + T3	Immediately — entire client backup infrastructure is down
Total backup outage (all jobs failing)	T1 → T2 + T3	Escalate immediately	T2 + T3 + Management	Immediately — client is unprotected
Veeam software error beyond troubleshooting	T3	Veeam support case	Veeam (614-339-8200)	After exhausting Troubleshooting Runbook

🔴 Any situation where a client has NO working backup is a Priority 1 event. A dental practice without backup is one hardware failure away from catastrophic data loss. Treat it with the urgency it deserves.

9. Weekly Backup Review (T2)

In addition to daily T1 monitoring, T2 conducts a comprehensive weekly review every Monday.

Weekly Review Checklist

Check	What to Look For	Action If Found
All server backup jobs green	Any client with failed or warning server backups over the past 7 days	Investigate recurring patterns, open ticket if not already tracked
All workstation backup jobs running	Workstations that haven't backed up in 7+ days	Check if machine is decommissioned, powered off long-term, or agent failed
S3 offsite copies current	Any S3 copy jobs that haven't completed in 7+ days	Verify internet connectivity, S3 bucket health, copy job configuration
Replication jobs current (multi-site clients)	Replication jobs behind by 24+ hours	Check WAN health, target host capacity
Repository capacity	Any repository above 80% utilization	Flag for capacity planning — alert AM if storage expansion needed
BDR appliance health	Windows 11 patches current, disk health, Veeam service running	Coordinate with NinjaRMM monitoring data
VSPC dashboard clean	Any persistent warnings or errors across the fleet	Aggregate trends for T3 review

Capacity Planning Trigger

If any BDR repository consistently runs above 80% capacity, alert the Account Manager. Storage expansion or retention policy adjustment may be needed. Do not wait until it hits 90% — at that point, backup jobs will start failing.

Trend Identification

Watch for these patterns during weekly review:

Backup sizes growing rapidly: New dental software installed? Large imaging database growth? May need retention adjustment.
Same error across multiple clients: Could indicate a Veeam update issue, VSPC problem, or infrastructure pattern.
Workstation backup completion rate dropping: Are new workstations being added to the protection group? Agent deployment issues?

Document any trends in a recurring internal HALO ticket for monthly engineering review.

10. Reporting to HALO — Documentation Standards

Minimum Documentation for Every Backup Ticket

When opening/updating a ticket:

Client: [Client Name]
Job: [Exact Veeam job name]
Job Type: [Server Backup / Workstation Backup / S3 Copy / Replication]
Error: [Exact error message from Veeam job log]
Consecutive Failures: [Count]
Last Successful Run: [Date/Time]
---
Action Taken: [What you did]
Result: [What happened]
Next Steps: [Monitoring / Escalating / Resolved]

Ticket Lifecycle

Event	Ticket Action
Alert arrives from backup failure	Ticket auto-created. T1 triages within 2 hours.
T1 retries and job succeeds	Update ticket with resolution. Leave open — monitor next scheduled run.
Next scheduled run succeeds	Update ticket. Close.
T1 retry fails or issue is recurring	Update ticket with details. Escalate to T2.
T2 resolves	Update ticket with root cause and fix. Close after next successful run.
T2 escalates to T3 or Veeam	Update ticket. Note escalation and what was exhausted.
Issue resolved after escalation	Update ticket with full root cause analysis. Close after verification.

When to Create a New Ticket vs. Update Existing

Same job, same error, within 7 days: Update the existing ticket
Same job, different error: New ticket (different root cause likely)
Same client, different job: New ticket
Recurring pattern over 30+ days: Consider a problem ticket for engineering review

11. DTC Backup Standards Quick Reference

For quick reference during troubleshooting. Full details in Veeam B&R Standards (page 1004).

Setting	Server Backup	Workstation Backup	Replication
Job Type	Hyper-V Backup	Windows Agent Policy	Hyper-V Replication
Source	Hyper-V host (all VMs)	Protection group "Workstations"	Individual server VM
Target	Local BDR repo + S3	Local BDR repo	Opposing site Hyper-V host
Schedule	Every 1 hour	Daily at 1:00 AM	Every 1 hour
Retention	14 restore points	14 days	6 restore points
Retry	3x / 10 min intervals	N/A	N/A
Expected RPO	~1 hour	~24 hours	~1 hour

12. Document Control

Version	Date	Author	Changes
1.0	March 2026	IT Support Engineering	Initial release. Establishes tiered T1→T2→T3 backup monitoring process, daily verification procedures, common failure response guides, escalation matrix, weekly T2 review cadence, HALO documentation standards.

Confidential — Internal Use Only

How to Create MSP360/Cloudberry Accounts for New Employees

NinjaOne Image Backup Plan Configuration Standard

NinjaOne Backup — Architecture Deep Dive: Lockhart, Cloud Storage & Hybrid Model

NinjaOne Backup — Agent Won't Install: TLS 1.2 & Prerequisites

NinjaOne Backup — Monthly Health Verification Checklist

NinjaOne Backup — MSP360 vs. NinjaOne: What Changes for DTC Techs

NinjaOne Backup — NinjaOne Support Escalation: When to Call & What to Bring

NinjaOne Backup — Lockhart Service: Start, Stop, Restart & Status Checks

NinjaOne Backup — Backup Integrity: Manual Verification & Spot-Check Procedure

NinjaOne Backup — Migration Verification: First Successful Backup Checklist

NinjaOne Backup — Post-Migration: Confirming Cloud Sync is Working

NinjaOne Backup — Agent Not Showing / Backup Not Appearing After Installation

NinjaOne Backup — NAS Setup & Best Practices for DTC Sites

NinjaOne Backup — Decommissioning MSP360 at a Migrated Site

NinjaOne Backup — Parallel Run: Monitoring Both Platforms During Transition

NinjaOne Backup — Client Communication Template: Backup Platform Change

NinjaOne Backup — Backup Won't Start / Stuck on "Backup Started"

NinjaOne Backup — Log File Locations & How to Read Them

NinjaOne Backup — VSS Error 132: Overview & General Triage

NinjaOne Backup — Error 303: NAS Path Not Configured on Device

NinjaOne Backup — Error 360: Cloud Communication Error

NinjaOne Backup — Error 13: Access Denied (NTFS Permissions)

NinjaOne Backup — Error 315: NAS Authentication Failed

NinjaOne Backup — File & Folder Restore: Complete Procedure

NinjaOne Backup — Error Code Master Reference

NinjaOne Backup — VSS DLL Re-registration & Writer Repair Procedure

NinjaOne Backup — Error 305: Unable to Access Local Storage

NinjaOne Backup — Error 131: Connection Lost During Backup

NinjaOne Backup — Error 5: EFS-Encrypted File Access Denied

NinjaOne Backup — Error 316: No Host Found for Network Storage

NinjaOne Backup — Image Restore: Bare Metal & Different Hardware Recovery

NinjaOne Backup — Backup Summary Report: Generating & Interpreting

NinjaOne Backup — Lockhart High CPU/Disk Usage & ReFS Interaction

NinjaOne Backup — Error 307: Low Disk Space Preventing VSS Snapshot

NinjaOne Backup — Error 306: Snapshot Deleted While Uploading

NinjaOne Backup — Error 10053 & 10054: Connection Aborted / Reset

NinjaOne Backup — Error 20: Individual File Deleted from Backup Path

NinjaOne Backup — Error 317: Unable to Request Credentials

NinjaOne Backup — Restore Fails: No Data Available & Device Not in Drop-Down

NinjaOne Backup — Error 327: VSS Writer Error (Image Backup)

NinjaOne Backup — Error 308: Unable to Determine Free Space

NinjaOne Backup — Network Allowlist & Firewall Requirements

NinjaOne Backup — Error 313 & 314: File Not Found / Inconsistent File

NinjaOne Backup — Mounting an Image to the Cloud for File-Level Recovery

NinjaOne Backup — Error 121: Windows Semaphore Timeout

NinjaOne Backup — Error 310: Unable to Backup Volume

NinjaOne Backup — Error 318: Network Storage Not Defined

NinjaOne Backup — Error 311: Integrity Check Failed

NinjaOne Backup — Error 312: Backup Repository Root Folder Missing

NinjaOne Backup — Error 344: NAS Storage Low Space (Warning)

NinjaOne Backup — Error 342: NAS Write Error

NinjaOne Backup — Error 150: Backup Database Error

NinjaOne Backup — SMB Credentials Rejected (System Error 86) Despite Correct Password: LmCompatibilityLevel / NTLMv2

Veeam Backup Daily Operations & Verification SOP

Veeam BDR Deployment SOP

Veeam Backup and Replication Standards

Adding & Replacing Computers in Veeam BDR

Veeam Troubleshooting Playbook

MariaDB Crash-Consistent Backup — Missing InnoDB Tablespace Files

Veeam IR Mount Instability During OS-Level Changes

Veeam & BDR Troubleshooting Guide

BDR Storage Alerts & Capacity Issues

Agent & Endpoint Offline

Performance & Slow Backups

BDR Offline & Connectivity

Veeam Console Connection & Permission Errors

TrueNAS Cloud Sync Provisioning SOP

Synology NAS — Google Workspace Backup Configuration SOP

Veeam Backup Daily Operations & Verification SOP

Veeam Backup Daily Operations & Verification SOP

1. Purpose

Related Documents

2. Quick Reference — Backup Failure Response

3. DTC Backup Architecture (What You're Monitoring)

4. Alert Sources & Ticket Flow

How Alerts Arrive

Ticket Ownership Model

HALO Ticket Standards for Backup Alerts

5. Daily Verification Procedure (T1)

Morning Check — Every Business Day