Skip to main content

Veeam Troubleshooting Playbook

Location: Backup & Disaster Recovery → Troubleshooting (Chapter 1088) Version: 1.0 Last Updated: March 2026 Applies To: Veeam Backup & Replication v12.x on DTC Custom BDR Appliances Audience: T1 / T2 / T3


How to Use This Playbook

Each scenario follows the same structure: Symptoms → Quick Check → Tiered Response → Root Cause Notes. Start at T1. If T1 steps don't resolve, escalate to T2, then T3. If T3 cannot resolve, open a case with Veeam Support.

Tier definitions for this document:

Tier

Scope

Expected Resolution

T1

Retry, verify, collect info

15 minutes or less

T2

Investigate, targeted fix, service-level changes

30–60 minutes

T3

Rebuild, architectural fix, vendor escalation

1+ hours, may require maintenance window


Scenario 1: VSS Writer Failures

Symptoms: Backup job fails or hangs mid-progress (often on C: drive). Veeam logs show "VSS writer failed" or "Writer [name] is in Failed state." Job may stall at a fixed percentage for hours with throughput dropping to 0 KB/s.

Most Common Writer: NTDS (Active Directory) — causes State: [11] Failed, Last error: Non-retryable error. This blocks ALL VSS-based operations on that volume because other writers queue behind it in "Waiting for completion."

HALO Reference: Ticket 1117116 — NTDS writer failure caused C: drive backup to hang at 68% for 1:38+ with 0 KB/s throughput. D: and F: completed fine because NTDS writer only affects the system volume.

T1 — Retry

  1. Stop the failed/hung backup job in the Veeam console (right-click → Stop).
  2. On the source server, check writer status:
    vssadmin list writers
  3. Look for any writer showing State: Failed, Waiting for completion, or Timed out.
  4. If all writers show Stable — retry the job. Sometimes a transient lock causes a one-time failure.
  5. If any writer shows Failed — escalate to T2.

T2 — Investigate

  1. Reset VSS services (does NOT affect AD or server availability):
    net stop vss /y
    net stop swprv /y
    net start swprv
    net start vss
  2. Re-check writers: vssadmin list writers | findstr /i "Failed Error Waiting"
  3. If the NTDS writer is still Failed, restart the AD service. This briefly interrupts domain authentication (seconds):
    net stop ntds /y
    net start ntds
  4. Verify NTDS recovered:
    vssadmin list writers | Select-String -Pattern "NTDS" -Context 0,4
    Expected: State: [1] Stable, Last error: No error
  5. If NTDS is stable — retry the backup job.
  6. If NTDS is still Failed after service restart — escalate to T3.

T3 — Rebuild

  1. If the writer survives service restarts in a Failed state, a full server reboot is required.
  2. Before rebooting: Confirm no active backup jobs, no VSS snapshots, and no competing backup products running (MSP360, Acronis, etc. — two backup products taking VSS snapshots simultaneously will deadlock):
    Get-Service Veeam* | Select Name, Status
    vssadmin list shadows
  3. If the Veeam BDR still shows the job as "Running" — reboot the BDR first to release the remote VSS session, then reboot the source server.
  4. After reboot, confirm all writers are Stable before restarting the job.
  5. If the problem recurs across multiple job runs — investigate application-level corruption. Run sfc /scannow and check Event Viewer > Application for recurring VSS errors. Component store corruption may require a repair install.

Root Cause Notes: VSS writer failures are most commonly caused by stale VSS sessions from previous failed/cancelled backup jobs, competing backup software holding VSS locks, or NTDS database inconsistency after unclean shutdowns. On physical servers without out-of-band management (iDRAC/IPMI), rebooting carries risk if the server hangs during POST — confirm physical console access is available or scheduled.


Scenario 2: Repository Full / Insufficient Disk Space on BDR

Symptoms: Job fails with "Insufficient disk space on repository" or "Failed to create snapshot." BDR local storage is at or near capacity.

T1 — Retry

  1. Check BDR disk space:
    Get-WmiObject Win32_LogicalDisk | Select DeviceID, @{N='FreeGB';E={[math]::Round($_.FreeSpace/1GB,2)}}, @{N='TotalGB';E={[math]::Round($_.Size/1GB,2)}}
  2. If free space is above 10% — this may be a transient issue (large incremental + merge). Retry the job.
  3. If free space is below 10% — escalate to T2.

T2 — Investigate

  1. Open the Veeam console → Backup Infrastructure → Backup Repositories. Check the "Free" column.
  2. Review retention policy: Right-click the job → Edit → Storage → Retention policy. If retention is set higher than necessary (e.g., 30 days on a small repo), reduce to DTC standard and run an Active Full to trigger merge.
  3. Check for orphaned backups: Backups → Disk — look for backup chains that are no longer tied to an active job. Right-click → Delete from disk if confirmed orphaned.
  4. Check for deleted VMs still consuming space: Backups → Disk → [Job name] — expand and look for VMs that no longer exist but still have restore points.
  5. Verify no stale Veeam .vbk/.vib/.vrb merge files in the repository path — these can accumulate after interrupted merge operations.

T3 — Rebuild

  1. If the repo is legitimately undersized for the workload — this is a capacity planning conversation with the Account Manager. Document current usage, growth rate, and retention requirements.
  2. If orphaned merge files are consuming significant space and can't be cleaned through the Veeam console — manually remove from the repository directory after confirming they're not part of an active chain (check the .vbm metadata file).
  3. Consider restructuring: move large/static volumes (D: data drives) to a separate job with lower retention, keep C: system volumes on standard retention.
  4. If recurrent — evaluate adding storage to the BDR or deploying a secondary repository.

Root Cause Notes: Repository full is often a retention misconfiguration issue at initial deployment. DTC standard BDR specs should be sized for the client's data footprint plus 2x growth headroom. This scenario also occurs when stale backup chains from decommissioned servers aren't cleaned up during migrations.


Scenario 3: Network Timeout Between Subnets

Symptoms: Backup jobs from servers or workstations on a different subnet than the BDR fail with "Connection timed out," "Failed to connect to host," or throughput drops to 0 and stalls. Jobs on the same subnet as the BDR work fine.

HALO Reference: Ticket 1117116 — Cross-subnet Veeam transport between 192.168.1.x (source server) and 192.168.2.x (BDR) was fundamentally unstable. Throughput would start at 44 MB/s then flatline.

T1 — Retry

  1. From the BDR, test basic connectivity to the source:
    Test-NetConnection -ComputerName [SOURCE_IP] -Port 445
    ping [SOURCE_IP]
  2. If ping works but TCP fails — likely a firewall issue. Check Windows Firewall on both sides.
  3. If both work — retry the job. Transient network blips on cross-subnet routes can cause one-time failures.

T2 — Investigate

  1. Verify routing between subnets — the UDM should handle inter-VLAN routing. Confirm both subnets can reach each other's gateway.
  2. Check for MTU issues on the cross-subnet path:
    ping [SOURCE_IP] -f -l 1472
    If this fails but normal ping works — MTU mismatch causing fragmentation. Check UDM VLAN interface MTU settings.
  3. Check if the BDR is on the correct VLAN/subnet. DTC standard is BDR on the server VLAN.
  4. Verify no bandwidth throttling is configured in Veeam: Main Menu → General Options → Network Traffic.
  5. If the source server has multiple NICs — confirm Veeam is binding to the correct interface (see Scenario 7).

T3 — Rebuild

  1. If cross-subnet transport is fundamentally unstable (recurring stalls/timeouts across multiple job runs), deploy a Veeam proxy on the same subnet as the source. This keeps data transport local and only metadata crosses subnets.
  2. If proxy deployment isn't feasible — consider Disk2VHD as a fallback for one-time P2V migrations (not for ongoing backup).
  3. Document the subnet layout and routing path in the HALO ticket for future reference.

Root Cause Notes: Cross-subnet Veeam transport relies on Windows networking stack and inter-VLAN routing quality. Asymmetric routing, MTU mismatches, and firewall inspection on inter-VLAN traffic are the most common culprits. The UDM's IPS/DPI engine can also interfere with large sustained data streams — check IPS logs for blocked traffic if other causes are ruled out.


Scenario 4: Agent Deployment Failures (Intune Policy Conflicts)

Symptoms: Pushing the Veeam Agent from the BDR console fails with "Failed to resolve host name," "Failed to connect to [hostname] on port 6160/11731," or "Access is denied." Disabling Windows Firewall locally doesn't help — it re-enables itself.

HALO Reference: Ticket 1117116 — DVA-ST-DOC01 was managed by the client's Intune tenant. Intune firewall policies overrode local Set-NetFirewallProfile -Enabled False. Microsoft Defender for Endpoint's Web Threat Defense Service silently dropped inbound connections independently of Windows Firewall.

T1 — Retry

  1. Verify DNS resolution from the BDR:
    nslookup [TARGET_HOSTNAME]
    If resolution fails — the DNS record is missing. Add an A record or use the IP address directly in the Veeam job.
  2. Test Veeam deployment ports from the BDR:
    Test-NetConnection -ComputerName [TARGET_IP] -Port 6160
    Test-NetConnection -ComputerName [TARGET_IP] -Port 11731
  3. If port tests fail — check Windows Firewall on the target. If it's domain-joined and Intune-managed, local firewall changes will be overwritten by policy. Escalate to T2.

T2 — Investigate

  1. On the target machine, check if Intune/MDM is managing the firewall:
    Get-Service mpssvc | Select Status
    Get-NetFirewallProfile | Select Name, Enabled
    dsregcmd /status
    If dsregcmd shows AzureAdJoined: YES — the device is Intune-managed.
  2. Check for Defender for Endpoint network protection:
    Get-Service webthreatdefsvc | Select Status
    If running — this filters traffic independently of Windows Firewall and can silently block Veeam deployment.
  3. Workaround (immediate): Install the Veeam Agent manually on the target instead of pushing from the BDR. Download the agent installer from the Veeam console or copy from the BDR's installation share. Run locally, point to the BDR. This bypasses the deployment kit entirely.
  4. Proper fix: If DTC has access to the client's Intune portal — create firewall exceptions for Veeam ports (TCP 6160, 11731) under Endpoint Security → Firewall. Also check Attack Surface Reduction → Network Protection.

T3 — Rebuild

  1. If the client or previous MSP manages Intune and won't grant DTC access — document the required firewall rules and submit as a change request to whoever manages the tenant.
  2. Required Intune firewall exceptions for Veeam Agent deployment:
    • TCP 6160 inbound (Veeam Installer Service)
    • TCP 11731 inbound (Veeam Data Mover)
    • Source: BDR IP address
  3. Until Intune policy is updated, use manual agent installation as the standard deployment method for that client's Intune-managed endpoints.
  4. Document in HALO which clients have Intune-managed endpoints so future techs know to expect this.

Root Cause Notes: This is increasingly common as dental practices adopt Microsoft 365 Business Premium (includes Intune and Defender for Endpoint). The key tell is that disabling Windows Firewall locally doesn't stick — Intune pushes the policy back. Defender for Endpoint's network protection layer (webthreatdefsvc) is a separate blocker that most techs don't know about. Always check dsregcmd /status first on any deployment failure.


Scenario 5: SQL Backup Consistency Errors

Symptoms: Application-aware processing fails on servers running SQL Server (common with Dentrix, Eaglesoft, Open Dental). Veeam logs show "SQL Writer failed," "Database consistency check failed," or "Transaction log backup failed."

T1 — Retry

  1. Check if the dental software's own database maintenance ran recently — Dentrix and Eaglesoft both have scheduled database optimization tasks that lock SQL during execution.
  2. Verify the SQL Server service is running:
    Get-Service MSSQL* | Select Name, Status
  3. If SQL is running and no maintenance tasks are active — retry the job.

T2 — Investigate

  1. Check the SQL Writer status:
    vssadmin list writers | Select-String -Pattern "SqlServerWriter" -Context 0,4
  2. If the SQL Writer is Failed — restart the SQL Server service (schedule with the practice if during business hours, as this briefly interrupts the PMS).
  3. Check for database corruption — run a consistency check:
    DBCC CHECKDB ('DatabaseName') WITH NO_INFOMSGS
    Replace DatabaseName with the actual dental software database name (check SQL Server Management Studio or the dental software documentation for the DB name).
  4. Review Veeam's Guest Processing settings for the job: Right-click job → Edit → Guest Processing. Verify that application-aware processing is using the correct credentials (domain admin or a SQL-privileged service account).
  5. Check if transaction logs are accumulating — if the database is in Full recovery mode and no log backups are configured, the transaction log will grow until it fills the disk.

T3 — Rebuild

  1. If database corruption is confirmed — this is a dental software vendor escalation (Patterson for Eaglesoft, Henry Schein for Dentrix). Document the DBCC output and engage the vendor.
  2. If transaction log growth is the issue — evaluate switching the database recovery model to Simple (discuss with the dental software vendor first, as some explicitly require Full recovery).
  3. If application-aware processing consistently fails — as a workaround, disable guest processing on the job and rely on crash-consistent backups until the SQL issue is resolved. Document this trade-off in the HALO ticket.

Root Cause Notes: SQL consistency failures most often stem from the dental PMS's own maintenance tasks conflicting with the backup window, or from databases that have accumulated corruption over time. Dentrix in particular runs a "Database Maintenance" utility that holds exclusive locks. Schedule Veeam jobs to avoid overlap with PMS maintenance windows.


Scenario 6: Backup Chain Integrity Breaks

Symptoms: Veeam reports "Backup chain is broken" or "Required restore point is missing." Incremental jobs fail because they can't find the parent .vbk or previous .vib. Health check shows chain as Unhealthy.

T1 — Retry

  1. Open the Veeam console → Backups → Disk → right-click the affected backup → Properties. Check if any restore points show as missing or corrupt.
  2. If the chain shows a gap but the most recent full backup exists — run an Active Full backup (right-click job → Active Full). This creates a new base .vbk and resets the chain.
  3. If the Active Full completes successfully — the chain is repaired. Subsequent incrementals will build from the new base.

T2 — Investigate

  1. Check the repository path for the affected job — look for orphaned .vib files without a corresponding .vbk, or .vbm (metadata) files that reference missing restore points.
  2. If files were manually deleted or moved from the repository — the chain is permanently broken for those points. Run Active Full to establish a new base.
  3. Check if the repository is on a drive with errors:
    chkdsk [DRIVE_LETTER]: /scan
  4. If CBT (Changed Block Tracking) data is corrupted — the incremental can't determine what changed. Reset CBT by running an Active Full. This was a documented issue on Ticket 1117116 where corrupted CBT/digest data on a physical server caused incrementals to stall repeatedly at the same block.

T3 — Rebuild

  1. If the chain is unrecoverable and multiple restore points are missing — delete the backup chain from disk (after confirming no critical restore points are needed) and start fresh with a new Full.
  2. If chain breaks recur — check for underlying storage issues on the BDR (SMART status, disk errors in Event Viewer).
  3. For physical-to-virtual migrations where chain integrity fails repeatedly — pivot to Disk2VHD as a direct P2V method rather than continuing to troubleshoot backup-based migration.
  4. Review the job schedule — chain breaks often happen when an Active Full is interrupted (power loss, manual stop, network disconnect during the full backup window).

Root Cause Notes: Chain integrity is most commonly broken by interrupted full backups, manual deletion of files from the repository path, or storage-level corruption. CBT corruption on the source (especially physical servers) causes incrementals to fail silently. When in doubt, Active Full resets everything.


Scenario 7: Dual-NIC / Defender for Endpoint Interference

Symptoms: BDR or Hyper-V host has two NICs on the same subnet, both with IP addresses. Backup jobs intermittently fail with timeouts or stalls. Ping may work but TCP connections fail. ARP table shows duplicate or inconsistent entries for the target.

HALO Reference: Ticket 1117116 — BDR (DTCBSURE-5001) had Ethernet at .2.166 and Ethernet 2 at .2.215 on the same subnet with the same gateway. This caused asymmetric routing — Windows sent traffic out one NIC, the response came back to the other, and TCP sessions broke. ICMP (ping) worked because it's stateless.

T1 — Retry

  1. On the BDR or Hyper-V host, check NIC configuration:
    Get-NetAdapter | Select Name, Status, LinkSpeed
    Get-NetIPAddress -AddressFamily IPv4 | Select InterfaceAlias, IPAddress
  2. If two NICs have IPs on the same subnet — this is the problem. Escalate to T2.
  3. Flush ARP on both sides and retry as a quick test:
    arp -d *

T2 — Investigate

  1. Determine which NIC Veeam is bound to — check the Veeam console server management IP.
  2. Immediate fix: Disable the NIC Veeam isn't using:
    Disable-NetAdapter -Name "[UNUSED_NIC_NAME]" -Confirm:$false
  3. Flush ARP on both BDR and source, re-enable any disabled firewalls, and retry.
  4. If this is a Hyper-V host — check if the SET (Switch Embedded Teaming) vSwitch management vNIC is disabled:
    Get-VMSwitch | Select Name, SwitchType, EmbeddedTeamingEnabled
    Get-VMNetworkAdapter -ManagementOS | Select Name, SwitchName, IPAddresses, Status
    If the management vNIC is disabled and physical NICs have individual IPs — the SET team was either never completed or was torn down.

T3 — Rebuild

  1. For Hyper-V hosts: rebuild the SET team properly — both physical NICs teamed into a single vSwitch with one management vNIC holding a single IP. Remove IPs from individual physical NICs.
  2. For BDR appliances: if dual NICs are by design (production + backup network), ensure they're on different subnets. Two NICs on the same subnet with the same gateway will always cause asymmetric routing.
  3. Check for Defender for Endpoint interference if the target is Intune-managed:
    Get-Service webthreatdefsvc | Select Status
    Defender for Endpoint's network protection filters traffic independently of Windows Firewall. Even with firewall disabled, Defender can silently drop inbound connections.
  4. If Defender is the blocker — see Scenario 4 for the Intune policy resolution path.

Root Cause Notes: Dual-NIC same-subnet is the most common "invisible" networking issue in DTC environments. It produces symptoms that look like firewall problems (TCP fails, ICMP works) because the asymmetric routing breaks stateful TCP connections but not stateless ICMP. Always check Get-NetIPAddress early in troubleshooting when connectivity is inconsistent. For Hyper-V hosts, the management vNIC being disabled is a common post-migration issue.


Scenario 8: DNS Resolution Failures

Symptoms: Veeam job fails with "Failed to resolve host name [hostname] from [BDR_NAME]." The BDR can't find the source server by hostname. May also manifest as agent deployment failures (see Scenario 4).

T1 — Retry

  1. From the BDR, test name resolution:
    nslookup [TARGET_HOSTNAME]
  2. If resolution fails — check what DNS server the BDR is using:
    Get-DnsClientServerAddress -AddressFamily IPv4
  3. DTC standard: DNS should point to the domain controller (if domain-joined) or the UDM gateway. If the BDR is pointing at an external DNS (8.8.8.8, 1.1.1.1) — it won't resolve internal hostnames.
  4. Quick fix — use the target's IP address directly in the Veeam job instead of hostname. This gets the backup running while DNS is fixed.

T2 — Investigate

  1. Check if the target server has a DNS A record:
    # On the domain controller
    Get-DnsServerResourceRecord -ZoneName "[DOMAIN.COM]" -Name "[TARGET_HOSTNAME]"
  2. If the record is missing — add it manually or force a DNS registration from the target:
    # On the target server
    ipconfig /registerdns
  3. If the BDR is not domain-joined (common for DTC BDR appliances) — it relies on the DNS server configured on its NIC to resolve internal names. Verify it's pointing at the client's DC or a DNS server that hosts the internal zone.
  4. Check for DNS scavenging — stale records may have been cleaned up if the target hasn't refreshed its registration.

T3 — Rebuild

  1. If DNS infrastructure is unreliable — configure Veeam jobs to use IP addresses instead of hostnames as a permanent workaround. Document this in the job notes.
  2. For environments transitioning DNS to the UDM per DTC standard — ensure the UDM's DNS is configured to resolve internal domain names (either as a conditional forwarder to the DC or hosting the zone).
  3. If the BDR consistently can't resolve names — add static host entries as a last resort:
    notepad C:\Windows\System32\drivers\etc\hosts
    Add: [IP_ADDRESS] [HOSTNAME]

Root Cause Notes: DNS failures are the most overlooked Veeam issue. BDR appliances that aren't domain-joined often default to DHCP-assigned DNS (usually the UDM gateway), which may not resolve internal Active Directory hostnames. During client onboardings, verify the BDR's DNS points to a server that can resolve all backup targets.


Scenario 9: Storage Filled by Incorrect Backup Settings or Stale Backups

Symptoms: BDR or source server disk fills up. Backups fail with disk space errors. Investigation reveals excessive retention, forgotten backup jobs for decommissioned servers, or backup jobs writing to unintended locations.

T1 — Retry

  1. Identify what's consuming space — on the BDR:
    Get-ChildItem -Path "[REPO_PATH]" -Recurse | Group-Object Extension | Sort-Object @{E={($_.Group | Measure-Object Length -Sum).Sum}} -Descending | Select Name, Count, @{N='SizeGB';E={[math]::Round(($_.Group | Measure-Object Length -Sum).Sum/1GB,2)}}
  2. Check for obvious issues in the Veeam console: Backups → Disk — look for backup chains belonging to servers that no longer exist.
  3. If the issue is the source server's disk (not the BDR) — check C:\Windows\Installer size (see Scenario 10).

T2 — Investigate

  1. Audit all backup jobs — compare the list of active Veeam jobs against the client's current server/workstation inventory. Flag any jobs targeting decommissioned or migrated systems.
  2. Check retention settings on every job — DTC standard retention should be documented per client. Common over-provisioning: 30-day retention on a repo sized for 14 days.
  3. Look for duplicate jobs — techs sometimes create new jobs after migrations without disabling or removing the old ones, resulting in double backup storage consumption.
  4. Check if any jobs are writing to non-standard locations (C: drive instead of the dedicated repo volume, for example).
  5. Check for GFS (Grandfather-Father-Son) retention that's accumulating monthly/yearly fulls nobody intended: Right-click job → Edit → Storage → Configure secondary destinations.

T3 — Rebuild

  1. Remove stale backup chains: right-click in Backups → Disk → Delete from disk. Verify with the team lead before deleting.
  2. Restructure retention if the repo can't support the current settings — reduce retention or add storage.
  3. Implement a quarterly backup audit practice — review all jobs per client, validate targets still exist, confirm retention is appropriate for repo size.
  4. If the problem is at the source server level — see Scenario 10 for the Windows Installer pattern and HALO 1125653.

Root Cause Notes: "Storage filled" is usually a process problem, not a technical one. It accumulates over months: a server gets migrated, the old job stays active, retention builds, nobody notices until the repo is full. Post-migration cleanup checklists should include a Veeam job audit step. DTC's Automation System Prompts document includes plans for a NinjaRMM-based Orphaned Installer Patch Monitor that addresses the source-side disk exhaustion pattern.


Scenario 10: Disk Space Exhaustion from Windows Installer (Orphaned Patches)

Symptoms: Source server or workstation C: drive fills up over time with no obvious cause. Investigation reveals C:\Windows\Installer is consuming 20–100+ GB. Backups fail because VSS can't create snapshots on a full volume.

HALO Reference: Ticket 1125653 — Dental workstation accumulated 128 GB of orphaned .msp/.msi patches in C:\Windows\Installer (57% of total disk). This caused cascading backup failures, application instability, and near-total disk exhaustion.

T1 — Retry

  1. Check disk space and the Installer folder size:
    Get-WmiObject Win32_LogicalDisk -Filter "DeviceID='C:'" | Select @{N='FreeGB';E={[math]::Round($_.FreeSpace/1GB,2)}}, @{N='TotalGB';E={[math]::Round($_.Size/1GB,2)}}
    
    (Get-ChildItem -Path "C:\Windows\Installer" -Recurse -Force -ErrorAction SilentlyContinue | Measure-Object Length -Sum).Sum / 1GB
  2. DTC thresholds: Warning if Installer > 20 GB, Critical if > 50 GB.
  3. If Installer is under 20 GB and disk has > 20% free — the backup failure is likely something else. Retry the job.
  4. If Installer is over 20 GB — escalate to T2.

T2 — Investigate

  1. Run DISM component cleanup (safe, no reboot required, does NOT use /ResetBase):
    DISM /Online /Cleanup-Image /StartComponentCleanup
    This cleans superseded WinSxS components — won't touch C:\Windows\Installer directly, but frees related space.
  2. Run Windows Disk Cleanup with system file option:
    cleanmgr /sageset:1
    Select all categories, especially Windows Update Cleanup and Previous Windows Installations.
  3. Check for other space consumers:
    # Check WinSxS size
    DISM /Online /Cleanup-Image /AnalyzeComponentStore
    
    # Check Windows Update cache
    (Get-ChildItem "C:\Windows\SoftwareDistribution\Download" -Recurse -Force -ErrorAction SilentlyContinue | Measure-Object Length -Sum).Sum / 1GB
  4. If C:\Windows\Installer is the primary consumer and is > 50 GB — escalate to T3 for orphaned patch cleanup.

T3 — Rebuild

⛔ WARNING: Do NOT blindly delete files from C:\Windows\Installer. Referenced files are required by installed applications. Deleting referenced files breaks MSI repair, uninstall, and update operations.

  1. Identify orphaned (unreferenced) files using registry queries — enumerate the Installer registry database to determine which .msp/.msi files are still referenced by installed products:
    • Referenced .msi files: HKLM:\SOFTWARE\Microsoft\Windows\CurrentVersion\Installer\UserData\S-1-5-18\Products\[GUID]\InstallProperties\ → LocalPackage value
    • Referenced .msp files: HKLM:\SOFTWARE\Microsoft\Windows\CurrentVersion\Installer\UserData\S-1-5-18\Patches\[GUID]\ → LocalPackage value
    • Any file in C:\Windows\Installer NOT in the referenced list = orphaned
  2. Do NOT use Win32_Product WMI class — it triggers a consistency check/reconfigure on every installed MSI and can take 20+ minutes while destabilizing applications.
  3. Quarantine orphaned files before deleting — move to C:\DTC\InstallerCleanup\Quarantine[date]\ and monitor for 30 days.
  4. If disk is critically full (< 5% free) and quarantining would make it worse — delete orphans directly with detailed logging of every file removed.
  5. DTC is developing an automated NinjaRMM monitoring and cleanup solution for this pattern (ref: Orphaned Installer Patch Monitor project in Automation System Prompts). Until that's deployed, this is a manual T3 operation.

Root Cause Notes: Windows never cleans up C:\Windows\Installer on its own. Every MSI install, update, and patch caches files here. Over years, especially on dental workstations with frequent PMS updates (Dentrix, Eaglesoft, etc.), orphaned patches accumulate silently. This is a ticking time bomb — monitor proactively via NinjaRMM custom fields once the automated solution is deployed.


Quick Reference — Escalation Summary

#

Scenario

T1 Action

T2 Action

T3 Action

1

VSS Writer Failure

Stop job, check writers, retry

Reset VSS/NTDS services

Full server reboot

2

Repository Full

Check disk, retry if > 10% free

Review retention, clean orphaned chains

Capacity planning / add storage

3

Network Timeout (Cross-Subnet)

Test connectivity, retry

Check MTU, routing, throttling

Deploy same-subnet proxy

4

Agent Deployment (Intune)

Test DNS and ports

Manual agent install workaround

Intune policy change request

5

SQL Consistency

Check SQL service, retry

Reset SQL Writer, run DBCC

Vendor escalation / recovery model change

6

Chain Integrity Break

Run Active Full

Check repo storage, reset CBT

Delete chain and rebuild / Disk2VHD fallback

7

Dual-NIC / Defender

Check NICs, flush ARP

Disable unused NIC

Rebuild SET team / Intune policy fix

8

DNS Resolution

nslookup, use IP as workaround

Fix DNS record, verify BDR DNS config

Static hosts / DNS infrastructure fix

9

Stale Backups / Settings

Audit jobs vs. inventory

Fix retention, remove stale jobs

Quarterly audit process

10

Windows Installer Bloat

Check folder size vs. thresholds

DISM cleanup, Disk Cleanup

Registry-based orphan detection and cleanup


Escalation to Veeam Support

When T3 cannot resolve the issue:

  1. Collect Veeam logs before calling — in the Veeam console: Main Menu → Help → Support Information → Export. This generates a .zip of all logs.
  2. Note the exact error message and job session ID from the Veeam console.
  3. Open a case with Veeam Support directly.
  4. Document the Veeam case number in the HALO ticket for tracking.

Ticket

Relevance

1125653

Windows Installer disk exhaustion pattern — 128 GB orphaned patches, cascading backup failures

1117116

Server migration: VSS NTDS writer failure, dual-NIC asymmetric routing, cross-subnet transport instability, Disk2VHD fallback, Intune/Defender agent deployment block


Document History

Version

Date

Author

Changes

1.0

March 2026

[Author]

Initial creation — 10 scenarios from HALO ticket analysis