Overview
A backup that cannot be restored is no backup at all. Ironcore Backup Solution (IBS) operates two complementary safeguards:- Verification — periodic SHA-256 integrity checks on every chunk in the datastore to detect bit rot, media failure, or tampering.
- Mock Recovery Drills — bi-annual end-to-end restore exercises from the Backup site to confirm full recoverability of the data and the playbook.
Prerequisites
- Administrator role on the Polystack platform
- At least one datastore with active backups
- For mock drills: dedicated compute and storage resources at the Backup site
Verification
What Verification Checks
For every snapshot in scope:- Chunk presence — every chunk referenced by the manifest exists in the datastore.
- Chunk SHA-256 — the recomputed hash matches the manifest reference.
- GCM auth tag — the AES-256-GCM tag validates the ciphertext is unmodified.
- Manifest signature — the manifest signature is valid.
Schedule a Verification Job
- Deployment Console
- CLI
Configure the schedule
Set:
- Schedule:
Sun 06:00(weekly is the default) - Scope: All snapshots, or filter by namespace / age
- Skip recently verified: Skip snapshots verified within the last
keep-verifiedwindow (default: 30 days) to limit IO load
Set notification
Attach the operational notification group. Failures dispatch immediately;
successful runs dispatch on the configured success policy.
Verification Report
| Field | Description |
|---|---|
| Snapshot | The snapshot under verification |
| Started | When verification began |
| Duration | Elapsed verification time |
| Chunks checked | Number of chunks read and validated |
| Result | OK, CORRUPT, MISSING, or STALE |
| First failure | First chunk fingerprint and reason |
| Verifier | Job ID that produced the report |
Limit Verification Load
Verification reads every chunk, which is IO-intensive. Tune for production:| Setting | Default | Effect |
|---|---|---|
--verify-window-days | 30 | Skip snapshots verified within N days |
--max-concurrent-snapshots | 4 | Parallel verification work |
--bwlimit | unlimited | Cap read bandwidth |
--io-priority | normal | Lower IO priority during verification |
Mock Recovery Drills
A mock drill is a scheduled, end-to-end recovery exercise using backup data from the Backup site. The drill validates the recovery playbook, IBS data, restored application correctness, and operator readiness — without touching the production environment.Why Bi-Annual?
Bi-annual drills satisfy common compliance frameworks (banking, government infrastructure, telecom). Two drills per year balance preparedness against operator burden. Higher-stakes environments may drill quarterly.Drill Roles and Resources
| Role | Responsibility |
|---|---|
| Drill coordinator | Plans the drill, sets success criteria, captures findings |
| Backup admin | Provides Backup site access, encryption keys, restore configuration |
| Workload owner | Validates the restored workload functions correctly |
| Auditor | Records timings, deviations, and outcomes |
- Compute resources matching the largest workload to be drilled
- Storage capacity for rehydration of archived backup data
- Network connectivity between the Backup site and the restored workload’s test network
Drill Workflow
Schedule and announce
Schedule the drill at least 4 weeks in advance. Announce to all participating
teams and the executive sponsor. Drills run twice per year on a recurring
calendar.
Select workloads
Select a representative sample:
- One Tier-0 production VM
- One Tier-1 production VM
- One container
- One physical host backup
- One database with point-in-time recovery requirements
Provision drill infrastructure
On the Backup site, provision compute hosts and a target storage backend
for the drill workloads. Use a dedicated test network — never connect
drill workloads to the production network.
Execute the restore
For each selected workload:
- Identify the most recent weekly archival snapshot in the Backup site datastore.
- Restore using the Backup Solution Dashboard or CLI.
- Record start time, time to first guest availability (for live-restore), and time to full restore completion.
Drill restore example
Validate the workload
The workload owner runs a defined functional test:
- Service starts
- Critical processes are running
- Application responds to its standard health check
- Database connectivity, replication state, and integrity all healthy
Restore individual files
From the same snapshot, restore an individual file and a directory archive.
Confirm content, permissions, and modification times match expectations.
Tear down
De-provision drill infrastructure. Confirm no drill data remains accessible
to production users.
Capture findings
Produce a drill report covering:
- Snapshots restored, with timings
- Workloads validated, with pass / fail status
- Deviations from the recovery playbook
- Issues found in the playbook, infrastructure, or backup data
- Action items with owners and deadlines
Drill Success Criteria
| Criterion | Threshold |
|---|---|
| Restore RPO | Most recent weekly archival snapshot ≤ 7 days old |
| Restore RTO | Recovery completes within the published target |
| Live-restore time to running | ≤ 5 seconds for any tier |
| Application validation | All functional tests pass |
| File-level restore | Single-file restore completes in ≤ 1 minute |
| Playbook accuracy | No critical deviations from documented procedure |
Automate the Drill Restore Step
The restore portion of a drill can be automated via a sync job that periodically restores a sample workload to a test environment. The output is compared against a known-good baseline.Automated weekly restore-test job
- Picks the latest matching snapshot
- Restores to the test target
- Runs the validation script and records exit code
- Tears down the test target
- Emits a notification with the result
Compliance Reporting
The verification and drill systems produce evidence suitable for compliance audits:| Evidence | Source |
|---|---|
| Per-snapshot verification log | Snapshot detail > Verification tab |
| Periodic verification job history | Tasks panel filtered by verify |
| Mock drill reports | Filed in compliance documentation |
| Encryption key rotation history | Audit log |
| Access grant history | Audit log |
| Replication completion log | Tasks panel filtered by sync |
- Deployment Console
- CLI
Open Audit Log > Export. Choose the time range and the categories
to export. The export is delivered as CSV and JSON.
Troubleshooting
Verification job takes longer than allowed window
Verification job takes longer than allowed window
Reduce scope (lower
--verify-window-days), increase parallelism, or split
verification into multiple jobs targeting different namespaces.Drill restore is slower than production restore
Drill restore is slower than production restore
The Backup site typically has fewer or slower storage tiers than Primary.
Drill targets should be provisioned to match the expected disaster-recovery
capacity — undersized drill infrastructure can give a false negative.
Verification report stuck in `STALE`
Verification report stuck in `STALE`
Verification is not running. Confirm the schedule and check the task
history. If the worker is offline, restart
ironcore-backup-verifier.Drill validation script returns non-zero on a healthy workload
Drill validation script returns non-zero on a healthy workload
The validation script may be tied to production-specific paths or hostnames.
Maintain drill-specific overrides in the validate script.
Next Steps
Replication and Sync
Backup site replication required for drills
Security and Encryption
Encryption keys required to restore during a drill
Notifications
Alert routing for verification and drill failures
Architecture
Underlying integrity and append-only design
