> ## Documentation Index
> Fetch the complete documentation index at: https://docs.polystack.tech/llms.txt
> Use this file to discover all available pages before exploring further.

# Verification and Validation

> Schedule SHA-256 verification jobs, run integrity checks on backups, and conduct bi-annual mock recovery drills for Ironcore Backup Solution.

## Overview

A backup that cannot be restored is no backup at all. Ironcore Backup Solution
(IBS) operates two complementary safeguards:

* **Verification** — periodic SHA-256 integrity checks on every chunk in the
  datastore to detect bit rot, media failure, or tampering.
* **Mock Recovery Drills** — bi-annual end-to-end restore exercises from the
  Backup site to confirm full recoverability of the data and the playbook.

This page covers the schedule, automation, and operational procedure for both.

<Note>
  **Prerequisites**

  * Administrator role on the Polystack platform
  * At least one datastore with active backups
  * For mock drills: dedicated compute and storage resources at the Backup site
</Note>

***

## Verification

### What Verification Checks

For every snapshot in scope:

1. **Chunk presence** — every chunk referenced by the manifest exists in the datastore.
2. **Chunk SHA-256** — the recomputed hash matches the manifest reference.
3. **GCM auth tag** — the AES-256-GCM tag validates the ciphertext is unmodified.
4. **Manifest signature** — the manifest signature is valid.

Any failure marks the snapshot as **CORRUPT** in the verification report and
emits a notification.

### Schedule a Verification Job

<Tabs>
  <Tab title="Deployment Console" icon="gauge">
    <Steps titleSize="h3">
      <Step title="Open Verification" icon="external-link">
        Navigate to **Backup Solution** > **Datastores** > *(select datastore)*
        \> **Verification**.
      </Step>

      <Step title="Configure the schedule" icon="calendar">
        Set:

        * **Schedule**: `Sun 06:00` (weekly is the default)
        * **Scope**: All snapshots, or filter by namespace / age
        * **Skip recently verified**: Skip snapshots verified within the last
          `keep-verified` window (default: 30 days) to limit IO load
      </Step>

      <Step title="Set notification" icon="bell">
        Attach the operational notification group. Failures dispatch immediately;
        successful runs dispatch on the configured success policy.
      </Step>

      <Step title="Save" icon="check">
        Click **Save**.

        <Check>The verification job runs on the next scheduled tick and produces a report visible under the snapshot Verification tab.</Check>
      </Step>
    </Steps>
  </Tab>

  <Tab title="CLI" icon="terminal">
    ```bash title="Configure datastore verification" theme={null}
    ironcore-backup datastore update \
      --name ibs-primary \
      --verify-schedule "Sun 06:00" \
      --verify-window-days 30 \
      --verify-notification ops-team
    ```

    ```bash title="Trigger one-shot verification of a datastore" theme={null}
    ironcore-backup verify start ibs-primary
    ```

    ```bash title="Verify a single snapshot" theme={null}
    ironcore-backup verify start \
      ibs-primary:vm/12345/2026-05-21T00:00:00Z
    ```
  </Tab>
</Tabs>

### Verification Report

| Field              | Description                            |
| ------------------ | -------------------------------------- |
| **Snapshot**       | The snapshot under verification        |
| **Started**        | When verification began                |
| **Duration**       | Elapsed verification time              |
| **Chunks checked** | Number of chunks read and validated    |
| **Result**         | `OK`, `CORRUPT`, `MISSING`, or `STALE` |
| **First failure**  | First chunk fingerprint and reason     |
| **Verifier**       | Job ID that produced the report        |

A snapshot status flows through:

```mermaid theme={null}
stateDiagram-v2
    [*] --> Unverified: Snapshot created
    Unverified --> OK: First verification passes
    OK --> Stale: Verification window expired
    Stale --> OK: Re-verified
    Stale --> CORRUPT: Re-verification fails
    OK --> CORRUPT: Verification detects bit rot or tampering
    CORRUPT --> [*]: Operator restores from Backup site
```

### Limit Verification Load

Verification reads every chunk, which is IO-intensive. Tune for production:

| Setting                      | Default   | Effect                                |
| ---------------------------- | --------- | ------------------------------------- |
| `--verify-window-days`       | 30        | Skip snapshots verified within N days |
| `--max-concurrent-snapshots` | 4         | Parallel verification work            |
| `--bwlimit`                  | unlimited | Cap read bandwidth                    |
| `--io-priority`              | normal    | Lower IO priority during verification |

<Tip>
  For very large datastores, configure verification to run hourly with a short
  window (`keep-last=20`) — this gradually verifies every snapshot over a
  rolling period without a heavy weekly spike.
</Tip>

***

## Mock Recovery Drills

A mock drill is a scheduled, end-to-end recovery exercise using backup data
from the Backup site. The drill validates the recovery playbook, IBS data,
restored application correctness, and operator readiness — without touching
the production environment.

### Why Bi-Annual?

Bi-annual drills satisfy common compliance frameworks (banking, government
infrastructure, telecom). Two drills per year balance preparedness against
operator burden. Higher-stakes environments may drill quarterly.

### Drill Roles and Resources

| Role                  | Responsibility                                                      |
| --------------------- | ------------------------------------------------------------------- |
| **Drill coordinator** | Plans the drill, sets success criteria, captures findings           |
| **Backup admin**      | Provides Backup site access, encryption keys, restore configuration |
| **Workload owner**    | Validates the restored workload functions correctly                 |
| **Auditor**           | Records timings, deviations, and outcomes                           |

Required resources at the Backup site:

* Compute resources matching the largest workload to be drilled
* Storage capacity for rehydration of archived backup data
* Network connectivity between the Backup site and the restored workload's
  test network

### Drill Workflow

<Steps titleSize="h3">
  <Step title="Schedule and announce" icon="calendar">
    Schedule the drill at least 4 weeks in advance. Announce to all participating
    teams and the executive sponsor. Drills run twice per year on a recurring
    calendar.
  </Step>

  <Step title="Select workloads" icon="server">
    Select a representative sample:

    * One Tier-0 production VM
    * One Tier-1 production VM
    * One container
    * One physical host backup
    * One database with point-in-time recovery requirements
  </Step>

  <Step title="Provision drill infrastructure" icon="hard-drive">
    On the Backup site, provision compute hosts and a target storage backend
    for the drill workloads. Use a dedicated test network — never connect
    drill workloads to the production network.
  </Step>

  <Step title="Execute the restore" icon="rotate-ccw">
    For each selected workload:

    1. Identify the most recent weekly archival snapshot in the Backup site datastore.
    2. Restore using the Backup Solution Dashboard or CLI.
    3. Record start time, time to first guest availability (for live-restore),
       and time to full restore completion.

    ```bash title="Drill restore example" theme={null}
    ironcore-backup vm restore \
      --snapshot ibs-archival:production/vm/12345/2026-05-19T03:00:00Z \
      --target-host drill-compute-01 \
      --storage drill-storage \
      --rename drill-vm-12345-2026-05-21 \
      --live
    ```
  </Step>

  <Step title="Validate the workload" icon="check">
    The workload owner runs a defined functional test:

    * Service starts
    * Critical processes are running
    * Application responds to its standard health check
    * Database connectivity, replication state, and integrity all healthy

    Record any deviations from production behaviour.
  </Step>

  <Step title="Restore individual files" icon="file">
    From the same snapshot, restore an individual file and a directory archive.
    Confirm content, permissions, and modification times match expectations.
  </Step>

  <Step title="Tear down" icon="trash-2">
    De-provision drill infrastructure. Confirm no drill data remains accessible
    to production users.
  </Step>

  <Step title="Capture findings" icon="clipboard-list">
    Produce a drill report covering:

    * Snapshots restored, with timings
    * Workloads validated, with pass / fail status
    * Deviations from the recovery playbook
    * Issues found in the playbook, infrastructure, or backup data
    * Action items with owners and deadlines

    File the report in the compliance documentation system.
  </Step>
</Steps>

### Drill Success Criteria

| Criterion                    | Threshold                                         |
| ---------------------------- | ------------------------------------------------- |
| Restore RPO                  | Most recent weekly archival snapshot ≤ 7 days old |
| Restore RTO                  | Recovery completes within the published target    |
| Live-restore time to running | ≤ 5 seconds for any tier                          |
| Application validation       | All functional tests pass                         |
| File-level restore           | Single-file restore completes in ≤ 1 minute       |
| Playbook accuracy            | No critical deviations from documented procedure  |

<Warning>
  A failed drill is a critical finding. Restoration playbooks, backup data,
  and any infrastructure deviation must be remediated before the next
  production change window. File a formal incident if the failure indicates
  data loss.
</Warning>

### Automate the Drill Restore Step

The restore portion of a drill can be automated via a sync job that periodically
restores a sample workload to a test environment. The output is compared against
a known-good baseline.

```bash title="Automated weekly restore-test job" theme={null}
ironcore-backup restore-test create \
  --name weekly-vm-restore-test \
  --schedule "Mon 09:00" \
  --snapshot-filter "namespace:production,backup-id:vm/12345" \
  --target-host drill-compute-01 \
  --target-storage drill-storage \
  --rename "test-{{snapshot-id}}" \
  --validate-script /etc/ironcore/drill-validate.sh \
  --teardown
```

The restore-test job:

1. Picks the latest matching snapshot
2. Restores to the test target
3. Runs the validation script and records exit code
4. Tears down the test target
5. Emits a notification with the result

***

## Compliance Reporting

The verification and drill systems produce evidence suitable for compliance
audits:

| Evidence                          | Source                             |
| --------------------------------- | ---------------------------------- |
| Per-snapshot verification log     | Snapshot detail > Verification tab |
| Periodic verification job history | Tasks panel filtered by `verify`   |
| Mock drill reports                | Filed in compliance documentation  |
| Encryption key rotation history   | Audit log                          |
| Access grant history              | Audit log                          |
| Replication completion log        | Tasks panel filtered by `sync`     |

Export evidence:

<Tabs>
  <Tab title="Deployment Console" icon="gauge">
    Open **Audit Log** > **Export**. Choose the time range and the categories
    to export. The export is delivered as CSV and JSON.
  </Tab>

  <Tab title="CLI" icon="terminal">
    ```bash theme={null}
    ironcore-backup audit export \
      --since 2026-01-01 \
      --until 2026-06-30 \
      --categories verify,sync,prune,access \
      --output /tmp/audit-2026-h1.tar.gz
    ```
  </Tab>
</Tabs>

***

## Troubleshooting

<AccordionGroup>
  <Accordion title="Verification job takes longer than allowed window" icon="clock">
    Reduce scope (lower `--verify-window-days`), increase parallelism, or split
    verification into multiple jobs targeting different namespaces.
  </Accordion>

  <Accordion title="Drill restore is slower than production restore" icon="gauge">
    The Backup site typically has fewer or slower storage tiers than Primary.
    Drill targets should be provisioned to match the expected disaster-recovery
    capacity — undersized drill infrastructure can give a false negative.
  </Accordion>

  <Accordion title="Verification report stuck in `STALE`" icon="alert-triangle">
    Verification is not running. Confirm the schedule and check the task
    history. If the worker is offline, restart `ironcore-backup-verifier`.
  </Accordion>

  <Accordion title="Drill validation script returns non-zero on a healthy workload" icon="check">
    The validation script may be tied to production-specific paths or hostnames.
    Maintain drill-specific overrides in the validate script.
  </Accordion>
</AccordionGroup>

***

## Next Steps

<CardGroup cols={2}>
  <Card title="Replication and Sync" icon="arrow-left-right" href="/services/ironcore-backup/admin-guide/replication-sync" color="#bf9667">
    Backup site replication required for drills
  </Card>

  <Card title="Security and Encryption" icon="lock" href="/services/ironcore-backup/admin-guide/security-encryption" color="#bf9667">
    Encryption keys required to restore during a drill
  </Card>

  <Card title="Notifications" icon="bell" href="/services/ironcore-backup/admin-guide/notifications" color="#bf9667">
    Alert routing for verification and drill failures
  </Card>

  <Card title="Architecture" icon="layers" href="/services/ironcore-backup/admin-guide/architecture" color="#bf9667">
    Underlying integrity and append-only design
  </Card>
</CardGroup>
