> ## Documentation Index
> Fetch the complete documentation index at: https://docs.polystack.tech/llms.txt
> Use this file to discover all available pages before exploring further.

# Backup Solution Admin Troubleshooting

> Diagnose backup, restore, sync, verification, and datastore failures across Ironcore Backup Solution as an administrator.

## Overview

This page covers administrator-level diagnostic procedures across every layer
of Ironcore Backup Solution (IBS) — service health, datastore integrity, sync
failures, encryption key issues, and capacity exhaustion. For operator-level
issues, see the [User Troubleshooting page](/services/ironcore-backup/user-guide/troubleshooting).

<Note>
  **Diagnostic Inputs Expected From Operators**

  * Task UPID
  * Exact failure text
  * Affected datastore and namespace
  * Time of failure (with timezone)
  * Frequency (one-off or recurring)
</Note>

***

## Service Health Checks

### Service Inventory

```bash title="Confirm all services are running" theme={null}
ironcore-backup status --services
```

| Service                     | Expected State     |
| --------------------------- | ------------------ |
| `ironcore-backup-api`       | `active (running)` |
| `ironcore-backup-scheduler` | `active (running)` |
| `ironcore-backup-worker`    | `active (running)` |
| `ironcore-backup-verifier`  | `active (running)` |
| `ironcore-backup-gc`        | `active (running)` |
| `ironcore-backup-sync`      | `active (running)` |
| `ironcore-backup-notify`    | `active (running)` |

### Log Locations

| Service           | Log Path                                 |
| ----------------- | ---------------------------------------- |
| API               | `/var/log/ironcore-backup/api.log`       |
| Scheduler         | `/var/log/ironcore-backup/scheduler.log` |
| Worker            | `/var/log/ironcore-backup/worker.log`    |
| Verifier          | `/var/log/ironcore-backup/verifier.log`  |
| Garbage Collector | `/var/log/ironcore-backup/gc.log`        |
| Sync              | `/var/log/ironcore-backup/sync.log`      |
| Notifications     | `/var/log/ironcore-backup/notify.log`    |

<Tip>
  All service logs use structured JSON with a consistent schema. Use `jq` to
  filter and aggregate. Logs ship to the platform metric server via the
  notification system when configured.
</Tip>

***

## Datastore Issues

<AccordionGroup>
  <Accordion title="Datastore status `offline`" icon="alert-triangle">
    The datastore backend is unreachable.

    Diagnostic steps:

    1. For local datastores, confirm the mount: `findmnt /mnt/backup/ibs-primary`.

    2. For S3 datastores, confirm the endpoint is reachable: `curl -I https://s3.<your-domain>`.

    3. Re-probe the datastore:

       ```bash theme={null}
       ironcore-backup datastore probe ibs-primary
       ```

    4. Inspect the API log for the exact reason.

    Resolution: remount the filesystem, restore credentials, or restart the
    underlying storage service.
  </Accordion>

  <Accordion title="Datastore reports `no space` despite free disk" icon="hard-drive">
    Common causes:

    * Garbage collection has not run; chunks are referenced but unused
    * Quota set on the underlying filesystem
    * Inode exhaustion (many small chunks)

    Diagnostic steps:

    ```bash theme={null}
    df -h /mnt/backup/ibs-primary
    df -i /mnt/backup/ibs-primary
    ironcore-backup datastore stats ibs-primary
    ```

    Resolution: trigger GC, raise filesystem quotas, or migrate to a filesystem
    with higher inode capacity.

    ```bash theme={null}
    ironcore-backup gc start ibs-primary
    ```
  </Accordion>

  <Accordion title="GC takes excessively long" icon="clock">
    Large datastores with many snapshots can require multiple hours for GC.

    Diagnostic steps:

    * Confirm the datastore size and snapshot count.
    * Check IO bandwidth utilisation during GC.
    * Consider partitioning by creating additional datastores.
  </Accordion>

  <Accordion title="`chunk not found` errors" icon="search">
    A chunk referenced by a manifest is missing from the datastore. This
    indicates corruption.

    Diagnostic steps:

    ```bash theme={null}
    ironcore-backup verify start ibs-primary --notification-mode always
    ```

    Resolution: identify all affected snapshots and remove them. Restore from
    the Backup site replica. Investigate the root cause — disk failure, GC
    bug, manual deletion.

    File an incident if this affects multiple snapshots.
  </Accordion>
</AccordionGroup>

***

## Backup Job Failures

<AccordionGroup>
  <Accordion title="`source unreachable` for all VMs of a host" icon="server">
    Likely cause: the compute host or the hypervisor agent is unhealthy.

    Diagnostic steps:

    * Check compute host health from the Polystack Dashboard.
    * Confirm the hypervisor agent responds.
    * Failover the affected workloads if the host cannot be quickly recovered.
  </Accordion>

  <Accordion title="Backup job hangs at `Starting`" icon="hourglass">
    Worker queue is saturated or all worker slots are busy.

    Diagnostic steps:

    ```bash theme={null}
    ironcore-backup worker status
    ironcore-backup task list --state running
    ```

    Resolution: scale worker capacity (more cores or additional worker hosts),
    or reduce concurrent backups by staggering schedules.
  </Accordion>

  <Accordion title="`encryption key not registered`" icon="key">
    The client offered a key fingerprint the server does not recognise.

    Diagnostic steps:

    ```bash theme={null}
    ironcore-backup key list
    ```

    Resolution: register the missing key on the server, or redistribute the
    correct key to the client.
  </Accordion>

  <Accordion title="Backups complete but with `WARNING`" icon="alert-triangle">
    Non-fatal issues during backup. Common causes:

    * An exclude path matched no files (configuration drift)
    * A file changed during read; the changed copy was captured
    * A symlink target was unreachable

    Diagnostic steps: open the task log and review every warning line. Resolve
    configuration drift; ignore transient warnings.
  </Accordion>
</AccordionGroup>

***

## Restore Failures

<AccordionGroup>
  <Accordion title="`destination not writable`" icon="hard-drive">
    The target storage on the restore destination is full or read-only.

    Resolution: free space, change target backend, or restore to a different
    compute host.
  </Accordion>

  <Accordion title="Live-restore stays in `restoring` after hours" icon="clock">
    Background restore has slowed due to network, target storage saturation,
    or chunk cache misses.

    Diagnostic steps:

    * Check the live-restore task metrics: chunks fetched per second, bytes
      remaining.
    * Check the source datastore for IO contention with other jobs.

    Resolution: pause concurrent IO-heavy jobs. The guest stays usable for
    normal IO during the background fill.
  </Accordion>

  <Accordion title="`fingerprint mismatch` during decryption" icon="lock">
    The decryption key on the restore client does not match the snapshot.

    Resolution: load the correct key file. Identify the right key by the
    snapshot's recorded key fingerprint.
  </Accordion>

  <Accordion title="Live-restore VM crashes on first IO" icon="alert-triangle">
    Storage backend type mismatch between the source and the target. Virtio
    drivers, controller types, or boot order differs.

    Resolution: match the source VM's storage controller. If the target host
    cannot host the same storage type, fall back to a standard restore with
    target-storage rewriting.
  </Accordion>
</AccordionGroup>

***

## Sync and Replication Failures

<AccordionGroup>
  <Accordion title="`TLS fingerprint mismatch`" icon="lock">
    The remote's TLS certificate has changed.

    Diagnostic steps:

    * Verify the new fingerprint out-of-band (call the remote's owner).
    * Update the registered fingerprint:

      ```bash theme={null}
      ironcore-backup remote update --name backup-site \
        --fingerprint "<new-fingerprint>"
      ```

    Never accept the new fingerprint automatically — this would defeat the
    MITM protection.
  </Accordion>

  <Accordion title="Sync stalls mid-run" icon="alert-triangle">
    Inter-site link is degraded or the remote datastore is full.

    Diagnostic steps:

    * Check the inter-site bandwidth telemetry.
    * Check remote datastore free space.

    Resolution: free remote space, address the link issue, then re-run the sync.

    ```bash theme={null}
    ironcore-backup sync run --name primary-to-backup-archival
    ```
  </Accordion>

  <Accordion title="Sync repeatedly retransmits the same chunks" icon="arrow-left-right">
    Deduplication hit rate on the remote is artificially low. Often caused by
    a previous failed sync that committed a partial manifest without committing
    the chunks.

    Resolution: force a full datastore comparison and rebuild the dedup index:

    ```bash theme={null}
    ironcore-backup sync repair --name primary-to-backup-archival
    ```
  </Accordion>

  <Accordion title="Pull sync from Primary fails with `not authorised`" icon="lock">
    The remote auth-id used by the Backup site lacks `Datastore.Reader` on the
    Primary datastore.

    Resolution: grant the sync auth-id explicit permission on the source.
  </Accordion>
</AccordionGroup>

***

## Verification Failures

<AccordionGroup>
  <Accordion title="`CORRUPT` chunk detected" icon="alert-triangle">
    Either disk corruption, tampering, or a software defect.

    Immediate steps:

    1. Quarantine the affected datastore — disable new writes:

       ```bash theme={null}
       ironcore-backup datastore quarantine ibs-primary
       ```

    2. Identify every snapshot referencing affected chunks.

    3. Restore from the Backup site replica.

    4. Open an incident.

    Do not run garbage collection on a quarantined datastore until the root
    cause is established.
  </Accordion>

  <Accordion title="Verification report stuck `STALE`" icon="clock">
    Verification is not running.

    Diagnostic steps:

    ```bash theme={null}
    ironcore-backup verify status
    systemctl status ironcore-backup-verifier
    ```

    Resolution: restart the verifier service or confirm the schedule is set.
  </Accordion>

  <Accordion title="Verification takes most of the verification window" icon="hourglass">
    Datastore size has outgrown the configured verification window.

    Resolution: increase `--verify-window-days`, run verification more
    frequently with `--max-concurrent-snapshots`, or partition the datastore.
  </Accordion>
</AccordionGroup>

***

## Notification Issues

<AccordionGroup>
  <Accordion title="Critical events not arriving via SMTP" icon="mail">
    Diagnostic steps:

    * Check the notification target's last test result.
    * Inspect the notification service log for delivery errors.

    ```bash theme={null}
    journalctl -u ironcore-backup-notify -n 200
    ironcore-backup notification target test ops-smtp \
      --recipient admin@<your-domain>
    ```
  </Accordion>

  <Accordion title="Webhook delivery returns 5xx repeatedly" icon="alert-triangle">
    Downstream endpoint is unhealthy or returning errors.

    Diagnostic steps:

    * Check the receiver's logs.
    * Use `curl` to send the same payload manually.

    Resolution: fix the receiver, then re-enable the target. Optionally
    configure exponential backoff with retry caps.
  </Accordion>
</AccordionGroup>

***

## Tape Library Issues

<AccordionGroup>
  <Accordion title="Tape drive not detected" icon="alert-triangle">
    Confirm the SCSI generic device exists. Check `dmesg` for hardware errors.

    ```bash theme={null}
    lsscsi -g
    dmesg | grep -i scsi
    ```
  </Accordion>

  <Accordion title="Autoloader inventory mismatch" icon="archive">
    Reset the library and re-inventory.

    ```bash theme={null}
    ironcore-backup tape library reset lto-library-01
    ironcore-backup tape inventory --library lto-library-01
    ```
  </Accordion>

  <Accordion title="Tape backup completes with `WARNING`" icon="alert-triangle">
    A media slot may have been unavailable, or a particular tape failed an
    integrity check.

    Inspect the task log for the affected tape. Quarantine the media if it
    reports recurring errors.
  </Accordion>
</AccordionGroup>

***

## Encryption Key Issues

<AccordionGroup>
  <Accordion title="`master key not available` after a system restart" icon="key">
    The master key was not auto-unlocked at boot, possibly because the secure
    enclave or HSM was unavailable.

    Resolution:

    ```bash theme={null}
    ironcore-backup masterkey unlock --interactive
    ```
  </Accordion>

  <Accordion title="Key rotation broke some clients" icon="lock">
    Distribution of the new keyfile failed on some clients.

    Resolution: re-distribute. Both the old and new keys remain valid for
    decryption until the old key is explicitly retired.
  </Accordion>

  <Accordion title="Paperkey rejected during master key restore" icon="qr-code">
    Checksum mismatch on the scanned or transcribed key.

    Diagnostic steps:

    * Re-scan with better lighting.
    * Compare the printed paperkey's printed checksums with the QR-decoded
      values.
  </Accordion>
</AccordionGroup>

***

## Performance Issues

<AccordionGroup>
  <Accordion title="Backup throughput is below expected" icon="gauge">
    Diagnostic steps:

    * Datastore IO utilisation
    * Backup server CPU utilisation
    * Network utilisation between client and server

    Common fixes:

    * Increase backup server CPU (compression and encryption are CPU-bound)
    * Add faster storage to the datastore
    * Tune chunk cache size on the worker
  </Accordion>

  <Accordion title="Restore is slower than backup" icon="hourglass">
    Restore reads many chunks; throughput is bound by chunk read and target
    write speeds.

    Resolution: restore to a faster target storage backend, or schedule the
    restore during off-peak hours.
  </Accordion>
</AccordionGroup>

***

## Escalation Procedure

For incidents that affect data integrity, security, or service-wide availability,
follow this escalation path:

<Steps titleSize="h3">
  <Step title="Isolate" icon="shield">
    Quarantine the affected datastore. Stop new writes to prevent further
    propagation.
  </Step>

  <Step title="Collect evidence" icon="clipboard">
    Capture service logs, task logs, datastore stats, and notification history.
    Time-stamp every action.
  </Step>

  <Step title="Notify" icon="bell">
    Notify the on-call rotation, the platform owner, and the compliance team.
  </Step>

  <Step title="Recover" icon="rotate-ccw">
    Restore affected workloads from the Backup site replica.
  </Step>

  <Step title="Root-cause" icon="search">
    Investigate the underlying cause — hardware, software defect, configuration,
    or process failure.
  </Step>

  <Step title="Remediate and document" icon="check">
    Implement long-term fixes. Update the runbook and the post-incident report.
  </Step>
</Steps>

***

## Open a Support Case

For issues that exceed local capability, open a support case with Polystack
Technologies:

| Information to Provide   | Source                           |
| ------------------------ | -------------------------------- |
| Platform version         | Dashboard footer                 |
| Affected service version | `ironcore-backup --version`      |
| Task UPIDs               | Task panel                       |
| Logs (sanitised)         | `/var/log/ironcore-backup/`      |
| Datastore inventory      | `ironcore-backup datastore list` |
| Time of incident         | Audit log                        |
| Reproduction steps       | Operator narrative               |

Email: [support@polystack.tech](mailto:support@polystack.tech)

***

## Next Steps

<CardGroup cols={2}>
  <Card title="Architecture" icon="layers" href="/services/ironcore-backup/admin-guide/architecture" color="#bf9667">
    Diagnostic context for layered failures
  </Card>

  <Card title="Datastores" icon="database" href="/services/ironcore-backup/admin-guide/datastores" color="#bf9667">
    Datastore configuration and lifecycle
  </Card>

  <Card title="Verification and Validation" icon="check" href="/services/ironcore-backup/admin-guide/verification-validation" color="#bf9667">
    Verify recoverability after a failure
  </Card>

  <Card title="Polystack Support" icon="headphones" href="mailto:support@polystack.tech" color="#bf9667">
    Open a support case with Polystack Technologies
  </Card>
</CardGroup>
