Backup Solution Admin Troubleshooting - Polystack Documentation

Overview

This page covers administrator-level diagnostic procedures across every layer of Ironcore Backup Solution (IBS) — service health, datastore integrity, sync failures, encryption key issues, and capacity exhaustion. For operator-level issues, see the User Troubleshooting page.

Diagnostic Inputs Expected From Operators

Task UPID
Exact failure text
Affected datastore and namespace
Time of failure (with timezone)
Frequency (one-off or recurring)

Service Health Checks

Service Inventory

Confirm all services are running

ironcore-backup status --services

Service	Expected State
`ironcore-backup-api`	`active (running)`
`ironcore-backup-scheduler`	`active (running)`
`ironcore-backup-worker`	`active (running)`
`ironcore-backup-verifier`	`active (running)`
`ironcore-backup-gc`	`active (running)`
`ironcore-backup-sync`	`active (running)`
`ironcore-backup-notify`	`active (running)`

Log Locations

Service	Log Path
API	`/var/log/ironcore-backup/api.log`
Scheduler	`/var/log/ironcore-backup/scheduler.log`
Worker	`/var/log/ironcore-backup/worker.log`
Verifier	`/var/log/ironcore-backup/verifier.log`
Garbage Collector	`/var/log/ironcore-backup/gc.log`
Sync	`/var/log/ironcore-backup/sync.log`
Notifications	`/var/log/ironcore-backup/notify.log`

All service logs use structured JSON with a consistent schema. Use jq to filter and aggregate. Logs ship to the platform metric server via the notification system when configured.

Datastore Issues

Datastore status `offline`

The datastore backend is unreachable.Diagnostic steps:

For local datastores, confirm the mount: findmnt /mnt/backup/ibs-primary.
For S3 datastores, confirm the endpoint is reachable: curl -I https://s3.<your-domain>.

Re-probe the datastore:

ironcore-backup datastore probe ibs-primary

Inspect the API log for the exact reason.

Resolution: remount the filesystem, restore credentials, or restart the underlying storage service.

Datastore reports `no space` despite free disk

Common causes:

Garbage collection has not run; chunks are referenced but unused
Quota set on the underlying filesystem
Inode exhaustion (many small chunks)

Diagnostic steps:

df -h /mnt/backup/ibs-primary
df -i /mnt/backup/ibs-primary
ironcore-backup datastore stats ibs-primary

Resolution: trigger GC, raise filesystem quotas, or migrate to a filesystem with higher inode capacity.

ironcore-backup gc start ibs-primary

GC takes excessively long

Large datastores with many snapshots can require multiple hours for GC.Diagnostic steps:

Confirm the datastore size and snapshot count.
Check IO bandwidth utilisation during GC.
Consider partitioning by creating additional datastores.

`chunk not found` errors

A chunk referenced by a manifest is missing from the datastore. This indicates corruption.Diagnostic steps:

ironcore-backup verify start ibs-primary --notification-mode always

Resolution: identify all affected snapshots and remove them. Restore from the Backup site replica. Investigate the root cause — disk failure, GC bug, manual deletion.File an incident if this affects multiple snapshots.

Backup Job Failures

`source unreachable` for all VMs of a host

Likely cause: the compute host or the hypervisor agent is unhealthy.Diagnostic steps:

Check compute host health from the Polystack Dashboard.
Confirm the hypervisor agent responds.
Failover the affected workloads if the host cannot be quickly recovered.

Backup job hangs at `Starting`

Worker queue is saturated or all worker slots are busy.Diagnostic steps:

ironcore-backup worker status
ironcore-backup task list --state running

Resolution: scale worker capacity (more cores or additional worker hosts), or reduce concurrent backups by staggering schedules.

`encryption key not registered`

The client offered a key fingerprint the server does not recognise.Diagnostic steps:

ironcore-backup key list

Resolution: register the missing key on the server, or redistribute the correct key to the client.

Backups complete but with `WARNING`

Non-fatal issues during backup. Common causes:

An exclude path matched no files (configuration drift)
A file changed during read; the changed copy was captured
A symlink target was unreachable

Diagnostic steps: open the task log and review every warning line. Resolve configuration drift; ignore transient warnings.

Restore Failures

`destination not writable`

The target storage on the restore destination is full or read-only.Resolution: free space, change target backend, or restore to a different compute host.

Live-restore stays in `restoring` after hours

Background restore has slowed due to network, target storage saturation, or chunk cache misses.Diagnostic steps:

Check the live-restore task metrics: chunks fetched per second, bytes remaining.
Check the source datastore for IO contention with other jobs.

Resolution: pause concurrent IO-heavy jobs. The guest stays usable for normal IO during the background fill.

`fingerprint mismatch` during decryption

The decryption key on the restore client does not match the snapshot.Resolution: load the correct key file. Identify the right key by the snapshot’s recorded key fingerprint.

Live-restore VM crashes on first IO

Storage backend type mismatch between the source and the target. Virtio drivers, controller types, or boot order differs.Resolution: match the source VM’s storage controller. If the target host cannot host the same storage type, fall back to a standard restore with target-storage rewriting.

Sync and Replication Failures

`TLS fingerprint mismatch`

The remote’s TLS certificate has changed.Diagnostic steps:

Verify the new fingerprint out-of-band (call the remote’s owner).

Update the registered fingerprint:

ironcore-backup remote update --name backup-site \
  --fingerprint "<new-fingerprint>"

Never accept the new fingerprint automatically — this would defeat the MITM protection.

Sync stalls mid-run

Inter-site link is degraded or the remote datastore is full.Diagnostic steps:

Check the inter-site bandwidth telemetry.
Check remote datastore free space.

Resolution: free remote space, address the link issue, then re-run the sync.

ironcore-backup sync run --name primary-to-backup-archival

Sync repeatedly retransmits the same chunks

Deduplication hit rate on the remote is artificially low. Often caused by a previous failed sync that committed a partial manifest without committing the chunks.Resolution: force a full datastore comparison and rebuild the dedup index:

ironcore-backup sync repair --name primary-to-backup-archival

Pull sync from Primary fails with `not authorised`

The remote auth-id used by the Backup site lacks Datastore.Reader on the Primary datastore.Resolution: grant the sync auth-id explicit permission on the source.

Verification Failures

`CORRUPT` chunk detected

Either disk corruption, tampering, or a software defect.Immediate steps:

Quarantine the affected datastore — disable new writes:
```
ironcore-backup datastore quarantine ibs-primary
```
Identify every snapshot referencing affected chunks.
Restore from the Backup site replica.
Open an incident.

Do not run garbage collection on a quarantined datastore until the root cause is established.

Verification report stuck `STALE`

Verification is not running.Diagnostic steps:

ironcore-backup verify status
systemctl status ironcore-backup-verifier

Resolution: restart the verifier service or confirm the schedule is set.

Verification takes most of the verification window

Datastore size has outgrown the configured verification window.Resolution: increase --verify-window-days, run verification more frequently with --max-concurrent-snapshots, or partition the datastore.

Notification Issues

Critical events not arriving via SMTP

Diagnostic steps:

Check the notification target’s last test result.
Inspect the notification service log for delivery errors.

journalctl -u ironcore-backup-notify -n 200
ironcore-backup notification target test ops-smtp \
  --recipient admin@<your-domain>

Webhook delivery returns 5xx repeatedly

Downstream endpoint is unhealthy or returning errors.Diagnostic steps:

Check the receiver’s logs.
Use curl to send the same payload manually.

Resolution: fix the receiver, then re-enable the target. Optionally configure exponential backoff with retry caps.

Tape Library Issues

Tape drive not detected

Confirm the SCSI generic device exists. Check dmesg for hardware errors.

lsscsi -g
dmesg | grep -i scsi

Autoloader inventory mismatch

Reset the library and re-inventory.

ironcore-backup tape library reset lto-library-01
ironcore-backup tape inventory --library lto-library-01

Tape backup completes with `WARNING`

A media slot may have been unavailable, or a particular tape failed an integrity check.Inspect the task log for the affected tape. Quarantine the media if it reports recurring errors.

Encryption Key Issues

`master key not available` after a system restart

The master key was not auto-unlocked at boot, possibly because the secure enclave or HSM was unavailable.Resolution:

ironcore-backup masterkey unlock --interactive

Key rotation broke some clients

Distribution of the new keyfile failed on some clients.Resolution: re-distribute. Both the old and new keys remain valid for decryption until the old key is explicitly retired.

Paperkey rejected during master key restore

Checksum mismatch on the scanned or transcribed key.Diagnostic steps:

Re-scan with better lighting.
Compare the printed paperkey’s printed checksums with the QR-decoded values.

Performance Issues

Backup throughput is below expected

Diagnostic steps:

Datastore IO utilisation
Backup server CPU utilisation
Network utilisation between client and server

Common fixes:

Increase backup server CPU (compression and encryption are CPU-bound)
Add faster storage to the datastore
Tune chunk cache size on the worker

Restore is slower than backup

Restore reads many chunks; throughput is bound by chunk read and target write speeds.Resolution: restore to a faster target storage backend, or schedule the restore during off-peak hours.

Escalation Procedure

For incidents that affect data integrity, security, or service-wide availability, follow this escalation path:

Isolate

Quarantine the affected datastore. Stop new writes to prevent further propagation.

Collect evidence

Capture service logs, task logs, datastore stats, and notification history. Time-stamp every action.

Notify

Notify the on-call rotation, the platform owner, and the compliance team.

Recover

Restore affected workloads from the Backup site replica.

Root-cause

Investigate the underlying cause — hardware, software defect, configuration, or process failure.

Remediate and document

Implement long-term fixes. Update the runbook and the post-incident report.

Open a Support Case

For issues that exceed local capability, open a support case with Polystack Technologies:

Information to Provide	Source
Platform version	Dashboard footer
Affected service version	`ironcore-backup --version`
Task UPIDs	Task panel
Logs (sanitised)	`/var/log/ironcore-backup/`
Datastore inventory	`ironcore-backup datastore list`
Time of incident	Audit log
Reproduction steps	Operator narrative

Email: support@polystack.tech

Next Steps

Architecture

Diagnostic context for layered failures

Datastores

Datastore configuration and lifecycle

Verification and Validation

Verify recoverability after a failure

Polystack Support

Open a support case with Polystack Technologies

​Overview

​Service Health Checks

​Service Inventory

​Log Locations

​Datastore Issues

​Backup Job Failures

​Restore Failures

​Sync and Replication Failures

​Verification Failures

​Notification Issues

​Tape Library Issues

​Encryption Key Issues

​Performance Issues

​Escalation Procedure

Isolate

Collect evidence

Notify

Recover

Root-cause

Remediate and document

​Open a Support Case

​Next Steps

Architecture

Datastores

Verification and Validation

Polystack Support

Overview

Service Health Checks

Service Inventory

Log Locations

Datastore Issues

Backup Job Failures

Restore Failures

Sync and Replication Failures

Verification Failures

Notification Issues

Tape Library Issues

Encryption Key Issues

Performance Issues

Escalation Procedure

Open a Support Case

Next Steps