Skip to main content

Overview

This page covers administrator-level diagnostic procedures across every layer of Ironcore Backup Solution (IBS) — service health, datastore integrity, sync failures, encryption key issues, and capacity exhaustion. For operator-level issues, see the User Troubleshooting page.
Diagnostic Inputs Expected From Operators
  • Task UPID
  • Exact failure text
  • Affected datastore and namespace
  • Time of failure (with timezone)
  • Frequency (one-off or recurring)

Service Health Checks

Service Inventory

Confirm all services are running
ironcore-backup status --services
ServiceExpected State
ironcore-backup-apiactive (running)
ironcore-backup-scheduleractive (running)
ironcore-backup-workeractive (running)
ironcore-backup-verifieractive (running)
ironcore-backup-gcactive (running)
ironcore-backup-syncactive (running)
ironcore-backup-notifyactive (running)

Log Locations

ServiceLog Path
API/var/log/ironcore-backup/api.log
Scheduler/var/log/ironcore-backup/scheduler.log
Worker/var/log/ironcore-backup/worker.log
Verifier/var/log/ironcore-backup/verifier.log
Garbage Collector/var/log/ironcore-backup/gc.log
Sync/var/log/ironcore-backup/sync.log
Notifications/var/log/ironcore-backup/notify.log
All service logs use structured JSON with a consistent schema. Use jq to filter and aggregate. Logs ship to the platform metric server via the notification system when configured.

Datastore Issues

The datastore backend is unreachable.Diagnostic steps:
  1. For local datastores, confirm the mount: findmnt /mnt/backup/ibs-primary.
  2. For S3 datastores, confirm the endpoint is reachable: curl -I https://s3.<your-domain>.
  3. Re-probe the datastore:
    ironcore-backup datastore probe ibs-primary
    
  4. Inspect the API log for the exact reason.
Resolution: remount the filesystem, restore credentials, or restart the underlying storage service.
Common causes:
  • Garbage collection has not run; chunks are referenced but unused
  • Quota set on the underlying filesystem
  • Inode exhaustion (many small chunks)
Diagnostic steps:
df -h /mnt/backup/ibs-primary
df -i /mnt/backup/ibs-primary
ironcore-backup datastore stats ibs-primary
Resolution: trigger GC, raise filesystem quotas, or migrate to a filesystem with higher inode capacity.
ironcore-backup gc start ibs-primary
Large datastores with many snapshots can require multiple hours for GC.Diagnostic steps:
  • Confirm the datastore size and snapshot count.
  • Check IO bandwidth utilisation during GC.
  • Consider partitioning by creating additional datastores.
A chunk referenced by a manifest is missing from the datastore. This indicates corruption.Diagnostic steps:
ironcore-backup verify start ibs-primary --notification-mode always
Resolution: identify all affected snapshots and remove them. Restore from the Backup site replica. Investigate the root cause — disk failure, GC bug, manual deletion.File an incident if this affects multiple snapshots.

Backup Job Failures

Likely cause: the compute host or the hypervisor agent is unhealthy.Diagnostic steps:
  • Check compute host health from the Polystack Dashboard.
  • Confirm the hypervisor agent responds.
  • Failover the affected workloads if the host cannot be quickly recovered.
Worker queue is saturated or all worker slots are busy.Diagnostic steps:
ironcore-backup worker status
ironcore-backup task list --state running
Resolution: scale worker capacity (more cores or additional worker hosts), or reduce concurrent backups by staggering schedules.
The client offered a key fingerprint the server does not recognise.Diagnostic steps:
ironcore-backup key list
Resolution: register the missing key on the server, or redistribute the correct key to the client.
Non-fatal issues during backup. Common causes:
  • An exclude path matched no files (configuration drift)
  • A file changed during read; the changed copy was captured
  • A symlink target was unreachable
Diagnostic steps: open the task log and review every warning line. Resolve configuration drift; ignore transient warnings.

Restore Failures

The target storage on the restore destination is full or read-only.Resolution: free space, change target backend, or restore to a different compute host.
Background restore has slowed due to network, target storage saturation, or chunk cache misses.Diagnostic steps:
  • Check the live-restore task metrics: chunks fetched per second, bytes remaining.
  • Check the source datastore for IO contention with other jobs.
Resolution: pause concurrent IO-heavy jobs. The guest stays usable for normal IO during the background fill.
The decryption key on the restore client does not match the snapshot.Resolution: load the correct key file. Identify the right key by the snapshot’s recorded key fingerprint.
Storage backend type mismatch between the source and the target. Virtio drivers, controller types, or boot order differs.Resolution: match the source VM’s storage controller. If the target host cannot host the same storage type, fall back to a standard restore with target-storage rewriting.

Sync and Replication Failures

The remote’s TLS certificate has changed.Diagnostic steps:
  • Verify the new fingerprint out-of-band (call the remote’s owner).
  • Update the registered fingerprint:
    ironcore-backup remote update --name backup-site \
      --fingerprint "<new-fingerprint>"
    
Never accept the new fingerprint automatically — this would defeat the MITM protection.
Inter-site link is degraded or the remote datastore is full.Diagnostic steps:
  • Check the inter-site bandwidth telemetry.
  • Check remote datastore free space.
Resolution: free remote space, address the link issue, then re-run the sync.
ironcore-backup sync run --name primary-to-backup-archival
Deduplication hit rate on the remote is artificially low. Often caused by a previous failed sync that committed a partial manifest without committing the chunks.Resolution: force a full datastore comparison and rebuild the dedup index:
ironcore-backup sync repair --name primary-to-backup-archival
The remote auth-id used by the Backup site lacks Datastore.Reader on the Primary datastore.Resolution: grant the sync auth-id explicit permission on the source.

Verification Failures

Either disk corruption, tampering, or a software defect.Immediate steps:
  1. Quarantine the affected datastore — disable new writes:
    ironcore-backup datastore quarantine ibs-primary
    
  2. Identify every snapshot referencing affected chunks.
  3. Restore from the Backup site replica.
  4. Open an incident.
Do not run garbage collection on a quarantined datastore until the root cause is established.
Verification is not running.Diagnostic steps:
ironcore-backup verify status
systemctl status ironcore-backup-verifier
Resolution: restart the verifier service or confirm the schedule is set.
Datastore size has outgrown the configured verification window.Resolution: increase --verify-window-days, run verification more frequently with --max-concurrent-snapshots, or partition the datastore.

Notification Issues

Diagnostic steps:
  • Check the notification target’s last test result.
  • Inspect the notification service log for delivery errors.
journalctl -u ironcore-backup-notify -n 200
ironcore-backup notification target test ops-smtp \
  --recipient admin@<your-domain>
Downstream endpoint is unhealthy or returning errors.Diagnostic steps:
  • Check the receiver’s logs.
  • Use curl to send the same payload manually.
Resolution: fix the receiver, then re-enable the target. Optionally configure exponential backoff with retry caps.

Tape Library Issues

Confirm the SCSI generic device exists. Check dmesg for hardware errors.
lsscsi -g
dmesg | grep -i scsi
Reset the library and re-inventory.
ironcore-backup tape library reset lto-library-01
ironcore-backup tape inventory --library lto-library-01
A media slot may have been unavailable, or a particular tape failed an integrity check.Inspect the task log for the affected tape. Quarantine the media if it reports recurring errors.

Encryption Key Issues

The master key was not auto-unlocked at boot, possibly because the secure enclave or HSM was unavailable.Resolution:
ironcore-backup masterkey unlock --interactive
Distribution of the new keyfile failed on some clients.Resolution: re-distribute. Both the old and new keys remain valid for decryption until the old key is explicitly retired.
Checksum mismatch on the scanned or transcribed key.Diagnostic steps:
  • Re-scan with better lighting.
  • Compare the printed paperkey’s printed checksums with the QR-decoded values.

Performance Issues

Diagnostic steps:
  • Datastore IO utilisation
  • Backup server CPU utilisation
  • Network utilisation between client and server
Common fixes:
  • Increase backup server CPU (compression and encryption are CPU-bound)
  • Add faster storage to the datastore
  • Tune chunk cache size on the worker
Restore reads many chunks; throughput is bound by chunk read and target write speeds.Resolution: restore to a faster target storage backend, or schedule the restore during off-peak hours.

Escalation Procedure

For incidents that affect data integrity, security, or service-wide availability, follow this escalation path:

Isolate

Quarantine the affected datastore. Stop new writes to prevent further propagation.

Collect evidence

Capture service logs, task logs, datastore stats, and notification history. Time-stamp every action.

Notify

Notify the on-call rotation, the platform owner, and the compliance team.

Recover

Restore affected workloads from the Backup site replica.

Root-cause

Investigate the underlying cause — hardware, software defect, configuration, or process failure.

Remediate and document

Implement long-term fixes. Update the runbook and the post-incident report.

Open a Support Case

For issues that exceed local capability, open a support case with Polystack Technologies:
Information to ProvideSource
Platform versionDashboard footer
Affected service versionironcore-backup --version
Task UPIDsTask panel
Logs (sanitised)/var/log/ironcore-backup/
Datastore inventoryironcore-backup datastore list
Time of incidentAudit log
Reproduction stepsOperator narrative
Email: support@polystack.tech

Next Steps

Architecture

Diagnostic context for layered failures

Datastores

Datastore configuration and lifecycle

Verification and Validation

Verify recoverability after a failure

Polystack Support

Open a support case with Polystack Technologies