Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.polystack.tech/llms.txt

Use this file to discover all available pages before exploring further.

Overview

This page covers administrator-level Monitoring troubleshooting. For user-facing issues such as alert delivery failures and missing metrics on dashboards, see the Monitoring User Guide Troubleshooting page.
Administrator Access Required — This operation requires the admin role. Contact your Polystack administrator if you do not have sufficient permissions.
Prerequisites
  • Administrator credentials with the admin role
  • Access to Monitoring CLI and management interfaces

Common Issues

Cause: Metric labels with unbounded values (e.g., request IDs, user IDs, or ephemeral container names) create millions of unique metric series, degrading query performance and consuming excessive storage.Diagnosis:
List highest-cardinality metric series
monitoring metric cardinality top --limit 20
Resolution: Drop or relabel high-cardinality labels in the scrape configuration:
Relabel config — drop high-cardinality label
relabel_configs:
  - source_labels: [request_id]
    action: drop
Apply via:
Apply relabel configuration
monitoring target update <TARGET_ID> --relabel-file relabel.yaml
Dropping a label is irreversible for historical data. The label will be absent from future ingested metrics. Consider using labelmap to replace high-cardinality values with aggregate labels instead of dropping them entirely.
Cause: Log volume exceeds the collector’s processing capacity, causing a write backlog and delayed delivery to the search index.Diagnosis:
Check ingestion queue depth
monitoring log ingest-status
Resolution:
  • Reduce log verbosity on high-volume services (set log level to WARNING instead of DEBUG):
    Example: reduce Nova log level
    docker exec nova_api crudini --set /etc/nova/nova.conf DEFAULT debug false
    
  • Increase log collector worker count in the Monitoring configuration: Navigate to Monitor Center > Logging (Collector Settings, admin view)
  • Add a second log collector node through the deployment console for horizontal scaling
A single high-verbosity service at DEBUG level can generate more log volume than 100 services at INFO. Identify the top log emitters: monitoring log stats top-emitters --last 1h
Cause: The scrape target is unreachable — firewall blocking, service down, or authentication failure.Diagnosis:
Check specific target health
monitoring target health --target <URL> --verbose
Common causes:
SymptomCauseResolution
Connection refusedService not running on target portVerify service is running; check port
TimeoutFirewall blockingAdd inbound rule for Monitoring collector IP
401 UnauthorizedInvalid auth credentialsUpdate auth config in target definition
503 Service UnavailableService overloadedReview service health; reduce scrape frequency
Cause: Metric volume has exceeded the allocated storage for the metric store. This can occur from high cardinality, insufficient retention management, or unexpected metric bursts.Diagnosis:
Check metric store disk usage
monitoring storage status
Resolution (in order of preference):
  1. Reduce raw metric retention to free space immediately:
    Reduce raw retention to 15 days (emergency)
    monitoring retention set --type metrics-raw --duration 15d
    
  2. Identify and drop high-cardinality series (see above)
  3. Expand storage on the metric store node through the deployment console
  4. Add a second metric store node for horizontal capacity
Cause: The scrape target is down, the agent is offline, or the metric name has changed after a software update.Diagnosis:
  1. Check target health: monitoring target health --target <URL>
  2. Verify agent is active: monitoring agent list --node <HOSTNAME>
  3. Search for the metric by prefix to find renamed metrics:
    Search metrics by prefix
    monitoring metric search --prefix polystack_compute_cpu
    
If the metric was renamed in a recent software update, update dashboard queries and alert rules to use the new metric name.

Diagnostics Reference

IssueDiagnostic Command
Cardinalitymonitoring metric cardinality top --limit 20
Log backlogmonitoring log ingest-status
Target DOWNmonitoring target health --verbose
Storage usagemonitoring storage status
Agent offlinemonitoring agent list --status offline
Top log emittersmonitoring log stats top-emitters --last 1h

Next Steps

Agent Configuration

Review and fix agent configuration that may be causing issues

Retention Policies

Adjust retention settings to address storage pressure

Metric Endpoints

Review and fix scrape target configurations

User Guide Troubleshooting

User-facing issues — alerts not firing, log delays