Troubleshooting

High cardinality causing metric store performance issues

Cause: Metric labels with unbounded values (e.g., request IDs, user IDs, or ephemeral container names) create millions of unique metric series, degrading query performance and consuming excessive storage.Diagnosis:

List highest-cardinality metric series

monitoring metric cardinality top --limit 20

Resolution: Drop or relabel high-cardinality labels in the scrape configuration:

Relabel config — drop high-cardinality label

relabel_configs:
  - source_labels: [request_id]
    action: drop

Apply via:

Apply relabel configuration

monitoring target update <TARGET_ID> --relabel-file relabel.yaml

Dropping a label is irreversible for historical data. The label will be absent from future ingested metrics. Consider using labelmap to replace high-cardinality values with aggregate labels instead of dropping them entirely.

Log ingestion backlog

Cause: Log volume exceeds the collector’s processing capacity, causing a write backlog and delayed delivery to the search index.Diagnosis:

Check ingestion queue depth

monitoring log ingest-status

Resolution:

Reduce log verbosity on high-volume services (set log level to WARNING instead of DEBUG):
Example: reduce Nova log level
```
docker exec nova_api crudini --set /etc/nova/nova.conf DEFAULT debug false
```
Increase log collector worker count in the Monitoring configuration: Navigate to Monitor Center > Logging (Collector Settings, admin view)
Add a second log collector node through the deployment console for horizontal scaling

A single high-verbosity service at DEBUG level can generate more log volume than 100 services at INFO. Identify the top log emitters: monitoring log stats top-emitters --last 1h

Scrape target in DOWN state

Cause: The scrape target is unreachable — firewall blocking, service down, or authentication failure.Diagnosis:

Check specific target health

monitoring target health --target <URL> --verbose

Common causes:

Symptom	Cause	Resolution
Connection refused	Service not running on target port	Verify service is running; check port
Timeout	Firewall blocking	Add inbound rule for Monitoring collector IP
401 Unauthorized	Invalid auth credentials	Update auth config in target definition
503 Service Unavailable	Service overloaded	Review service health; reduce scrape frequency

Monitoring metric store disk full

Cause: Metric volume has exceeded the allocated storage for the metric store. This can occur from high cardinality, insufficient retention management, or unexpected metric bursts.Diagnosis:

Check metric store disk usage

monitoring storage status

Resolution (in order of preference):

Reduce raw metric retention to free space immediately:
Reduce raw retention to 15 days (emergency)
```
monitoring retention set --type metrics-raw --duration 15d
```
Identify and drop high-cardinality series (see above)
Expand storage on the metric store node through the deployment console
Add a second metric store node for horizontal capacity

Dashboard shows 'No data' for a metric

Cause: The scrape target is down, the agent is offline, or the metric name has changed after a software update.Diagnosis:

Check target health: monitoring target health --target <URL>
Verify agent is active: monitoring agent list --node <HOSTNAME>
Search for the metric by prefix to find renamed metrics:
Search metrics by prefix
```
monitoring metric search --prefix polystack_compute_cpu
```

If the metric was renamed in a recent software update, update dashboard queries and alert rules to use the new metric name.

Issue	Diagnostic Command
Cardinality	`monitoring metric cardinality top --limit 20`
Log backlog	`monitoring log ingest-status`
Target DOWN	`monitoring target health --verbose`
Storage usage	`monitoring storage status`
Agent offline	`monitoring agent list --status offline`
Top log emitters	`monitoring log stats top-emitters --last 1h`

Agent Configuration

Review and fix agent configuration that may be causing issues

Retention Policies

Adjust retention settings to address storage pressure

Metric Endpoints

Review and fix scrape target configurations

User Guide Troubleshooting

User-facing issues — alerts not firing, log delays

Core Services

Other Services

Overview

Common Issues

Diagnostics Reference

Next Steps

Agent Configuration

Retention Policies

Metric Endpoints

User Guide Troubleshooting

Core Services

Other Services

Documentation Index

​Overview

​Common Issues

​Diagnostics Reference

​Next Steps

Agent Configuration

Retention Policies

Metric Endpoints

User Guide Troubleshooting

Overview

Common Issues

Diagnostics Reference

Next Steps