Documentation Index
Fetch the complete documentation index at: https://docs.polystack.tech/llms.txt
Use this file to discover all available pages before exploring further.
Overview
This page covers administrator-level Monitoring troubleshooting. For user-facing issues such as alert delivery failures and missing metrics on dashboards, see the Monitoring User Guide Troubleshooting page.Prerequisites
- Administrator credentials with the
adminrole - Access to Monitoring CLI and management interfaces
Common Issues
High cardinality causing metric store performance issues
High cardinality causing metric store performance issues
Cause: Metric labels with unbounded values (e.g., request IDs, user IDs, or
ephemeral container names) create millions of unique metric series, degrading
query performance and consuming excessive storage.Diagnosis:Resolution:
Drop or relabel high-cardinality labels in the scrape configuration:Apply via:
List highest-cardinality metric series
Relabel config — drop high-cardinality label
Apply relabel configuration
Log ingestion backlog
Log ingestion backlog
Cause: Log volume exceeds the collector’s processing capacity, causing a write
backlog and delayed delivery to the search index.Diagnosis:Resolution:
Check ingestion queue depth
- Reduce log verbosity on high-volume services (set log level to
WARNINGinstead ofDEBUG):Example: reduce Nova log level - Increase log collector worker count in the Monitoring configuration: Navigate to Monitor Center > Logging (Collector Settings, admin view)
- Add a second log collector node through the deployment console for horizontal scaling
Scrape target in DOWN state
Scrape target in DOWN state
Cause: The scrape target is unreachable — firewall blocking, service down,
or authentication failure.Diagnosis:Common causes:
Check specific target health
| Symptom | Cause | Resolution |
|---|---|---|
| Connection refused | Service not running on target port | Verify service is running; check port |
| Timeout | Firewall blocking | Add inbound rule for Monitoring collector IP |
| 401 Unauthorized | Invalid auth credentials | Update auth config in target definition |
| 503 Service Unavailable | Service overloaded | Review service health; reduce scrape frequency |
Monitoring metric store disk full
Monitoring metric store disk full
Cause: Metric volume has exceeded the allocated storage for the metric store.
This can occur from high cardinality, insufficient retention management, or
unexpected metric bursts.Diagnosis:Resolution (in order of preference):
Check metric store disk usage
- Reduce raw metric retention to free space immediately:
Reduce raw retention to 15 days (emergency)
- Identify and drop high-cardinality series (see above)
- Expand storage on the metric store node through the deployment console
- Add a second metric store node for horizontal capacity
Dashboard shows 'No data' for a metric
Dashboard shows 'No data' for a metric
Cause: The scrape target is down, the agent is offline, or the metric name has
changed after a software update.Diagnosis:
- Check target health:
monitoring target health --target <URL> - Verify agent is active:
monitoring agent list --node <HOSTNAME> - Search for the metric by prefix to find renamed metrics:
Search metrics by prefix
Diagnostics Reference
| Issue | Diagnostic Command |
|---|---|
| Cardinality | monitoring metric cardinality top --limit 20 |
| Log backlog | monitoring log ingest-status |
| Target DOWN | monitoring target health --verbose |
| Storage usage | monitoring storage status |
| Agent offline | monitoring agent list --status offline |
| Top log emitters | monitoring log stats top-emitters --last 1h |
Next Steps
Agent Configuration
Review and fix agent configuration that may be causing issues
Retention Policies
Adjust retention settings to address storage pressure
Metric Endpoints
Review and fix scrape target configurations
User Guide Troubleshooting
User-facing issues — alerts not firing, log delays
