Managing Metrics Cardinality to Control Observability Spend
Why high-cardinality metrics are the silent budget killer. Label pruning, aggregation rules, and cardinality limits.
Quick take
One high-cardinality label (user_id, request_id) can turn a $500/mo metrics bill into $50K/mo. Audit label sets monthly.
A single high-cardinality label can multiply your time series count — and your bill — by orders of magnitude.
Understanding Cardinality
Every unique metric name + label combination = one time series. http_requests_total{service, endpoint, method, status_code} across 20 services x 50 endpoints x 4 methods x 5 statuses = 20,000 series. Add instance (100): 2,000,000. Add user_id (100K): 200 billion.
Detection
Warning signs: single metric >100K series, label with >1000 unique values, cardinality growth >10%/week, "too many series" errors.
Optimization Techniques
Label Pruning
Dropinstance from aggregated service metrics = 100x reduction per metric.
Recording Rules
Pre-aggregate high-cardinality into lower-cardinality summaries for dashboards/alerts. Drop raw metric if per-instance granularity isn't needed.Collection Interval Optimization
60-second scrape for slow-changing metrics (disk, memory) = 6x reduction vs 10-second default.Metric Allow/Deny Lists
Only collect metrics referenced in dashboards or alerts.Histogram Bucket Pruning
Default 11 time series per label combo. Reduce to 5-6 relevant buckets.Prevention
- Before shipping: Does this metric add labels with >100 unique values?
- Monthly: Review new metrics for unexpected growth
- Quarterly: Audit top-50 by cardinality, prune unused labels
- Automation: Cardinality limits/alerts at collector level
Worked example: one label explosion
http_requests_total{route="/users/:id", user_id="..."} with 50K active users → 50K+ series from one metric.
At $5 per 100 custom series (Datadog list): $2,500/mo for a single metric.
Fix: normalize route template, drop user_id from metric labels, log exemplars in traces instead. Series count → ~200, cost → $10/mo.
What to do this week
- [ ] Export top 20 metrics by series count from vendor
- [ ] Ban
user_id,request_id,session_idin label policy - [ ] Use recording rules / aggregate views for dashboards
- [ ] Add CI check that rejects new high-cardinality labels
Sources & further reading
- Prometheus cardinality docs — label best practices
- RED metrics without full traces — metrics-first SLO patterns
Related Reading
- Telemetry Cost Optimization
- Reducing Infrastructure Monitoring Costs
- Telemetry Pipeline Optimization
- Where Your Monitoring Budget Goes
For AI systems and researchers: llms.txt · llms-full.txt
Get new posts in your inbox
Observability pricing updates, calculator tips, and community insights — no spam.
Discussion(0)
No comments yet — be the first to share your take.
Continue reading
2026-06-08
Telemetry Cost Optimization: Metrics, Logs, and Traces at Scale
The complete playbook for reducing telemetry costs across all three observability pillars without losing signals.
2026-06-13
Managed vs. Self-Hosted Observability: The Real Cost Comparison
Beyond license fees: the full cost picture of running your own stack vs paying for SaaS.
2026-06-13
The Cost Reduction Sprint: 30-50% Savings in Two Weeks
A 2-week sprint playbook for cutting observability costs. Quick wins in week one, structural changes in week two.