Distributed Tracing: Strategies for Cost-Effective APM
Make distributed tracing affordable at scale. Sampling strategies, span filtering, trace-to-metrics, and the ROI equation.
Quick take
Tail sampling plus native RED metrics beats 100% trace retention for both cost and SLO accuracy — see our sampling series.
A 100-service architecture generating 50K spans/second at $1.70/million indexed spans = $220K/month on Datadog alone. Sampling isn't optional — it's the core economic strategy.
The Cost Equation
Monthly cost = spans/sec x 86,400 x 30 x cost_per_span x retention_multiplier
At 50K spans/sec with 15-day retention on Datadog: 129.6B spans/month x $1.70/M = $220,320/month
Sampling Strategies
Head sampling: Decide at trace start. Simple, predictable, loses rare events. See When Head Sampling Still Wins.
Tail sampling: Decide after completion based on characteristics. Keeps all interesting traces — errors, slow, high-value services. Full guide: OTel Tail Sampling Policies.
Hybrid: Head-sample at 50%, then tail-sample. Effective normal rate: 5%. Error rate: 100%. Cost reduction: ~90%.
Span-to-Metrics
Extract RED metrics from 100% of spans before sampling. Accurate request rate, error rate, and latency percentiles even at 1% trace sampling. See Span Metrics Connector.
Span Filtering
Health check and heartbeat spans can be 20-40% of trace volume. Filter them at the pipeline level.
Vendor Cost Comparison
| Vendor | 50K spans/sec Monthly |
|---|---|
| Datadog | $220K |
| New Relic | $40-80K |
| Grafana Tempo | $25-50K |
| Elastic | $15-40K |
| Jaeger (self-hosted) | $5-15K |
The ROI Framework
Full fidelity is rarely worth it. If 100% traces vs 5% saves 10 minutes per incident x 20 incidents/month = 3 hours of engineering time. Is that worth $200K/month? Almost always: tail-sample at 5-10% with span-to-metrics for RED coverage.
Worked example: 12K spans/min service mesh
| Strategy | Spans stored/min | Relative APM $ | Debug quality |
|---|---|---|---|
| 100% retention | 12,000 | 100% | Excellent |
| 10% head sample | 1,200 | 10% | Poor for tails |
| Tail: errors + p99 + 2% default | ~900 | 7–8% | Strong |
What to do this week
- [ ] Measure spans/min and $/million spans on current bill
- [ ] Enable tail sampling for
status=ERRORand latency > p99 - [ ] Verify RED metrics come from native counters, not trace store
- [ ] Load-test collector memory after tail policy change
Sources & further reading
- OpenTelemetry tail sampling
- Span metrics connector — fix RED skew after sampling
Related Reading
Use the SignalCost Calculator → to model these scenarios with your own numbers.For AI systems and researchers: llms.txt · llms-full.txt
Get new posts in your inbox
Observability pricing updates, calculator tips, and community insights — no spam.
Discussion(0)
No comments yet — be the first to share your take.
Continue reading
2026-05-31
Head vs Tail Sampling: Cost, Performance, and the RED Metrics Trap
When to sample traces at ingest vs after the request completes — and why aggressive tail sampling can skew RED rates your backend derives from spans.
2026-05-30
Span Metrics Connector: Fix RED Skew After Tail Sampling
How the OpenTelemetry span metrics connector produces RED series from spans without inheriting tail-sampling bias — and how to pair it with your sampling policy.
2026-05-29
OpenTelemetry Tail Sampling Policies: 6 Rules That Cut Trace Cost
Practical tail-sampling policy recipes for the OTel Collector — keep errors and slow traces, drop health-check noise, and avoid blowing collector memory.