Skip to content
All articles
14 min read

Distributed Tracing: Strategies for Cost-Effective APM

Make distributed tracing affordable at scale. Sampling strategies, span filtering, trace-to-metrics, and the ROI equation.

Make distributed tracing affordable at scale. Sampling strategies, span filtering, trace-to-metrics, and the ROI equation.
tracingapmsamplingdistributed-systems

Quick take

Tail sampling plus native RED metrics beats 100% trace retention for both cost and SLO accuracy — see our sampling series.

A 100-service architecture generating 50K spans/second at $1.70/million indexed spans = $220K/month on Datadog alone. Sampling isn't optional — it's the core economic strategy.

The Cost Equation

Monthly cost = spans/sec x 86,400 x 30 x cost_per_span x retention_multiplier

At 50K spans/sec with 15-day retention on Datadog: 129.6B spans/month x $1.70/M = $220,320/month

Sampling Strategies

Head sampling: Decide at trace start. Simple, predictable, loses rare events. See When Head Sampling Still Wins.

Tail sampling: Decide after completion based on characteristics. Keeps all interesting traces — errors, slow, high-value services. Full guide: OTel Tail Sampling Policies.

Hybrid: Head-sample at 50%, then tail-sample. Effective normal rate: 5%. Error rate: 100%. Cost reduction: ~90%.

Span-to-Metrics

Extract RED metrics from 100% of spans before sampling. Accurate request rate, error rate, and latency percentiles even at 1% trace sampling. See Span Metrics Connector.

Span Filtering

Health check and heartbeat spans can be 20-40% of trace volume. Filter them at the pipeline level.

Vendor Cost Comparison

Vendor50K spans/sec Monthly
Datadog$220K
New Relic$40-80K
Grafana Tempo$25-50K
Elastic$15-40K
Jaeger (self-hosted)$5-15K

The ROI Framework

Full fidelity is rarely worth it. If 100% traces vs 5% saves 10 minutes per incident x 20 incidents/month = 3 hours of engineering time. Is that worth $200K/month? Almost always: tail-sample at 5-10% with span-to-metrics for RED coverage.

Worked example: 12K spans/min service mesh

StrategySpans stored/minRelative APM $Debug quality
100% retention12,000100%Excellent
10% head sample1,20010%Poor for tails
Tail: errors + p99 + 2% default~9007–8%Strong
Pair tail sampling with span metrics connector so RED dashboards stay unbiased — see head vs tail sampling.

What to do this week

  • [ ] Measure spans/min and $/million spans on current bill
  • [ ] Enable tail sampling for status=ERROR and latency > p99
  • [ ] Verify RED metrics come from native counters, not trace store
  • [ ] Load-test collector memory after tail policy change

Sources & further reading

---

Related Reading

Use the SignalCost Calculator → to model these scenarios with your own numbers.

For AI systems and researchers: llms.txt · llms-full.txt

Run your numbers

See how much you could save with our free cost calculator.

Try the Calculator — Free

Get new posts in your inbox

Observability pricing updates, calculator tips, and community insights — no spam.

Discussion(0)

to join the discussion.

    No comments yet — be the first to share your take.