[tailsamplingprocessor] [chore] Remove test coupling on component internal fields #43201

Logiraptor · 2025-10-07T18:50:59Z

While working on a fix for #41656, I've noticed many tests are strongly coupled to the tsp internals. This PR refactors all such coupling to exist in a small number of test helpers, instead of being duplicated across all tests. Those helpers are:

withPolicies - allows directly setting the policy evaluators which is useful for fine grained decision logic tests
withTickerLatency - allows speeding up the decision ticker which is useful for speeding up tests
withTestController - makes the decision tick blocking until triggered. This allows deterministic tests of otherwise non-deterministic behavior.

A few tests were mostly rewritten to make them test externally observable behavior. For example:

TestLateArrivingSpanUsesDecisionCache - used to inspect idToTrace state to ensure traces are cleared from memory. Now the tsp is destroyed and recreated with the same cache to ensure the cache is the only state available.
TestSequentialTraceMapSize et al - these used to inspect idToTrace state to ensure trace data is properly tracked. Now they trigger sampling decisions and assert that the output is complete.
TestSetSamplingPolicy et al - these used to inspect policies state to ensure policies are reloaded. Now they observe sampled traces only.

Instead of invoking the TSP tick loop directly

Instead of directly calling the tsp tick loop, I've added code to wrap the tick loop and make it blocking until called. This still involves some coupling, but since it's encapsulated in one place, it will be easier to change when refactoring the tsp in #41656. I took the opportunity to consolidate the sync batcher into this controller as well. In all cases with a manual tick loop, we also want a synchronous batcher for the same reason. Again this makes it simpler to update later without modifying every test at once.

Instead of accessing the idToTrace map

In all cases where we were directly inspecting the idToTrace map, we could instead just let the tsp sample those traces and inspect the output.

Instead of setting tickerFrequency directly

I've moved these tests to use withTickerFrequency, which is still a coupling, but since it's delegated to the option func, there's a central place to change how it works.

other

I included a final commit: 7bfab91. This just removes an unused field.

I've decided to remove two tests:

TestConcurrentTraceMapSize - this one writes traces concurrently then asserts on the size of the idToTrace map. We have overlapping tests which write concurrently then assert complete sampled output, which IMO is a better test.
TestDecisionPolicyMetrics - this one calls the internal makeDecision method and then asserts on the returned metrics, which are only used in debug logging. This feels like over testing to me and makes the tests overly coupled.

axw · 2025-10-08T01:37:36Z

@Logiraptor I like the sound of this, but it needs to wait until the minimum Go version is bumped to 1.25. That will happen when Go 1.26.0 comes out in February 2026.

Logiraptor · 2025-10-08T01:40:40Z

@axw Yeah, I just realized that 😓 I will mark this as a draft while I figure out if there's an easy way around it

Logiraptor · 2025-10-08T17:26:47Z

@axw Figured out an alternative approach. It's not quite as magical, but it accomplishes the goal of reducing how much needs to change when the tsp internals are refactored.

This PR builds on open-telemetry#43201 (pending) to parallelize the decision ticker. It also fixes the open issue in open-telemetry#41656 by moving all state management to a single goroutine (per shard). In practice at Grafana Labs, we've seen that the single threaded nature of the decision ticker is the primary bottleneck when running the tail sampling processor at high scales. For example, we have one large cluster performing around 12M policy evaluations per second while maintaining p99 tick latency of around 500ms. In order to reach this scale, we need to run ~200 replicas just to get enough parallel threads of execution. By adding support for parallel decision ticks, we hope to run fewer, larger replicas which should reduce the fragmentation caused by trace id load balancing. Another benefit here is to reduce contention on the idToTrace map and generally simplify reasoning about concurrency by giving each trace a clear "owner" goroutine. Previously we'd try to do some work inside the calling goroutine of ConsumeTraces. Instead, new work is passed via channels to its dedicated shard. By default we run one shard per available core. Shared state is kept to the minimum required to implement TSP semantics: 1. The trace limiter is shared, which allows the overall memory consumption to be reliably capped even in case of uneven shard utilization. 2. Both decision caches are shared, which again allows overall memory consumption to be capped even with uneven shard utilization. In practice, shard utilization should be very even, but it's still nice to have hard guarantees on the total traces in memory, and this allows us to avoid changing any tests in this PR, which greatly increases my confidence in the code despite the large change. Ideally the change to parallel execution is invisible to end users (aside from the much higher throughput). Sharding makes use of the maphash package with a unique `Seed`, which should provide even shard distribution even if traceIDs have previously been sampled downstream. This is better than e.g. a simple modulo on traceID, since traceIDs are not necessarily uniformly distribute after head sampling.

Logiraptor added 2 commits October 7, 2025 13:10

[tailsamplingprocessor] Use synctest instead of tsp internals

9aa653a

Remove unused field

7bfab91

Logiraptor requested a review from a team as a code owner October 7, 2025 18:51

Logiraptor requested a review from axw October 7, 2025 18:51

github-actions bot assigned codeboten Oct 7, 2025

github-actions bot added the processor/tailsampling Tail sampling processor label Oct 7, 2025

github-actions bot requested a review from portertech October 7, 2025 18:51

Logiraptor added 3 commits October 7, 2025 14:53

Merge branch 'main' into tsp-synctest

22afb36

Remove test references to idToTrace

e8c467b

Use withTickerFrequency

68e72d7

Logiraptor marked this pull request as draft October 8, 2025 01:40

Logiraptor added 8 commits October 7, 2025 22:30

Use a test controller instead of synctest

1deb987

Move more options into test file

87b5ec4

Fix lint errors

d4ac110

Better drop policy test

01d0669

Remove some references to policies

21eb6d2

Better SetPolicy test

0df5fdd

Remove testing debug logs

ac2718d

Remove TestConcurrentTraceMapSize

d0fc836

Logiraptor marked this pull request as ready for review October 8, 2025 17:26

github-actions bot assigned ChrsMark Oct 8, 2025

Logiraptor changed the title ~~[tailsamplingprocessor] [chore] Use synctest instead of tsp internals~~ [tailsamplingprocessor] [chore] Remove test coupling on component internal fields Oct 8, 2025

Logiraptor mentioned this pull request Oct 8, 2025

[tailsamplingprocessor] Parallel decision ticker Logiraptor/opentelemetry-collector-contrib#1

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[tailsamplingprocessor] [chore] Remove test coupling on component internal fields #43201

[tailsamplingprocessor] [chore] Remove test coupling on component internal fields #43201

Logiraptor commented Oct 7, 2025 •

edited

Loading

Uh oh!

axw commented Oct 8, 2025

Uh oh!

Logiraptor commented Oct 8, 2025

Uh oh!

Logiraptor commented Oct 8, 2025

Uh oh!

Uh oh!

[tailsamplingprocessor] [chore] Remove test coupling on component internal fields #43201

Are you sure you want to change the base?

[tailsamplingprocessor] [chore] Remove test coupling on component internal fields #43201

Conversation

Logiraptor commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Instead of invoking the TSP tick loop directly

Instead of accessing the idToTrace map

Instead of setting tickerFrequency directly

other

Uh oh!

axw commented Oct 8, 2025

Uh oh!

Logiraptor commented Oct 8, 2025

Uh oh!

Logiraptor commented Oct 8, 2025

Uh oh!

Uh oh!

Logiraptor commented Oct 7, 2025 •

edited

Loading