Skip to content

Conversation

pkositsyn
Copy link
Contributor

Signed-off-by: Pavel Kositsyn [email protected]

Description:
This PR adds workers for processing traces in events machine. Currently, only one thread does processing, which can often be a bottleneck. Some problems were discussed in the issue
@jpkrohling could you take an overview look at this before I add tests?

Link to tracking Issue: #1710

Testing:
To be added

Documentation:
Changed tags for metric

@pkositsyn pkositsyn requested a review from jpkrohling as a code owner March 29, 2021 00:24
@pkositsyn pkositsyn requested a review from a team March 29, 2021 00:24
@codecov
Copy link

codecov bot commented Mar 29, 2021

Codecov Report

Merging #2902 (aec8e73) into main (17399fa) will decrease coverage by 0.28%.
The diff coverage is 92.30%.

❗ Current head aec8e73 differs from pull request most recent head 17751f6. Consider uploading reports for the commit 17751f6 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2902      +/-   ##
==========================================
- Coverage   91.91%   91.63%   -0.29%     
==========================================
  Files         494      477      -17     
  Lines       23939    23324     -615     
==========================================
- Hits        22003    21372     -631     
- Misses       1429     1451      +22     
+ Partials      507      501       -6     
Flag Coverage Δ
integration 68.96% <ø> (+5.27%) ⬆️
unit 90.62% <92.30%> (-0.31%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
processor/groupbytraceprocessor/processor.go 94.64% <86.36%> (-3.62%) ⬇️
processor/groupbytraceprocessor/event.go 94.57% <93.84%> (-1.39%) ⬇️
processor/groupbytraceprocessor/factory.go 100.00% <100.00%> (ø)
processor/groupbytraceprocessor/storage_memory.go 88.63% <100.00%> (ø)
processor/k8sprocessor/config.go 0.00% <0.00%> (-100.00%) ⬇️
...urcedetectionprocessor/internal/system/metadata.go 0.00% <0.00%> (-57.15%) ⬇️
exporter/elasticexporter/config.go 71.79% <0.00%> (-28.21%) ⬇️
exporter/elasticexporter/factory.go 90.32% <0.00%> (-9.68%) ⬇️
exporter/awsxrayexporter/awsxray.go 79.06% <0.00%> (-7.30%) ⬇️
exporter/newrelicexporter/transformer.go 95.62% <0.00%> (-4.38%) ⬇️
... and 121 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c638b56...17751f6. Read the comment docs.

@jpkrohling
Copy link
Member

I like the idea of this change and I think your logic is sound, but do you have evidence that this indeed brings performance improvements? I was planning on doing some similar changes as part of the perf tests that I have here:

https://github.com/jpkrohling/groupbytrace-tailbasedsampling-perf-comparison

I won't have time to run the perf tests for the next couple of weeks, but if you could run the perf comparison above and share your findings, it would be a good start!

@github-actions
Copy link
Contributor

github-actions bot commented Apr 6, 2021

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Apr 6, 2021
@jpkrohling jpkrohling removed the Stale label Apr 8, 2021
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function name is now outdated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc isn't accurate anymore.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps update the doc here to mention consume(trace) instead? Also mention that the batch is split per trace, and that traces are routed to specific workers based on the trace ID

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit worried that we might not have enough tests with NumWorkers higher than 1, so that possible race conditions are not being detected.

@jpkrohling
Copy link
Member

I like this PR a lot, and I think you mentioned some numbers elsewhere. I think this would be an improvement, but would you be able to publish your numbers here? The code is now more complex than before, and I would like to see the numbers to assert that the new complexity is worth it.

@pkositsyn
Copy link
Contributor Author

Benchmark with 1 worker on 2 core machine (+2 hyperthreading)

BenchmarkConsumeTracesCompleteOnFirstBatch
BenchmarkConsumeTracesCompleteOnFirstBatch-4   	   37278	     28465 ns/op
PASS

Benchmark with 2 workers

BenchmarkConsumeTracesCompleteOnFirstBatch
BenchmarkConsumeTracesCompleteOnFirstBatch-4   	   51589	     22327 ns/op
PASS

Cannot really measure the influence in our production since we haven't still upgraded the collector's version due to some dependencies. You can run the benchmark on something more powerful to see the difference

@jpkrohling
Copy link
Member

The numbers in there are indeed not very exciting. I'll try to run some tests in a bare metal that I have access to, but I will probably only have time to do it next week. If this gets stale, let me know and I'll remove the label (or try making this a draft PR).

@pkositsyn
Copy link
Contributor Author

Isn't performance one of the main targets of OpenTelemetry? 20-30% for 2 workers already seems really good, isn't it? Anyway, would be nice to test it in real environment

@jpkrohling
Copy link
Member

20-30% for 2 workers already seems really good, isn't it?

22327 ns is 0.023ms. This complexity is saving 0.006ms per operation if my math is right. Not sure it's really worth it, to be honest.

@pkositsyn
Copy link
Contributor Author

I cannot provide numbers on the fact, that current implementation just cannot serve the throughput of 50-60k input events per second. And having 16, 32, 64, ... cores on the machine doesn't help because of single worker. It is even more about throughput, not the latency actually.

@jpkrohling
Copy link
Member

I cannot provide numbers on the fact, that current implementation just cannot serve the throughput of 50-60k input events per second

Can you give numbers supporting that this change helps increase the throughput for this processor?

@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@jpkrohling
Copy link
Member

Status update: I'm working on the performance comparison between the groupbytrace + policy sampling processor vs. tail sampling processor and I'm picking the code from this PR to test as well.

@jpkrohling jpkrohling force-pushed the groupbytrace_workers branch from aec8e73 to 839f33b Compare May 3, 2021 15:32
@jpkrohling
Copy link
Member

I've got good numbers from this PR:

image

The first series is the current code, the second is this PR with 10 workers. I'm happy with this PR as is.

@pkositsyn, would you like to take another look, just to make sure I didn't do anything wrong during the rebase?

@jpkrohling
Copy link
Member

@jpkrohling jpkrohling force-pushed the groupbytrace_workers branch from 839f33b to 6175cda Compare May 6, 2021 15:29
@jpkrohling
Copy link
Member

loadtests failed, re-running tests.

image

@jpkrohling
Copy link
Member

@tigrannajaryan, @bogdandrutu, this PR is ready to be merged from my perspective, if the load test failures are intermittent.

@bogdandrutu bogdandrutu merged commit c179ad8 into open-telemetry:main May 10, 2021
@tigrannajaryan
Copy link
Member

@jpkrohling for load test failures please check why it fails. If we are normally (check some previous "main" branch builds) running close to the limits then increase the limits. We aim for limits to be about 30% above the max observed.

@jpkrohling
Copy link
Member

Will do it whenever I see a failure again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants