Added workers for queue processing #2902

pkositsyn · 2021-03-29T00:24:14Z

Signed-off-by: Pavel Kositsyn [email protected]

Description:
This PR adds workers for processing traces in events machine. Currently, only one thread does processing, which can often be a bottleneck. Some problems were discussed in the issue
@jpkrohling could you take an overview look at this before I add tests?

Link to tracking Issue: #1710

Testing:
To be added

Documentation:
Changed tags for metric

processor/groupbytraceprocessor/config.go

processor/groupbytraceprocessor/event.go

codecov · 2021-03-29T00:42:37Z

Codecov Report

Merging #2902 (aec8e73) into main (17399fa) will decrease coverage by 0.28%.
The diff coverage is 92.30%.

❗ Current head aec8e73 differs from pull request most recent head 17751f6. Consider uploading reports for the commit 17751f6 to get more accurate results

@@            Coverage Diff             @@
##             main    #2902      +/-   ##
==========================================
- Coverage   91.91%   91.63%   -0.29%     
==========================================
  Files         494      477      -17     
  Lines       23939    23324     -615     
==========================================
- Hits        22003    21372     -631     
- Misses       1429     1451      +22     
+ Partials      507      501       -6

Flag	Coverage Δ
integration	`68.96% <ø> (+5.27%)`	⬆️
unit	`90.62% <92.30%> (-0.31%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
processor/groupbytraceprocessor/processor.go	`94.64% <86.36%> (-3.62%)`	⬇️
processor/groupbytraceprocessor/event.go	`94.57% <93.84%> (-1.39%)`	⬇️
processor/groupbytraceprocessor/factory.go	`100.00% <100.00%> (ø)`
processor/groupbytraceprocessor/storage_memory.go	`88.63% <100.00%> (ø)`
processor/k8sprocessor/config.go	`0.00% <0.00%> (-100.00%)`	⬇️
...urcedetectionprocessor/internal/system/metadata.go	`0.00% <0.00%> (-57.15%)`	⬇️
exporter/elasticexporter/config.go	`71.79% <0.00%> (-28.21%)`	⬇️
exporter/elasticexporter/factory.go	`90.32% <0.00%> (-9.68%)`	⬇️
exporter/awsxrayexporter/awsxray.go	`79.06% <0.00%> (-7.30%)`	⬇️
exporter/newrelicexporter/transformer.go	`95.62% <0.00%> (-4.38%)`	⬇️
... and 121 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c638b56...17751f6. Read the comment docs.

jpkrohling · 2021-03-29T14:26:05Z

I like the idea of this change and I think your logic is sound, but do you have evidence that this indeed brings performance improvements? I was planning on doing some similar changes as part of the perf tests that I have here:

https://github.com/jpkrohling/groupbytrace-tailbasedsampling-perf-comparison

I won't have time to run the perf tests for the next couple of weeks, but if you could run the perf comparison above and share your findings, it would be a good start!

github-actions · 2021-04-06T05:21:19Z

This PR was marked stale due to lack of activity. It will be closed in 7 days.

jpkrohling · 2021-04-08T13:46:22Z

processor/groupbytraceprocessor/event.go

The function name is now outdated.

jpkrohling · 2021-04-08T13:55:26Z

processor/groupbytraceprocessor/event.go

This doc isn't accurate anymore.

jpkrohling · 2021-04-08T13:57:19Z

processor/groupbytraceprocessor/processor.go

Perhaps update the doc here to mention consume(trace) instead? Also mention that the batch is split per trace, and that traces are routed to specific workers based on the trace ID

jpkrohling · 2021-04-08T13:58:20Z

processor/groupbytraceprocessor/processor_test.go

I'm a bit worried that we might not have enough tests with NumWorkers higher than 1, so that possible race conditions are not being detected.

jpkrohling · 2021-04-08T14:00:04Z

I like this PR a lot, and I think you mentioned some numbers elsewhere. I think this would be an improvement, but would you be able to publish your numbers here? The code is now more complex than before, and I would like to see the numbers to assert that the new complexity is worth it.

pkositsyn · 2021-04-12T18:04:09Z

Benchmark with 1 worker on 2 core machine (+2 hyperthreading)

BenchmarkConsumeTracesCompleteOnFirstBatch
BenchmarkConsumeTracesCompleteOnFirstBatch-4   	   37278	     28465 ns/op
PASS

Benchmark with 2 workers

BenchmarkConsumeTracesCompleteOnFirstBatch
BenchmarkConsumeTracesCompleteOnFirstBatch-4   	   51589	     22327 ns/op
PASS

Cannot really measure the influence in our production since we haven't still upgraded the collector's version due to some dependencies. You can run the benchmark on something more powerful to see the difference

jpkrohling · 2021-04-13T09:49:05Z

The numbers in there are indeed not very exciting. I'll try to run some tests in a bare metal that I have access to, but I will probably only have time to do it next week. If this gets stale, let me know and I'll remove the label (or try making this a draft PR).

pkositsyn · 2021-04-13T11:34:19Z

Isn't performance one of the main targets of OpenTelemetry? 20-30% for 2 workers already seems really good, isn't it? Anyway, would be nice to test it in real environment

jpkrohling · 2021-04-13T11:38:06Z

20-30% for 2 workers already seems really good, isn't it?

22327 ns is 0.023ms. This complexity is saving 0.006ms per operation if my math is right. Not sure it's really worth it, to be honest.

pkositsyn · 2021-04-13T12:36:05Z

I cannot provide numbers on the fact, that current implementation just cannot serve the throughput of 50-60k input events per second. And having 16, 32, 64, ... cores on the machine doesn't help because of single worker. It is even more about throughput, not the latency actually.

jpkrohling · 2021-04-13T19:38:20Z

I cannot provide numbers on the fact, that current implementation just cannot serve the throughput of 50-60k input events per second

Can you give numbers supporting that this change helps increase the throughput for this processor?

github-actions · 2021-04-21T05:21:15Z

This PR was marked stale due to lack of activity. It will be closed in 7 days.

github-actions · 2021-04-29T05:20:50Z

This PR was marked stale due to lack of activity. It will be closed in 7 days.

jpkrohling · 2021-04-30T15:19:38Z

Status update: I'm working on the performance comparison between the groupbytrace + policy sampling processor vs. tail sampling processor and I'm picking the code from this PR to test as well.

jpkrohling · 2021-05-03T15:32:34Z

I've got good numbers from this PR:

The first series is the current code, the second is this PR with 10 workers. I'm happy with this PR as is.

@pkositsyn, would you like to take another look, just to make sure I didn't do anything wrong during the rebase?

jpkrohling · 2021-05-03T15:35:06Z

The graphs above are part of the perf tests recorded here: https://github.com/jpkrohling/groupbytrace-tailbasedsampling-perf-comparison/tree/main/results/2021-05-03-groupbytrace-10-pr2902

Signed-off-by: Pavel Kositsyn <[email protected]>

Signed-off-by: Pavel <[email protected]>

jpkrohling · 2021-05-10T07:20:55Z

loadtests failed, re-running tests.

jpkrohling · 2021-05-10T07:22:04Z

@tigrannajaryan, @bogdandrutu, this PR is ready to be merged from my perspective, if the load test failures are intermittent.

tigrannajaryan · 2021-05-11T15:40:59Z

@jpkrohling for load test failures please check why it fails. If we are normally (check some previous "main" branch builds) running close to the limits then increase the limits. We aim for limits to be about 30% above the max observed.

jpkrohling · 2021-05-11T15:42:58Z

Will do it whenever I see a failure again!

…2708) (#2902)

pkositsyn requested a review from jpkrohling as a code owner March 29, 2021 00:24

pkositsyn requested a review from a team March 29, 2021 00:24

github-actions bot assigned jpkrohling Mar 29, 2021

pkositsyn commented Mar 29, 2021

View reviewed changes

processor/groupbytraceprocessor/config.go Outdated Show resolved Hide resolved

pkositsyn commented Mar 29, 2021

View reviewed changes

processor/groupbytraceprocessor/event.go Outdated Show resolved Hide resolved

github-actions bot added the Stale label Apr 6, 2021

jpkrohling removed the Stale label Apr 8, 2021

jpkrohling reviewed Apr 8, 2021

View reviewed changes

github-actions bot added the Stale label Apr 21, 2021

bogdandrutu removed the Stale label Apr 21, 2021

github-actions bot added the Stale label Apr 29, 2021

bogdandrutu removed the Stale label Apr 29, 2021

jpkrohling force-pushed the groupbytrace_workers branch from aec8e73 to 839f33b Compare May 3, 2021 15:32

pkositsyn added 2 commits May 6, 2021 17:28

Added workers for queue processing

ca29396

Signed-off-by: Pavel Kositsyn <[email protected]>

add tests for eventmachine consume

6175cda

Signed-off-by: Pavel Kositsyn <[email protected]>

jpkrohling force-pushed the groupbytrace_workers branch from 839f33b to 6175cda Compare May 6, 2021 15:29

Resolve conflicts with main

17751f6

Signed-off-by: Pavel <[email protected]>

jpkrohling approved these changes May 10, 2021

View reviewed changes

bogdandrutu merged commit c179ad8 into open-telemetry:main May 10, 2021

alexperez52 referenced this pull request in open-o11y/opentelemetry-collector-contrib Aug 18, 2021

Make sure we don't try to count metrics if scraper returns an error (#…

5664e61

…2708) (#2902)

jpkrohling mentioned this pull request Oct 7, 2021

improve groupbytraceprocessor performance #1710

Closed

bryan-aguilar mentioned this pull request Dec 29, 2022

[processor/groupbytrace] Add num_workers doc to README #17305

Merged

Added workers for queue processing #2902

Added workers for queue processing #2902

Uh oh!

Conversation

pkositsyn commented Mar 29, 2021

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Mar 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jpkrohling commented Mar 29, 2021

Uh oh!

github-actions bot commented Apr 6, 2021

Uh oh!

jpkrohling Apr 8, 2021

Choose a reason for hiding this comment

Uh oh!

jpkrohling Apr 8, 2021

Choose a reason for hiding this comment

Uh oh!

jpkrohling Apr 8, 2021

Choose a reason for hiding this comment

Uh oh!

jpkrohling Apr 8, 2021

Choose a reason for hiding this comment

Uh oh!

jpkrohling commented Apr 8, 2021

Uh oh!

pkositsyn commented Apr 12, 2021

Uh oh!

jpkrohling commented Apr 13, 2021

Uh oh!

pkositsyn commented Apr 13, 2021

Uh oh!

jpkrohling commented Apr 13, 2021

Uh oh!

pkositsyn commented Apr 13, 2021

Uh oh!

jpkrohling commented Apr 13, 2021

Uh oh!

github-actions bot commented Apr 21, 2021

Uh oh!

github-actions bot commented Apr 29, 2021

Uh oh!

jpkrohling commented Apr 30, 2021

Uh oh!

jpkrohling commented May 3, 2021

Uh oh!

jpkrohling commented May 3, 2021

Uh oh!

jpkrohling commented May 10, 2021

Uh oh!

jpkrohling commented May 10, 2021

Uh oh!

tigrannajaryan commented May 11, 2021

Uh oh!

jpkrohling commented May 11, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov bot commented Mar 29, 2021 •

edited

Loading