Trace ID aware load-balancing exporter #1231

jpkrohling · 2020-10-09T12:49:26Z

Description: Added a Trace ID aware load-balancing exporter.

Link to tracking Issue: open-telemetry/opentelemetry-collector#1724

Testing: some manual tests (missing Kubernetes e2e tests for DNS resolver) + unit tests

Documentation: readme file

jpkrohling · 2020-10-09T12:49:48Z

Each commit in this PR is a step towards the final solution and can be reviewed individually.

joe-elliott

Added a few comments. Thank you for this work @jpkrohling. Very nice.

joe-elliott · 2020-10-09T18:30:56Z

exporter/loadbalancingexporter/README.md

that either splits the incoming batches into multiple batches, one per trace

Does this exist? might be better just to mention groupbytrace if it's the only existing option.

It will, once #1235 is addressed :-) I think groupbytrace isn't suitable for all cases, although I would certainly recommend it for most cases that I can think of.

joe-elliott · 2020-10-09T18:31:17Z

exporter/loadbalancingexporter/README.md

should the example show groupbytrace since it's basically required?

I thought about it, but I think I would prefer to have the future "splitbatch" there instead.

joe-elliott · 2020-10-09T18:33:52Z

exporter/loadbalancingexporter/README.md

this example is cracking me up. I love how flexible the collector is.

joe-elliott · 2020-10-09T18:40:43Z

exporter/loadbalancingexporter/consistent_hashing_test.go

perhaps use a jumphash? I'm pretty sure it will work in this use case and it's significantly less code.

https://github.com/dgryski/go-jump

We present jump consistent hash, a fast, minimal memory, consistent hash algorithm that can be expressed in about 5 lines of code. In comparison to the algorithm of Karger et al., jump consistent hash requires no storage, is faster, and does a better job of evenly dividing the key space among the buckets and of evenly dividing the workload when the number of buckets changes. Its main limitation is that the buckets must be numbered sequentially, which makes it more suitable for data storage applications than for distributed web caching.

I'm no expert, but I think this will work if you just order the IPs you're load balancing over in increasing order.

I saw it the first time in @annanay25's branch and found it highly interesting, especially because it's implemented 5 lines written by a PerlMonks. I decided to read the paper, and this statement in the abstract intrigued me:

Its main limitation is that the buckets must be numbered sequentially, which makes it more suitable for data storage applications than for distributed web caching

Reading further, I realized that jumphash might not be a good choice for our use case at all:

This ability to add or remove buckets in any order can be valuable for cache servers where the servers are purely a performance improvement. But for data storage applications, where each bucket represents a different shard of the data, it is not acceptable for shards to simply disappear, because that shard is only place where the corresponding data is stored. Typically this is handled by either making the shards redundant (with several replicas), or being able to quickly recover a replacement, or accepting lower availability for some data. Server death thus does not cause reallocation of data; only capacity changes do. This means that shards can be assigned numerical ids in increasing order as they are added, so that the active bucket ids always fill the range [0, num_buckets)

Basically, we can scale up, but can't scale down.

Ah, thank you. I knew we used jumphash with memcached in Cortex so I thought it might be useful here, but you are correct. We use a statefulset to force identity/order so that new shards are always appended to the end.

jpkrohling · 2020-10-09T19:19:31Z

exporter/loadbalancingexporter/README.md

Replace double negative (never false) to positive (always true)

exporter/loadbalancingexporter/README.md

jpkrohling · 2020-10-14T13:12:49Z

@bogdandrutu would you like me to split this PR? It would be roughly 8 PRs, if we divide the almost 4k diffs by 500. Or would you like to try to review commit by commit as suggested in open-telemetry/opentelemetry-collector#1952 ?

jpkrohling · 2020-10-14T13:27:44Z

Build failure seems to be unrelated:

ok  	github.com/open-telemetry/opentelemetry-collector-contrib/extension/jmxmetricsextension	0.054s	coverage: 100.0% of statements
--- FAIL: TestWithEnvVarsIntegration (5.00s)
    integration_test.go:150: 
        	Error Trace:	integration_test.go:150
        	Error:      	Condition never satisfied
        	Test:       	TestWithEnvVarsIntegration
FAIL
coverage: 98.3% of statements
FAIL	github.com/open-telemetry/opentelemetry-collector-contrib/extension/jmxmetricsextension/subprocess	5.029s
FAIL
make[2]: *** [../../Makefile.Common:58: do-unit-tests-with-cover] Error 1

Signed-off-by: Juraci Paixão Kröhling <[email protected]>

jpkrohling · 2020-10-15T19:43:11Z

Looking again at this PR, it isn't that big after all: half of it are changes to go.sum. Then, quite a big chunk is for tests. The readme has yet another ~150 lines. In the end, the actual business code is under 400 lines:

$ find . -name "*.go" -a -not -name "*_test.go" | xargs cat | grep -v "^\/\/" | sort | uniq | wc -l
397

Tests are almost an exact match:

$ find . -name "*_test.go" | xargs cat | grep -v "^\/\/" | sort | uniq | wc -l
398

tigrannajaryan · 2020-10-16T17:48:33Z

@jpkrohling one the reasons we ask for a small PR first is to understand if we want it in the repo. We don't want contributors to spend time on something that potentially may be rejected (not saying this one will be rejected, I did not review it). First PR with config and README is a great way to get this initial agreement.
I still highly recommend to split it that way. No need for 8 PRs, you line count analysis shows that is can be 2 or 3 PRs.

jpkrohling · 2020-10-16T19:09:12Z

To be honest, I just wanted to write a PoC to keep the discussion in open-telemetry/opentelemetry-collector#1724 going. This PR shouldn't be reviewed before we have an agreement there. If this is indeed the direction we want to go, I'll split this PR in this way:

skeleton (and readme)
consistent hash ring
resolvers (interface, DNS and static)
processor implementation

jpkrohling · 2020-10-22T12:37:21Z

I haven't heard anything on this yet, which I want to understand as that there are no objections to the design. I'll start splitting this into 4 PRs, hence, I'm closing this one.

Signed-off-by: Bogdan Drutu <[email protected]>

* Prepare for releasing v0.13.0 * Update CHANGELOG.md for v0.13.0 release

jpkrohling mentioned this pull request Oct 9, 2020

Discussion of Consistent Routing by TraceID open-telemetry/opentelemetry-collector#1724

Closed

jpkrohling force-pushed the jpkrohling/1724-loadbalancing-processor branch 2 times, most recently from a319e1e to 156554f Compare October 9, 2020 14:05

jpkrohling mentioned this pull request Oct 9, 2020

Split traces from batch #1235

Closed

joe-elliott reviewed Oct 9, 2020

View reviewed changes

jpkrohling commented Oct 9, 2020

View reviewed changes

exporter/loadbalancingexporter/README.md Outdated

Copy link

Member Author

jpkrohling Oct 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace double negative (never false) to positive (always true)

jpkrohling commented Oct 9, 2020

View reviewed changes

exporter/loadbalancingexporter/README.md Outdated Show resolved Hide resolved

jpkrohling force-pushed the jpkrohling/1724-loadbalancing-processor branch from c661a0e to 531ff5d Compare October 14, 2020 13:10

jpkrohling marked this pull request as ready for review October 14, 2020 13:11

jpkrohling requested a review from a team October 14, 2020 13:11

jpkrohling added 17 commits October 15, 2020 21:19

Initial skeleton

de9ceda

Signed-off-by: Juraci Paixão Kröhling <[email protected]>

Consistent hash ring

664af07

Signed-off-by: Juraci Paixão Kröhling <[email protected]>

Added resolver and update logic

7b98e21

Signed-off-by: Juraci Paixão Kröhling <[email protected]>

Converted to exporter

b1e5f84

Signed-off-by: Juraci Paixão Kröhling <[email protected]>

Create/remove exporters

117d676

Signed-off-by: Juraci Paixão Kröhling <[email protected]>

Changed auto-update to be a resolver concern

cf452cd

Signed-off-by: Juraci Paixão Kröhling <[email protected]>

Added static resolver

c7723e3

Signed-off-by: Juraci Paixão Kröhling <[email protected]>

Added DNS resolver

ea113fc

Signed-off-by: Juraci Paixão Kröhling <[email protected]>

Added periodic resolution for DNS resolver

748e090

Signed-off-by: Juraci Paixão Kröhling <[email protected]>

Integrated with contrib distribution

01d5222

Signed-off-by: Juraci Paixão Kröhling <[email protected]>

Added metrics

7fef6ab

Signed-off-by: Juraci Paixão Kröhling <[email protected]>

Metrics

a6f296b

Signed-off-by: Juraci Paixão Kröhling <[email protected]>

Improved test coverage

14645dc

Signed-off-by: Juraci Paixão Kröhling <[email protected]>

Normalized resolved backends to include port

b1463e9

Signed-off-by: Juraci Paixão Kröhling <[email protected]>

Fixed data races

e25e89f

Signed-off-by: Juraci Paixão Kröhling <[email protected]>

Removed load balancing from processors (it's an exporter)

3d74b9f

Signed-off-by: Juraci Paixão Kröhling <[email protected]>

Minor changes to readme

61f3be2

Signed-off-by: Juraci Paixão Kröhling <[email protected]>

jpkrohling force-pushed the jpkrohling/1724-loadbalancing-processor branch from 531ff5d to 61f3be2 Compare October 15, 2020 19:19

tigrannajaryan assigned rmfitzpatrick Oct 16, 2020

jpkrohling closed this Oct 22, 2020

This was referenced Oct 28, 2020

Trace ID aware load-balancing exporter - 1/4 #1348

Merged

Revisit "Contributing Guide" open-telemetry/opentelemetry-collector#1952

Closed

dyladan referenced this pull request in dynatrace-oss-contrib/opentelemetry-collector-contrib Jan 29, 2021

Add tests to componenttest to make codecov happy (#1231)

e29ed74

Signed-off-by: Bogdan Drutu <[email protected]>

ljmsc referenced this pull request in ljmsc/opentelemetry-collector-contrib Feb 21, 2022

Pre release v0.13.0 (#1231)

9ebca88

* Prepare for releasing v0.13.0 * Update CHANGELOG.md for v0.13.0 release

Trace ID aware load-balancing exporter #1231

Trace ID aware load-balancing exporter #1231

Uh oh!

Conversation

jpkrohling commented Oct 9, 2020

Uh oh!

jpkrohling commented Oct 9, 2020

Uh oh!

joe-elliott left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jpkrohling commented Oct 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jpkrohling commented Oct 14, 2020

Uh oh!

jpkrohling commented Oct 15, 2020

Uh oh!

tigrannajaryan commented Oct 16, 2020

Uh oh!

jpkrohling commented Oct 16, 2020

Uh oh!

jpkrohling commented Oct 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jpkrohling commented Oct 14, 2020 •

edited

Loading

jpkrohling commented Oct 22, 2020 •

edited

Loading