Trace ID aware load-balancing exporter #1231

jpkrohling · 2020-10-09T12:49:26Z

Description: Added a Trace ID aware load-balancing exporter.

Link to tracking Issue: open-telemetry/opentelemetry-collector#1724

Testing: some manual tests (missing Kubernetes e2e tests for DNS resolver) + unit tests

Documentation: readme file

jpkrohling · 2020-10-09T12:49:48Z

Each commit in this PR is a step towards the final solution and can be reviewed individually.

joe-elliott

Added a few comments. Thank you for this work @jpkrohling. Very nice.

joe-elliott · 2020-10-09T18:30:56Z

exporter/loadbalancingexporter/README.md

that either splits the incoming batches into multiple batches, one per trace

Does this exist? might be better just to mention groupbytrace if it's the only existing option.

It will, once #1235 is addressed :-) I think groupbytrace isn't suitable for all cases, although I would certainly recommend it for most cases that I can think of.

joe-elliott · 2020-10-09T18:31:17Z

exporter/loadbalancingexporter/README.md

should the example show groupbytrace since it's basically required?

I thought about it, but I think I would prefer to have the future "splitbatch" there instead.

joe-elliott · 2020-10-09T18:33:52Z

exporter/loadbalancingexporter/README.md

this example is cracking me up. I love how flexible the collector is.

joe-elliott · 2020-10-09T18:40:43Z

exporter/loadbalancingexporter/consistent_hashing_test.go

perhaps use a jumphash? I'm pretty sure it will work in this use case and it's significantly less code.

https://github.com/dgryski/go-jump

We present jump consistent hash, a fast, minimal memory, consistent hash algorithm that can be expressed in about 5 lines of code. In comparison to the algorithm of Karger et al., jump consistent hash requires no storage, is faster, and does a better job of evenly dividing the key space among the buckets and of evenly dividing the workload when the number of buckets changes. Its main limitation is that the buckets must be numbered sequentially, which makes it more suitable for data storage applications than for distributed web caching.

I'm no expert, but I think this will work if you just order the IPs you're load balancing over in increasing order.

I saw it the first time in @annanay25's branch and found it highly interesting, especially because it's implemented 5 lines written by a PerlMonks. I decided to read the paper, and this statement in the abstract intrigued me:

Its main limitation is that the buckets must be numbered sequentially, which makes it more suitable for data storage applications than for distributed web caching

Reading further, I realized that jumphash might not be a good choice for our use case at all:

This ability to add or remove buckets in any order can be valuable for cache servers where the servers are purely a performance improvement. But for data storage applications, where each bucket represents a different shard of the data, it is not acceptable for shards to simply disappear, because that shard is only place where the corresponding data is stored. Typically this is handled by either making the shards redundant (with several replicas), or being able to quickly recover a replacement, or accepting lower availability for some data. Server death thus does not cause reallocation of data; only capacity changes do. This means that shards can be assigned numerical ids in increasing order as they are added, so that the active bucket ids always fill the range [0, num_buckets)

Basically, we can scale up, but can't scale down.

Ah, thank you. I knew we used jumphash with memcached in Cortex so I thought it might be useful here, but you are correct. We use a statefulset to force identity/order so that new shards are always appended to the end.

jpkrohling · 2020-10-09T19:19:31Z

exporter/loadbalancingexporter/README.md

Replace double negative (never false) to positive (always true)

jpkrohling mentioned this pull request Oct 9, 2020

Discussion of Consistent Routing by TraceID open-telemetry/opentelemetry-collector#1724

Closed

jpkrohling force-pushed the jpkrohling/1724-loadbalancing-processor branch 2 times, most recently from a319e1e to 156554f Compare October 9, 2020 14:05

jpkrohling mentioned this pull request Oct 9, 2020

Split traces from batch #1235

Closed

joe-elliott reviewed Oct 9, 2020

View reviewed changes

jpkrohling commented Oct 9, 2020

View reviewed changes

exporter/loadbalancingexporter/README.md Outdated

Copy link

Member Author

jpkrohling Oct 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace double negative (never false) to positive (always true)

jpkrohling commented

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Trace ID aware load-balancing exporter #1231

Trace ID aware load-balancing exporter #1231

Uh oh!

jpkrohling commented Oct 9, 2020

Uh oh!

jpkrohling commented Oct 9, 2020

Uh oh!

joe-elliott left a comment

Uh oh!

joe-elliott Oct 9, 2020

Uh oh!

jpkrohling Oct 9, 2020

Uh oh!

joe-elliott Oct 9, 2020

Uh oh!

jpkrohling Oct 9, 2020

Uh oh!

joe-elliott Oct 9, 2020

Uh oh!

joe-elliott Oct 9, 2020

Uh oh!

jpkrohling Oct 9, 2020

Uh oh!

joe-elliott Oct 9, 2020

Uh oh!

jpkrohling Oct 9, 2020

Uh oh!

Trace ID aware load-balancing exporter #1231

Trace ID aware load-balancing exporter #1231

Uh oh!

Conversation

jpkrohling commented Oct 9, 2020

Uh oh!

jpkrohling commented Oct 9, 2020

Uh oh!

joe-elliott left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!