Skip to content
This repository was archived by the owner on Apr 2, 2024. It is now read-only.

Conversation

niksajakovljevic
Copy link
Contributor

@niksajakovljevic niksajakovljevic commented Aug 10, 2022

Batching will help to achieve better ingest performance, especially if
traces are sent one by one.

A batch is produced when one of two conditions is met: batch size or timeout.

This PR also adds async support for traces meaning that client doesn't need to wait for DB write. This increases ingest performance with a small risk of data loss.

Added 3 new CLI flags:

  • Two flags to control batch size: tracing.max-batch-size and tracing.batch-timeout.
  • Flag for async writes tracing.async-acks

@niksajakovljevic niksajakovljevic added the epic/jaeger-grpc-write Jaeger gRPC based write integration label Aug 10, 2022
@niksajakovljevic niksajakovljevic requested a review from a team August 10, 2022 11:44
@niksajakovljevic niksajakovljevic self-assigned this Aug 10, 2022
@niksajakovljevic niksajakovljevic removed the request for review from a team August 10, 2022 11:45
@niksajakovljevic niksajakovljevic force-pushed the niksa/batch-traces branch 5 times, most recently from ca3872f to 6853fd1 Compare August 10, 2022 13:45
@niksajakovljevic niksajakovljevic marked this pull request as ready for review August 10, 2022 13:46
@niksajakovljevic niksajakovljevic requested review from a team as code owners August 10, 2022 13:46
@niksajakovljevic niksajakovljevic force-pushed the niksa/batch-traces branch 2 times, most recently from a7546c1 to 149872d Compare August 10, 2022 16:53
@niksajakovljevic
Copy link
Contributor Author

The local benchmarks that I've added are pointing that batching should provide better performance, however I still need to run full blown benchmark with more realistic data loads.

defaultMaxBufferedBatches = 200 // arbitrary picked. We want to avoid wait on batches
)

var defaultBatchWorkers = runtime.NumCPU() / 2 // we only take half so other half can be used for writers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is primarily IO driven, is it possible to size this based on number of db connections instead of no.of CPUs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look closer batching and writing batches is separated thus creating batches is mostly CPU driven - batches are only written to a channel and picked up by batch writers.

@niksajakovljevic niksajakovljevic force-pushed the niksa/batch-traces branch 3 times, most recently from 0e7e316 to ed765e8 Compare August 16, 2022 08:52
Copy link
Member

@harkishen harkishen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to see perf difference with and without batching in case of Jaeger ingestion.

Comment on lines 154 to 175
case <-ticker.C:
batcherSpan.AddEvent("Batch timeout reached")
if !batch.isEmpty() {
batch = flushBatch(batch)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work as expected.
It will tick even if the very previous moment we did the item := <-b.in case and then flushed the batch. This is because the ticker is just created once, so it does not care about most recent event.

We should use <-time.After(config.BatchTimeout) in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure that's true. If item is received and batch is full we will flush it and reset the ticker. If batch is not full we still can reach the timeout and flush it which is what we want. Does it now make sense?

@niksajakovljevic niksajakovljevic force-pushed the niksa/batch-traces branch 3 times, most recently from 2500299 to 07dbd59 Compare August 16, 2022 15:07
@niksajakovljevic
Copy link
Contributor Author

Would be good to see perf difference with and without batching in case of Jaeger ingestion.

Yes I will be soon publishing benchmark numbers in this PR.

Copy link
Member

@harkishen harkishen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. However, some changes are needed on the metrics side.

b.bufferedBatches <- batchCp
batcherSpan.AddEvent("Batch sent to buffer")
batcherSpan.AddEvent("New Batch")
return NewBatch(b.config.MaxBatchSize)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong suggestion. My aim was more towards readability. But, let's ignore this.

@niksajakovljevic niksajakovljevic force-pushed the niksa/batch-traces branch 2 times, most recently from f2ac414 to ba9f7d2 Compare August 22, 2022 14:13
Copy link
Contributor

@cevian cevian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good but needs some changes

)

const (
defaultReqBufferSize = 100000 // buffer for incoming requests, especially important for async acks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't these defaults be derived from other defaults

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm...not sure about this one. My idea was to have enough buffer for spikes. 100K might be too much. In the recent benchmark runs I did I've never seen this going above 1K. Maybe we can set it to 5K for now and tune as we go. Frankly fine tuning these requires time and a lots of benchmark runs... Please let me know if you have a better idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually tweak this to be MaxBatchSize * 3. I also split incoming request into multiple queues (we have a separate queue for each batcher). I added a separate commit for easier review.

Copy link
Member

@harkishen harkishen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit remaining, otherwise LGTM 👍🏻

@niksajakovljevic niksajakovljevic force-pushed the niksa/batch-traces branch 6 times, most recently from 24eb8b4 to b6528f6 Compare August 28, 2022 11:53
// If it's only one span we shard by it's TraceID so spans with the same TraceID end up in the same batcher.
// Otherwise we roun-robin between batchers
func (td *Dispatcher) getBatcherIdx(ctx context.Context, traces ptrace.Traces) (int, error) {
numberOfBatchers := td.batcher.config.Batchers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love that the dispatcher needs to care about the number of batchers but this may be unavoidable...

Batching will help to achieve better ingest performance, especially if
traces are sent one by one (which is the case for Jaeger collector).

Batch is flushed either on timeout or when full.

Adds async support for traces meaning that client doesn't
need to wait for DB write. This increases ingest performance with a small
risk of data loss. New CLI flag `tracing.async-acks` added.

Flags to control batch size: `tracing.max-batch-size` and `tracing.batch-timeout`.
Flags to control batch workers: `tracing.batch-workers`
@niksajakovljevic niksajakovljevic merged commit 48cde7c into master Aug 29, 2022
@niksajakovljevic niksajakovljevic deleted the niksa/batch-traces branch August 29, 2022 06:33
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

epic/jaeger-grpc-write Jaeger gRPC based write integration

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants