Add support for request TotalTimeMs latency histograms #423

hshi2022 · 2022-12-02T22:35:41Z

TICKET = LIKAFKA-47556 Establish Kafka Server SLOs
LI_DESCRIPTION =
This PR is to add support for request TotalTimeMs latency histograms such that we could counter the number of requests in different latency ranges. The bin boundaries are configurable.

EXIT_CRITERIA = N/A

core/src/test/scala/unit/kafka/network/RequestChannelTest.scala

gitlw · 2022-12-02T23:34:59Z

core/src/main/scala/kafka/network/RequestChannel.scala

    }
  }

+  class Histogram(val metricNamePrefix: String, val metricNamePostfix: String, val binBoundaries: Array[Int]) {


Int -> Long considering the time is measured in ms, including the type of the key in the TreeMap etc

I feel Int should be enough for the bin boundary. int.max ms is roughly 2 * 10^6 seconds, which is 23 days. For currently usage, I do not see a need that we would set the boundary greater than 23 days. The measured latency has a potential to be greater than 23 days in case of bug, but we do not need to set the boundary to this big number.

For latency yes, but just in case this Histogram class is used for something else, e.g. size of requests.

If we need that big number, which probably indicates we should use a different unit. For request size, we set the boundary with the unit of Mb currently, and the biggest configured boundary is 100 Mb. Even if we change it Kb in future, Int should also be good enough.

You are right that Int can satisfy the current needs.
But I feel this class needs to be a bit more generic and future-proof, just as how the yammer metrics library stores the values using Long.

core/src/test/scala/unit/kafka/network/SocketServerTest.scala

nickgarvey · 2022-12-05T17:41:16Z

core/src/main/scala/kafka/network/RequestChannel.scala

Can you change the units to KiB instead of MiB? Measuring if there are a lot of small requests is useful to find services with bad batching.

This part of code is a refactor of previous code on separating requests by request size such that we could provide SLOs for request of different size. I did the refactor here such that the latency histogram implementation could reuse this part of code. Currently we plan to provide SLO for request with request size smaller than 1MB and we need to collect one month of data. If we change the buckets now, it would make the metrics more complicated for the latency SLO estimation. So I prefer to delay this change to after the estimation of latency SLO. And the change should be in a separate PR.

Sure, no problem to delay this if that's the preference. We do really care about the small requests so please don't lose track of this.

nickgarvey · 2022-12-05T17:42:58Z

core/src/main/scala/kafka/network/RequestChannel.scala

nit if (

Surprised the formatter didn't catch this

nickgarvey · 2022-12-05T18:10:29Z

core/src/main/scala/kafka/network/RequestChannel.scala

I might be wrong, but it seems like if someone is logging an int, that means it goes Double -> Long -> Int. Given the goal above of using an int instead of a long, this seems like it will negate any compute benefit of using an int instead of a long.

Is there a way to provide an overload here? Or maybe updateInt updateDouble etc?

added more update methods to support different value types

nickgarvey · 2022-12-07T00:55:27Z

LGTM

This should be a direct fix-up to - `e99f81c` Add support for request TotalTimeMs latency histograms (#423)

Add support for request TotalTimeMs latency histograms.

e6bd220

hshi2022 requested a review from gitlw December 2, 2022 22:37

gitlw requested changes Dec 2, 2022

View reviewed changes

gitlw reviewed Dec 3, 2022

View reviewed changes

core/src/test/scala/unit/kafka/network/SocketServerTest.scala Outdated Show resolved Hide resolved

gitlw approved these changes Dec 5, 2022

View reviewed changes

nickgarvey reviewed Dec 5, 2022

View reviewed changes

Address comments

a6a8ad7

hshi2022 force-pushed the request_count_histgram_20221128 branch from f63b6c8 to a6a8ad7 Compare December 7, 2022 04:22

hshi2022 merged commit e99f81c into linkedin:3.0-li Dec 7, 2022

lmr3796 mentioned this pull request Feb 21, 2023

[LI-FIXUP] Fix checkstyle warning RequestChannelTest.scala #439

Merged

3 tasks

lmr3796 added a commit that referenced this pull request Feb 21, 2023

[LI-FIXUP] Fix checkstyle warning RequestChannelTest.scala (#439)

dec729a

This should be a direct fix-up to - `e99f81c` Add support for request TotalTimeMs latency histograms (#423)

Add support for request TotalTimeMs latency histograms #423

Add support for request TotalTimeMs latency histograms #423

Uh oh!

Conversation

hshi2022 commented Dec 2, 2022

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nickgarvey commented Dec 7, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants