Skip to content

Conversation

@hshi2022
Copy link

@hshi2022 hshi2022 commented Dec 2, 2022

TICKET = LIKAFKA-47556 Establish Kafka Server SLOs
LI_DESCRIPTION =
This PR is to add support for request TotalTimeMs latency histograms such that we could counter the number of requests in different latency ranges. The bin boundaries are configurable.

EXIT_CRITERIA = N/A

@hshi2022 hshi2022 requested a review from gitlw December 2, 2022 22:37
}
}

class Histogram(val metricNamePrefix: String, val metricNamePostfix: String, val binBoundaries: Array[Int]) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Int -> Long considering the time is measured in ms, including the type of the key in the TreeMap etc

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel Int should be enough for the bin boundary. int.max ms is roughly 2 * 10^6 seconds, which is 23 days. For currently usage, I do not see a need that we would set the boundary greater than 23 days. The measured latency has a potential to be greater than 23 days in case of bug, but we do not need to set the boundary to this big number.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For latency yes, but just in case this Histogram class is used for something else, e.g. size of requests.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need that big number, which probably indicates we should use a different unit. For request size, we set the boundary with the unit of Mb currently, and the biggest configured boundary is 100 Mb. Even if we change it Kb in future, Int should also be good enough.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right that Int can satisfy the current needs.
But I feel this class needs to be a bit more generic and future-proof, just as how the yammer metrics library stores the values using Long.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you change the units to KiB instead of MiB? Measuring if there are a lot of small requests is useful to find services with bad batching.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part of code is a refactor of previous code on separating requests by request size such that we could provide SLOs for request of different size. I did the refactor here such that the latency histogram implementation could reuse this part of code. Currently we plan to provide SLO for request with request size smaller than 1MB and we need to collect one month of data. If we change the buckets now, it would make the metrics more complicated for the latency SLO estimation. So I prefer to delay this change to after the estimation of latency SLO. And the change should be in a separate PR.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, no problem to delay this if that's the preference. We do really care about the small requests so please don't lose track of this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit if (

Surprised the formatter didn't catch this

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be wrong, but it seems like if someone is logging an int, that means it goes Double -> Long -> Int. Given the goal above of using an int instead of a long, this seems like it will negate any compute benefit of using an int instead of a long.

Is there a way to provide an overload here? Or maybe updateInt updateDouble etc?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added more update methods to support different value types

@nickgarvey
Copy link

LGTM

@hshi2022 hshi2022 force-pushed the request_count_histgram_20221128 branch from f63b6c8 to a6a8ad7 Compare December 7, 2022 04:22
@hshi2022 hshi2022 merged commit e99f81c into linkedin:3.0-li Dec 7, 2022
lmr3796 added a commit that referenced this pull request Feb 21, 2023
This should be a direct fix-up to 
- `e99f81c` Add support for request TotalTimeMs latency histograms (#423)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants