-
Notifications
You must be signed in to change notification settings - Fork 63
Add support for request TotalTimeMs latency histograms #423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for request TotalTimeMs latency histograms #423
Conversation
| } | ||
| } | ||
|
|
||
| class Histogram(val metricNamePrefix: String, val metricNamePostfix: String, val binBoundaries: Array[Int]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Int -> Long considering the time is measured in ms, including the type of the key in the TreeMap etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel Int should be enough for the bin boundary. int.max ms is roughly 2 * 10^6 seconds, which is 23 days. For currently usage, I do not see a need that we would set the boundary greater than 23 days. The measured latency has a potential to be greater than 23 days in case of bug, but we do not need to set the boundary to this big number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For latency yes, but just in case this Histogram class is used for something else, e.g. size of requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we need that big number, which probably indicates we should use a different unit. For request size, we set the boundary with the unit of Mb currently, and the biggest configured boundary is 100 Mb. Even if we change it Kb in future, Int should also be good enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right that Int can satisfy the current needs.
But I feel this class needs to be a bit more generic and future-proof, just as how the yammer metrics library stores the values using Long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you change the units to KiB instead of MiB? Measuring if there are a lot of small requests is useful to find services with bad batching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part of code is a refactor of previous code on separating requests by request size such that we could provide SLOs for request of different size. I did the refactor here such that the latency histogram implementation could reuse this part of code. Currently we plan to provide SLO for request with request size smaller than 1MB and we need to collect one month of data. If we change the buckets now, it would make the metrics more complicated for the latency SLO estimation. So I prefer to delay this change to after the estimation of latency SLO. And the change should be in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, no problem to delay this if that's the preference. We do really care about the small requests so please don't lose track of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit if (
Surprised the formatter didn't catch this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be wrong, but it seems like if someone is logging an int, that means it goes Double -> Long -> Int. Given the goal above of using an int instead of a long, this seems like it will negate any compute benefit of using an int instead of a long.
Is there a way to provide an overload here? Or maybe updateInt updateDouble etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added more update methods to support different value types
|
LGTM |
f63b6c8 to
a6a8ad7
Compare
This should be a direct fix-up to - `e99f81c` Add support for request TotalTimeMs latency histograms (#423)
TICKET = LIKAFKA-47556 Establish Kafka Server SLOs
LI_DESCRIPTION =
This PR is to add support for request TotalTimeMs latency histograms such that we could counter the number of requests in different latency ranges. The bin boundaries are configurable.
EXIT_CRITERIA = N/A