-
Notifications
You must be signed in to change notification settings - Fork 347
HyperLogLogSeries #295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HyperLogLogSeries #295
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we ditch the bits here and take it as a parameter in toHLL? If we do this, we can make this class extend AnyVal, which makes it allocate less. It does not seem that the bits parameter is ever used except to go to an HLL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually... thinking about this more, it seems like a minor issue. Storing the extra bit could be used to do a runtime check if we try to combine two that are not from the same sized monoid. That said. Perhaps we should assert they are equal in the plus below.
|
Looks fine if you add a test using the standard Monoidlaws. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spacing is a bit off here
|
I'm curious why this was done with 'since' semantics instead of 'atTime' semantics. I'm guessing there's a particular use case in mind? |
|
I thought it was worth noting that "any window into the past" is not quite accurate. Specifically, we get 'since' semantics and little more. The reasoning being that, if we subtract two of these from each other (estimate2 - estimate1) the result is not a count of the uniques between timestamp1 and timestamp2, but rather the uniques that happened after timestamp1 but not after timestamp2. Having said that, it's not as if I expect there to be a way to compactly do arbitrary windows. If there's value to having HLLSeries I wouldn't object, but I would like to have it more clearly documented. |
|
Yes, that's correct - by "window into the past" I meant "window starting at any point in the past and ending now", but certainly that should be made more explicit. As an example use case, imagine a realtime dashboard which shows "uniques in last hour, last day, last week, last month". It's a trivial variation to do the |
|
Sounds good. I suppose many folks do time series anyway, so this will give them a simpler way to compute things from arbitrary starting points. |
1490ef2 to
6efc590
Compare
|
Ping! Anything still missing here before y'all would be comfortable merging? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need more docs here. I have to read carefully to remind myself of the intent and use of this code. Can you give some comments that will show up in the scaladoc?
|
Righto, comments added. ^^ |
|
Thanks for the comments and the usage notes! |
This data structure can produce a HyperLogLog counter for any window into the past, using a constant factor more space than HyperLogLog.
For each hash bucket, rather than keeping a single max RhoW value, it keeps every RhoW value it has seen, and the max timestamp where it saw that value. This allows it to reconstruct an HLL as it would be had it started at zero at any given point in the past, and seen the same updates this structure has seen.
The usage is like this: