Skip to content

Conversation

@avibryant
Copy link
Contributor

This data structure can produce a HyperLogLog counter for any window into the past, using a constant factor more space than HyperLogLog.

For each hash bucket, rather than keeping a single max RhoW value, it keeps every RhoW value it has seen, and the max timestamp where it saw that value. This allows it to reconstruct an HLL as it would be had it started at zero at any given point in the past, and seen the same updates this structure has seen.

The usage is like this:

vall hllSeriesMonoid = new HyperLogLogSeriesMonoid(bits)

val series = examples
                       .map{case (bytes, timestamp) => hllSeriesMonoid.create(bytes, timestamp)}
                       .reduce{hllSeriesMonoid.plus(_,_)}

val estimate1 = series.since(timestamp1).toHLL.estimatedSize
val estimate2 = series.since(timestamp2).toHLL.estimatedSize

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we ditch the bits here and take it as a parameter in toHLL? If we do this, we can make this class extend AnyVal, which makes it allocate less. It does not seem that the bits parameter is ever used except to go to an HLL.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually... thinking about this more, it seems like a minor issue. Storing the extra bit could be used to do a runtime check if we try to combine two that are not from the same sized monoid. That said. Perhaps we should assert they are equal in the plus below.

@johnynek
Copy link
Collaborator

Looks fine if you add a test using the standard Monoidlaws.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spacing is a bit off here

@jnievelt
Copy link
Contributor

I'm curious why this was done with 'since' semantics instead of 'atTime' semantics. I'm guessing there's a particular use case in mind?

@jnievelt
Copy link
Contributor

I thought it was worth noting that "any window into the past" is not quite accurate. Specifically, we get 'since' semantics and little more.

The reasoning being that, if we subtract two of these from each other (estimate2 - estimate1) the result is not a count of the uniques between timestamp1 and timestamp2, but rather the uniques that happened after timestamp1 but not after timestamp2.

Having said that, it's not as if I expect there to be a way to compactly do arbitrary windows. If there's value to having HLLSeries I wouldn't object, but I would like to have it more clearly documented.

@avibryant
Copy link
Contributor Author

Yes, that's correct - by "window into the past" I meant "window starting at any point in the past and ending now", but certainly that should be made more explicit.

As an example use case, imagine a realtime dashboard which shows "uniques in last hour, last day, last week, last month".

It's a trivial variation to do the atTime semantics (which is to say, "window starting at time 0 and going to any point in the past"), but I think that will be less frequently desired - the one use case I can think of is some kind of "cumulative unique visitors over time" plot showing the growth of the use of something. If you think it's important to support either, though, I'll try to come up with a reasonable API for doing so.

@jnievelt
Copy link
Contributor

Sounds good. I suppose many folks do time series anyway, so this will give them a simpler way to compute things from arbitrary starting points.

@DanielleSucher
Copy link
Collaborator

Ping! Anything still missing here before y'all would be comfortable merging?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need more docs here. I have to read carefully to remind myself of the intent and use of this code. Can you give some comments that will show up in the scaladoc?

@DanielleSucher
Copy link
Collaborator

Righto, comments added. ^^

johnynek added a commit that referenced this pull request May 8, 2015
@johnynek johnynek merged commit c9c29b3 into twitter:develop May 8, 2015
@johnynek
Copy link
Collaborator

johnynek commented May 8, 2015

Thanks for the comments and the usage notes!

jnievelt pushed a commit that referenced this pull request May 11, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants