CMSItems #390

avibryant · 2014-12-31T01:57:25Z

Adds a CMSItems[K] implementation of CMS[K] which, like CMSItem[K], is exact, but for a small set of keys instead of just one.

You can add CMSItem and CMSItem or CMSItems and CMSItems and get a CMSItems. At a certain threshold (currently set to 100 items), this will transition to a CMSInstance[K].

I haven't updated the TopN or TopPct variants, just the basic CMS.

I have a specific need for this, but it seems like a generally useful optimization. It could probably entirely replace CMSItem but I wouldn't want to do that without benchmarking first, which I have not done.

Thoughts @johnynek @miguno?

miguno · 2014-12-31T08:56:52Z

Here's a quick list of comments to get the discussion started before we hit 2015. :-)

Just to confirm: The purpose of this optimization would be to decrease the time of CMS operations (notably +/++ as well as frequency estimation) as well as the space requirements of CMS for small data streams, right? That is, the purpose is to improve the performance but e.g. it is not to improve the counting accuracy. (For "common" values of eps+delta the frequency estimates of CMS for small data streams should be fairly accurate because the probability of collisions is very low).
This change would require changes to how CMS is currently tested (src/test/), because most of the current tests use only small input data streams and thus (after this change) we'd mostly test the code that counts exactly -- i.e. we would rarely trigger the frequency estimation, which is the foundation of CMS. After this change we'd need tests that verify the estimation code (exist but would require updating) as well as tests that verify the exact counting code (new).
I think we should benchmark this change (via algebird-caliper) regardless of whether we'd replace CMSItem or not, e.g. because @johnynek wants to benchmark all the things.
I agree that it might be worth investigating whether we could replace the "1-cardinality" CMSItem with the proposed "n-cardinality" CMSItems.
Given that MaxSize == 100 is arbitrary, should this become a configurable parameter via CMSParams, with a default (arbitrary) value of 100? This may also be one way to make this optimization configurable and, potentially, to disable it, which could be useful when updating the CMS test suite ("do NOT use this optimization because we do not want to be forced to make up 100+x fake items in this test just to trigger proper frequency estimation behind the scenes").
Is there a particular reason the new code uses the more specific List[K] instead of Seq[K]?

Edit: Reversed my first question above. I think you're actually looking for a performance improvement and not for more accurate counting.

miguno · 2014-12-31T09:11:47Z

One more comment for clarification:

Adds a CMSItems[K] implementation of CMS[K] which, like CMSItem[K], is exact, but for a small set of keys instead of just one.

Could you clarify whether you want to limit (currently via MaxSize) based on the size of the input data stream, the number of distinct keys in the input data stream, or totalCount (which is AFAIU not the size of the input stream but the L1-norm; see my comment about totalCount in the PR).

I think this would also help to verify (also via tests) that the various operations in CMSItems (but also the changed pattern match in CMSInstance) are defined correctly.

miguno · 2014-12-31T16:07:30Z

FYI: Reversed my first question in my second comment above. I think you're actually looking for a performance improvement and not for more accurate counting.

miguno · 2014-12-31T16:12:19Z

I have a specific need for this, but it seems like a generally useful optimization.

Could you elaborate a bit on the use case you're imagining to put this PR into context? (e.g. to understand when or why the magic number of 100 would be a good pick)

Edit: I think one area of application is micro-batching of input elements, e.g. when using sth like Spark Streaming or Storm Trident. Here, you could add/count up to 100 new items at a time.

In contrast, "core" Storm does not make use of micro-batching, and if you do not manually add batching yourself in some way you see only a single new input element (tuple) at a time. Here, the optimization would essentially mean that we can call the + operation on a fresh CMS a hundred times before we'd transition from the proposed CMSItems to the normal, hashing-based CMSInstance behind the scenes. I'd say that many use cases in practice involve sliding windows of some kind, and in this case I think the positive effect of the optimization would depend on the rate of incoming elements per time interval (e.g. if you use sliding windows of 1min in core Storm, and you see 1M items during a minute, then this optimization may not help much).

Does that make sense?

ianoc · 2014-12-31T18:30:57Z

We follow this strategy in one of our big pipelines internally -- using https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/Eventually.scala#L38 style approach. An external monoid wrapping the approximate structure. It saves quite a bit on space, but also is far faster in the super common cases of the tail in aggregations (5 - 10 elements in a set/map type levels).

+1 to building it internally into these monoids, since now only one of our pipelines where they wired it all in get this sort of gain. and +1 to benchmark too

johnynek · 2014-12-31T20:58:18Z

algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala

creating a bunch of these is going to be O(N^2) since we get O(N) hit on each one. Should we make it lazy val or def?

We can explicitly track it.

(And as I said above)

Double-checking: IIRC totalCount in Algebird's CMS is the equivalent of the L1-norm of the counting vector in the CMS paper. If that is true, then totalCount should not be item.size but:

override val totalCount: Long = items.sum

Or am I mistaken?

Items is not a sparse counting vector (although that might make sense), but simply a list of keys, which may contain duplicates. So items.size is the correct way to get the totalCount - this will be equivalent to the L1-norm of the counting vector once it's transformed into that.

avibryant · 2014-12-31T23:24:36Z

@miguno actually, your first assumption was correct: my direct motivation for this is accuracy, not performance. Specifically, I'm testing a CMS-based algorithm for clustering [0], which uses innerProduct as the distance function, and I'd like the innerProduct between a large CMS and a small CMS to be as accurate as possible. (Using the CMSInstance representation, the innerProducts are exact with small • small, but they're less accurate than they could be with large • small).

However, the extra performance boost seems worthwhile in itself.

[0] http://charuaggarwal.net/cskrevise.pdf

avibryant · 2014-12-31T23:40:40Z

@miguno to address some other comments:

by all means let's benchmark, is there a HOWTO for algebird-caliper somewhere? If not I'm sure I can dive in and figure it out.
agreed that making MaxItems part of params makes sense, and is a good way to deal with the testing issue, as you point out. I'll do that.
the reason to use List instead of Seq is just to have some certainty about the performance (not that I've made great use of this, but see the various other comments people have made about what operations to use to combine lists).
as @ianoc points out, one of the benefits of this approach is that in many cases you have a large number of CMS objects but only a few of them get large enough to actually need the sketch; for the others there's a great space and time benefit to keeping it exact. I think that (and my specific accuracy concern about innerProduct) is more important the the ++ performance.

miguno · 2015-01-01T00:06:20Z

by all means let's benchmark, is there a HOWTO for algebird-caliper somewhere? If not I'm sure I can dive in and figure it out.

We do not have a full HOWTO for algebird-caliper yet but I did write a brief README:
https://github.com/twitter/algebird/blob/develop/algebird-caliper/README.md

See CMSBenchmark for an example. I'd suggest to first run this benchmark from the CLI (see README) to see what it is actually doing, and then take a look at the benchmark code on how it's done. It's essentially a simple, parameterizable micro-benchmark ("Call this function N-times with these combinations of parameters").

Note: algebird-caliper uses the sbt plugin cappi to run caliper benchmarks. Cappi does not support the latest caliper version yet (1.0-beta-1), so we must use the feature set provided by IIRC caliper 0.6-something (see further details).

HTH!

avibryant · 2015-01-01T06:17:27Z

@johnynek, I think this latest commit addresses your performance concerns.
@miguno I moved MaxItems to CMSParams and called it maxExactCount. The API involving configuring CMS and its variations has a very large surface area and I'm not super happy with how this new param is integrated: the only reasonable way to set it explicitly is to create a new CMSMonoid, otherwise you'll generally get the default. OTOH, I think I'm not very happy with CMSParams in general - if we're pulling out a params object, why aren't we exposing it in the API for constructing these monoids and aggregators etc? But I don't want to get embroiled in a major refactoring of that right now, so this seems like the minimal-impact way to start.

I'll look at benchmarks next.

miguno · 2015-01-01T07:07:52Z

algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala

It looks as if count is not being taken into account. Shouldn't +(x, 4) result in the count of x being increased by 4?

This was the reason why I asked the totalCount question about items.size vs. items.sum (the L1-norm) -- size would be equivalent to sum only if we restrict + to +(x, 1).

Thanks, good catch.

miguno · 2015-01-01T07:20:46Z

CMS configuration issues: Yeah, I definitely agree with your comments that the current way of configuring CMS could be better. The main reason (on my side) this hasn't really changed from the original code was to limit the changes being introduced when we turned CMS[Long] into CMS[K] (see #345 and #354) which is a similar "I'm not touching it right now" reason as you mentioned; well, at least it is a bit clearer in the latest CMS version about which parameters are needed where, which should help with deciding on how to improve the setup. Also, CMS up to this point has had the built-in assumption that the widths and depths of CMS instances are identical when they are being merged via ++ (the only "free" parameter has been related to heavy hitters, and that parameter is not a "core" CMS parameter), so from what I understand we haven't had a need yet to fiddle with CMS parameters after instances had been constructed.

If you have any ideas on how to simplify and/or improve the CMS configuration, I'd definitely appreciate it. IMHO the configuration and usability aspects of this code is as important as the magic that's going on behind the scenes, so personally I wouldn't mind spending some time to make sure the new functionality you have been working on is made available to the user in a better fashion -- but that could be a separate issue/PR as you are pointing out.

miguno · 2015-01-01T08:06:24Z

algebird-test/src/test/scala/com/twitter/algebird/CountMinSketchTest.scala

Can we use a const for 0 to explain the magic number, just like we do for EPS, DELTA, and SEED?

avibryant · 2015-01-01T08:29:03Z

Some initial benchmarks: it makes relatively little difference when summing the first 1000 integers (though using a size 100 CMSItems is a modest performance gain):

[info]       1000        PlusOfFirstNIntegersWithItemsCms   0.1        0  2787.1 =
[info]       1000        PlusOfFirstNIntegersWithItemsCms   0.1       10  2822.8 =
[info]       1000        PlusOfFirstNIntegersWithItemsCms   0.1      100  2539.4 =
[info]       1000        PlusOfFirstNIntegersWithItemsCms 0.005        0  3653.2 ==
[info]       1000        PlusOfFirstNIntegersWithItemsCms 0.005       10  3656.0 ==
[info]       1000        PlusOfFirstNIntegersWithItemsCms 0.005      100  3337.8 ==

Not too surprisingly, if you're only summing 100 integers, and you have a size 100 CMSItems, it makes a huge difference:

[info]        100        PlusOfFirstNIntegersWithItemsCms   0.1        0   278.5 =
[info]        100        PlusOfFirstNIntegersWithItemsCms   0.1       10   257.7 =
[info]        100        PlusOfFirstNIntegersWithItemsCms   0.1      100    14.5 =
[info]        100        PlusOfFirstNIntegersWithItemsCms 0.005        0   436.0 =
[info]        100        PlusOfFirstNIntegersWithItemsCms 0.005       10   411.5 =
[info]        100        PlusOfFirstNIntegersWithItemsCms 0.005      100    13.8 =

johnynek · 2015-01-02T19:57:56Z

So the speed win here is likely when you have a significant number of CMS instances that are going to be very small, which is common in heavy-tailed distributions. In that case, you cut the small instance cost down by a factor of 10 or more. Seems worth it for that.

Also, for the accuracy win you mention in dot-product.

johnynek · 2015-01-02T20:02:26Z

algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala

don't we want the method reverse_::: which combines these two to save cost?

don't we want the smaller one on the left so we push the fewer items onto the bigger?

(2) don't we want the smaller one on the left so we push the fewer items onto the bigger?

The code is correctly pushing the smaller list onto the larger list:

scala> val larger = List(1,2,3,4,5,6,7,8,9) larger: List[Int] = List(1, 2, 3, 4, 5, 6, 7, 8, 9) scala> val smaller = List(10,20,30) smaller: List[Int] = List(10, 20, 30) scala> larger.reverse_:::(smaller) res4: List[Int] = List(30, 20, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9)

(1) don't we want the method reverse_::: which combines these two to save cost?

And I was surprised to learn that apparently a List's reverse_::: is faster than ++ or ::: because it leverages structural sharing (its loop does ::) whereas e.g. ::: creates a ListBuffer via ++= and prependToList. I didn't expect that, though I haven't yet verified my understanding with a quick caliper benchmark. :-)

johnynek · 2015-01-02T20:04:49Z

Can you commit the benchmarks?

miguno · 2015-01-04T13:44:55Z

@johnynek wrote:

So the speed win here is likely when you have a significant number of CMS instances that are going to be very small, which is common in heavy-tailed distributions. In that case, you cut the small instance cost down by a factor of 10 or more. Seems worth it for that.

First let me say that I agree, too, but I have a question: Could you elaborate on why heavy-tailed distributions would commonly imply a significant (large) number of CMS instances? Doesn't this depend on how you "partition" the input data stream into CMS instances?

@ianoc made a similar reference to how this optimization would help to aggregate the long tail; I don't see yet how you guys are segregating the long tail from the heavy hitters (if they are lumped together then the number of elements is quickly > 100, or you need to create so many N partitions that 1/N-th of the data is <= 100) -- I assume you are talking about scenarios where one is not using e.g. random shuffling to spread the load of processing?

ianoc · 2015-01-05T01:03:09Z

@miguno the primary win for space + perf:
-> Adding Maps of few elements is much faster than combining the CMS structures
-> Those maps wind up being smaller serialized than the CMS structures too

And the reference/difference here i think what I was getting at is:

Often we in a dataset have a primary key in a space O(10^8) , say user ID's. Then we have a secondary index, in a space O(10^12), say url's. So we might use a setup like this to track the main referring URL's for all users, with a type of (Long, CMS[String]). For most users there will in a given time window before we age them out probably very few of these URL's. So moving to a CMSInstance at size == 2 is very expensive if the size never gets bigger than say 5.

johnynek · 2015-01-16T20:41:52Z

We're about to publish. Is this ready to merge?

@avibryant can you merge with develop? @miguno what do you say? does this have your 👍 ?

avibryant · 2015-01-16T20:45:18Z

IMO let's get this in the next one, I think I might want to take another run at it when I have time that is more invasive but makes me happier with the result.

Unless someone really wants this in the meantime?

miguno · 2015-01-16T20:51:48Z

I remember there's at least one functional issue remaining where the new code only supports incrementing by 1 but not by N (see my comment above). We also don't have the tests yet AFAIK. So as Avi said maybe we should wait one more round and give it some more work?

johnynek · 2015-03-31T01:11:58Z

@avibryant you want to get this synced and try to get it in 0.10?

ianoc · 2015-08-04T00:47:41Z

Sorry my bad, git foo on cmd line broke stuff and closed all of these

johnynek · 2015-11-25T15:59:16Z

I think this was superseded by #464

avi-stripe added 3 commits December 30, 2014 11:44

CMSItems optimization

6bad05e

actually implement toCMSInstance

51aebec

didn't mean to change the formatting of this file

38db0f3

johnynek reviewed Dec 31, 2014
View reviewed changes

move maxExactCount to params, optimizations for CMSItems

12e80b5

miguno reviewed Jan 1, 2015
View reviewed changes

johnynek reviewed Jan 2, 2015
View reviewed changes

miguno mentioned this pull request Jan 7, 2015

CMS: Do we support negative counts and/or is computing inner products broken? #393

Closed

johnynek mentioned this pull request Jul 16, 2015

Create a sparse Count-Min-Sketch. #464

Merged

ianoc closed this Aug 4, 2015

ianoc reopened this Aug 4, 2015

johnynek closed this Nov 25, 2015

CMSItems #390

CMSItems #390

Uh oh!

Conversation

avibryant commented Dec 31, 2014

Uh oh!

miguno commented Dec 31, 2014

Uh oh!

miguno commented Dec 31, 2014

Uh oh!

miguno commented Dec 31, 2014

Uh oh!

miguno commented Dec 31, 2014

Uh oh!

ianoc commented Dec 31, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

avibryant commented Dec 31, 2014

Uh oh!

avibryant commented Dec 31, 2014

Uh oh!

miguno commented Jan 1, 2015

Uh oh!

avibryant commented Jan 1, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

miguno commented Jan 1, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

avibryant commented Jan 1, 2015

Uh oh!

johnynek commented Jan 2, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

johnynek commented Jan 2, 2015

Uh oh!

miguno commented Jan 4, 2015

Uh oh!

ianoc commented Jan 5, 2015

Uh oh!

johnynek commented Jan 16, 2015

Uh oh!

avibryant commented Jan 16, 2015

Uh oh!

miguno commented Jan 16, 2015

Uh oh!

johnynek commented Mar 31, 2015

Uh oh!

ianoc commented Aug 4, 2015

Uh oh!

johnynek commented Nov 25, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants