Skip to content

Conversation

@avibryant
Copy link
Contributor

Adds a CMSItems[K] implementation of CMS[K] which, like CMSItem[K], is exact, but for a small set of keys instead of just one.

You can add CMSItem and CMSItem or CMSItems and CMSItems and get a CMSItems. At a certain threshold (currently set to 100 items), this will transition to a CMSInstance[K].

I haven't updated the TopN or TopPct variants, just the basic CMS.

I have a specific need for this, but it seems like a generally useful optimization. It could probably entirely replace CMSItem but I wouldn't want to do that without benchmarking first, which I have not done.

Thoughts @johnynek @miguno?

@miguno
Copy link

miguno commented Dec 31, 2014

Here's a quick list of comments to get the discussion started before we hit 2015. :-)

  • Just to confirm: The purpose of this optimization would be to decrease the time of CMS operations (notably +/++ as well as frequency estimation) as well as the space requirements of CMS for small data streams, right? That is, the purpose is to improve the performance but e.g. it is not to improve the counting accuracy. (For "common" values of eps+delta the frequency estimates of CMS for small data streams should be fairly accurate because the probability of collisions is very low).
  • This change would require changes to how CMS is currently tested (src/test/), because most of the current tests use only small input data streams and thus (after this change) we'd mostly test the code that counts exactly -- i.e. we would rarely trigger the frequency estimation, which is the foundation of CMS. After this change we'd need tests that verify the estimation code (exist but would require updating) as well as tests that verify the exact counting code (new).
  • I think we should benchmark this change (via algebird-caliper) regardless of whether we'd replace CMSItem or not, e.g. because @johnynek wants to benchmark all the things.
  • I agree that it might be worth investigating whether we could replace the "1-cardinality" CMSItem with the proposed "n-cardinality" CMSItems.
  • Given that MaxSize == 100 is arbitrary, should this become a configurable parameter via CMSParams, with a default (arbitrary) value of 100? This may also be one way to make this optimization configurable and, potentially, to disable it, which could be useful when updating the CMS test suite ("do NOT use this optimization because we do not want to be forced to make up 100+x fake items in this test just to trigger proper frequency estimation behind the scenes").
  • Is there a particular reason the new code uses the more specific List[K] instead of Seq[K]?

Edit: Reversed my first question above. I think you're actually looking for a performance improvement and not for more accurate counting.

@miguno
Copy link

miguno commented Dec 31, 2014

One more comment for clarification:

Adds a CMSItems[K] implementation of CMS[K] which, like CMSItem[K], is exact, but for a small set of keys instead of just one.

Could you clarify whether you want to limit (currently via MaxSize) based on the size of the input data stream, the number of distinct keys in the input data stream, or totalCount (which is AFAIU not the size of the input stream but the L1-norm; see my comment about totalCount in the PR).

I think this would also help to verify (also via tests) that the various operations in CMSItems (but also the changed pattern match in CMSInstance) are defined correctly.

@miguno
Copy link

miguno commented Dec 31, 2014

FYI: Reversed my first question in my second comment above. I think you're actually looking for a performance improvement and not for more accurate counting.

@miguno
Copy link

miguno commented Dec 31, 2014

I have a specific need for this, but it seems like a generally useful optimization.

Could you elaborate a bit on the use case you're imagining to put this PR into context? (e.g. to understand when or why the magic number of 100 would be a good pick)

Edit: I think one area of application is micro-batching of input elements, e.g. when using sth like Spark Streaming or Storm Trident. Here, you could add/count up to 100 new items at a time.

In contrast, "core" Storm does not make use of micro-batching, and if you do not manually add batching yourself in some way you see only a single new input element (tuple) at a time. Here, the optimization would essentially mean that we can call the + operation on a fresh CMS a hundred times before we'd transition from the proposed CMSItems to the normal, hashing-based CMSInstance behind the scenes. I'd say that many use cases in practice involve sliding windows of some kind, and in this case I think the positive effect of the optimization would depend on the rate of incoming elements per time interval (e.g. if you use sliding windows of 1min in core Storm, and you see 1M items during a minute, then this optimization may not help much).

Does that make sense?

@ianoc
Copy link
Collaborator

ianoc commented Dec 31, 2014

We follow this strategy in one of our big pipelines internally -- using https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/Eventually.scala#L38 style approach. An external monoid wrapping the approximate structure. It saves quite a bit on space, but also is far faster in the super common cases of the tail in aggregations (5 - 10 elements in a set/map type levels).

+1 to building it internally into these monoids, since now only one of our pipelines where they wired it all in get this sort of gain. and +1 to benchmark too

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

creating a bunch of these is going to be O(N^2) since we get O(N) hit on each one. Should we make it lazy val or def?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can explicitly track it.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(And as I said above)

Double-checking: IIRC totalCount in Algebird's CMS is the equivalent of the L1-norm of the counting vector in the CMS paper. If that is true, then totalCount should not be item.size but:

override val totalCount: Long = items.sum

Or am I mistaken?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Items is not a sparse counting vector (although that might make sense), but simply a list of keys, which may contain duplicates. So items.size is the correct way to get the totalCount - this will be equivalent to the L1-norm of the counting vector once it's transformed into that.

@avibryant
Copy link
Contributor Author

@miguno actually, your first assumption was correct: my direct motivation for this is accuracy, not performance. Specifically, I'm testing a CMS-based algorithm for clustering [0], which uses innerProduct as the distance function, and I'd like the innerProduct between a large CMS and a small CMS to be as accurate as possible. (Using the CMSInstance representation, the innerProducts are exact with small • small, but they're less accurate than they could be with large • small).

However, the extra performance boost seems worthwhile in itself.

[0] http://charuaggarwal.net/cskrevise.pdf

@avibryant
Copy link
Contributor Author

@miguno to address some other comments:

  • by all means let's benchmark, is there a HOWTO for algebird-caliper somewhere? If not I'm sure I can dive in and figure it out.
  • agreed that making MaxItems part of params makes sense, and is a good way to deal with the testing issue, as you point out. I'll do that.
  • the reason to use List instead of Seq is just to have some certainty about the performance (not that I've made great use of this, but see the various other comments people have made about what operations to use to combine lists).
  • as @ianoc points out, one of the benefits of this approach is that in many cases you have a large number of CMS objects but only a few of them get large enough to actually need the sketch; for the others there's a great space and time benefit to keeping it exact. I think that (and my specific accuracy concern about innerProduct) is more important the the ++ performance.

@miguno
Copy link

miguno commented Jan 1, 2015

by all means let's benchmark, is there a HOWTO for algebird-caliper somewhere? If not I'm sure I can dive in and figure it out.

We do not have a full HOWTO for algebird-caliper yet but I did write a brief README:
https://github.com/twitter/algebird/blob/develop/algebird-caliper/README.md

See CMSBenchmark for an example. I'd suggest to first run this benchmark from the CLI (see README) to see what it is actually doing, and then take a look at the benchmark code on how it's done. It's essentially a simple, parameterizable micro-benchmark ("Call this function N-times with these combinations of parameters").

Note: algebird-caliper uses the sbt plugin cappi to run caliper benchmarks. Cappi does not support the latest caliper version yet (1.0-beta-1), so we must use the feature set provided by IIRC caliper 0.6-something (see further details).

HTH!

@avibryant
Copy link
Contributor Author

@johnynek, I think this latest commit addresses your performance concerns.
@miguno I moved MaxItems to CMSParams and called it maxExactCount. The API involving configuring CMS and its variations has a very large surface area and I'm not super happy with how this new param is integrated: the only reasonable way to set it explicitly is to create a new CMSMonoid, otherwise you'll generally get the default. OTOH, I think I'm not very happy with CMSParams in general - if we're pulling out a params object, why aren't we exposing it in the API for constructing these monoids and aggregators etc? But I don't want to get embroiled in a major refactoring of that right now, so this seems like the minimal-impact way to start.

I'll look at benchmarks next.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks as if count is not being taken into account. Shouldn't +(x, 4) result in the count of x being increased by 4?

This was the reason why I asked the totalCount question about items.size vs. items.sum (the L1-norm) -- size would be equivalent to sum only if we restrict + to +(x, 1).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, good catch.

@miguno
Copy link

miguno commented Jan 1, 2015

CMS configuration issues: Yeah, I definitely agree with your comments that the current way of configuring CMS could be better. The main reason (on my side) this hasn't really changed from the original code was to limit the changes being introduced when we turned CMS[Long] into CMS[K] (see #345 and #354) which is a similar "I'm not touching it right now" reason as you mentioned; well, at least it is a bit clearer in the latest CMS version about which parameters are needed where, which should help with deciding on how to improve the setup. Also, CMS up to this point has had the built-in assumption that the widths and depths of CMS instances are identical when they are being merged via ++ (the only "free" parameter has been related to heavy hitters, and that parameter is not a "core" CMS parameter), so from what I understand we haven't had a need yet to fiddle with CMS parameters after instances had been constructed.

If you have any ideas on how to simplify and/or improve the CMS configuration, I'd definitely appreciate it. IMHO the configuration and usability aspects of this code is as important as the magic that's going on behind the scenes, so personally I wouldn't mind spending some time to make sure the new functionality you have been working on is made available to the user in a better fashion -- but that could be a separate issue/PR as you are pointing out.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use a const for 0 to explain the magic number, just like we do for EPS, DELTA, and SEED?

@avibryant
Copy link
Contributor Author

Some initial benchmarks: it makes relatively little difference when summing the first 1000 integers (though using a size 100 CMSItems is a modest performance gain):

[info]       1000        PlusOfFirstNIntegersWithItemsCms   0.1        0  2787.1 =
[info]       1000        PlusOfFirstNIntegersWithItemsCms   0.1       10  2822.8 =
[info]       1000        PlusOfFirstNIntegersWithItemsCms   0.1      100  2539.4 =
[info]       1000        PlusOfFirstNIntegersWithItemsCms 0.005        0  3653.2 ==
[info]       1000        PlusOfFirstNIntegersWithItemsCms 0.005       10  3656.0 ==
[info]       1000        PlusOfFirstNIntegersWithItemsCms 0.005      100  3337.8 ==

Not too surprisingly, if you're only summing 100 integers, and you have a size 100 CMSItems, it makes a huge difference:

[info]        100        PlusOfFirstNIntegersWithItemsCms   0.1        0   278.5 =
[info]        100        PlusOfFirstNIntegersWithItemsCms   0.1       10   257.7 =
[info]        100        PlusOfFirstNIntegersWithItemsCms   0.1      100    14.5 =
[info]        100        PlusOfFirstNIntegersWithItemsCms 0.005        0   436.0 =
[info]        100        PlusOfFirstNIntegersWithItemsCms 0.005       10   411.5 =
[info]        100        PlusOfFirstNIntegersWithItemsCms 0.005      100    13.8 =

@johnynek
Copy link
Collaborator

johnynek commented Jan 2, 2015

So the speed win here is likely when you have a significant number of CMS instances that are going to be very small, which is common in heavy-tailed distributions. In that case, you cut the small instance cost down by a factor of 10 or more. Seems worth it for that.

Also, for the accuracy win you mention in dot-product.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. don't we want the method reverse_::: which combines these two to save cost?
  2. don't we want the smaller one on the left so we push the fewer items onto the bigger?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(2) don't we want the smaller one on the left so we push the fewer items onto the bigger?

The code is correctly pushing the smaller list onto the larger list:

scala> val larger = List(1,2,3,4,5,6,7,8,9)
larger: List[Int] = List(1, 2, 3, 4, 5, 6, 7, 8, 9)

scala> val smaller = List(10,20,30)
smaller: List[Int] = List(10, 20, 30)

scala> larger.reverse_:::(smaller)
res4: List[Int] = List(30, 20, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9)

(1) don't we want the method reverse_::: which combines these two to save cost?

And I was surprised to learn that apparently a List's reverse_::: is faster than ++ or ::: because it leverages structural sharing (its loop does ::) whereas e.g. ::: creates a ListBuffer via ++= and prependToList. I didn't expect that, though I haven't yet verified my understanding with a quick caliper benchmark. :-)

@johnynek
Copy link
Collaborator

johnynek commented Jan 2, 2015

Can you commit the benchmarks?

@miguno
Copy link

miguno commented Jan 4, 2015

@johnynek wrote:

So the speed win here is likely when you have a significant number of CMS instances that are going to be very small, which is common in heavy-tailed distributions. In that case, you cut the small instance cost down by a factor of 10 or more. Seems worth it for that.

First let me say that I agree, too, but I have a question: Could you elaborate on why heavy-tailed distributions would commonly imply a significant (large) number of CMS instances? Doesn't this depend on how you "partition" the input data stream into CMS instances?

@ianoc made a similar reference to how this optimization would help to aggregate the long tail; I don't see yet how you guys are segregating the long tail from the heavy hitters (if they are lumped together then the number of elements is quickly > 100, or you need to create so many N partitions that 1/N-th of the data is <= 100) -- I assume you are talking about scenarios where one is not using e.g. random shuffling to spread the load of processing?

@ianoc
Copy link
Collaborator

ianoc commented Jan 5, 2015

@miguno the primary win for space + perf:
-> Adding Maps of few elements is much faster than combining the CMS structures
-> Those maps wind up being smaller serialized than the CMS structures too

And the reference/difference here i think what I was getting at is:

Often we in a dataset have a primary key in a space O(10^8) , say user ID's. Then we have a secondary index, in a space O(10^12), say url's. So we might use a setup like this to track the main referring URL's for all users, with a type of (Long, CMS[String]). For most users there will in a given time window before we age them out probably very few of these URL's. So moving to a CMSInstance at size == 2 is very expensive if the size never gets bigger than say 5.

@johnynek
Copy link
Collaborator

We're about to publish. Is this ready to merge?

@avibryant can you merge with develop? @miguno what do you say? does this have your 👍 ?

@avibryant
Copy link
Contributor Author

IMO let's get this in the next one, I think I might want to take another run at it when I have time that is more invasive but makes me happier with the result.

Unless someone really wants this in the meantime?

@miguno
Copy link

miguno commented Jan 16, 2015

I remember there's at least one functional issue remaining where the new code only supports incrementing by 1 but not by N (see my comment above). We also don't have the tests yet AFAIK. So as Avi said maybe we should wait one more round and give it some more work?

@johnynek
Copy link
Collaborator

@avibryant you want to get this synced and try to get it in 0.10?

@ianoc
Copy link
Collaborator

ianoc commented Aug 4, 2015

Sorry my bad, git foo on cmd line broke stuff and closed all of these

@ianoc ianoc reopened this Aug 4, 2015
@johnynek
Copy link
Collaborator

I think this was superseded by #464

@johnynek johnynek closed this Nov 25, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants