-
Notifications
You must be signed in to change notification settings - Fork 347
CMSItems #390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CMSItems #390
Conversation
|
Here's a quick list of comments to get the discussion started before we hit 2015. :-)
Edit: Reversed my first question above. I think you're actually looking for a performance improvement and not for more accurate counting. |
|
One more comment for clarification:
Could you clarify whether you want to limit (currently via I think this would also help to verify (also via tests) that the various operations in |
|
FYI: Reversed my first question in my second comment above. I think you're actually looking for a performance improvement and not for more accurate counting. |
Could you elaborate a bit on the use case you're imagining to put this PR into context? (e.g. to understand when or why the magic number of Edit: I think one area of application is micro-batching of input elements, e.g. when using sth like Spark Streaming or Storm Trident. Here, you could add/count up to 100 new items at a time. In contrast, "core" Storm does not make use of micro-batching, and if you do not manually add batching yourself in some way you see only a single new input element (tuple) at a time. Here, the optimization would essentially mean that we can call the Does that make sense? |
|
We follow this strategy in one of our big pipelines internally -- using https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/Eventually.scala#L38 style approach. An external monoid wrapping the approximate structure. It saves quite a bit on space, but also is far faster in the super common cases of the tail in aggregations (5 - 10 elements in a set/map type levels). +1 to building it internally into these monoids, since now only one of our pipelines where they wired it all in get this sort of gain. and +1 to benchmark too |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
creating a bunch of these is going to be O(N^2) since we get O(N) hit on each one. Should we make it lazy val or def?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can explicitly track it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(And as I said above)
Double-checking: IIRC totalCount in Algebird's CMS is the equivalent of the L1-norm of the counting vector in the CMS paper. If that is true, then totalCount should not be item.size but:
override val totalCount: Long = items.sumOr am I mistaken?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Items is not a sparse counting vector (although that might make sense), but simply a list of keys, which may contain duplicates. So items.size is the correct way to get the totalCount - this will be equivalent to the L1-norm of the counting vector once it's transformed into that.
|
@miguno actually, your first assumption was correct: my direct motivation for this is accuracy, not performance. Specifically, I'm testing a CMS-based algorithm for clustering [0], which uses innerProduct as the distance function, and I'd like the innerProduct between a large CMS and a small CMS to be as accurate as possible. (Using the CMSInstance representation, the innerProducts are exact with small • small, but they're less accurate than they could be with large • small). However, the extra performance boost seems worthwhile in itself. |
|
@miguno to address some other comments:
|
We do not have a full HOWTO for algebird-caliper yet but I did write a brief README: See CMSBenchmark for an example. I'd suggest to first run this benchmark from the CLI (see README) to see what it is actually doing, and then take a look at the benchmark code on how it's done. It's essentially a simple, parameterizable micro-benchmark ("Call this function N-times with these combinations of parameters"). Note: algebird-caliper uses the sbt plugin cappi to run caliper benchmarks. Cappi does not support the latest caliper version yet (1.0-beta-1), so we must use the feature set provided by IIRC caliper 0.6-something (see further details). HTH! |
|
@johnynek, I think this latest commit addresses your performance concerns. I'll look at benchmarks next. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks as if count is not being taken into account. Shouldn't +(x, 4) result in the count of x being increased by 4?
This was the reason why I asked the totalCount question about items.size vs. items.sum (the L1-norm) -- size would be equivalent to sum only if we restrict + to +(x, 1).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, good catch.
|
CMS configuration issues: Yeah, I definitely agree with your comments that the current way of configuring CMS could be better. The main reason (on my side) this hasn't really changed from the original code was to limit the changes being introduced when we turned If you have any ideas on how to simplify and/or improve the CMS configuration, I'd definitely appreciate it. IMHO the configuration and usability aspects of this code is as important as the magic that's going on behind the scenes, so personally I wouldn't mind spending some time to make sure the new functionality you have been working on is made available to the user in a better fashion -- but that could be a separate issue/PR as you are pointing out. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use a const for 0 to explain the magic number, just like we do for EPS, DELTA, and SEED?
|
Some initial benchmarks: it makes relatively little difference when summing the first 1000 integers (though using a size 100 CMSItems is a modest performance gain): Not too surprisingly, if you're only summing 100 integers, and you have a size 100 CMSItems, it makes a huge difference: |
|
So the speed win here is likely when you have a significant number of CMS instances that are going to be very small, which is common in heavy-tailed distributions. In that case, you cut the small instance cost down by a factor of 10 or more. Seems worth it for that. Also, for the accuracy win you mention in dot-product. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- don't we want the method
reverse_:::which combines these two to save cost? - don't we want the smaller one on the left so we push the fewer items onto the bigger?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(2) don't we want the smaller one on the left so we push the fewer items onto the bigger?
The code is correctly pushing the smaller list onto the larger list:
scala> val larger = List(1,2,3,4,5,6,7,8,9)
larger: List[Int] = List(1, 2, 3, 4, 5, 6, 7, 8, 9)
scala> val smaller = List(10,20,30)
smaller: List[Int] = List(10, 20, 30)
scala> larger.reverse_:::(smaller)
res4: List[Int] = List(30, 20, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9)(1) don't we want the method reverse_::: which combines these two to save cost?
And I was surprised to learn that apparently a List's reverse_::: is faster than ++ or ::: because it leverages structural sharing (its loop does ::) whereas e.g. ::: creates a ListBuffer via ++= and prependToList. I didn't expect that, though I haven't yet verified my understanding with a quick caliper benchmark. :-)
|
Can you commit the benchmarks? |
|
@johnynek wrote:
First let me say that I agree, too, but I have a question: Could you elaborate on why heavy-tailed distributions would commonly imply a significant (large) number of CMS instances? Doesn't this depend on how you "partition" the input data stream into CMS instances? @ianoc made a similar reference to how this optimization would help to aggregate the long tail; I don't see yet how you guys are segregating the long tail from the heavy hitters (if they are lumped together then the number of elements is quickly |
|
@miguno the primary win for space + perf: And the reference/difference here i think what I was getting at is: Often we in a dataset have a primary key in a space O(10^8) , say user ID's. Then we have a secondary index, in a space O(10^12), say url's. So we might use a setup like this to track the main referring URL's for all users, with a type of (Long, CMS[String]). For most users there will in a given time window before we age them out probably very few of these URL's. So moving to a CMSInstance at size == 2 is very expensive if the size never gets bigger than say 5. |
|
We're about to publish. Is this ready to merge? @avibryant can you merge with develop? @miguno what do you say? does this have your 👍 ? |
|
IMO let's get this in the next one, I think I might want to take another run at it when I have time that is more invasive but makes me happier with the result. Unless someone really wants this in the meantime? |
|
I remember there's at least one functional issue remaining where the new code only supports incrementing by 1 but not by N (see my comment above). We also don't have the tests yet AFAIK. So as Avi said maybe we should wait one more round and give it some more work? |
|
@avibryant you want to get this synced and try to get it in 0.10? |
|
Sorry my bad, git foo on cmd line broke stuff and closed all of these |
|
I think this was superseded by #464 |
Adds a
CMSItems[K]implementation ofCMS[K]which, likeCMSItem[K], is exact, but for a small set of keys instead of just one.You can add
CMSItemandCMSItemorCMSItemsandCMSItemsand get aCMSItems. At a certain threshold (currently set to 100 items), this will transition to aCMSInstance[K].I haven't updated the TopN or TopPct variants, just the basic CMS.
I have a specific need for this, but it seems like a generally useful optimization. It could probably entirely replace
CMSItembut I wouldn't want to do that without benchmarking first, which I have not done.Thoughts @johnynek @miguno?