-
Notifications
You must be signed in to change notification settings - Fork 347
Create a sparse Count-Min-Sketch. #464
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to be consistent, let's use foldLeft rather than foreach here (to avoid an unneeded var). I'm willing to have vars if they add performance, in this case the perf should be the same as:
newTable.foldLeft(CMSInstance[K](params))(_ + _)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might need to do newTable.foldLeft(CMSInstance[K](params)) { case (cms, (k, c)) => cms + (k, c) } because + takes two args and not a tuple.
|
should we add a test and fix #459 at the same time (you added another copy of this bug in this PR)? Also, can we add a test to specifically test this path? Like make a scalacheck generator that only generates sparse values, and verify that when we add them, either the count is exact or we have a CMSInstance? |
|
This is related to #390 (kind of forgot about that, sorry folks). I guess this will supercede that one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suppose we call this .empty. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, this follows CMSInstance so let's keep it as is.
|
This all looks good, but now looking back at #390, one thing I see that is different is that there was a parameter added to control how big the sparse CMS got, rather than just always doing width * depth (which I guess could use MUCH more than a CMS if the key is a String or BigInt, for instance). Maybe we should copy that approach of adding a parameter (and perhaps have it default to something very small, like |
|
@johnynek Is this good to go? |
|
Seems the coverage/coveralls test is taking forever to finish. Is there a way to turn it off or somehow? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's put the maxExactCountOpt into CMSParams. Without that, you can't really control it, right? Since CMSItem will always create with the default setting, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@johnynek Good points. But that will make maxExactCountOpt available to all CMS subclasses. Is that desired?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it is only relevant to one part of the state machine, but similarly, the size is only depth/width is only relevant to when we create the hash count matrix (not to CMSItem or SparseCMS). So, I think it still fits as a parameter of the operation of the monoid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense. Has made the change as suggested.
On Tue, Jul 28, 2015 at 10:41 AM, P. Oscar Boykin [email protected]
wrote:
In algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala
#464 (comment):}
/**
- * A sparse Count-Min sketch structure, used for situations where the key is highly skewed.
- */
+case class SparseCMS[K](exactCountTable: Map[K, Long], maxExactCountOpt: Option[Int] = None,I agree it is only relevant to one part of the state machine, but
similarly, the size is only depth/width is only relevant to when we create
the hash count matrix (not to CMSItem or SparseCMS). So, I think it still
fits as a parameter of the operation of the monoid.—
Reply to this email directly or view it on GitHub
https://github.com/twitter/algebird/pull/464/files#r35676556.
Kevin Lin | Twitter, Inc.
1355 Market St. | San Francisco, CA | 94103
Follow me: @reconditesea https://twitter.com/reconditesea
|
@johnynek Is this good to go :) |
|
Sorry my bad, git foo on cmd line broke stuff and closed all of these |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait, we don't want ++ here right? We want Semigroup.plus(exactCountTable, other.exactCountTable).
If that's correct, can you write a test that exposes this bug so we don't have a regression on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. This should be a Semigroup add. I will make an iteration and also add a test.
|
@johnynek @DanielleSucher Here you go! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this have maxExactCountOpt = None? so you don't have to pass it?
|
How does this effect our CMS benchmarks? |
|
@ianoc Could you share some more details about the current CMS benchmarks? Not very familiar with it. But given the default maxExactCount is quite small, I believe it won't affect it too much. |
|
The algebird-benchmark project has some benchmarks like It would be good to run these on develop vs this branch to ensure no regressions. Even if your code doesn't in theory effect the higher density CMS's, if it for instances blocks any specialization then that would have a large impact everywhere. |
|
@ianoc It's not a super clean comparison, because I'm doing other stuffs on my computer while the benchmarks are running. So the develop even took slightly more time than branch. But I guess it proves they are within the same performance level? |
|
Thats not the benchmark timing? the benchmark should be a matrix of output timing. That looks like the output from sbt? see the docs in the readme at the root or the sbt-jmh plugin docs |
|
That suggests about 4% slow down in all the expensive jobs there right? |
|
I reran my branch's benchmark and compared it with develop. This time, some of them are better whole others are still slower than develop. I guess there are some variances in different benchmark runs. |
|
I'm not sure, 4% loss in perf while not huge, is non-trival to me. Can you profile one of consistently 4% slower ones and see where it spends the time? Are any of them significantly faster than develop? |
|
@ianoc I think the reason of previous slowness is that the MaxExactCount (key-count) set for SparseCMS is too low (at most 10). So in the benchmark easily a SparseCMS need to be converted to a CMSInstance, and therefore each element needs to be added to the new CMS CountTable one by one. That brings the cost. If I set MaxExactCount to 50, my branch has comparable performance as develop. |
|
That seems reasonable to me, even 50 might wind up being too low. But given we have comparable performance at that level and its a configurable parameter. LGTM. Any others issues @johnynek & @DanielleSucher ? |
|
@johnynek & @DanielleSucher, any suggestions? |
|
@ianoc Is this good to go? Thanks! |
Create a sparse Count-Min-Sketch.
|
Merged thanks |
To close #461