Create a sparse Count-Min-Sketch. #464

reconditesea · 2015-07-15T02:22:12Z

To close #461

johnynek · 2015-07-15T12:55:25Z

algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala

just to be consistent, let's use foldLeft rather than foreach here (to avoid an unneeded var). I'm willing to have vars if they add performance, in this case the perf should be the same as:

newTable.foldLeft(CMSInstance[K](params))(_ + _)

might need to do newTable.foldLeft(CMSInstance[K](params)) { case (cms, (k, c)) => cms + (k, c) } because + takes two args and not a tuple.

johnynek · 2015-07-15T13:05:51Z

should we add a test and fix #459 at the same time (you added another copy of this bug in this PR)?

Also, can we add a test to specifically test this path? Like make a scalacheck generator that only generates sparse values, and verify that when we add them, either the count is exact or we have a CMSInstance?

reconditesea · 2015-07-15T22:21:01Z

@johnynek Didn't know #459 before. Will fix it in this PR along with some tests.

johnynek · 2015-07-16T12:15:23Z

This is related to #390 (kind of forgot about that, sorry folks). I guess this will supercede that one.

johnynek · 2015-07-16T12:17:36Z

algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala

suppose we call this .empty. What do you think?

actually, this follows CMSInstance so let's keep it as is.

johnynek · 2015-07-18T01:05:45Z

This all looks good, but now looking back at #390, one thing I see that is different is that there was a parameter added to control how big the sparse CMS got, rather than just always doing width * depth (which I guess could use MUCH more than a CMS if the key is a String or BigInt, for instance).

Maybe we should copy that approach of adding a parameter (and perhaps have it default to something very small, like max(width * depth / 100, 10) or something.

reconditesea · 2015-07-22T05:58:12Z

@johnynek Is this good to go?

reconditesea · 2015-07-24T06:21:43Z

Seems the coverage/coveralls test is taking forever to finish. Is there a way to turn it off or somehow?

johnynek · 2015-07-28T00:24:33Z

algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala

let's put the maxExactCountOpt into CMSParams. Without that, you can't really control it, right? Since CMSItem will always create with the default setting, right?

@johnynek Good points. But that will make maxExactCountOpt available to all CMS subclasses. Is that desired?

I agree it is only relevant to one part of the state machine, but similarly, the size is only depth/width is only relevant to when we create the hash count matrix (not to CMSItem or SparseCMS). So, I think it still fits as a parameter of the operation of the monoid.

Make sense. Has made the change as suggested.

On Tue, Jul 28, 2015 at 10:41 AM, P. Oscar Boykin [email protected]
wrote:

In algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala
#464 (comment):

}

/**

* A sparse Count-Min sketch structure, used for situations where the key is highly skewed.

*/
+case class SparseCMS[K](exactCountTable: Map[K, Long], maxExactCountOpt: Option[Int] = None,

I agree it is only relevant to one part of the state machine, but
similarly, the size is only depth/width is only relevant to when we create
the hash count matrix (not to CMSItem or SparseCMS). So, I think it still
fits as a parameter of the operation of the monoid.

—
Reply to this email directly or view it on GitHub
https://github.com/twitter/algebird/pull/464/files#r35676556.

Kevin Lin | Twitter, Inc.
1355 Market St. | San Francisco, CA | 94103

Follow me: @reconditesea https://twitter.com/reconditesea

reconditesea · 2015-08-03T00:04:28Z

@johnynek Is this good to go :)

ianoc · 2015-08-04T00:47:36Z

Sorry my bad, git foo on cmd line broke stuff and closed all of these

johnynek · 2015-08-04T23:41:11Z

algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala

wait, we don't want ++ here right? We want Semigroup.plus(exactCountTable, other.exactCountTable).

If that's correct, can you write a test that exposes this bug so we don't have a regression on this?

You're right. This should be a Semigroup add. I will make an iteration and also add a test.

reconditesea · 2015-08-05T22:36:07Z

@johnynek @DanielleSucher Here you go!

johnynek · 2015-08-07T01:13:44Z

algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala

shouldn't this have maxExactCountOpt = None? so you don't have to pass it?

ianoc · 2015-08-10T16:05:21Z

How does this effect our CMS benchmarks?

reconditesea · 2015-08-10T17:51:52Z

@ianoc Could you share some more details about the current CMS benchmarks? Not very familiar with it. But given the default maxExactCount is quite small, I believe it won't affect it too much.

ianoc · 2015-08-10T18:02:45Z

The algebird-benchmark project has some benchmarks like
https://github.com/twitter/algebird/blob/develop/algebird-benchmark/src/main/scala/com/twitter/algebird/benchmark/CMSBenchmark.scala

It would be good to run these on develop vs this branch to ensure no regressions.

Even if your code doesn't in theory effect the higher density CMS's, if it for instances blocks any specialization then that would have a large impact everywhere.

reconditesea · 2015-08-10T21:45:53Z

@ianoc
On my branch: [success] Total time: 3255 s, completed Aug 10, 2015 1:36:09 PM
On develop: [success] Total time: 3269 s, completed Aug 10, 2015 2:41:02 PM

It's not a super clean comparison, because I'm doing other stuffs on my computer while the benchmarks are running. So the develop even took slightly more time than branch. But I guess it proves they are within the same performance level?

ianoc · 2015-08-10T22:09:22Z

Thats not the benchmark timing? the benchmark should be a matrix of output timing. That looks like the output from sbt? see the docs in the readme at the root or the sbt-jmh plugin docs

reconditesea · 2015-08-10T22:58:36Z

Here you go.
My brach:

develop:

ianoc · 2015-08-10T23:00:21Z

That suggests about 4% slow down in all the expensive jobs there right?

reconditesea · 2015-08-11T00:13:51Z

I reran my branch's benchmark and compared it with develop. This time, some of them are better whole others are still slower than develop. I guess there are some variances in different benchmark runs.
I think the toDense() def of SparseCMS will add some additional cost, since it goes over all (k,v) items once. But that should not be too much. Any suggestions?

ianoc · 2015-08-11T15:35:12Z

I'm not sure, 4% loss in perf while not huge, is non-trival to me. Can you profile one of consistently 4% slower ones and see where it spends the time?

Are any of them significantly faster than develop?

reconditesea · 2015-10-07T23:24:06Z

@ianoc I think the reason of previous slowness is that the MaxExactCount (key-count) set for SparseCMS is too low (at most 10). So in the benchmark easily a SparseCMS need to be converted to a CMSInstance, and therefore each element needs to be added to the new CMS CountTable one by one. That brings the cost. If I set MaxExactCount to 50, my branch has comparable performance as develop.
Do we think key space 50 is a good deal for a SparseCMS key size?

Branch

Develop

ianoc · 2015-10-12T14:27:45Z

That seems reasonable to me, even 50 might wind up being too low. But given we have comparable performance at that level and its a configurable parameter. LGTM.

Any others issues @johnynek & @DanielleSucher ?

reconditesea · 2015-10-13T17:58:28Z

@johnynek & @DanielleSucher, any suggestions?

reconditesea · 2015-10-15T22:48:22Z

@ianoc Is this good to go? Thanks!

Create a sparse Count-Min-Sketch.

ianoc · 2015-10-22T20:32:57Z

Merged thanks

Create a sparse Count-Min-Sketch.

e2a9c1c

johnynek reviewed Jul 15, 2015
View reviewed changes

reconditesea added 2 commits July 15, 2015 15:21

Address Oscar's comments, execept for issue #459 and new tests.

2a08559

fix bug 459.

275b640

johnynek reviewed Jul 16, 2015
View reviewed changes

Add CMSInstanceTest.

3e31e55

Add maxExactCountOpt parameter.

4503c5a

johnynek reviewed Jul 28, 2015
View reviewed changes

reconditesea added 2 commits July 27, 2015 19:09

Merge branch 'develop' into klin/cms/sparse

c298355

Move maxExactCountOpt to CMSParams.

fdf1b86

ianoc closed this Aug 4, 2015

ianoc reopened this Aug 4, 2015

johnynek reviewed Aug 4, 2015
View reviewed changes

reconditesea added 2 commits August 5, 2015 13:36

Merge branch 'develop' into klin/cms/sparse

0e6a8d2

Fix Map add bug and update param comments.

4c27d2c

johnynek reviewed Aug 7, 2015
View reviewed changes

More overrides for different parameter.

b6b810d

reconditesea added 3 commits October 6, 2015 11:17

Merge branch 'develop' into klin/cms/sparse

20ca493

Make maxExact default to 100.

0e2438f

Set to 50.

5aefcf8

ianoc added a commit that referenced this pull request Oct 22, 2015

Merge pull request #464 from twitter/klin/cms/sparse

e474cce

Create a sparse Count-Min-Sketch.

ianoc merged commit e474cce into develop Oct 22, 2015

ianoc deleted the klin/cms/sparse branch October 22, 2015 20:32

johnynek mentioned this pull request Nov 25, 2015

CMSItems #390

Closed

reconditesea mentioned this pull request Dec 16, 2015

CMSItem disregards count in ++ #459

Closed

Create a sparse Count-Min-Sketch. #464

Create a sparse Count-Min-Sketch. #464

Uh oh!

Conversation

reconditesea commented Jul 15, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

johnynek commented Jul 15, 2015

Uh oh!

reconditesea commented Jul 15, 2015

Uh oh!

johnynek commented Jul 16, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

johnynek commented Jul 18, 2015

Uh oh!

reconditesea commented Jul 22, 2015

Uh oh!

reconditesea commented Jul 24, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

reconditesea commented Aug 3, 2015

Uh oh!

ianoc commented Aug 4, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

reconditesea commented Aug 5, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ianoc commented Aug 10, 2015

Uh oh!

reconditesea commented Aug 10, 2015

Uh oh!

ianoc commented Aug 10, 2015

Uh oh!

reconditesea commented Aug 10, 2015

Uh oh!

ianoc commented Aug 10, 2015

Uh oh!

reconditesea commented Aug 10, 2015

Uh oh!

ianoc commented Aug 10, 2015

Uh oh!

reconditesea commented Aug 11, 2015

Uh oh!

ianoc commented Aug 11, 2015

Uh oh!

reconditesea commented Oct 7, 2015

Uh oh!

ianoc commented Oct 12, 2015

Uh oh!

reconditesea commented Oct 13, 2015

Uh oh!

reconditesea commented Oct 15, 2015

Uh oh!

ianoc commented Oct 22, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development