Skip to content

Conversation

@reconditesea
Copy link
Contributor

To close #461

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to be consistent, let's use foldLeft rather than foreach here (to avoid an unneeded var). I'm willing to have vars if they add performance, in this case the perf should be the same as:

newTable.foldLeft(CMSInstance[K](params))(_ + _)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might need to do newTable.foldLeft(CMSInstance[K](params)) { case (cms, (k, c)) => cms + (k, c) } because + takes two args and not a tuple.

@johnynek
Copy link
Collaborator

should we add a test and fix #459 at the same time (you added another copy of this bug in this PR)?

Also, can we add a test to specifically test this path? Like make a scalacheck generator that only generates sparse values, and verify that when we add them, either the count is exact or we have a CMSInstance?

@reconditesea
Copy link
Contributor Author

@johnynek Didn't know #459 before. Will fix it in this PR along with some tests.

@johnynek
Copy link
Collaborator

This is related to #390 (kind of forgot about that, sorry folks). I guess this will supercede that one.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suppose we call this .empty. What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, this follows CMSInstance so let's keep it as is.

@johnynek
Copy link
Collaborator

This all looks good, but now looking back at #390, one thing I see that is different is that there was a parameter added to control how big the sparse CMS got, rather than just always doing width * depth (which I guess could use MUCH more than a CMS if the key is a String or BigInt, for instance).

Maybe we should copy that approach of adding a parameter (and perhaps have it default to something very small, like max(width * depth / 100, 10) or something.

@reconditesea
Copy link
Contributor Author

@johnynek Is this good to go?

@reconditesea
Copy link
Contributor Author

Seems the coverage/coveralls test is taking forever to finish. Is there a way to turn it off or somehow?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's put the maxExactCountOpt into CMSParams. Without that, you can't really control it, right? Since CMSItem will always create with the default setting, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnynek Good points. But that will make maxExactCountOpt available to all CMS subclasses. Is that desired?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it is only relevant to one part of the state machine, but similarly, the size is only depth/width is only relevant to when we create the hash count matrix (not to CMSItem or SparseCMS). So, I think it still fits as a parameter of the operation of the monoid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. Has made the change as suggested.

On Tue, Jul 28, 2015 at 10:41 AM, P. Oscar Boykin [email protected]
wrote:

In algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala
#464 (comment):

}

/**

  • * A sparse Count-Min sketch structure, used for situations where the key is highly skewed.
  • */
    +case class SparseCMS[K](exactCountTable: Map[K, Long], maxExactCountOpt: Option[Int] = None,

I agree it is only relevant to one part of the state machine, but
similarly, the size is only depth/width is only relevant to when we create
the hash count matrix (not to CMSItem or SparseCMS). So, I think it still
fits as a parameter of the operation of the monoid.


Reply to this email directly or view it on GitHub
https://github.com/twitter/algebird/pull/464/files#r35676556.

Kevin Lin | Twitter, Inc.
1355 Market St. | San Francisco, CA | 94103

Follow me: @reconditesea https://twitter.com/reconditesea

@reconditesea
Copy link
Contributor Author

@johnynek Is this good to go :)

@ianoc ianoc closed this Aug 4, 2015
@ianoc ianoc reopened this Aug 4, 2015
@ianoc
Copy link
Collaborator

ianoc commented Aug 4, 2015

Sorry my bad, git foo on cmd line broke stuff and closed all of these

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, we don't want ++ here right? We want Semigroup.plus(exactCountTable, other.exactCountTable).

If that's correct, can you write a test that exposes this bug so we don't have a regression on this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. This should be a Semigroup add. I will make an iteration and also add a test.

@reconditesea
Copy link
Contributor Author

@johnynek @DanielleSucher Here you go!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this have maxExactCountOpt = None? so you don't have to pass it?

@ianoc
Copy link
Collaborator

ianoc commented Aug 10, 2015

How does this effect our CMS benchmarks?

@reconditesea
Copy link
Contributor Author

@ianoc Could you share some more details about the current CMS benchmarks? Not very familiar with it. But given the default maxExactCount is quite small, I believe it won't affect it too much.

@ianoc
Copy link
Collaborator

ianoc commented Aug 10, 2015

The algebird-benchmark project has some benchmarks like
https://github.com/twitter/algebird/blob/develop/algebird-benchmark/src/main/scala/com/twitter/algebird/benchmark/CMSBenchmark.scala

It would be good to run these on develop vs this branch to ensure no regressions.

Even if your code doesn't in theory effect the higher density CMS's, if it for instances blocks any specialization then that would have a large impact everywhere.

@reconditesea
Copy link
Contributor Author

@ianoc
On my branch: [success] Total time: 3255 s, completed Aug 10, 2015 1:36:09 PM
On develop: [success] Total time: 3269 s, completed Aug 10, 2015 2:41:02 PM

It's not a super clean comparison, because I'm doing other stuffs on my computer while the benchmarks are running. So the develop even took slightly more time than branch. But I guess it proves they are within the same performance level?

@ianoc
Copy link
Collaborator

ianoc commented Aug 10, 2015

Thats not the benchmark timing? the benchmark should be a matrix of output timing. That looks like the output from sbt? see the docs in the readme at the root or the sbt-jmh plugin docs

@reconditesea
Copy link
Contributor Author

Here you go.
My brach:
branch

develop:
develop

@ianoc
Copy link
Collaborator

ianoc commented Aug 10, 2015

That suggests about 4% slow down in all the expensive jobs there right?

@reconditesea
Copy link
Contributor Author

I reran my branch's benchmark and compared it with develop. This time, some of them are better whole others are still slower than develop. I guess there are some variances in different benchmark runs.
I think the toDense() def of SparseCMS will add some additional cost, since it goes over all (k,v) items once. But that should not be too much. Any suggestions?

branch2

@ianoc
Copy link
Collaborator

ianoc commented Aug 11, 2015

I'm not sure, 4% loss in perf while not huge, is non-trival to me. Can you profile one of consistently 4% slower ones and see where it spends the time?

Are any of them significantly faster than develop?

@reconditesea
Copy link
Contributor Author

@ianoc I think the reason of previous slowness is that the MaxExactCount (key-count) set for SparseCMS is too low (at most 10). So in the benchmark easily a SparseCMS need to be converted to a CMSInstance, and therefore each element needs to be added to the new CMS CountTable one by one. That brings the cost. If I set MaxExactCount to 50, my branch has comparable performance as develop.
Do we think key space 50 is a good deal for a SparseCMS key size?

Branch
branch

Develop
develop

@ianoc
Copy link
Collaborator

ianoc commented Oct 12, 2015

That seems reasonable to me, even 50 might wind up being too low. But given we have comparable performance at that level and its a configurable parameter. LGTM.

Any others issues @johnynek & @DanielleSucher ?

@reconditesea
Copy link
Contributor Author

@johnynek & @DanielleSucher, any suggestions?

@reconditesea
Copy link
Contributor Author

@ianoc Is this good to go? Thanks!

ianoc added a commit that referenced this pull request Oct 22, 2015
Create a sparse Count-Min-Sketch.
@ianoc ianoc merged commit e474cce into develop Oct 22, 2015
@ianoc ianoc deleted the klin/cms/sparse branch October 22, 2015 20:32
@ianoc
Copy link
Collaborator

ianoc commented Oct 22, 2015

Merged thanks

@johnynek johnynek mentioned this pull request Nov 25, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SparseCMS

5 participants