Dealing with probabilistic tests #478

sid-kap · 2015-08-04T22:40:57Z

This is a rough sketch of a way we can deal with flaky/probabilistic tests.

I defined a trait ApproximateProperty that allows us to describe approximate properties, and, as an example, implemented a working ApproximateProperty test for CMS.

In detail:

Defined the ApproximateProperty trait to describe a relationship between
- an Exact type
- an Approx type which approximates the behavior of Exact
- a behavior that both Exact and Approx implement.
  - Exact should be able to take something of type Input and produce something of type Result
  - Approx should be able to take something of type Input and produce something of type Approximate[Result]
- some generators/glue functions (exactGenerator, makeApproximate, inputGenerator) that determine how to create an instance of Exact, how to convert an instance of Exact to an instance of Approximate, and how to create an Input
Defined the ApproximateProperty.toProp function that converts an ApproximateProperty to an org.scalacheck.Prop that runs the ApproximateProperty test many times and fails if the number of failed tests exceeds a certain threshold. (using Hoeffding's inequality)

Comments/suggestions?

sid-kap · 2015-08-04T22:44:17Z

algebird-test/src/main/scala/com/twitter/algebird/ApproximateProperty.scala

A better name for falsePositiveRate might be falseFailureRate

johnynek · 2015-08-05T17:23:02Z

algebird-test/src/main/scala/com/twitter/algebird/ApproximateProperty.scala

what is this exact value corresponding to?

Exact is useful here so that we can generate appropriate test inputs.

For example, suppose we're testing CMS[Long]. We need to generate some inputs so that we can query exactList.frequency(x) and cms.frequency(x) and compare them.
We could simply query using random Long values. However, most of the random Long values we generate would probably not be in exactList, so these test cases would not be very useful. If we have exactList, we can choose random elements from exactList as our test cases.

I'm not sure if this is the best way to do this.

but why couldn't we do something like:

// I need an exact value for my implementation: def needExact(e: Exact): Gen[Input] // but I could still implement this: def inputGenerator: Gen[Input] = exactGenerator.flatMap(needExact)

Am I missing something? Putting this in the trait as is seems to be constraining more than is needed. Why can't we just get by with def inputGenerator: Gen[Input]

johnynek · 2015-08-05T17:40:24Z

algebird-test/src/main/scala/com/twitter/algebird/ApproximateProperty.scala

let's link to:
https://en.wikipedia.org/wiki/Hoeffding%27s_inequality

and add the result;

When Prob(X = 1) >= p, then Prob(sum(X) <= (p - eps)*n) <= exp(-2*eps*eps*n) If we want that Prob <= fprate, then it is sufficient that, fprate = exp(-2*eps*eps*n) or: eps = math.sqrt( - math.log(fprate)/(2.0 * n) )

So, I think there is a slight error in your version (n is in the numerator in yours).

diff = n * fprate, so it's correct in the numerator.

My only concern here is that the formula is for the Bernoulli version of Hoeffding's. Since we're taking separate probability values from each result, we aren't guaranteed that they'll be uniform (very possibly they won't be). But maybe it's "close enough."

The general case (https://en.wikipedia.org/wiki/Hoeffding%27s_inequality#General_case) is

P(sum(X) - sum(E[X]) >= nt) <= exp(-2*n*t^2)

(here I used sum(X) = X_1 + ... + X_n instead of Xbar = 1/n (X_1 + ... + X_N))
So we have

exp(-2*n*t^2) = fprate => t = sqrt(-log(fprate) / (2*n) )

So I think Oscar's right.
I don't think this assumes the Bernoulli distribution.

Whoops, I read Joe's comment again, and he's right.
diff = n * t, so we need diff = sqrt(-n * log(fprate) / 2).

I agree. I overlooked the n * t on the outside.

I think the bound applies perfectly, with the assumption that our Random class is a real RNG. That is obviously not true, and in fact it is pseudo-random, but I don't see how that is escapable. We could beef up the RNG used to be stronger (but certainly slower), but that might require some changes to scalacheck, not sure.

Create a trait GeneralizedApproximate that contains the minimal behavior necessary for ApproximateProperty. Change return type of ApproximateProperty#approxResult to GeneralizedApproximate[Result] instead of Approximate[Result]. Create implicit conversions from - Approximate[T] to GeneralizedApproximate[T] - ApproximateBoolean to GeneralizedApproximate[Boolean]

DanielleSucher · 2015-08-12T13:50:17Z

version.sbt

Why bump to a snapshot version here?

Whoops, I didn't mean to check that change in. I'll remove it

We probably should actually swap all of them to work like this. We've seen some funky errors and mixed classpath's if you have a published version in the version.sbt. Not sure if its a new sbt thing or what. But can be a different PR

Yeah, I keep getting a "Unresolved dependencies: algebird-test" error when I don't use "-SNAPSHOT". Is this happening to other people too?

johnynek · 2015-08-13T19:39:15Z

algebird-test/src/main/scala/com/twitter/algebird/ApproximateProperty.scala

seems like what we really want is

trait ApproximateSet[S, -T] { def contains(set: S, t: T): ApproximateBoolean }

and use that as a typeclass rather than implicit conversions.

object ApproximateSet { def contains[S, T](s: S, t: T)(implicit as: ApproximateSet[S, T]): ApproximateBoolean = as.contains(s, t) implicit def fromApproximate[N: Numeric]: ApproximateSet[Approximate[N], N] = new ApproximateSet ... }

Actually, what you are trying to do here is add a common interface to ApproximateBoolean and Approximate[N] right?

Maybe just do that directly rather than through implicit conversion.

Yeah, I'm trying to make a common interface to ApproximateBoolean and Approximate[N].

I like the idea of having a trait ApproximateSet[T]:

trait ApproximateSet[T] { def contains(t: T): ApproximateBoolean } ApproximateBoolean extends ApproximateSet[Boolean] Approximate[N: Numeric] extends ApproximateSet[N]

I think I'll follow that approach.

sid-kap · 2015-08-19T17:20:59Z

@ianoc everything in this PR uses Scalacheck foralls and Properties classes. Do we want to rewrite it in terms of Scalatest foralls and property tests?

ianoc · 2015-08-19T17:21:04Z

Looks like this doesn't merge cleanly to develop @sid-kap can you merge in develop to your branch and resolve the conflicts?

ianoc · 2015-08-19T17:21:32Z

Nope, lets just get this in and stable, once Oscar is happy I think we should try get it in. Can worry about that other stuff later/maybe never.

sid-kap · 2015-08-19T17:23:41Z

Ok sounds good.
The maintainer of scalacheck finally merged in the PR that I sent. Now we just have to wait for him to do a new release. Once that happens, I can try to merge this in.

johnynek · 2015-08-19T17:35:58Z

algebird-test/src/test/scala/com/twitter/algebird/CountMinSketchTest.scala

can we call list vec instead? Also, vec.count(_ == key) is going to be faster here since it does not materialize a second vector.

johnynek · 2015-08-19T17:46:04Z

A couple of comments. Also note that this does not merge cleanly, so you'll need to merge in develop.

sid-kap · 2015-09-11T01:57:42Z

Scalacheck just released the new version with the feature that I requested. So I'll try to bring in the new dependency and finish up this PR sometime next week.

Also, create ApproximateProperties class

sid-kap · 2015-10-04T23:29:44Z

By the way, I don't think we should use the Chernoff-Hoeffing bound. A few weeks of studying concentration inequalities in Randomized Algorithms has taught me not to use the Hoeffding bound for a binomial with very small or very large p -- this bound doesn't depend on p, and therefore doesn't take into account the fact that the binomial concentrates much better for very small/very large p.

Using a weak bound would result in overly lax tests -- for example, suppose we are running 100 trials with 0.975 probability of success, and we want to fail at most 0.01 of the time. The true threshold for the binomial (qbinom(0.01, 100, 0.975)) is 93, but the Chernoff-Hoeffding bound gives us 97.5 - sqrt(-2 * log(0.01) / 100) = 82.3. This gets worse for properties with a 0.99 success probability -- the true value is 96, and the Hoeffding bound gives us 83.8.

For fun: we could use a Bernstein-type inequality (since the binomial is sub-gamma), which would give us a cutoff of 97.5 - sqrt(-4*0.975*(1-0.975)*log(0.01)) = 90.7 for 0.975 success probability, and 99 - sqrt(-4*0.99*(1-0.99)*log(0.01)) = 94.7 for the 0.99 success probability. This bound is essentially tight because it uses the Gaussian tail bound (e^(-x^2)) instead of an exponential bound (e^(-x)). For proof, see the last page of http://www.cs.utexas.edu/~ecprice/courses/randomized/notes/lec6.pdf (Note that these notes assume that p << 1, so it uses the approximation sqrt(p(1-p)) ~ sqrt(p), which is not the case here.)

Or we could be lame and use the inverse binomial function...

sid-kap · 2015-10-16T14:52:53Z

Do we want to try to merge this in? I've updated Scalacheck, so now it runs each test the correct number of times.

The only things I think need to be addressed are

Maybe make the bounds tighter, as mentioned above
How to enforce that people use these classes correctly

Concerning 2: I made a class called ApproximateProperties that extends Properties but has a default of running each test once instead of 100 times. All Properties classes that have approximate properties should extend from this, and should contain only approximate properties. (We need to run approximate properties only once because when the ApproximateProperty check code is called once, it generates many (say, 100) inputs and succeeds if enough of those 100 succeed.)

Is there a clean way to enforce, using the type system, that only ApproximatePropertys can be put in the ApproximateProperties class? I would not want to emulate the Scalacheck Properties class because that class is implemented using messy mutable structures (every time you call property("foo") =, it appends the property to a mutable list of properties). Maybe we could make a function that takes a List[ApproximateProperty] and generates the appropriate Properties instance? This would work but it might lead to messier code (we would not be able to take advantage of the property("foo") = DSL).

johnynek · 2015-10-16T18:01:18Z

@sid-kap Thanks! I do very much want to merge this, I just haven't looked back at it since the scalacheck upgrade. I'll review. Thanks for not letting it drop! Appreciated.

erikerlandson · 2015-10-19T17:49:13Z

I've had some success comparing algorithms involving random behaviors by using the Kolmogorov-Smirnov test, where you can test whether two distributions are different (or, in the inverse case, whether they are not different). If you can define (either in closed-form, or via sampling) a reference of "correct" behavior, you can then run your test some number of times and run the KS-test between test and reference.

Worth noting that with random behaviors, there is generally no way to guarantee that your random test will never violate, though it is possible to put things on a theoretical footing that fails less frequently.

apache/spark#2455
http://erikerlandson.github.io/blog/2014/09/11/faster-random-samples-with-gap-sampling/

johnynek · 2015-11-25T17:02:18Z

@ianoc what do you think? I'm happy to merge this now. I think it's a big improvement, and we let it sit too long.

ianoc · 2015-11-25T17:07:56Z

I'm in favor of pushing ahead and getting it in, if there are futher improvements to be made we can do it later

Dealing with probabilistic tests

sid-kap added 5 commits August 3, 2015 16:31

basic implementation of ApproximateProperty

9da8eb1

Some work

b4da480

more stuff

e60f0b5

work on tests

05bdf25

Basic implementation of ApproximateProperty

e9fedb6

sid-kap reviewed Aug 4, 2015
View reviewed changes

sid-kap and others added 6 commits August 4, 2015 18:23

More changes

de57b71

Work on hyperloglogtests

409ffc9

Work on HyperLogLogTests

0f997f3

Sample failing test

118a380

Fix HLL intersection test (now fails weirdly)

d0c0187

Add intersection size == sizeOf sanity test

c66d1c3

DanielleSucher mentioned this pull request Aug 5, 2015

upgrade scalacheck #431

Closed

johnynek reviewed Aug 5, 2015
View reviewed changes

Add comment for toProp

1e5e9a2

johnynek reviewed Aug 5, 2015
View reviewed changes

sid-kap mentioned this pull request Aug 5, 2015

HyperLogLog intersections test is weak #479

Closed

sid-kap added 7 commits August 5, 2015 17:50

Fix HLL intersection test

9bd6a7f

Work more on HyperLogLogTests

587bb63

Use Hash128 in HyperLogLogTests

b76b6af

Add SetSizeHashAggregator tests

4963b1b

Merge branch 'develop' into test_failure_bounds

931aad9

Test HyperLogLogSeries using ApproximateProperty

c6da416

sid-kap changed the title ~~Dealing with probabilistic tests (Don't merge)~~ Dealing with probabilistic tests Aug 11, 2015

DanielleSucher reviewed Aug 12, 2015
View reviewed changes

sid-kap added 2 commits August 12, 2015 10:14

Write BloomFilterTests, handle case where prob = 0

e56b612

Revert accidental change to version.sbt

46390c9

sid-kap force-pushed the test_failure_bounds branch from f07db15 to 46390c9 Compare August 12, 2015 17:14

johnynek reviewed Aug 13, 2015
View reviewed changes

Refactor GeneralizedApproximate[T] -> ApproximateSet[T]

97f06eb

johnynek reviewed Aug 19, 2015
View reviewed changes

sid-kap added 3 commits August 19, 2015 13:05

Merge branch 'develop' into test_failure_bounds

96e953a

Small change in CountMinSketchTest

6653e65

Fix incorrect merge conflict resolution

a750192

sid-kap added 3 commits October 4, 2015 17:14

Merge branch 'develop' into test_failure_bounds

9e599c1

Update Scalacheck version

185eab3

Also, create ApproximateProperties class

Travis, please run the tests again

bdb7354

johnynek added a commit that referenced this pull request Nov 30, 2015

Merge pull request #478 from sid-kap/test_failure_bounds

3a4c927

Dealing with probabilistic tests

johnynek merged commit 3a4c927 into twitter:develop Nov 30, 2015

Dealing with probabilistic tests #478

Dealing with probabilistic tests #478

Uh oh!

Conversation

sid-kap commented Aug 4, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sid-kap commented Aug 19, 2015

Uh oh!

ianoc commented Aug 19, 2015

Uh oh!

ianoc commented Aug 19, 2015

Uh oh!

sid-kap commented Aug 19, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

johnynek commented Aug 19, 2015

Uh oh!

sid-kap commented Sep 11, 2015

Uh oh!

sid-kap commented Oct 4, 2015

Uh oh!

sid-kap commented Oct 16, 2015

Uh oh!

johnynek commented Oct 16, 2015

Uh oh!

erikerlandson commented Oct 19, 2015

Uh oh!

johnynek commented Nov 25, 2015

Uh oh!

ianoc commented Nov 25, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants