An implementation of t-digest numeric distribution sketching #495

erikerlandson · 2015-10-18T23:50:13Z

As described in the following paper:
Computing Extremely Accurate Quantiles Using t-Digests
Ted Dunning and Otmar Ertl
https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf

erikerlandson · 2015-10-18T23:53:09Z

For reference, my blog post that describes my reasoning behind binary tree algorithms backing the t-digest:
http://erikerlandson.github.io/blog/2015/09/26/a-library-of-binary-tree-algorithms-as-mixable-scala-traits/

erikerlandson · 2015-10-19T03:32:27Z

CI is showing a failure, but the log output makes it seem like tests succeeded. I'm not replicating any failures on my local machine.

johnynek · 2015-10-19T18:34:09Z

algebird-core/src/main/scala/com/twitter/algebird/TDigest.scala

return type on public methods, please.

erikerlandson · 2015-10-19T19:59:32Z

Comment on t-digest monoids: I'm pretty sure I can address the issue of non-deterministic behavior of monoid plus, which may be worth doing, but I think there is a deeper failure of strict associativity, since (td1 ++ td2) ++ td3 is unlikely to yield the same final set of clusters as td1 ++ (td2 ++ td3), even if the re-clustering presentation order becomes repeatable. I'd contend that this is OK (the behavior is "statistically monoidal", to whatever extent that's a thing), but worth calling out, since I'm declaring that it has a monoid type class.

johnynek · 2015-10-19T20:15:46Z

algebird-core/src/main/scala/com/twitter/algebird/TDigest.scala

can we remove this?

avibryant · 2015-10-19T20:50:58Z

@erikerlandson agreed on "statistically monoidal" - it seems fine, and is consistent with QTree and with the top-k stuff in CMS. We should come up with some formal expression of this but it's definitely something we're comfortable with.

erikerlandson · 2015-10-20T14:23:17Z

@johnynek looks like some unrelated HLL unit test failed, but I can't re-run it

johnynek · 2015-10-23T17:12:08Z

algebird-core/src/main/scala/com/twitter/algebird/TDigest.scala

this is non-deterministic, but I wonder if we track a seed which we update on merges if we can make it deterministic and yet keep the property that in expectation it is randomized.

Yes, definitely. I was thinking of just using a Random object seeded with some variation of hashCode on the data, something along the lines of:

def deterministicShuffle[T](data: Seq[T]) = if (data.isEmpty) data else (new Random(data.head.hashCode)).shuffle(data)

johnynek · 2015-10-23T17:32:41Z

This is exciting, and I really appreciate the work on this. I want to get this merged, but also I hope you agree with us that careful code review is important.

One thing I'm wondering: could we break this into two PRs? One to add the Red/black trees with tests, and the second to leverage the red/black trees to implement T-digest (that builds on the first clearly)?

What I'm not seeing yet in the review is why scala.collection.immutable.SortedMap: http://www.scala-lang.org/api/current/index.html#scala.collection.immutable.SortedMap won't work here? Could you comment in the code? Sorry if I missed it.

Also, Red/black trees are O(log_2 N), but the hash-trie based standard maps are O(log_32 N). Can you comment what features we need of the red/black that the faster hash-trie can't do for us?

erikerlandson · 2015-10-23T17:36:47Z

@johnynek, I agree, it's a big PR with a lot going on, and I think that reviews may as well be rigorous otherwise they aren't as useful

erikerlandson · 2015-10-23T17:46:40Z

@johnynek regarding inheriting from SortedMap, I am pretty sure I can update the tree classes to satisfy the SortedMap interface trait, although it involves implementing several methods that I didn't immediately have great answers for, so I punted a bit. I had figured that might be addressed in a future PR. However, if you prefer I can try and take that bull by the horns now.

Regarding the particular use of binary trees, one nontrivial reason for that choice was that binary trees support the log-time maintenance and queries for prefix sums, and my "cover" constructs, and a couple other basic algorithms. So there is more going on than just mapping objects. Although that might be possible in a non-binary-tree data structure, I think the logic would be a lot harder at best, and maybe not possible at all.

Using red-black trees was just a way to guarantee balanced trees with a well-understood data structure that I had a hope of getting to work along with all the other functionality I was layering on top :)

erikerlandson · 2015-10-23T18:00:17Z

Maybe an even better way to explain it is that a lot of the logic I need to support requires that clusters are maintained in numeric / location order. So any hash-based container will not get me what I need. A tree data structure maintains the ordering by (numeric) key that I need, along with the desirable log-time operations.

erikerlandson · 2015-10-23T20:50:30Z

@johnynek regarding separate Red/Black node classes, in an earlier iteration I was doing that, but the code wasn't as dry. To make a long story short, it was resulting in two parallel copies of internal node logic (which is where most of the logic is). So I felt like designing around a single internal node class, with color as one field, was the cleaner solution.

erikerlandson · 2015-10-23T21:21:00Z

@johnynek splitting into a "tree PR" and a "t-digest PR" is OK with me, would you like me to pull the trigger on that?

johnynek · 2015-10-23T23:41:27Z

@erikerlandson Yes. Please.

erikerlandson · 2015-10-24T17:04:26Z

The supporting tree/map library has been factored out to #496, and I will now be keeping this branch rebased off of topic branch feature/treemaps

…m methods

…ith specific classes that inherit from the hierarchy which interact badly with type-widening

erikerlandson · 2016-02-09T20:12:34Z

@johnynek @ianoc @avibryant is there still interest in #495 and #496 ?

kainoa21 · 2016-08-18T17:51:09Z

Would very much like to see this PR merged, I am currently using the reference implementation but would benefit from an Algebird based implementation.

erikerlandson · 2016-08-22T16:44:30Z

@kainoa21 coincidentally I started re-visiting this last weekend. There was a request to un-factor the various tree functions to reduce the code bulk on the back end, which is still on my to-do list.

jnievelt · 2016-08-22T18:23:43Z

algebird-core/src/main/scala/com/twitter/algebird/TDigest.scala

+          // if we have already distributed all the mass, remaining clusters unchanged
+          cmNew = cmNew :+ ((clust.centroid, clust.mass))
+        } else if (xn == clust.centroid) {
+          // if xn lies exactly on the centroid, add all mass in regardless of bound


What's the justification for this? It doesn't seem to come from the paper.

In fact, there is explicit acknowledgment that two clusters could have the same centroid:

For centroids with identical means, order of creation is used as a tie-breaker to allow an unambiguous ordering

Allows centroid to serve as unique key, which simplifies things. Multiple clusters with the same centroid adds nothing useful to the model.

johnynek · 2016-08-22T20:34:08Z

Sorry, we did not communicate clearly on this.

This is indeed an exciting algorithm, however, if we are going to become the maintainers of it, here it needs to be significantly smaller in size. An implementation that could reuse existing scala collection classes may be small enough to accept.

Honestly, the best approach is probably for @erikerlandson to provide a low dependency subproject that hosts the t-digest, and then we could have algebird-tdigest sub-project which adds the relevant algebird integration (monoids, aggregators mostly, I think).

johnynek · 2016-08-22T20:35:39Z

I'd like to add, these cases are a challenge. We want to welcome contributions, but then inclusion can seem like an endorsement which has burned us a few times in the past when implementations bit-rot or are poorly performing (not saying that will happen here, I just mean that we need to really have the bandwidth to understand and maintain the code to add it).

willb · 2016-08-23T18:35:22Z

@johnynek What about an algebird-contrib repository (or organization) for cases just like this, where developers could host community-maintained code that was designed to work with Algebird but without an Algebird imprimatur?

erikerlandson · 2016-12-07T02:25:02Z

This is now available as a package here:
https://github.com/isarn/isarn-sketches

So I'm going to close this out

johnynek reviewed Oct 19, 2015
View reviewed changes

algebird-core/src/main/scala/com/twitter/algebird/TDigest.scala Outdated

Copy link

Collaborator

johnynek Oct 19, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return type on public methods, please.

johnynek reviewed Oct 19, 2015
View reviewed changes

algebird-core/src/main/scala/com/twitter/algebird/TDigest.scala Outdated

Copy link

Collaborator

johnynek Oct 19, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we remove this?

johnynek reviewed Oct 23, 2015
View reviewed changes

erikerlandson force-pushed the feature/tdigest branch from 8f10e56 to 4356629 Compare October 24, 2015 17:01

erikerlandson force-pushed the feature/tdigest branch from 4356629 to 05f387b Compare October 30, 2015 17:40

erikerlandson mentioned this pull request Dec 1, 2015

Implement an appendMonoid Aggregator factory which yields aggregators… #501

Merged

erikerlandson added 16 commits December 5, 2015 09:32

A Library of Binary Tree Algorithms as Mixable Scala Traits

319588e

Replace abstract 'val' with 'def' in trait definitions

2e67187

inherit from Scala's MapLike and SetLike traits

e0312ca

change implementation of INodeIterator so it will support iteratorFro…

96f572c

…m methods

iteratorFrom() family from SortedMapLike and SortedSetLike

986f08b

rangeImpl

642dcdc

inherit from SortedSetLike and SortedMapLike

2e083e6

re-centralize definition of '+' insertion to clean up some problems w…

927898d

…ith specific classes that inherit from the hierarchy which interact badly with type-widening

replace 'object tree' with 'package tree' (similar for 'infra')

d4cc85a

add explicit return types

810de02

Replace IncrementingMonoid with MonoidAggregator

4c9b73d

annotate the purpose of the Inject classes

62505f4

An implementation of t-digest numeric distribution sketching

96137ef

sync TDigestMap with updates to tree/map hierarchy requirements

49ea6cf

make t-digest shuffling operations referentially transparent

e78810c

update to use MonoidAggregator

503db35

erikerlandson force-pushed the feature/tdigest branch from 3bdc3a4 to 503db35 Compare December 10, 2015 01:02

jnievelt reviewed Aug 22, 2016
View reviewed changes

erikerlandson closed this Dec 7, 2016

An implementation of t-digest numeric distribution sketching #495

An implementation of t-digest numeric distribution sketching #495

Uh oh!

Conversation

erikerlandson commented Oct 18, 2015

Uh oh!

erikerlandson commented Oct 18, 2015

Uh oh!

erikerlandson commented Oct 19, 2015

Uh oh!

johnynek Oct 19, 2015

Choose a reason for hiding this comment

Uh oh!

erikerlandson commented Oct 19, 2015

Uh oh!

johnynek Oct 19, 2015

Choose a reason for hiding this comment

Uh oh!

avibryant commented Oct 19, 2015

Uh oh!

erikerlandson commented Oct 20, 2015

Uh oh!

johnynek Oct 23, 2015

Choose a reason for hiding this comment

Uh oh!

erikerlandson Oct 23, 2015

Choose a reason for hiding this comment

Uh oh!

johnynek commented Oct 23, 2015

Uh oh!

erikerlandson commented Oct 23, 2015

Uh oh!

erikerlandson commented Oct 23, 2015

Uh oh!

erikerlandson commented Oct 23, 2015

Uh oh!

erikerlandson commented Oct 23, 2015

Uh oh!

erikerlandson commented Oct 23, 2015

Uh oh!

johnynek commented Oct 23, 2015

Uh oh!

erikerlandson commented Oct 24, 2015

Uh oh!

erikerlandson commented Feb 9, 2016

Uh oh!

kainoa21 commented Aug 18, 2016

Uh oh!

erikerlandson commented Aug 22, 2016

Uh oh!

jnievelt Aug 22, 2016

Choose a reason for hiding this comment

Uh oh!

erikerlandson Aug 22, 2016

Choose a reason for hiding this comment

Uh oh!

johnynek commented Aug 22, 2016

Uh oh!

johnynek commented Aug 22, 2016

Uh oh!

willb commented Aug 23, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

erikerlandson commented Dec 7, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

willb commented Aug 23, 2016 •

edited

Loading