Skip to content

Conversation

@erikerlandson
Copy link
Contributor

As described in the following paper:
Computing Extremely Accurate Quantiles Using t-Digests
Ted Dunning and Otmar Ertl
https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf

@erikerlandson
Copy link
Contributor Author

For reference, my blog post that describes my reasoning behind binary tree algorithms backing the t-digest:
http://erikerlandson.github.io/blog/2015/09/26/a-library-of-binary-tree-algorithms-as-mixable-scala-traits/

@erikerlandson
Copy link
Contributor Author

CI is showing a failure, but the log output makes it seem like tests succeeded. I'm not replicating any failures on my local machine.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return type on public methods, please.

@erikerlandson
Copy link
Contributor Author

Comment on t-digest monoids: I'm pretty sure I can address the issue of non-deterministic behavior of monoid plus, which may be worth doing, but I think there is a deeper failure of strict associativity, since (td1 ++ td2) ++ td3 is unlikely to yield the same final set of clusters as td1 ++ (td2 ++ td3), even if the re-clustering presentation order becomes repeatable. I'd contend that this is OK (the behavior is "statistically monoidal", to whatever extent that's a thing), but worth calling out, since I'm declaring that it has a monoid type class.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we remove this?

@avibryant
Copy link
Contributor

@erikerlandson agreed on "statistically monoidal" - it seems fine, and is consistent with QTree and with the top-k stuff in CMS. We should come up with some formal expression of this but it's definitely something we're comfortable with.

@erikerlandson
Copy link
Contributor Author

@johnynek looks like some unrelated HLL unit test failed, but I can't re-run it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is non-deterministic, but I wonder if we track a seed which we update on merges if we can make it deterministic and yet keep the property that in expectation it is randomized.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, definitely. I was thinking of just using a Random object seeded with some variation of hashCode on the data, something along the lines of:

def deterministicShuffle[T](data: Seq[T]) =
  if (data.isEmpty) data else (new Random(data.head.hashCode)).shuffle(data)

@johnynek
Copy link
Collaborator

This is exciting, and I really appreciate the work on this. I want to get this merged, but also I hope you agree with us that careful code review is important.

One thing I'm wondering: could we break this into two PRs? One to add the Red/black trees with tests, and the second to leverage the red/black trees to implement T-digest (that builds on the first clearly)?

What I'm not seeing yet in the review is why scala.collection.immutable.SortedMap: http://www.scala-lang.org/api/current/index.html#scala.collection.immutable.SortedMap won't work here? Could you comment in the code? Sorry if I missed it.

Also, Red/black trees are O(log_2 N), but the hash-trie based standard maps are O(log_32 N). Can you comment what features we need of the red/black that the faster hash-trie can't do for us?

@erikerlandson
Copy link
Contributor Author

@johnynek, I agree, it's a big PR with a lot going on, and I think that reviews may as well be rigorous otherwise they aren't as useful

@erikerlandson
Copy link
Contributor Author

@johnynek regarding inheriting from SortedMap, I am pretty sure I can update the tree classes to satisfy the SortedMap interface trait, although it involves implementing several methods that I didn't immediately have great answers for, so I punted a bit. I had figured that might be addressed in a future PR. However, if you prefer I can try and take that bull by the horns now.

Regarding the particular use of binary trees, one nontrivial reason for that choice was that binary trees support the log-time maintenance and queries for prefix sums, and my "cover" constructs, and a couple other basic algorithms. So there is more going on than just mapping objects. Although that might be possible in a non-binary-tree data structure, I think the logic would be a lot harder at best, and maybe not possible at all.

Using red-black trees was just a way to guarantee balanced trees with a well-understood data structure that I had a hope of getting to work along with all the other functionality I was layering on top :)

@erikerlandson
Copy link
Contributor Author

Maybe an even better way to explain it is that a lot of the logic I need to support requires that clusters are maintained in numeric / location order. So any hash-based container will not get me what I need. A tree data structure maintains the ordering by (numeric) key that I need, along with the desirable log-time operations.

@erikerlandson
Copy link
Contributor Author

@johnynek regarding separate Red/Black node classes, in an earlier iteration I was doing that, but the code wasn't as dry. To make a long story short, it was resulting in two parallel copies of internal node logic (which is where most of the logic is). So I felt like designing around a single internal node class, with color as one field, was the cleaner solution.

@erikerlandson
Copy link
Contributor Author

@johnynek splitting into a "tree PR" and a "t-digest PR" is OK with me, would you like me to pull the trigger on that?

@johnynek
Copy link
Collaborator

@erikerlandson Yes. Please.

@erikerlandson
Copy link
Contributor Author

The supporting tree/map library has been factored out to #496, and I will now be keeping this branch rebased off of topic branch feature/treemaps

@erikerlandson
Copy link
Contributor Author

@johnynek @ianoc @avibryant is there still interest in #495 and #496 ?

@kainoa21
Copy link

Would very much like to see this PR merged, I am currently using the reference implementation but would benefit from an Algebird based implementation.

@erikerlandson
Copy link
Contributor Author

@kainoa21 coincidentally I started re-visiting this last weekend. There was a request to un-factor the various tree functions to reduce the code bulk on the back end, which is still on my to-do list.

// if we have already distributed all the mass, remaining clusters unchanged
cmNew = cmNew :+ ((clust.centroid, clust.mass))
} else if (xn == clust.centroid) {
// if xn lies exactly on the centroid, add all mass in regardless of bound
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the justification for this? It doesn't seem to come from the paper.

In fact, there is explicit acknowledgment that two clusters could have the same centroid:

For centroids with identical means, order of creation is used as a tie-breaker to allow an unambiguous ordering

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allows centroid to serve as unique key, which simplifies things. Multiple clusters with the same centroid adds nothing useful to the model.

@johnynek
Copy link
Collaborator

Sorry, we did not communicate clearly on this.

This is indeed an exciting algorithm, however, if we are going to become the maintainers of it, here it needs to be significantly smaller in size. An implementation that could reuse existing scala collection classes may be small enough to accept.

Honestly, the best approach is probably for @erikerlandson to provide a low dependency subproject that hosts the t-digest, and then we could have algebird-tdigest sub-project which adds the relevant algebird integration (monoids, aggregators mostly, I think).

@johnynek
Copy link
Collaborator

I'd like to add, these cases are a challenge. We want to welcome contributions, but then inclusion can seem like an endorsement which has burned us a few times in the past when implementations bit-rot or are poorly performing (not saying that will happen here, I just mean that we need to really have the bandwidth to understand and maintain the code to add it).

@willb
Copy link

willb commented Aug 23, 2016

@johnynek What about an algebird-contrib repository (or organization) for cases just like this, where developers could host community-maintained code that was designed to work with Algebird but without an Algebird imprimatur?

@erikerlandson
Copy link
Contributor Author

This is now available as a package here:
https://github.com/isarn/isarn-sketches

So I'm going to close this out

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants