-
Notifications
You must be signed in to change notification settings - Fork 347
Cuber/roller macros #483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cuber/roller macros #483
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kill comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use scalatest here instead of scalacheck so it runs with the other tests? It will get counted towards the metrics then of how many tests we've ran overall
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought @johnynek preferred scalacheck over scalatest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
He's not a huge fan of scalatest its true, but most of our tests are already scalatest, and it provides metrics on how many run. A big pain point is when tests don't get run, so it would provide us that data so we can graph/track in future.
|
Whats the definition of a cube vs a rollup? |
|
Cube produces all 2^n subsets of the data. Rollup is useful for hierarchical data, and produces only n+1 subsets, where each field is included only if all the previous (more general) fields are included. @joshualande has a great blog post on this: http://joshualande.com/cube-rollup-pig-data-science/ |
|
Ok cool, well sounds like we should add some decent comments to the code and links to that post |
|
We should probably add enhancements to Seq so that people can do case class Person(age: Int, gender: Gender, height: Double)
val people: List[Person]
val map1: Map[(Option[Int], Option[Gender]), List[Person]] =
people.cubeBy { p => (p.age, p.gender) }
val map2: Map[(Option[Int], Option[Gender]), List[Person]] =
people.rollupBy { p => (p.age, p.gender) } or maybe even val people: List[Person]
def aggregator(people: Seq[Person]) = {
val heights = people.map(_.height)
heights.sum / heights.length
}
val map1: Map[(Option[Int], Option[Gender]), Double] =
people.cubeBy( { p => (p.age, p.gender) }, aggregator)Not sure exactly what API we want to provide. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just statically unroll this in the macro ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can then cache our Some allocations, right now we would re-allocate a lot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to unrolling in the macro. The nested flatMaps really deep will probably optimize poorly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is this different from Cuber?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, I get it. This is doing a prefix only, not the full data cube. Given that the type is isomorphic, it is easy to get confused. Can we make a shorter explanation at the top of the comment? Something like:
/**
* For a tuple N produces a result with (N + 1) elements each of arity N such that there is a suffix of k Nones, for all k
* from 0 to N.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can evaluate ((1 << ${index - 1}) & i) == 0 at compile time. What not:
(1 to arity).map { index =>
val some = newTermName(s"some$index")
if (((1 << ${index - 1}) & i) == 0) q"_root_.scala.None" else q"$some"
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh, I see. It has i in it, which is not known at this time.
|
Looks good to me (I'll defer to Ian if he really wants scalatest, I get the desire for a count of tests, but we've had several issues with property checks not running with scalatest, which worries me. It's harder to have that happen using scalacheck's approach, which is type checked, as opposed to scalatest using functions to unit which silently ignore functions that have no asserts). As to the API, I'd still like to see: object MapAlgebra {
def cube/rollup[K, V](it: TraversableOnce[(K, V)])(implicit c: Cuber[K], sg: Semigroup[V]): Map[c.K, V]
// maybe we could also do:
def cubeBy[T, K,U,V](it: TraversableOnce[T], agg: Aggregator[T,U,V])(fn: T => K)(implicit c: Cuber[K]): Map[c.K, V]
// using scala's groupBy usually performs very poorly. We'd probably need to use MapAlgebra.sumByKey which internally uses a mutable map.
} |
|
(I posted this comment in the diff above but it's messy to have this discussion in two places so might as well just move it here.) The cube function in Pig and SQL doesn't seem to have aggregation built in -- it only generates the tuples, and then you manually do a groupby/aggregation/etc afterwards. Maybe we should emulate that here, by keeping the aggregation separate from the This would be more in line with how we were planning to implement this in Scalding -- if we want I don't have a huge problem with semigroup aggregation here, but I think it would be nice to have consistent names and types between the MapAlgebra version of cube and the TypedPipe cube. (Maybe we could find a different name for the cube function you mentioned above? Maybe |
|
How about On Mon, Aug 17, 2015 at 2:16 PM, Sidharth Kapur [email protected]
Oscar Boykin :: @posco :: http://twitter.com/posco |
|
Sure, that sounds good. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The trade-off here is to allocate lots more objects (the Semigroup, lots of Iterable singletons and Maps) in order to take advantage of sumByKey (rather than using Scala's groupBy). Is this a good idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this approach is going to work well (the default Iterable is List, I think, so you're going to get repeated ++ on List which gives an O(N^2) algorithm).
You're better off doing something like:
val map: collection.mutable.Map[c.K, List[V]] = collection.mutable.Map[c.K, List[V]]()
it.toIterator.foreach { case (k, v) =>
c(k).foreach { ik =>
map.get(ik) match {
case Some(vs) => map += ik -> (v :: vs)
case None => map += ik -> List(v)
}
}
// now reverse the lists and return an immutable map.
// you might just wrap this one with the immutable map that wraps a mutable one (so we
// don't pay the immutable construction cost unless we need to, i.e. the user calls + or .toMap).
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't seem to find an immutable wrapper to a mutable map in the standard library. So our options here are
- write a wrapper class
- return an immutable map by calling
.toMapon the map - return the mutable map, but specify the return type as
scala.collection.Mapinstead ofscala.collection.mutable.Map. In this case, the user doesn't know that it's mutable, so it's effectively immutable. (They can only access the mutator methods by casting.) If they really want aimmutable.Map, they can call.toMapon it.
I'm leaning toward the third option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have one around in algebird I believe.
BTW option 3 is bad because the contract there lets the underlying map move about on you, you can't presume its stable/immutable. Just that the handle you have is. Subtle but important difference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, found the mutable backed map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's make the value return type List[V] or Iterable[V].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
other than the type issues I mentioned above, 👍 @ianoc any concerns? This is really great. Going to be a huge win for people doing rollups in the REPL. |
|
Nope, this looks great to me |
Related to twitter/scalding#1420