Cuber/roller macros #483

sid-kap · 2015-08-14T18:59:22Z

ianoc · 2015-08-14T18:59:52Z

algebird-core/src/main/scala/com/twitter/algebird/macros/Cuber.scala

kill comment

ianoc · 2015-08-14T19:02:02Z

algebird-test/src/test/scala/com/twitter/algebird/macros/CuberMacroTest.scala

can we use scalatest here instead of scalacheck so it runs with the other tests? It will get counted towards the metrics then of how many tests we've ran overall

I thought @johnynek preferred scalacheck over scalatest?

He's not a huge fan of scalatest its true, but most of our tests are already scalatest, and it provides metrics on how many run. A big pain point is when tests don't get run, so it would provide us that data so we can graph/track in future.

ianoc · 2015-08-14T19:04:44Z

Whats the definition of a cube vs a rollup?

sid-kap · 2015-08-14T19:08:08Z

Cube produces all 2^n subsets of the data. Rollup is useful for hierarchical data, and produces only n+1 subsets, where each field is included only if all the previous (more general) fields are included.

@joshualande has a great blog post on this: http://joshualande.com/cube-rollup-pig-data-science/

ianoc · 2015-08-14T19:09:16Z

Ok cool, well sounds like we should add some decent comments to the code and links to that post

sid-kap · 2015-08-14T20:30:14Z

We should probably add enhancements to Seq so that people can do

case class Person(age: Int, gender: Gender, height: Double)
val people: List[Person]

val map1: Map[(Option[Int], Option[Gender]), List[Person]] = 
  people.cubeBy { p => (p.age, p.gender) } 

val map2: Map[(Option[Int], Option[Gender]), List[Person]] = 
  people.rollupBy { p => (p.age, p.gender) }

or maybe even

val people: List[Person]
def aggregator(people: Seq[Person]) = {
  val heights = people.map(_.height)
  heights.sum / heights.length
}
val map1: Map[(Option[Int], Option[Gender]), Double] =
  people.cubeBy( { p => (p.age, p.gender) }, aggregator)

Not sure exactly what API we want to provide.

ianoc · 2015-08-14T20:33:24Z

algebird-core/src/main/scala/com/twitter/algebird/macros/Cuber.scala

can we just statically unroll this in the macro ?

we can then cache our Some allocations, right now we would re-allocate a lot

+1 to unrolling in the macro. The nested flatMaps really deep will probably optimize poorly.

johnynek · 2015-08-15T01:45:18Z

algebird-core/src/main/scala/com/twitter/algebird/macros/Cuber.scala

how is this different from Cuber?

okay, I get it. This is doing a prefix only, not the full data cube. Given that the type is isomorphic, it is easy to get confused. Can we make a shorter explanation at the top of the comment? Something like:

/** * For a tuple N produces a result with (N + 1) elements each of arity N such that there is a suffix of k Nones, for all k * from 0 to N.

johnynek · 2015-08-17T22:24:13Z

algebird-core/src/main/scala/com/twitter/algebird/macros/Cuber.scala

you can evaluate ((1 << ${index - 1}) & i) == 0 at compile time. What not:

(1 to arity).map { index => val some = newTermName(s"some$index") if (((1 << ${index - 1}) & i) == 0) q"_root_.scala.None" else q"$some" }

ahh, I see. It has i in it, which is not known at this time.

johnynek · 2015-08-17T22:41:14Z

Looks good to me (I'll defer to Ian if he really wants scalatest, I get the desire for a count of tests, but we've had several issues with property checks not running with scalatest, which worries me. It's harder to have that happen using scalacheck's approach, which is type checked, as opposed to scalatest using functions to unit which silently ignore functions that have no asserts).

As to the API, I'd still like to see:

object MapAlgebra {
  def cube/rollup[K, V](it: TraversableOnce[(K, V)])(implicit c: Cuber[K], sg: Semigroup[V]): Map[c.K, V]
  // maybe we could also do:
  def cubeBy[T, K,U,V](it: TraversableOnce[T], agg: Aggregator[T,U,V])(fn: T => K)(implicit c: Cuber[K]): Map[c.K, V]
  // using scala's groupBy usually performs very poorly. We'd probably need to use MapAlgebra.sumByKey which internally uses a mutable map.
}

sid-kap · 2015-08-18T00:16:50Z

(I posted this comment in the diff above but it's messy to have this discussion in two places so might as well just move it here.)

The cube function in Pig and SQL doesn't seem to have aggregation built in -- it only generates the tuples, and then you manually do a groupby/aggregation/etc afterwards. Maybe we should emulate that here, by keeping the aggregation separate from the cube method.

This would be more in line with how we were planning to implement this in Scalding -- if we want TypedPipe[(K,V)]#cube to return a TypedPipe[(c.K, V)], then it would make sense for MapAlgebra.cube(t: TraversableOnce[(K,V)]) to return either a TraversableOnce[(c.K, V)] or a Map[c.K, TraversableOnce[V]].

I don't have a huge problem with semigroup aggregation here, but I think it would be nice to have consistent names and types between the MapAlgebra version of cube and the TypedPipe cube.

(Maybe we could find a different name for the cube function you mentioned above? Maybe cubeAggregate or something?)

johnynek · 2015-08-18T00:57:28Z

How about MapAlgebra.cubeSum and MapAlgebra.cubeAggregate that use
semigroup and Aggregator respectively to return a Map.

On Mon, Aug 17, 2015 at 2:16 PM, Sidharth Kapur [email protected]
wrote:

(I posted a comment in the diff above but it's messy to have this
discussion in two places so might as well just move it here.)

The cube function in Pig and SQL doesn't seem to have aggregation built in
-- it only generates the tuples, and then you manually do a
groupby/aggregation/etc afterwards. Maybe we should emulate that here, by
keeping the aggregation separate from the cube method.

This would be more in line with how we were planning to implement this in
Scalding -- if we want TypedPipe[(K,V)]#cube to return a TypedPipe[(c.K,
V)], then it would make sense for MapAlgebra.cube(t:
TraversableOnce[(K,V)]) to return either a TraversableOnce[(c.K, V)] or a Map[c.K,
V].

I don't have a huge problem with semigroup aggregation here, but I think
it would be nice to have consistent names and types between the List/Map
version of cube and the TypedPipe cube.

(Maybe we could find a different name for the cube function you mentioned
above? Maybe cubeAggregate or something?)

—
Reply to this email directly or view it on GitHub
#483 (comment).

Oscar Boykin :: @posco :: http://twitter.com/posco

sid-kap · 2015-08-18T00:58:52Z

Sure, that sounds good.

sid-kap · 2015-08-18T01:26:10Z

algebird-core/src/main/scala/com/twitter/algebird/MapAlgebra.scala

The trade-off here is to allocate lots more objects (the Semigroup, lots of Iterable singletons and Maps) in order to take advantage of sumByKey (rather than using Scala's groupBy). Is this a good idea?

I don't think this approach is going to work well (the default Iterable is List, I think, so you're going to get repeated ++ on List which gives an O(N^2) algorithm).

You're better off doing something like:

val map: collection.mutable.Map[c.K, List[V]] = collection.mutable.Map[c.K, List[V]]() it.toIterator.foreach { case (k, v) => c(k).foreach { ik => map.get(ik) match { case Some(vs) => map += ik -> (v :: vs) case None => map += ik -> List(v) } } // now reverse the lists and return an immutable map. // you might just wrap this one with the immutable map that wraps a mutable one (so we // don't pay the immutable construction cost unless we need to, i.e. the user calls + or .toMap). }

I can't seem to find an immutable wrapper to a mutable map in the standard library. So our options here are

write a wrapper class

return an immutable map by calling .toMap on the map

return the mutable map, but specify the return type as scala.collection.Map instead of scala.collection.mutable.Map. In this case, the user doesn't know that it's mutable, so it's effectively immutable. (They can only access the mutator methods by casting.) If they really want a immutable.Map, they can call .toMap on it.

I'm leaning toward the third option

We have one around in algebird I believe.

BTW option 3 is bad because the contract there lets the underlying map move about on you, you can't presume its stable/immutable. Just that the handle you have is. Subtle but important difference

Ah, found the mutable backed map.

johnynek · 2015-08-18T19:26:56Z

algebird-core/src/main/scala/com/twitter/algebird/MapAlgebra.scala

let's make the value return type List[V] or Iterable[V].

johnynek · 2015-08-18T19:33:14Z

other than the type issues I mentioned above, 👍 @ianoc any concerns?

This is really great. Going to be a huge win for people doing rollups in the REPL.

ianoc · 2015-08-18T19:50:20Z

Nope, this looks great to me

Cuber/roller macros

First sketch of cuber/roller macro

022aaa7

ianoc reviewed Aug 14, 2015
View reviewed changes

algebird-core/src/main/scala/com/twitter/algebird/macros/Cuber.scala Outdated

Copy link

Collaborator

ianoc Aug 14, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kill comment

Require at least one param in case class for cuber

ee529e9

ianoc reviewed Aug 14, 2015
View reviewed changes

Add doc comments to Cuber and Roller

ff94531

ianoc reviewed Aug 14, 2015
View reviewed changes

sid-kap added 2 commits August 14, 2015 13:35

Fix type error in doc comment

494a4df

Fix typo

6f58da6

johnynek reviewed Aug 15, 2015
View reviewed changes

sid-kap added 5 commits August 17, 2015 11:16

Edit comments, improve tests

ac1c03e

Make Cuber more efficient

346c19b

Unroll loop in roller macro

3b3707c

Make cuber and roller methods implicit

e8f242e

Change TupleN to _root_.scala.TupleN

2de8276

johnynek reviewed Aug 17, 2015
View reviewed changes

Edit comment

8c7704b

sid-kap force-pushed the cuber_roller branch from e0ac4de to 8c7704b Compare August 17, 2015 22:45

Clean up macros a bit

5b2db7b

sid-kap added 2 commits August 17, 2015 18:04

Use scalatest instead of scalacheck

39c733a

Implement {cube,rollup}{,Sum,Aggregate}

3ce55bb

sid-kap reviewed Aug 18, 2015
View reviewed changes

sid-kap added 2 commits August 18, 2015 12:10

Improve {cube,rollup}{,Sum,Aggregate} and write tests

71cd6cc

Refactor macros helper functions

4564d14

johnynek reviewed Aug 18, 2015
View reviewed changes

Make cube/rollup return a Map[c.K, List[V]]

55d4335

johnynek added a commit that referenced this pull request Aug 18, 2015

Merge pull request #483 from sid-kap/cuber_roller

2aa6ba8

Cuber/roller macros

johnynek merged commit 2aa6ba8 into twitter:develop Aug 18, 2015

sid-kap deleted the cuber_roller branch August 18, 2015 21:25

Cuber/roller macros #483

Cuber/roller macros #483

Uh oh!

Conversation

sid-kap commented Aug 14, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ianoc commented Aug 14, 2015

Uh oh!

sid-kap commented Aug 14, 2015

Uh oh!

ianoc commented Aug 14, 2015

Uh oh!

sid-kap commented Aug 14, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

johnynek commented Aug 17, 2015

Uh oh!

sid-kap commented Aug 18, 2015

Uh oh!

johnynek commented Aug 18, 2015

Uh oh!

sid-kap commented Aug 18, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

johnynek commented Aug 18, 2015

Uh oh!

ianoc commented Aug 18, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants