Skip to content

Conversation

@sid-kap
Copy link
Contributor

@sid-kap sid-kap commented Aug 14, 2015

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kill comment

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use scalatest here instead of scalacheck so it runs with the other tests? It will get counted towards the metrics then of how many tests we've ran overall

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought @johnynek preferred scalacheck over scalatest?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

He's not a huge fan of scalatest its true, but most of our tests are already scalatest, and it provides metrics on how many run. A big pain point is when tests don't get run, so it would provide us that data so we can graph/track in future.

@ianoc
Copy link
Collaborator

ianoc commented Aug 14, 2015

Whats the definition of a cube vs a rollup?

@sid-kap
Copy link
Contributor Author

sid-kap commented Aug 14, 2015

Cube produces all 2^n subsets of the data. Rollup is useful for hierarchical data, and produces only n+1 subsets, where each field is included only if all the previous (more general) fields are included.

@joshualande has a great blog post on this: http://joshualande.com/cube-rollup-pig-data-science/

@ianoc
Copy link
Collaborator

ianoc commented Aug 14, 2015

Ok cool, well sounds like we should add some decent comments to the code and links to that post

@sid-kap
Copy link
Contributor Author

sid-kap commented Aug 14, 2015

We should probably add enhancements to Seq so that people can do

case class Person(age: Int, gender: Gender, height: Double)
val people: List[Person]

val map1: Map[(Option[Int], Option[Gender]), List[Person]] = 
  people.cubeBy { p => (p.age, p.gender) } 

val map2: Map[(Option[Int], Option[Gender]), List[Person]] = 
  people.rollupBy { p => (p.age, p.gender) } 

or maybe even

val people: List[Person]
def aggregator(people: Seq[Person]) = {
  val heights = people.map(_.height)
  heights.sum / heights.length
}
val map1: Map[(Option[Int], Option[Gender]), Double] =
  people.cubeBy( { p => (p.age, p.gender) }, aggregator)

Not sure exactly what API we want to provide.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just statically unroll this in the macro ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can then cache our Some allocations, right now we would re-allocate a lot

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to unrolling in the macro. The nested flatMaps really deep will probably optimize poorly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this different from Cuber?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, I get it. This is doing a prefix only, not the full data cube. Given that the type is isomorphic, it is easy to get confused. Can we make a shorter explanation at the top of the comment? Something like:

/**
 * For a tuple N produces a result with (N + 1) elements each of arity N such that there is a suffix of k Nones, for all k
 * from 0 to N.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can evaluate ((1 << ${index - 1}) & i) == 0 at compile time. What not:

 (1 to arity).map { index =>
  val some = newTermName(s"some$index")
  if (((1 << ${index - 1}) & i) == 0) q"_root_.scala.None" else q"$some"
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh, I see. It has i in it, which is not known at this time.

@johnynek
Copy link
Collaborator

Looks good to me (I'll defer to Ian if he really wants scalatest, I get the desire for a count of tests, but we've had several issues with property checks not running with scalatest, which worries me. It's harder to have that happen using scalacheck's approach, which is type checked, as opposed to scalatest using functions to unit which silently ignore functions that have no asserts).

As to the API, I'd still like to see:

object MapAlgebra {
  def cube/rollup[K, V](it: TraversableOnce[(K, V)])(implicit c: Cuber[K], sg: Semigroup[V]): Map[c.K, V]
  // maybe we could also do:
  def cubeBy[T, K,U,V](it: TraversableOnce[T], agg: Aggregator[T,U,V])(fn: T => K)(implicit c: Cuber[K]): Map[c.K, V]
  // using scala's groupBy usually performs very poorly. We'd probably need to use MapAlgebra.sumByKey which internally uses a mutable map.
}

@sid-kap
Copy link
Contributor Author

sid-kap commented Aug 18, 2015

(I posted this comment in the diff above but it's messy to have this discussion in two places so might as well just move it here.)

The cube function in Pig and SQL doesn't seem to have aggregation built in -- it only generates the tuples, and then you manually do a groupby/aggregation/etc afterwards. Maybe we should emulate that here, by keeping the aggregation separate from the cube method.

This would be more in line with how we were planning to implement this in Scalding -- if we want TypedPipe[(K,V)]#cube to return a TypedPipe[(c.K, V)], then it would make sense for MapAlgebra.cube(t: TraversableOnce[(K,V)]) to return either a TraversableOnce[(c.K, V)] or a Map[c.K, TraversableOnce[V]].

I don't have a huge problem with semigroup aggregation here, but I think it would be nice to have consistent names and types between the MapAlgebra version of cube and the TypedPipe cube.

(Maybe we could find a different name for the cube function you mentioned above? Maybe cubeAggregate or something?)

@johnynek
Copy link
Collaborator

How about MapAlgebra.cubeSum and MapAlgebra.cubeAggregate that use
semigroup and Aggregator respectively to return a Map.

On Mon, Aug 17, 2015 at 2:16 PM, Sidharth Kapur [email protected]
wrote:

(I posted a comment in the diff above but it's messy to have this
discussion in two places so might as well just move it here.)

The cube function in Pig and SQL doesn't seem to have aggregation built in
-- it only generates the tuples, and then you manually do a
groupby/aggregation/etc afterwards. Maybe we should emulate that here, by
keeping the aggregation separate from the cube method.

This would be more in line with how we were planning to implement this in
Scalding -- if we want TypedPipe[(K,V)]#cube to return a TypedPipe[(c.K,
V)], then it would make sense for MapAlgebra.cube(t:
TraversableOnce[(K,V)]) to return either a TraversableOnce[(c.K, V)] or a Map[c.K,
V].

I don't have a huge problem with semigroup aggregation here, but I think
it would be nice to have consistent names and types between the List/Map
version of cube and the TypedPipe cube.

(Maybe we could find a different name for the cube function you mentioned
above? Maybe cubeAggregate or something?)


Reply to this email directly or view it on GitHub
#483 (comment).

Oscar Boykin :: @posco :: http://twitter.com/posco

@sid-kap
Copy link
Contributor Author

sid-kap commented Aug 18, 2015

Sure, that sounds good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trade-off here is to allocate lots more objects (the Semigroup, lots of Iterable singletons and Maps) in order to take advantage of sumByKey (rather than using Scala's groupBy). Is this a good idea?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this approach is going to work well (the default Iterable is List, I think, so you're going to get repeated ++ on List which gives an O(N^2) algorithm).

You're better off doing something like:

val map: collection.mutable.Map[c.K, List[V]] = collection.mutable.Map[c.K, List[V]]()
it.toIterator.foreach { case (k, v) =>
  c(k).foreach { ik =>
    map.get(ik) match {
     case Some(vs) => map += ik -> (v :: vs)
     case None => map += ik -> List(v)
    }
  }
  // now reverse the lists and return an immutable map.
  // you might just wrap this one with the immutable map that wraps a mutable one (so we
  // don't pay the immutable construction cost unless we need to, i.e. the user calls + or .toMap).
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't seem to find an immutable wrapper to a mutable map in the standard library. So our options here are

  • write a wrapper class
  • return an immutable map by calling .toMap on the map
  • return the mutable map, but specify the return type as scala.collection.Map instead of scala.collection.mutable.Map. In this case, the user doesn't know that it's mutable, so it's effectively immutable. (They can only access the mutator methods by casting.) If they really want a immutable.Map, they can call .toMap on it.

I'm leaning toward the third option

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have one around in algebird I believe.

BTW option 3 is bad because the contract there lets the underlying map move about on you, you can't presume its stable/immutable. Just that the handle you have is. Subtle but important difference

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, found the mutable backed map.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make the value return type List[V] or Iterable[V].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@johnynek
Copy link
Collaborator

other than the type issues I mentioned above, 👍 @ianoc any concerns?

This is really great. Going to be a huge win for people doing rollups in the REPL.

@ianoc
Copy link
Collaborator

ianoc commented Aug 18, 2015

Nope, this looks great to me

johnynek added a commit that referenced this pull request Aug 18, 2015
@johnynek johnynek merged commit 2aa6ba8 into twitter:develop Aug 18, 2015
@sid-kap sid-kap deleted the cuber_roller branch August 18, 2015 21:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants