Which operations should be included in the new collections?

This request generalizes to the question, for any predicate X, do we make a method notX to prevent having to type !X?

2 Likes

Or, maybe, we should have this property to be build in the language itself? Like all boolean-returning functions (except explicitly marked) should have their reverse sibling automatically?

What’s wrong with !? That it is prefix? How about adding a not method to scala.Boolean?

I think an efficient histogram function would be nice

Something that would turn List(a,b,c,d,c,d,e,a,c,a) into

Map(a -> 3,b -> 1,c -> 3,d -> 2,e -> 1)

This isn’t the most efficient implementation, but this is the idea

def histogram [T](t: Traversable [T]): Map[T,Int] = {
   t.groupBy(a => a).mapValues(_.size)
}

This would now be done with: xs.groupMapReduce(identity)(_.size)(_ + _)

1 Like

I think you want _ => 1 as your second parameter (unless you care about the length of the strings) .i.e. xs.groupMapReduce(identity)(_ => 1)(_ + _)

You’re right! I wrote it too fast. This will also work:

xs.groupMapReduce(identity)(Function const 1)(_ + _)

The point being that these solutions are not efficent. I often felt the need for such a method.

Have you measured the performance difference between this and a hand-written specialized implementation? The methods in the standard library will never be the fastest possible for all use cases; what matters is that they be reasonably fast for common use cases such as this.

Using these implementations:

case class Box[A](var value: A, var first: Boolean = true) {}
case class Count(var value: Int = 0) {}

def groupMapReduce[A, K, V](xs: Iterable[A])(key: A => K)(value: A => V)(op: (V, V) => V): Map[K, V] = {
  val m = new collection.mutable.HashMap[K, Box[V]]
  val i = xs.iterator
  while (i.hasNext) {
    val a = i.next
    val boxed = m.getOrElseUpdate(key(a), Box(value(a)))
    if (boxed.first) boxed.first = false
    else boxed.value = op(boxed.value, value(a))
  }
  Map.empty[K, V] ++ m.iterator.map{ case (k, boxed) => k -> boxed.value }
}

def countMapReduce[A, K](xs: Iterable[A])(key: A => K): Map[K, Int] = {
  val m = new collection.mutable.HashMap[K, Count]
  val i = xs.iterator
  while (i.hasNext) {
    val count = m.getOrElseUpdate(key(i.next), Count())
    count.value += 1
  }
  Map.empty[K, Int] ++ m.iterator.map{ case (k, count) => k -> count.value }
}

counting on things of length 1024 with 16 categories, or on things of length 16 with 2 categories, countMapReduce takes 75-80% of the time that groupMapReduce does. This goes up to more like 90% if there is less repetition (counts of ~4 give ~90% time taken).

So it’s faster, but not outrageously faster. However, it is far clearer what the intent is in the latter case. (_ => 1)(_ + _) is a bit of a puzzler. Of course it doesn’t take very long to figure it out, but it does take a lot longer than the count version, which is very obviously a count.

An alternative to groupMapReduce might be groupFold operations. I’d have to think more about the full pros and cons of the two options, but the ability to pass in a zero value for the folds generally provides a bit of flexibility.

Well, personally I consider prefixing long expressions (and thus, distributing the local logic along the expression) to be very bad thing. I loved scala collections having, nonEmpty as a pair to isEmpty, for instance, which makes code much much easier to comprehend.

This generally sounds reasonable. The only problem I see (comparing to somewhat difficult and not beautiful way of having reverse sibling methods) is breakage of infix syntax usage (which I personally love very much). You need to ether not to use infix notation or surround expressions with brackets.

You can always use its real name: list.contains(x).unary_!

(Please don’t!)

That reads much better as list.contains(x)...not_!

1 Like

Stream and LazyList have unfold method, but Iterator still doesn’t have it. I think it’s unfortunate as Stream and LazyList memoize all the values, while Iterator doesn’t. Consider the following code (works under Scala 2.13.0-M4):

object UnfoldIterator extends Unfold with App {
  def unfoldIterator[A, S](init: S)(f: S => Option[(A, S)]): Iterator[A] = {
    var currentState = init
    var nextResultAndState: Option[(A, S)] = null

    new Iterator[A] {
      override def hasNext: Boolean = {
        if (nextResultAndState == null) {
          nextResultAndState = f(currentState)
        }
        nextResultAndState.isDefined
      }

      override def next(): A = {
        assert(hasNext)
        val (result, state) = nextResultAndState.get
        currentState = state
        nextResultAndState = null
        result
      }
    }
  }

  run(unfoldIterator(Unfold.initialState)(Unfold.generator))
}

object UnfoldStream extends Unfold with App {
  run(LazyList.unfold(Unfold.initialState)(Unfold.generator).iterator())
}

class Unfold {
  sys.props.put("scala.time", "")

  def run(input: Iterator[Long]): Unit = {
    println("result: " + input.map(_.toString).map(_.length).sum)
  }
}

object Unfold {
  val initialState = 0.0
  val generator: Double => Option[(Long, Double)] = num => {
    if ((num * scale).toLong != (num * scale + 1).toLong) {
      val nextNum = num + 1.3
      if ((num * inverseDistance).toInt != (nextNum * inverseDistance).toInt) {
        println(num.toLong)
      }
      Some((num.toLong, nextNum))
    } else {
      None
    }
  }

  private def scale = 123456789.0

  private def inverseDistance = 1.0 / 1234567
}

UnfoldIterator works happily even with -Xmx100m, while UnfoldStream grinds to a halt with -Xmx1g (it would probably fail with OoME after a while, but I don’t want to waste a lot of time waiting for it).

The iterator implementation I’ve written here is completely lazy, i.e. is lazy also in the first element. If you don’t use the iterator produced by unfoldIterator then the generator function will never be invoked. Is there a way to achieve that in little code using existing combinators (of course avoiding collection with memoization, like Stream or LazyList)?

3 Likes

Good point @tarsa, I’ve created https://github.com/scala/bug/issues/10955 for it.

2 Likes

That’s a little weird, because it kind of implies using the zero value for infinite elements. With a normal fold, if the collection is empty, you get the zero element. With a fold on the result of groupBy, you have infinite empty collections.

Also, you’d probably want groupMapFold

Some old Set operations that can fail have two versions: one throws an exception, while the other returns an Option and has Option in its name. I noticed that new methods, such as TreeSet#minAfter return Option but don’t have Option in their name. It’d be a breaking change, but we could reduce the number of operations by removing old methods that throw exceptions and have them return Option instead. This could wait for Scala 3 also. Or never. It’s just a legacy annoyance and sometimes a performance surprise.

2 Likes

That isn’t the semantics that I had in mind, but I can see that interpretation. The documentation can easily clarify that. I don’t think that a groupMapFold is needed though, because unlike with reduce, zero value specifies an output type that can be different from the collection type, so the map step can be combined with the fold. So take the common example of counting how many elements each group has. I think that would be a call like this.

c.groupFold(_.key)(0)((cnt, _) => cnt + 1)

Unlike a reduce, I could start with a value other than 0 for that. Or if I wanted to sum up a given value for each group it might be like the following.

c.groupFold(_.key)(0)((cnt, v) => cnt + v.points)

Note that the map step isn’t needed here the way it is with a reduce.

I’m not certain if the ability to add a different value for the zero matters for all that many use cases, but it is something that the fold version can do which a map-reduce version can’t. I doubt that the fact that the map-reduce does two function invocations for each element compared to one for the fold will matter much. The implementation of the fold probably needs an extra conditional instead for the semantics where values are only calculated when a key appears, and that will probably balanced things out.

scanl1 needs a name and a declaration of interest if anyone is actually interested. There is an existing issue. Otherwise, I’ll just kill the PR, painlessly of course.