Possible optimization for Traversable.minBy / reduceLeft

#1

I think I have found a possible improvement in Scala 2.11 / 2.12 implementation of TraversableOnce.min / minBy/ reduce… functions. Currently those functions call isEmpty on the very beginning. This call is quite inefficient on a general Traversable, as it is implemented using breakable / break - and it therefore throws / catches an exception on any non-empty traversable. It would be possible to use the fact variable first is true when the collection is empty, this way:

   def minMy[A, B](t: Traversable[A])(f: A => B)(implicit cmp: Ordering[B]): A = {

      var minF: B = null.asInstanceOf[B]
      var minElem: A = null.asInstanceOf[A]
      var first = true

      for (elem <- t) {
        val fx = f(elem)
        if (first || cmp.lt(fx, minF)) {
          minElem = elem
          minF = fx
           first = false
        }
      }
      if (first)
        throw new UnsupportedOperationException("empty.minBy")
      minElem
    }

Now Traversable seems to be removed in 2.13, therefore I am not sure if and how is this still applicable for that version. For me this is causing a significant performance issue, as I am searching using minBy for several thousands of very short traversables, for me it is causing an order of magnitude slowdown.

Perhaps I am missing something obvious and there is some reason why it is structured the way it is now? Can you please comment?

If the concern is valid, what should I do next to improve this?

#2

You could try implementing this change and benchmarking it. But it would only be useful for the last few releases of 2.12. Of course that might still be worth it to you.
Are you sure that the slowdown you’re seeing is caused by this? Because (1) breakable is probably not as slow as you think, and (2) I think that almost all collection implementations override - or inherit a more specific override of - isEmpty than the one in Traversable(Like).

#3

I am quite sure it is caused by this. In my case it is a custom Traversable, which implements foreach only, nothing else. When I replace minBy with minMy above, my data processing time is reduced from 240 sec to 6 sec.

#4

Is is possible to override isEmpty also with an efficient version?

#5

Sure it is (in my case the Traversable is never empty, therefore it is trivial). Still, I cannot see any drawbacks caused by suggested implementation - not calling isEmpty can hardly ever be worse than calling it. What is 2.13 counterpart for Traversable? I liked using it, as it is super easy to implement and provide most operations with a reasonable implementation.

#6

I don’t think there’s any drawback; it’s just a workaround that you can use now to speed up your code.

2.13 doesn’t have an equivalent to Traversable. You have to be able to stop the traversal partway. If you can’t, you have to put it all into a collection that can stop partway.

Usually you can replace the body of your foreach with either something that directly adds the results to a Builder, or possibly instead something that can be used to load the results when needed. (If you’re traversing across the entire contents of the filesystem, for example.)

#7

If you want a foreach-based Traversable that you can stop partway, you can use https://github.com/lihaoyi/geny’s Generator.

It’s used heavily in OS-Lib, Mill, Ammonite, and other places. Works great!