Collections ``count``, ``maxBy``, ``minBy`` methods

#1

It would be nice if the signatures of count, maxBy and minBy methods were

def count(ps :Seq[A => Boolean]): Seq[Int]

def maxBy(fs:Seq[cmp: Ordering[A]]): Seq[A]

def minBy(fs:Seq[cmp: Ordering[A]]): Seq[A]

#2

What would be the spec of those signatures, and what would their implementation be?

1 Like
#3

If I understand those method signatures correctly, what would be the advantage of having those methods over the current ones where you can do

def ps: Seq[A => Boolean] = ???
def ords: Seq[Ordering[A]] = ???
def as: Seq[A] = ???

ps.map(as.count)
ords.map(as.max(_))
ords.map(as.min(_))
1 Like
#4

I’m not sure how often the collection would be large enough (or there being enough separate ways to count) for it to make a difference, but it would allow doing the aggregation in a single pass.

I’m dubious if this comes up often enough to warrant it being the default case, rather than just implementing a helper where it’s needed

#5

I’m not sure how either implementation could do better—or worse—than O(ps.length * as.length).

1 Like
#6

I doubt the BigO would change, but if the collection were large enough that multiple iterations would cause cache misses, you might get better performance from a single pass.

Another possibility I didn’t think of last time could be if the collection could only be iterated once, or was both fully lazy and involved enough computation to make multiple passes unpleasantly expensive.

I highly doubt this comes up often enough to promote a workaround into the standard library

1 Like
#7

Actually for Big O would make a big difference depending from the collection. I actually have a use case for it while I need to calculate very often some statistics.

#8

Shameless plugs: here are two related approaches which generalize what you are asking for, by allowing several different aggregate operations to be performed in parallel:

  • A dual to iterators, which let you compose low-level aggregators that use mutation behind the scenes:

    def average = for { s <- sum[Double]; c <- count } yield s/c
    
    val ls = List(1.0, 2.0, 3.0)
    assert(average(ls) == 2.0)
    
  • Using monoid comprehensions, a more functional way based on monoid and semigroup instances (see the extended abstract which uses Scala):

    def average(ls: List[Double]) = {
      val (s,c) = for { x <- ls } yield (Sum(x), Avg(x))
      s/c
    }
    
1 Like
#9

In numeric codes, it makes sense to have a basic statistics accumulators. Or with stream based collections. Using fold to compute them together in a single pass is common. For stream collections, you’ll want to override the next element methods to update the basic stats as well. The collectors can be copy-on-update or mutable depending on requirements for speed and parallelism. For multiprocessor, and more so multi-computer processing using an even based evaluation is a good idea.

If your collections are less than 1k elements, It’s not worth the time to implement something like this. Premature optimization is the source of all evil.

1 Like