Collections ``count``, ``maxBy``, ``minBy`` methods

georgeorfanidi · August 8, 2019, 8:58am

It would be nice if the signatures of count, maxBy and minBy methods were

def count(ps :Seq[A => Boolean]): Seq[Int]

def maxBy(fs:Seq[cmp: Ordering[A]]): Seq[A]

def minBy(fs:Seq[cmp: Ordering[A]]): Seq[A]

sjrd · August 8, 2019, 2:07pm

What would be the spec of those signatures, and what would their implementation be?

Jasper-M · August 8, 2019, 2:21pm

If I understand those method signatures correctly, what would be the advantage of having those methods over the current ones where you can do

def ps: Seq[A => Boolean] = ???
def ords: Seq[Ordering[A]] = ???
def as: Seq[A] = ???

ps.map(as.count)
ords.map(as.max(_))
ords.map(as.min(_))

morgen-peschke · August 8, 2019, 2:33pm

I’m not sure how often the collection would be large enough (or there being enough separate ways to count) for it to make a difference, but it would allow doing the aggregation in a single pass.

I’m dubious if this comes up often enough to warrant it being the default case, rather than just implementing a helper where it’s needed

Jasper-M · August 8, 2019, 2:37pm

I’m not sure how either implementation could do better—or worse—than O(ps.length * as.length).

morgen-peschke · August 8, 2019, 3:57pm

I doubt the BigO would change, but if the collection were large enough that multiple iterations would cause cache misses, you might get better performance from a single pass.

Another possibility I didn’t think of last time could be if the collection could only be iterated once, or was both fully lazy and involved enough computation to make multiple passes unpleasantly expensive.

I highly doubt this comes up often enough to promote a workaround into the standard library

georgeorfanidi · August 8, 2019, 4:43pm

Actually for Big O would make a big difference depending from the collection. I actually have a use case for it while I need to calculate very often some statistics.

LPTK · August 8, 2019, 5:24pm

Shameless plugs: here are two related approaches which generalize what you are asking for, by allowing several different aggregate operations to be performed in parallel:

A dual to iterators, which let you compose low-level aggregators that use mutation behind the scenes:

def average = for { s <- sum[Double]; c <- count } yield s/c

val ls = List(1.0, 2.0, 3.0)
assert(average(ls) == 2.0)

Using monoid comprehensions, a more functional way based on monoid and semigroup instances (see the extended abstract which uses Scala):
```
def average(ls: List[Double]) = {
  val (s,c) = for { x <- ls } yield (Sum(x), Avg(x))
  s/c
}
```

tyohDeveloper · August 8, 2019, 8:10pm

In numeric codes, it makes sense to have a basic statistics accumulators. Or with stream based collections. Using fold to compute them together in a single pass is common. For stream collections, you’ll want to override the next element methods to update the basic stats as well. The collectors can be copy-on-update or mutable depending on requirements for speed and parallelism. For multiprocessor, and more so multi-computer processing using an even based evaluation is a good idea.

If your collections are less than 1k elements, It’s not worth the time to implement something like this. Premature optimization is the source of all evil.