Set#map Deduping and Possibly Unsafe Behavior

IDoCodingStuffs · June 20, 2019, 4:47pm

Moving the discussion at the bug tracker here

Since Set#map is implemented as a functor, its return type is also a Set which discards dupes like it is supposed to.

But, this can also cause unsafe behavior with a lot of minimal-idiomatic implementations especially when Java Maps are concerned.

There are a lot of points touched on the tracker and I believe this warrants even further in-depth discussion on the design of .map() behavior here. So far the proposed workarounds that want to keep the current design boil down to documentation improvements and tacking on idiomatic clutter.

IDoCodingStuffs · June 20, 2019, 4:51pm

Pasting Josh’s comment on the linked issue for convenience:

The behaviour of Set#map is quite strange on its own, but especially as an implementation of the Iterable interface, I think the difference in behaviour between Set’s map and that of other Iterables makes it very questionable what is the use-case of Iterable#map at all? If you have an Iterable[T] and you call map on it, you actually do not know if it is deduplicating or not. I cannot think of a situation in which I would like to map each element of an Iterable[T] to something, but then I don’t actually care if the output is deduplicated or not.

def foo(it: Iterable[Int]): Iterable[Int] = it.map(_ / 3) 
foo(List(1,2)) // List(0, 0)
foo(Set(1, 2)) // Set(0)

The situation is even more bizarre when you look at Map, which can completely change its behaviour based on if you know it is a Map at compile time:

scala> Map(1 -> 1, 2 -> 1).map(_.swap) 
res14: scala.collection.immutable.Map[Int,Int] = Map(1 -> 2) 
scala> (Map(1 -> 1, 2 -> 1): Iterable[(Int, Int)]).map(_.swap) 
res15: Iterable[(Int, Int)] = List((1,1), (1,2))

I think in an ideal world, and maybe something we can shoot for long term, is a situation where Sets and Maps return Iterable s from their map methods, and Seq’s continue as they do today, refining their map output to return Seq :

scala> Map(1 -> 1, 2 -> 1).map(_.swap) 
res14: Iterable[(Int, Int)] = List((1,1), (1,2))
scala> (Map(1 -> 1, 2 -> 1): Iterable[(Int, Int)]).map(_.swap) 
res15: Iterable[(Int, Int)] = List((1,1), (1,2)) 
scala> Set(1,2).map(_ / 3) 
res18: scala.collection.immutable.Iterable[Int] = List(0, 0)
scala> Seq(1,2).map(_ / 3)
res19: Seq[Int] = List(0, 0)

I think the use case of mapping Set => Set and mapping Map => Map is such a specific/niche transformation/use-case, that it wouldn’t be so inconvenient to just in that case do set.view.map(...).toSet and map.view.map(...).toMap , and leave their implementations Iterable#map to be consistent and predictable with other Iterables.

eed3si9n · June 20, 2019, 7:00pm

I agree that the end result is strange, but to me this illustrates the limitation of using subtyping to represent a mix bag of unrelated datatype under the name of Iterable etc, and expecting Liskov substitution principle to hold up.

The fact that Set(1, 2, 3).map(identity) returns a Set (to me) makes sense as the core design of Scala collection. I think the strangeness comes from the fact that it silently mixes itself with other Seq-like collections, either via (Can)BuildFrom or via inheritance.

scala> for { i <- List(1, 2); j <- Set(3) } yield i / j
res0: List[Int] = List(0, 0)

scala> for { j <- Set(3); i <- List(1, 2) } yield i / j
res1: scala.collection.immutable.Set[Int] = Set(0)

Perhaps we can take similar route as we did in Option[A] here and limit its super type to IterableOnce?

joshlemer · June 20, 2019, 8:37pm

Nice suggestion, but since the definitional difference between Iterable[A] and other IterableOnce[A]s is that they can be iterated arbitrarily many times, not only one time, I think maybe just raising Sets to not inherit from Iterable[A] despite being able to be iterated multiple times, seems like the wrong placement in the hierarchy, and instead points to maybe the idea that either Iterable, or how Set extends Iterable, could be improved. Perhaps Set should extend IterableOps[A, Iterable, Set] instead of IterableOps[A, Set, Set].

NthPortal · June 20, 2019, 11:57pm

or even better, we bring back Traversable and TraversableOnce, and have map for Traversable return Traversable instead of CC, and then make Set extend Traversable instead of Iterable

(please no, this is a joke)

eed3si9n · June 21, 2019, 2:13am

The discussion in Make Option extend IterableOnce (scala/scala#8038) helped me understand Iterable[A] and IterableOnce[A] better:

IterableOnce just means “has an iterator method”. By this logic Option should be an IterableOnce but Future should not. - Make Option extend IterableOnce by lrytz · Pull Request #8038 · scala/scala · GitHub
Iterable seems to mean both:
1. you can iterate over it using foreach, map etc.
2. you can build it from an arbitrary IterableOnce using Iterable#fromSpecific.

If we look at these two criteria, it makes sense to treat datatypes like List[A] and Set[A] as Iterable[A] individually.

The tricky point that you’re bringing up with def foo(it: Iterable[Int]): Iterable[Int] = it.map(_ / 3) is that the meaning of Iterable#map becomes unclear when the datatype is upcasted to Iterable[A]. I know what List#map and Set#map should do, but I am unsure what Iterable#map is. Potential ideas:

Everything is fine. Don’t upcast collections.
Remove Iterable#map, which might be too big of a change.
Change the meaning so you can only be Iterable if Iterable#map acts like a Seq#map.

By limiting Set’s super type to IterableOnce, I was suggesting the third route.

NthPortal · June 21, 2019, 5:30am

I suppose we could actually look at it the other way. Iterable#map does not make any guarantees for which mapped elements are retained and under what circumstances. Only Seq#map does, by virtue of it being a sequence.

joshlemer · June 21, 2019, 3:15pm

I don’t think that is tenable. The fundamental concept of subtyping A <: B is that values of type A can be used as values of type B.
That’s doable , but maybe throwing the baby out with the bath water.
Seems too restrictive. A Set is Iterable, because you can iterate on it multiple times, and it is a “generic collection” (from the scaladocs of Iterable: “Base trait for generic collections.”).
(mine )I’m just suggesting a much less radical approach: map, flatMap, collect and flatten methods on Iterable can return Iterable rather than CC so that Set is free to return non-deduped Iterables (probably Lists, at runtime), and Seq can refine their return types to CC

It’s true that it doesn’t make currently make any guarantees about loss of elements in the returned Iterable, but I just think then that the method is more-or-less useless (no offence), and I suspect the vast majority of uses of Iterable#map in the wild are actually assuming that all the elements are retained, and will be buggy if/when a Set is used. To bring the discussion from theoretical to practical, as an exercise, I would like to invite anyone think of an actual plausible situation where you might want to actually use Iterable#map, considering that it may or may not be deduplicating.

curoli · June 21, 2019, 4:12pm

I don’t remember the exact use case, but I have on rare occasions written a method that takes an Iterable and calls map on it, and I was aware of the fact that it may be Seq or Set.

I would ask the other way round: why use Iterable if you really mean Seq?

IDoCodingStuffs · June 21, 2019, 4:39pm

For an Iterable#map in the wild I can imagine a generalized data summary method taking both Sets and Seqs of a primary key. Too lazy to actually write up an example but it is perfectly plausible

joshlemer · June 21, 2019, 4:41pm

Thanks for the question, the answer would be that we would use Iterable in any context where all we care about are the contents of the collection and don’t care about the order in which they are in. In such a scenario, you may be willing to accept Seqs, Sets, and even more exotic structures from collectio-contrib such as MultiSets.

odersky · June 21, 2019, 4:49pm

Which is the case here. I believe the underlying problem is a misunderstanding what Iterable is.
As @eed3si9n states, an Iterable is anything that

has an iterator,
can construct a value of its type with the elements enumerated by a given iterator.

map of f over an Iterable X is simply the iterable of the same type as X produced from mapping f over X’s iterator. That’s all the contract says. LSP is not violated for this contract.

lbialy · June 24, 2019, 6:45pm

Just adding my 2 cents due to @joshlemer suggestion on Twitter:

Set being a straight up Iterable is a painful pitfall. I once lost a week trying to track down a bug in which spark job yielded slightly different results on each run. In the end it turned out it was due to a Set being passed to a method that expected Iterable, mapped contents and taken head of it. We don’t know to this day why it wasn’t at least stable (we’d assumed that hashCode would yield same order and it did when job was executed locally but then returned random results when executed on the cluster), but changing Iterable to List (abstraction wasn’t really required in that part of the codebase) made the bug go away.

NthPortal · June 25, 2019, 12:01am

If nothing else, I think a big takeaway from this is "if you accept Iterable, operate on it using .iterator".

Jasper-M · June 25, 2019, 12:10pm

In that case the current Iterable should just be some kind of private class for code deduplication, and only IterableOnce (perhaps renamed to Iterable) should be public API.
But then you probably run into that issue where scalac infers private types as LUBs.

dcsobral · June 25, 2019, 6:23pm

This entire discussion seems bizarre to me. The behavior is correct, and to expect anything else is akin to not understanding what a set is. That can easily be seen by simply breaking down the original code:

        val wSum = elements.map(element=>
weights(element) * values(element))
.sum

martijnhoekstra · June 25, 2019, 7:59pm

I dont think anyone doesn’t understand what a set is, but rather have an understanding of map on Iterable that isn’t the functor definition.

All in all, that’s not so weird. Whether it its fmap depends on what builder is passed to the map method, and that’s always been the case - or at least since the previous collections rework.

odersky · June 25, 2019, 8:35pm

Iterable is the common supertype of Set, Map, and Seq. So its map operation has to admit the behavior of map on sets or maps. If you want a map that does not deduplicate, use Seq, not Iterable.

martijnhoekstra · June 25, 2019, 9:27pm

Yes, it’s easy to convince yourself of that when you think about it. The risk is when you don’t and you mistakenly think the iteration is the thing you’re mapping over.

I dont think that argument is strong enough not to have map on Iterable.

But I definitely dont think that considering the argument that its plausible that people make that mistake, and consider whether we can put safeguards in place to avoid them from making that mistake, should make it fair game to be told you dont understand what a set is. That’s just needlessly belittling.

What I would find more helpful is seeing if we can find out how often map on Iterable is used, and out of those usages, how often it is a bug if the underlying iterable is a set.

IDoCodingStuffs · June 25, 2019, 9:47pm

That’s the workaround I suggested on the original ticket and absolutely true given FP design rules. Still does not change the fact that semantically it is not the safest behavior.

Funny thing is I ran into the issue first working with a system using SAS that was giving off results depending on the amount of duplications in the input data. Maybe we need to accept this is a pitfall of strict adherence to functor rules in general and some compromise considering the human factor is needed on a case by case basis.