Set#map Deduping and Possibly Unsafe Behavior

If nothing else, I think a big takeaway from this is "if you accept Iterable, operate on it using .iterator".

1 Like

In that case the current Iterable should just be some kind of private class for code deduplication, and only IterableOnce (perhaps renamed to Iterable) should be public API.
But then you probably run into that issue where scalac infers private types as LUBs.

1 Like

This entire discussion seems bizarre to me. The behavior is correct, and to expect anything else is akin to not understanding what a set is. That can easily be seen by simply breaking down the original code:

        val wSum = elements.map(element=>
weights(element) * values(element))
.sum
1 Like

I dont think anyone doesn’t understand what a set is, but rather have an understanding of map on Iterable that isn’t the functor definition.

All in all, that’s not so weird. Whether it its fmap depends on what builder is passed to the map method, and that’s always been the case - or at least since the previous collections rework.

1 Like

Iterable is the common supertype of Set, Map, and Seq. So its map operation has to admit the behavior of map on sets or maps. If you want a map that does not deduplicate, use Seq, not Iterable.

5 Likes

Yes, it’s easy to convince yourself of that when you think about it. The risk is when you don’t and you mistakenly think the iteration is the thing you’re mapping over.

I dont think that argument is strong enough not to have map on Iterable.

But I definitely dont think that considering the argument that its plausible that people make that mistake, and consider whether we can put safeguards in place to avoid them from making that mistake, should make it fair game to be told you dont understand what a set is. That’s just needlessly belittling.

What I would find more helpful is seeing if we can find out how often map on Iterable is used, and out of those usages, how often it is a bug if the underlying iterable is a set.

3 Likes

That’s the workaround I suggested on the original ticket and absolutely true given FP design rules. Still does not change the fact that semantically it is not the safest behavior.

Funny thing is I ran into the issue first working with a system using SAS that was giving off results depending on the amount of duplications in the input data. Maybe we need to accept this is a pitfall of strict adherence to functor rules in general and some compromise considering the human factor is needed on a case by case basis.

That makes sense, but following that logic, shouldn’t map on an Iterator which is really a Map deduplicate based on keys, instead of behaving like a “normal” Iterator ?

Well, there’s one super-easy case: if I map with an injective function.

xs.map(_ + 1)
xs.map("The answer is " + _)
xs.map(x => (x, f(x)))

A lot of the time, people do in fact map with injective functions.

3 Likes

Yes, everything is fine, except possibly documentation about what an Iterable is as compared to a Seq.

If you want things to be nonduplicated and in order, use Seq.

There are all sorts of surprises you run into if you think Iterable and Seq are synonymous, either with Scala 2.12 or 2.13. For instance

scala> "incredible".toSet.zipWithIndex.map(_._2).mkString(" ")
res7: String = 0 5 1 6 2 7 3 4

I specifically zip with index and then read the indices but…they’re not in order?!

Iterable is an abstraction over things that don’t necessarily have an order and/or preserve all elements. If there are methods on Iterable that can only possibly make sense when they’re applied to an in-order collection, then they should be deprecated there and in the future only appear on Seq.

Otherwise, we just need to make it easy for people to learn what the distinction is.

6 Likes

It can’t, because it’s not the same map; it’s an overloaded one. We could, I suppose, add the overload (pointlessly) to Iterable. But despite having the same name, it’s not the same function. One is (equivalent to)

mapKeysAndValues(f: (K, V) => (Q, U)): C[(Q, U)]

whereas the other is

mapElements(f: A => B): C[B]
1 Like

This is a second issue that is distinct from the Set and map issue, IMO. It would deserve a separate thread. Otherwise, the discussion will keep getting sidetracked like it is.

This kind of visualization would be useful: https://twitter.com/FMuntenescu/status/1143524844312678405

I would suggest that it would at least make sense to clarify the behaviour in the API documentation. There is no harm in making sure people understand the limitations.

Is map the only method defined in Iterable that might have surprising behaviour when applied to a set?

Another possibility would be to deprecate map in Iterable, and to give it a different name to warn users of the API.

1 Like

Right. It’s ad hoc polymorphism where subtype polymorphism is also in play… :slightly_frowning_face:

(Watched @pbvie’s Polymorphism in Scala ScalaDays talk yesterday. :smile:)

1 Like

I’d be all for at least adding such a warning to the doc.

3 Likes

Then make a mapElementsToSeq methods. Make sure the Scaladoc on Iterable#map explains the behavior is dependent on the implementation, and give the set example as a case where behavior might not correspond to expectations.

That said, I still think it’s a matter of becoming familiar with the standard library. Why would there be any expectation on Iterable#map returning an Iterable of the same size? That has to come from experience somewhere else. Equating “Iterable” with Java’s Stream, perhaps?

3 Likes

Exactly my main point. Java devs transition to Scala, ex-Java devs are used to idiomatic implementation with #stream#map, ex-Java devs build faulty applications by mapping directly over a set expectating it to behave like in Java minus the idiomatic #stream

Even Haskell docs make a point of mentioning this aspect of Set#map, not to mention not including it in standard lib to begin with.

edit: I mention Haskell to contrast with the lack of design pressure to include Java design elements

I don’t find Iterable#map useful at all, if this is the contract. Using Seq is not viable, it has too many additional constraints. You are constraining yourself to Collections. Users want a more general Iterable.

In every real world case I can think of where I require iterator(), I would have to call iter.iterator().map(...) for any sense of sanity.

Consider a few extreme cases for the Iterable:

  1. The values come from a Set
  2. The values come from a file with a billion values that are 10x the RAM available.
  3. The values come from a mutable concurrent data structure and so has ‘weak’ stability (perhaps TrieMap or CHM)

It is clear that using Seq instead is not appropriate. Seq adds too many constraints. The existence of map on Iterable raises concerns on whether it can even be used for #2.

People reach for ‘Iterable’ when they want something that has iterator(), not when they want map. Its in the name. It should be a common supertype of Set/Map/Seq, not the common supertype.
In general, if one wants a supertype of Set/Map/Seq it will be for either iterable or map/flatMap but not both.

In a use case where you need iterator, map is only useful if you know what the subtype is. In a use case where you need map in the general sense (e.g. as a functor or monad), iterator tends to be useless.
If you don’t know the subtype and you are interested in producing values by iteration, then you’re better off using .iterator().map(...).

My conclusion: Remove map from Iterable.

People need a type that represents supplying an Iterator without extra baggage. Spark, for example. Seq is useless when you are representing ordered data streaming over the network or spooling from disk. I guess we can represent that with () => Iterator instead, but a name for that would be better. And there is no better name for that Iterable.

Then add a type that represents the ability to map – Functor, Mappable, whatever.
These concepts should be separated, and not glued together. One is more functional style, and the other is more procedural. Neither implies the other.

A third type can imply both, and be the supertype of Map, Set and Seq, perhaps Collection?

This doesn’t move me. One can always define a contract as: this either returns a value or throws an exception. Now it satisfies LSP, because it compiles.

But what is important from my perspective is whether it violates substitution with respect to user expectations of the API contract. An API contract is only as good as what users expect. Everything else leads to bugs.

A contract in line with expectations but with mediocre documentation beats one with unexpected surprises and good documentation. There will be fewer bugs with the former. Of course we can manage expectations over time with documentation (and not everyone has the same expectations!). But something that takes less time to learn and has fewer surprises is superior.

If the intention of Iterable was to be the supertype of Map, Set and Seq then it is horribly named, based on my expectations and many in this discussion. Its really Collection.

4 Likes

I’d like to propose slightly different approach to solve this otherwise unsolvable problem.
Please take a look at issue I posted recently:

Code snippet showing how it addresses problem described at the beginning of this discussion:

m.