Set#map Deduping and Possibly Unsafe Behavior

That makes sense, but following that logic, shouldn’t map on an Iterator which is really a Map deduplicate based on keys, instead of behaving like a “normal” Iterator ?

Well, there’s one super-easy case: if I map with an injective function.

xs.map(_ + 1)
xs.map("The answer is " + _)
xs.map(x => (x, f(x)))

A lot of the time, people do in fact map with injective functions.

3 Likes

Yes, everything is fine, except possibly documentation about what an Iterable is as compared to a Seq.

If you want things to be nonduplicated and in order, use Seq.

There are all sorts of surprises you run into if you think Iterable and Seq are synonymous, either with Scala 2.12 or 2.13. For instance

scala> "incredible".toSet.zipWithIndex.map(_._2).mkString(" ")
res7: String = 0 5 1 6 2 7 3 4

I specifically zip with index and then read the indices but…they’re not in order?!

Iterable is an abstraction over things that don’t necessarily have an order and/or preserve all elements. If there are methods on Iterable that can only possibly make sense when they’re applied to an in-order collection, then they should be deprecated there and in the future only appear on Seq.

Otherwise, we just need to make it easy for people to learn what the distinction is.

6 Likes

It can’t, because it’s not the same map; it’s an overloaded one. We could, I suppose, add the overload (pointlessly) to Iterable. But despite having the same name, it’s not the same function. One is (equivalent to)

mapKeysAndValues(f: (K, V) => (Q, U)): C[(Q, U)]

whereas the other is

mapElements(f: A => B): C[B]
1 Like

This is a second issue that is distinct from the Set and map issue, IMO. It would deserve a separate thread. Otherwise, the discussion will keep getting sidetracked like it is.

This kind of visualization would be useful: https://twitter.com/FMuntenescu/status/1143524844312678405

I would suggest that it would at least make sense to clarify the behaviour in the API documentation. There is no harm in making sure people understand the limitations.

Is map the only method defined in Iterable that might have surprising behaviour when applied to a set?

Another possibility would be to deprecate map in Iterable, and to give it a different name to warn users of the API.

1 Like

Right. It’s ad hoc polymorphism where subtype polymorphism is also in play… :slightly_frowning_face:

(Watched @pbvie’s Polymorphism in Scala ScalaDays talk yesterday. :smile:)

1 Like

I’d be all for at least adding such a warning to the doc.

3 Likes

Then make a mapElementsToSeq methods. Make sure the Scaladoc on Iterable#map explains the behavior is dependent on the implementation, and give the set example as a case where behavior might not correspond to expectations.

That said, I still think it’s a matter of becoming familiar with the standard library. Why would there be any expectation on Iterable#map returning an Iterable of the same size? That has to come from experience somewhere else. Equating “Iterable” with Java’s Stream, perhaps?

3 Likes

Exactly my main point. Java devs transition to Scala, ex-Java devs are used to idiomatic implementation with #stream#map, ex-Java devs build faulty applications by mapping directly over a set expectating it to behave like in Java minus the idiomatic #stream

Even Haskell docs make a point of mentioning this aspect of Set#map, not to mention not including it in standard lib to begin with.

edit: I mention Haskell to contrast with the lack of design pressure to include Java design elements

I don’t find Iterable#map useful at all, if this is the contract. Using Seq is not viable, it has too many additional constraints. You are constraining yourself to Collections. Users want a more general Iterable.

In every real world case I can think of where I require iterator(), I would have to call iter.iterator().map(...) for any sense of sanity.

Consider a few extreme cases for the Iterable:

  1. The values come from a Set
  2. The values come from a file with a billion values that are 10x the RAM available.
  3. The values come from a mutable concurrent data structure and so has ‘weak’ stability (perhaps TrieMap or CHM)

It is clear that using Seq instead is not appropriate. Seq adds too many constraints. The existence of map on Iterable raises concerns on whether it can even be used for #2.

People reach for ‘Iterable’ when they want something that has iterator(), not when they want map. Its in the name. It should be a common supertype of Set/Map/Seq, not the common supertype.
In general, if one wants a supertype of Set/Map/Seq it will be for either iterable or map/flatMap but not both.

In a use case where you need iterator, map is only useful if you know what the subtype is. In a use case where you need map in the general sense (e.g. as a functor or monad), iterator tends to be useless.
If you don’t know the subtype and you are interested in producing values by iteration, then you’re better off using .iterator().map(...).

My conclusion: Remove map from Iterable.

People need a type that represents supplying an Iterator without extra baggage. Spark, for example. Seq is useless when you are representing ordered data streaming over the network or spooling from disk. I guess we can represent that with () => Iterator instead, but a name for that would be better. And there is no better name for that Iterable.

Then add a type that represents the ability to map – Functor, Mappable, whatever.
These concepts should be separated, and not glued together. One is more functional style, and the other is more procedural. Neither implies the other.

A third type can imply both, and be the supertype of Map, Set and Seq, perhaps Collection?

This doesn’t move me. One can always define a contract as: this either returns a value or throws an exception. Now it satisfies LSP, because it compiles.

But what is important from my perspective is whether it violates substitution with respect to user expectations of the API contract. An API contract is only as good as what users expect. Everything else leads to bugs.

A contract in line with expectations but with mediocre documentation beats one with unexpected surprises and good documentation. There will be fewer bugs with the former. Of course we can manage expectations over time with documentation (and not everyone has the same expectations!). But something that takes less time to learn and has fewer surprises is superior.

If the intention of Iterable was to be the supertype of Map, Set and Seq then it is horribly named, based on my expectations and many in this discussion. Its really Collection.

4 Likes

I’d like to propose slightly different approach to solve this otherwise unsolvable problem.
Please take a look at issue I posted recently:

Code snippet showing how it addresses problem described at the beginning of this discussion:

m.