SLC: Make `Vector` the default implementation of `Seq` instead of `List`

Vector is in general a much more forgiving data structure than List: while it does not have quite as fast head append/removal performance, it has generally decent O(log n) performance across a wide range of operations, and doesn’t have the poor-indexed-loop poor-tail-append etc. footguns that List has. That makes it more suitable as the default for Seq which typically do not have the same head-biased usage patterns that List does

While @odersky mentions that previous attempts to change Seq()'s default from List to Vector resulted in performance slowdowns, Scala 2.13’s Vector class implemented by @szeiger is an order of magnitude faster than the previous 2.12 Vector class, so it’s worth a second look

I did an experiment with [WIP] Make `Vector` the default implementation of `Seq` by lihaoyi · Pull Request #26386 · scala/scala3 · GitHub , making Seq construct a Vector instead of List, and then build a compiler to compile mill-libs-javalib. At least on this benchmark (10 runs warmup, 10 runs measurement), the difference appears to be negligible:

Case run1 run2 run3 Mean SD
main (control, Seq=List) 8852 8612 8960 8808 ms 145
seq-as-vector (PR) 8707 8768 8767 8747 ms 29
seq-as-vector + 2M spin 10977 11345 11236 11186 ms 154

To ensure this is not a measurement error, and to ensure that touching the Seq() constructor can indeed make a difference to the compiler performance, I did a run with the Seq() intentionally slowed down with a spin-loop. It did demonstrate a measurable slowdown as shown above, indicating that the lack of change between the main (control, Seq=List) case and seq-as-vector (PR) is meaningful

The primary Seq() -> Vector case implemented in the PR replaces the Seq(foo, bar, baz)foo :: bar :: baz :: Nil optimization with an equivalent Vector.fromArrayUnsafe optimization that statically detects Seqs with length <= 32 and directly constructs a new Vector1 from short literals (which should be most of them).

Given that the difference in performance from changing the default from List to Vector seems negligible, I think we should seriously consider changing Seq to return Vector in 3.10 once the feature freeze and library freeze is lifted. This would make Seq behave “reasonably” across most usage patterns, without the perform cliffs that List as an implementation has.

Notably I only benchmarked the new Seq() -> Vector implementation when used in the Scala compiler; most of my other com-lihaoyi libraries that I have paid attention to performance have already had most hot Seq() paths replaced by specialized data structures, so I wouldn’t expect to see significant changes there. We probably need to benchmark this implementation on a few more projects to see how it affects usage patterns across the board.

4 Likes

What about the memory usage for a small Seq?

note to self - refresher (from this deck):

I have always been puzzled by the fact that, while the title of the following recipe in the Scala Cookbook is Making Vector your Go-To Immutable Sequence, rather than that exact title being repeated verbatim within the recipe, the following fact is instead mentioned:

The Vector class is considered the go-to general-purpose indexed immutable sequential collection.

Could it be that while the title of the recipe is attuned to the idea that the default Seq ought to (or may as well) be Vector, the recipe cannot bring itself to make the explicit suggestion?

In other words, the title (not the recipe itself explicitly) seems to suggest that when all we know is that we need a sequence, we may by default reach for a Vector, and if and when we later learn that we require the sequence to be linear rather than indexed, we switch to List.

This seems a good idea to me.

A programmer should always tune the collection at construction for the task at hand. Thus, the performance should not matter that much if the programmer uses the default constructor. The only thing i worry about is the change itself. If a programmer previously thought, “this problem needs a List … ah, i can use the default Seq constructor for that, this may lead to an unpleasant surprise”.

So it seems essential the communication around the use is correct. “If your problem has specific requirements, always choose the collection that fits your needs best. Otherwise, use the default which is a reasonable solution in most cases, but not necessarily a stable one.”

2 Likes

Looking back at Structure and Interpretation of Computer Programs (web page, pdf), is it possible that developers who first come into contact with a sequence as having a particularly straightforward implementation as a chain of pairs (lists - cons, car, cdr, nil), might expect such implementation to be the default one…





…even though of course SICP also makes it clear that a list is just one possible representation of a sequence, and that one can experiment with alternative representations…



…though when SICP considers a stream (c.f. Scala’s LazyList) as an alternative implementation of a sequence, it means an infinite sequence.



slides from these two decks:


I would not expect it makes much difference for the compiler, since the compiler uses List instead of Seq almost everywhere. I believe the previous experiment replaced all occurrences of List in the compiler with Vector with adaptions (e.g :: pattern matchings had to be rewritten). That gave a ~ 10% slowdown. I believe even after the subsequent optimizations to Vector it would come down in the same ballpark.

That’s not to say that List is the optimial sequence type for the compiler overall. I am experimenting with replacing List by IArray or something similar. But Vector is almost certainly too heavy to be a contender.

3 Likes

I support this change, I’ve wanted to see it happen for many years now, and I hope the change goes through. (Of course, it needs careful consideration.)

What about the memory usage for a small Seq?

Someone will need to measure that, but it’s plausible that it’s actually fine. Compact representation of small vectors was an explicit design goal of the 2.13.2 Vector rework, as per Release Scala 2.13.2 · scala/scala · GitHub and Rewrite Vector (now "radix-balanced finger tree vectors"), for performance by szeiger · Pull Request #8534 · scala/scala · GitHub .

In fact, Vector might actually win for most if not all small sizes, since a small vector uses a single small array, while a small List uses multiple cons cells. That’s a definite locality win, and it could be a size win as well.

4 Likes

jol is a good resource for measuring sizes.

1 Like

Asking Claude to quickly measure the overhead using jol (and compressed-oops) gives:

n List Vector
0 0 0
1 24 40
2 48 40
3 72 48
4 96 48
5 120 56
6 144 56
7 168 64
8 192 64
9 216 72
10 240 72
11 264 80
12 288 80
13 312 88
14 336 88
15 360 96
16 384 96

So tentatively, the size is probably okay. (You pick up an extra 48 bytes at the 32->33 transition, but overall it’s pretty space-efficient. if you’re just building and using it. If you’re taking repeated tails, obviously it gets worse.)

4 Likes

I can try a replace-all-Lists-with-Vectors experiment

Why do you think Vector and IArray would vary that much? For small collections, isn’t IArray basically isomorphic to Vector1 in data layout and operation implementation strategies?

3 Likes

but vector1 is still an extra box

1 Like

But after Valhalla it might not be.

We should make a forward looking decisions, that means pretending project Valhalla is launched and in widespread use. It is probably not too controversial, but List is a data structure that is going to benefit from Valhalla the least.

I was always surprised List was picked as a default Seq implementation, when it only beats Vector or Array in a single operation, prepend. Even iteration is slow, which is at a core of something called a sequence. I see List as a specialist data structure, rather than a generalist one - you use it when you explicitly know you need it. And that is almost never.

One option is also deprecating Seq.apply (and other abstract collections), and requiring programmer to always be explicit what kind of structure he wants to construct.

I have looked through my Scala codebases, and I rarely append or prepend. I mostly construct, map, flatmap, and filter. Therefore IArray is perfect for me, but changing it like this would break existing codebases that use Seq and assume prepend is efficient. Could we make a new collection that is just a flat array, until you append or prepend, and then it falls back to Vector-like structure?

Third, we should also consider Scala.js. I think Vector is not too efficient there.

1 Like

I had a closer look at Vector; it does look very carefully optimized for small collections, which is what mostly matters in the compiler itself. So the main downside of Vector1 over an opaque array representation is the boxing overhead. That matters somewhat, but probably not that much. So yes, we should try the experiment.

And I also agree that Vector is right as the default implementation of Seq.

5 Likes