tl;dr , but please, let’s do it in a more abstract and more performant way.

I really like the idea to add such a type for large Strings to scala.

I’ve had a similar usecase where I finally used a `Vector[String]`

. And that was even the reason I implemented faster `Vector`

concatenation, which provides concatenation in $O(\min(n,m))$ for `Vector`

s of length $n$ and $m$.

#
But there’s a catch:

The implementation linked above also uses a `Vector`

of collections (`Vector[ByteString1]`

to be precise).

While this is a flat structure and seems to provide fast access, it does not: Worst case performance is $O(n)$, as can be seen in the while loop in `apply`

. This is due to the fact that the `Vector`

has no sense of the length of its elements, so to access an element, we have to iterate over all previous `String`

s or `ByteString1`

s to compute their accumulated length and find the `String`

/`ByteString1`

containing the element. If the string consists of many short substrings (the string is “scattered”), that really hurts performance.

The same problem occurs at concatenation: If you have “scattered” `ByteString`

s, their concatenation is also $O(n)$ (as every substrings reference is copied to the new Vector).

This differs from the data structure described in the Wikipedia article, where in fact each node in the tree stores an accumulated length. This allows for $O(\log(n))$ access and $O(\log(n))$ (or $O(1)$) concatenation.

##
`Vector`

s have a similar problem to `String`

s

A similar problem exists in the `Vector`

structure itself, slowing down concatenation. In short: At concatenation, `Vector`

s can be *aligned* (rare) or *misaligned* (normal case), where the latter requires a full copy of one of the `Vectors`

(so $O(n)$). Currently that is the right `Vector`

, if the PR above is accepted, it’s the shorter one.

This cannot be changed without drastically changing the internal structure of `Vector`

s, introducing respective performance drawbacks. (Would require inhomogenious instead of homogenious `Array`

sizes)

The only way I could think of how to provide (mostly) the same performance as now *and* have fast concatenation, was to add (optional) layers of inhomogeniously sized nodes, to form a tree. So, it’s a tree, where every leaf is a (homogenious) `Vector`

of any size and each node stores the size of the tree (or the size of the left subtree as in Wikipedia, doesn’t matter).

##
A possible solution for both

If we add an (abstract?) class `TreeSeq[A, S <: IndexedSeq[A]] extends IndexedSeq[A]`

, we could do one implementation for both:

- A ByteString could simply extend a
`TreeSeq[Char, String]`

(or `TreeSeq[Byte, ByteString1]`

) and provide better performance than the `Vector[String]`

resp `Vector[ByteString1]`

implementation.
- And one can simply add a
`VectorTree[A] extends TreeSeq[A, Vector[A]] with Vector[A]`

that gets constructed when concatenating large `Vector`

s that are misaligned. This preserves asymptotic times for *all* `Vector`

s. And it preserves current times exactly for all `Vectors`

that were *not* obtained by concatenating large `Vector`

s (which are currently hard to obtain anyway). So, the drawbacks in access/modification time only occur for `Vector`

s that are currently comparatively slow to obtain.

`TreeSeq`

would then also allow to easily implement fast concatenation for other `IndexedSeq`

s (e.g. `BitVector`

), if necessary.

Also, it would allow for fast `insertAt(idx: Int, elem: A)`

and `remove`

methods.

What do you think of that approach?