tl;dr
, but please, let’s do it in a more abstract and more performant way.
I really like the idea to add such a type for large Strings to scala.
I’ve had a similar usecase where I finally used a Vector[String]. And that was even the reason I implemented faster Vector concatenation, which provides concatenation in $O(\min(n,m))$ for Vectors of length $n$ and $m$.
But there’s a catch:
The implementation linked above also uses a Vector of collections (Vector[ByteString1] to be precise).
While this is a flat structure and seems to provide fast access, it does not: Worst case performance is $O(n)$, as can be seen in the while loop in apply. This is due to the fact that the Vector has no sense of the length of its elements, so to access an element, we have to iterate over all previous Strings or ByteString1s to compute their accumulated length and find the String/ByteString1 containing the element. If the string consists of many short substrings (the string is “scattered”), that really hurts performance.
The same problem occurs at concatenation: If you have “scattered” ByteStrings, their concatenation is also $O(n)$ (as every substrings reference is copied to the new Vector).
This differs from the data structure described in the Wikipedia article, where in fact each node in the tree stores an accumulated length. This allows for $O(\log(n))$ access and $O(\log(n))$ (or $O(1)$) concatenation.
Vectors have a similar problem to Strings
A similar problem exists in the Vector structure itself, slowing down concatenation. In short: At concatenation, Vectors can be aligned (rare) or misaligned (normal case), where the latter requires a full copy of one of the Vectors (so $O(n)$). Currently that is the right Vector, if the PR above is accepted, it’s the shorter one.
This cannot be changed without drastically changing the internal structure of Vectors, introducing respective performance drawbacks. (Would require inhomogenious instead of homogenious Array sizes)
The only way I could think of how to provide (mostly) the same performance as now and have fast concatenation, was to add (optional) layers of inhomogeniously sized nodes, to form a tree. So, it’s a tree, where every leaf is a (homogenious) Vector of any size and each node stores the size of the tree (or the size of the left subtree as in Wikipedia, doesn’t matter).
A possible solution for both
If we add an (abstract?) class TreeSeq[A, S <: IndexedSeq[A]] extends IndexedSeq[A], we could do one implementation for both:
- A ByteString could simply extend a
TreeSeq[Char, String] (or TreeSeq[Byte, ByteString1]) and provide better performance than the Vector[String] resp Vector[ByteString1] implementation.
- And one can simply add a
VectorTree[A] extends TreeSeq[A, Vector[A]] with Vector[A] that gets constructed when concatenating large Vectors that are misaligned. This preserves asymptotic times for all Vectors. And it preserves current times exactly for all Vectors that were not obtained by concatenating large Vectors (which are currently hard to obtain anyway). So, the drawbacks in access/modification time only occur for Vectors that are currently comparatively slow to obtain.
TreeSeq would then also allow to easily implement fast concatenation for other IndexedSeqs (e.g. BitVector), if necessary.
Also, it would allow for fast insertAt(idx: Int, elem: A) and remove methods.
What do you think of that approach?