tl;dr , but please, let’s do it in a more abstract and more performant way.
I really like the idea to add such a type for large Strings to scala.
I’ve had a similar usecase where I finally used a Vector[String]
. And that was even the reason I implemented faster Vector
concatenation, which provides concatenation in $O(\min(n,m))$ for Vector
s of length $n$ and $m$.
But there’s a catch:
The implementation linked above also uses a Vector
of collections (Vector[ByteString1]
to be precise).
While this is a flat structure and seems to provide fast access, it does not: Worst case performance is $O(n)$, as can be seen in the while loop in apply
. This is due to the fact that the Vector
has no sense of the length of its elements, so to access an element, we have to iterate over all previous String
s or ByteString1
s to compute their accumulated length and find the String
/ByteString1
containing the element. If the string consists of many short substrings (the string is “scattered”), that really hurts performance.
The same problem occurs at concatenation: If you have “scattered” ByteString
s, their concatenation is also $O(n)$ (as every substrings reference is copied to the new Vector).
This differs from the data structure described in the Wikipedia article, where in fact each node in the tree stores an accumulated length. This allows for $O(\log(n))$ access and $O(\log(n))$ (or $O(1)$) concatenation.
Vector
s have a similar problem to String
s
A similar problem exists in the Vector
structure itself, slowing down concatenation. In short: At concatenation, Vector
s can be aligned (rare) or misaligned (normal case), where the latter requires a full copy of one of the Vectors
(so $O(n)$). Currently that is the right Vector
, if the PR above is accepted, it’s the shorter one.
This cannot be changed without drastically changing the internal structure of Vector
s, introducing respective performance drawbacks. (Would require inhomogenious instead of homogenious Array
sizes)
The only way I could think of how to provide (mostly) the same performance as now and have fast concatenation, was to add (optional) layers of inhomogeniously sized nodes, to form a tree. So, it’s a tree, where every leaf is a (homogenious) Vector
of any size and each node stores the size of the tree (or the size of the left subtree as in Wikipedia, doesn’t matter).
A possible solution for both
If we add an (abstract?) class TreeSeq[A, S <: IndexedSeq[A]] extends IndexedSeq[A]
, we could do one implementation for both:
- A ByteString could simply extend a
TreeSeq[Char, String]
(or TreeSeq[Byte, ByteString1]
) and provide better performance than the Vector[String]
resp Vector[ByteString1]
implementation.
- And one can simply add a
VectorTree[A] extends TreeSeq[A, Vector[A]] with Vector[A]
that gets constructed when concatenating large Vector
s that are misaligned. This preserves asymptotic times for all Vector
s. And it preserves current times exactly for all Vectors
that were not obtained by concatenating large Vector
s (which are currently hard to obtain anyway). So, the drawbacks in access/modification time only occur for Vector
s that are currently comparatively slow to obtain.
TreeSeq
would then also allow to easily implement fast concatenation for other IndexedSeq
s (e.g. BitVector
), if necessary.
Also, it would allow for fast insertAt(idx: Int, elem: A)
and remove
methods.
What do you think of that approach?