Adding Akka's ByteString as a Scala module and/or stdlib

mdedetrich · October 17, 2022, 1:19pm

Similar to other data structures that have recently been added to Scala’s stdlib such as VectorMap, I am proposing to add in ByteString as a Scala module and/or stdlib. The ByteString is a String backed by a rope like data-structure with the original implementation from the latest Akka ASL 2.0 release. The general idea is that additional performance improvements as well as other features such as Scala.js/Scala-native can be added over time (or even before initial release if possible).

I have created a sample project at https://github.com/mdedetrich/bytestring and there is an scala-dev issue at https://github.com/scala/scala-dev/issues/822.

Just like with String, there is a strong argument that ByteString datastructure is a very common implementation for a wide variety of problems (tl;dr Any time you are dealing with “large” strings the ByteString has better algorithmic complexity, i.e. standard String is O(n^2) if you concatenate onto a String since you have to copy the entire contents of the original String plus the String to concatenate). Having very large strings is the standard when it comes to webservers/streaming libraries and afterall this is the reason why Akka implemented such a data structure. There are also other use cases, i.e. rope like strings are very commonly used in IDE’s/editors as well (again large text + easy diff’ing/undo mechanisms) and with Scala.js there can also be improvements when it comes to the DOM.

mpilquist · October 17, 2022, 2:49pm

I’d also consider scodec.bits.ByteVector.

mdedetrich · October 17, 2022, 3:13pm

I had a look at scodec and one of the major differences from what I can tell versus ByteString is it doesn’t implement the Scala collection API’s (i.e. IndexedSeq/IndexedSeqOptimized/BufferedIterator which ByteString/ByteIterator implements). At least for something that goes into Scala’s stdlib and/or a Scala module that is putting a data structure into scala.collection.immutable I believe it’s an expectation that it plays well with the rest of the collection library.

odersky · October 20, 2022, 3:04pm

I think it would be a good idea to add ByteString to the Scala 3 standard library. This can be done quite quickly in a minor version. The Scala 3 standard library is a superset of the Scala 2.13 standard library, and our policy allows it to grow between minor versions.

There might also be a way to add it to a Scala 2 library maybe scala-collection-compat?

mdedetrich · October 20, 2022, 3:26pm

Where is the Scala3 standard library that this would need to be contributed to, is it still GitHub - lampepfl/dotty: The Scala 3 compiler, also known as Dotty. ?

sjrd · October 20, 2022, 3:42pm

Yes, in dotty/library/src/scala at main · lampepfl/dotty · GitHub

ansvonwa · October 20, 2022, 4:54pm

tl;dr , but please, let’s do it in a more abstract and more performant way.

I really like the idea to add such a type for large Strings to scala.
I’ve had a similar usecase where I finally used a Vector[String]. And that was even the reason I implemented faster Vector concatenation, which provides concatenation in $O(\min(n,m))$ for Vectors of length $n$ and $m$.

But there’s a catch:

The implementation linked above also uses a Vector of collections (Vector[ByteString1] to be precise).
While this is a flat structure and seems to provide fast access, it does not: Worst case performance is $O(n)$, as can be seen in the while loop in apply. This is due to the fact that the Vector has no sense of the length of its elements, so to access an element, we have to iterate over all previous Strings or ByteString1s to compute their accumulated length and find the String/ByteString1 containing the element. If the string consists of many short substrings (the string is “scattered”), that really hurts performance.
The same problem occurs at concatenation: If you have “scattered” ByteStrings, their concatenation is also $O(n)$ (as every substrings reference is copied to the new Vector).

This differs from the data structure described in the Wikipedia article, where in fact each node in the tree stores an accumulated length. This allows for $O(\log(n))$ access and $O(\log(n))$ (or $O(1)$) concatenation.

`Vector`s have a similar problem to `String`s

A similar problem exists in the Vector structure itself, slowing down concatenation. In short: At concatenation, Vectors can be aligned (rare) or misaligned (normal case), where the latter requires a full copy of one of the Vectors (so $O(n)$). Currently that is the right Vector, if the PR above is accepted, it’s the shorter one.
This cannot be changed without drastically changing the internal structure of Vectors, introducing respective performance drawbacks. (Would require inhomogenious instead of homogenious Array sizes)
The only way I could think of how to provide (mostly) the same performance as now and have fast concatenation, was to add (optional) layers of inhomogeniously sized nodes, to form a tree. So, it’s a tree, where every leaf is a (homogenious) Vector of any size and each node stores the size of the tree (or the size of the left subtree as in Wikipedia, doesn’t matter).

A possible solution for both

If we add an (abstract?) class TreeSeq[A, S <: IndexedSeq[A]] extends IndexedSeq[A], we could do one implementation for both:

A ByteString could simply extend a TreeSeq[Char, String] (or TreeSeq[Byte, ByteString1]) and provide better performance than the Vector[String] resp Vector[ByteString1] implementation.
And one can simply add a VectorTree[A] extends TreeSeq[A, Vector[A]] with Vector[A] that gets constructed when concatenating large Vectors that are misaligned. This preserves asymptotic times for all Vectors. And it preserves current times exactly for all Vectors that were not obtained by concatenating large Vectors (which are currently hard to obtain anyway). So, the drawbacks in access/modification time only occur for Vectors that are currently comparatively slow to obtain.

TreeSeq would then also allow to easily implement fast concatenation for other IndexedSeqs (e.g. BitVector), if necessary.

Also, it would allow for fast insertAt(idx: Int, elem: A) and remove methods.

What do you think of that approach?

mdedetrich · October 20, 2022, 5:12pm

In principle I don’t have any problem with at all, it just needs to be implemented and validated with jmh benchmarks. As mentioned in the original post there is already a base project ready so I see no problem in creating a PR to improve the algorithmic complexity as you describe (sbt-jmh is already set up with benchmarks for GitHub - mdedetrich/bytestring: Rope-like immutable data structure containing bytes for Scala)

lrytz · October 21, 2022, 7:43am

Another option is to drop the forwards compatibility requirement for the Scala 2.13 standard library.

Today, the org.scala-lang::scala-library library is treated specially by sbt, its version is always pinned to the scalaVersion used in the project.

For other dependencies, sbt automatically picks the newest version of a library required by the depdency graph. For example, with

libraryDependencies ++= List(
  "com.softwaremill.sttp.client3" %% "core" % "3.8.3", // depends on ws 1.3.10
  "com.softwaremill.sttp.shared"  %% "ws"   % "1.3.9", // for demonstration
)

sbt uses version 1.3.10 for ws. This means that libraries only needs to maintain backwards binary compatibility. A new class in ws 1.3.10 doesn’t break code that was compiled against 1.3.9, and sbt makes sure the newest required version is used.

Because this is not the case for scala-library, we cannot add anything to the standard library, we have to maintain forwards and backwards binary compatibility.

The difficulty with this change is that it can only be rolled out in new versions of sbt. If

Scala 2.13.11 is not forwards compatible with 2.13.10
a user upgrades a dependency that was built with 2.13.11
does not upgrade sbt
and keeps the project’s scalaVersion to 2.13.10

they can run into linkage errors at runtime.

sbt would only upgrade the scala-library, but not the Scala compiler used in the project. This would not cause any problems, the Scala 2.13.10 compiler works fine with the 2.13.11 standard library.

I’m interested to hear what people think about this.

Does mill (and other build tools) have the same special treatment for scala-library?

arturopala · October 21, 2022, 8:18pm

@mdedetrich good idea! I have implemented something similar but generalised in the GitHub - arturopala/buffer-and-slice: Lightweight Buffer and Slice abstractions for Scala.

mpilquist · October 22, 2022, 12:45pm

Interested how this compares to the slice/drop optimizations in the fs2.Chunk type (Chunk.View in particular, but there seems to be a lot of overlap here).

Adding Akka's ByteString as a Scala module and/or stdlib

But there’s a catch:

Vectors have a similar problem to Strings

A possible solution for both

`Vector`s have a similar problem to `String`s