Standardizing IO Interfaces for Scala Libraries

lihaoyi · December 29, 2019, 8:16am

Wrote a blog post about standardizing some of the interfaces my libraries using to exchange streaming binary data:

http://www.lihaoyi.com/post/StandardizingIOInterfacesforScalaLibraries.html

Thought this might be of interest to people here. It’s something that could go into the standard library, - widely used, widely useful, tiny interfaces total 7 lines of code, with a solid theoretical foundation - but doesn’t need to be in order to be adopted as a standard. People can depend just fine on the interfaces on Maven Central for compatibility. Anyway the standard library isn’t really interested in domains where such interfaces would be useful (ignoring abandoned packages like scala.io and scala.sys.process) but if any other library authors want to collaborate on these shared interfaces to ensure seamless interop between our libraries, I’d love to talk!

LPTK · December 29, 2019, 1:33pm

Out of curiosity, what do you mean by a solid theoretical foundation in this context?

lihaoyi · December 29, 2019, 11:04pm

I suppose I meant that it maps clearly to the idea of “push” vs “pull” based protocols and algorithms, which are an old and well studied concept, without any novel ideas or techniques that I am trying to introduce. Not sure if “solid theoretical foundation” is the right phrase to describe it, but it’s the closest I could come up with…

odersky · December 30, 2019, 11:54am

I personally like the proposal a lot. I think it would be good to have minimal interfaces for streaming interop in the standard library.

We should look into a process to make it possible an organic fashion (the other proposal I found a clear win was to have a Converter type class that enables explicit conversions between various types of the form e.as[T]).

The standard library is still subject to strict binary compatibility constraints, which means it cannot evolve easily. We should discuss whether we can lighten the constraints, for instance by guaranteeing only backwards but not forwards compatibility in minor versions. Or maybe we can define a set of interfaces or libraries outside of stdlib that are intended for interop between libraries. Last time it was tried, the Scala platform process did not get off the ground. We should find out what went wrong and give it a new try.

lihaoyi · December 30, 2019, 2:31pm

My impression with the Scala Platform Process was that it never really focused on what makes a good standard: consensus and adoption. The things it provided, like an SBT plugin or automatic publishing from CI, were of little-to-no value to either library authors or library users.

For an interface like this, it is basically immaterial whether it lives in the standard library or not, under the scala github org or not, or is published under the org.scala-lang maven org. What matters 100% is consensus and adoption: can we get other library authors to implement or accept these interfaces in their code? If everyone agrees on the same interfaces, we get 100% of the benefit. Everything else is immaterial.

It’s interesting to brainstorm who may be interested in implementing or accepting geny.Writable and geny.Readable. Off the top of my head:

Akka ByteString and @mpilquist’s ByteVector could implement Readable
Almost every JSON or serialization library could implement Writable: play-json, circe, ScalaPB, jackson-module-scala, etc.
Most parsing libraries could accept Readable to allow streaming parsing: e.g. scala-parser-combinators, @tpolecat’s Atto, Scodec
Most HTTP clients could take a Writable and return a Readable: @adamw’s STTP, Akka HTTP
Most HTTP servers could provide requests as Readable and accept routes returning Writable: PlayFramework, Akka HTTP, Finagle, Finch, Finatra, HTTP4s
Filesystem libraries like @pathikrit’s Better-Files could accept Writable

The goal of such a standardization effort, would be that someone could take an Akka ByteString, ScalaPB message object, or a Circe Writable, return it from a PlayFramework server endpoint, write it to disk using Better-Files or upload it using HTTP4S-client, and have the data automatically be efficiently streamed without any adaptor code and without any of the libraries knowing about each other at all.

The value is purely in consensus and adoption: the more the better, but even incomplete standardization already provides value.

I’ve set the ball rolling by standardizing the dozen or so libraries I maintain, but if we could get the Lightbend folks or Typelevel folks on board to add support in some of their major libraries, that would probably be enough to reach critical mass.

adamw · December 30, 2019, 2:51pm

One objection I would have here (if I understand the concepts as intended, please correct me if I’m wrong) is that the Readable/Writable traits are based on Java’s InputStream and OutputStream.

These interfaces expose blocking operations, while the majority of the Scala ecosystem (as well as a large part of the Java ecosystem) is already based on non-blocking operations, or in the process of migrating towards them.

My intuition would be that if we were to standardise on anything, it would have to support non-blocking streaming, with back-pressure. A natural candidate seems to be reactive streams, or something based on reactive streams, which can be created and consumed by normal code (“plain” reactive streams are not meant to be interacted with directly). Plus, RS is part of Java itself. I suspect Akka/Lightbend teams would have a lot of expertise here to share as well.

nafg · December 30, 2019, 3:22pm

You could get broader adoption with typeclasses than base classes no? Because a library could be supported without its author adding it to the library initially. Once enough code is written to the typeclass the library author would be more likely to define the instance in the library.

lihaoyi · December 30, 2019, 3:37pm

The way I look at it, Writable/Readable compete with Array[Byte], not with reactive streams, ZIO, akka-streams, monix, or FS2. Using FS2 doesn’t mean you don’t have Array[Byte]s in your program, and just because you’re using Reactive Streams doesn’t mean you won’t have Writable/Readable in your program.

They’re simply different tools for different jobs, and neither is really a substitute for the other.

tpolecat · December 30, 2019, 3:44pm

I have standardized on fs2 for my work and it works well for me (including adaptation to Java IO when necessary). I share @adamw’s concerns about this API … also it’s not functional so it’s probably not something I would use.

adamw · December 30, 2019, 4:19pm

Ah, I see. Although I would say, if you have a byte array, just use the byte array If you have a data source that’s not in memory, then you probably want to use a stream that’s compatible with your library (either blocking/non-blocking, pure/side-effecting etc.).

Maybe there could be an abstraction for in-memory data that’s better than Array[Byte], but even if, that wouldn’t be InputStream. Having an InputStream you have to take into account the possibility that .read() will block (as you don’t know what’s backing the stream), and that it can throw an exception. And that brings a considerable amount of complexity to any user of InputStream.

curoli · December 30, 2019, 4:53pm

What is Internal?

What if you cannot read a T from an InputStream (e.g. IOException, not enough data, not the right format)? How would such failure be handled?

lihaoyi · December 30, 2019, 8:37pm

The linked blog addresses why you may not want to do that, and discusses in memory datasets which are not byte arrays

Again: Writable and Readable do not change the libraries that implement them: they simply codify what functionality already present so we can interoperate with less friction. Equivalent APIs already exist in many of the libraries listed, just under a range of different names :

scodec.ByteVector.copyToStream(OutputStream),
play.api.libs.json.Json.parse(InputStream),
sttp.client.RequestT.body(InputStream),
scalapb.GeneratedMessage.writeTo(OutputStream)
akka.stream.scaladsl.StreamConverters.fromInputStream (which you can return from akka-http and play endpoints)
etc. .

The fact that non-blocking pure-functional code is good and a functional streaming IO monad is superior to InputStream/OutputStream is orthogonal to this codification.

The point of this isn’t to argue about which library’s style is better, but instead to recognize the existing common ground in which all the libraries are the same

Adligo · December 30, 2019, 9:38pm

I don’t know the Scala API’s well enough to comment on your proposal, but here are some things I thought you should see.

+1 I agree Bocking vs Non Blocking is all about how the InputStream and OutputStream is used by the Threads (loom/fiber)/ Javascript eventloop.

On a separate note, I am working on something similar over at Adligo.org in Java (this allows me to support Java, Javascript, Kotin and Scala). My main use case is for something to parse this;

https://www.ietf.org/archive/id/draft-adligo-hybi-asbp-02.txt

Here are a few things I would consider adding;

#1) I have also seen file system IO run machines out of TCP/IP sockets due to open files etc;

i.e.

https://unix.stackexchange.com/questions/157351/why-are-tcp-ip-sockets-considered-open-files

To prevent errors like this, I am adding a concept of a IOContext/Factory to make sure my API is pooled (also NOT related to Blocking vs NonBlocking IO).

This IOContext/Factory might be something to consider in your Scala IO API suggestion. I could be made optional, for cases when you know there’s another pool preventing over usage (i.e. JDBC pool, Servlet Container (Fibers) Thread Pool, etc).

i.e.

IOContextFactory

…obtainIOContext()

…returnIOContext(IOContext ctx)

…obtainIOContext(boolean fromPool)

#2) A second problem I have encountered is that the Writers and Readers in Java often hide the IOStreams which creates problems when you want to co-mingle binary data and character data. For example, let’s say you want to send 7 bytes (using your own encoding scheme) and then some UTF-8 and then even more bytes (using another encoding scheme for video), etc. Just something to think about supporting. I think you already are supporting this, since you seem to expose the IOStreams. However, I haven’t spent enough time looking at your proposal to be sure.

adamw · December 30, 2019, 10:15pm

I can only speak for sttp/tapir and the application code I write, but I always considered the InputStream/OutputStream variants necessary for integration with legacy code or Java libraries. I don’t think I can recall using these directly recently - but again, that’s a very small subset of Scala out there.

Although, of course, if Scala introduced any kind of abstraction on blocking or non-blocking streams, these would have to end up in sttp/tapir/other libraries that might come into existence until that point

Anyway, as most of the Scala code out there is non-blocking - either using Futures, Tasks, IOs or ZIOs - when seeing “standard IO interface” I would imagine something non-blocking as well. And not only because of performance - but because of the overall spirit of “doing things better” in terms of error handling, behaviour specification, resource management, local reasoning etc.

szeiger · January 2, 2020, 11:47am

It’s not clear to me how this would work in a sensible and useful way for string-based data. The blog post uses some string-based processing and even mentions it explicitly:

It is easier to implement the push-based Writable than the pull-based Readable , as any data type that had some sort of def writeTo(out: OutputStream) or def writeTo(out: Writer) or def writeTo(out: StringBuilder) method could trivially be adapted to support the def writeToByte interface. On the other hand, many of these would need invasive refactoring in order to support a pull-based interface

But this doesn’t make sense with a byte-based API. You don’t want to serialize strings into bytes unnecessarily for streaming and them immediately parse them back into strings. There is a reason why the API uses a Writer or StringBuilder instead of an OutputStream in the first place. And you definitely don’t want to do this without a way to specify the encoding.

lihaoyi · January 2, 2020, 2:19pm

The blog post mentions string based processing, but it’s focused on IO where the string needs to be serialized anyway to go over the wire: HTTP client uploads and downloads, HTTP server receivers and return values, file reads and writes.

The title of this post is about IO Interfaces after all, not a more general construct for streaming in-memory structured values.

Perhaps we could have a separate discussion for in-memory streaming of UTF-8/UTF-16 data, but that’s mostly orthogonal to the IO streaming of binary data (which may happen to be UTF8/UTF-16, but still needs to be serialized/de-serialized to bytes to go across the wire)