In multi-stage compilation, should we use a standard serialisation method to ship objects through stages?

Original post:

This question is formulated in Scala 3/Dotty but should be generalised to any language NOT in MetaML family.

The Scala 3 macro tutorial:

Starts with the The Phase Consistency Principle, which explicitly stated that free variables defined in a compilation stage CANNOT be used by the next stage, because its binding object cannot be persisted to a different compiler process:

… Hence, the result of the program will need to persist the program state itself as one of its parts. We don’t want to do this, hence this situation should be made illegal

This should be considered a solved problem given that many distributed computing frameworks demands the similar capability to persist objects across multiple computers, the most common kind of solution (as observed in Apache Spark) uses standard serialisation/pickling to create snapshots of the binded objects (Java standard serialization, twitter Kryo/Chill) which can be saved on disk/off-heap memory or send over the network.

The tutorial itself also suggested the possibility twice:

One difference is that MetaML does not have an equivalent of the PCP - quoted code in MetaML can access variables in its immediately enclosing environment, with some restrictions and caveats since such accesses involve serialization. However, this does not constitute a fundamental gain in expressiveness.

In the end, ToExpr resembles very much a serialization framework

Instead, Both Scala 2 & Scala 3 (and their respective ecosystem) largely ignores these out-of-the-box solutions, and only provide default methods for primitive types (Liftable in scala2, ToExpr in scala3). In addition, existing libraries that use macro relies heavily on manual definition of quasiquotes/quotes for this trivial task, making source much longer and harder to maintain, while not making anything faster (as JVM object serialisation is an highly-optimised language component)

What’s the cause of this status quo? How do we improve it?

1 Like

In theory, Scala Pickling would be the direction of improvement in which the compiler presents type information and emits appropriate serialization and deserialization code into JSON or some binary format. However, in my own experience, there were major issues in the detail because it guessed how an object should be constructed, and it often guessed incorrectly. See A path forward for pickling Java objects · Issue #296 · scala/pickling · GitHub.

So generally I have felt more sympathetic to idea of data binding, defining the wire protocol first, and deriving the Java or Scala object second rather than the idea of trying to serialize everything. XML databinding is an example, but general IDL (interface description languages) such as Protobuf, Avro, and Apache Thrift are all pretty widely used, for good reasons. Fast runtime performance and safe route to evolving the datatype over time.

If we do go with serialization route, I hope we learn the lesson and use a serializer with static typeclass – as in fail at compile-time when it encounters a class that it’s unable to serialize into JSON. I am guessing there are lots of JSON serialization solutions to pick from like Circe, uPickle, and Play JSON.

There’s also Sauerkraut - “A revitalization of Pickling in the Scala 3 world.” by @jsuereth. I haven’t looked into the details, but maybe that will be the answer that combines the strengths of both worlds.


That’s an interesting project! But strangely Apache Spark (literally the most heavyweight application of Scala) still uses either Java serialisation or Twitter Kryo/Chill. Which forfeited format readability for speed and universality.

This kind of fits into the paradigm of metaprogramming & multi-stage compilation as they are, in general, pretty bad at generating human-readable code, like JSON or XML. I have some fond memory reading 10k+ lines of parser code generated by Antlr, and Verilog code filled with random ID generated by CHISEL framework. Succinctness & speed of the generated code are usually much worthy.

Here, a prototype has been written in Scala 2 to automatically lift ANY serializable object from this stage to the next:

It uses Java object serialisation and base64 encoding/decoding, theoretically it should be made much faster by doing AOT decoding of the byte array, but I don’t know how to do this given existing JVM compiler. Regardless, this solution seems to be universally applicable, particularly if your code is already primed for distributed computing

IMHO structural/schematic data serialisation library like Sauerkraut works best on case classes / product type. They are evidently better at generating human-readable code but can’t be anywhere close in terms of universally applicable or speed

It’s from 2016, but there’s an interesting blog post called Scala Serialization by Dmitry Komanov that benchmarked the performance of ScalaPB, Pickling, Boopickle, Chill, and Scrooge. Scala Serialization (Charts) has been updated in 2017 and 2018 to include Jackson into the benchmark. Github: dkomanov/scala-serialization.

Here’s a screenshot of p50 for roundtrip serialization of some data transfer object by different sizes. The unit is nanosecond so the lower is faster.

With the smallest data size 1kb, ScalaPB (Protobuf wrapper) out performs Java serialization by 17x, but all non-JSON things are roughly in the same neighborhood. In that sense, the choice of Kryo / Twitter Chill does sound attractive in a situation where setting up the data schema is not possible, although it seems to be most affected by the data size increases among the non-JSON, non-Java-serialization cohort.

The chart is a bit misleading because the y-axis gets cuts off after 60μs. The chart might give an impression that Java serialization remains constant, but it’s just off the chart.


Wow, didn’t expect this happen :-]

No surprise that Chill is what Java serialization should strive to be. But BooPickle & ProtoBuf (both yield binary data) really stand out here which kind of proved my point that human readability can be sacrificed.

Will either of them achieve to be universally applicable like Chill? Regardless of the outcome, I would propose to add the simplest implementation in dotty core staging, then suggest people to improve on top of it using all the better options.

You would also run into issues if you try to lift a large object. There are hard limits on how big literals in a classfile can get (How to run Scala code at compile time? - #9 by Jasper-M - Question - Scala Users).

1 Like

Thanks a lot @Jasper-M, I’ll modify the code to take multiple UTF-8 encoded string easily.

This is just a demonstration in scala 2, I expect scala 3 to have some mechanism to create headless Expr[_] directly from serialized data

@Jasper-M hope the new version is up to your standard:

Wondering if something similar could be useful to scala3 compiler?