In multi-stage compilation, should we use a standard serialisation method to ship objects through stages?

That’s an interesting project! But strangely Apache Spark (literally the most heavyweight application of Scala) still uses either Java serialisation or Twitter Kryo/Chill. Which forfeited format readability for speed and universality.

This kind of fits into the paradigm of metaprogramming & multi-stage compilation as they are, in general, pretty bad at generating human-readable code, like JSON or XML. I have some fond memory reading 10k+ lines of parser code generated by Antlr, and Verilog code filled with random ID generated by CHISEL framework. Succinctness & speed of the generated code are usually much worthy.

Here, a prototype has been written in Scala 2 to automatically lift ANY serializable object from this stage to the next:

It uses Java object serialisation and base64 encoding/decoding, theoretically it should be made much faster by doing AOT decoding of the byte array, but I don’t know how to do this given existing JVM compiler. Regardless, this solution seems to be universally applicable, particularly if your code is already primed for distributed computing

1 Like

IMHO structural/schematic data serialisation library like Sauerkraut works best on case classes / product type. They are evidently better at generating human-readable code but can’t be anywhere close in terms of universally applicable or speed

It’s from 2016, but there’s an interesting blog post called Scala Serialization by Dmitry Komanov that benchmarked the performance of ScalaPB, Pickling, Boopickle, Chill, and Scrooge. Scala Serialization (Charts) has been updated in 2017 and 2018 to include Jackson into the benchmark. Github: dkomanov/scala-serialization.

Here’s a screenshot of p50 for roundtrip serialization of some data transfer object by different sizes. The unit is nanosecond so the lower is faster.

With the smallest data size 1kb, ScalaPB (Protobuf wrapper) out performs Java serialization by 17x, but all non-JSON things are roughly in the same neighborhood. In that sense, the choice of Kryo / Twitter Chill does sound attractive in a situation where setting up the data schema is not possible, although it seems to be most affected by the data size increases among the non-JSON, non-Java-serialization cohort.

The chart is a bit misleading because the y-axis gets cuts off after 60μs. The chart might give an impression that Java serialization remains constant, but it’s just off the chart.

2 Likes

Wow, didn’t expect this happen :-]

No surprise that Chill is what Java serialization should strive to be. But BooPickle & ProtoBuf (both yield binary data) really stand out here which kind of proved my point that human readability can be sacrificed.

Will either of them achieve to be universally applicable like Chill? Regardless of the outcome, I would propose to add the simplest implementation in dotty core staging, then suggest people to improve on top of it using all the better options.

You would also run into issues if you try to lift a large object. There are hard limits on how big literals in a classfile can get (How to run Scala code at compile time? - #9 by Jasper-M - Question - Scala Users).

1 Like

Thanks a lot @Jasper-M, I’ll modify the code to take multiple UTF-8 encoded string easily.

This is just a demonstration in scala 2, I expect scala 3 to have some mechanism to create headless Expr[_] directly from serialized data

@Jasper-M hope the new version is up to your standard:

Wondering if something similar could be useful to scala3 compiler?

Here is a prototype that implements the AutoLift idea above in Scala 3.

import java.io.{ByteArrayInputStream, ByteArrayOutputStream, ObjectInputStream, ObjectOutputStream}
import java.util.Base64
import scala.quoted.*

sealed trait SerializableExpr[T]:
  def apply(x: T)(using Quotes): Expr[T]

sealed trait DeSerializableExpr[T]:
  def unapply(x: Expr[T])(using Quotes): Option[T]

object SerializableExpr {

  private inline val MAX_LITERAL_LENGTH = 32768

  lazy val encoder: Base64.Encoder = Base64.getEncoder
  lazy val decoder: Base64.Decoder = Base64.getDecoder

  def apply[T: SerializableExpr](x: T)(using Quotes): Expr[T] =
    summon[SerializableExpr[T]].apply(x)

  def unapply[T: DeSerializableExpr](x: Expr[T])(using Quotes): Option[T] =
    summon[DeSerializableExpr[T]].unapply(x)

  given serializableExpr[T <: Serializable : Type]: SerializableExpr[T] with {
    def apply(x: T)(using Quotes): Expr[T] =
      val stringsExpr = Varargs(serialize(x).map(Expr(_)))
      '{ deserialize[T]($stringsExpr*) }
  }

  given deSerializableExpr[T <: Serializable : Type]: DeSerializableExpr[T] with {
    def unapply(x: Expr[T])(using Quotes): Option[T] =
      x match
        case '{ deserialize[T](${Varargs(stringExprs)}*) } =>
          Exprs.unapply(stringExprs).map(strings => deserialize(strings*))
        case _ => None
  }

  private def serialize(x: Serializable): Seq[String] = {
    val bOStream = new ByteArrayOutputStream()
    val oOStream = new ObjectOutputStream(bOStream)
    oOStream.writeObject(x)
    val serialized = encoder.encodeToString(bOStream.toByteArray)
    serialized.sliding(MAX_LITERAL_LENGTH, MAX_LITERAL_LENGTH).toSeq
  }

  private def deserialize[T <: Serializable](strings: String*) = {
    val bytes = strings.map(decoder.decode).reduce(_ ++ _)
    val bIStream = new ByteArrayInputStream(bytes)
    val oIStream = new ObjectInputStream(bIStream)
    val v = oIStream.readObject()
    v.asInstanceOf[T]
  }
}

object App {
  import SerializableExpr.given

  def example(using Quotes) = {
    val serializedExpr = SerializableExpr("abc")
    serializedExpr match
      case SerializableExpr(value) => println(value)
      case _ =>
  }
}
1 Like

Actually, I did not need the DeSerializableExpr to port AutoLift.

Here is a shorter version that used ToExpr. But, it creates an expression that is harder to extract using FromExpr.

import java.io.{ByteArrayInputStream, ByteArrayOutputStream, ObjectInputStream, ObjectOutputStream}
import java.util.Base64
import scala.quoted.*

object SerializableExpr {

  private inline val MAX_LITERAL_LENGTH = 32768

  lazy val encoder: Base64.Encoder = Base64.getEncoder
  lazy val decoder: Base64.Decoder = Base64.getDecoder

  given serializableExpr[T <: Serializable : Type]: ToExpr[T] with {
    def apply(x: T)(using Quotes): Expr[T] =
      val stringsExpr = Varargs(serialize(x).map(Expr(_)))
      '{ deserialize[T]($stringsExpr*) }
  }

  private def serialize(x: Serializable): Seq[String] = {
    val bOStream = new ByteArrayOutputStream()
    val oOStream = new ObjectOutputStream(bOStream)
    oOStream.writeObject(x)
    val serialized = encoder.encodeToString(bOStream.toByteArray)
    serialized.sliding(MAX_LITERAL_LENGTH, MAX_LITERAL_LENGTH).toSeq
  }

  private def deserialize[T <: Serializable](strings: String*) = {
    val bytes = strings.map(decoder.decode).reduce(_ ++ _)
    val bIStream = new ByteArrayInputStream(bytes)
    val oIStream = new ObjectInputStream(bIStream)
    val v = oIStream.readObject()
    v.asInstanceOf[T]
  }
}

object App {
  import SerializableExpr.*

  def example(using Quotes) = {
    val serializedExpr: Expr[String] = Expr("abc")
  }
}
1 Like

@nicolasstucki Thanks a lot! Hope I could be helpful later if you plan to add a (extendable) variant of this into dotty metaprogramming core.

According to my recent poll 55.6% of developers claim they don’t need it (https://twitter.com/tribbloid/status/1520501655376175105). Maybe all they need is a little push :-]

Extensidable in what way?

Using Java Serialization is almost certainly a Bad Idea™ for reasons I don’t think I need to repeat here. But the idea in general makes sense, and could be implemented using some other serialization library that’s less problematic.

In fact, I implemented exactly this in my Python implementation of syntactic macros and hygienic quasiquotes Reference — MacroPy3 1.1.0 documentation. I don’t know enough about Scala 3 macros to talk about the details, but at a high level yeah it is handy and seems to work.

4 Likes

@nicolasstucki how about making the following changes?

  • SerializableExpr becomes an extendable trait, with
    • Serializable a dependent type, &
    • def serialize and def deserialize abstract methods which returns/takes BLOBs instead of strings

In this case the threat @lihaoyi mentioned will only have minimal impact, as we can define JavaSerializableExpr easily when it works. And switch to, e.g. KryoSerializableExpr or UPickleSerializableExpr whenever necessary.

BTW, looks like your last version already has the ability to create Expr[_] from BLOBs with 0 runtime overhead (by compiling directly into inlined constant JVM bytecode), so my old concern should be invalid.

1 Like

Someone should create a library defining this abstraction and some basic implementations.

1 Like

@nicolasstucki do you accept PR? I thought you are a maintainer of scala3 reflection library

I did not understand the question. Which PRs? Yes, I am the maintainer.

@nicolasstucki I would hope to add the PR into scala3 compiler, as the quote & splice API is still kind of experimental and implementation may need to adapt in the future.

Others may have different plans.

I see. I imagined it would be better to implement first in a standalone library. This would make it simpler to implement and try out all serialization framework such as Kryo and upickle in one place. We would be able to implement and stabilize it faster. In a second time we could move the interface definition and the basic java serialization implementation to the standard library.

We could try adding it as experimental in the standard library, but then to cross validate with other serialization libraries we would need to wait full release cicles. I feel this would be much slower and require more work that the first approach.

I could start a repo with this code and add the basic functionality, but might need some help.

2 Likes

Thanks a lot, all make sense, will do & publish

1 Like