Case classes are often preferred to simple classes by Scala developers to model data structures. However, case classes have a drawback: they can’t evolve in a binary compatible way. Some Scala developers created workarounds based on code generation (e.g., contraband), or macro annotations (e.g., data-class, scalameta). Other developers just manually write simple classes (e.g., Scala.js, endpoints4s), but that requires a lot of undesired boilerplate.
I believe there is a need for a middle ground between case classes and regular classes, with some of the features of the case classes (mainly structural equality) but without compromising the possibility of making binary compatible evolutions. Let’s call them “data classes”.
This post details the motivation for data classes, and proposes a couple of ideas to get more support for them at the language level. Please let me know what you think of the proposed ideas, or if you see another path!
Motivation
Case classes are often preferred to simple classes by Scala developers to model data structures. However, case classes have a drawback: they can’t evolve in a binary compatible way (we can’t add or remove optional fields, nor mandatory fields with a default value).
For instance, consider the following case class definition:
case class User(name: String, age: Int)
Developers can write programs that use User
, as follows:
val julien = User("Julien", 36)
val sebastien = User("Sébastien", 32)
assert(julien != sebastien)
val updatedJulien = julien.copy(age = julien.age + 1)
Let’s say that the class User
is shipped in a library, and that at some point we want to add an optional field email
:
case class User(name: String, age: Int, email: Option[String] = None)
This change is not backward binary compatible. The above program will have to be re-compiled with the new version of the class User
, although the change is source compatible! The reason why it is not bacwkard binary compatible is because the signature of the constructor has changed, so has the signature of the copy
method.
However, there are ways to add an optional field to a data type without breaking the binary compatibility.
Indeed, Scala developers have been using the following techniques:
- code generation (e.g., contraband), which requires a build tool with a specific setup, which is sometimes not supported well by IDEs, and which makes the code harder to navigate through,
- macro annotations (e.g., data-class, scalameta), which are currently dragging the adoption of Scala 3 (IMHO), and generally make the the code harder to navigate through since the macro annotations generate code that is not seen in the source files,
- manually write simple classes (e.g., Scala.js, endpoints4s), which requires a lot of undesired boilerplate.
I believe there is a need for a middle ground between case classes and simple classes, with some of the features of case classes but without compromising the possibility of making binary compatible evolutions.
The features I would like to retain for data classes are the following:
- field accessors
- structural implementation of
equals
andhashCode
- structural implementation of
toString
- Java serialization
- lean syntax for creating and copying instances
- support for “named field patterns” in match expressions (if it becomes implemented… Let’s put aside this item for now, since it refers to a language feature that does not exist yet)
Status Quo
Currently, to benefit from the aforementioned features on the class User
without relying on macros or code generation, developers have to write the following:
class User(val name: String, val age: Int) extends Serializable:
private def copy(name: String = name, age: Int = age): User = new User(name, age)
override def toString(): String = s"User($name, $age)"
override def equals(that: Any): Boolean =
that match
case user: User => user.name == name && user.age == age
case _ => false
override def hashCode(): Int =
37 * (37 * (17 + name.##) + age.##)
def withName(name: String): User = copy(name = name)
def withAge(age: Int): User = copy(age = age)
The following snippet illustrates how to construct instances, to copy them, and to compare them:
val alice = User("Alice", 36)
val bob = User("Bob", 42)
// structural `toString`
println(bob) // "User(Bob, 42)"
// lean syntax for copying instances
val bob2 = bob.withAge(31)
// structural equality
assert(bob2 == User("Bob", 31))
Then, one can publish a new version of the data type User
, with an additional (optional) field email
. That new version of User
is binary compatible with the previous one:
class User private (val name: String, val age: Int, email: Option[String]) extends Serializable:
def this(name: String, age: Int): User = this(name, age, None) // public constructor that matches the signature of the previous primary constructor
private def copy(name: String = name, age: Int = age, email: Option[String] = email): User = new User(name, age, email)
override def toString(): String = s"User($name, $age, $email)"
override def equals(that: Any): Boolean =
that match
case user: User => user.name == name && user.age == age && user.email == email
case _ => false
override def hashCode(): Int =
37 * (37 * (37 * (17 + name.##) + age.##) + email.##)
def withName(name: String): User = copy(name = name)
def withAge(age: Int): User = copy(age = age)
def withEmail(email: Option[String]): User = copy(email = email)
So, for every added field, we have to remember to update the implementation of toString
, copy
, equals
, and hashCode
.
The problem statement is: how to keep supporting the use-case of defining a data structure that can evolve in a binary compatible way, while significantly reducing the associated burden?
After some internal discussions, I saw the following possible solutions, which are detailed further in the next sections.
The first approach would be to introduce a new type of class definitions that would support exactly this use-case. Developers would write class definitions, which, like case classes, would expand to serializable class definitions with structural equality and public field accessors, but unlike case classes would have synthetic methods like withName
and withAge
to transform instances (no public copy
method), and would have a mechanism to ensure that the public constructor remains backward binary compatible over time. That approach would require the least effort from end-developers, but it raises some technical challenges (how do we manage the compatibility of the public constructor?), and the specification of the desugaring of data class
would be more complex than the alternative approaches.
The second approach would be to focus on the more general use case of defining “structural” data types. Developers would write class definitions that would expand to serializable class definitions with structural equality and public field accessors, but nothing more. Such structural classes could be used to support our main use-case by manually adding transformation methods like withName
and withAge
.
The last approach would be to build on the existing case class
feature, which already does exactly what we want when we define the primary constructor to be private, except that it also define a public extractor that would break the backward compatibility if the class evolves. Thus, the last approach would be to change the semantic of case classes with private constructors to also make their extractor private. This approach is the most “conservative” one in the sense that it does not introduce a new language feature.
The next sections discuss the proposed approaches in more details.
Fully Fledged Data Classes
In this approach, our User
class definition would look like the following:
data class User(name: String, age: Int)
The compiler would expand it to:
class User(val name: String, val age: Int) extends Serializable:
private def copy(name: String = name, age: Int = age): User = new User(name, age)
override def toString(): String = s"User($name, $age)"
override def equals(that: Any): Boolean =
that match
case user: User => user.name == name && user.age == age
case _ => false
override def hashCode(): Int =
37 * (37 * (17 + name.##) + age.##)
def withName(name: String): User = copy(name = name)
def withAge(age: Int): User = copy(age = age)
The desugaring would be exactly what we would write manually with plain classes: the compiler would implement structural equality and toString
, it would define public accessors for the fields, and it would define public transformation methods (withName
and withAge
).
The main challenge is to deal with the binary compatibility of the class constructor if we want to publish a new version of User
with new fields with default values. We need to find a way to tell the compiler what was the type signature of the previous version of the User
data type.
One possibility would be to handle data class
fields with a default value in a special way. For instance, if a developer writes a new version of User
that includes the optional email, they would write the following:
data class User(name: String, age: Int, email: Option[String] = None)
And the compiler would desugar it to the following:
class User private (val name: String, age: Int, email: Option[String]):
// public constructor that calls the primary (private) constructor
def this(name: String, age: Int) = this(name, age, None)
// ... then, just like the above desugaring
A data class
that has fields with default values would have a private primary constructor and a public secondary constructor taking as parameters only the fields that don’t take default values, and calling the primary constructor with the default values for the remaining parameters.
That mechanism would allow developers to introduce new fields, but not to remove optional fields. To support this use case, developers would have to manually re-introduce the accessor of the removed field to return the previously defined default value, and to manually re-introduce the withXxx
transformation method. In the case of email
, this would look like the following:
// Removal of `email` field
data class User(name: String, age: Int):
def email: Option[String] = None
def withEmail(email: Option[String]): User = this
Structural Classes
A simpler approach (from the perspective of the language design) would be to focus on the more general use case of defining “structural” data types. That is, type definitions that support structural equality and toString
.
The language would support the concept of structural class
definitions, which would provide half of the features of case class
definitions:
structural class User(name: String, age: Int)
This would desugar to:
class User(val name: String, val age: Int) extends Serializable:
def copy(name: String = name, age: Int = age): User = new User(name, age)
override def toString(): String = s"User($name, $age)"
override def equals(that: Any): Boolean =
that match
case user: User => user.name == name && user.age == age
case _ => false
override def hashCode(): Int =
37 * (37 * (17 + name.##) + age.##)
structural classes
would be very similar to case classes
. The main difference is that they would not synthesize an extractor (an unapply
method in the companion), meaning that we could not use “constructor patterns” on instances of structural classes
. Other differences are that they would not extend Product
and CanEqual
, but that point is open to discussion, see below.
To define a data type that can evolve in a backwards compatible way, developers could use a structural class
with a private
default constructor, and add transformation methods, and a public “smart constructor”:
structural class User private (name: String, age: Int):
def withName(name: String): User = copy(name = name)
def withAge(age: Int): User = copy(age = age)
object User:
def apply(name: String, age: Int): User = new User(name, age)
Note that the visibility of the copy
method would be the same as the visibility of the primary constructor, private
. (This is already the case, currently, with case classes
.)
A backward binary compatible version of User
with an optional email field could be defined as follows:
structural class User private (name: String, age: Int, email: Option[String]):
def withName(name: String): User = copy(name = name)
def withAge(age: Int): User = copy(age = age)
def withEmail(email: Option[String]: User = copy(email = email)
object User:
def apply(name: String, age: Int): User = new User(name, age, None)
Note that the public constructor (the apply
method in User
) has the same signature as before but it now provides a default None
value for the email
field.
This solution is more verbose than the previous one because of the explicit definitions of the transformation methods withName
, withAge
, and withEmail
. However, the fact that transformation methods are defined explicitly also provides more flexibility. For instance, one could define more specific transformation methods for optional fields or collection fields:
def withEmail(email: String): User = copy(email = Some(email))
def withoutEmail: User = copy(email = None)
A challenge raised by this solution is that we need a way to make the private constructor effectively private at the bytecode-level. Indeed, since it is actually called from the companion object, it can’t really be private at the bytecode-level. At least, currently this is not the case for case classes
with private constructor (see below). I see several possible solutions to this problem. The first solution was proposed by @smarter and consists of emitting the constructor as ACC_SYNTHETIC
to make it effectively invisible from Java. Another solution could be to define the first “version” of User
with a public constructor (and copy
method), and then make them private in the second version only:
// v1
structural class User(name: String, age: Int)
// v2
structural class User private (name: String, age: Int, email: Option[String]):
// re-introduce the old public constructor and copy method, for compatibility
def this(name: String, age: Int): User = this(name, age, None)
def copy(name: String = name, age: Int = age) = copy(name = name, age = age, email = email) // use the generated private copy method
In this version, the private constructor is really private because it is not called from the companion.
We might consider alternative keywords instead of structural
. Maybe product
would be a good one (and in such a case, the class may also extend Product
, see also the discussion point below). Or data
.
One thing that I like about structural classes, is that they can be seen an intermediate step between plain classes and case classes. Indeed, case classes are structural classes with an extractor. And structural classes are plain classes with structural implementation of toString
, equals
, and hashCode
, and a copy
method.
Case Classes with Private Constructors
As described in the previous section, the main difference between “structural” classes and case classes would be that structural classes would not have an unapply
method in their companion. It made me think that maybe case classes alone would be enough to support our use-case. Indeed, if we changed the semantic of case classes with private constructors to also have a private unapply
method (like this is already the case for their apply
method), then we would not even need to introduce the concept of structural classes to the language. We could just use case classes with private constructors to support our use-case.
Our running example rewritten with a case class with a private constructor would look very similar to the structural class
with private constructor:
case class User private (name: String, age: Int):
def withName(name: String): User = copy(name = name)
def withAge(age: Int): User = copy(age = age)
object User:
def apply(name: String, age: Int): User = new User(name, age)
We could then define a new version of User
with an additional optional field as follows:
case class User private (name: String, age: Int, email: Option[String]):
def withName(name: String): User = copy(name = name)
def withAge(age: Int): User = copy(age = age)
def withEmail(email: Option[String]): User = copy(email = email)
object User:
def apply(name: String, age: Int): User = new User(name, age, None)
Currently, this new version of User
is not backward binary compatible with the previous one for two reasons. First, the private constructor is not really private at the bytecode-level, see the discussion point in the previous section. Second, because the compiler emits a public unapply
extractor that allows users to write code like user match { case User(name, age) => ... }
, which would crash on the new version of User
.
So, the main question about this design is “should case classes with a private constructor also have a private extractor?”. The answer may not be obvious. Maybe there is a real need for defining data structures that need a controlled way to be constructed, but that are fine to be pattern matched on?
In any case, if we decide to now change the compiler to emit private unapply
methods when the primary constructor is private, it would still be possible for users who want a public unapply
to define it explicitly:
def unapply(user: User): User = user
That would also allow programs compiled with the new version of the compiler to be compatible with what the old version of the compiler used to produce.
Another argument is that the purpose of the case
keyword is to enable pattern matching. It would look weird to define something with the syntax case class
that does not support pattern matching.
Open Questions
Should data classes
and structural classes
also implement Product
? I would say yes, but I didn’t think more about it.
Should the compiler synthesize “generic” Mirror
s for them, like it does with case classes
? Maybe, but only the fields that don’t have a default value should be mirrored.