Pre-SIP: A Syntax for Collection Literals

djspiewak · January 17, 2025, 3:34am

I should clarify my position a bit. I’m not saying that we should affix the language forever in stone. That’s clearly not a good outcome. Ideally, we would continue the process of slowly deprecating the older ways of expressing things like the myriad of implicit forms, for example.

What I’m saying is that language additions, particularly of a syntactic nature, come with a significant cost. That cost is paid by the tools and by the teams who now have to deal with inconsistency across their code base. The cost is born by everyone who cares about the language and has to answer the question “why are there so many ways to do the same thing?” The cost is very high and not always obvious.

So that means the value we get for an addition of this nature should be equally high. I think enums are probably a pretty fair example of this, where the price of the feature was immense (it really does create a lot of ambiguous syntax), but it also significantly eases an extremely common pattern and makes it feel much more first class within the language.

Sequence literals don’t rise to this level. Not even close. It’s not even clear to me that sequence literals of this form are actually a particularly common case in real world code outside of simple scripting. They’re certainly less common that case classes. Where they do exist, they’re often something other than just Seq. In fact, I literally never use Seq directly and I discourage its use to anyone who will listen to me (and have for the last decade). Additionally the existing options for handling these cases are more than adequate. Even by measures like character count, the bespoke syntax saves us almost nothing.

So all we have here is cost, and not a small amount of it given the ambiguity this introduces in subtle area, with very minimal benefits outside of prettying up cases that are already trivial. The only thing that can be said in its favor is that simple repl examples will look more familiar to Python developers playing around with the language, but only if they don’t scratch too far.

One of the things that drew me to Scala originally was the strong desire for orthogonality. Odd missteps like xml literals aside, the language was very clearly trying for a very minimized set of constructs generating a broad set of functionality. The early excitement about actor encodings come to mind as a good example, but really the apply syntax itself is also a good reminder of that ethos. I fear that, in a well intentioned drive to make the language less impenetrable and lower the onboarding ramp, we have lost sight of the value that this type of orthogonality brings.

spamegg1 · January 17, 2025, 6:02am

Good point! I also have in the past. But in hindsight it was a mistake; I think we should start shying away from it even in tests; put the test data in files instead. The test data does start to get unmanageable. (Funnily enough, doing Advent of Code started to change my mind on this! Since it provides test data in files.)

For some reason we don’t treat test code like “real code”, but we should. That’s a wider cultural issue. That’s the real issue I think. (This SIP would be just a band-aid.)

fanf · January 17, 2025, 6:22am

Thank you, this is a good long-form, balanced explanation of my concerns.

dos65 · January 17, 2025, 6:47am

I think the biggest part of the significat cost you are talking about is that users who against this feature will have to face this syntax if this SIP will be accepted.

I don’t see any other issues around this change except of one that related to personal preferences.

I can’t take seriously the points about ambiguity or lang complexity after addition this syntax. It doesn’t count in total cost. The same for tooling - the change isn’t big.

So the only point what is left is that you think that they don’t rise some value level. But how do you estimate that? Is it a fair estimation?

fanf · January 17, 2025, 7:59am

Well, when one of the first complaint about Scala is the number of different way of doing the same thing, and that a lot of voices complain about the complexity to share conventions or even just syntax use between code bases and teams, perhaps refusing to take seriously that point is a problem.

At least, since there’s a requested field impact in the SIP process, that impact analysis should absolutely be done with this one, more so since there’s at least some concerns by lib and tool maintainers.

JD557 · January 17, 2025, 8:32am

So, from knowledge of Clojure, there kind of 4 special cases:

Vectors: [1 2 3 4]
Lists: '(1 2 3 4)
Maps: {:1 1 :2 2 :3 3 :4 4}
Sets: #{1 2 3 4}

Of those, 2 of them have a prefix character, and that doesn’t seem like a big issue to use something like v(1, 2, 3), #(1, 2, 3) or even c(1, 2, 3) (like R).

As a personal opinion on this, I do think the usage of [] in Clojure for both vectors and binding groups (which I guess you can argue are also vectors) makes some code a bit hard to parse at a glance.
I don’t think it would be as bad in Scala, but it’s hard for me to say without looking at a large codebase with the new syntax.

lihaoyi · January 17, 2025, 8:58am

Here’s an example bit of code from Mill where I think a [] syntax would help:

object eval extends build.MillStableScalaModule {
  def moduleDeps = Seq(define)
}

object resolve extends build.MillStableScalaModule {
  def moduleDeps = Seq(define)
}

object client extends build.MillPublishJavaModule with BuildInfo {
  def buildInfoPackageName = "mill.main.client"
  def buildInfoMembers = Seq(BuildInfo.Value("millVersion", build.millVersion(), "Mill version."))

  object test extends JavaModuleTests with TestModule.Junit4 {
    def ivyDeps = Agg(build.Deps.junitInterface, build.Deps.commonsIo)
  }
}

object server extends build.MillPublishScalaModule {
  def moduleDeps = Seq(client, api)
}
object graphviz extends build.MillPublishScalaModule {
  def moduleDeps = Seq(build.main, build.scalalib)
  def ivyDeps = Agg(build.Deps.jgraphtCore) ++ build.Deps.graphvizJava ++ build.Deps.javet
}

object maven extends build.MillPublishScalaModule {
  def moduleDeps = Seq(build.runner)
  def ivyDeps = Agg(
    build.Deps.mavenEmbedder,
    build.Deps.mavenResolverConnectorBasic,
    build.Deps.mavenResolverSupplier,
    build.Deps.mavenResolverTransportFile,
    build.Deps.mavenResolverTransportHttp,
    build.Deps.mavenResolverTransportWagon
  )
  def testModuleDeps = super.testModuleDeps ++ Seq(build.scalalib)
}

def testModuleDeps = super.testModuleDeps ++ Seq(build.testkit)

Nothing fancy, just a bunch of objects and defs defining collections. Right now they’re a mix of Seqs and Aggs, in the next breaking version they’ll all be Seqs, but that doesn’t affect the example.

With [] syntax, it looks like this:


object eval extends build.MillStableScalaModule {
  def moduleDeps = [define]
}

object resolve extends build.MillStableScalaModule {
  def moduleDeps = [define]
}

object client extends build.MillPublishJavaModule with BuildInfo {
  def buildInfoPackageName = "mill.main.client"
  def buildInfoMembers = [BuildInfo.Value("millVersion", build.millVersion(), "Mill version.")]

  object test extends JavaModuleTests with TestModule.Junit4 {
    def ivyDeps = [build.Deps.junitInterface, build.Deps.commonsIo]
  }
}

object server extends build.MillPublishScalaModule {
  def moduleDeps = [client, api]
}
object graphviz extends build.MillPublishScalaModule {
  def moduleDeps = [build.main, build.scalalib]
  def ivyDeps = [build.Deps.jgraphtCore] ++ build.Deps.graphvizJava ++ build.Deps.javet
}

object maven extends build.MillPublishScalaModule {
  def moduleDeps = [build.runner]
  def ivyDeps = [
    build.Deps.mavenEmbedder,
    build.Deps.mavenResolverConnectorBasic,
    build.Deps.mavenResolverSupplier,
    build.Deps.mavenResolverTransportFile,
    build.Deps.mavenResolverTransportHttp,
    build.Deps.mavenResolverTransportWagon
  ]
  def testModuleDeps = super.testModuleDeps ++ [build.scalalib]
}

def testModuleDeps = super.testModuleDeps ++ [build.testkit]

It’s not a quantum leap forward, but it does help the reader focus on what’s important - the configuration values - rather than the auxiliary Seq wrappers. In this case it doesn’t really matter what collection type it is, hence the usage of Seq, which I would expect to be the common case since most parts of most programs are not performance sensitive.

Another example is from uPickle, which lets you define JSON literals as follows:

val json: ujson.Value = ujson.Obj(
  "declarationMap" -> true,
  "esModuleInterop" -> true,
  "baseUrl" -> ".",
  "rootDir" -> "typescript",
  "declaration" -> true,
  "outDir" -> pubBundledOut(),
  "plugins" -> ujson.Arr(
    ujson.Obj("transform" -> "typescript-transform-paths"),
    ujson.Obj(
      "transform" -> "typescript-transform-paths",
      "afterDeclarations" -> true
    )
  ),
  "moduleResolution" -> "node",
  "module" -> "CommonJS",
  "target" -> "ES2020"
)

This would look a lot nicer if written with square brackets

val json: ujson.Value = [
  "declarationMap" -> true,
  "esModuleInterop" -> true,
  "baseUrl" -> ".",
  "rootDir" -> "typescript",
  "declaration" -> true,
  "outDir" -> pubBundledOut(),
  "plugins" -> [
    ["transform" -> "typescript-transform-paths"],
    [
      "transform" -> "typescript-transform-paths",
      "afterDeclarations" -> true
    ]
  ],
  "moduleResolution" -> "node",
  "module" -> "CommonJS",
  "target" -> "ES2020"
]

Again, not groundbreaking, but it’s a significant reduction in boilerplate names that the user doesn’t care about in these contexts, to let the reader focus on the data that is what actually matters.

Moving data out into config files is always an option. In the old days, people found Java too verbose, and so data was moved into XML, YAML, and other formats. But there is a real cost for introducing a separate-file and separate-language barrier: you lose type safety, editor support, performance, add indirection, etc. etc… Being able to inline important bits of hierarchical data with minimal boilerplate is table stakes for most modern languages today for good reason.

A third scenarios is the OS-Lib subprocess syntax. Currently, you can do

os.call(Seq("curl", "www.google.com"))

Using the Seq constructor, or

os.call(("curl", "www.google.com"))

which we accomplish via implicit conversion hacks on the tuple data types, which are non-standard and fragile. Neither of these is great, and it would be nice to write

os.call(["curl", "www.google.com"])

To be able to pass collections of strings to a subprocess invocation

A fourth example is from Requests-Scala:

requests.get(
  "https://api.github.com/some/endpoint",
  params = Map("q" -> "http language:scala", "sort" -> "stars")
  headers = Map("user-agent" -> "my-app/0.0.1,other-app/0.0.2")
)

requests.get(
  "https://api.github.com/some/endpoint",
  params = ["q" -> "http language:scala", "sort" -> "stars"]
  headers = ["user-agent" -> "my-app/0.0.1,other-app/0.0.2"]
)

The Maps we see in the first snippet provide no meaning. The user doesn’t care about them. The important part is "q" -> "http language:scala", "user-agent" -> "my-app/0.0.1,other-app/0.0.2", etc… There is also a target type, so there’s no ambiguity as to what the type of the expression is. Being able to elide the Map would be a nice boon in this sort of code

majk-p · January 17, 2025, 9:02am

I can’t take seriously the points about ambiguity or lang complexity after addition this syntax. It doesn’t count in total cost. The same for tooling - the change isn’t big.

With all due respect, why cant you take ambiguity point seriously?

My entire experience with Scala is in large organizations. Acquiring Scala talent was always problematic, so we often looked for developers from a different background who were willing to learn. From my experience, ambiguity has often been a challenge in teaching, particularly when learning on the job rather than in a traditional university setting.

It has been mentioned before, but I would like to reiterate that whenever you see square bracket symbol you know you’re dealing with types. With this SIP that intuition is disrupted and mental overhead is required to learn another exception. This doesn’t make learning and teaching easier.

sjrd · January 17, 2025, 9:22am

TBH I don’t find these examples convincing.

Mill: My visual chunking ability is much more bothered by the object ... extends build.WhateverModule thingies than Seq(...). Even that can be reduced in user space with a def s(...) in the Mill library, and it will achieve the same readability benefits.
JSON: I don’t think you can make that one compile. You cannot both support Arr and Obj depending on the type of the elements and propagate the expected types of the values down recursively. I’d like to see a PoC of that being able to compile with a user-space def v(...) that can recursively handle both arrays and objects.
OS-Lib: This ones feels like call should take a String*, not a Seq[String], and it would let me write os.call("curl", "www.google.com") which is even better.
Requests: OK, I’ll take that one as a small win. (The argument I gave for Mill is not applicable here because it is not in a DSL environment where we can afford to introduce a def m(...).)

lihaoyi · January 17, 2025, 9:25am

It cannot take varargs because we need to be able to do

os.call("curl", "www.google.com", cwd = blah)

If we could allow such a syntax (as Python does) the OS-Lib example would indeed be irrelevant, but right now we don’t so we cannot use varargs and do need some kind of wrapper for the first parameter

Previously we also did

os.proc("curl", "www.google.com").call(cwd = blah)

Which works, but again is inferior: I don’t actually want a fluent changed syntax, I just want concise way to pass a list of tokens together with some optional keyword params after

lihaoyi · January 17, 2025, 9:27am

This is definitely an option. I mentioned in the original thread that R builds its collections via c(...). But I do think that having a name is worse than not having one if that name is meaningless, which is the case here. It’s like why _.foo can be better than x => x.foo: the name does nothing to add clarity and purely adds verbosity and obfuscation, if the meaning is obvious from context (which differs on a case by case basis)

lihaoyi · January 17, 2025, 9:31am

It feels like it should be doable if everything is typed as a single ujson.Value constructor that has overloaded def apply(args: ujson.Value*) and def apply(args: (String, ujson.Value)*) with appropriate implicit constructors or typeclasses. But it would definitely take some experimentation to see if we can make it work recursively and with overloads

Sporarum · January 17, 2025, 9:56am

lihaoyi:

Another example is from uPickle, which lets you define JSON literals as follows:

[…]

This would look a lot nicer if written with square brackets

val json: ujson.Value = [
  "declarationMap" -> true,
  "esModuleInterop" -> true,
  "baseUrl" -> ".",
  "rootDir" -> "typescript",
  "declaration" -> true,
  "outDir" -> pubBundledOut(),
  "plugins" -> [
    ["transform" -> "typescript-transform-paths"],
    [
      "transform" -> "typescript-transform-paths",
      "afterDeclarations" -> true
    ]
  ],
  "moduleResolution" -> "node",
  "module" -> "CommonJS",
  "target" -> "ES2020"
]

See Syntax highlighting inside custom string, it would read even better as:

val json = json"""

{
  "declarationMap": true,
  "esModuleInterop": true,
  "baseUrl": ".",
  "rootDir": "typescript",
  "declaration": true,
  "outDir": ${pubBundledOut()},
  "plugins": [
    {"transform": "typescript-transform-paths"},
    {
      "transform": "typescript-transform-paths",
      "afterDeclarations": true
    }
  ],
  "moduleResolution": "node",
  "module": "CommonJS",
  "target": "ES2020"
}

"""

hepin1989 · January 17, 2025, 10:16am

We can follow the syntax of Dart. I think which is nearly perfect.

odersky · January 17, 2025, 10:53am

I think we can all agree that representing JSON is a good benchmark for the proposed notation. It’s ubiquitous, everybody knows it, and it is in a sense the paradigmatic notation for data definitions.

With named tuples, we have a great way to express this. I have taken Haoyi’s UPickle example and added two lines to show more uses of sequences. Here’s how we can write it now:

val json = (
  declarationMap = true,
  esModuleInterop = true,
  baseUrl = ".",
  rootDir = "typescript",
  declaration = true,
  outDir = pubBundledOut,
  deps = [junitInterface, commonsIo],
  plugins  = [
    ( transform = "typescript-transform-paths" )
    ( transform = "typescript-transform-paths",
      afterDeclarations = true
    )
  ]
  aliases = ["someValue", "some-value", "a value"]
  moduleResolution = "node",
  module = "CommonJS",
  target = "ES2020"
)

This is really nice! It’s actually nicer than original JSON. I think with this, Scala further strengthens its traditional niche that its notation is as lightweight as dynamically typed languages while at the same time being statically typed. That in fact was the original push why Scala got adopted. We seem to have lost that priority today, alas.

Now, if we did not have sequence literals, it gets less nice:

val json = (
  declarationMap = true,
  esModuleInterop = true,
  baseUrl = ".",
  rootDir = "typescript",
  declaration = true,
  outDir = pubBundledOut,
  deps = Seq(junitInterface, commonsIo),
  plugins  = Seq(
    ( transform = "typescript-transform-paths" )
    ( transform = "typescript-transform-paths",
      afterDeclarations = true
    )
  )
  aliases = Seq("someValue", "some-value", "a value")
  moduleResolution = "node",
  module = "CommonJS",
  target = "ES2020"
)

Now the three Seq’s stick out like a sore thumb. (why? because they are the only named elements that do not come from the data model). And the next data definitions you read might use List or Array instead of Seq, so it’s not that you can count on getting used to this eventually.

Alternatives: Prefix letter and parens don’t cut it. JSON string interpolators don’t cut it either. We need something that is obvious and that integrates naturally in the rest of the language. With named tuples and [...] we have the ideal solution.

To be sure, adding sequence literals is no big deal. No big effort to add them, no catastrophic loss if they are absent. But it would be really nice and remove a Scala weirdness compared to other languages that is there for no reason. If JavaScript, Python and Haskell agree on a notation, then we need to have very good arguments why this somehow would not work in Scala.

To pick a comparison with another form of literal: We do write -1 in Scala, and yes, it does complicate the language since you could as well write 1.minus which would avoid the irregularity of prefix operators (which are actually harder to parse than collection literals!). But of course we don’t do that since everybody else uses prefix “-” so we do, too. The trade-offs for collection literals are very similar.

About alternative ways to write things: You will use a collection literal if you don’t care about the type of collection or it has been decided for you (e.g. by a formal parameter type). If you do want to be explicit you write the type. It’s simply a shorthand that makes sense.

sjrd · January 17, 2025, 11:07am

I don’t think you can make that JSON code compile, for the same reason I outlined above.

Also you have no way to write an empty JSON object with this proposal, even if you manage to make the above example compile.

odersky · January 17, 2025, 11:43am

Here’s the full example, with mocks for referenced data.

import language.experimental.namedTuples
import language.experimental.collectionLiterals

val json = (
  declarationMap = true,
  esModuleInterop = true,
  baseUrl = ".",
  rootDir = "typescript",
  declaration = true,
  outDir = pubBundledOut,
  deps = [junitInterface, commonsIo],
  plugins  = [
    ( transform = "typescript-transform-paths" ),
    ( transform = "typescript-transform-paths",
      afterDeclarations = true
    )
  ],
  aliases = ["someValue", "some-value", "a value"],
  moduleResolution = "node",
  module = "CommonJS",
  target = "ES2020"
)

class Dep
val junitInterface = Dep()
val commonsIo = Dep()

class Dir
val pubBundledOut = Dir()

This compiles with the mentioned draft PR and gives json the type

   val json:
      (declarationMap : Boolean, esModuleInterop : Boolean, baseUrl : String,
        rootDir : String, declaration : Boolean, outDir : Dir, deps : Seq[Dep],
        plugins : Seq[NamedTuple.AnyNamedTuple], aliases : Seq[String],
        moduleResolution : String, module : String, target : String)

About the empty JSON object if you need it: Anything will do, why not ()?

sjrd · January 17, 2025, 11:58am

This part makes it completely useless. If you lose the type of named tuples, you also actually lose the keys. You cannot recover them, so you have absolutely not created a usable JSON data structure.

lihaoyi · January 17, 2025, 12:09pm

There’s actually one way this can be useful, which was discussed in the original thread: inferring the apply() call and using named tuple syntax as anonymous constructors when the target type is a case class. In such scenarios, you already know the type, and won’t need to worry about losing keys.

That’s not quite the same feature as the collection literals discussed here, but they serve the same purpose: to allow the compiler to infer the nominal type in contexts where the a known type is expected, thus allowing the user to specify the structure of the data necessary to construct that type, without needing to redundantly specify the type itself (because it can be inferred from the target type).

@odersky was notably against this shorthand for case classes in the original discussion, but it appears we have come full circle back to it!

odersky · January 17, 2025, 12:10pm

That’s a good point. Let me think of how we can rescue it. The notation is too nice too just throw away. Maybe use named tuples as a way to write map literals? Or else map to case clases but that presupposes we have a schema.