Proposal to add top-level definitions (and replace package objects)

lihaoyi · March 23, 2019, 3:22am

I have mentioned this in an earlier discussion, but since we are now allowing side-effecting top-level vals, it only makes sense to allow side-effecting top-level statements

package p
println("Hello World!")

This does several things:

Cleans up the syntax for something that is already allowed via val meaninglessName = println("Hello World"), just without the meaningless names. This is a big wart with the meaningless _ <- foo syntax in for-comprehensions, for example.
Converges *.scala file syntax closer towards Ammonite/Script/Worksheet/Mill *.sc syntax or SBT’s *.sbt syntax
Converges the REPL and *.scala syntax
Makes Scala easier to learn for novices: you can just write code and put it into a file, and run it. Just like Python, which is in many cases the gold-standard for ease-of-getting-started for programming beginners

Note that “novices” doesn’t just mean students: there is a large class of data scientists, analysts, system admins, devops, mathematicians, mechanical engineers, and others who would fall under the category of “professional novices”. These are people who use programming to supplement their main job but have neither the need nor desire to become experts in it. These are the sorts of people who would benefit greatly from being able to “just write code and run it”

Even within the existing Scala community, there is tons of evidence for existing demand for such a syntax: SBT, Mill, IntelliJ Worksheets, Scala-Fiddle, Scastie, Jupyter-Scala, and others have each re-invented their own flavor of it. Clearly, this is something which a lot of people want.

We have already had discussions about how to simplify the main method and make Scala less boilerplatey and easier to pick up for absolute beginners, who may be confused by def main(args: Array[String]): Unit = {...}. Regardless of what elegant syntax people can come up with, no-syntax-at-all beats everything in terms of boilerplate (zero) and things to learn (zero), not to mention similarity with other getting-started languages (Python, Ruby, Javascript, …). And we get this essentially for free, by removing an arbitrary restriction that allows vals but not free-floating expressions/statements at the top of a file.

If we hope for Scala to be as approachable as the other languages that currently dominate the ease-of-learning, getting-started, and professional-novice scene, we should do what they do where-ever it makes sense, and this is one of them. In this case, the fact that it simplifies Scala itself in multiple different ways, narrowing the arbitrary differences in Scala syntax seen in different contexts, is just a (huge) bonus.

MarkCLewis · March 23, 2019, 4:01am

I am glad you mentioned this. The simplicity for beginners is one of the things that I thought of first with this proposal. Right now our CS1 uses the Scala scripting capability. I expect that Scala 3 could easily retain that capability, but if it weren’t needed because that type of capability was provided by the standard language, that would work equally well for beginners. Granted, this model inevitably requires a compile followed by a run instead of being run like a script, but I’m not actually opposed to that. There is educational value in seeing the separation between compile and execution and running scripts in any language blurs that distinction for the novice.

nafg · March 24, 2019, 1:59am

I agree that if we are going to allow strict top-level definitions, like var and regular val with arbitrary initializers, then we may as well allow arbitrary code at the top level.

If allowing arbitrary top-level code to run on program startup is not part of this proposal, then you would have to require top-level vals to be lazy val, and disallow top-level var.

odersky · March 25, 2019, 9:53am

Allowing arbitrary statements as toplevel definitions and supporting toplevel sourcefiles as programs looks very attractive.

I believe there’s no big issue with allowing statements as toplevel definitions. The problem that we have to explain when side effects happen (i.e. when someone references a definition in the same file) is already present for side-effecting value definitions.

We’d need one more tweak. A toplevel object implicitly defined by src.scala is named src$package. But we surely want to run it using scala src, not scala src$package. This could be achieved by tweaking the scala runner script.

sjrd · March 25, 2019, 10:03am

I believe allowing top-level statements would set the wrong expectations, that those statements are executed at the beginning of the program, or somehow “automatically”. But that won’t be the case; they will only be executed once we touch a val, var or def defined in that file. That would be very hard to explain, and even so the normal expectations would not be met (another ward, or put otherwise: “how do we teach this?”).

For the side effects in the rhs of val and var definitions, I am not so worried. It’s relatively easier to convince people that those would only be executed when the definition or one of its siblings is accessed. Also, I think the problem of naive expectations doesn’t happen as easily with those, because usually we don’t put side effects in the rhs of non-local vals and vars so often.

lihaoyi · March 25, 2019, 11:07am

@sjrd you raise a good point. Unlike Python (or Ammonite) which would trigger top-level code any time the module is imported, Scala would only trigger them when a top level val/var/def is referenced, but not when top level class/object/types are referenced. That is surprising.

Presumably this surprisingness is already present in package objects, but those are uncommon and used much less than we expect top-level definitions to be.

There is also the question of, given we want to use this top-level code as program entrypoints, how do we change the various scala runners to specify which top level code to run? These top-level code blocks basically become main methods, and will need to be specifiable in scala, SBT, Mill, and so on.

Perhaps we could consider a slightly more limited scope:

Top-level statements can only be used in *.sc files; these are picked up by the Scala compiler similar to *.scala files
*.sc files automatically generate a Java-compatible main method with the name of the class being the name of the file e.g. Foo.sc generates a class Foo with a main method (perhaps mangled in some way to avoid collisions?)
We ban top-level var and vals within *.scala files, as @nafg suggested. It’s not the end of the world to label the vals with lazy to get a more predictable initialization semantic, and top-level mutable state is rare enough the boilerplate of stuffing it in an object is no big deal.

This would have the following consequences:

Standalone *.sc files become code that people can run via scala (this is already possible), or via alternate runners like amm (to the extent that they are compatible, which they mostly are)
*.sc files can also serve as entrypoints to larger applications, with the benefit that the entrypoint of a large codebase can trivially be seen from the filesystem without needing to dig through individual files to hunt for def main methods (or extends App, …). Essentially, you could start off with a standalone script, and as it grows seamlessly incorporate it into a multi-file project with a proper build tool by adding *.scala files.
*.scala “library” files maintain their current “statelessness”: you cannot accidentally trigger a top-level side effect when dealing with a *.scala file, only by calling their defined functions, instantiating their classes or referencing their (lazy) objects or lazy vals. This also follows the best practice in other languages which allow top-level code, which generally discourage you from having top-level side effecting code in any imported “library” files and only use top-level code in the application entrypoint

Essentially, we would take the convenient “just run code” part of scripting languages, while enforcing the “avoid top level code in imported library files” best practice that already exists, and avoiding any confusion about exactly when top-level code evaluates when non-entrypoint *.scala files are used.

The “seamlessly go from one-file script to multi-file project with build tool” would be a nice experience to people used to Python’s “just import helper code” style of growing out their initial scripts. SBT would already support it (since it allows Scala files in the project root), and Mill and even Ammonite’s script runner could be similarly tweaked to conform to such a "*.sc is entrypoint, *.scala is library" convention with the limitations described above

In this world, we wouldn’t consolidate to a single Scala syntax, but at least we can get everyone to converge towards the same two *.sc/*.scala file extensions with their associated semantics.

This is the best I can come up with so far, unless we can find some way of harmonizing the behavior of top-level code in imported files with that of other languages (i.e. it runs the first time something in the file, anything, is used) to avoid the confusion sebastien brought up.

smarter · March 25, 2019, 12:12pm

Yes, I planned to implement exactly this, we can then unify the worksheet mode and the REPL and top-level definition files. Furthermore, we can add other features that only make sense in scripts to .sc files, like import-from-ivy.

sjrd · March 25, 2019, 1:30pm

I propose an alternative solution to the issue of writing programs, aka entry points, in

MarkCLewis · March 25, 2019, 1:45pm

I really like where this is heading. My first reaction to having a separate file extension was fear that it might cause confusion among the novice programmers I’m working with, but upon further reflection, I think that it would be less confusing because it provides a clear delineation between the two different types of files that act very differently.

One of the limitations of the current Scala scripting model is that you can’t easily mix scripts with normal Scala code, so a script has to be completely self-contained. This approach would break down that barrier and allow a smoother transition from scripting to writing full applications in Scala.

RichType · March 25, 2019, 2:05pm

I’ve long felt that Scala lacks the differentiation between an immutable value and a compile time constant or literal. So it would be desirable to have to top level constants or literals. It would also be desirable to have top level literals for compound deep value types:

lit pi: Double = 3.14159265358979
lit specialPoint: Vec2 = Vec2(2.435, -0985)

nafg · March 25, 2019, 5:39pm

I’m not sure what you’re referring to. If you write final val it is considered a compile time constant, IIUC. But what differentiation are you looking for?

yangbo · March 25, 2019, 5:49pm

Currently implicit priority can be defined by inheritance (e.g. https://github.com/scala/scala/blob/v2.12.8/src/library/scala/math/Ordering.scala#L145)

How to include multiple implicit methods with different priorities in a package without the help of inheritances of package object?

yangbo · March 25, 2019, 5:57pm

There is compile time constant:

Welcome to Scala 2.12.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_202-ea).
Type in expressions for evaluation. Or try :help.

scala> final val pi = 3.14159265358979
pi: Double(3.14159265358979) = 3.14159265358979

Note the type of pi is the literal type 3.14159265358979, instead of Double.

RichType · March 25, 2019, 7:33pm

So “final val” is a literal for Jvm value types? Presumably including Int, Boolean and String? Do you need to add the “final” keyword to the val for final class and singleton objects?

final val r: Int = scala.util.Random//Presumably this is not a literal.

final val s: Int = 2 + 2 //Is this a literal?

yangbo · March 25, 2019, 7:36pm

You have to delete the type annotation : Int in order to infer it as a literal type.

drdozer · March 25, 2019, 9:39pm

So I think that’s a wonderful idea. I would love to have .scala files represent the library and .sc files representing the executables. This would give a really clean segway from mashing code to extracting reusable components, and it would, I think, be fairly easy to teach to students and non-expert professional users e.g. data scientists and bioinformaticians. We don’t need to be beholden to Java. There’s no god-given requirement for us to be wedded to static main methods.

nafg · March 25, 2019, 10:24pm

Is this sort of thing – wrapping a file based on its extension – a job for the compiler? Maybe it’s a job for the build tool and/or a compiler plugin. After all, at a high level this is analogous to Twirl – if a file has a certain extension, it’s equivalent to a Scala source file under a certain transformation.

megri · March 25, 2019, 11:40pm

Wouldn’t top level statements run when the object representing the top level defs get imported, like how object initialization work at the moment?

sjrd · March 26, 2019, 7:45am

Define “when imported”. It’s actually only when one of the val, var or defs (not other stuff) of the top-level of that file is accessed (not imported). That’s extremely surprising.

sjrd · March 26, 2019, 7:55am

I believe @lihaoyi’s proposal based on wrapping *.sc is technically sound.

That said, I also think it’s heading in the wrong direction as far as language design goes. Yes, the syntax of top-level statements is accepted by different tools, but each such tool gives different semantics to them. Sometimes they inject special imports; sometimes they run stuff in a different way (e.g., worksheets associate results to individual statements; sbt builds interpret top-level terms as expressions and use the result of each such expression; Ammonite gives entirely custom semantics to special kinds of imports; etc.)

That would also give different top-level grammar goals based on an external factor, i.e., the file extension. ECMAScript went that route with Scripts and Modules, and the ecosystem still doesn’t know how to deal with that (see Node.js’ proposal to support ES modules, for example). I don’t think this is the way to go.

To relieve the existing tools from the non-standard syntax aspect, we could allow top-level statements in the syntactic grammar–perhaps going as far as typing them–but then reject them in a later phase of the standard compiler. Tools that want to do some magic with top-level statements can then hijack them after the regular parser (and typechecker), rather than each doing their own stuff.

But baking a main method concept with top-level statements in the standard compiler is not going to end any better, I believe. Many existing tools using top-level statements wouldn’t even be happy with that standard treatment.