A semi-automatic way to make the Scala compiler bootstrappable?

See the following for why we might want a bootstrapping compiler in the first place:
https://contributors.scala-lang.org/t/bootstrapping-of-the-scala-compiler

The idea is the following:
Iteratively ablate the compiler until what remains is small enough to write an interpreter for (in whatever other language).
By numbering each successive version as bootstrap(0), bootstrap(1), … bootstrap(k), interpreter, we can then bootstrap the compiler by running (pseudo-code):

interpreter_binary = bootstrapped_other_language_compiler( interpreter )

# Run the simplest compiler on itself
first_binary = interpreter_binary( bootstrap(k), bootstrap(k) )

# Get progressively more capable binaries
second_binary = first_binary( bootstrap(k-1) )
third_binary = second_binary( bootstrap(k-2) )
...
penultimate_binary = ante_penultimate_binary( bootstrap(1) )

# And one last time, with the source code of scalac x.y.z
scalac_binary = penultimate_binary( bootstrap(0) )

Each successive ablation would be done in one of two ways:

Scoverage Loop

In this loop, run the current compiler on its own source code with coverage enabled, remove all dead code, and do it again.
By “current compiler” I don’t mean “scalac x.y.z”, I means the current state of the compiler we’re modifying.

Once this process reaches a fixed point (there is no dead code to remove), make a step in the second loop:

Manual Loop

This means we should have reached a set of features that the compiler uses to compile itself.
Find the easiest feature to refactor out, and do it.

Some examples of the kind of “feature-refactoring”, replace in the current compiler:

  • All uses of given/using by explicit parameters.
  • All enums with case-classes.
  • All case-classes with regular classes

(Probably only one should be done at once, so that the scoverage loop hopefully makes removing another simpler.)

With this done, the current compiler should have more dead code for the Scoverage loop.
For example, replacing all givens/using by explicit parameters means the Scoverage loop will remove all code which handles compiling givens/using, from the parser all the way to the code generation !

And if there’s only a few features left, and/or none of the features can be refactored out, it’s time for the final phase:

Scala-- interpreter

Let’s call Scala-- the sum of language features necessary for

Write an interpreter that covers all the features used in the current compiler in a boot-strappable language, for example C.

And that’s it ! (well it’s still a lot of work, but at least its cut into manageable portions)
It’s probably a very naive plan, but it’s been bouncing around in my head for a while, and I don’t think I’ll have time to try it soon, so hopefully it will inspire someone to try it !

P.S: It might be tempting to use coding agents for one of the manual steps, I highly recommend against it: 1. it stops you from learning cool things, 2. It’s way harder to trust since LLMs have a tendency of making hard to spot mistakes, 3. It’s very probable the current models are not “smart” enough, and it will make a lot of progress at the start, and then come to a screeching halt

4 Likes

i have another idea:

  1. make scala compiler itself compilable by scala.js
  2. modify scala.js to be able to produce readable javascript with some helpful comments (e.g. expected types, visibility scope, anything that improves effectiveness of static analysis tools and/or ai agents, etc)
  3. from time to time, pay some security firm to audit the produced javascript and when all found problems are addressed, publish the javascript as the latest audited bootstrap scala compiler
2 Likes

Both @Sporarum’s and @tarsa’s ideas sound very laborious and expensive to me, especially the idea of paying a security firm to audit generated JavaScript code. As they say, the problem is that you eventually run out of other people’s money.

I think a more promising approach is to stick much closer to how Scala was originally bootstrapped.

  • As @Kordyjan pointed out in the other thread (Bootstrapping of the Scala compiler - #5 by Kordyjan), the original bootstrap compiler for Scala 2.0 was Scala 1.4. The source code for that is available in git: GitHub - scala/scala at v1.4.0+4 · GitHub
  • It is written in Pizza and compiled using the PiCo compiler. As @SethTisue has helpfully pointed out, the PiCo source code is available on GitHub these days.
    • unfortunately, it’s self-hosting, so now we would have to bootstrap Pizza… And while the Pizza compiler is available on SourceForge, it is again self-hosting, and I haven’t been able to find the pure-Java bootstrap compiler for Pizza
    • and I don’t even know if the Pizza compiler is able to compile PiCo
  • Since bootstrapping Pizza doesn’t seem viable, a different route is needed.
    • Pizza is essentially a Java dialect with a few additional features: ADTs, pattern matching, lambdas, operator overloading. Most of these have made it into modern Java by now, and hence…
    • it should be feasible to translate most of the Pizza-specific features into their modern Java equivalents
    • LLMs should make short work of that
  • Once we have a Scala 1.4 compiler written in (modern) Java, we can use it to compile Scala 2.0 and then work our way through subsequent releases

Ideally the whole process would be automated using GNU Guix, as that seems to be the most advanced ecosystem for automated bootstrapping. Needless to say it’s still an ungodly amount of work that’s unlikely to happen without funding, but it’s probably the most realistic approach, and also more faithful to Scala’s history.

2 Likes

hmm, i noticed that i made my method unnecessarily expensive by redoing the audits. in fact, starting with some scala 3.x compiler, we can request audit once for that scala 3.x version and then make it a head of a bootstrapping chain and we wouldn’t need to make audits of later scala versions if we’re only concerned about bootstrapping. so the amended idea is:

  1. make scala 3 compiler itself compilable by scala.js

  2. modify scala.js to be able to produce readable javascript with some helpful comments (e.g. expected types, visibility scope, anything that improves effectiveness of static analysis tools and/or ai agents, etc)

  3. pay some security firm to audit the produced javascript and when all found problems are addressed, publish the javascript as the initial audited bootstrap scala 3 compiler

  4. later scala 3 compiler versions can be compiled by earlier scala 3 compiler versions, going back up to the initial audited one

steps 1 and 2 are the same as previously. steps 3 and 4 are changed or added.

1 Like

well you also need to handle all the other build dependencies like sbt, zinc, etc - if you want that initial link in the chain to be able to be inserted before e.g. scala/scala3 at the 3.0.0 tag

I’m not convinced you do, really. For the purposes of bootstrapping, you don’t need all those shenanigans that build systems do, like incremental compilation and whatnot. All you need to do is compile the code once, so you can probably get away with some kind of makeshift solution like a shell script. In this case, reducing the number of dependencies is far more important than whatever functionality you can get out of sbt, zinc etc.

That said, I still think the only viable approach to bootstrap Scala is to start at a much earlier point than Scala 3. And if you do that, then maybe building sbt wouldn’t even be that big a deal.

1 Like

yeah but i thought the whole point is having an origin point that can follow through all the way if some link in the chain breaks, so sbt i dont think has a bootstrappable build either

The sbt manual states that the first public sbt release was 0.3.2 from late 2008, but I haven’t been able to find that anywhere. The earliest tag I could find in the sbt repo is 0.7.5, and that is already self-hosting. So in order to bootstrap sbt, we’d probably have to come up with a makeshift build system for that, too.

Whoever dares to pull this off has definitely earned their doctorate in software archaeology.

3 Likes

https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/simple-build-tool/sbt-0.3.2.jar

Claude under a minute.

Well, perhaps you could explain to Claude what bootstrapping means: it means compiling software without depending on binaries that haven’t themselves been either bootstrapped or hand-written. Unfortunately, downloading a pre-built sbt 0.3.2 jar isn’t bootstrapping.

The Guix project has done a lot of heavy lifting here: they can build their entire package archive while only relying on a single hand-written 357-byte program – everything else is built from source, which is a remarkable achievement.

The good news is that the simple-build-tool (as it was called at the time) source code is indeed available on the Google Code archive. It contains an archive of a Subversion repo with a bunch of tags, starting with 0.3 up to 0.5.6. And versions up to 0.4.6 are apparently built with a shell script!

LIBS=$(find lib/ -name "*.jar" -exec echo -n {}: \;)
mkdir -p target/classes && \
scalac -cp $LIBS -d target/classes/ src/main/scala/sbt/{,impl/}*.scala && \
jar -cf target/sbt.jar licenses/* LICENSE NOTICE -C src/main/resources scalac-plugin.xml -C target/classes sbt

Assuming we can get an old Scala compiler bootstrapped, this could probably be used to get started.

The bad news is that the latest tag in that archive is 0.5.6, while the earliest sbt version in the Git repository is 0.7.5 – so there might be a few versions missing in between, who knows. The other bad news is that the archive also contains several pre-built jar files like ivy, jetty, jline and others. So we’d have to either bootstrap those, too, or rip out the functionality pertaining to them. After all, the only purpose of these old sbt versions would be to use them once to build the next release :wink:

3 Likes

You asked for v0.3.2. Not a non-bootstrapped version.

This reminds me that I made a presentation long ago for school about using my Nintendo 64 to demonstrate sorting algorithms. It was pure assembly for running everything. At the end of the talk I made an annecdote about compilers and the universe. Could it be that the universe is so complex to understand because God deleted the source code of the first universe compiler and we would never have a non-bootstrapped perspective on life? :slight_smile:

I actually didn’t ask for anything, I merely stated I wasn’t able to find it. Besides, in the context of a discussion about bootstrapping sbt, I thought it was clear that the relevant artifact is source code, not compiled code. Apparently it wasn’t, my mistake.