Compiling Scala and Sbt for Debian distro

Havent gotten around to it yet, but yes. Do you know of other examples I can argue on the Debian mail list?

Haskell, OCaml and more: Bootstrapping (compilers) - Wikipedia

3 Likes

If that was the only hurdle you had to clear, you might be fine. But it isn’t the only hurdle; the compiler was re-bootstrapped several dozen times between 2.11.6 and 2.12.9. (And then a bunch more times after that to get to 2.13.0.)

link?

Lets explore this.

But first I’d like to mentionen that I checked the latest Debian releases (stable, unstable, testing)
All of them does contain scala 2.11., but no Sbt. Sbt was removed because of the infraction in bootstrapping. So I will not dwell on how scala 2.11 got into Debian. But at least now we got the neccesary tools to build Sbt the proper way in Debian (Just need help from the Sbt team to work out the command lines that is needed to build Sbt manually, and then script it.)
So this is some good news.

Next, how was the version re-bootstrapped? what tools, compilers, languages… and versions was used?

There’s no point in exploring this further. I know you mean well and are trying to do good, but:

As I’ve said before, above, there is only one answer to this, there is only one path forward here, and it doesn’t involve recapitulating the bootstrap process all the way back to historic Scala versions.

If there’s some way we (the Scala compiler team) can help you make your case with the Debian people, let us know.

1 Like

(Anyone interested in this may also want to follow Scala build process as well, where we’re now talking about how to bootstrap the compiler, not by going back to historic versions, but just by using the current version.)

I am a bit confused, cause the documentation says the CI system performs a bootstrap when performing a full build of scala. So why you say "recapitulate the bootsstrap process al the way back to historic scala versions, confuses me. (Unless the CI’s version of rebooting means its able to start compiling from the first version of scala from 15 years ago)

I don’t think we need to get into that, as it was only in reference to a hypothetical, impractical course of action we won’t be pursuing.

re: what our CI does and how the bootstrap works, that will become clearer to you over in the other thread, I hope.

Debian’s upstream guide says:

There are a number of ways to achieve above requirements:

  • write a second implementation of your language (either as a compiler or an interpreter) in a language that is already known to meet these requirements (like C or Python)
  • define a mini version of your language and then write a compiler or interpreter for that language in another language which is known to either already meet these requirements or which has a path down to a language that does. This mini language can then be used to compile the core of the bigger comiler/interpreter which can then in turn be used to compile more in as many steps as needed.
  • identify the very early versions of your compiler/interpreter which were still built using another already existing programming language. Use this version to create a newer version of the compiler which is then in turn used to compile an even newer version. Thus, create a “treasure map” of compiler versions until you reach the current version

Having multiple comilers/interpreters for a language and being able to bootstrap the compiler from nothing is also an important requirement to allow diverse double compilation to verify the existing compiler binaries for backdoors like the Ken Thompson attack (“Reflections on Trusting Trust” 1984).

Here’s some lwn.net discussion about debian and Rust but ultimately no information about how it got into debian: Debian, Rust, and librsvg [LWN.net]. It seems there’s a (third-party?) mrustc written in C++ which is good enough to allow bootstrapping some Rust version in only ~10 steps.

The more I think about it, I actually like the requirement a lot that you must build from sources. Without that requirement the only way to verify a piece of software doesn’t contain any malware is to fully verify the binary (which might be feasible for a single version but is quite hard to do over time).

But that might also mean that Scala could never be admitted to Debian because the amount of work necessary to re-bootstrap e.g. from Java at this point seems insurmountable (which is what @smarter and @SethTisue seem to mean in contrast to the “only sane approach” which would be somehow circumventing that requirement). There might be explanations of why Haskell, OCaml and Rust might have gotten through that “insane” process nevertheless (just guessing):

  • started earlier when the language actually was bootstrapped from another language
  • simpler core languages
  • less frequent re-bootstrapping during the compiler development over the years
2 Likes

For scala any of these options would be an insane amount of work, if not impossible, especially point 3.

Rust was invented by the mozilla team, and they are on par with the Debian team, so this was probably on their mind from the almost beginning.

Would TASTY be considered “source code” by the Debian definition? If so, then perhaps waiting for Dotty to get far enough along with TASTY could ease some of the compilation complexity. Then reimplementing something smaller and manually written to “compile TASTY” to JVM bytecode to create a one-off compiler bootstrap might be easier. Once the compiler bootstrap is complete, then getting the sbt bootstrap completed via that compiler should work, right?

Another possibility is getting Scala.js to compile Dotty down to Javascript files. And then run the JavaScript version of the compiler in Debian to compile the chain from wherever makes sense.

There might also be a Scala Native pathway possible, too, following similar logic of either of the above speculative solutions.

And does the compiler literally use every aspect of Scala? Or can a stripped down version of the Scala language spec be created that is only what the compiler ends up using?

My goal was to “think outside the box” to see if any of the above might create some sort of simpler pathway, given everything I have seen thus far is onerous, fragile, and complex.

2 Likes

The Scala build includes something called the “stability test”. It uses the reference compiler (aka “starr”) to build the new compiler, then it uses the new compiler to rebuild itself, then it checks to see that the resulting compiler was the same both times.

Perhaps that would help satisfy the Debian people? Instead of a single-stage process where you use the reference compiler to build the new compiler directly, a two-stage process where the reference compiler builds an intermediate compiler, and the intermediate compiler compiles itself to build the final compiler?

Debian primary focus is political in this matter, source code should be free and hence must not rely on any external dependencies to produce its results. Hence code in a source package must be humanly readable, available for any to compile by them selves and Debian requires that all tools used to compile the software must itself have followed the same principle. I.e. only tools that can be promoted into Debian testing og stable can be used. Its a pain, but it ensures that all code is mostly written from scratch with GPL license and built with GPL licensed tools on Debian build infrastructure.

Freely available source code fosters sharing of knowledge, and hence must not be impeeded by dependencies or binaries that are secrets for others or has limitations in usage, That in it self gives a lot of seconday bonuses, such as being able to review the code for different purposes, such as security, education, bug finding etc… which lends itself nicely to transparency and trust.

For recreational purposes I tried compiling the 2.12.9 compiler with 2.11.6 and 2.10.7. The changes needed were mostly trivial with some fun puzzlers but in the end it worked and produced the same artifacts as the original sources when the preliminary compiler was then used to compile itself. That was a fun exercise that doesn’t show much, doing a complete bootstrap seems still far out of reach (e.g. next step would be compiling with 2.9.3 but how do you run that at all since the classfiles don’t even run on Java 8 and it’s hard just to find a working version of JDK 6 that still works).

I guess that won’t help because reaching a fix point is neither required not sufficient e.g. to exclude the possibility of a compiler malware that propagates itself.

As I mentioned before I like the requirement of having to build from sources in principle. But looking at a self-hosted language compiler that was re-bootstrapped dozens of times it would mean that to establish that full chain of trust for the resulting binary from first principles would include trusting all the source code of all the intermediate steps. In that case, it seems less onerous to include a single binary (that would have to be reverse engineered to be analyzed) versus the full source code of hundreds of versions of intermediate compilers that would have to be analyzed instead.

2 Likes

Actually, it might give us the opportunity to compile 2.12.9 directly, using 2.11.12 (which is in Debian already). Perhaps we dont have to build all the inbetween version? How did you compare the artifacts produced? I dont know if the same can be done with 2.13, but 2.12.9 would be a start…

Thomas