Implement cached build as a plugin

spockz · March 4, 2019, 9:00am

Inspired by Speedy Scala Builds with Bazel at Databricks. I decided to see whether we can speed up our build by caching the generated class files. I want the first build for a contributor to be fast as well as builds when switching between branches.

So far I’ve managed to prevent compilation by copying back class files just before compile. This works when I delete a specifc class file or delete the target directory altogether.

However, when I run clean the next compile always recompiles, regardless of whether I restore the whole target directory or not.

Where does sbt store the extra metadata that it uses for deciding whether to recompile?

(A first and very hacky attempt to wire the caching into the compilation step is here: https://github.com/spockz/sbt-cached-build-tryout)

olafurpg · March 4, 2019, 12:38pm

You might be interested in https://github.com/romanowski/hoarder

SethTisue · March 4, 2019, 10:06pm

I’ve never delved into it myself, but (as I mentioned on Gitter) https://github.com/sbt/zinc/pull/351 (cc @jvican) looks a good starting point for exploring how the information that supports incremental compilation is persisted between runs

jvican · March 4, 2019, 11:32pm

In the analysis file generated by the incremental compiler. You can just export it and cache it in a S3 cluster and retrieve it in your CI.

However, doing this robustly is not easy. You need to make sure that every pull request you want to merge is rebased to latest master right before it’s merged. Otherwise, there can be conflicts between changes in different pull requests that are not detected.

lihaoyi · March 5, 2019, 2:02am

At Databricks our answer was simple: we simply do not cache the output of incremental compiles, as the incremental compiler does have bugs and we do not sufficiently trust it. Only batch compiles are cached, but that includes most compilation runs (not everyone uses incremental), jenkins (which re-builds everything every commit from scratch), and artifacts from the mac mini sitting on my desk (which also pulls and builds everything every commit to master, because OS-X and Linux do not share caches in Bazel)

jvican · March 5, 2019, 8:58am

I agree that only caching full compiles is the best approach here, not only in Bazel but in other build tools. CI should be guaranteed to run correctly the code in the branch, there’s no room for false positives/negatives.

I assume @spockz you’re not a Bazel user, so what I’d do is to cache the analysis files and compilation products (e.g. classes directories) generated from full compiles. This is relatively simple to do and gives you most of the speedup you can get from Bazel in most Scala projects (note that you don’t mind having the build tool regenerate resources/sources, the generation time is usually negligible in ordinary Scala builds and the incremental compiler will not recompile it so long as their content hashes are the same as before).

If you perform this caching, you can create a plugin so that whenever users of your company check out a different commit, you fetch the compilation outputs from the remote cache and avoid rebuilding that branch.

dragos · March 9, 2019, 10:18am

In case you are looking for ways to speedup your build, you may want to try Hydra. It’s a drop-in replacement that adds parallelism to the Scala compiler and integrates with most build tools out there. It supports the full Scala language, including macros and plugins.

Disclaimer: I’m one of the founders.

spockz · March 9, 2019, 1:00pm

CPU utilisation is fine in our project because we have many small separate modules so for verifying PRs this would not help that much I suppose. It might help a bit with speeding incremental compilation of the few files touched during development so I’ll give it a try.

That plug-in does what I tried to achieve. I’m not entirely sure whether it only caches full builds or also incremental builds. There is also some code for storing the cache in S3 but I haven’t managed to get it to work yet. It also has an issue with our multi-module build. I’m going to try and work with the maintainer to see how we can get the docs up to snuff.

Indeed, I’m using SBT and want to get caching for the full builds. How do I wrap the full build task? I only see the compile and compileIncremental tasks.