This PR vibe-optimizes the Scala compiler, running codex/claude in a loop over a… few weeks, resulting in a 50-55% speedup over the course of ~100 commits.
10x warmup runs 10x measurement runs
|| Mean | Std Dev |
|---|---|---|
| Before | 8715 ms/op | +/- 183 |
| After | 5681 ms/op | +/- 137 |
- Speedup: 54%:
- Time reduction: 35%
These numbers are running on my 2021 M1 Macbook Pro, and while they vary across benchmark runs overall it seems about there: speedup ranges from 50-55% running the benchmark over and over.
While it is a large number of commits, they are each relatively localized changes and should be reviewable for correctness individually, so it should be possible to review and merge this PR with some work. Each one is typically a single micro-optimization: hoisting/sharing of computed values, caching, and other straightforward micro-optimizations.
This benchmark uses compilation of `mill-libs-javalib` (the user-facing Mill API) and its upstream dependencies (totalling 364 files 32kLOC) as the workload, rather than bootstrapping `scala-compiler`, as `mill-libs-javalib` has a very different style of Scala: lots of macros, lots of third-party libraries, etc.. The individual commits are all done by claude and contain their individual reasoning and individual (noisy) benchmarks.
All existing tests pass
## Major themes
- **Per-runId/period scalar caches** for stable predicates and per-symbol lookups (`derivesFrom`, `isStatic`, `NamedType.symbol`, `ThisType.cls`, `isBottom`, empty-GADT).
- **Identity-eq short-circuits** at map/copy/traversal entry points to skip dispatch when inputs are reference-identical.
- **Shape-based no-op skips** (frozen, empty, NonMember, NoPrefix, leaf nodes, class tycons) that bypass setup the dominant case doesn't need.
- **Inlining synthetic by-name thunks** to retire `op$proxy` frames in TypeMap/TypeAccumulator/AsSeenFromMap variance handling.
- **Cutting redundant denotation/info/symbol forces** along resolution chains via reorder, threading, and gating.
- **Iterative loops + Array indexing** replacing mutual recursion and `List.drop(n).head` walks.
- **Sharing one substitution map / adjuster per outer call** instead of allocating fresh each iteration.
- **Hash-table tuning** (prefilter, sizing, paged growth, `EqHashMap.HashedOnly`) for Uniques/WeakHashSet/EqHashMap.
- **Period-keyed caching of expensive denotation lookups** (member cache, baseType, derivedSelect, asSeenFrom).
- **Allocation reduction** via smaller eager buffers, lazy refs, stamp-based resets, and outermost-only traversals.
## For review
`mill-libs-javalib` is normally built as separate smaller modules, but this PR consolidates all of that into one big compile for benchmarking
- The first commit in this PR contains the `bench-mill-javalib/` benchmarking and optimization scripts used to perform this benchmark, which contain the majority of the lines changed (~1k lines). These scripts:
- Download the `mill-libs-javalib`'s source jar and unpack it
- Run JFR
- Nicely visualize the JFR profile data, etc..
They are throwaway vibe coded slop and can be removed before merging, since none of it is necessary for the optimizations to take effect, but I left it in the PR in case we want to keep it or some variant of it for others to pick up in future. They're written as Scala scripts, using a local Mill bootstrap script until https://github.com/scala/scala3/pull/25970 lands we cannot easily write such scripts in Scala. They were originally in Python but ported to Scala due to poor performance.
- The subsequent commits each contain a single optimization with the rationale for that optimization and measured before/after using JMH or JFR that demonstrates the improvement (total ~1000 lines).
- Each one should be self contained and reviewable on its own. This PR should be rebased/merged without squashing, as each commit contains useful information without mixing them together.
- Related commits are grouped together in the commit log, to make them easier to review in one pass, but mostly maintained as separate commits with separate profiling results and separate reasoning. Only a small number of especially overlapping commits were merged together
The best way to review this is probably to go through commit by commit in order and review the code change and commit message, leaving comments on each commit. Once done I can do a single cleanup pass to fix or remove any commits as necessary, whether code changes or benchmarking harness, and we can merge this without squashing preserving the original commits and commit messages containing reasoning and benchmarks
## How much have you relied on LLM-based tools in this contribution?
Entirely vibe coded in ralph loops, with some human judgement. The prompt used is provided at `bench-mill-javalib/prompt.md`
## How was the solution tested?
This branch was developed by hooking up `claude` or `codex` in a loop with JMH and JFR (`loop-claude.sh` and `loop-codex.sh`), and asking it to find potential performance optimizations over and over by cross-referencing the JFR profile with the source code, and validating these optimizations by looking for the expected % drop in the optimized methods JFR self/total times.
Notably, Codex seems better for this usage:
- It is much better at following instructions, e.g. Claude has trouble spawning the right number of subagents, passing correct and complete instructions to subagents, formatting the commit message correctly, ensuring all heavy lifting is delegated to subagents rather than the top-level agent, etc.
- It has a much more generous subscription quote, e.g. Claude20x's 5hr subscription can only do one iteration of `loop-*.sh`, whereas Codex20x can do 6-8 iterations
- It is much more stable: Claude regularly hangs without noticing that a subagent has finished and it can proceed, or kills subagents prematurely due to thinking an async-await-ing subagent is idle, and all sorts of other harness problems unrelated to the model itself.
Each iteration using Codex typically takes 5-10 hours running 4x parallel on my macbook, which I leave running during the day and overnight
## Profiling
JMH profiles are typically too noisy to measure the <1% drops in the total time taken for a compilation run, whereas the JFR profiles are fine grained enough that e.g. we can clearly see a method go from e.g. `0.5%` of the profile to `0.2%` and have confidence that the expected improvement materialized. In particular, JFR profiles %s do not seem heavily influenced by system load: running 1x parallel (uncontented) to 4x parallel (significantly overloaded) does not seem to significantly affect the std dev of the JFR profile %s, presumably because such system load affects the entire program equally
Each commit is profiled 5 times each time running 10 iterations (~1min of runtime) and we accept any change where the optimized methods show a reduction in %self or %total times more than their standard deviation between those 5 profiles.
Rejected commits are documented here https://github.com/scala/scala3/pull/26091 for posterity, complete with their code changes and profiling numbers and analysis. The rejected branch does see a small speedup of ~4%, but given the number of commits it is difficult to identify where that speedup comes up and whether those changes can be included in this PR. The prompt instructs the agent to review both accepted and rejected commits each iteration before coming up with proposed optimizations.
As usual, there is no easy way to regression tests performance: it can only be maintained or improved by repeated or ongoing monitoring and improvement effort going forward
Correctness is validated via existing tests Tests to make sure nothing breaks.