Autobench the stdlib with LLM

hepin1989 · May 30, 2026, 7:01am

Performance trap in the standard library -- min() of a TreeMap

opened 06:41AM - 27 May 26 UTC

itype:performance area:library

## Compiler version 3.8.3 ## Minimized code Just discovered a huge performance… bug in my code base that has been there a while. I am not sure I would call it a bug in the standard library but it is at the very least a trap, and if I had to guess many a people are suffering from it and don't even know it. I thought to bring it to your attention, just in case something can be done about it (in all likelyhood you have seen it before). I was using a TreeMap[K, V] in some performance critical code and I was trying to find the min value in this Map. What is more natural than to call the min function? Disaster! `min` on a TreeMap is `O(n)` and not `O(log(n))` as one would naively assume. The linear implementation gets inherited from `IterableOnceOps` and it's not being overriden in TreeMap -- presumably because it can't be. This is the signature: ```scala def min[B >: A](implicit ord: Ordering[B]): A ``` `B` would be the pair `(K, V)` and so we have no idea if this ordering respects the ordering on `K` on which the Map was originally built on. Nightmare! To make things more confusing `min` on a `TreeSet[K]` works as expected (in `O(log(n))` time) because there you can actually check if the orderings are the same. ```scala override def min[A1 >: A](implicit ord: Ordering[A1]): A = { if ((ord eq ordering) && nonEmpty) { head } else { super.min(using ord) } } ``` After searching I found the other functions `treeMap.head` and `treeMap.firstKey` but nobody would even look for those after they typed `min` and the code compiled. Here is some benchmark code that shows the slowdown one would not expect: ```scala package scratch import scala.collection.immutable.TreeMap import scala.util.Random object TreeMapMinBench: private final val N: Int = 100 @main def runBench(): Unit = val size = 100_000 println(s"Generating TreeMap with $size elements...") // Generate random entries val random = new Random(42) val pairs = (1 to size).map(_ => (random.nextLong(), s"value-${random.nextInt(100)}")).toList val treeMap: TreeMap[Long, String] = TreeMap.from(pairs) println("TreeMap generated successfully.") println("--------------------------------------------------") // Warm up the JVM thoroughly for both execution paths (10,000 iterations for JIT compilation) println("Warming up JVM with 10,000 iterations for both execution paths...") (1 to 10000).foreach: i => val (x0, y0) = treeMap.head val (x1, y1) = treeMap.min if (i == 20_000) // Never gets executed -- just here so the compilers don't delete this code and the warmup fails... printf("%d, %s, %d, %s", x0, y0, x1, y1) // Benchmark 1: treeMap.min println(s"Running treeMap.min (O(N) full traversal) $N times...") val startMin = System.nanoTime() var minTuple: (Long, String) = treeMap.head var minSum: Long = 0L var i = 0 while i < N do minTuple = treeMap.min minSum += minTuple._1 i += 1 val durationMin = ((System.nanoTime() - startMin) / 1_000_000.0) / N // to ms println(s"treeMap.min result: $minTuple (sum of keys: $minSum)") println(s"treeMap.min took: $durationMin ms") println("--------------------------------------------------") // Benchmark 2: treeMap.head println(s"Running treeMap.head (O(1)/O(log N) direct access) $N times...") val startHead = System.nanoTime() var headTuple: (Long, String) = treeMap.head var headSum: Long = 0L var j = 0 while j < N do headTuple = treeMap.head headSum += headTuple._1 j += 1 val durationHead = ((System.nanoTime() - startHead) / 1_000_000.0) / N // to ms println(s"treeMap.head result: $headTuple (sum of keys: $headSum)") println(s"treeMap.head took: $durationHead ms") println("--------------------------------------------------") val speedup = durationMin / durationHead println(s"treeMap.head is approximately ${"%.2f".format(speedup)}x FASTER than treeMap.min!") end runBench end TreeMapMinBench ``` ## Output ```scala Generating TreeMap with 100000 elements... TreeMap generated successfully. -------------------------------------------------- Warming up JVM with 10,000 iterations for both execution paths... Running treeMap.min (O(N) full traversal) 100 times... treeMap.min result: (-9223232542084064297,value-12) (sum of keys: 13949477071151100) treeMap.min took: 1.477784 ms -------------------------------------------------- Running treeMap.head (O(1)/O(log N) direct access) 100 times... treeMap.head result: (-9223232542084064297,value-12) (sum of keys: 13949477071151100) treeMap.head took: 7.7E-5 ms -------------------------------------------------- treeMap.head is approximately 19192.00x FASTER than treeMap.min! ``` ## Expectation

github.com/scala/scala3

[WIP] Speed up the Scala compiler on the `mill-libs-javalib` codebase by 50-55%

main ← lihaoyi:perf

opened 08:24AM - 09 May 26 UTC

lihaoyi

+8412 -1319

This PR vibe-optimizes the Scala compiler, running codex/claude in a loop over a… few weeks, resulting in a 50-55% speedup over the course of ~100 commits. 10x warmup runs 10x measurement runs || Mean | Std Dev | |---|---|---| | Before | 8715 ms/op | +/- 183 | | After | 5681 ms/op | +/- 137 | - Speedup: 54%: - Time reduction: 35% These numbers are running on my 2021 M1 Macbook Pro, and while they vary across benchmark runs overall it seems about there: speedup ranges from 50-55% running the benchmark over and over. While it is a large number of commits, they are each relatively localized changes and should be reviewable for correctness individually, so it should be possible to review and merge this PR with some work. Each one is typically a single micro-optimization: hoisting/sharing of computed values, caching, and other straightforward micro-optimizations. This benchmark uses compilation of `mill-libs-javalib` (the user-facing Mill API) and its upstream dependencies (totalling 364 files 32kLOC) as the workload, rather than bootstrapping `scala-compiler`, as `mill-libs-javalib` has a very different style of Scala: lots of macros, lots of third-party libraries, etc.. The individual commits are all done by claude and contain their individual reasoning and individual (noisy) benchmarks. All existing tests pass ## Major themes - **Per-runId/period scalar caches** for stable predicates and per-symbol lookups (`derivesFrom`, `isStatic`, `NamedType.symbol`, `ThisType.cls`, `isBottom`, empty-GADT). - **Identity-eq short-circuits** at map/copy/traversal entry points to skip dispatch when inputs are reference-identical. - **Shape-based no-op skips** (frozen, empty, NonMember, NoPrefix, leaf nodes, class tycons) that bypass setup the dominant case doesn't need. - **Inlining synthetic by-name thunks** to retire `op$proxy` frames in TypeMap/TypeAccumulator/AsSeenFromMap variance handling. - **Cutting redundant denotation/info/symbol forces** along resolution chains via reorder, threading, and gating. - **Iterative loops + Array indexing** replacing mutual recursion and `List.drop(n).head` walks. - **Sharing one substitution map / adjuster per outer call** instead of allocating fresh each iteration. - **Hash-table tuning** (prefilter, sizing, paged growth, `EqHashMap.HashedOnly`) for Uniques/WeakHashSet/EqHashMap. - **Period-keyed caching of expensive denotation lookups** (member cache, baseType, derivedSelect, asSeenFrom). - **Allocation reduction** via smaller eager buffers, lazy refs, stamp-based resets, and outermost-only traversals. ## For review `mill-libs-javalib` is normally built as separate smaller modules, but this PR consolidates all of that into one big compile for benchmarking - The first commit in this PR contains the `bench-mill-javalib/` benchmarking and optimization scripts used to perform this benchmark, which contain the majority of the lines changed (~1k lines). These scripts: - Download the `mill-libs-javalib`'s source jar and unpack it - Run JFR - Nicely visualize the JFR profile data, etc.. They are throwaway vibe coded slop and can be removed before merging, since none of it is necessary for the optimizations to take effect, but I left it in the PR in case we want to keep it or some variant of it for others to pick up in future. They're written as Scala scripts, using a local Mill bootstrap script until https://github.com/scala/scala3/pull/25970 lands we cannot easily write such scripts in Scala. They were originally in Python but ported to Scala due to poor performance. - The subsequent commits each contain a single optimization with the rationale for that optimization and measured before/after using JMH or JFR that demonstrates the improvement (total ~1000 lines). - Each one should be self contained and reviewable on its own. This PR should be rebased/merged without squashing, as each commit contains useful information without mixing them together. - Related commits are grouped together in the commit log, to make them easier to review in one pass, but mostly maintained as separate commits with separate profiling results and separate reasoning. Only a small number of especially overlapping commits were merged together The best way to review this is probably to go through commit by commit in order and review the code change and commit message, leaving comments on each commit. Once done I can do a single cleanup pass to fix or remove any commits as necessary, whether code changes or benchmarking harness, and we can merge this without squashing preserving the original commits and commit messages containing reasoning and benchmarks ## How much have you relied on LLM-based tools in this contribution? Entirely vibe coded in ralph loops, with some human judgement. The prompt used is provided at `bench-mill-javalib/prompt.md` ## How was the solution tested? This branch was developed by hooking up `claude` or `codex` in a loop with JMH and JFR (`loop-claude.sh` and `loop-codex.sh`), and asking it to find potential performance optimizations over and over by cross-referencing the JFR profile with the source code, and validating these optimizations by looking for the expected % drop in the optimized methods JFR self/total times. Notably, Codex seems better for this usage: - It is much better at following instructions, e.g. Claude has trouble spawning the right number of subagents, passing correct and complete instructions to subagents, formatting the commit message correctly, ensuring all heavy lifting is delegated to subagents rather than the top-level agent, etc. - It has a much more generous subscription quote, e.g. Claude20x's 5hr subscription can only do one iteration of `loop-*.sh`, whereas Codex20x can do 6-8 iterations - It is much more stable: Claude regularly hangs without noticing that a subagent has finished and it can proceed, or kills subagents prematurely due to thinking an async-await-ing subagent is idle, and all sorts of other harness problems unrelated to the model itself. Each iteration using Codex typically takes 5-10 hours running 4x parallel on my macbook, which I leave running during the day and overnight ## Profiling JMH profiles are typically too noisy to measure the <1% drops in the total time taken for a compilation run, whereas the JFR profiles are fine grained enough that e.g. we can clearly see a method go from e.g. `0.5%` of the profile to `0.2%` and have confidence that the expected improvement materialized. In particular, JFR profiles %s do not seem heavily influenced by system load: running 1x parallel (uncontented) to 4x parallel (significantly overloaded) does not seem to significantly affect the std dev of the JFR profile %s, presumably because such system load affects the entire program equally Each commit is profiled 5 times each time running 10 iterations (~1min of runtime) and we accept any change where the optimized methods show a reduction in %self or %total times more than their standard deviation between those 5 profiles. Rejected commits are documented here https://github.com/scala/scala3/pull/26091 for posterity, complete with their code changes and profiling numbers and analysis. The rejected branch does see a small speedup of ~4%, but given the number of commits it is difficult to identify where that speedup comes up and whether those changes can be included in this PR. The prompt instructs the agent to review both accepted and rejected commits each iteration before coming up with proposed optimizations. As usual, there is no easy way to regression tests performance: it can only be maintained or improved by repeated or ongoing monitoring and improvement effort going forward Correctness is validated via existing tests Tests to make sure nothing breaks.

In our work, we frequently utilize LLMs to automate benchmarking and generate “stacked commits.” Each commit in the stack delivers a positive performance improvement, and the series is ultimately submitted either as a single consolidated Pull Request or as separate, individual PRs. Recently, we observed within the community how lihaoyi employed a “Ralph loop” to optimize a compiler, boosting its speed by over 50%. I believe we could apply a similar approach to optimize the performance of the Scala standard library—even though it is already quite fast. What are everyone’s thoughts on this? Given that corresponding JMH scores are available to provide a clear before-and-after comparison, I am confident that LLMs would be well-suited to handle this task effectively.

jducoeur · May 30, 2026, 3:41pm

It’s not a bad idea in principle, but I worry about automated LLM usage from a pure cost perspective.

We’re starting to see LLM costs rising to something closer to the actual costs of running the services, which is on the order of ten times what people have been paying up until now.

So I’d be cautious about using it in an overly automated way, given that Scala Center very much isn’t made of money. Used as an occasional tool it’s plausible, but I wouldn’t bake it into processes too much.

hepin1989 · May 31, 2026, 11:29am

Currently, both OpenAI and Claude offer dedicated support via OSS accounts. While it is possible to identify certain issues using the free plans available on platforms like OpenRouter or OpenCode, relying on actual API calls for extensive usage proves to be prohibitively expensive. If anyone happens to have unused credits available, it is certainly worth giving it a try.