The role of the Scala language and compiler and tooling in the age of LLM-supported "automatic coding"

bjornregnell · February 13, 2025, 9:09am

Taking a step back to squint at the bigger picture, I guess many of us are thinking about our future in the light of things like cursor and copilot and I am wondering what all Scala-friends here think about the future relationship with the Scala language/compiler/tooling and the role of developers given the rapid evolution of automation capabilities enabled by large language models.

Here is a whiteboard sketch I made that may be food for thought, based on the hypothesis that there will be trained AI components “everywhere”, and that the compiler IO is connected to an “automatic coder”:

We already have standardized error output that may help training AI on compiler IO, but what more cool things can be done from the language, compiler and tooling side to facilitate automation based on LLMs etc.?
Are there some “low hanging fruits” in language, compiler and tooling evolution that you think would make Scala stick out in the race?
With the current Scala and its ecosystem, are there specific things with our favorite language that you think make Scala particularly suitable for LLM-supported coding?

(This is an attempt to kick-off some forward-looking requirements elicitation in our joint endeavor of requirements engineering for our favorite language - any contributions welcome on goals, non-goals, features, qualities, opportunities, challenges, … with a long-term horizon in mind.)

Sporarum · February 13, 2025, 10:31am

Right now, the main limitation of LLMs is that they work sometimes and sometimes not (for example hallucinations)
For that, having a strong type system already helps a lot, since it can statically detect mistakes

As for things we should do, I think nothing

We should focus our supporting humans
We are already resource constrained and we have no ideas what the AIs of tomorrow will look like
Furthermore what will matter the most will probably be input data size, and in this department we already lost the race, python is miles ahead (and is the language most AI-making people use), and Rust has a stronger type system(?) and more examples (but not by a lot ?)
Developers are the ones that care about Scala !

bjornregnell · February 13, 2025, 11:47am

Yeah, I think statically typed languages in general and Scala in particular with stronger safety guarantees due to a powerful and deterministic type system will be more and more important as LLM-supported coding gets more and more applied in product development with its stochastic output.

But I guess other things than strongest types also matters already today if you compare Scala to language X? For example,

How much code out there is trained on Scala compared to X?
Is the AI able to combine context with error messages to find a good step-wise refinement of LLM output for Scala compared to X?
What is the probability of hallucination with Scala + library L1 compared to Language X + library L2?
How is the tooling, performant feedback, and developer experience in general with Scala+LLM compared to Language X+LLM?
etc.

morgen-peschke · February 13, 2025, 8:56pm

For most of these questions, given that LLMs don’t actually understand what they’re doing, you’ll need to focus on the humans anyway

Specifically:

Given how LLM builders historically take a very … flexible approach to gathering training data, you’d need to maximize the amount of Scala code in the wild (which we’d want to do anyway for humans).

This depends on having error messages posted and explained for the models to train on, so arguably the response that would result in the best LLM experience is horribly obtuse error messages that need to be explained multiple times in various contexts so the models have a lot of training data to work with, because clean and easily understandable error messages will generate less training data because they’ll need to be explained less.

To be clear, I think this would be horrible idea.

I’m not sure this can be realistically measured, particularly because of the amount of hours humans with expertise would need to spend checking LLM output. That being said, the practice of law generates huge amounts of training data, and hallucinations still happen when someone tries to use an LLM to replace legal writers, so this is probably going to happen as long as Generative AI uses LLMs instead of Reasoning Models that can actually understand the topic.

In light of the above, “Scala compared to Language X” part of “Scala+LLM compared to Language X+LLM” is probably going to be the dominant factor, so focusing on the human experience is going to be way more effective.

djspiewak · February 13, 2025, 9:20pm

Maybe this is a bit dangerously close to my day job but I’ll weigh in anyway… Disclosure: I work for NVIDIA, I don’t work on LLMs (though I do work on infrastructure for very large transformer models), my opinions are very much my own and don’t reflect anything non-public.

I’m deeply skeptical that we’ll integrate compilers with AI in the fashion your diagram suggests. It’s important to understand that classical components still have an extremely important role to play even in learned systems. For example, I work in autonomous driving, and even in the most frontier systems for AV, we still generally have a classical PID controller to translate intended trajectories into vehicle actuation (what presses the pedals and turns the wheel). Even the most AI-enriched variant of programming still requires a classical translation from programming language to machine code.

Speaking of programming language, in my experience with AI coding, PLs still serve the same important role that they’ve always served: a means for humans to communicate. Code is read by other humans, whether that code was written using AI or written by hand is irrelevant. When actually engaging in the act of AI-assisted coding, the model communicates with you, the human, using code, and you communicate back in the same terms. This offers a far higher degree of precision than pure natural language could allow for, which is to say, exactly the same benefit that high-level programming languages have always brought to the table.

As for your call to action… The most interesting frontier in this space are the evolutions around RAG. Cursor effectively deputizes the programmer as the initial retrieval agent when you populate its context with a given file. In practice this seems to work pretty well with conventional developer workflows, but it still leaves a lot on the table when it’s fully textual in nature (sidebar: Cursor doesn’t seem to work all that well with Scala in particular, so in that sense Scala is very much falling behind the curve). In general, this type of textual scanning leaves a lot of meaning on the table in a way that we just don’t need to.

It’s entirely possible to tokenize richer semantic representations of the language, such as TaSTY or even other stable IRs within the compiler. This is relatively similar to the IDE problem faced by Metals or IntelliJ, and ultimately all of that same information is extremely useful to any coding LLM. This type of deeper integration would allow for considerably more accurate model behavior and much denser context windows, reducing hallucinations in part by allowing for lower emphasis on the training corpus. In this sense, I think Scala is on a good path due to the fact that TaSTY exists in the first place, but someone should probably do the actual legwork to try to push this further and see where it takes them.

I can also envision some more elaborate possibilities where larger companies have fine-tuned variants of major models (again, avoiding blunt RAG with more nuanced approaches), but we’ll have to see how the broader ecosystem shakes out there.

So I guess the tldr on my long-winded ramble is that the best thing Scala can do to do well in this emerging world is to solve the same problems that it has always needed to solve: stable and strong tooling, particularly of a semantic nature. Think of Metals/IntelliJ as the canaries in the coal mine. As long as they are struggling with the latest version of the language, everything and everyone else will also.

Ichoran · February 14, 2025, 4:59am

Not to throw cold water on the party, but for the time being, Scala loses with LLMs. This is because LLMs very successfully generalize from bazillions of examples, and that means what they need are bazillions of projects of reasonable quality covering all kinds of different things. Furthermore, because they’re fundamentally recognizing design patterns, but their abilities aren’t extraordinary in this regard, simplicity helps them do better.

That means, basically, that Python wins: culture of simplicity, tons of examples.

JavaScript and R and Java are okay. C and C++ have lots of examples, but the degree of fidelity they require is higher, and is so-so–fine for the easy stuff, but LLMs will shoot you in the foot just like you would have yourself. Dunno about Go; I haven’t tried it.

Tools whose strength is in their sophistication, but which aren’t available in vast quantities, are at the biggest disadvantage. And Scala is solidly in that camp, along with Rust.

(All these statements comport with my anecdotal experience in using LLMs on all of these languages.)

In the future, as reasoning models become more prevalent, this might change, but we can’t predict that, only hope. And then we need to see what the state of affairs actually is.

Now, I don’t think this is actually all that bad for Scala programmers. Scala is already more for people who want to do difficult things well than easy things quickly without particularly knowing how things work. It just widens the gap a little bit more. And I certainly wouldn’t discourage any features that would help LLMs and Scala get along with each other a bit better.

But fundamentally, I don’t think Scala is likely to take great advantage of LLMs as they stand right now. They’re a bigger power-up for Python.

But! Reasoning is right around the corner. ChatGPT o3-mini still botches my Scala code (and Rust code), but the same features that allow reasoning to do a lot better on math problems has promise for allowing it to work on languages with more mathematical-type guarantees.

So, this year it’s not looking great, but who knows next year or the year after that.

Jasper-M · February 14, 2025, 10:04am

In my experience with code assist integration in IDEs I haven’t really noticed much difference in quality between Cats-Effect projects and plain Java projects. But that’s only with the “advanced code completion” tools. I haven’t really tried getting it to develop an entire feature from a blank page.

odersky · February 14, 2025, 10:21am

I believe that these problems of sample size will be overcome. For instance, LLMs are pretty good at Estonian even though Estonian is far less prevalent than English. How do they do it? Bob West, a colleague at EPFL, disected the operations of an LLM by instrumenting internal layers and found out that it looks like the LLM will essentially translate Estonian tokens into something resembling English tokens, do the generation in English and then translate back. Not by any explicit instructions, just because this came out of the training weights. Pretty amazing…

Looking further ahead, we can predict that code generation will be very cheap, but establishing trust that the code does the correct thing will remain expensive. So we will have to work on how one can trust LLM generated code. Ultimately, it might mean that LLM generated code needs to come with a proof that it behaves according to specification. Of course, the proof should be generated automatically as well. But what is the specification? There the whole thing cycles back: We need high-level languages and rich type systems to express specifications. And to do that scalably we need to have good module abstractions. To get good module abstractions that are more than just primitive access restrictions and syntactic sugar, we need good support for capabilities.

AMatveev · February 14, 2025, 11:01am

We use scala in our company and have created about 5 000 000 lines of code with a relativly small command. So text generation is not the most critical part of the development cycle.
We have created high parallel system with very good energy efficiency.

It is amazing how scala is good to prevent developers from errors. We actually do not have errors out of multithreading at all.

The only problem I can manage to remember It is unlimited collections wich can crash server due to some program errors.

Scala allows us to create really safe sandbox for developers.
It is realy cool and this sandbox is the reason why I do not think general LLM can help us a lot in such case. We probably will have to teach LLM on our own code base.

If you need something that have not written many times on stackoverflow, google or github, LLM usualy will not help either.

vincenzobaz · February 14, 2025, 12:14pm

After DeepSeek a lot of people are looking into distilling models with reinforcement learning in order to obtain new specialized models at very low cost.
The structured and clear error messages that the Scala 3 compiler emits could be used for distilling a model specialized for our language. The compiler could be a good evaluation function in RL.

bjornregnell · February 14, 2025, 1:43pm

Yes, the parts that we need to trust (type checking, machine code generation, …) cannot have trained, stochastic behavior - but the proof engine-like type system is already what we used to call AI, but symbolic AI instead of sub-symbolic neural networks.

But the compiler also engages in a dialog of errors and suggestions including natural language error messages and there might be room/need for LLM-support to boost this dialog with humans + the AI model used by non-technical stakeholders conveying intent as input to automatic source code generation of a prototype that operationalize their requirements (before the real safe good thing is properly engineered if resources permits…).

I also think, like @djspiewak , @AMatveev and @vincenzobaz suggests, that distilled models and generation with retrieval augmentation may be really effective soon, esp. with a safer language like Scala.

Assume that you use the whole Scala community build + entire corpus in Tasty and GitHub/GitLab/CodeBerg permissive licensed Scala code as input to reinforced/augmented learning/generation and your whole company’s internal code base as context - then you might get high-quality generation of Scala code from human intentions expressed in natural language, iteratively with improved results as the deterministic output from the compiler (tasty, errors etc) is fed back into the iterations.

The AI components in the compiler are symbolic+deterministic for the parts that need to be safe and LLM-based for the parts used for dialog with stochastic entities (humans & LLMs).

bjornregnell · February 14, 2025, 1:53pm

I have played with letting the new o3 model re-engineer the requirements from Scala code and it is surprisingly good at taking explanation one step further to provide a natural language specification at a high level. I guess the reasoning part would have good help if it somehow can be connected with the deterministic reasoning that is going on inside the compiler.

I sincerely agree that safety will be increasingly important and indeed a strategic direction for Scala - and it eventually needs to be on by default, not just opt in. Every safety measure we can include should be considered, such as strict equality, a stdlib with NonEmptySeq etc. And if capability checking can provide generalized safety beyond any existing language we are in a really good spot.

Here is an interesting talk that suggests that unsafe languages will be removed by regulation, accelerating the demise of unsafe languages such as C/C++:

I think we should communicate even stronger safety as a key offer by Scala.

@AMatveev Your story about Scala being an enabler for a combination of safety and productivity is compelling - perhaps you’d like to blog about that? I think many would find it an interesting read, including myself!

Ichoran · February 14, 2025, 2:48pm

Interesting! But that’s what I’d (largely) expect–languages mostly get mapped onto the same statistical regularities because humans still are pretty much humans and talk about the same kinds of things, pretty much, regardless of language.

But that only generalizes to expecting to write Scala organized around the concepts that are common in Python and Java. And, yes, that’s pretty much what I find.

So, for instance, I recently wrote some Python code that uses fairly typical-for-Python patterns of mutable fields which are assigned to None, throwing ValueError, etc., and just asked ChatGPT o3-mini-high for a translation to idiomatic Scala 3. The LLM still used mutable fields, throw new RuntimeError, didn’t use for comprehensions (but did use Option[ ]), etc… Asking for a rewrite of the Scala to more idiomatic Scala helped some, but in so doing it leaned heavily on Java-esque patterns, like defining a Map[String, Any] and using asInstanceOf to cast from a key.

It’s slightly better when I ask it to write Scala 3 de novo, but it makes more mistakes when it’s using Scala concepts that don’t map nicely into Python, at least in my experience. (This is not unique to o3-mini.)

I very much hope and rather expect that as people spend more effort on thinking about LLMs as providing a type of linguistic computation, the LLMs will do better at finding correct and elegant solutions that aren’t as driven by approaches in languages without Scala’s features. I’m very encouraged by the reasoning models in areas that aren’t exactly programming. But I think there’s still a ways to go.

On the other hand, if people find the entry barrier to Scala at the level of “I want to write Python in Scala but it confuses me,” then I think even if we can’t do it right now, it’s not far off and almost sure to work. As a teaching aid, I think we’re already there.

And I do think there’s a good story about languages with strong typing and powerful abstractions, as in Scala, helping de-risk the problems of LLM-assisted/generated code. But it’s the same story we already had with human programmers. I’m not sure +LLM makes the story any more compelling. LLMs are in some ways already better than average human programmers at avoiding some of the problems inherent in weakly-typed languages.

Ichoran · February 14, 2025, 2:54pm

Oh, yes! They’re already very good at going from code to a natural language description, including for Scala. It’s the other direction which isn’t as clean for lesser-used languages (e.g. Scala, Julia, and Rust).

nadavwr · February 14, 2025, 3:19pm

The lowest hanging fruit for giving Scala an advantage in LLM-assisted programming will be by training reasoning models to follow methodologies that take advantage of what Scala is good at, wherever possible. When asking LLMs to compose new code, the LLM should figure out a precise domain model, then continue to implement the rest of the code on top of that, going back and forth as necessary. Scala’s comparative advantage in making illegal state unexpressable will add guardrails that aren’t as effective in mainstream competitors. For that you need:

To train a reasoning model to follow a methodology that tries to capture requirements in robust types before implementing “process” logic, and is able to iterate with this in mind.
Back this up with iterative tool calling by the reasoning LLM—good use of tool calling can make small models competitive with much larger models that lack access to tool calling.

Here’s where compiler-based tooling can shine. I see several opportunities:

Safe sandboxed calls to the compiler (you don’t want malicious code executing via macros), possibly with some accuracy/performance tradeoffs—at the very least, you might not need to go through all stages of the compiler to benefit.
Ability to extract relevant code from the local project, or extract interfaces from libraries unfamiliar to the LLM, in order to augment the context based on insertion point (might be more relevant for lengthier local completion than for multifile composition—in my mind the former is about using what’s available in local scope and latter implies a wider scope anyway).

I think the latest Mistral and Falcon models are currently able to do reasoning and tool calling, and have capable smaller variants able to run on consumer hardware.

djspiewak · February 14, 2025, 3:32pm

I very much agree with this, and it aligns well with my own understanding of how large transformers function.

There seems to be a pretty pervasive misunderstanding of LLMs, which is the idea that they operate on text and then preserve that association through their machinery. They really don’t. Text is tokenized and then those tokens are abstracted layer by layer. There’s more than enough evidence at this point to prove that this is exactly how they function (see: Anthropic’s studies on LLM explainability, most notably the hilarious experiment where they tricked the model into always thinking about the Golden Gate Bridge). This in turn means that LLMs don’t really need a vast Scala corpus in order to be a good Scala programmer, instead they need a vast code corpus of any type and enough Scala to understand how it relates to other languages. And for this, the existing public data is more than adequate.

As an aside on this, my forays into LLM-assisted programming have often touched on things that, pretty verifiably, have never been written before in any language, though are (obviously) amalgamations of other concepts. If LLMs were simply blindly generalizing text from other text, they would be unable to handle such things, and yet (almost to my surprise) they generally do extremely well.

I recommend reading up more about the role of tokenization in transformer models. Symbolic abstraction already exists, it’s just that different words are used to describe it than are in vogue amongst linguists. Multi-modal systems prove that this work better than you probably expect (remember: code is a modality; the ensembles do not “think” of programming the way they “think” about natural language output).

Within this context, given the models a richer token set to operate on (like TaSTY!) is a great way to raise the abstraction level that already exists. In other words, the future you are asking for is already here, and the goal should be to lean into it.

AMatveev · February 14, 2025, 3:33pm

I am very grateful to scala community.

And I really would be glad to write something useful. Unfortunately is not so simple for me. I will think about it.

I even have thought up a title: “Scala is realy great for classical procedural programming”
it is becuase of these killer features

bjornregnell · February 14, 2025, 3:50pm

Cool! Perhaps some LLM nearby can help you get started

bjornregnell · February 14, 2025, 4:02pm

I certainly have more reading catch-up to do - I found this video, around 7:46 min in good in visualizing and explaining tokenization. As I understand it there is emergent abstraction of tokens but not comparable to human-encoded symbols, such as Tasty, which has higher-level semantics encoded in the types symbols.

Sounds promising!

So distill using Tasty and then de-compress / generate by augmentation and context also compressed using tasty…

Ichoran · February 14, 2025, 7:36pm

I certainly agree about the nature of statistical inference in LLMs–it is very abstracted, with multiple levels of high-level tokens.

But I don’t agree that a code corpus of any type is necessarily suitable, because it is in aspects of high level design that Scala can often vary in highly useful ways from the “ordinary” approach.

For instance, something as simple as using nested methods in Scala is woefully underused, at least whenever I ask LLMs to write Scala. They’ll happily create external methods to abstract functionality between other methods, but inside a method they’ll happily repeat the same code multiple times if it’s hard to extract into a top-level method rather than creating or suggesting a nested method.

Why? Mostly because that’s not the high level pattern you find over and over and over in the vast code corpus, I would guess. I haven’t tried to search for activation that corresponds to factoring out common code, but I’d bet that it’s learned patterns of attention that aren’t awesome for detecting when you can have within-method factoring out via nested methods. (Where, just to clarify, the distinction is that you need access to and may even need to modify things only visible inside the scope of the un-nested or less-nested method.)

So I think lots of exposure to the high-level concepts is necessary, or, possibly, less exposure but more reasoning, for Scala code where one is often happy–as one is when writing fully by hand–that it is in Scala.