Compiling Scala to Python as a new platform for data engineering and AI

bjornregnell · May 27, 2025, 10:07pm

This is what Microsoft did to make C# interop with Python including pip installl integration etc. They share memory to avoid heap duplication.

Ichoran · May 27, 2025, 10:55pm

Dealing with non-JVM-controlled memory is part of the problem space anyway if you’re doing computations on the GPU. That the shared memory happens to be CPU-addressable memory instead of GPU memory doesn’t really change…anything. (It’s also the same hassle in Python to deal with the GPU if you start going direct instead of interacting only with the highest-level wrappers–you have the whole data on/off GPU thing.)

So I don’t really see IPC with Python via shared memory buffers as different than “IPC” with the GPU. It doesn’t mean you have to compile to Python, just talk to it.

bjornregnell · May 28, 2025, 7:02am

Yeah, the over goal is to “talk to it” i some convenient and safe way without performance penalty. Other ways than compiling to it is also interesting to investigate. Different use cases might benefit from different solutions.

My main point is that we are currently lacking access to the AI-tooling currently only available in Python, which may hamper Scala adoption.

same hassle in Python to deal with the GPU

Well, as I understand it, much effort has been made in Python to create the shims for the latest planet-boiling hardware, but if I am correct, its a big job getting the interfacing to work, but it “just works” in Python from the pythorch user side.

It is interesting to learn that, from a C# view, the Microsoft engineers in the video above speaks about Python as “an implementation detail”… (may they were joking)

rssh · May 28, 2025, 8:19am

I recall a case where I needed Python in Scala, but as a scripting language interpreter, not a compiler. The business reason was to enable customers to write scripts within the application server, without requiring them to learn Scala. Python was chosen because it is a well-known language and it is easy to understand. (Hmm, reasons for spark-python are the same). So, the main question is the perception of the language as {hard/easy} to learn.

devlaam · May 28, 2025, 10:17am

You nail it here. Long time ago I did a project in Eifel, but the language was abandoned for it was too slow in picking up decent XML support, which was the hype of the time. AI is the hype of this time. I do coding for it and, of course, it is in Python. Torturing myself, i wonder, could this not be done in Scala? That would make me produce much more reliable code. My feeling is that, if Scala is not able to connect to the new hype fast, our community will lose a lot of programmers.

My personal favourite would be to give the main role to Scala Native here, with wrappers around libraries like PyTorch and TensorFlow. If i am correctly informed (but correct me if wrong) these both have their core libs written in C++.

rssh · May 28, 2025, 10:38am

Btw, DJL. ( https://djl.ai ) is good.

bjornregnell · May 28, 2025, 10:47am

I share this worry. And I would not look forward to be forced to code in unsafe land.

jducoeur · May 28, 2025, 1:53pm

A nuance to consider: I’d honestly say that “without performance penalty” may be stricter than is necessarily called for here. (Especially given that it’s not as if Python itself is all that fast.)

I’d recommend instead thinking in terms of quantifying the performance penalty that various options have, and considering that as one of the tradeoffs. Something’s that’s considerably easier to build, and not too much slower in practice, might not be a bad compromise. (Not that that necessarily exists, of course – my point is just that performance is probably a tradeoff, not a hard requirement.)

Alexander · May 28, 2025, 2:05pm

No one stopping anyone from exploring ideas. My comment was focused on finding best solution to the problem - enabling AI in Scala or which is better making Scala great language for AI development, and in my opinion compilation to Python doesn’t solve either of problems.

Just some points why:

Compiling to Python won’t make use AI libraries in Scala easier, it will only make it harder as you will need to workout target compilation quirks
It won’t make people target Scala due Scala unique proposition at language level or platform
It won’t make execution faster and cheaper than using Python directly
It going to encourage more people to learn Python than use Python through Scala, and they can use as transition step.
Cross-compilation is good when you’re targeting multiple platforms, so you want to share business logic (e.g. Android <> iOS <> Backend, another example Kotlin multiplatform), while it’s not that interesting when you need to get access to ecosystem for specific use case.

Scala has amazing type system and support for macroses, that if can be paired with either native compilation or Panama can produce amazing and fast toolchain and interoperability with python should be considered an intermediate step towards creating own AI toolchain on Scala that’s better than python. Spark only took off because it was solving industry problems better than other solution, and once there will be something even better everyone will forget Spark.

At the current moment If I need to do data science, AI or statistics for commercial use, I won’t choose Scala as main language, it just doesn’t have ease of use of Python nor Rust performance, and if we imagine that Scala already can compile to Python - it still won’t increase chances I pick Scala as it going to be just an extra wheel anyway.

So overall better to find optimal solution with as many people aligned as possible. Scala doesn’t have herds of students and companies behind AI that willing to put time and money to evolve the ecosystem.

bjornregnell · May 28, 2025, 3:31pm

Maybe the ultimate solution is to do as microsoft and generate Scala source code to make interop really convenient:

(note the cute mascot)

But I guess we should evaluate all options in terms of benefits and use cases. The engineers at microsoft said in the presention that it took at least one person (with a backing team, unclear how much time they devoted) the lead time of as much as one year to accomplish this.

But compiling to Python (or some similar tight interop on Scala Native+JVM) may also attract those who gradually want to migrate off Python to Scala but cannot do the migration in big one step (even when assuming all corresponding missing scalalibs or Panama+Babylon were ready to go…)

spamegg1 · May 28, 2025, 6:47pm

People mentioned Scala on GPUs, I thought this might be relevant: Scala Days - Painting with functions - Scala on GPUs

som-snytt · May 28, 2025, 8:35pm

I agree that the lack of a mascot is a blocker and would require at least a year of research.

What would the Scala helix look like right after swallowing a cute snake?

By coincidence, RFC - Angular official mascot · angular/angular · Discussion #61733 · GitHub

They are approaching the problem with seriousness of purpose and structured discussion, however. The current sticking point is that the anglerfish is not cute.

Glavo · May 29, 2025, 3:22am

Panama has improved the way off-heap memory is managed, and Babylon has further built upon this by providing new memory-mapping interfaces for GPU programming.

To quote Juan’s introduction:

This is one of my favourite parts of the HAT project. HAT defines an interface called iFaceMapper to represent data. Data is actually stored off-heap by leveraging the Panama Memory Segments API for GPU computing.

From my point of view, data representation presents a significant challenge in GPU programming with managed runtime languages like Java, particularly concerning the tradeoffs between performance, portability and ease of use. It is also a critical part, because in Java, we have the Garbage Collector (GC), that can move pointers around if needed.

HAT tackles this issue by defining a base interface capable of handling data access and manipulation within Panama Segments. This interface is extensible, enabling developers to create custom data objects compatibility with GPUs and other hardware accelerators.

This interface offers broad potential benefits, extending beyond Babylon and HAT to projects like TornadoVM. While TornadoVM offers a wide range of hardware accelerator-compatible types, it currently lacks user-side customization for data representation. This interface could provide a very promising approach for integration, enabling greater flexibility and control, and improve TornadoVM further.

For example, to create a custom data object in HAT to store an array that uses a Memory Segment:
public interface MyCustomArray extends Buffer {
   int length();

   @BoundBy("length")
   float data(long idx);
   void data(long idx, float f);

   // Define the schema
   Schema<MyCustomArray> schema = Schema.of(MyCustomArray.class,
           array -> array
           .arrayLen("length")
           .array("data"));

   static MyCustomArray create(Accelerator accelerator, int length) {
       return schema.allocate(accelerator, length);
   }
}
Then, the HAT OpenCL compiler generates a C-struct as follows:
typedef struct MyCustomArray_s {
    int length;
    float data[1];
} MyCustomArray_t;
Still, a bit of boiler-plate code to add, but it can be used to define custom data types compatible with GPUs. How cool is this?

Ichoran · May 29, 2025, 4:15am

Scala DSL to GPU code is definitely interesting! But rather than leaning into a functional approach, the easy path is just wrapping tooling–like bytedeco does for GPU libraries (so there isn’t even any user-facing install of GPU tooling, which is awesome!), or DJL does for ML libraries (which you do have to install separately, so boo, but after that it is (supposed to be) easy)–or simply providing a shared memory arena with Python which I think Java 22 is pretty close to with the FFM API.

Sporarum · May 29, 2025, 8:44am

I’m honestly not too interested in having yet another backend for Scala, altough having a language you can use anywhere does sound interesting

What I would love to see is very good abstractions for programming on the GPU with native support
Maybe even in the standard library

@odersky weren’t there discussions about such a thing with a graphics programming professor at EPFL ?

mullerhai · June 6, 2025, 1:44am

now I just try to enhance storch and make full stack scala3 deep learning and big data and llm environment chains
we need and must to do
We need to establish a ‌stable, robust, reliable, and sustainable‌ deep learning environment while fostering a strong community culture. I have already written nearly ‌40,000 lines of code‌ for ‌Storch‌ and implemented its upstream and downstream components entirely in ‌Scala‌, though the work is not yet complete. ‌PyTorch‌ will undoubtedly remain the standard for deep learning, and we rely on it. Fortunately, we now have ‌Storch‌, along with essential tools like ‌NumPy, Polars, Pandas, Pickle, SafeTensor, MsgPack, Excel, and HDF5‌, allowing us to break through Python’s barriers and monopolistic dominance with real capability.

We still need ‌Hugging Face Transformers‌. Our goal is to build our own ecosystem by ‌reimplementing or binding‌ outstanding Python and Rust projects in ‌Scala 3‌, ensuring developers remain within the Scala ecosystem. Personally, I believe ‌Scala Native‌ holds even greater potential and will become a strong competitor to Rust.

Ultimately, ‌regardless of differing opinions, we must complete a full-fledged Scala ecosystem.

https://github.com/mullerhai/storch-numpy

and them some have publish to maven center repo for use

HollandDM · June 6, 2025, 9:15am

Some new languages like C3, Odin, Mojo, … has Vector/Tensor type embedded directly into the system. Some even consider scalar type a Vector with 1 lane to unify the type system. With these type they also provide some operator that will compiled directly into GPU instructions or CPU SIMD instructions.
Given the SIMD situation in JVM, I don’t think we can do this at the language level yet (I don’t know much about GPU programming in JVM so I can’t tell). But someone can try creating a library.

bjornregnell · June 6, 2025, 10:16am

This is fantastic! How is the relation between the work in this fork and the upstream lib - are your PR:s planned to be merge or will you continue with a deviating fork? I’m also wondering how the docs will develop so that beginners can get going with this recent development.

Also it would be good to know were potential contributors can make the most difference depending on the knowledge level etc. in terms of good first issues etc.

JD557 · June 6, 2025, 12:46pm

Just adding another 5 cents: I also think that it’s probably better to invest in Scala Native than a new Python backend.

One potential quick win would be native wrappers for native runtimes like ONNX runtime or MLPack.

Right now, a traditional pipeline might look something like:

Data Processing in Spark (either Scala or Python) -> Training in Python -> Inference in Python

With this, one could easily move to:

Data Processing in Spark (either Scala or Python) -> Training in Python -> Inference in Scala Native

Which could also tip the scales to move the Data Processing to Scala instead of Python.

There’s also the fact that it might not be that easy to convince people to move the training code to Scala: If you have a Data Engineering / Applied Science split, it’s likely that your Applied Science team will be more familiar with Python and more resistant to change (they have more things to worry about), while the Data Engineering team might be more open to move to Scala (after all, they are the ones that have to deal with pipeline errors due to type mismatches).

mullerhai · June 6, 2025, 6:04pm

I want to merge ，but it is too huge to review code for the author， I have raise a pr two months ago ，but need to waiting 。
I have write a tutorial for begineer try to practice ，but it it not easy too use， because we also need some package to make them easy 。
I think future direction， we need make scala3 jvm more convinient for deeplearning, then migrate the Ai coder to scala-native , thus we will behind by Rust, ,
most important, scala3 jvm and native need support vector and tensor type on System with GPU