Compiling Scala to Python as a new platform for data engineering and AI

bjornregnell · May 24, 2025, 10:05am

Yesterday at the latest SIP meeting we discussed how the Scala community can get better access to Python libraries such as numpy, pandas and tensorflow. We currently have ScalaPy but it relies on the JVM and there seems to be some challenges on the maintenance of the interop. Although Project Panama might open new opportunities, there is another intriguing possibility: drawing on the experiences from ScalaJS and Scala Native to develop a compiler from Scala to Python.

Each of the existing Scala platforms have their compelling selling-points:

ScalaJVM: the great back-end ecosystem
ScalaJS: unlocking type safe front-end apps and the js ecosystem
Scala Native: interop with C/C++ to make native binaries

With Scala compiled to Python we would unlock interop to enable rich data engineering and the whole ecosystem on LLM tooling etc. We would then further our great story of accessing ecosystems based on old languages using a modern, convenient and safe language…

What do you think? What are the pros-cons? What are the engineering challenges? What would it take to achieve a Scala-to-Python compiler?

AMatveev · May 24, 2025, 10:34am

IIUC: Python is not designed to support multi language paradigme. There are no such stuff as:

source maps
bytecode specification

Is there possible at all to implement good debugingg between diffrent languages?

bishabosha · May 24, 2025, 10:34am

Then there’s Mojo which embeds python semantics in an otherwise independent native language

bjornregnell · May 24, 2025, 10:44am

Interesting! From the wikipedia article om Mojo I learned that some of the Swift engineers are behind it…

So what do you think @bishabosha : would debugging (as @AMatveev asked about) and other important utilities given by a statically typed language be feasible with a Scala-to-Python compiler? When I click “Goto Definition…” in a vscode ScalaJS project I get right into the std-lib with the JS facades etc. which is nice.

scalavision · May 24, 2025, 1:14pm

Wouldn’t it be better spending those resources on Scala Native, simplify C integration even more, then we could create even better apis for machine learning than already exists in Python. We could reuse c implementation of numpy, pytorch and also create apis with CUDA etc.

I have worked in Python ecosystem for 5 years. I think its completely unrealistic to see Python developers learning and using Scala for these things. Also the packaging of Python libraries etc are far behind what we have in scala imho.

Look at how they did this in elixir world and what they have now:

A lot of this use C under the hood and as far as I know they integrate also with the Python ecosystem.

dwalend · May 24, 2025, 2:09pm

Lego’s Spike Prime uses a MicroPython operating system. https://assets.education.lego.com/v3/assets/blt293eea581807678a/bltf512a371e82f6420/5f8801baf4f4cf0fa39d2feb/techspecs_techniclargehub.pdf

(Lego’s Ev3 can run Debian Linux and JDK 11; Scala’s worked great but the kids are wearing out the hardware.)

bjornregnell · May 24, 2025, 2:30pm

I would be interesting (perhaps in another thread) if you can give a bit more detailed feedback on how the Scala Native C integration can be simplified even more.

If I understand it correctly, some of those libraries have significant parts written in Python directly that is not covered by the underlying C layer? With a similar mechanism as in ScalaJS with py.native-annotated facades we would benefit from the appreciated api-design directly.

But if the effort is not to big, there is opportunity the create even better api experiences give Scala’s unique abstraction mechanism that give both convenient AND safe code… Disregarding resource constraints, I’m wondering if we cannot have both?

bjornregnell · May 24, 2025, 2:31pm

Interesting! Was it the resource-hungry JVM that exhasuted the hardware? Would Scala Native have a small enough footprint for that hardware? Or would Scala-to-Python-compilation had helped here?

beyondpie · May 24, 2025, 2:39pm

I’m not sure if this is relevent. R languange provide a way to access Cpython directly: Interface to Python • reticulate.

If we can do the similar thing using Scala Native, that would be fantastic! I remember, as long as we can access CPython smoothly, we can use whatever Python packages. Here is their implementation: GitHub - rstudio/reticulate: R Interface to Python

In the meanwhile, we need to modify our Shell to better support interactive jobs.

dwalend · May 24, 2025, 2:54pm

[quote=“bjornregnell, post:8, topic:7129”]
Was it the resource-hungry JVM that exhasuted the hardware?[/quote]

No - It’s 5th graders. Everyone drops the robot once. Small fractures in the battery slowly reduce its capacity. The wires eventually fail due to metal fatigue while the kids scold each other about not towing the robot by the wires.

The main challenge using Scala on Jvm/Ev3 is the slow start-up time vs kids attention span. I made a wrapper process to cut down time to run a new program from about 2 minutes to about 5 seconds. (Class loader hack - it’s exactly what you imagine.)

[quote=“bjornregnell, post:8, topic:7129”]
Would Scala Native have a small enough footprint for that [Ev3] hardware?[/quote]

Probably. Ev3 has an Arm7. It should work fine with the latest Scala Native. I quit chasing that solution once I got the tax to load a new program under 5 seconds. I’m not sure compiling a new program to native could beat that.

Catching all the syntax errors before the code goes to the robot is a huge win over Python - fewer cycles, no need to fish errors off the robot, and a lot less cryptic debugging.

bjornregnell · May 24, 2025, 3:06pm

Sounds like a really cool project! I have also experience in kid programming here in Sweden using Scala compared to Python. We use Kojo for turtle graphics and it is really striking how the compiler error feedback loop preserves more of the grit of kids, as much less runtime-debugging is needed.

I’m not sure compiling a new program to native could beat that

Well, the startup-time for a Scala Native binary should be down to millisecs compared to secs so I recommend to give it a try. (Although the compile and build times will increase compare to incremental builds in sbt of JVM apps.)

bjornregnell · May 24, 2025, 3:09pm

Interesting! (And a bit ironic that Python first made a rip-off of R and turned it into pandas and now R get over to the other side of the river to get water… )

I think both solutions would be interesting in combined concert: Convenient C-python from Scala Native and Scala compiling to Python. Esp. since there seems to be quite some python-parts living outside of C-land.

bjornregnell · May 24, 2025, 3:19pm

But if Scala compiled to Python I guess the targets would run really nice in a MicroPython env? @dwalend

dwalend · May 24, 2025, 7:53pm

I did try compiling to native 4-5 years back. Ev3 is 32-bit - a recent feature for Scala Native.

sbt? Way too ponderous, slow, and baroque for grade school to even consider. We moved from mill to bleep recently. The 6th-graders were OK modifying a bleep plug-in.

dwalend · May 24, 2025, 7:58pm

That’s the hope. MicroPython is a subset, so it’d be easy to botch the use case.

aboisvert · May 24, 2025, 9:25pm

As others have said in different words, compiling to Python seems to be a very leaky abstraction… I wouldn’t expect there to be as much lasting motivation for this “target runtime” as there is for the web browser.

There is a common saying that “Python is just a thin wrapper around C libraries”, which is unfair and, yet, has some unignorable truth to it.

With Project Panama just around the corner, I wonder if the path of least effort would be:

repackaging C libraries used by popular Python libraries for consumption by the JVM
creating common low-level Java bindings for said libraries
creating idiomatic bindings for the various JVM languages (Scala, Kotlin, Clojure, Java, etc.)

Steps 1 and 2 can be shouldered by the much broader JVM enthusiasts ecosystem. Joining forces seem to be the smarter way. The Java-language ecosystem is already headed in that direction… why not ride that wave?

As for Scala Native, there would be some additional work to mimic the low-level Java bindings but beyond that it should be able to reuse that high-level Scala bindings as well. This would ensure that Native can tap into the full “bare-metal” performance of the C libraries without JVM abstraction costs.

hepin1989 · May 25, 2025, 4:50am

Why not just invest more in Scala Native:)
I knew there is a library for deep learning which is built with Scala native GitHub - sbrunk/storch: GPU accelerated deep learning and numeric computing for Scala 3.

Investing resources in Scala native and Scala js may be more conducive to concentrating efforts on major projects. Otherwise, it is difficult for us to see how far Scala ptyhon can go. At present, we already have Scala native, Scala js, Scala wasm and Scala jvm. Only by concentrating on providing better Scala native and better GPU integration can we make better use of the current limited resources.

I think scala-native will have much more potential once Capture Checking is stable.

bjornregnell · May 25, 2025, 10:01am

Do you mean that your prediction is that the browser as a runtime will go away?

With Project Panama just around the corner, I wonder if the path of least effort would be

Yes this one way that can enable interop. I think there are some different use cases and requirements on what platform different orgs are willing to rely on. My observations, which triggered this post, is that many say that the AI tools locked into the Python ecosystem are preventing them from choosing Scala. So, there are some use cases esp. in data engineering and AI that we should consider how to support, and perhaps Python as a platform is one way.

bjornregnell · May 25, 2025, 10:12am

I agree that it would be nice if Scala Native enabled convenient access to GPU acceleration. But there is still a use case for unlocking the api:s written in Python for Scala-based systems.

Investing resources in Scala native and Scala js may be more conducive

Well there is some tradeoff in terms of central resources. But resource “allocation” in open source communities is a bit different than in a hierarchically governed company. In open source, the “resources” (i.e. the engaged human contributors) are working on what they think matters and what the find rewarding and useful in their view. As I understand it, both Scala Native and ScalaJS came about through the enthusiasm and dedication of individuals who did this because they wanted it and had the knowledge to do it - I guess it was not the case that they would have stopped being interested in this just because some boss told them that there are other more important things…

If there are Scala-enthusiasts that need to get hold of the vast code base in Python for the AI and number crunching tools, there might be new “resources” (i.e. enthusiastic devs) that would do this without “stealing” resources from other projects.

Just look at huggingface to see the big open source code bases written mostly (?) in Python. I’d guess that Scala’s convenience and safety offer might be quite attractive, if it can interoperate.

bjornregnell · May 25, 2025, 10:23am

BTW as a side note: For those interested in open-source strategy from a corporate management perspective, I have some papers on that topic, including this one:

What to share, when, and where (in Empirical Soft Eng (Springer) together with Johan Linåker)