Compiling Scala to Python as a new platform for data engineering and AI

Seems there are many languages

1 Like

I’m sympathetic to this line of reasoning – in airy theory, it seems like it could be quite useful. The question is what the level of effort would be, even at a ballpark level.

If this was a commercial project, I’d task somebody with a time-bounded research spike: spend, say, a month digging into this seriously, kicking the tires and experimenting, with an aim to finish with a report on what’s easy and what’s not, where the major problems are, and maybe a bit of simple prototype that cross-compiles a small subset of Scala successfully.

Whether any of the stakeholders have a sufficiently senior engineer to spare for a month is, of course, another question. (If somebody is between gigs and looking for a project to work on in the meantime, this might be a terribly interesting one.)

2 Likes

More various thoughts, after pondering this for a while more (note that my knowledge of Python is extremely shallow, so it’s possible that I’m off-base in some places):

Some of the above discussion is talking about Python dialects. I suspect that we should be cautious there: the goal (as I understand it) is being able to work in conventional Python environments, with the major Python AI libraries. So a dialect only seems relevant if it can speak to those libraries.

Based on the history of ScalaJS and ScalaNative, I would guess that a significant fraction of the effort here would be spent dealing with:

  • JVM types. That’s probably mostly not rocket science, but it’s the big chunk of the iceberg below water. We assume a bunch of these types in routine Scala programming, and we would need reasonably API-compatible implementations of them in order to do much. (Possibly the Java project has made progress there; I don’t know.)
  • Concurrency. Idiomatic Scala code tends to assume that Future is a thing, and translating those idioms often involves creating a fair amount of our own runtime in order to support that assumption.

That said, translating all of Scala might be a second-order priority here. The right cognate might not be ScalaJS, but Spark.

The analogy is pretty precise: a major use case where people do their exploratory work in Python, but often write their production code in Scala for reliability and maintainability. We tend to pay relatively little attention to Spark here, but my observations suggest that it accounts for a large fraction of the actual Scala work happening on the enterprise scene.

AFAIK, Spark doesn’t support quite all of Scala; it wouldn’t astonish me if a subset also sufficed for at least a lot of practical AI use cases. So we might not have to solve the entire problem in order to get something initially useful, as a solid proof of concept.

But really, I suspect we can talk all we like, but we’re probably going to need somebody seizing the bull by the horns and starting to commit code in anger in order to start having fully-informed opinions.

A “hello world” would be a start; a backend sufficient to do some basic AI calls would be a genuine proof of concept, and it wouldn’t surprise me if you could get there with a fairly modest subset of functionality. And that proof of concept might well suffice to get folks jumping onto the project. If someone’s feeling ambitious, it seems like a fine focus.

2 Likes

Scala doesn’t seem to have much success with backends other than the JVM. I haven’t seen any real world use case for Scala Native till now. I’m very pessimistic about a new backend.

I think a more realistic option is to deepen cooperation with the OpenJDK community. HAT (Heterogeneous Accelerator Toolkit) is a subproject of Project Babylon, which provides a GPU backend for the JVM platform. Here is a good article introducing it: Babylon OpenJDK: A Guide for Beginners and Comparison with TornadoVM. I thought maybe it would be possible to reuse parts of it to provide the GPU backend for Scala.

8 Likes

I think ScalaJS has been a success as another platform outside JVM, and having been production ready for quite some time. Scala Native is not yet 1.0 and when it will reach production ready and stability it might well take off like Scala JS did. Esp. for use cases where GraalVM is not attractive.

Project Babylon may bring nice GPU access, but there is still the big landscape of Python tools e.g. available at huggingface that might be interesting to tap into.

JVM types. That’s probably mostly not rocket science, but it’s the big chunk

Storch has made a whole forest of various types of different precisions etc: storch/core/src/main/scala/torch/DType.scala at main ¡ sbrunk/storch ¡ GitHub

Yes, the fact that Scala Native has not yet been officially released may be the reason why almost no one uses it. But even with Scala.js as a precedent, Scala Native has been in development for ten years without an official version, which is itself worthy of vigilance. This makes me pessimistic about whether a new backend can be put into practical use.

Additionally, although Python has a long history, it is not a legacy platform, and there aren’t many people actively trying to abandon it. Borrowing from the Python ecosystem can enhance Scala’s capabilities, but it also faces competition from Python. I think competing with Python is very challenging, and relying on the Python ecosystem makes it hard for Scala to build its own ecosystem. It can only serve as a glue language, and users can easily revert to Python.

2 Likes

Yeah, but that’s exactly my point: one place where Scala is quite successful is working hand-in-glove with Python, for Spark.

Python is a fine language for exploration and development, but a somewhat weak one for long-lived products. We’ve had sustained success with large companies using Python to design their Spark environments, and Scala to maintain them.

I suspect that that’s where Scala fits into the AI equation. We shouldn’t try to supplant Python – we’ll just lose. But providing a robust, strongly-typed way to build long-lived, highly maintainable AI applications seems like a very plausible niche. Indeed, using Spark very explicitly as the well-established proof of concept for the model seems like a likely way to sell it into the enterprise market.

We shouldn’t be thinking in terms of competition, but complementarity. Python and Scala fit into different parts of the lifecycle for these sorts of tools: we should lean into that.

2 Likes

Recently, I was also interested in working with python and I was looking for interop, however I would more agree with @Glavo in regards of approach.

Even recent survey from VirtusLab shows quite low native adoption, even JS still niche in Scala ecosystem. So, in my opinion instead of investing in yet another project it is better to double down on existing stack - improving native integration (including python), improve tooling so integration with native code is trivial (with or without Panama or Babylon). With currently ever reducing Scala community I just don’t think it’s feasible to maintain yet another compilation target.

Also, it might worth to shift priorities to WASM which seems also getting traction in AI space.

3 Likes

I agree that Spark is successful and does the right things: It has built an ecosystem with its own core competitiveness and allowed Python to use its ecosystem. But I think the Python backend does the exact opposite. Think about this: if Spark was written in Python and the Scala API was just a binding, could you still get those users to switch from Python to Scala? I don’t think so. I think Scala being the native language of Spark played a huge role in this process. After losing this core advantage, Python’s advantage in terms of number of users is enough to overwhelm most languages.

In my opinion, the right way to get more Python users to try Scala is to develop more frameworks in Scala and create Python bindings, like Spark, rather than the other way around. Even Scala Native is much more important than Python backend in this regard, because CPython is notoriously slow, and using Python as the backend would greatly undermine Scala’s advantages, while implementing libraries in Scala Native and creating Python APIs looks more attractive. In addition, I think technologies such as GraalPython are also worth considering.

4 Likes

“It’s better to put resources elsewhere.”

I agree with @bjornregnell that arguments of that form are not helpful. Either someone finds this idea interesting and somehow gets the time to do it, and it will be done, or no one does, and it won’t. It’s not like we have someone with nothing on their hands that we would choose to assign this project to, as opposed to something else.

“Scala.js is niche”

Well … it’s just not. Several surveys agree on about 20% of Scala users using Scala.js. Name any other cross-compiling language where there is such a high ratio of people targeting JS. Now you’ll say: but 20% of the Scala userbase is nothing compared to, e.g., TypeScript users. Sure, that’s true; but if that’s your metric, you can also leave Scala/JVM behind so :man_shrugging:

"Scala Native is not production ready, so how would Scala/Python make it?

Now it’s time to look at the technical aspects. Scala’s core expertise lies in how to interoperate with the host languages of its target platforms. Without that, none of its platforms would stand a chance. Scala/JVM works because it can leverage Java and other JVM libraries. Scala.js can leverage JavaScript libraries. Both those platforms have a really good story for interop, and that is why they work.

Writing a compiler backend is easy (the compiler part of Scala.js had a full prototype in 2 months). Designing language features for interoperability is the true challenge of targeting a new platform. It is also the main challenge with Scala.js-on-Wasm-without-JS.

Scala Native follows suit in the problem it’s trying to address. However, the interop problem is much harder on Native than on the JVM or JS. The gap between a GCed object model and a linear memory model is huge. Much bigger than between the statically typed nature of Scala and the dynamically typed nature of JavaScript (for Scala.js, the biggest gap was overloading semantics; not static typing at all). That’s why Scala Native has not reached the level of maturity that Scala.js has. It’s not because it’s lacking resources (on the core, there are more resources on Scala Native than Scala.js).

Python is semantically much closer to JS than to native. Writing a backend for Python can probably be prototyped in a few months, like for JS. Designing interop will take longer, but based on the experience of Scala.js and Scala/JVM before it, we could have something decent to play with in a few more months.

A significant amount of work in writing Scala.js was to write the JDK libraries. This work has already been reused to a large extent by Scala Native. It is even easier now, because we’re making it less dependent on JS in order to target standalone Wasm. That’s a whole lot of work that a Python backend will get for free.

So this is very much doable. Yes, it’s going to be slow to execute; but no slower than Python itself (Scala.js is not slower than JS; sometimes it’s faster). Yes, it’s going to be yet another backend. But if there is one person who finds that challenge interesting and has a bit of experience writing a compiler, they could rely on a lot of existing work, knowledge and expertise in the Scala compiler landscape.

11 Likes

I don’t see it: in what way is the fact that Spark is written in Scala even relevant? In my experience, most Spark users – Python and Scala users alike – aren’t aware of the fact that Spark is written in Scala.

What matters is that Python makes it easy to explore your data, and Scala makes it easy to productionize those explorations. Far as I can tell, that’s why companies use them. Spark’s origin doesn’t matter; what matters is that Scala results in more maintainable systems than Python. And it looks to me like the same arguments likely apply to the AI space.

That doesn’t mean it would succeed, of course. But “you’re probably already using Scala for something similar, for the following reasons” is a good argument to put into a whitepaper.

2 Likes

I debated internally a lot on the answers before I realize it’s because I’m at odds with the premise itself here of the SIP-meeting discussing this. Where did you garner sentiment that the Scala community cares about this? It has certainly not been my experience in 20 years at 7+ different places. Or maybe it’s not about the what the Scala community wants but about something you think we should care about and that would be beneficial but we don’t know it yet (“we” here meaning the community).

Naturally one is free to work in whatever they want, but it seems here you are being propelled by the interest in addressing a pain point. I’d like to know where you (the SIP-meeting you, not singular you) got this idea, since from where I stand those good intentions seem unjustified.

I think it’d scratch the “I want to work with those libraries without being forced to use python in my personal projects because no company is going to let me do this anyway” itch, like when I wrote a scala-lua transpiler to do scripts for some game engines, but I don’t believe 1 in a hundred scala developers are in that position.
The other technical questions where masterfully addressed by srjd.

All in all, these kind of questions of “would working on this be of interest to the community” are always terrible: Scala-contributors is the worst kind of representation of the scala-community. It’s like going into a rich private neighborhood to ask for sentiment regarding political matters. You’ll only get the most slanted and narrow perspectives possible. Frankly I’d rather you just do it (or not) instead of asking here, and I say that as a member of the community that sometimes happens to visit Scala contributors, because every other Scala programmer I know does not and would not.

By the end of each SIP-meeting, when we have gone through all pending SIP-proposals, we often, if time permits, discuss various things related to the Scala language. This time I proposed to discuss Scala for AI and the relation to existing AI tools in Python - so don’t “blame” the entire set of SIP members for setting some kind of “agenda” that you may disapprove of - I am the only one to “blame” for bringing this up.

Or maybe it’s not about the what the Scala community wants but about something you think we should care about

I have some anecdotal evidence (I cannot claim any scientific validity but at least anecdotes from independent sources) that people have considered Scala for AI applications, but chose Python, not because of the language, but because of the AI tools available. Thus, I have a hypothesis (true or false) that it would be useful for some Scala developers if there could be some improved interop. I have no plans of doing this myself, but I think there are opportunities beyond the most shallow interop such as just communicating between different processes which will be even slower than Python as parsing and unparsing will be needed for each data transfer.

always terrible […] instead of asking here, and I say that as a member of the community

I am a bit perplexed by your negative tone, but I may have misinterpreted the underlying sentiment. My apologies if I upset you by starting this post - that was not my intent. I just wanted to elicit different views on the topic, for those who care to chime in - this is after all a discussion forum. Thanks anyway for taking the time to express your views.

2 Likes

No intention on blaming from me, just understanding how it happened to properly answer your questions.

Regarding the anecdotal evidence, I think that was my main point at the end regarding where one garners sentiment on a topic and the plurality of it. I’m also sure that every human is likely to inflate the non scientific numbers of their perception based on what they consider better (like when you think, this is better for them even though they don’t know it yet!). All done in the best of intentions.

I am a bit perplexed by your negative tone

I tried as hard as I could to keep it neutral, though dissenting. It’s hard not to take or convey dissent as negativity though. My apologies for my efforts weren’t good enough.

1 Like

Thanks for your reply. No worries.

Compiling to Python may seem like a bad idea to many and a good idea to some, but I am still happy for all the insightful answers in this thread and I am learning a lot. Thanks again.

n terms of artificial intelligence, Scala has fallen far behind many other languages, especially Rust, Python, and Go. We lack some crucial libraries and toolchains, and they are currently very incomplete. There is a shortage of talent in Scala. It has even fallen behind some niche languages. This is a disgrace to us. We must enrich our AI core libraries and toolchains and promote them well. we need scala-numpy scala-pandas scala-plot scala-torch scala-transformers scala-pickle scala-gym scala-vllm scala-triton scala-tensorRT scala-deepspeed scala-sglang

5 Likes

Agreed, but I think a lot of those should be JVM or native bindings to the native code versions, not some sort of weird compile-to-Python backend. There is some reason to use Scala as even-slower-Python-that-keeps-track-of-types-better, but not very much.

If Scala has the basics at Python-speed but by default lets you do all the custom stuff at JVM speed rather than Python speed, that already is an argument to use Scala instead of Python even if what you’re doing is all data-dependent so whether you have types or not is relatively unimportant.

This doesn’t mean that the Python backend is necessarily undesirable; you might want that as a way to write in Scala and consume in Python. But if we’re doing it as a way to consume Python libraries, it’s hard for me to see how it’s a big win over simply using Python or using an interface layer like ScalaPy, SoS in Jupyter, Polynote, etc..

6 Likes

I guess that is also true for Java then?

Storch had its last commit in Feb 2024 and it is still on pyTorch 2.1 while pyTorch is moving fast and is now on 2.7.

Devloping and maintaining all those libs you mention from scratch would need many new maintainers and contributors diving in. And it would require GPU-support which is not available yet if I understand the JVM situation correctly.

So compiling to Python could perhaps be a quicker way to develop the interop…

Well, as pointed out, much performance-critical Python actually runs at C-speed. And its also about accessing nvidias latest GPU:s conveniently, possibly by standing on the shoulders of those who already did it…

2 Likes

The traditional problem with JVM ↔ Python interop is that each has a separate heap and data needs to be copied between them. This tends to be too slow for large data, which is typical of AI applications. I don’t know whether this is about to change with Panama or other efforts.

2 Likes