Spark as a Scala gateway drug and the 2.12 failure

ghawkins · March 28, 2018, 2:17pm

Scala 2.12 has been out for more than a year now (it was released back in November 2016).

I would guess that historically Spark has been a big source of new recruits to the Scala world. People who might not necessarily have been interested in discovering a new language started using Spark because it addressed specific problems that they had to deal with and through this they discovered Scala.

So it’s disappointing that Spark, rather than leading the Scala charge, is actively holding up the move to 2.12 for a large element of the community. As a major part of many people’s infrastructure the availability, or lack of it, of a Scala 2.12 version of Spark affects their take up of the current Scala release.

Progress on moving to 2.12 is tracked by SPARK-14220. If you look at it you can see that there are many subtasks and that the majority of them are marked as resolved. This gives the impression of a large amount of progress with little more to do, however, if you’ve been tracking this item you’d have seen that little or no real progress has been made on this in the last year. I.e. it’s not going anywhere quick.

Most of the resolved subtasks are trivial (waiting for the availability of 2.12 versions of various libraries and updating to them once available) and many were already marked as done by the time of the final release of 2.12 or shortly afterward.

The final two items (SPARK-14540 [1] and SPARK-14643 [2]) show little sign of active progress. My impression is that Josh Rosen and Sean Owen would very much like to get these out of the way but actually, no time has been prioritized to do so.

As such we’re left with “if you really care about this, try to work on [it yourself]” style comments. Fair enough - it seems not enough people do care about it for it to be prioritized by Hortonworks, Databricks, Cloudera or the other major active contributors to the Spark code base.

I find this surprising and problematic for the Scala community at large. If major projects stay on earlier versions of Scala then this splits the community and makes progress difficult. The lack of progress on SPARK-14220 (it’s essentially in stasis) doesn’t give much hope that it’ll even be resolved by the time Scala 2.13 release candidates start appearing in a few months.

Perhaps the Spark community has decided that Scala 2.11 is good enough and they’re happy to stick with it indefinitely but I think this attitude has and will freeze the progress of Scala in many major in-house projects all over the world.

Lightbend and others have made major efforts to help out on projects that are seen as very important to the health of the overall ecosystem but, for whatever reason, seem to have stalled. SBT being the most obvious case.

SBT is obviously more fundamental to almost everybody’s Scala workflow than Spark but I’d argue that Spark is big enough that left as it is it could act as a major drag on the take up of new Scala releases.

I don’t think we’re anywhere near a Python 3 moment but it would be a shame if the failure of major frameworks to shift to new versions of Scala led to a split up of the Scala world. Scala has made so much progress on eliminating the headache of libraries being tied to specific versions - all the effort to build popular libraries and pick up issues before new releases means that the most popular libraries are now generally available immediately along with the final release of a new Scala version.

With the release of Java 9 and 10 it seems Scala has been moving in the right direction while Oracle is moving in a direction reminiscent of the early years of Scala, i.e. major libraries breaking on new releases without new versions being lined up to play well with whatever changes have been introduced.

Anyway - as usual, I’ve produced a TL;DR message on something simple. Does the Scala community see it as an issue that a major framework like Spark remains stuck on a previous release of Scala more than a year after 2.12 became available? And if so might Lightbend or others consider contributing effort to resolving this issue?

Perhaps Spark has grown less important, it certainly seems to have lost much of the vim of its earlier years (though perhaps this is fair enough and healthy as something matures). But as part of the Scala specialization at Coursera (Big Data Analysis with Scala and Spark taught by @heathermiller) it still seems to be pretty central to the whole Scala offering.

Thanks for reading this far

Note: this was commented on before back in early February in “Spark and Scala 2.12” [3], when I was too busy on other things to weigh-in, but it didn’t seem to result in much follow-up then.

/George

Links that Discourse prevented me from including in clickable form as they exceeded my two links per post limit:

https://issues.apache.org/jira/browse/SPARK-14540
https://issues.apache.org/jira/browse/SPARK-14643
https://contributors.scala-lang.org/t/spark-and-scala-2-12/1576/5

jvican · March 28, 2018, 2:53pm

Thanks for bringing up this discussion. I’ve actually wondered many times why Spark contributors haven’t seen through the migration to 2.12, it doesn’t seem to have been a priority. Most of the work is done. IIRC there’s only one or two tests failing.

From what I know, a Scala 2.12 compatible version of Spark breaks some Spark APIs, which means that it can only be published in major versions. Major versions in Spark happen more or less yearly (at least judging from the past, @MasseGuillaume looked into the actual dates some weeks ago), the last one being last December.

I asked in one of the PRs when the next release will be, and Sean Owen said “6 months out” and that nobody was working on it.

Yeah, but keep in mind that Lightbend owns sbt. Maybe the communication between the Scala team and Spark developers should improve, but IMO Spark not supporting 2.12 is a major drawback and it’s not clear to me this kind of work should be done by the Community…

joshlemer · March 28, 2018, 3:44pm

In a similar vein, we are currently held back on 2.11 because Flink (arguably the main competitor to Spark) has not upgraded to 2.12 yet. Though, at least it is officially planned https://issues.apache.org/jira/browse/FLINK-5005

ghawkins · March 28, 2018, 4:34pm

I was tracking the Spark releases initially - but it just became depressing, one doesn’t get the impression that SPARK-14220 has slipped from one release to another but more that it’s not even really on the radar.

There’ve actually been three major releases of Spark (and three minor releases) since Scala 2.12 came out:

Spark 2.1 - Dec 28th, 2016.
Spark 2.2 - Jul 11th, 2017.
Spark 2.3 - Feb 28th, 2018.

So it looks like at least another 6 months (the roadmaps for Spark releases are usually fairly opaque and subject to repeated slippage) until there’s a window for this again - even if SPARK-14220 was on the schedule, which it doesn’t seem to be

Just to be clear I don’t think Lightbend or the community should be fixing this for Spark-ers, but like the Fed stepping in to address market failures I think it might be considered as a solution of last resort.

For whatever reason, the Spark development community doesn’t seem to be addressing this themselves and a knock on is that a large user base must effectively stick with Scala 2.11 indefinitely - a bad outcome for the Scala community as a whole.

som-snytt · March 28, 2018, 6:25pm

The market analogy is apt.

Perhaps 2.12 is not a big deal for Spark users. That can be OK. If there’s not a feature need driving adoption, then why invest resources?

I don’t know much about the problem space, but I assumed 2.13 is an opportunity for improved REPL support for Spark, with other fixes.

jducoeur · March 28, 2018, 7:11pm

(NB: I’m not a Spark user myself yet, but I agree with @ghawkins that this is likely to become a blot on Scala’s general reputation, deserved or not: folks outside a community don’t tend to distinguish the fine details of who is responsible for what, and often just lump it all together as “Scala”. From the outside, this rather looks like “the Scala community” developing a technology and then neglecting it.)

My impression from reading the issues (I’m totally an outsider here, so take this with a grain of salt) is that it isn’t so much that it isn’t a big deal, and more that it’s been a black hole for the one or two people actually tackling it: every time they get one problem fixed, another one raises its head. No organization can invest unlimited resources in any given problem – if you don’t get results, either you decide that it is Absolutely Mission-Critical, or you deprioritize it for the time being.

Since people are managing to use Spark the old way, it isn’t a blocker per se. That doesn’t mean they’re necessarily happy about that, just that they’re living with it. I suspect this will eventually lead to some of them becoming disenchanted with using Spark via Scala, and moving to other solutions, but that’s just a guess.

It sounds like it’s currently blocked on expertise in Scala, and specifically how Scala is represented in the JVM; my impression is that it’s basically beyond the knowledge of the main Spark dev working on it, and he needs somebody with serious Scala expertise to help it get over the hump. (I suspect that tag-teaming, rather than tossing the ball back and forth release-by-release, would move this along a lot faster.)

ghawkins · March 28, 2018, 7:56pm

Hmm… I’m not sure I see things quite so dramatically. I don’t think this is a case of a problem that’s got bigger and bigger with no end in sight.

All the closed subtasks give the impression of slow gradual progress over time. But really I think there’s been no real progress on this issue for a long time, it simply hasn’t been prioritized.

The subtasks were mainly basic bits and pieces that had to be lined up before things could get started, 2.12 versions of libraries etc. Those things have long been in place (most were marked as complete at or shortly after the release of Scala 2.12).

So a couple of the subtasks that are the real heart of this item have gotten done but on the whole, it’s seen little work.

I think it’s a problem of prioritization rather than difficulty. While I’m sure people like Josh and Sean would like to see this done, I don’t think it’s at all a case of them slogging away over weeks and months, just seeing the thing getting bigger and bigger. It’s more it’s just on permanent hold.

wsargent · March 28, 2018, 8:15pm

Upgrading to 2.12 is not a simple matter, and involves lots of transitive upgrades. Here’s what we had to go through for Play 2.6:

adriaanm · March 28, 2018, 8:26pm

But you did it!

wsargent · March 28, 2018, 8:46pm

Yeah, I think it helps for whoever is volunteering (@ghawkins? @jducoeur?) to have other codebases to refer to…

Although in this case it seems to be down to how closures are turned into SerializedLambda in 2.12?

I’m on to other issues in Spark with respect to the new lambda-based implementation of closures in Scala. For example, closures compile to functions with names containing “$Lambda$” rather than “$anonfun$”, and some classes that turn up for cleaning have names that don’t map to the class file that they’re in. I’ve gotten through a few of these issues and may post a WIP PR for feedback, but haven’t resolved them all.

jducoeur · March 28, 2018, 9:22pm

No chance I can do much – even if I had any free time, I don’t know nearly enough about the relevant bits for this one. Just providing a little opinion from the outside…

jvican · March 29, 2018, 8:15am

Honestly, I think the Spark developers are more than technically capable of doing that. If they have succeded in creating the closure cleaner by inspecting JVM names, they can also modify the code a little bit to make it work with 2.12.

My impression is that most of the hard work for 2.12 support is done. Only one or two failing tests need to be fixed.

I’d be very happy to get support for 2.12 to the finish line if the Spark team cuts a release soon. But 6 months is too far away. @ghawkins it would be good if you convince the Spark team to reprioritize this ticket.

ghawkins · March 29, 2018, 9:15am

I agree with @jvican - I think the Spark developers are more than up to this, i.e. I don’t think they’ve struggled with this and failed (as @jducoeur reading of events might suggest). They presumably just see the things that they have delivered in 2.1, 2.2 and 2.3 as more important to their user base than moving to Scala 2.12.

@wsargent - in saying “I think it’s a problem of prioritization rather than difficulty” I didn’t mean to give the impression that I think this is an easy issue to address.

I don’t - I was really replying to @jducoeur and just meant that I didn’t think it was a problem that had been attacked but proved so difficult that it’s been impossible to deliver, rather I get the impression that it hasn’t really been prioritized for development effort in the first place.

I appreciate that if it were prioritized it might well prove difficult.

As to getting them to reprioritize the ticket - I think there’s been a lot of “please do this” for more than a year now, but the weight behind those requests seems less than that behind other improvements and at this stage, it does just look like it’s fallen off the Spark development radar.

The talk on the main remaining subtask, i.e. SPARK-14540, and on Sean Owen’s pull request (which involves preliminary changes to ClosureCleaner) imply that what needs to be done is well understood but as Sean notes “I am not working on this and am not aware of anyone working on it.”

This whole thing may not be a big deal for the Spark people directly but it does keep a significant portion of the Scala user base on an earlier Scala release which I think is bad for the ecosystem overall.

soronpo · March 29, 2018, 10:13am

What could have been done differently in Scala (in terms of language features) that could have prevented this “special SPARK wiring”, yet still maintain the user experience for SPARK?

kevinwright · March 29, 2018, 10:32am

One thought… Is this something that might have been avoided if spores had ever come to pass?

jvican · March 29, 2018, 10:38am

Excellent question. Aside from the spurious regressions that are normal in any compiler release, I’d say nothing could have been avoided. The Spark developers were relying on JVM names, and those are prone to change. When Scala 2.12 leveraged the lambda encoding in Java 8, those names went away.

Yes, to some extent. Spores are the perfect instrument to fix this fundamental issue at compile time rather than as a postprocessing step. As one of the main contributors of Spores, I didn’t see much interest in them. If the Community was to pick up interest in them again and try them out so that we can prepare them for production readiness, I’m sure we could try to add them to the language and revive the SIP. What we missed the most was feedback.

ghawkins · March 29, 2018, 11:24am

Interesting to note that the main problem in moving to 2.12 seems to be the same for both Flink and Spark, i.e. the ClosureCleaner. See FLINK-7816.

If you look at ClosureCleaner in Spark and then at ClosureCleaner in Flink you’ll see that the Flink version originally came from Spark.

/* This code is originally from the Apache Spark project. */

I wonder if the two teams could pool their efforts on this one? Or if Flink prove more committed to getting this done (though is there any evidence of this?) then perhaps it’ll in effect also be done for Spark.

Perhaps someone on Flink might be willing to run with Sean Owen’s Spark pull request and finish it off for both projects

Jasper-M · March 29, 2018, 11:45am

Is there actually a reason why the scala compiler itself doesn’t clean closures as much as possible? For instance rewrite

x => foo.bar

to

{val foo$bar = foo.bar; x => foo$bar}

for every stable (non-lazy) path foo.bar.

rxin · April 3, 2018, 7:45am

Reynold from Apache Spark PMC here. I looked a lot into this in 2016. Unfortunately I don’t think the issue is simply the technical work or the closure cleaner, but there will have to be API breaking changes for existing users in order to upgrade to 2.12, and Spark as a project doesn’t break APIs for feature/minor releases (2.1, 2.2, or 2.3 are feature releases). Maybe Scala 2.12 is a good reason to do Spark 3.0.

rxin · April 3, 2018, 7:47am

BTW it’d be great if there are help from the hard core Scala folks on some of the issues (e.g. closure cleaner). Yes some contributors to Spark are technically very capable and can do crazy things, but there are very limited number of them and they all have other crazy things that need help.