Spark as a Scala gateway drug and the 2.12 failure

retronym · April 3, 2018, 10:57am

I’ve added some notes about our the new lambda encoding and asked for clarification about how I can help further over in https://github.com/apache/spark/pull/19675#issuecomment-378206809

scottcarey · April 4, 2018, 5:36pm

@rxin

I don’t buy the API breaking change argument, but I may be missing something.

My understanding is that there is a method or methods that become ambiguous once scala has SAM type matching for lambdas.

Consider the following hypothetical:

Spark 2.3.1 now supports Scala 2.12! There are now three releases, e.g.

org.apache.spark:spark-core_2.10:2.3.1
org.apache.spark:spark-core_2.11:2.3.1
org.apache.spark:spark-core_2.12:2.3.1

The 2.10 and 2.11 2.3.1 releases are completely API and binary compatible with their prior 2.3.0 variants.
The brand new 2.12 release is not binary compatible with the 2.11 2.10 releases, nor is it expected to be. It has no further API compatibility requirements – it is the first 2.12 release with no predecessor. No user upgrading their code on the 2.11 or 2.10 path would have any problems. If they upgraded from 2.11 to 2.12, there would be some required changes, but that is besides the point – these are not binary compatible and the whole ecosystem on the user side has to change simultaneously.

In other words, IMO as long as the existing 2.10 and 2.11 flavors do not break their API contracts, the 2.12 one is OK being slightly different. This is sometimes a necessity when cross building across major Scala versions.

I believe forcing your users (including me, and I’m really quite annoyed at being stuck on 2.11), to wait for a major version in order to introduce a new major scala flavor with no preceding userbase seems to be completely ignoring with the purpose of the rule to forbid API breakages in minor versions: Existing users must be able to upgrade minor versions without breaking them. In this case, this does not break anyone, other than the phantom ghost of someone already using Spark with Scala 2.12 with the current API.

sjrd · April 4, 2018, 8:16pm

This is a very valid point. It applies to many of us. For example Scala.js will be forced to change its API somewhat for 2.13 users (due to the collection rewrite), but we can still publish it as a minor version as long as the 2.10-12 versions don’t break compat wrt. the previous 0.6.x releases.

jpallas · April 4, 2018, 8:19pm

That’s fine.

That’s not fine. That’s effectively declaring that there isn’t one Spark 2.3.1 API, there are two different Spark 2.3.1 APIs. API versioning is about source-level compatibility, as many Scala libraries have demonstrated by having 2.11 and 2.12 versions with the same API.

Having two different variations of the same API version is not something that can reasonably be imposed on a downstream project. It breaks the users’ expectations about what an API version means, and it could massively complicate the process for building the code and generating the documentation.

Depending on the scale of the compatibility issues, it might be possible to minimize the impact by imposing minimal workarounds on code that will be compiled with 2.12, but I don’t know what the scale is and whether it’s possible to do that without breaking compatibility.

xerial · April 4, 2018, 9:17pm

@rxin It’s great to have you in this thread.

For people who are not familiar with ClosureSearializer, I have some slides to explain how it works (It’s a bit old, but I believe the behavior has not been changed so much):

Here are basic ideas behind ClosureCleaner:

For sending a closure (free variables + function body (class name) + outer variables) to a remote machine, Spark needs to fill unused outer variables with nulls. This is necessary to properly serialize the closure and its context without involving unnecessary objects.
To find these outer variables ClosureCleaner traverses JVM byte code of the closure by using ASM4 library.

Scala 2.12 has changed the byte code generation for closure in order to utilize Java8 lambda, so supporting Scala 2.12 for Scala has not been so easy.

Ideally speaking, if we already had Spore (safe mobile closures) in Scala https://github.com/heathermiller/spores, distributed processing frameworks like Spark and Flink would have used this.

In Scala side, we essentially need to:

Support safe closures in Scala language or library level.
Add functionality to Scala to define closures that can ensure having no outer variables. If we have this functionality, we don’t need to resort to this complicated ClosureCleaner approach, such as byte code traversal & clean up.
Alternatively, we need to provide a way to cleanup closures to remove unused outer variables inside the code block.

The dependency of Spark to Scala 2.11 is the biggest blocker for Scala community to deprecate Scala 2.11 even though Scala 2.12 is already matured several years ago. I hope we can find a solution for this Scala 2.12 support issue in Spark to avoid the burden of cross compiling binaries for Scala 2.11 and 2.12 (and 2.13 is also coming)

rxin · April 4, 2018, 11:23pm

I sent an email just now to dev@spark about Spark 3.0. The email doesn’t really discuss the specificities about how to support Scala 2.12, but with a major version bump we would be able to do minor API changes if needed in order to support Scala 2.12.

http://apache-spark-developers-list.1001551.n3.nabble.com/time-for-Apache-Spark-3-0-td23755.html

You are more than welcomed / encouraged to chime in there to express your support.

scottcarey · April 5, 2018, 9:33am

That’s not fine. That’s effectively declaring that there isn’t one Spark 2.3.1 API, there are two different Spark 2.3.1 APIs. API versioning is about source-level compatibility, as many Scala libraries have demonstrated by having 2.11 and 2.12 versions with the same API.

I was being hyperbolic to demonstrate a point. If you changed the 2.12 version so that every single method had an extra ‘x’ added to its name, and released that in 2.12, then when users go and modify their gradle, maven, or sbt projects to change their spark versions from ‘2.3.0’ to ‘2.3.1’, it would break exactly 0 user builds.

Yes, having truly large API differences would have other consequences, like documentation, etc. And semantic differences would be highly confusing. But take a step back and that is not what we are talking about here. From what I know (and again, I might be missing something) the ‘breakage’ is very minor. There are a couple locations where an overloaded method becomes ambiguous and one of the overloads has to change its name. This is neither difficult to document nor hard for users to understand.

Next, you speak of source compatibility. Guess what? Scala does NOT enforce source compatibility across major releases. Such changes are intentionally limited, and were especially minor between Scala 2.10 and 2.11, but I certainly had to modify my Apache Kafka client code when migrating from 2.9 to 2.10 years ago in a couple minor ways (and other projects as well). Things become deprecated and are removed over time. Other things move packages. Some bug fixes can restrict what is legal Scala. End-users do not expect to upgrade their Scala major version without having to make a few minor modifications.

So if Spark is trying to enforce absolute source compatibility across Scala major versions, it is trying to be more strict than the language itself. The specific problem where some overloaded methods can become ambiguous in the presence of SAM types can hit ordinary user code too. Its rare, but not that rare.

Consider two universes:

Universe A: Contains Spark 2.3.1 with only a Scala 2.11 variant.
Universe B: Contains Spark 2.3.1 with both a Scala 2.11 and 2.12

The claim is that

Having two different variations of the same API version is not something that can reasonably be imposed on a downstream project. It breaks the users’ expectations about what an API version means, and it could massively complicate the process for building the code and generating the documentation.

What is Universe B Imposing on users versus Universe A? There is one extra freedom of choice. No user is forces to compile their end-user program against the 2.12 version. Only those that opt-in to 2.12 need to even know that Universe A and Universe B differ. So for end-users there is zero downside and only upside.

Only a downstream library that sits between end-users and Spark is faced with a more complicated choice: Only support the 2.11 version, or build against both? Again, many can choose the path no different between the two universes and only support 2.11. Others may decide to cross-build. Cross-building is not that hard, and Spark until recently supported both 2.10 and 2.11 so it should be nothing new to a library maintainer. In a cross-build, you can have a few files that differ to account for differences. This is also not very hard, unless the differences are extreme. I reject the idea that this ‘massively complicates’ anything. Its common practice already.

I think tying this to a Spark 3.0 release is a bad idea, unless 3.0 is almost ready and filled with things users want anyway. It will only further burden users by making the upgrade more painful than it is, slowing transition beyond 2.11 (since a true 3.0 probably includes many other API changes, some that may be semantic and not merely nominal). One could release something like ‘2.3.1-a’ as a Scala 2.12 artifact, which might make communicating the fact that the api differs slightly easier, but this actually makes cross-compiling by downstream libraries harder, by doing something different than everyone else w.r.t. labeling cross-built artifacts.

And as others have pointed out 2.13 is coming. It has some more significant API changes and I suspect it may in fact be impossible for some projects to be 100% source compatible across 2.12 and 2.13. Most things should just work, but I doubt everything will. I do not believe Spark should hold its source compatibility standard to be more strict than the language itself. Doing so is a disservice to end-users, making it harder for Spark end-users to migrate their code-bases to new Scala versions than necessary.

kevinwright · April 5, 2018, 10:01am

I disagree - it’s effectively declaring that there are two different major releases of scala, with slightly different APIs

Whatever way you look at it, folk migrating from 2.11 to 2.12 are very likely going to have to change stuff anyway, on account of having moved between major versions.

Which leaves us with two options:

Even though code changes will be necessary due to the scala change, we allow our hands to be tied by ultra-strict adherence to semantic versioning, and deem that we can’t possibly allow even one of the changes (that’ll be happening anyway) to happen in a fairly tiny part of code specifically relating to spark
We accept the need of this change, make it only in the scala 2.12 build of spark, add documentation/deprecation as necessary to inform users as to why, and allow people to finally upgrade to a version that’s less than two years old - with all the performance and other improvements that that provides.

To my thinking, the relative cost/benefit of these two choices favours one of them extremely strongly.

jpallas · April 7, 2018, 2:23am

Maybe I’m overreacting, but it seems like the attitude I’m hearing is, “Scala doesn’t guarantee any compatibility across major releases so Spark should just suck it up.” Given the title of this thread, “Spark as a Scala gateway drug and the 2.12 failure”, it’s hard to see how that’s helpful.

NthPortal · April 7, 2018, 3:14am

It’s not an issue of “sucking it up”, it’s that there is no compatibility to break.

More precisely, 2.11 and 2.12 have entirely separate artifacts. It doesn’t make sense to complain that spark-core_2.11:2.3.0 is not compatible with spark-core_2.12:2.3.1 - they have different artifact IDs, and aren’t compatible in the first place. Even a spark-core_2.11:3.0.0 - a completely new major version - would never be compatible with spark-core_2.12:3.0.0.

ghawkins · April 7, 2018, 10:47am

Hi @jpallas - I think most of the input on this thread has been very constructive in nature. I don’t think anyone here wants to, or even thinks they can, bully the Spark world into a particular choice.

You and @scottcarey seem to have gotten into a bit of a shouting match. I guess he’s got his opinions and you’ve got yours. At the risk of sounding patronizing, I think you shouldn’t take it so personally. Clearly, no one can make you or the Spark world “suck it up.”

Both of you have made points that it was good to get out there. If you really object to the way in which some of them have been made then OK, that’s your right. Personally, I can see where @scottcarey (and @kevinwright) are coming from, but of course you should point out the issues with their positions as you see them.

The input from @rxin, @xerial and many others on this thread has been really positive.

I think we’d all just like to see how we can get from the current situation to one where the Spark community can move to 2.12 if they wish.

The wider community clearly can’t demand or batter the Spark community into this, but it’s clearly also not ideal that Spark remain wedded to a previous Scala release so long after a major release like 2.12 (2.13 is just on the horizon now).

So the question is how to get it done technically, i.e. ClosureCleaner, and then how to make that available in a way that causes the least amount of confusion and disruption to the Spark user base, be it a major Spark release, i.e. 3.0, or whatever.

In the long term though, I think it’d be great if this wasn’t doomed to happen on every major new Scala release (for reasons other than the ability to prioritize the technical work involved). If it’s possible to come up with a process that deals with the fact that major Scala releases often introduce more dramatic changes than e.g. Java releases typically have, that’d be super - but there clearly isn’t an immediately obvious or simple solution that’s going to please everyone.

Blaisorblade · April 7, 2018, 3:33pm

The linked thread seems to be discussing better than here the consequences of the API changes on Spark. People might want to take a look at that thread and at API issues with Spark and Scala 2.12 / Java 8 - Google Docs.

But the Spark analysis on overloads seems to be incorrect — the implicit conversion idea doesn’t seem to work in Scala 2.11, as Adriaan pointed out (API issues with Spark and Scala 2.12 / Java 8 - Google Docs). The only answer there from Josh Rosen seems to be “OK so let’s just fork (parts of) Spark”, but on the Spark mailing list people assume there’s an alternative.

I sketch a different solution below.

Compatibility issues

The issue for 2.12 is first of all about source compatibility for libraries that depend on Spark and crossbuild.

However, the document also motivates why Spark cares about compatibility between spark-core_2.11:2.3.0 and spark-core_2.12:2.3.0. Supposedly, Java users must supposedly be able to mix and match Spark built against Scala 2.11 and Spark built against Scala 2.12. That’s more plausible than demanding such compatibility for Scala users, but I still don’t get the point. Can’t Spark just say that Java users should stick to a “main” Scala build (say, 2.12)? Or do you have many Spark users using Scala and Java who can’t just pick consistent Scala versions and deal with any breakage? (Though at that point we’re probably in Spark 3.0 territory).

Breakage in the Scala 2.11 API

@rxin, the planned API changes (API issues with Spark and Scala 2.12 / Java 8 - Google Docs) seem to have unresolved accuracy issues! There are multiple ones, but IIUC, if you don’t fork parts of Spark for 2.11/2.12 (“pretty large classes such as RDD, DataFrame and DStream”), and you use everywhere a single overload + implicit conversions, 2.11 users won’t get argument type inference when they call higher-order Spark methods:
Spark 2.12 API issue · GitHub

That’s not just breaking compatibility, that’s making the source API worse, while the doc claims for this case:

This works perfectly for Scala 2.10 and 2.11 since every place where they can pass in a scala.Function1 today will continue to be source-compatible.

Right now, the best solution might be to go to Spark 3.0 with some changes in the source API, provide def map[U : Encoder](func: T => U), and maybe require Java<=7 users to subclass AbstractFunction1 (most likely, a Spark copy of it, in Scala, to insulate you from Scala library changes). Or maybe just bless Spark for Scala 2.12 as the version that Java users should link against; Spark for Scala 2.11 seems only useful for people also using Scala, who must change some code between Scala 2.11 and 2.12 anyway. Cross-building still seems to be supported.

Community relations

I’m not a Spark user or an expert, but IIUC from this thread, the Scala community is asking a favor to the Spark community. If Spark supports Scala 2.12, Spark gets all the improvements in 2.12, while Scala 2.12 gets all the Spark users. Which is perfectly fine, but let’s indeed be polite about it, and maybe offer help.

jpallas · April 7, 2018, 6:05pm

Thanks @ghawkins for the level reply. It certainly wasn’t my intention to get into a shouting match, but I may have let my frustration get the better of me. I have a background in systems where there is a very sharp and well-understood distinction between compatibility of binary interfaces (ABIs) and compatibility of source-level interfaces (APIs). People in the JVM world seem to blur these distinctions in a way that I sometimes find confusing (although understandable, given the dual role of .class files in the compilation/runtime process).

I agree with everything you’ve said. The Scala community can help the Spark community to mutual benefit. Arguing about the rules and processes that the Spark community has adopted isn’t going to further that collaboration, and distracts from the technical issues.

kevinwright · April 7, 2018, 7:22pm

Could I take a moment here to speak out against the notion of “spark community vs scala community”?

As someone with more than his fair share of experience on both sides of the fence, and a foot in both camps, I’d like to speak of a different concern here. I’ve used both Spark and Scala without Spark, and know for a fact that literally everyone who is affected (or will be affected) by this issue will be using both together! Spark may well be a gateway drug, but it’s also used in well-established Scala shops where a different set of issues appear.

I work on several Scala projects, and spark sticks out like a sore thumb. Our internal utility libraries are already cross-compiled against 2.12 and typelevel 2.12 (for experiments with -Ytype-literals), those that are used in Spark also have to be cross-compiled against 2.11… which is an unwanted extra burden to maintain.

I’m perfectly happy to accept that Spark_2.12 is a different artefact from Spark_2.11 - sufficiently so that I wouldn’t expect the APIs to match even if they’ve both versioned “2.3.0”, any more than I’d expect the 2.3.0 Java or Python APIs to be identical. I already know that I’ll be making changes to migrate from Scala 2.11 to 2.12 anyway, so this really couldn’t be less of a problem.

The lack of any sort of 2.12 version does present a compatibility problem though… It breaks compatibility with internal corporate guidelines as to which version of Scala should be used in Scala-based projects.

To put it into “marketing” speak… This isn’t just an issue of user acquisition, there’s a retention problem here too.

Jasper-M · April 8, 2018, 9:03am

Unless I misunderstood you, this seems like a lost cause to me. If Java users mix 2.11 and 2.12 artifacts they are bound to get all kinds of runtime errors.

adriaanm · April 8, 2018, 8:16pm

Regarding overloaded higher order methods: 2.13 will have better type inference for them (the new collections rely on it). M3 already has the minimal improvements, and am working on refining those in M4.

jpallas · April 9, 2018, 2:50am

Unless I misunderstood you, this seems like a lost cause to me. If Java users mix 2.11 and 2.12 artifacts they are bound to get all kinds of runtime errors.

I don’t think that’s what the document says. It says, “a Java user whose compile-time Spark dependency uses Scala 2.11 must be able to run their compiled code against a Scala 2.12 version of Spark.” In other words, the Scala version used to build Spark must not affect Java binary compatibility.

scottcarey · April 10, 2018, 7:59pm

You and @scottcarey seem to have gotten into a bit of a shouting match.

I did not realize I was in a shouting match. Sorry.

@jpallas I’m sorry if I came across too harshly. I have strong opinions here. I may have been hyperbolic, but I am trying to make my perspective very clear.

I do not represent the scala community. I represent myself. In order of ‘what relevant community do I belong to’, the order would be:

Apache Member (former PMC chair, Apache Avro)
Scala user, JVM polyglot.
Spark user.
I had one PR merged to scala/scala several years ago, and would attempt to contribute more if I had the time.

As far as open source contributions, my work in Apache related projects is orders of magnitude larger than Scala.

What triggered me to jump in this conversation was what I perceived to be a ‘throw up the hands’ moment on progress on this, waiting for a Spark 3.0. I am Spark user who is currently frustrated with the issue, and where most of my codebase is still stuck on 2.11 because of Spark (I could move more, but I would have to cross-compile some Java and other non-scala JVM artifacts across scala versions… ick – or re-write some low level Scala dependencies into Java or Kotlin…). Honestly, it would be easier to fork Spark, once the closure cleaner parts are complete, and maintain the fork than to do the work to maintain too many cross-built things in my organization.

With respect to source or binary compatibility, it made no sense to me to enforce strict compatibility across Scala major versions. The ecosystem does not work that way. Minor changes to After reviewing the details in the linked document, the tricky part is actually the binary compatibility requirements on the Java side.

Can’t Spark just say that Java users should stick to a “main” Scala build (say, 2.12)? Or do you have many Spark users using Scala and Java who can’t just pick consistent Scala versions and deal with any breakage?

The issue here IMO is not Java users but Java library developers. Otherwise, a Java library developer would have to cross-compile across scala versions, and produce multiple artifacts, etc.
I honestly have no idea at all how many Java library developers code against this API, however.

The previously linked document has these two relevant requirement statements:

Scala users’ code must be source-compatible across all Scala versions (to the greatest extent possible)
Java users should be isolated from Spark’s Scala version

The key part of the first one is the parenthetical. It is not always possible to have source compatibility across major scala releases. This is one of those cases. In my mind, this means that it is fair game to maintain scala source compatibility from a spark-core_2.11:2.3.0 to spark-core_2.11:2.3.1 but have a minor difference between spark-core_2.11:2.3.1 and spark-core_2.12:2.3.1. The requirement above makes it clear that as long as the changes are as limited as possible, its OK.

The hard part is the second one. If spark-core_2.12:2.3.1 has one less method than spark-core_2.12:2.3.1, then this is violated. A Java library built against the 2.11 variant might break.

There are two ways around this.

Loosen the requirement, so that a Java library built against 2.3.1 must be against the 2.12 flavor (which would work when linked to a 2.11 flavor, if a method was removed.

FWIW, the above two rules are not official:

https://spark.apache.org/versioning-policy.html

The official versioning policy does not mention these requirements. It would appear that the PMC could reasonably choose to do something that was not so strict without violating any existing public policy. The official version promises much less to end-users, in fact.

That said, simply ‘following rules’ is not the point. The truth behind these rules is the impact to end-users and library developers. We really should discuss those, rather than the rules themselves.

A Proposal

Leave the Scala 2.11 variant the same, with both map methods
The Scala 2.12 variant replaces both of them with def map[U : Encoder](func: MapFunction[T, U])

This means that from the perspective of a Scala 2.12 user, it works fine as long as they are not passing raw Function1’s and are using lambdas. The implicit ability to use a Function1 doesn’t really work anyway, and is most relevant for Scala 2.11 and prior. 2.12 users can use SAM.

From the perspective of a Java library maintainer, if built against the 2.12 variant it will be binary compatible with either. OR if it was built against the 2.11 variant but did not use methods that take a Scala Function1 argument, it will work. Presumably, most if not all Java libraries would be using the MapFunction variant anyway – its not trival for a Java client to summon a Scala 2.11 Function1.

The above does not strictly adhere to the requirements in the document, and loosens them a bit. From the real-world user perspective, the risk is Java developers that use the Scala 2.11 flavor map methods with Function1 arguments and expect their code and/or library to work with the Scala 2.12 artifacts at runtime. For Scala users, the risk is ‘typical’ for a major Scala version upgrade: recompilation is required and in some cases, minor source changes.

Thoughts

The primary reason I suggest something like the above is that I do not want to have to wait for a Spark 3.0, which might introduce even more delay to get my ecosystem working as a 3.0 will force more changes on my codebase. From a Scala community perspective, this is one more barrier to moving the ecosystem forward.

However, perhaps the 3.0 API breakages are minor, and I should not be so worried about it. In this case, moving the community to a 3.0 may be a lot easier than dealing with quirky 2.x_2.12 artifacts.

If a 3.0 is coming soon, and the API changes are not drastic, I’m all for that. The challenges in this upgrade cycle are real – the Closure Cleaner part in particular is a fundamental issue that the ecosystem should solve in the long run so that Spark does not have to.

2.13 is coming soon, and I hope that I don’t have to wait 2 years for that. Along those lines, perhaps someone should investigate what kinds of Spark API compatibility issues a 2.13 upgrade would introduce. The Scala community can surely help with this. The Spark PMC is going to have to make some decisions on what that means for the compatibility guarantees for the 3.0 release cycle.

lrytz · June 21, 2018, 7:07pm

There is a new document to discuss the API issue, and the closure cleaner issue, with new proposals / insights. https://docs.google.com/document/d/1fbkjEL878witxVQpOCbjlvOvadHtVjYXeB-2mgzDTvk/edit#heading=h.ofsvnyrj8o3b

Let’s consolidate (technical) discussions on that document.

NthPortal · August 2, 2018, 3:06pm

Congrats and thanks to all involved!