Proposal to deprecate and remove symbol literals

What makes you think they care enough about Scala usage? Yes, it’s kind of cool that they can put “Scala” on their front page. Compared to the kind of integration R has, though, Scala is a third-class citizen.

Also, can you point to prior art of any automatic notebook rewriting tools, with or without a button? (Have you tried to get into the workbook API deeply enough to figure out how to do this?)

I haven’t looked into this very much, but I have looked into it some and at this point I wouldn’t assume it’s trivial or that it’s wanted.

Without passing judgment on the fate of Symbol as a whole, I think so far this argument (auto-convert of workbooks) for why it’s okay is not clearly a good one.

1 Like

This makes a lot of sense. Also it could emit a warning so that users can migrate usages gradually.

I have no strong feelings about symbol literals (I find them cute but not particularly useful) but this idea seems like a profligate use of an easily typed character for a very niche purpose. IMO metaprogramming deserves to be armoured in thicker syntax.

4 Likes

I don’t know if they are or they aren’t. My point is that they should be, since they are relying on Scala for their actual product.

If you rely on a programming language for your product to work, you should care about it. RoR cares about Rails because it was built ontop of Rails, Spark cares about Scala because it was built ontop of Scala etc etc.

I tend to agree with this to be honest.

I was initially very attracted to the splice and quote syntax out of Lisp nostalgia but If this causes lots of problems for existing users then perhaps an alternative should be considered, e.g.

  • quoting refers to a metaprogramming quote only when followed by a curly brace e.g. '{quotedVal} - this would not collide with symbol literals
  • some kind of idiom bracket style thing e.g. {| quotedVal |} or #{quotedVal}
  • @ is not a valid initial Java identifier char and I don’t think it’s commonly used for anything else in Scala (edit: oh no it’s used in Ammonite REPL)
1 Like

@odersky solutions that sound easy when one deals with a local file system become quite a bit more complicated when one deals with multiple Web GUIs backed by multiple backends. For example, no notebook environment has anything remotely related to “take the code in this notebook cell and send it to the server to be rewritten”, let alone the rewriter running a Scala compiler. Same problem applies to using Scalafix, etc.

So, the question is simple: is the removal of 'x so important that Scala wants to either (a) force multiple open-source projects and for-profit vendors used by hundreds of thousands of people to implement substantial and non-trivial changes that span Web clients and backends or, if that doesn’t happen, because Scala is NOT the most important language for the likes of Jupyter, (b) cause significant end-user suffering?

What is the value scale here? An abstract notion of programming language aesthetics? The quality of experience of people using meta programming? It certainly is not the quality of experience of everyone using Scala, as that will include everyone negatively affected by the change whose volume will dwarf the number of advanced developers using meta programming.

It’s easy to turn a nose up at Spark: it ended up using Scala by accident, it’s not functional, it uses a tiny fraction of the type capabilities of Scala and it even stopped code-generating Scala, because the compilation was too slow. Still, it’s wildly successful. It’s even easier to dismiss the plight of newbie Scala developers hacking in notebooks–after all, they are most definitely not like the people engaged in SIP discussions and they hardly care about programming language design–but shouldn’t we be embracing them instead?

Looking at the evolution of programming languages, it’s easy to see how the ones that make commercially beneficial alliances (either with powerful vendors or with certain user types) gain at the expense of those that do not. BASIC + C# and Microsoft. ObjectiveC + Swift and Apple. Kotlin and Google/Android. JavaScript and Web developers. R and statisticians. When it comes to usage popularity–introducing a lot of fresh blood into the developer ranks–Spark is the best thing that has happened to Scala in recent years. Through Spark, Scala has a real opportunity to align with a major emerging user type–data engineers and data scientists–and benefit from the explosive growth in big data, machine learning and AI worldwide, gaining momentum outside of its European base. Opportunities like this do not come knocking many times and the hard truth is that there are no other major forces fueling substantial Scala developer community growth.

With a multitude of choices for syntax evolution, why is it smart to choose a path that hurts the Spark+Scala community and, for example, pushes more people to use Spark through Python or Java or R, all of which are substantially more popular languages than Scala?

3 Likes

Then perhaps an alternative should be considered, e.g.

  • quoting refers to a metaprogramming quote only when followed by a curly brace e.g. '{quotedVal} - this would not collide with symbol literals

That’s actually what’s implemented. Simple identifier quotes 'x are only allowed inside a splice ${...}. That’s where the shorthand syntax matters most. So from the standpoint of meta-programming we can leave symbol literals in legacy code. You can’t write symbol literals in splices, but no legacy code contains splices.

So the real question is whether symbol literals carry their own weight to be kept as a language feature. It seems the consensus is they don’t, except that existing DSLs make some use of it. The uses might be of dubious value but if there’s no practical migration strategy that’s beside the point.

I believe “leave them, but under a flag” is a workable solution. Notebook servers would have the flag turned on, so notebooks could continue to compile as is.

In terms of the SIP proposal, that would probably mean we remove the feature, since we do not want to encourage its continued use. But implementations would support it under
-language:SymbolLiterals as long as there is significant usage in the field.

2 Likes

I agree that Spark developers are an important segment of the Scala users and should definitely be considered (I am one…). However I feel that some things are being a bit overstated, especially in the context of this symbols discussion.
Some thoughts:

  • There are already multiple alternatives to refer to columns: $"column", col("column"), df("column"), plus many methods that already have overloads that accept plain strings.
  • Users of notebooks will not suddenly be on Scala 3 without warning and notice all their processes failing. First a Spark upgrade will have to take place. This is at least a minor version upgrade, and it doesn’t seem unreasonable that a Spark version that comes with Scala 3 requires a major version upgrade. I think that upping your Spark minor version is already a bit risky for notebook users (we’re dealing with source compatibility here, perhaps in some scenarios also binary). Upping your major version and expecting that everything will keep working is simply impossible. The same is already true if you upgrade your Scala version: you simply cannot expect that all your notebooks will keep working. If someone has all his production processes running in notebooks (which I would have always advised against but ok) and he upgrades now from Scala 2.11 to Scala 2.12, he will have to test all his notebooks. So this is not a new problem and IMO not a problem that will disappear or even made far less worse by not deprecating Symbol. Of all things that will or could potentially change, symbols are the least of your worries…
  • So I share @mdedetrich’s concern that if you take this reasoning to its logical conclusion we cannot make any incompatible changes whatsoever to Scala. Actually we’d have to revert the ecosystem to 2.11 and go in Java-style compatibility mode from there.
  • That said, I also agree that we shouldn’t be making too many changes, almost just for the sake of change.
1 Like

@Jasper-M we are in violent agreement when it comes to the general problems of good code management in notebooks: it’s an issue, exacerbated by the fact that in several years I’ve seen not one sign of movement in the communities around notebook servers to improve things, which is why I am so very skeptical that there will be improved tooling by the time this possible change lands.

The fact that there are alternative syntaxes for referring to columns in Spark is relevant only if 'x is used very sparing, which is not the case, as validated a few days ago on the Spark dev mailing list. After all, it’s the simplest and shortest way to refer to a top-level column.

The argument that worrying about this change because of notebook users are equivalent to freezing the Scala language I find somewhat silly. A non-zero probability of code changes required with an upgrade is not at all the same as a high probability of code changes required. For a team that uses the 'x syntax, it is very likely that many cells in almost all notebooks will require change. I don’t know of other proposed language changes that have the same effect.

I’ll end on another note of violent agreement:

That said, I also agree that we shouldn’t be making too many changes, almost just for the sake of change.

What I mean is that this change, although probably more prevalent in notebook code, is trivial to make and relatively easy to spot. Other changes might be far more difficult to make, or even to recognize when they are required.

Are you saying Apache Spark users commonly use quoted symbols for columns? Do you have any sense how common it is? User code and notebooks will be hard to inventory, but do they appear in the Spark codebase, in popular libraries, in documentation?

I’m not seeing many examples of using quoted symbols in the documentation for Apache Spark. I’ve only looked here though:

https://spark.apache.org/docs/2.3.3/sql-programming-guide.html#running-sql-queries-programmatically

Are there popular books or courses on Spark that recommend quoted symbols for SQL columns? Maybe we should try to track these down and update them. Spark probably won’t be migrating to Scala 2.14 until the year 2024, but it might be useful to get started.

@ashawley there wouldn’t be many examples in the core docs because they don’t include many transformation examples, period. In a Scala notebook, in the context of an import org.apache.spark.sql.functions._ and import spark.implicits._ where spark is a SparkSession, there are three ways to refer to a top-level Spark column: 'x, $"x" and col("x"). The reason why 'x is the most common syntax should be obvious in terms of succinctness and readability, aided by the fact that there is no Spark API that uses Char so there is no confusion. JAR-based code usually uses col as it can’t always depend on import spark.implicits._. I don’t have real stats for this but I did hear similarly from people in the Spark community when mentioning this SIP and it’s consistent with what I’ve seen. I’m reasonably deep into the community, e.g., I help select who presents at the Spark+AI Summit and talk to lots of people related to the OSS work my company does around Spark.

As for updating docs/books/code, your question is based on an implicit assumption that this SIP proposes a meaningfully positive change to Scala and it’s entire ecosystem of users, which I wholeheartedly disagree with.

It seems clear that this SIP started from a standpoint of language aesthetics, with very limited to non-existent awareness of the negative externalities from the change, as demonstrated by the pros and cons analysis in the original proposal, which was substantially flawed. As I have pointed out repeatedly, with no contrary arguments presented against, there is an implicit assumption that making things syntactically prettier for a small number of advanced developers using meta programming is preferable to forcing changes to a large group of users who have limited means automated refactoring. I would really like someone who is for this SIP to actually take this head on and argue explicitly why it makes sense to do this.

2 Likes

Sorry if it didn’t seem like I hadn’t read your previous comments opposing the proposal to remove symbol literals.

If anyone else has more information on the scope of the usage of symbol literals in Apache Spark or its documentation, it would be useful to see.

Here’s where we stand:

The SIP committee hasn’t finished voting on this yet. A few members were missing from the meeting today and will submit their votes by email, and one member wanted more time to review the discussion.

In Scala 2.13.0-RC1, released last week, symbol literals are deprecated. (The deprecation could still be reversed, depending on the SIP vote.) Deprecation begins the process of increasingly discouraging the use of symbol literals. Stronger discouragement could could come in 2.14 and beyond, for example by requiring a language import.

Consensus on this thread was not to add the sym interpolator, so it was removed before RC1 was built.

There was no clear consensus on whether scala.Symbol itself should be deprecated, so it isn’t.

It isn’t clear yet whether Scala 3 metaprogramming will eventually repurpose the single-quote syntax, but in any case the proposal doesn’t stand or fall on whether metaprogramming wants the syntax or not, nobody seems to consider this a crucial deciding factor.

The major new objection that came up in this thread, raised by Simeon Simeonov, is that Apache Spark uses symbol literals to reference column names. It isn’t the only syntax that works for this, but regardless (although we don’t have hard data on this) it appears the syntax is in wider use than we realized.

So, as discussed in this thread, actual removal should be delayed until some Scala 3.x release, not 2.14 or 3.0, to give Spark and the developers of popular Spark notebook environments adequate time to migrate, as part of their overall Scala 3 migration effort.

2 Likes

There is this discussion thread on the Spark Dev mailing list. It didn’t generate a lot of traffic, but a few more people said they are using the single-quote syntax. Note that this was the mailing list for Spark development, not Spark users, and probably not the audience that uses a lot of data frames in their day to day job.

While I don’t think there are millions of Spark notebooks, all using single-quoted literals, I think there’s a non-trivial amount of them. There’s also akk-http and Play Anorm, and several testing frameworks. There’s even deprecatedName in the standard library. They all add up, and besides the burden of rewriting all this code, it points out that symbol literals are useful and nicer to use than their alternatives.

To sum it up, the cost of removal is higher than anticipated, and at the same time I haven’t heard very strong arguments in favor of their removal. I don’t think they’re hard to explain and they sit quietly in the back until you need them (has anyone been “bitten” by symbol literals?), the cost of keeping them is very low. Right now I think the cost of removing them outweighs the stated benefits.

1 Like

I believe actually there is a real cost for Symbol literals, and Spark is the best example of this. I find it very confusing to read Spark queries where symbol literals are mixed with plain strings, both of them doing some column selection. When to use what? Maybe Spark specifies this, or maybe the choice is up to the user but in any case, mixing strings and symbol literals is a bad idea! A pointless choice between two things that are basically the same. By removing symbol literals we remove the temptation for others to fall into the same trap.

What if there was just one way, symbol literals? I see this more of an issue with the API design than with the language. At any rate, this wouldn’t be the only place where Scala has several ways to do something.

When I say the cost is low, I mean that removing them won’t make Scala simpler to teach or learn. It’s really not a source of complexity (interaction with other features is very low, you don’t risk stepping on them accidentally).

Yes, but in Scala 3 we try hard to cut down on unneccessary choices.

When I say the cost is low, I mean that removing them won’t make Scala simpler to teach or learn.

I doubt that statement also. The problem is, given that the majority opinion is that symbol literals are unnecessary, we won’t do a systematic job teaching them. So people will be very surprised when they see the feature the first time in existing code. Most likely it does not look like anything familiar to them, so they wonder what that is, and then will come away frustrated and annoyed that this seemingly weird (and therefore powerful?) thing is so trivial and useless.

It is actually not the same.
When we write something like

 'id->5

Somebody can say 'id is a string. But it is pulling out of context. Actually it is an attempt to emulate an identifier.

Somebody can say it is not much worse than f"id". But it is worse. Lets imagine that we must always write obj."someCall"() it would be annoying :slight_smile:

So there are only one undeniable argument.
The most of scala users do not need it. May be it is true.

I think It is step forward for simplicity, and it is step back for extensibility.

IMHO scala should have more simple way to declare identifiers and numbers than string interpolation.

Isn’t this a case for scala.Dynamic? Instead of Strings or Symbols you could be using real identifiers.

1 Like