Scala 3 syntax support in "other" editors

ckipp01 · November 15, 2021, 11:18am

Personal feelings about significant whitespace aside, something that will greatly hinder the adoption of this is syntax support in what we’ll call “non-tradition” editors, mainly anything that’s not IntelliJ or VS Code. During the last Metals survey at least there was a decently strong showing of Vim and Emacs users, both of which don’t support significant whitespace syntax highlighting and other syntax related features like correct auto-indentation. There seems to be discussions going on about this all over:

Sublime
Tree Sitter (Used by Neovim, other editors via a plugin, and could in theory help power code navigation in GitHub for Scala)
Emacs (Not sure how widely used this is, but I think this is the default one for Emacs)
There is probably even more that I’m not listing

While most beginners will probably be using VS Code or IntelliJ, there will definitely be a meaningful percentage of people that won’t consider switching editors when coming to Scala, and they’ll be met with pretty terrible syntax support in Scala 3. For example, here is a small snippet of Scala 3 code I just randomly opened up in the Scala 3 codebase using Neovim with Tree Sitter:

This isn’t even using a lot of Scala 3 syntax, but it’s pretty broken simply with the usage of :. This alone is enough for me to not even consider using significant whitespace.

I’m honestly not sure what I’d like to achieve with this post apart from maybe shining some light on the existing problem in hopes that there will be someone who sees this and is willing to help out in the various places. This conversation probably should have happened months and months ago, but it’s not too late to try to collectively tackle this.

fommil · November 15, 2021, 11:33am

It would be more accurate to say that the current scala-mode cannot be easily converted to support significant whitespace indentation. Emacs can of course support significant whitespace languages, e.g. Python and Haskell.

I have a lot of experience in this area, having written a Haskell mode from scratch. Doing the same for Scala would require such a dramatic refactor of scala-mode that it would effectively be a major rewrite. I have no reason to do that at this point in time, but it would be a lot of fun; I thoroughly enjoyed writing the Haskell mode. A requirement would be a very clear set of rules for how to insert virtual parens. Emacs also needs a forwards AND backwards lexer here, and I’m not sure if that poses any particular problems for Scala.

Ayoub · November 15, 2021, 11:57am

In the case of Sublime it seems to me that a syntax file could be shared with vscode.
According to this https://github.com/slimsag/Packages#adding-a-new-language
It is possible to convert .tmLanguage syntax to .sublime-syntax.
So in theory https://github.com/scala/vscode-scala-syntax/blob/main/syntaxes/Scala.tmLanguage.json
could be reused.

seoethereal · November 15, 2021, 1:06pm

Personally, when using Emacs, the syntax highlighting for keywords and terms works even for Scala 3 syntax. The problem is with significant whitespaces where the indentation does not work properly.

ekrich · November 18, 2021, 1:14am

Well, after looking at this a second time, it seems like Treesitter might be a better option long term but it might need some support to bring it along. With Scala 3 maybe this is a good opportunity?

TextMate is the default support in VSCode though but perhaps they could support more than one?

Ayoub · November 18, 2021, 8:53am

I think it it going to be extremely difficult trying to convince several editors to support another highlighting format.

We would have a better chance if we could leverage tooling to convert a single spec to target multiple syntax highlighting formats. Similar to this tool that takes a ebnf conf and generates a sublime-syntax file: https://github.com/BenjaminSchaaf/sbnf

At least in this scenario the spec effort isn’t duplicated across editors and all including vscode would see their support automatically updated on an syntax change.

For me this leads to some questions:

could Treesitter be used to automatically generate tmlanguage, sublime-syntax, etc ?
could we in theory use the compiler to output these files to leverage the existing parser ? I think scala3 repl reuses some of the compiler to handle the syntax highlighting

ckipp01 · November 18, 2021, 9:50am

Ah yea of course, should have made this clearer.

Yes, that’s sort of what I understood from your comments in that repo. This is also the case with the current Tree-sitter grammar. Having a grammar that could both correctly parse significant whitespace and traditional Scala would be a fairly large refactor.

This is absolutely true. After playing around with Tree-sitter I’m incredibly impressed with what’s possible. We often jump to just improved syntax highlighting, but forgot about improved indentation rules, folding rules, text objects, etc. It’s incredibly powerful. While language servers like Metals can produce many of the same things with semantic tokens, folding ranges, and even helping with indentation, Tree-sitter has the benefit of all being done client side.

Something this reminds me of is the work Ethan Atkins was doing on a Tree-sitter grammar generator from ebnf. It’s interesting to think about a tool that could take an ebnf grammar and spit out a Tree-sitter grammar and potential other grammars. However there seems to also be some drawbacks from generating a Tree-sitter grammar that way.

This would be interesting to look into, but I’m unsure.

fommil · November 18, 2021, 8:04pm

FYI I looked into TreeSitter and although it sounds awesome (I really did spend a lot of time looking into it properly) it would require a major native-binding language-agnostic extension in Emacs to get it close to working, before you even sit down to write a language grammar for Scala 3; that’s not even considering how hard it would be to write formal rules for Scala 3 (ha!) in the Tree Sitter language.

jackcviers · November 19, 2021, 4:19pm

Am I missing something? I thought we had EBNF for 3 at https://dotty.epfl.ch/docs/internals/syntax.html

Have we tried converting the above into a SMIE bnf for indentation yet? scala-mode might not be amenable to the change; I’ve been following the work on the scala 3 issue there closely.

I’m aware that implementing a new major mode for scala 3 is a huge undertaking, but with your guidance, Sam, I think it could be accomplished. Not necessarily something you would have to do, but if I could rely upon you for pr reviews I think I could muddle through. I already approached my manager at 47 Degrees about helping to fix it.

FWIW, I use the current scala-mode on scala 3 code with metals and the most annoying thing I run into on a daily basis is indentation when there are no braces. Everything else is OK. (I miss ensime, used it for years, but c’est la vie).

Also, it could stay scala-mode; we’d have to replace much of what is there, and relying upon an external parser would be against the tide in major modes, but providing a feature flag variable to switch between implementations would allow us to keep the existing mode going until the new version is ready, at which time we could change the default for it to be on full-time. There would be some prefactoring work, and likely a lot of test code to write first, but I think it’s a solvable problem.

Jack Viers - [email protected]

fommil · November 22, 2021, 2:09pm

SMIE needs its own set of custom rules, it’s rarely the case that an EBNF will translate directly. The examples in the Emacs manual are pretty good at explaining why, although I needed to re-read it several times for it to sink in.

You’ll need some way of turning the significant whitespace into virtual symbols, which of course you can feed to the SMIE lexer. I doubt you’ll be able to implement a Scala 3, version of haskell-tng-layout so easily, see Tseen She / haskell-tng.el · GitLab for inspiration. You’d probably also need to infer ; and { } symbols in Scala 2, which is probably a lot trickier than it sounds.

with your guidance, Sam, I think it could be accomplished

Thanks for the sentiment, although I just can’t commit any time to this. For so very, very, many reasons.

If I were to rewrite the haskell mode again, I’d have pushed the lexing down into the syntax layer so that fontification, navigation, and SMIE tokenisation share a commonly computed set of categories (e.g. “this thing is a varid vs consid”). I’m not sure what the performance impact of that would be, but it would be worth exploring.

jackcviers · November 22, 2021, 5:54pm

Yeah, I’m aware that it would be a semi-trial and error process to massage it into shape. I think it’s worth experimenting with, especially with the existing code available in as guidance in haskel-tng. The resulting operator precedence grammar wouldn’t necessarily be equivalent to scala-3’s grammar… but if it was good enough to provide block navigation, fontlock, and indentation in the presence of syntax errors, I think it would be good enough. As of right now, everything gets indented to beginning of line on every Ctrl-M or return keypress.

I have little interest in changing the mode for scala-2, as that works pretty well, as it exists today.

My first plan is to have a working mode for scala-3. Then provide a configurable variable to set the scala-mode source version, then provide a global configuration option, then to allow customizing that option on a path-dependent basis, then look at auto-detection of scala 3 / scala 2 codebases.

If it does work well for scala 3, then in the future we can talk about going back to scala-2 and reworking it in the same manner.

In the same manner, if someone does come up with a cross-editor grammar that could be worked into a syntax table for prog-mode, or writes an indentation server that can handle syntax errors and output the current indentation level for the next line, then we could work on auto-translating that grammar or using the indentation server via the same configuration mechanism described above. Alternatively, if the scala-3 issue gets resolved for the current codebase we can throw my work away. The end goal here is working tooling that is maintainable going forward.

I do feel that it should be based on what we have for formal grammars in some way. Obviously, scala 2 and 3 can be parsed, the tricky part is converting that output into the individual editors’ expectations, which is likely not ever going to be equivalent in terms of syntax highlighting, etc.

fommil · November 22, 2021, 9:58pm

The order of play is very strict in Emacs, it’s syntax tables first (which feeds the stateful paren parser). Then independently: fontification and SMIE, with SMIE providing some navigation but it actually being provided by hacks in smartparens or paredit (which only looks at the syntax table).

Fontification and SMIE can use the syntax tables, but not the fontification metadata (unless you want to play fast and loose). That’s why I think it would be good to do some basic lexing in the syntax table layer and “paint” symbol classification information at that level; it can then be reused by both fontification and lexing (which includes BACKWARDS lexing, btw… backwards regexps don’t always work the way you think so it’s good to be working off good data that was painted the forward direction first).

I started out thinking that way, but that’s what’ll lead you down the treesitter train of thought, because it can handle (in theory) locally broken code… but the ask of having treesitter support in Emacs is going to be a political battle lasting years, even if somebody manages to pull it off on a technical level, because of the licenses. I wouldn’t put any stakes on it happening. It’d probably need to be in something independent of prog-mode and have its own lifecycle for syntax tables, fontification and indentation.

In short, I think your best bet would be to just do it by the book, like haskell-tng plus lower level lexing.

jackcviers · November 22, 2021, 11:01pm

Yeah, misspoke – fontlock is different, sorry I crammed it in with SMIE for indentation and forward-backward-navigation, and you’re correct.

This is where you lose me –

I thought the definition of a syntax table was a lookup table of syntax class, optional flags, and optional match charachter.

I’m not following how or what you would be scanning and lexing to produce those descriptors, or what you mean by “paint” in this context. Could you expand?

fommil · November 23, 2021, 7:48am

the syntax table has sections for user-defined “categories”. You could define categories that match larger constructs, like “variable identifier” (which may be used in many locations such as in a type, package name, etc) and then your regexps in the fontifier and lexer could be like [[:varid:]]+ rather than having to repeat the defintion for what constitutes a valid varid everywhere. In Haskell this would have been particularly nice because it has strict rules about capitalisation etc which makes this sort of thing much much easier, but scala’s a bit of a free-for-all.

jackcviers · November 23, 2021, 1:26pm

Ok. I understand what you mean now. You’re talking about syntax categories. https://www.gnu.org/software/emacs/manual/html_node/elisp/Categories.html

Yeah, we’re pretty much going to have to use them for uniccode idents, no matter what.

fommil · November 26, 2021, 10:39pm

Let me know if you start working on this. I am tempted to do it myself, initially just for Scala 2, because I would like to have better face painting for types vs symbols, and I really enjoy Emacs lisp. That would lead to a more natural upgrade path to Scala 3. I have no active Scala 3 projects, so I lack the motivation.

If you’re taking inspiration from haskell-tng make sure to pay close attention to the testing framework. It is possibly the best part.

jackcviers · November 27, 2021, 12:54am

I’m doing preliminary research, currently. Seems we’ve both stumbled upon Wisi (saw your name on a SO question).

I’m paying very close to Haskell TNG. It (the tests) looks a little like ert-expectations.

fommil · December 3, 2021, 7:52pm

Say, for sake of argument, that I wrote a completely fresh Emacs mode that does syntax highlighting and smart indentation for Scala 3, including significant whitespace. That’s a huge piece of work. If I was doing that work for a customer, it would easily be 6 figures. The sole purpose of such an endeavour would appear to be increasing the likelihood of the adoption of Scala 3, which is something I’m more “passive” about than “passionate”.

If I think back to 10 years ago, and the motivations to get involved in Scala tooling. The community was small back then and doing this sort of thing was the only way to increase the chance of getting a job writing Scala. But nowadays, the demand for Scala developers far outstrips the supply.

I’m interested in doing this from a purely technical point of view: I like writing major modes for Emacs. It’s the closest one can get to writing a compiler without actually writing a compiler. But I think I need a push beyond mere technical intrigue, especially given my rather bruised history here: is there anything (non-monetary) that the Scala Center (and/or the Scala “leadership”) can offer in return?

jackcviers · December 5, 2021, 10:27pm

Sorry all, I caught covid on Thanksgiving and have been recovering.

I have started on work for supporting scala-mode-3. There’s no guarantee I will finish or at what rate I’ll progress. @fommil – I see that you’re working on a new ensime mode. I’ll probably reach out when I’m a little farther along, maybe we can share work.

Obviously, I’ll open source this eventually, eventually, if it works. I’ll also try to integrate it with the existing scala-mode, again, if it works. What I want to do is base this off of the documented scala syntax.

I’ll post here as I progress.

eed3si9n · December 20, 2022, 8:07pm

Wrote a blog post on this topic fast Scala 3 parsing with tree-sitter.