Need an accurate EBNF grammar of Scala

I am from Cambridge Massachusetts US, working on program analysis for some Scala projects. My project uses some nice features of Antlr4 so we want an accurate EBNF grammar in Antlr4 form to build the parser and AST walker first.

I saw several previous attempt from stackoverflow:

I made my own attempt by comparing the Antlr4 grammar shown in the post, and correct it to match the 2.13 spec.

And I took a look of the scala syntax spec, and found something inaccurate there, here is an example:

https://www.scala-lang.org/files/archive/spec/2.13/13-syntax-summary.html#context-free-syntax


  TopStat           ::=  {Annotation [nl]} {Modifier} TmplDef
                      |  Import
                      |  Packaging
                      |  PackageObject
                      |
  Packaging         ::=  ‘package’ QualId [nl] ‘{’ TopStatSeq ‘}’
  PackageObject     ::=  ‘package’ ‘object’ ObjectDef

  CompilationUnit   ::=  {‘package’ QualId semi} TopStatSeq

This part of the grammar failed to parse this simple code:

package tip.ast

import tip.ast.AstNodeData.DeclarationData

I guess it should be doable to write down a EBNF grammar for Antlr4?

I also wonder, did the author of Scala write down the EBNF grammar first for parser generator? or implemented the recursive decent parser first?

This one looks like a recursive decent parser for Scala using 2.13 spec, I dont know how reliable it is:

Why do you want antlr parser? There is a pretty decent parser example in fastparse, may it work for you?

I saw this from the official website:
FastParse is a Scala library for parsing strings and bytes into structured data.”

I am trying to parse Scala code, not to use Scala to parse other things.

Those aren’t necessarily mutually exclusive, the FastParse example ScalaParse uses Scala code to parse Scala code.

You’d have to adapt it to return an AST, but it would certainly give you a good place to start.

Another option would be ScalaMeta, which explicitly calls out analysis as one of it’s core usecases

2 Likes

The " ScalaParse" looks nice. I tested it with some random large scala codes and it says it is parsing it successfully? I assume either there is a recursive decent parser implemented for it or the author used some EBNF grammar to generate the parser? But I dont seem to find a grammar file from this project.

I really need a correct EBNF grammar for Antlr4 since it provides lots of convenient features we are using, like the AST walker and the parse tree visualization. Those things are very helpful for debugging.

ScalaParse basically is nothing but a grammar, as far as I’m aware. While it’s not stated in BNF form, I believe that a collection of FastParse Parsers is mostly isomorphic to a BNF grammar. My guess is that going from ScalaParse to BNF would be a largely mechanical exercise. It does require learning how to read FastParse, though…

FastParse is based on parser combinators: https://en.wikipedia.org/wiki/Parser_combinator

I tried to test the EBNF grammar from https://www.scala-lang.org/files/archive/spec/2.13/13-syntax-summary.html in online checker here: https://mdkrajnak.github.io/ebnftest/ but I can’t event get past grammar checking. In “Test Output” I’m constantly getting errors like: “Error parsing grammar specification”. I tried tidying up the grammar (converting the exotic apostrophes and quotation marks into normal ones for example), but I still don’t have working version.

thank you for trying. This is what I am facing. I have provided an example above showing a potential problem of the spec about parsing the import statement.

I guess there’s no single industry standard for EBNF grammar and you need to adjust it to a particular parser generator implementation. Also, the EBNF grammar in Scala documentation is intentionally broken, I think:

lower            ::=  ‘a’ | … | ‘z’ | ‘_’ // and any character in Unicode category Ll, and and any character in Lo or Ml that has contributory property Other_Lowercase
upper            ::=  ‘A’ | … | ‘Z’ | ‘$’ // and any character in Unicode category Lu, Lt or Nl, and any character in Lo and Ml that don't have contributory property Other_Lowercase
digit            ::=  ‘0’ | … | ‘9’
opchar           ::=  // printableChar not matched by (whiteSpace | upper | lower |
                      // letter | digit | paren | delim | opchar | Unicode_Sm | Unicode_So)
printableChar    ::=  // all characters in [\u0020, \u007F] inclusive

How the above is supposed to work without adapting it to particular parser generator implementation?

As for ScalaParse https://github.com/lihaoyi/fastparse/tree/master/scalaparse/src/scalaparse there’s no value generated because no capturing was implemented. If you want to have a generated AST you need to modify the code, e.g. change:

      def While = P( `while` ~/ "(" ~ ExprCtx.Expr ~ ")" ~ Expr )

to

      def While = P( `while` ~/ "(" ~ ExprCtx.Expr ~ ")".! ~ Expr.! )
        .map { case ((cond, body)) => new AstWhile(cond, body) } // choose this or below code
        .map(AstWhile.tupled) // alternative version for case classes

And of course do such change everywhere through the parsing code for it to make sense.

4 Likes

seconded. it sounds like you ought to be building on top of Scalameta

5 Likes