Scala 3 syntax reference disallows Unicode string literals

The Scala 3 syntax summary (not Scala 2.x, I was mistaken) says characters in single-quoted string literals are restricted to printableChar = “all characters in [\u0020, \u007F] inclusive”, except for " and \. Whereas triple-quoted strings can contain all Unicode characters.

In practice, the compiler accepts single-quoted Unicode string literals (eg "שלום"), and application code can rely on this. Should the spec guarantee this? Or should code treat this as an implementation detail and use triple quotes a lot?

Also, the compiler does not accept string literals with BIDI control characters (they must be \u-escaped). These are the characters matched by scala.reflect.internal.Chars.isBiDiCharacter in Scala 2:

'\u202a' | '\u202b' | '\u202c' | '\u202d' | '\u202e' | 'u2066' | '\u2067' | '\u2068' | '\u2069'

The syntax documentation doesn’t forbid these but maybe should, if the implementation restriction won’t be lifted.

1 Like

The spec:

A string literal is a sequence of characters in double quotes. The characters can be any Unicode character except the double quote delimiter or \u000A (LF) or \u000D (CR); or any Unicode character represented by an escape sequence.

The syntax summary reflects that.

I’m not sure why the BIDI restriction is an error instead of a warning. That would be less forbidding.

The Scala 3.x syntax is different (omitting productions for escape sequences):

stringLiteral    ::=  ‘"’ {stringElement} ‘"’
stringElement    ::=  printableChar \ (‘"’ | ‘\’)
printableChar    ::=  “all characters in [\u0020, \u007F] inclusive” ;

This goes back to at least this commit from Dotty 0.5.5.

The Scala 3 compiler doesn’t have such a restriction, so I’m not sure if this was ever intended to be implemented.

I mistakenly thought that 2.x also said this, but apparently I was wrong, it’s only in 3.x. Sorry about the confusion, I’ll edit my post to clarify it’s about Scala 3.

Authoritative tweet

1 Like

Thanks! I take it that means no-one (who read this question) thinks string literals really should be restricted.

The BiDi characters were restricted to fix a security issue:

So that restriction will definitely not be lifted.

2 Likes

Thanks, that’s useful to know.

Is it worthwhile to update the syntax docs to include this restriction?

1 Like

Yes, absolutely. PR welcome :wink:

1 Like

For future reference: opened PR: Document: don't allow characters with unicode property Bidi_Class in source files by danarmak · Pull Request #10197 · scala/scala · GitHub

3 Likes