Invalid unicode escapes as literals


#1

Motivating examples:

var latexstring = """\usePackage{mypackage}"""

var userfolder = """C:\users\""""

Both of these are invalid unicode escapes, since they contain the literal sequence \u which “switches on” unicode escape parsing, and these are not valid unicode escapes.

I’m not sure whether this is to spec. AFAIK all the spec says is

In Scala mode, Unicode escapes are replaced by the corresponding Unicode character with the given hexadecimal code.

UnicodeEscape ::= ‘\’ ‘u’ {‘u’} hexDigit hexDigit hexDigit hexDigit
hexDigit ::= ‘0’ | … | ‘9’ | ‘A’ | … | ‘F’ | ‘a’ | … | ‘f’

In either case, I believe it would be better if things that are now unicode escape errors are just processed as literals.


#2

Proposed alternatives in scala/contributors gitter:

escape unicode escapes with an additional backslash: "\\usepackage{mypackage}"

This would prevent stuff like "\u 1234" doing the (arguably) wrong thing of being valid rather than failing early.

Of course, this would be a special case of a special case, which is never too pretty.


#3

Historical reference:
http://www.scala-archive.org/Unicode-escapes-in-multiline-strings-td1987616.html


#4

This actually does work, so I suppose that’s all there is to it.


#5

I always thought that it was a historical mistake. Java was born into an ASCII world but nowadays there is no reason to treat its \u escape syntax as a Unicode encoding. Everyone uses UTF-8 for this purpose. It would be nice if Scala 3 could fix this.


#6

I agree it would.

After thinking more about it, making unicode escapes handle the same way as any other escape would have my personal preference.

That would lose you the possibility to use unicode escapes to represent arbitrary syntax. I don’t think anyone would lose sleep over the fact that you then no longer can write \u0076\u0061\u006c\u0020\u0078\u0020\u003d\u0020\u0037 instead of val x = 7.

You do use the ability to use unicode escapes for identifiers that aren’t back-quoted. That might be a bit more cumbersome in the general case, and disallows using unicode escapes in identifiers in match cases (since back-quoted escapes have a special meaning there).

I’d love to check whether there are any uses of that feature in the community build, and how often unicode escapes are used in a way that wouldn’t be supported by this scheme.