Pre-SIP: addressing warts in string literals and interpolations

martijnhoekstra · February 22, 2020, 3:12pm

There is a bunch of inconsistencies and redundancy around string literals and interpolations that I think can be cleaned up and allow for a more consistent and simpler language.

Single and triple quoted strings differ in two ways: how escape sequences are handled and whether they accept new line literals.

Single and triple quoted interpolations differ only in the literal newlines. Because the interpolator decides on how escapes are handled and the interpolator doesn’t have access to whether the input was single or double quoted, it can’t differentiate between the two even if it wanted to.

A first observation is that there isn’t much of a reason not to let single quoted strings and interpolations contain newlines. It would take something that’s now a syntax error and make it in what is the obviously intended thing. It departs from how Java handles things, but if you go in thinking things work as in Java, the worst that could happen is you’re pleasantly surprised.

It would also be sort of nice if interpolations did know whether or not they are single quoted or triple quoted. They could then decide to handle escapes differently between single and triple quoted interpolations, eliminating a potential source of surprise.

With that, you could make an interpolator that works like the s interpolator when single quoted and like the raw interpolator when triple quoted. A string literal only differs from such an interpolation in how $ is handled.

As a bonus point, the added value of string literals when you already have interpolations becomes pretty small. That was remarked years before already by the inimitable Som Snytt, who suggested expressing string literals as the apply interpolator: “foo” would be shorthand for apply."foo" which would produce the string foo. Hopefully I didn’t butcher his ideas too horribly.

If fixing the mentioned warts is worth the effort, would this be the route to go, without the bonus point? If it is, what objections would there be to the bonus point?

jducoeur · February 22, 2020, 3:36pm

Well, yes and no. It’s not unusual to accidentally miss a concluding double-quote, or typo an extra one, so you can easily wind up with different errors – literal Strings that sometimes go on for tens of lines until hitting the next double-quote – instead.

That should be pretty obvious in modern IDEs, so I don’t think it’s a disastrous problem, but folks would have to change their expectations a little.

It’s interesting, but I’m not immediately seeing the practical value. Do you have a use case in mind? An example would be helpful here.

(I’m not sure I agree that this aspect is a wart per se. Things are currently consistent along one dimension; you’re proposing to make it consistent in a different way. I’m trying to understand whether it’s an improvement in practice.)

martijnhoekstra · February 22, 2020, 4:21pm

The current inconsistency is that string literals differ in 2 ways depending on quotedness (newlines and escapes), and interpolations in 1 way (newlines). That s"foo\tbar" == s"""foo\tbar""" == "foo\tbar", but raw"foo\tbar" == raw"""foo\tbar""" == """foo\tbar""" is the warty part.

As for an example, if I have the string

"""An example: the escape for tab is written as \t"""

and I want to make it into an interpolation, I can’t just slap an s in front:

s"""An example for $name: the escape for tab is written as \t"""

escapes the tab, while the string literal doesn’t. In fact, triple quoting the interpolation doesn’t make a difference at all: in an interpolation, it’s just something you have to do when you write a multiline, it doesn’t have the meaning it has in a string literal.

That the difference between single quoted and triple quoted differs between a string and an interpolation is confusing, to the point you quite often see people triple quote their single-line interpolators.

som-snytt · February 22, 2020, 8:59pm

I agree that it’s confusing that triple quotes has extra semantics for string literals that doesn’t hold for s"". There are stackoverflow questions about it. Is the solution to say triple quotes means multiline only, and use raw interpolator for unprocessed escapes.

The multiline distinction is useful if my fix for s"\"hello\", said the world" were accepted, as it relied on line-oriented parsing, IIRC.

My opinion is that the inability to escape the quotation is the biggest wart of all. There are reasons it works the way it does, but in a world with voice recognition in call centers, facial recognition in law enforcement, DNA matching from public databases for cold cases, my compiler can’t handle backslash quote?

Also, why do we call it “facial recognition” instead of “face recognition”. A “facial” is what you get at the spa. Maybe there is a technical reason for the term. There is that twitter, “faces in things”. I did just read an article debunking the things people see in photos from Mars, including the famous face.

There are also questions around single quote ' versus double, because of other languages (than Java). I think 'abc' should do the obvious thing. I’m not sure if 'a' should mean a string unless the expected type is a char. (Or conversely, 'a' is a char unless expected to be a string. The rule of “smallest type that fits” is used for val b: Byte = 42, for instance.)

If "abc" means apply"abc", I’m not sure if 'abc' means uninterpolated, like quoting schemes in “other languages”. I lean that way, but I haven’t experimented.

There was a parser fix around whether a char literal can be a newline (as opposed to the escaped '\n') or whether ''' is legal, so I agree that edge cases or warts can persist for years and become canonicalized as just the way it is.

The transition to Scala 3 is a good moment to revisit the points you raise, especially since Dotty has been experimenting with syntax.

martijnhoekstra · February 22, 2020, 10:16pm

If memory serves me right, whether the backslash quote was interpreted as a backslash and then end of string, or as a quote as part of a string depended on the rest of the line, and where the string ended depended on whether there are more strings with backslash escaped quotes on the line.

This is one of those cases where easy is at odds with simple. The result would probably be it works and does the right thing in the majority of cases, but almost nobody knows in which cases it does and doesn’t work. Given the alternative solutions (either triple quote, or escape with a $ sign), I much prefer those over the easy but complex parsing rule for the rest of the line.

I would like being able to use single quotes for strings too. I think that conflicts with the quote/splice syntax in dotty. That could use a different token too to make this possible, I suppose. It makes me wonder how ‘a’ is parsed in dotty if a is a valid quotable identifier though.

martijnhoekstra · February 22, 2020, 11:36pm

Ultimately, I think it would be pretty neat if

"Kind regards,

 $name"

Or maybe even

val footer = "kind regards,
             |
             |$name"

Did the obviously right thing.