SIP-XX: Dedented Multiline String Literals

I’m not a fan of placing arbitrary limitations on the language to satisfy the language designers’ preferences. If somebody likes to put ASCII arts in their string literals, let them. We don’t need to babysit grownup programmers. And when it comes to ASCII arts, you’re not going to get very far within a single line anyway…

3 Likes

Well, I like that philosophy too, overall, at least when I’m writing code not reading someone else’s. But the Scala philosophy seems to admit certain limitations for the sake of regularity (e.g. infix requiring a keyword), and this is an area where simplicity of visual parsing is a virtue, so having a rule that helps enforce that seems pretty well within bounds.

But with significant indentation being enough, I mostly don’t see the point of elaborate delimiters at all.

On the other hand, if we allow arbitrary sequences, users could make some strings looking more self-explanatory:

def giveMeCodeExample(s: String) = ???

giveMeCodeExample(
  "<sql>

  SELECT one, two, three
  FROM foo
  WHERE two = '2'
  ORDER BY three

  <sql>")

giveMeCodeExample(
  "<scala>

  foo
    .filter(_.two)
    .sortBy(_.three)
    .map: foo =>
      (foo.one, foo.two, foo.three)

  <scala>")

giveMeCodeExample(
  "<java>

  foo.stream()
      .filter(Foo::two)
      .sorted(Comparator.comparing(Foo::three))
      .map(f -> new Triple(f.one(), f.two(), f.three()))
      .collect(Collectors.toList());

  <java>")

I admit it is a made-up example, but the point is the added flexibility – it is up to a user to decide.

To be honest, I don’t think the compiler should bother about indentations inside dedented strings. I believe that the compiler should strip leading whitespaces exactly up to the position, pointed by the closing sequence. Everything that is beyond that position – it is the user’s business.

val str = "
    one
      two
        three
    "

The result string should be exactly this one:

one
  two
    three

That way there would be a clear separation of responsibilities between the compiler and the user by the final position boundary. It would be visually pretty clear and distinctive, I think.

1 Like

I don’t think this works very cleanly.

That’s an example where it does work. But what about this?

val str = "
    I've got a few things to say
    "
      Hi, hello.
      Bye, farewell.
    "
    That's what I've said.
  "

Is this valid or not? It depends entirely on how you determine the rule for the end of the string, which is indented some.

One rule is: the first token match ends it. However deep that is, is how deep the indent is. In that case, the quoted string is "I've got a few things to say\n", followed by syntax errors.

But another rule is: the outermost match that is no deeper than what we’ve got ends it. In that case, there are no syntax errors, the entire block is indented two spaces, and it’s got quotes in it.

Either way, we end up with weird stuff like

val text = "
  My favorite quote is
  " ++
  raw"
  Isn't it yours, too?
  "

With the first-as-deep-quote-ends-it rule, we don’t get a parse error here! But if we were thinking we’d have an embedded \" ++\nraw\"\n in the string, we’d be wrong.

However, if we are required to have deeper indentation, and the end quote is the same depth as the depth of the line that started the quote, then it’s unambiguous.

With the indent-required rule, the parsing of the string endpoints is trivial for both computers and people. When you see

vvvvvv count these
      blah, whatever, 2 + 4, thing: "
        dunno who cares whatever
        "
        """
        ~~~~~"!##"#_#"
        " + "The End!"
      "
^^^^^^ First line with the same indentation depth, so this is it
// If " isn't the first character after the indent, it's a syntax error

You simply can’t write any confusing cases because indentation takes care of it all.

The only possible confusion is if the block is supposed to come with its own indentation, and it’s never zero throughout the whole block.

val leftwards = "
          /
         /
        /
       <
        \
         \
          \
"

What if that is actually supposed to be indented to the level of two spaces to the left of the <? All we could do at parse time is find how deep < is and subtract that from every line. In contrast, with the trailing same-level ", we get to decide how much indentation there is.

But there are at least three other ways to solve the problem.

  1. Have an .indentBy(" ") method that is eligible for compile-time operation.
  2. Allow <- as a depth indicator, as I showed before, and place the opening " on its own line.
  3. Allow an optional depth line which would be the desired indent followed by a vertical bar (I guess), then newline. This would set the depth to which indentation is removed. If you wanted to start a multi-line quote with a single vertical bar, then you could repeat that line.
val indentHere = "
    |
      I always like
      a little space
      on the left!
"

(Incidentally, this would make for a less obtrusive stripMargin, too.)

1 Like

One more requirement I’d like to bring up is the ability to represent single-line strings.

The original motivation for this discussion is to improve the ergonomics of multi-line strings. But if we want to be able to properly deprecate """ and keep the number of recommended syntaxes small, then we need to be able to represent single-line """s which are used in cases such as these JSON string literals found in uPickle’s test suite:

test("simple"){
  test - rw(Gadt.Exists("hello"), """{"$type":"Exists","path":"hello"}""")
  test - rw(Gadt.IsDir(" "), """{"$type":"IsDir","path":" "}""")
  test - rw(Gadt.ReadBytes("\""), """{"$type":"ReadBytes","path":"\""}""")
  test - rw(Gadt.CopyOver(Seq(1, 2, 3), ""), """{"$type":"CopyOver","src":[1,2,3],"path":""}""")
}
test("partial"){
  test - rw(Gadt2.Exists("hello"), """{"$type":"Exists","v":"hello"}""")
  test - rw(Gadt2.IsDir(123), """{"$type":"IsDir","v":123}""")
  test - rw(Gadt2.ReadBytes('h'), """{"$type":"ReadBytes","v":"h"}""")
  test - rw(Gadt2.CopyOver(Seq(1, 2, 3), ""), """{"$type":"CopyOver","src":[1,2,3],"v":""}""")
}
test("issues"){
  test("issue95"){
    rw(
      Tuple1(List(C1("hello", List("world")))),
      """[[{"name": "hello", "types": ["world"]}]]"""
    )
    rw(
      C2(List(C1("hello", List("world")))),
      """{"results": [{"name": "hello", "types": ["world"]}]}"""
    )

    rw(
      GeoCoding2(List(Result2("a", "b", List("c"))), "d"),
      """{"results": [{"name": "a", "whatever": "b", "types": ["c"]}], "status": "d"}"""
    )
  }
}

Similar single-line """ literals can be found in Scalatags, Mill, Cask, Ammonite, PPrint, OS-Lib, Requests-Scala, FastParse, and other projects. It seems almost every single com-lihaoyi project makes use of this syntax, and in scenarios that benefit greatly from the compactness of having it as part of a single line rather than as a multi-line string spaced out over three lines

The various single-" syntaxes proposed can’t really work here, and neither can the extensible "-- syntax proposed by @satorg, all for the same reason that " for single-line strings is already valid syntax so we cannot assign it new semantics. And so for those proposals, keeping """ syntax around seems inevitable, and we cannot deprecate it because it is superior in many scenarios (e.g. those given above) despite its other weaknesses (multi-line indentation management, still needing escaping sometimes).

But with the original extensible ''' syntax, we can handle single-line strings as well by defining it as:

  • Single-line ''' strings can be defined as three-or-more single quotes followed by one space, and are terminated by a space followed by an equal number of single quotes. The contents of the string is the characters following the first space and preceding the last space, excluding the spaces.

With that definition, the example above would be written as:

test("simple"){
  test - rw(Gadt.Exists("hello"), ''' {"$type":"Exists","path":"hello"} ''')
  test - rw(Gadt.IsDir(" "), ''' {"$type":"IsDir","path":" "} ''')
  test - rw(Gadt.ReadBytes("\""), ''' {"$type":"ReadBytes","path":"\""} ''')
  test - rw(Gadt.CopyOver(Seq(1, 2, 3), ""), ''' {"$type":"CopyOver","src":[1,2,3],"path":""} ''')
}
test("partial"){
  test - rw(Gadt2.Exists("hello"), ''' {"$type":"Exists","v":"hello"} ''')
  test - rw(Gadt2.IsDir(123), ''' {"$type":"IsDir","v":123} ''')
  test - rw(Gadt2.ReadBytes('h'), ''' {"$type":"ReadBytes","v":"h"} ''')
  test - rw(Gadt2.CopyOver(Seq(1, 2, 3), ""), ''' {"$type":"CopyOver","src":[1,2,3],"v":""} ''')
}
test("issues"){
  test("issue95"){
    rw(
      Tuple1(List(C1("hello", List("world")))),
      ''' [[{"name": "hello", "types": ["world"]}]] '''
    )
    rw(
      C2(List(C1("hello", List("world")))),
      ''' {"results": [{"name": "hello", "types": ["world"]}]} '''
    )

    rw(
      GeoCoding2(List(Result2("a", "b", List("c"))), "d"),
      ''' {"results": [{"name": "a", "whatever": "b", "types": ["c"]}], "status": "d"} '''
    )
  }
}

The leading/trailing-space requirement would serve as a parallel to the leading/trailing-newline requirement for multiline strings, and have a similar purpose: to allow extensible delimiters, so we can extend ''' "hello" ''' to '''' "hello" '''' or longer, in case the body of the string we want to represent itself contains a literal ''', without escaping

These are all strict improvements over the existing single-line """ strings, which work well but similar to multi-line strings have edge-case issues with escaping.

If the single-line ''' string starts or ends with a space, the user adds it after/before the required spaces that get stripped, similar to how in multi-line ''' strings you can add leading or trailing newlines in addition to the required newlines that get stripped.

Apart from the improvements themselves being nice, this would let us fully deprecate the """ syntax since we now have an alternative that is better in every way in both single-line and multi-line scenarios

We could also drop the extensibility and leading/trailing-space requirement for single-line ''' strings. In that case we would be at feature-parity with existing """ strings, enough to deprecate """ in favor of ''', though without the extensible-delimiters/never-need-escaping improvement and with slightly less symmetry between single-line and multi-line strings

This is an interesting way to solve the problem, but I’m not sure I like the way that the content of the string gets visually separated. The character doesn’t have to be ; indeed, since '' isn’t a legal identifier in Scala it can be anything following '' and it’s fine. I’m not sure ' is the best choice for the following text.

I would tend to go for ''| or ''# for visual obviousness, with extra ' as needed, if we need this feature at all. The triple-quoted versions honestly don’t seem bad to me; we already have the syntax, and it’s apparent enough, although I agree that the single " are a bit of a distractor when trying to find the terminal """.

test("simple"){
  test - rw(Gadt.Exists("hello"), ''#{"$type":"Exists","path":"hello"}#'')
  test - rw(Gadt.IsDir(" "), ''#{"$type":"IsDir","path":" "}#'')
  test - rw(Gadt.ReadBytes("\""), ''#{"$type":"ReadBytes","path":"\""}#'')
  test - rw(Gadt.CopyOver(Seq(1, 2, 3), ""), ''#{"$type":"CopyOver","src":[1,2,3],"path":""}#'')
}

reads better to me personally.

But, anyway, it would be fine; I just don’t know if the motivation is strong enough, especially given that we don’t need ''' at all to get the multi-line auto-stripmargin working.

How about using a singlequote ' symbol to represent strings exactly the same way as a doublequote " does? I.e. 'abc' == "abc". Like in Python?

The only caveat – when ' wraps a single char (e.g. 'a'), then is represents a Chal literal, not String. But if it contains more than 1 symbol – it is a regular string, treated the same way as if one wrapped with ", including multiline strings.

> val c = 'A'
val c: Char = A

> val s1 = "A'a"
val s1: String = A'a
                                                                                                                                                                                                        
> val s2 = 'A"a'
val s2: String = A"a
1 Like

Will it work for unicode?

It’s nice not to have ambiguity about types. Double single quote would be fine, though. ''This is a string.''

At least for the purpose of having extensible delimiters, there needs to be some character that’s different from the repeated character (') to allow us to determine the end of the delimiter. So ''# would work.

Then there’s the subjective question of familiarity, where ''' strings are much more ubiquitous across the programming ecosystem than ''# strings, so having the delimiter start with ''' seems like a good thing. And the proposed multi-line syntax already uses '''\n as its delimiter, so '''<space> for the single-line equivalent seems like it would match closely and be least surprising

The motivation for single-line ''' syntax would be to allow us to fully deprecate the older """ syntax. So it would be " or ''', rather than " or """ or ''', at least for writing code going forward. Notably, that brings it in line with the proposed " multi-line syntax, which still requires two syntaxes " or """ since it doesn’t handle well the single-line-string-with-lots-of-quotes case which appears to be quite common in the wild

Fair enough, but I don’t think this actually answers the question of whether it’s needed.

For a minimal alternative quote, '' is available and smaller and is “a double quote”, just spelled differently. So independent of everything else, we could just say that ordinary strings can "be like this" or ''like this''.

Three is fine also. But the third isn’t giving us anything. In languages where '' is the empty string, they’re forced to go to '''. But '' is not an empty string in Scala; it’s a syntax error. (In fact, I’d argue that ''' should be how “the character '” is spelled. It’s very odd that, despite ‘x’ working for basically every other x (save \), you have to write '\''.)

Separately, if you need a quote of deeply nested quotes of quotes of quotes, having to use indentation syntax in those cases seems like not too big an ask. The only thing is that we’d need to specify that multi-line strings do not return the trailing newline. If you want one, add another line.

Although I usually like the idea of generality, this one seems like it’s messing with the typical case, forcing extra spaces where normally what you see is what you get (and yes, you can want spaces, and 7 spaces vs 8 is really hard to spot), in order to support one flavor of unusual case that could be solved other ways.

And I don’t think there’s a compelling reason to retire """. It has its uses as-is: you go """ on a line, simply dump verbatim text in, and go """ at the end and voila! You have that text. You don’t need to worry about indentation, margins, any of that stuff. It’s just the pure text. The only improvement might be to allow powers of 3 (so, """"""""" and """"""""""""""""""""""""""" and so on) to handle the case of quoting things that have strings of quotes.

Ok that would be a reason to keep """. Is it worth to change everything around to support these cases?

1 Like

Maybe? This is just following the idea of “what if we deprecate """” that you mentioned earlier, in the interest of consolidating onto a smaller set of language primitives. If we don’t care about deprecating """ then we can skip all this

I see no benefit in deprecating """ and use it for these single line cases where you don’t want to escape ". The majority of the cases are multi-line and if we can solve this without adding ''' it’s a win in my book.

3 Likes

I have updated the proposal to include all the feedback and alternatives brought up in this discussion. Now that this first round of discussion seems to have tapered off, I will circulate it more broadly among the community. Those of you who participated in this discussion should also read through the updated proposal so we can start the next rounds of discussion all on the same page

1 Like

I like the general idea of this pre-SIP.

I’ve only skimmed the discussion, so apologies if the following was already brought up…

The pre-SIP says:

Dedented strings automatically strip:

  • Any indentation on every line up to the position of the closing '''

And then provides an example with pattern matching:

foo match{
  case '''
    i am cow
    hear me moo
  ''' =>
}

So, it seems that the above syntax would produce a string value with leading spaces:

  i am cow
  hear me moo

Whereas other examples in the pre-SIP intend to produce a fully dedented string:

i am cow
hear me moo

To achieve the latter, we would need to write something like

foo match{
  case '''
    i am cow
    hear me moo
    ''' =>
}

If understood this correctly, perhaps just update this example in the doc.

I don’t like the idea to overload semantics of '. Overloaded semantics make a language more difficult.

I’m very much in favor to actually make the language simpler by unification of features.

So if there is a way to get " for the proposed idea that would be great!

Than we had: " == String Literal, and that’s it. Clean and simple. (I guess the triple version needs to be kept for backwards compatibility for some time.)

The idea in general is overdue. Most languages have this feature since some time as the pre-SIP writeup clearly demonstrates, and it’s such an important QoL feature!

Thanks @lihaoyi for pushing that features in general! (Even I don’t like the current concrete details.)

What pleases me most visually is the

def openingParagraph =
  "
    i am cow
    hear me moo
  "

version. Similar to what @Ichoran proposed too.

I think the intendation inside the string block looks much better than

def openingParagraph =
  "
  i am cow
  hear me moo
  "

This simply doesn’t feel natural in a indentation block based syntax.

Also Scala really needs to embrace vertical whitespace much more. “Classical” Scala was always way to dense. Since we have indentation block based syntax things got much better. I’ve seen even Odersky arguing for vertical whitespace (“paragraphs” in code) on some GitHub issue, even his old code was usually some of the most densely written (which as said is badly readable). It’s nice to see a change here.

Using a single " on a line on its own also makes the text block really look like a block. Using than an additional indentation solves also the escaping issue. I really like that.

1 Like

@nikitaga you are right, that was an oversight! I’ll fix the proposal

It looks compelling at first glance.
However, this approach raises two questions:

  1. What is the correct level of indentation within a text block? I’m not sure that a compiler (any compiler!) should dictate the exact level to users. A compiler may require some indentation (in a case of significant indentation), but I don’t think it is a good idea to force any specific amount of spaces on users. Ultimately, different users may have different preferences.
  2. What if a user wants to preserve some indentation within their text block? I.e. what if we want to get something like this as a result:
       I want this text
       to be indented
       3 positions.
    
    I believe we should never assume that no one will need it at all. And that would imply that a user would have to create an intentation within the pre-indented block, e.g.:
    def openingParagraph =
      "
           I want this text
           to be indented
           3 positions.
      "
    //  ^ – this is where the text own indentation starts.
    
    To me, it is just difficult to read and maintain.

Also I’ve got some doubts regarding the “indentation pointers” (<-) approach suggested by @Ichoran. Although looks technically plausibe, however:

  1. In my opition, it looks too “esoteric”. I wonder, how may languages around have something like this?
  2. Imagine your code filled in with those little pointers in the beginning of every text block. My opinion again, but I don’t think it would look clear, readable or simply pleasing to the eye.

On the other hand, HEREDOC-like strings:

  1. Well-known and rather ubiquitos. Many languages support this idea in one form or another. In other words, this approach is proven to be working.
  2. It allows to visually and very clearly highlight large blocks of code. Moreover, users can choose the most suitable method for them.
  3. Also this approach allows to draw a clear and unambiguous boundary where the dedenting (made by compiler) ends and where the user text begins: it is exactly at the beginning of the closing sequence.

The above my example with additional indentation within the text would look like this:

def openingParagraph =
  "--
     I want this text
     to be indented
     3 positions.
  --"

Clear, readable, configurable, perferct :slight_smile:

Ok, just kidding – no solution is ever perfect. But I think it is still a better deal than the “indentation-after-indentation” approach.

2 Likes