Remove local tuple allocation

When I looked at the code of
https://benjdd.com/languages3/


Scala-Native is fast, but Scala on the JVM is slow

![image|608x500](upload://owtLxsxOjjVHIvmufH9fw6tWyYn.jpeg

where languages/levenshtein/scala/code.scala at main · bddicken/languages · GitHub

I think that may introduce some allocation.

object Test {

  def method(s1: String, s2: String): Unit = {
    val (str1, str2) = if (s1.length > s2.length) (s2, s1) else (s1, s2)
  }
}

and the bytecode shows:

// access flags 0x1
  public method(Ljava/lang/String;Ljava/lang/String;)V
    // parameter final  s1
    // parameter final  s2
   L0
    LINENUMBER 11 L0
    ALOAD 1
    INVOKEVIRTUAL java/lang/String.length ()I
    ALOAD 2
    INVOKEVIRTUAL java/lang/String.length ()I
    IF_ICMPLE L1
    NEW scala/Tuple2
    DUP
    ALOAD 2
    ALOAD 1
    INVOKESPECIAL scala/Tuple2.<init> (Ljava/lang/Object;Ljava/lang/Object;)V
    GOTO L2
   L1
   FRAME SAME
    NEW scala/Tuple2
    DUP
    ALOAD 1
    ALOAD 2
    INVOKESPECIAL scala/Tuple2.<init> (Ljava/lang/Object;Ljava/lang/Object;)V
   L2
   FRAME SAME1 scala/Tuple2
    ASTORE 5
    ALOAD 5
    IFNULL L3
    ALOAD 5
    INVOKEVIRTUAL scala/Tuple2._1 ()Ljava/lang/Object;
    CHECKCAST java/lang/String
    ASTORE 6
   L4
    ALOAD 5
    INVOKEVIRTUAL scala/Tuple2._2 ()Ljava/lang/Object;
    CHECKCAST java/lang/String
    ASTORE 7
   L5
    NEW scala/Tuple2
    DUP
    ALOAD 6
    ALOAD 7
    INVOKESPECIAL scala/Tuple2.<init> (Ljava/lang/Object;Ljava/lang/Object;)V
    GOTO L6
   L3
   FRAME APPEND [T T scala/Tuple2]
    GOTO L7
   L7
   FRAME SAME
    NEW scala/MatchError
    DUP
    ALOAD 5
    INVOKESPECIAL scala/MatchError.<init> (Ljava/lang/Object;)V
    ATHROW
   L6
   FRAME FULL [com/alibaba/ultramax/mtop/jsonpath/Test$ java/lang/String java/lang/String T T scala/Tuple2 java/lang/String java/lang/String] [scala/Tuple2]
    ASTORE 4
    ALOAD 4
    INVOKEVIRTUAL scala/Tuple2._1 ()Ljava/lang/Object;
    CHECKCAST java/lang/String
    ASTORE 8
   L8
    ALOAD 4
    INVOKEVIRTUAL scala/Tuple2._2 ()Ljava/lang/Object;
    CHECKCAST java/lang/String
    ASTORE 9
   L9
    RETURN
   L10
    LOCALVARIABLE str1 Ljava/lang/String; L4 L3 6
    LOCALVARIABLE str2 Ljava/lang/String; L5 L3 7
    LOCALVARIABLE str1 Ljava/lang/String; L8 L10 8
    LOCALVARIABLE str2 Ljava/lang/String; L9 L10 9
    LOCALVARIABLE this Lcom/alibaba/ultramax/mtop/jsonpath/Test$; L0 L10 0
    LOCALVARIABLE s1 Ljava/lang/String; L0 L10 1
    LOCALVARIABLE s2 Ljava/lang/String; L0 L10 2
    MAXSTACK = 4
    MAXLOCALS = 10

I think when the tuple is only used for switch variables, it should be compiles to simple if else block.

object Test {

  def method(s1: String, s2: String): Unit = {
    val isS1Bigger = s1.length > s2.length
    val str1 = if (isS1Bigger) s2 else s1
    val str2 = if (isS1Bigger) s1 else s2
  }
}

which gets:

  // access flags 0x1
  public method(Ljava/lang/String;Ljava/lang/String;)V
    // parameter final  s1
    // parameter final  s2
   L0
    LINENUMBER 11 L0
    ALOAD 1
    INVOKEVIRTUAL java/lang/String.length ()I
    ALOAD 2
    INVOKEVIRTUAL java/lang/String.length ()I
    IF_ICMPLE L1
    ICONST_1
    GOTO L2
   L1
   FRAME SAME
    ICONST_0
   L2
   FRAME SAME1 I
    ISTORE 3
   L3
    LINENUMBER 12 L3
    ILOAD 3
    IFEQ L4
    ALOAD 2
    GOTO L5
   L4
   FRAME APPEND [I]
    ALOAD 1
   L5
   FRAME SAME1 java/lang/String
    ASTORE 4
   L6
    LINENUMBER 13 L6
    ILOAD 3
    IFEQ L7
    ALOAD 1
    GOTO L8
   L7
   FRAME APPEND [java/lang/String]
    ALOAD 2
   L8
   FRAME SAME1 java/lang/String
    ASTORE 5
   L9
    RETURN
   L10
    LOCALVARIABLE isS1Bigger Z L3 L10 3
    LOCALVARIABLE str1 Ljava/lang/String; L6 L10 4
    LOCALVARIABLE str2 Ljava/lang/String; L9 L10 5
    LOCALVARIABLE this Lcom/alibaba/ultramax/mtop/jsonpath/Test$; L0 L10 0
    LOCALVARIABLE s1 Ljava/lang/String; L0 L10 1
    LOCALVARIABLE s2 Ljava/lang/String; L0 L10 2
    MAXSTACK = 2
    MAXLOCALS = 6
2 Likes

Given the same source uses while loop few lines below this, instead of using natural for, one would assume the explicit procedural style code you have shown could be used in this place as well, if it is shown it matters?

The special case you describe may sound trivial, but there is plethora of simple cases like this and their variations and I am not sure if it is realistic to expect the compiler to handle them all.

Frankly I think the Java vs Scala comparison on this particular benchmark can be explained by the fact that Scala compilation produces a jar, and the java one uses native image.

So Scala is just paying down the jvm startup cost.

1 Like

Yes, we certainly should eliminate tuple constructions in matches. As far as I know Scala 2 already does this?

1 Like

Not sure, I’m using 2.13.15 but just a simple compile without the -opt:inline

The Scala 2 optimization is specific to tuples:

Did you have in mind a similarly tuple-specific optimization for Scala 3 or something more general?

I can imagine an idea of inline class where the class instance would instead be replaced with the fields of the class as local variables. However, I can’t see any way that an inline class could be compatible with an ordinary class – we would have to introduce a new InlineTuple rather than make the existing Tuple an inline class.

1 Like

You don’t need and should not need a language-level concept for this. This is optimization material. The Scala.js optimizer does that for tuples and other classes annotated with @inline, without semantic change.

4 Likes


Scala-js seems faster than raw js.

submitted a pr Add scala-js with bun by He-Pin · Pull Request #271 · bddicken/languages · GitHub

2 Likes

For those of you who enjoy startup optimiztaions for the JVM, try:

java
-XX:+AutoCreateSharedArchive
-XX:SharedArchiveFile=appcds.jsa
-XX:TieredStopAtLevel=1
-XX:+UseSerialGC
-jar code

This improves the example from ~200ms to ~130ms for very small sequences (i.e., only measuring startup overhead) on my system.

There was a time when I managed to execute jvm/node programs in ~20ms startup time (not ~80-100) … not sure what changed, but both got slower :thinking:

I wonder … given that this program allocates only a single tuple at the beginning … would optimizing that away improve things more than anticipated because the classloader would not have to load the tuple class? :sweat_smile:

Last time I checked the scala 2 optimizer did exactly these kinds of optimizations.