Remove local tuple allocation

hepin1989 · December 19, 2024, 10:26am

When I looked at the code of
https://benjdd.com/languages3/

Scala-Native is fast, but Scala on the JVM is slow

![image|608x500](upload://owtLxsxOjjVHIvmufH9fw6tWyYn.jpeg

where languages/levenshtein/scala/code.scala at main · bddicken/languages · GitHub

I think that may introduce some allocation.

object Test {

  def method(s1: String, s2: String): Unit = {
    val (str1, str2) = if (s1.length > s2.length) (s2, s1) else (s1, s2)
  }
}

and the bytecode shows:

// access flags 0x1
  public method(Ljava/lang/String;Ljava/lang/String;)V
    // parameter final  s1
    // parameter final  s2
   L0
    LINENUMBER 11 L0
    ALOAD 1
    INVOKEVIRTUAL java/lang/String.length ()I
    ALOAD 2
    INVOKEVIRTUAL java/lang/String.length ()I
    IF_ICMPLE L1
    NEW scala/Tuple2
    DUP
    ALOAD 2
    ALOAD 1
    INVOKESPECIAL scala/Tuple2.<init> (Ljava/lang/Object;Ljava/lang/Object;)V
    GOTO L2
   L1
   FRAME SAME
    NEW scala/Tuple2
    DUP
    ALOAD 1
    ALOAD 2
    INVOKESPECIAL scala/Tuple2.<init> (Ljava/lang/Object;Ljava/lang/Object;)V
   L2
   FRAME SAME1 scala/Tuple2
    ASTORE 5
    ALOAD 5
    IFNULL L3
    ALOAD 5
    INVOKEVIRTUAL scala/Tuple2._1 ()Ljava/lang/Object;
    CHECKCAST java/lang/String
    ASTORE 6
   L4
    ALOAD 5
    INVOKEVIRTUAL scala/Tuple2._2 ()Ljava/lang/Object;
    CHECKCAST java/lang/String
    ASTORE 7
   L5
    NEW scala/Tuple2
    DUP
    ALOAD 6
    ALOAD 7
    INVOKESPECIAL scala/Tuple2.<init> (Ljava/lang/Object;Ljava/lang/Object;)V
    GOTO L6
   L3
   FRAME APPEND [T T scala/Tuple2]
    GOTO L7
   L7
   FRAME SAME
    NEW scala/MatchError
    DUP
    ALOAD 5
    INVOKESPECIAL scala/MatchError.<init> (Ljava/lang/Object;)V
    ATHROW
   L6
   FRAME FULL [com/alibaba/ultramax/mtop/jsonpath/Test$ java/lang/String java/lang/String T T scala/Tuple2 java/lang/String java/lang/String] [scala/Tuple2]
    ASTORE 4
    ALOAD 4
    INVOKEVIRTUAL scala/Tuple2._1 ()Ljava/lang/Object;
    CHECKCAST java/lang/String
    ASTORE 8
   L8
    ALOAD 4
    INVOKEVIRTUAL scala/Tuple2._2 ()Ljava/lang/Object;
    CHECKCAST java/lang/String
    ASTORE 9
   L9
    RETURN
   L10
    LOCALVARIABLE str1 Ljava/lang/String; L4 L3 6
    LOCALVARIABLE str2 Ljava/lang/String; L5 L3 7
    LOCALVARIABLE str1 Ljava/lang/String; L8 L10 8
    LOCALVARIABLE str2 Ljava/lang/String; L9 L10 9
    LOCALVARIABLE this Lcom/alibaba/ultramax/mtop/jsonpath/Test$; L0 L10 0
    LOCALVARIABLE s1 Ljava/lang/String; L0 L10 1
    LOCALVARIABLE s2 Ljava/lang/String; L0 L10 2
    MAXSTACK = 4
    MAXLOCALS = 10

I think when the tuple is only used for switch variables, it should be compiles to simple if else block.

object Test {

  def method(s1: String, s2: String): Unit = {
    val isS1Bigger = s1.length > s2.length
    val str1 = if (isS1Bigger) s2 else s1
    val str2 = if (isS1Bigger) s1 else s2
  }
}

which gets:

  // access flags 0x1
  public method(Ljava/lang/String;Ljava/lang/String;)V
    // parameter final  s1
    // parameter final  s2
   L0
    LINENUMBER 11 L0
    ALOAD 1
    INVOKEVIRTUAL java/lang/String.length ()I
    ALOAD 2
    INVOKEVIRTUAL java/lang/String.length ()I
    IF_ICMPLE L1
    ICONST_1
    GOTO L2
   L1
   FRAME SAME
    ICONST_0
   L2
   FRAME SAME1 I
    ISTORE 3
   L3
    LINENUMBER 12 L3
    ILOAD 3
    IFEQ L4
    ALOAD 2
    GOTO L5
   L4
   FRAME APPEND [I]
    ALOAD 1
   L5
   FRAME SAME1 java/lang/String
    ASTORE 4
   L6
    LINENUMBER 13 L6
    ILOAD 3
    IFEQ L7
    ALOAD 1
    GOTO L8
   L7
   FRAME APPEND [java/lang/String]
    ALOAD 2
   L8
   FRAME SAME1 java/lang/String
    ASTORE 5
   L9
    RETURN
   L10
    LOCALVARIABLE isS1Bigger Z L3 L10 3
    LOCALVARIABLE str1 Ljava/lang/String; L6 L10 4
    LOCALVARIABLE str2 Ljava/lang/String; L9 L10 5
    LOCALVARIABLE this Lcom/alibaba/ultramax/mtop/jsonpath/Test$; L0 L10 0
    LOCALVARIABLE s1 Ljava/lang/String; L0 L10 1
    LOCALVARIABLE s2 Ljava/lang/String; L0 L10 2
    MAXSTACK = 2
    MAXLOCALS = 6

There is a asm method for this:

public void swap(Type prev,
 Type type)
Generates the instructions to swap the top two stack values.

OndrejSpanel · December 19, 2024, 10:44am

Given the same source uses while loop few lines below this, instead of using natural for, one would assume the explicit procedural style code you have shown could be used in this place as well, if it is shown it matters?

The special case you describe may sound trivial, but there is plethora of simple cases like this and their variations and I am not sure if it is realistic to expect the compiler to handle them all.

velvetbaldmime · December 19, 2024, 10:48am

Frankly I think the Java vs Scala comparison on this particular benchmark can be explained by the fact that Scala compilation produces a jar, and the java one uses native image.

So Scala is just paying down the jvm startup cost.

odersky · December 19, 2024, 11:01am

Yes, we certainly should eliminate tuple constructions in matches. As far as I know Scala 2 already does this?

hepin1989 · December 19, 2024, 11:15am

Not sure, I’m using 2.13.15 but just a simple compile without the -opt:inline

olhotak · December 19, 2024, 3:29pm

The Scala 2 optimization is specific to tuples:

github.com

scala/scala/blob/8c6882330c8b60626bb38ef8c45fef82305e524a/src/compiler/scala/tools/nsc/backend/jvm/opt/BoxUnbox.scala#L812


      
          
          def checkTupleCreation(insn: AbstractInsnNode, expectedKind: Option[Tuple], prodCons: ProdConsAnalyzer): Option[(BoxCreation, Tuple)] = {
            def checkKind(mi: MethodInsnNode): Option[Tuple] = expectedKind match {
              case Some(kind) => if (kind.tupleClass == tupleClass(mi)) expectedKind else None
              case None => Some(Tuple(boxedTypes(mi), tupleClass(mi)))
            }
          
            insn match {
              // no need to check for TupleN.apply: the compiler transforms case companion apply calls to constructor invocations
              case ti: TypeInsnNode if ti.getOpcode == NEW =>
                for ((dupOp, initCall) <- BoxKind.checkInstanceCreation(ti, prodCons) if isTupleConstructor(initCall); kind <- checkKind(initCall))
                  yield (InstanceCreation(ti, dupOp, initCall), kind)
          
              case _ => None
            }
          }
          
          private val specializedTupleClassR = "scala/Tuple[12]\\$mc[IJDCZ]{1,2}\\$sp".r
          private def isSpecializedTupleClass(tupleClass: InternalName) = specializedTupleClassR.pattern.matcher(tupleClass).matches
          
          private val specializedTupleGetterR = "_[12]\\$mc[IJDCZ]\\$sp".r

Did you have in mind a similarly tuple-specific optimization for Scala 3 or something more general?

I can imagine an idea of inline class where the class instance would instead be replaced with the fields of the class as local variables. However, I can’t see any way that an inline class could be compatible with an ordinary class – we would have to introduce a new InlineTuple rather than make the existing Tuple an inline class.

sjrd · December 19, 2024, 9:00pm

You don’t need and should not need a language-level concept for this. This is optimization material. The Scala.js optimizer does that for tuples and other classes annotated with @inline, without semantic change.

hepin1989 · December 20, 2024, 3:56am

Scala-js seems faster than raw js.

hepin1989 · December 20, 2024, 5:44am

submitted a pr Add scala-js with bun by He-Pin · Pull Request #271 · bddicken/languages · GitHub

ragnar · December 20, 2024, 11:12am

For those of you who enjoy startup optimiztaions for the JVM, try:

java
-XX:+AutoCreateSharedArchive
-XX:SharedArchiveFile=appcds.jsa
-XX:TieredStopAtLevel=1
-XX:+UseSerialGC
-jar code

This improves the example from ~200ms to ~130ms for very small sequences (i.e., only measuring startup overhead) on my system.

There was a time when I managed to execute jvm/node programs in ~20ms startup time (not ~80-100) … not sure what changed, but both got slower

I wonder … given that this program allocates only a single tuple at the beginning … would optimizing that away improve things more than anticipated because the classloader would not have to load the tuple class?

Jasper-M · December 20, 2024, 11:21am

Last time I checked the scala 2 optimizer did exactly these kinds of optimizations.

hepin1989 · December 29, 2024, 6:39pm

The JVM has swap bytecode which does swap the values.

public void swap(Type prev,
 Type type)
Generates the instructions to swap the top two stack values.

I just checked scala 3, the SWAP is not been used.

hepin1989 · December 30, 2024, 3:06pm

/** A phase that optimizes tuple swap patterns like:
 *  `val (a, b) = if (cond) (x, y) else (y, x)`
 *  into more efficient code that avoids tuple allocation.
 */
class Tuple2SwapOptimizer extends MiniPhase {
...
}