Turning megamorphic virtual calls into MethodHandles

tarsa · April 19, 2019, 4:49pm

Java 7+ provides java.lang.invoke.MethodHandle which can be used to statically resolve virtual calls (i.e. it will precompute some or all of v-table walk), if I understand that thing correctly. That would greatly reduce cost of invoking functions passed to hard-to-inline combinators, like Future.flatMap for example. But how to use that? I’m thinking of possible method signature:

class Future[A] {
  def flatMap[B](f: A => Future[B])(implicit ec: ExecutionContext, mh: MethodHandle = macro ???): Future[B] = {
    // forget about f.apply use mh.invokeExact(f, ...) directly
  }
}

mh would need to be created at call-site and cached there in a static final field. When I pass anonymous class instance (be it lambda or not) then I can precisely compute the class at use-site because I have it at hand. Problem comes when the function passed to f is something generic - can mh be computed reliably and cached in such situation? If not then mh should be changed to mhOpt:

class Future[A] {
  def flatMap[B](f: A => Future[B])(implicit ec: ExecutionContext, mhOpt: Option[MethodHandle] = macro ???): Future[B] = {
    // if mhOpt.isDefined then forget about f.apply use mhOpt.get.invokeExact(f, ...) directly
  }
}

Is that possible to do using macros?
How hard would it be to develop portable solution (working in Scala 2.11, Scala 2.12, Scala 2.13, Scala.js, etc)
Most important question - would MethodHandles provide performance improvement here? Any guesses?

sjrd · April 20, 2019, 5:32am

Your problem is probably solved just by making it final and enabling the optimizer.

In Scala.js and Scala Native the optimizer is always enabled, because they’re closed world, so this is already fixed there.

No need for MethodHandles.

tarsa · April 20, 2019, 9:35am

What would optimizer do?

I didn’t show the call that is megamorphic. Take for example Scala source code as of version 2.13-RC1: https://github.com/scala/scala/tree/v2.13.0-RC1/src/library/scala/concurrent

The megamorphic calls are in scala.concurrent.impl.Promise.Transformation.run e.g:

case Xform_map           =>
  if (v.isInstanceOf[Success[F]]) Success(fun(v.get)) else v // Faster than `resolve(v map fun)`
case Xform_flatMap       =>
  if (v.isInstanceOf[Success[F]]) {
    val f = fun(v.get)
    if (f.isInstanceOf[DefaultPromise[T]]) f.asInstanceOf[DefaultPromise[T]].linkRootOf(this, null) else completeWith(f.asInstanceOf[Future[T]])
    null
  } else v

fun(v.get) calls are megamorphic, because fun there is a function passed to Future.map, Future.flatMap etc and typically for every usage of such combinators on Future a new subclass of Function1 is created. If we change e.g. val f = fun(v.get) to:

val f =
  if (mhOpt.isDefined) {
    mhOpt.get.invokeExact(fun, v.get) // megamorphic call avoided
  } else {
    fun(v.get) // MethodHandle wasn't provided, fallback to usually megamorphic call
  }

then every time we succeed to construct cached MethodHandle we would avoid megamorphic calls.

Flavio Brasil from Twitter was proposing special optimizations for GraalVM which would speed up such megamorphic calls: https://github.com/oracle/graal/issues/893 . They aren’t yet there, but anyway I think that MethodHandles could be still the fastest and most reliable solution.

mdedetrich · April 20, 2019, 9:49am

Unless I misunderstand, I don’t see how final helps here for Scala on JVM. JVM has a concept of effectively final, so doing final doesn’t make a difference on a method/class/variable unless it gets subclassed/overriden.

Megamorphic calls are a real issue for functional style code and its nothing to do with final, rather its an issue with how many different ways a method gets called. JVM only supports monomorphic/bimorphic method calls, once you get higher you get into a megamorphic case which never gets inlined (and this happens regardless if its final or not)

It would be fantastic if we could look into this as it seems that it would a lot of issues in Scala performance wise.

sjrd · April 20, 2019, 12:47pm

final means something for the Scala/JVM optimizer, though. That’s what I’m talking about.

If you actually have megamorphic calls that the JVM cannot already deal with, and you can’t inline the higher order method with a Scala optimizer to get rid of it to begin with, then I very much doubt that one can do better with MethodHandles.

tarsa · April 20, 2019, 12:57pm

I think JVM doesn’t have any problem inlining e.g. Future.flatMap body, but it’s not what I’m talking about. The function I’m talking about is the function passed into Future.flatMap and that function is run on some thread from thread pool. How can you inline through async boundary? How would that avoid megamorphic call in scala.concurrent.impl.Promise.Transformation.run?

mdedetrich · April 20, 2019, 10:52pm

Afaik it does for JS/Native, but I am unaware it does anything substantial on JVM due to JVM doing CHA (Class Hierarchy Analysis).

sjrd · April 21, 2019, 4:53am

On the contrary, final is irrelevant for Scala.js and Scala Native, because they can do a CHA because they have a closed world. The Scala/JVM optimizer doesn’t have the luxury of a closed world, and hence it needs final on methods to be able to inline them.

mdedetrich · April 21, 2019, 1:26pm

The JVM doesn’t require final to inline methods assuming they are effectively final (i.e. they have never been subclassed/overriden), it will inline the methods as long as they pass certain heuristics (i.e. gets called enough times)

sjrd · April 21, 2019, 5:28pm

I am not talking about the optimizer in the JVM. I am talking about the optimizer in scalac for the JVM.

scottcarey · April 21, 2019, 5:29pm

One of you is talking about the JVM. The other is talking about scalac’s JVM optimizer. You are both right.

mdedetrich · April 22, 2019, 5:17pm

I am not aware of the cases where the Scala optimizer does its own optimizations with final for the JVM backend (I know that Scala.js and Native do this and it makes complete sense to do so since they can’t rely on the JVM JIT).

Afaik, the Scala backend does do transformations to make it easier for the JVM to detect things as final but I am unaware of how much of an overlap this is and what this transformation entails (i.e. is it actually inlining final as it would at runtime or does it do something else?)

In any case I think we are going offtopic, I would like method handles to be explored because pure FP has a lot of pathalogic megamorphic cases and it often requires library writers in the ecosystem to rewrite their libraries just to get rid of these megamorphic cases (basically to try and please the JVM at runtime).

lrytz · April 23, 2019, 7:44am

Scala Inliner and Optimizer | Lightbend explains a bit what the scalac backend can and cannot do.

For the original proposal, it’s an interesting idea. However, I’m sceptical it would result in significant performance improvements in practice. The common performance issues seen at megamorphic callistes is not due to the time spent in virtual method lookup. The main issue is that a megamorphic callsite is a barrier for the JVM’s optimizer: it cannot inline the f invocation at the megamorphic callsite, and therefore not apply any further optimizations. Passing a MethodHandle which has f already resolved would not change anything in this regard.

Inlining flatMap on the other hand solves the problem. The code of flatMap is duplicated and specialized for one specific funtion f, the f.apply callsite becomes monomorphic. The JVM can then inline through that call and apply its downstream optimiztions.

mdedetrich · April 23, 2019, 4:33pm

Yeah I was under the same impression, once the JVM detects a method as megamorphic, it will basically say “no” and refuse to do any inlining. Its a shame that method handles don’t help out here, but as expected its something that is systematic of the JVM

tarsa · April 24, 2019, 12:36pm

AFAIU inlining flatMap doesn’t help at all in optimizing call to f.apply (except some very unlikely cases I’ll describe later). The main reason is that 99.999%+ of the time Future.flatMap doesn’t invoke f.apply, but instead adds f to some queue of callbacks (inside scala.concurrent.impl.Promise.DefaultPromise) or runnables (inside java.util.concurrent.ExecutorService).

Proof: Scastie - An interactive playground for Scala.

import scala.concurrent.{Await, ExecutionContext, Future, Promise}
import scala.concurrent.duration.Duration

object Main {
  def test(ecName: String, preCompleted: Boolean): Unit = {
    val ec = ecName match {
      case "global" => ExecutionContext.global
      case "parasitic" => ExecutionContext.parasitic
    }
    val promise = Promise[Int]()
    if (preCompleted) {
      promise.success(5)
    }
    val future = promise.future.flatMap(a => {
      new Exception(s"EC name = $ecName, precompleted = $preCompleted")
        .printStackTrace(System.out)
      Future.successful(a + 13)
    })(ec)
    if (!preCompleted) {
      promise.success(6)
    }
    Await.result(future, Duration.Inf)
    println("-")
  }
  
  def main(args: Array[String]): Unit = {
    test("global", preCompleted = true)
    test("global", preCompleted = false)
    test("parasitic", preCompleted = true)
    test("parasitic", preCompleted = false)
  }
}

When running on non-parasitic ExecutionContext, f.apply is invoked asynchronously, so Future.flatMap is not in the stack trace when executing f.apply (thus inlining Future.flatMap no matter how deep, won’t help inlining f.apply):

java.lang.Exception: EC name = global, precompleted = true
at Main$.$anonfun$test$1(main.scala:15)
at Main$.$anonfun$test$1$adapted(main.scala:14)
at scala.concurrent.impl.Promise$Transformation.run(Promise.scala:433) <- f.apply invoked
at scala.concurrent.BatchingExecutor$AbstractBatch.runN(BatchingExecutor.scala:134)
at scala.concurrent.BatchingExecutor$AsyncBatch.apply(BatchingExecutor.scala:163)
at scala.concurrent.BatchingExecutor$AsyncBatch.apply(BatchingExecutor.scala:146)
at scala.concurrent.BlockContext$.usingBlockContext(BlockContext.scala:107)
at scala.concurrent.BatchingExecutor$AsyncBatch.run(BatchingExecutor.scala:154)
at java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1402)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
-
java.lang.Exception: EC name = global, precompleted = false
at Main$.$anonfun$test$1(main.scala:15)
at Main$.$anonfun$test$1$adapted(main.scala:14)
at scala.concurrent.impl.Promise$Transformation.run(Promise.scala:433) <- f.apply invoked
at scala.concurrent.BatchingExecutor$AbstractBatch.runN(BatchingExecutor.scala:134)
at scala.concurrent.BatchingExecutor$AsyncBatch.apply(BatchingExecutor.scala:163)
at scala.concurrent.BatchingExecutor$AsyncBatch.apply(BatchingExecutor.scala:146)
at scala.concurrent.BlockContext$.usingBlockContext(BlockContext.scala:107)
at scala.concurrent.BatchingExecutor$AsyncBatch.run(BatchingExecutor.scala:154)
at java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1402)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)

OTOH if we run on parasitic ExecutionContext then things change. Doing Future.flatMap on precompleted Future will result in synchronous invocation of f.apply. Doing Future.flatMap on not yet complete Future will add f to callbacks queue and those callbacks will then be invoked during completion of the Future.

java.lang.Exception: EC name = parasitic, precompleted = true
at Main$.$anonfun$test$1(main.scala:15)
at Main$.$anonfun$test$1$adapted(main.scala:14)
at scala.concurrent.impl.Promise$Transformation.run(Promise.scala:433) <- f.apply invoked
at scala.concurrent.ExecutionContext$parasitic$.execute(ExecutionContext.scala:164)
at scala.concurrent.impl.Promise$Transformation.submitWithValue(Promise.scala:392)
at scala.concurrent.impl.Promise$DefaultPromise.submitWithValue(Promise.scala:302)
at scala.concurrent.impl.Promise$DefaultPromise.dispatchOrAddCallbacks(Promise.scala:276)
at scala.concurrent.impl.Promise$DefaultPromise.flatMap(Promise.scala:140)
at Main$.test(main.scala:18) <- Future.flatMap invoked
at Main$.main(main.scala:29)
at Main.main(main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at sbt.Run.invokeMain(Run.scala:67)
at sbt.Run.run0(Run.scala:61)
at sbt.Run.sbt$Run$$execute$1(Run.scala:51)
at sbt.Run$$anonfun$run$1.apply$mcV$sp(Run.scala:55)
at sbt.Run$$anonfun$run$1.apply(Run.scala:55)
at sbt.Run$$anonfun$run$1.apply(Run.scala:55)
at sbt.Logger$$anon$4.apply(Logger.scala:84)
at sbt.TrapExit$App.run(TrapExit.scala:248)
at java.lang.Thread.run(Thread.java:748)
-
java.lang.Exception: EC name = parasitic, precompleted = false
at Main$.$anonfun$test$1(main.scala:15)
at Main$.$anonfun$test$1$adapted(main.scala:14)
at scala.concurrent.impl.Promise$Transformation.run(Promise.scala:433) <- f.apply invoked
at scala.concurrent.ExecutionContext$parasitic$.execute(ExecutionContext.scala:164)
at scala.concurrent.impl.Promise$Transformation.submitWithValue(Promise.scala:392)
at scala.concurrent.impl.Promise$DefaultPromise.submitWithValue(Promise.scala:302)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete0(Promise.scala:249)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:242)
at scala.concurrent.Promise.complete(Promise.scala:53)
at scala.concurrent.Promise.complete$(Promise.scala:52)
at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:104)
at scala.concurrent.Promise.success(Promise.scala:87)
at scala.concurrent.Promise.success$(Promise.scala:87)
at scala.concurrent.impl.Promise$DefaultPromise.success(Promise.scala:104)
at Main$.test(main.scala:20) <- Promise.success invoked
at Main$.main(main.scala:30)
at Main.main(main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at sbt.Run.invokeMain(Run.scala:67)
at sbt.Run.run0(Run.scala:61)
at sbt.Run.sbt$Run$$execute$1(Run.scala:51)
at sbt.Run$$anonfun$run$1.apply$mcV$sp(Run.scala:55)
at sbt.Run$$anonfun$run$1.apply(Run.scala:55)
at sbt.Run$$anonfun$run$1.apply(Run.scala:55)
at sbt.Logger$$anon$4.apply(Logger.scala:84)
at sbt.TrapExit$App.run(TrapExit.scala:248)
at java.lang.Thread.run(Thread.java:748)

I’ve done some benchmarks which suggest that the cost of megamorphic invocations inside scala.concurrent.impl.Promise$Transformation.run is insignificant compared to cost of other machinery related to Futures. I’ll try to do precise measurements and/ or try more lightweight monads.

lrytz · April 24, 2019, 12:50pm

Right, inlining doesn’t do much useful when the function is stashed in the callback queue. My mind was too much in the collections mindset.

This sounds quite plausible, and I wouldn’t be surprised if you could say the same about the virtual method lookup.

tarsa · April 25, 2019, 12:27am

I’m working on some benchmarks and MethodHandles usage. So far I didn’t gain anything with MethodHandles, but my newest benchmarks show that megamorphic methods that additionally aren’t well predicted by CPU’s branch predictor lead to significant increase in running time (e.g. 50%) even when wrapped in Futures.

Benchmark code:

import java.util.concurrent.TimeUnit

import org.openjdk.jmh.annotations._
import org.openjdk.jmh.infra.Blackhole
import pl.tarsa.megamorphic_overhead.jmh.FutureMapFunCallOverhead._

import scala.annotation.tailrec
import scala.concurrent.duration.Duration
import scala.concurrent.{Await, ExecutionContext, Future, Promise}
import scala.util.Random

@State(Scope.Benchmark)
@BenchmarkMode(Array(Mode.AverageTime))
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Timeout(time = 10, timeUnit = TimeUnit.SECONDS)
@Warmup(iterations = 1)
@Measurement(iterations = 3)
@Fork(value = 1,
      jvmArgsAppend = Array("-Xmx1G", "-Xms1G", "-XX:+AlwaysPreTouch"))
@Threads(value = 1)
class FutureMapFunCallOverhead {
  @Param(Array[String]("global", "parasitic", "trivial"))
  var ecName: String = _

  var ec: ExecutionContext = _

  @Param(Array[String]("fixed1", "regular2", "regular8", "random2", "random8"))
  var indicesType: String = _

  var indices: Array[Int] = _

  @Setup
  def setup(): Unit = {
    ec = ecName match {
      case "global"    => ExecutionContext.global
      case "parasitic" => ExecutionContext.parasitic
      case "trivial" =>
        new ExecutionContext {
          override def execute(runnable: Runnable): Unit     = runnable.run()
          override def reportFailure(cause: Throwable): Unit = ???
        }
    }
    indices = indicesType match {
      case "fixed1"   => indices1
      case "regular2" => regularIndices2
      case "regular8" => regularIndices8
      case "random2"  => randomIndices2
      case "random8"  => randomIndices8
    }
  }

  @tailrec
  private def go(
      previousFuture: Future[Int],
      iterations: Int,
      indices: Array[Int])(implicit ec: ExecutionContext): Future[Int] = {
    if (iterations > 0) {
      val index = indices(iterations - 1)
      go(previousFuture.map(closures(index)(iterations)),
         iterations - 1,
         indices)
    } else {
      previousFuture
    }
  }

  @Benchmark
  def pre(bh: Blackhole): Int = {
    val promise   = Promise.successful(5)
    val resultFut = go(promise.future, depth, indices)(ec)
    Await.result(resultFut, Duration.Inf)
  }

  @Benchmark
  def post(bh: Blackhole): Int = {
    val promise   = Promise[Int]()
    val resultFut = go(promise.future, depth, indices)(ec)
    promise.success(5)
    Await.result(resultFut, Duration.Inf)
  }
}

object FutureMapFunCallOverhead {
  val depth = 1000

  val random = new Random(0)
  import random.nextInt

  /* Array(3, 3, 3, ...) */
  val indices1: Array[Int] = Array.fill[Int](depth)(3)
  /* Array(6, 7, 6, 7, ...) */
  val regularIndices2: Array[Int] = Array.tabulate[Int](depth)(_.%(2).+(6))
  /* Array(0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, ...) */
  val regularIndices8: Array[Int] = Array.tabulate[Int](depth)(_ % 8)
  /* Array(2 or 5, 2 or 5, 2 or 5, 2 or 5, 2 or 5, ...), randomly chosen */
  val randomIndices2: Array[Int] =
    Array.fill[Int](depth)(if (nextInt(2) == 0) 2 else 5)
  /* Array(0 to 7, 0 to 7, 0 to 7, 0 to 7, 0 to 7, ...), randomly chosen */
  val randomIndices8: Array[Int] = Array.fill[Int](depth)(nextInt(8))

  private val closures: Array[Int => Int => Int] = Array(
    a => b => a * 2452 + b * 3242,
    a => b => a * 7643 + b * 3567,
    a => b => a * 9287 + b * 3623,
    a => b => a * 6234 + b * 7343,
    a => b => a * 6224 + b * 5321,
    a => b => a * 3432 + b * 5212,
    a => b => a * 5323 + b * 5312,
    a => b => a * 6123 + b * 5132,
  )
}

Results on Ubuntu, Core i5-4670, Oracle Java 1.8u201:

[info] Benchmark                       (ecName)  (indicesType)  Mode  Cnt    Score    Error  Units
[info] FutureMapFunCallOverhead.post     global         fixed1  avgt    3   77,079 ±  9,585  us/op
[info] FutureMapFunCallOverhead.post     global       regular2  avgt    3   79,084 ±  1,767  us/op
[info] FutureMapFunCallOverhead.post     global       regular8  avgt    3   94,292 ± 11,221  us/op
[info] FutureMapFunCallOverhead.post     global        random2  avgt    3   83,378 ±  9,528  us/op
[info] FutureMapFunCallOverhead.post     global        random8  avgt    3  122,085 ± 10,965  us/op
[info] FutureMapFunCallOverhead.post  parasitic         fixed1  avgt    3   38,998 ±  0,505  us/op
[info] FutureMapFunCallOverhead.post  parasitic       regular2  avgt    3   40,317 ±  0,436  us/op
[info] FutureMapFunCallOverhead.post  parasitic       regular8  avgt    3   47,622 ±  1,134  us/op
[info] FutureMapFunCallOverhead.post  parasitic        random2  avgt    3   42,660 ±  0,819  us/op
[info] FutureMapFunCallOverhead.post  parasitic        random8  avgt    3   69,109 ±  2,144  us/op
[info] FutureMapFunCallOverhead.post    trivial         fixed1  avgt    3   39,661 ±  0,393  us/op
[info] FutureMapFunCallOverhead.post    trivial       regular2  avgt    3   40,372 ±  0,455  us/op
[info] FutureMapFunCallOverhead.post    trivial       regular8  avgt    3   44,751 ±  0,376  us/op
[info] FutureMapFunCallOverhead.post    trivial        random2  avgt    3   40,763 ±  5,830  us/op
[info] FutureMapFunCallOverhead.post    trivial        random8  avgt    3   63,548 ±  1,991  us/op
[info] FutureMapFunCallOverhead.pre      global         fixed1  avgt    3   80,742 ±  0,074  us/op
[info] FutureMapFunCallOverhead.pre      global       regular2  avgt    3   87,824 ±  3,529  us/op
[info] FutureMapFunCallOverhead.pre      global       regular8  avgt    3   66,557 ±  3,282  us/op
[info] FutureMapFunCallOverhead.pre      global        random2  avgt    3   91,869 ±  3,204  us/op
[info] FutureMapFunCallOverhead.pre      global        random8  avgt    3   75,667 ± 17,878  us/op
[info] FutureMapFunCallOverhead.pre   parasitic         fixed1  avgt    3   33,588 ±  1,901  us/op
[info] FutureMapFunCallOverhead.pre   parasitic       regular2  avgt    3   34,134 ±  0,116  us/op
[info] FutureMapFunCallOverhead.pre   parasitic       regular8  avgt    3   44,550 ±  0,659  us/op
[info] FutureMapFunCallOverhead.pre   parasitic        random2  avgt    3   36,047 ±  0,642  us/op
[info] FutureMapFunCallOverhead.pre   parasitic        random8  avgt    3   58,970 ±  4,753  us/op
[info] FutureMapFunCallOverhead.pre     trivial         fixed1  avgt    3   29,749 ±  1,475  us/op
[info] FutureMapFunCallOverhead.pre     trivial       regular2  avgt    3   29,908 ±  0,415  us/op
[info] FutureMapFunCallOverhead.pre     trivial       regular8  avgt    3   36,730 ±  0,485  us/op
[info] FutureMapFunCallOverhead.pre     trivial        random2  avgt    3   30,461 ±  0,198  us/op
[info] FutureMapFunCallOverhead.pre     trivial        random8  avgt    3   47,008 ±  0,822  us/op

Running on global introduces quite a bit of variability and confuction (random8 is faster than random2, regular8 is faster than regular2 and fixed1), so I’ll probably continue with some more predictable monad than Future.

viktorklang · April 25, 2019, 6:39am

You could run global with a pool size of 1.

tarsa · April 25, 2019, 6:39pm

Hmm, that didn’t solve the predictability problem.

Current benchmark code:

import java.util.concurrent.TimeUnit

import org.openjdk.jmh.annotations._
import pl.tarsa.megamorphic_overhead.jmh.FutureMapFunCallOverhead._

import scala.annotation.tailrec
import scala.concurrent.duration.Duration
import scala.concurrent.{Await, ExecutionContext, Future, Promise}
import scala.util.Random

@State(Scope.Benchmark)
@BenchmarkMode(Array(Mode.AverageTime))
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Warmup(iterations = 5)
@Measurement(iterations = 10)
@Fork(value = 1,
      jvmArgsAppend = Array("-Xmx1G", "-Xms1G", "-XX:+AlwaysPreTouch"))
@Threads(value = 1)
class FutureMapFunCallOverhead {
  @Param(Array[String]("global", "parasitic", "trivial"))
  var ecName: String = _

  var ec: ExecutionContext = _

  @Param(Array[String]("1"))
  final var threads: Int = _

  @Param(Array[String]("fixed1", "regular2", "regular8", "random2", "random8"))
  var indicesType: String = _

  var indices: Array[Int] = _

  @Setup
  def setup(): Unit = {
    ec = ecName match {
      case "global" =>
        sys.props.put("scala.concurrent.context.minThreads", s"$threads")
        sys.props.put("scala.concurrent.context.numThreads", s"$threads")
        sys.props.put("scala.concurrent.context.maxThreads", s"$threads")
        ExecutionContext.global
      case "parasitic" => ExecutionContext.parasitic
      case "trivial" =>
        new ExecutionContext {
          override def execute(runnable: Runnable): Unit     = runnable.run()
          override def reportFailure(cause: Throwable): Unit = ???
        }
    }
    indices = indicesType match {
      case "fixed1"   => indices1
      case "regular2" => regularIndices2
      case "regular8" => regularIndices8
      case "random2"  => randomIndices2
      case "random8"  => randomIndices8
    }
  }

  @tailrec
  private def go(
      previousFuture: Future[Int],
      iterations: Int,
      indices: Array[Int])(implicit ec: ExecutionContext): Future[Int] = {
    if (iterations > 0) {
      val index = indices(iterations - 1)
      go(previousFuture.map(closures(index)(iterations)),
         iterations - 1,
         indices)
    } else {
      previousFuture
    }
  }

  @Benchmark
  def pre(): Int = {
    val promise   = Promise.successful(5)
    val resultFut = go(promise.future, depth, indices)(ec)
    Await.result(resultFut, Duration.Inf)
  }

  @Benchmark
  def post(): Int = {
    val promise   = Promise[Int]()
    val resultFut = go(promise.future, depth, indices)(ec)
    promise.success(5)
    Await.result(resultFut, Duration.Inf)
  }
}

object FutureMapFunCallOverhead {
  val depth = 1000

  val random = new Random(0)
  import random.nextInt

  /* Array(3, 3, 3, ...) */
  val indices1: Array[Int] = Array.fill[Int](depth)(3)
  /* Array(6, 7, 6, 7, ...) */
  val regularIndices2: Array[Int] = Array.tabulate[Int](depth)(_.%(2).+(6))
  /* Array(0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, ...) */
  val regularIndices8: Array[Int] = Array.tabulate[Int](depth)(_ % 8)
  /* Array(2 or 5, 2 or 5, 2 or 5, 2 or 5, 2 or 5, ...), randomly chosen */
  val randomIndices2: Array[Int] =
    Array.fill[Int](depth)(if (nextInt(2) == 0) 2 else 5)
  /* Array(0 to 7, 0 to 7, 0 to 7, 0 to 7, 0 to 7, ...), randomly chosen */
  val randomIndices8: Array[Int] = Array.fill[Int](depth)(nextInt(8))

  private val closures: Array[Int => Int => Int] = Array(
    a => b => a * 2452 + b * 3242,
    a => b => a * 7643 + b * 3567,
    a => b => a * 9287 + b * 3623,
    a => b => a * 6234 + b * 7343,
    a => b => a * 6224 + b * 5321,
    a => b => a * 3432 + b * 5212,
    a => b => a * 5323 + b * 5312,
    a => b => a * 6123 + b * 5132,
  )
}

Results:

[info] Benchmark                       (ecName)  (indicesType)  (threads)  Mode  Cnt    Score   Error  Units
[info] FutureMapFunCallOverhead.post     global         fixed1          1  avgt   10   71,506 ± 0,464  us/op
[info] FutureMapFunCallOverhead.post     global       regular2          1  avgt   10   71,470 ± 0,810  us/op
[info] FutureMapFunCallOverhead.post     global       regular8          1  avgt   10   76,712 ± 0,202  us/op
[info] FutureMapFunCallOverhead.post     global        random2          1  avgt   10   71,246 ± 0,237  us/op
[info] FutureMapFunCallOverhead.post     global        random8          1  avgt   10  103,102 ± 0,523  us/op
[info] FutureMapFunCallOverhead.post  parasitic         fixed1          1  avgt   10   38,834 ± 0,042  us/op
[info] FutureMapFunCallOverhead.post  parasitic       regular2          1  avgt   10   40,246 ± 0,035  us/op
[info] FutureMapFunCallOverhead.post  parasitic       regular8          1  avgt   10   47,357 ± 0,036  us/op
[info] FutureMapFunCallOverhead.post  parasitic        random2          1  avgt   10   42,425 ± 0,036  us/op
[info] FutureMapFunCallOverhead.post  parasitic        random8          1  avgt   10   67,691 ± 0,039  us/op
[info] FutureMapFunCallOverhead.post    trivial         fixed1          1  avgt   10   39,507 ± 0,041  us/op
[info] FutureMapFunCallOverhead.post    trivial       regular2          1  avgt   10   40,213 ± 0,048  us/op
[info] FutureMapFunCallOverhead.post    trivial       regular8          1  avgt   10   44,453 ± 0,053  us/op
[info] FutureMapFunCallOverhead.post    trivial        random2          1  avgt   10   40,379 ± 0,062  us/op
[info] FutureMapFunCallOverhead.post    trivial        random8          1  avgt   10   61,934 ± 0,067  us/op
[info] FutureMapFunCallOverhead.pre      global         fixed1          1  avgt   10   90,226 ± 0,770  us/op
[info] FutureMapFunCallOverhead.pre      global       regular2          1  avgt   10   93,227 ± 1,191  us/op
[info] FutureMapFunCallOverhead.pre      global       regular8          1  avgt   10   79,880 ± 1,388  us/op
[info] FutureMapFunCallOverhead.pre      global        random2          1  avgt   10  100,825 ± 0,563  us/op
[info] FutureMapFunCallOverhead.pre      global        random8          1  avgt   10   83,519 ± 2,156  us/op
[info] FutureMapFunCallOverhead.pre   parasitic         fixed1          1  avgt   10   33,427 ± 0,039  us/op
[info] FutureMapFunCallOverhead.pre   parasitic       regular2          1  avgt   10   33,923 ± 0,050  us/op
[info] FutureMapFunCallOverhead.pre   parasitic       regular8          1  avgt   10   43,880 ± 0,065  us/op
[info] FutureMapFunCallOverhead.pre   parasitic        random2          1  avgt   10   35,895 ± 0,036  us/op
[info] FutureMapFunCallOverhead.pre   parasitic        random8          1  avgt   10   57,659 ± 0,046  us/op
[info] FutureMapFunCallOverhead.pre     trivial         fixed1          1  avgt   10   29,664 ± 0,028  us/op
[info] FutureMapFunCallOverhead.pre     trivial       regular2          1  avgt   10   29,391 ± 0,021  us/op
[info] FutureMapFunCallOverhead.pre     trivial       regular8          1  avgt   10   36,590 ± 0,036  us/op
[info] FutureMapFunCallOverhead.pre     trivial        random2          1  avgt   10   30,291 ± 0,034  us/op
[info] FutureMapFunCallOverhead.pre     trivial        random8          1  avgt   10   47,305 ± 0,079  us/op
[success] Total time: 4521 s

global.pre still acts weirdly - megamorphic calls are the fastest. There’s maybe a lot of contention between the task producer thread and task executor thread, so longer task execution reduces that contention?

jrudolph · April 28, 2019, 2:18pm

These kinds of optimizations “just” improve dynamic dispatch but cannot get rid of it. That means that the dispatch itself is faster but none of the downstream optimizations of being able to inline at call-sites will be enabled. Still they are very much appreciated. Kudos to @fwbrasil trying to tackle these!

That seems to me partly a fundamental misunderstanding about what the JVM is able to do. In FP, you often model parts of your program (i.e. control and data flow) with data structures. To execute that program afterwards, you need to interpret that data structure. An interpreter is naturally megamorphic because all the atoms of your programs can and usually will be reused in different ways but there’s only a single interpreter.

What you seem to expect that there’s an optimization in the JVM that gets rid of the interpretation layer automatically. That’s really hard to do *). Put in other words, in FP, you often build the real user-level program at runtime but you usually don’t build a compiler for it, so why would you expect compiled performance?

Here’s another explanation: with FP data structures you can, at runtime, create infinitely many programs by composition of these data structures. However, there’s only a fixed number of methods in your jar file. The JIT compiler can only provide one compiled version of every method. That means there are more possible runtime programs than slots (compiled methods) to compile these programs into. (You could in theory create more slots but the question of when to generate code vs. when to reuse is hard to answer in general).

You can do static optimizations at compile-time if either every abstract data structure is only used in a single way or if the programs built using the pure FP data structures are in fact static in the first place, i.e. the FP program is completely determined before compilation (the difficulty here is to figure out what parts are indeed static and how to do the optimizations efficiently).

So “a lot of pathologic megamorphic cases” are very much expected if you abstract over control flow with data structures.

What “rewrite their libraries just to get rid of these megamorphic cases” usually means is to understand that you don’t actually want to create programs at runtime but that they are actually (mostly) static in many cases. Then you have to structure your program in a way that the JVM can figure out - after the fact - that certain (runtime) usages of these abstractions are actually not meant to be dynamic but that starting from some method the control flow is static indeed.

FWIW, a pathological case of FP data structures is the monad. Why? Because (for control flow instances) it allows to dynamically determine the future flow of the program based on a runtime value. That’s super powerful but it also introduces all kinds of problems not least the one of executing it efficiently.

I guess these explanations would be even more useful with an example. Here are some examples where I have personally seen performance to be affected by these kind of problems: streams in akka-stream (akka-stream’s GraphInterpreter is an interpreter, a fast one but still the dispatch overhead turns up), scodec Codec, spray-json’s JsonFormat, akka-http’s Route. All of those cannot really match the performance of hand-written versions of the same functionality. That said, the benefits of the abstraction more often than not outweigh the slight performance hit. (With Future, IMO the bigger problem is that thunks are usually dispatched even if the Future is already complete. The case where the Future is still to be completed, the cost of the dynamic dispatch seem negligible.)

Johannes

*) Partial evaluation can compile code by specializing an interpreter (see Futumura projections) and indeed that’s what Truffle offers when you implement a custom language on top of Graal. However, AFAIU that’s not a completely generic framework. You can mix interpreter and runtime code but you have to give some hints about compilation to make it work.