Impact of Loom on “functional effects”

tarsa · July 23, 2022, 9:26am

And you rarely do (i.e. in terms of actually commiting the memory) even in the case of using platform threads directly (i.e. just as we all do today). If you set -Xss8m that doesn’t mean every new thread eagerly asks operating system to physically allocate that much memory. It’s lazily commited, so you can e.g. set -Xss1g and it should work even with many threads if you don’t create deep call stacks.

Who implements tracing and where? Where’s the extra CPU cost? When running Java bytecode the JVM needs to create Java stack frames anyway. JVM doesn’t care if you run your code inside Future or not, if you run your code inside virtual thread or raw underlying thread. If the call is not inlined then JVM has to create new stack frame. Stack trace is just a metadata dump from current stack frames (includes inlined calls too, but data for inlined calls can be deterministically inferred from stack frames of outer calls).

I think I have to guess where’s the cost you’re mentioning. Is it the cost of GC scanning the GC roots? Have you measured that? People are running thousands of virtual threads at once and reporting massive throughput advantages with that, so most probably the cost of scanning GC roots is not dominating anything.

If you like to make really deep call stacks using tail calls and don’t want them to occupy a lot of memory then tail call optimization is in the scope of OpenJDK: Loom (opt-in per each method)

The goal of this Project is to explore and incubate Java VM features and APIs built on top of them for the implementation of lightweight user-mode threads (fibers), delimited continuations (of some form), and related features, such as explicit tail-call.

but then (most probably) you’ll lose some of the stack trace information. Anyway, who needs really deep stack traces? If your explicit stack trace, in your error logs, is deeper than e.g. 1000 calls (in total, including suppressed exceptions, causes and so on), then you’re sabotaging yourself anyway.

In short, you’re talking about some extra costs, but you haven’t explained where they supposedly come from. Show at least some hypothetical numbers and do some calculations. Compare virtual threads costs to raw platform threads costs.

tarsa · July 23, 2022, 10:01am

Here are some of my numbers then:
JVM downloaded from https://download.java.net/java/early_access/jdk19/32/GPL/openjdk-19-ea+32_linux-x64_bin.tar.gz
parameters: java --enable-preview -Xss1g

Code:

package pl.tarsa

import java.util.concurrent.atomic.AtomicLong
import scala.util.Try

object Main {
  def main(args: Array[String]): Unit = {
    for (_ <- 1 to 3) {
      timed("platform thread") {
        val rawThreadCounter = new AtomicLong(0)
        val rawThread = Thread
          .ofPlatform()
          .uncaughtExceptionHandler((t, e) => ())
          .start(() => loop(rawThreadCounter))
        println(Try(rawThread.join()))
        println(s"max depth = $rawThreadCounter")
      }
      println()

      timed("virtual thread") {
        val virtualThreadCounter = new AtomicLong(0)
        val virtualThread = Thread
          .ofVirtual()
          .uncaughtExceptionHandler((t, e) => ())
          .start(() => loop(virtualThreadCounter))
        println(Try(virtualThread.join()))
        println(s"max depth = $virtualThreadCounter")
      }
      println()
    }
  }

  def loop(counterRef: AtomicLong): Unit = {
    counterRef.incrementAndGet()
    bounce(loop(counterRef))
  }

  def bounce(any: Any): Unit = ()

  def timed(description: String)(action: => Unit): Unit = {
    println(description)
    val startTime = System.currentTimeMillis()
    action
    val endTime = System.currentTimeMillis()
    println(s"Total time = ${endTime - startTime} ms")
  }
}

Result:

platform thread
Success(())
max depth = 44691731
Total time = 5584 ms

virtual thread
Success(())
max depth = 44735002
Total time = 3570 ms

platform thread
Success(())
max depth = 44735076
Total time = 4184 ms

virtual thread
Success(())
max depth = 44735002
Total time = 2191 ms

platform thread
Success(())
max depth = 44735076
Total time = 4185 ms

virtual thread
Success(())
max depth = 44735002
Total time = 2189 ms

Somehow the virtual threads are 2x faster than platform threads. That’s very good IMHO even if I’m unable to explain that difference. The maximum call stack depth is also practically equal somehow in both cases (virtual and platform threads). So it looks there are no obvious downsides already

mdedetrich · July 23, 2022, 10:34am

Well yes, you have to create a physical Thread regardless because you otherwise cannot run anything. The point is that if you use a physical threads as a concurrency/computation primitive (as is done in classical Java style) you are already paying upfront for the expensive cost of the physical OS thread and due to this your stack propagation is almost for “free” (in reality this is a bit more complex because the OS itself tries to optimize OS threads, ergo pthreads in Linux will not immediately allocate the entire 8meg stack but since its being done in kernel space this is out of scope).

On the other hand, one of the whole points of green threads/fibers is that they need to be very cheap and hence they are typically 1kb of size. That 1kb of size is the main reason behind the “you can create 1 million threads on a commodity machine”.

That metadata dump of stacks and having to propagate that through is expensive (that is what I referred to as tracing) and most importantly its not free as some people seem to be arguing.

To be blunt, if a thousand VirtualThread’s at once is the metric of “success” that is pathetically low. Erlang’s can easily create 1 million processes on 10 year old machines. This is similar with Akka actors which are also deliberately designed to be so cheap that you can also create a massive amount of them.

Firstly 2x faster is not good at all

Secondly this is not benchmarking what is being talked about. As I said earlier, you would need to benchmark the current version of Loom which preserves stack trace to an alternate version of Loom that doesn’t preserve stack trace at all and see what the performance gap is.

tarsa · July 23, 2022, 11:17am

Linux doesn’t understand Java stack frames, so it can’t optimize their contents or investigate them. The only thing that can OS can optimize is lazy memory commit, but that applies to managed heap too (so to virtual threads stacks too). When JVM runs Java code, stack frames are managed entirely by JVM, regardless if they are in raw platform threads or in virtual threads.

Show the cost then.

Here’s my measurement:
JDK 19 early access (as before)
parameters: java --enable-preview -Xmx1g (i.e. 1 GiB managed heap limit)
code:

package pl.tarsa

import java.util.concurrent.atomic.{AtomicLong, AtomicReference}
import java.util.concurrent.locks.LockSupport

object Main {
  val safetyMargin = new AtomicReference(Array.ofDim[Long](1000 * 1000))
  val virtualThreadsCounter = new AtomicLong(0)
  def main(args: Array[String]): Unit = {
    Thread
      .currentThread()
      .setUncaughtExceptionHandler((t, e) => {
        safetyMargin.set(null) // release some memory before printing
        println(s"virtual threads number = $virtualThreadsCounter")
      })
    val uncaughtHandler: Thread.UncaughtExceptionHandler = (t, e) => ()
    while (true) {
      Thread
        .ofVirtual()
        .uncaughtExceptionHandler(uncaughtHandler)
        .start { () =>
          virtualThreadsCounter.incrementAndGet()
          LockSupport.park()
          virtualThreadsCounter.decrementAndGet()
        }
    }
  }
}

result:

virtual threads number = 1675456

so size of individual running virtual thread is less than a kilobyte (if it’s as simple as shown above). Over a million running virtual threads with 1 GiB managed heap limit is not bad.

yep, slowness is cool today /s

How could one disable stack frames? The whole point of thread stack is to contain stack frames. Stack frames contain local variables and return address. You have that with Futures, IO monads, etc with everything. How can you run code without stack at all? What even are you talking about?

mdedetrich · July 23, 2022, 11:38am

You can already see the cost in IO types such as cats io albeit at a library level. By default there is no stack trace propagation but depending on which version of propagation you opt into its 10-30% extra overhead (something I measured), see Tracing · Cats Effect.

Loom has to do the exact same thing with VirtualThread albeit overhead will be less since its done on vm level.

I meant disable stack propagation, not stack frames, of course the latter is not possible. Future also has a stack trace but no one uses it because the stack by default is not propagated across calls in the ExecutionContext.

tarsa · July 23, 2022, 12:15pm

You really need to define what “stack propagation” is and what it has to do with virtual threads. If “stack propagation” is enabled under virtual threads then it’s enabled under raw platform threads too. Both types of threads preserve full stack traces.

I use it because most of the time it has at least one stack frame that points to client code.

Tracing in IO monads is done because trampolining loses the call stack frames. It doesn’t avoid creating them, it just loses them prematurely, so additional tracing is done to recover the lost data. If you don’t do trampolining then you don’t lose stack frames permaturely so then you can use them as they are for error reporting.

Let’s say there are nested calls.

In imperative style:

def method1(...) = {
...
method2(...)
...
}

def method2(...) = {
...
currentThread.printStackTrace()
...
}

In monadic style:

def method1(...) = for {
  ... <- ...
  _ <- method2(...)
  ... <- ...
} yield ...

def method2(...) = for {
  ... <- ...
  _ <- IO.println(currentThread.printStackTrace())
  ... <- ...
} yield ...

What happens under imperative style (regardless whether under virtual or raw platform threads):

JVM creates stack frame for method1
JVM invokes some code from method1
JVM creates stack frame for method2
JVM invokes some code from method2
JVM prints stack trace - stack frames from method1 to method2 are still live so they are shown
JVM invokes some code from method2
JVM returns from method2 to method1, because it knows that from stack frame
JVM invoked some code from method1
JVM returns from method1 to something that called method1

What happens under monadic style (regardless whether under virtual or raw platform threads) is that it creates and destroys a few times more stack frames (because monad interpreter has its overhead) and also has to create many more lambdas, run megamorphic calls, etc so the costs are going through the roof, but the main difference is that stack frames (used for client code) are unwound much more quickly than in imperative style, due to trampolining. But they are created anyway as you can see e.g. from exceptions thrown in a Future. It contains at least one stack frame from client code, so it had to be created just as in imperative style. Well, even in monadic style you have to call method1 and method2 anyway, so stack frames for them must be created. The fact that they are quickly unwound doesn’t make that the whole process cheaper. Also in case of monads there is at least one call stack frame created for each ... <- ... in for-comprehension and that alone means there are many more stack frames created under monadic style.

mdedetrich · July 23, 2022, 12:40pm

tarsa:

What happens under monadic style (regardless whether under virtual or raw platform threads) is that it creates and destroys a few times more stack frames (because monad interpreter has its overhead) and also has to create many more lambdas, run megamorphic calls, etc so the costs are going through the roof, but the main difference is that stack frames (used for client code) are unwound much more quickly than in imperative style, due to trampolining. But they are created anyway as you can see e.g. from exceptions thrown in a Future. It contains at least one stack frame from client code, so it had to be created just as in imperative style. Well, even in monadic style you have to call method1 and method2 anyway, so stack frames for them must be created. The fact that they are quickly unwound doesn’t make that the whole process cheaper. Also in case of monads there is at least one call stack frame created for each ... <- ... in for-comprehension and that alone means there are many more stack frames created under monadic style.

Yes I understand why this happens (I mentioned it elsewhere previously with the interpreter). Stack propagation is the method used to make sure that stack traced are properly preserved amongst async/sync boundaries as they are multiplexed on real threads. Of course stacks are always created, it’s impossible to avoid this since you have to do a computation on a physical thread in the first place which will always create a stack.

The point about monadic style exasperating the problem is a bit of a diversion, fundamentally it’s about sync/async boundaries and how computations are executed onto threads. In typical imperative style you maximise the amount of tasks executed onto a single thread.

The issue I am interested in is not when you just create millions of a single VirtualThreads that do a single task in some tight while loop but rather when you have deeply “nested” (nested here is in quotation marks because you don’t really nest VirtualThreads). This is alluding to the joining problem that @Ichoran mentioned before regarding joins, it’s when you have millions of VirtualThreads calling between themselves and the overhead of the JVM propagating (or preserving if it’s easier to unserstand) the stack frames between these calls.

Ontop of that, seeing the performance characteristics of how VirtualThread and physical thread interplay with each there when mixing IO bound and CPU bound tasks is what I am getting at.

tarsa · July 23, 2022, 1:31pm

ehhh, so “propagating stack” in case of Loom (or imperative synchronous style in general) simply means delaying the destruction of call stack frame until it’s not needed anymore (as it was always the case in synchronous style). That’s definitely not a CPU cost in itself.

If someone wants to do deep tail calls in direct synchronous style then (as I’ve said above) optimization of tail calls in in scope of Project Loom. Anyway, I don’t see why should we worry about Java code. Java libraries will be forced to be reworked as people will start complaining that they limit scaling of virtual threads. I haven’t seen Scala code doing deep unoptimized tail calls, i.e. if there are tail calls then they are always @tailrec optimized.

If anything, I think Loom should eventually improve things in typical projects. Loom currently doesn’t preempt long running purely CPU-intensive virtual threads (i.e. with long periods of number crunching and no awaits), but it should in future, so then they won’t starve progress of other virtual threads mounted on the same set of platform threads. OTOH, if you run a few long-running Futures or IO monads (i.e. each is a single long-running thunk) on some execution context (with limited parallelism) then you can block other tasks (that also want to run on that execution context) from progressing.

Relevant promises from Loom authors
https://cr.openjdk.java.net/~rpressler/loom/loom/sol1_part1.html

Virtual threads are preemptive, not cooperative — they do not have an explicit await operation at scheduling (task-switching) points. Rather, they are preempted when they block on I/O or synchronization. Platform threads are sometimes forcefully preempted by the kernel if they occupy the CPU for a duration that exceeds some allotted time-slice. Time-sharing works well as a scheduling policy when active threads don’t outnumber cores by much and only very few threads are processing-heavy. If a thread is hogging the CPU for too long, it’s preempted to make other threads responsive, and then it’s scheduled again for another time-slice. When we have millions of threads, this policy is less effective: if many of them are so CPU-hungry that they require time-sharing, then we’re under-provisioned by orders of magnitude and no scheduling policy could save us. In all other circumstances, either a work-stealing scheduler would automatically smooth over sporadic CPU-hogging or we could run problematic threads as platform threads and rely on the kernel scheduler. For this reason, none of the schedulers in the JDK currently employs time-slice-based preemption of virtual threads, but that is not to say it won’t in the future — see Forced Preemption.

You must not make any assumptions about where the scheduling points are any more than you would for today’s threads. Even without forced preemption, any JDK or library method you call could introduce blocking, and so a task-switching point.

https://cr.openjdk.java.net/~rpressler/loom/loom/sol1_part2.html

Forced Preemption

Despite what was said about scheduling, there can be special circumstances where forcefully preempting a thread that is hogging the CPU can be useful. For example, a batch processing service that performs complex data queries on behalf of multiple client applications may receive client tasks and run them each in its own virtual thread. If such a task takes up too much CPU, the service might want to forcefully preempt it and schedule it again when the service is under a lighter load. To that end, we plan for the VM to support an operation that tries to forcefully preempt execution at any safepoint. How that capability will be exposed to the schedulers is TBD, and will likely not make it to the first Preview.

Here’s my code that demonstrates that right now (in JDK 19 EA) virtual threads don’t implement forced preemption:

package pl.tarsa

import java.util.concurrent.atomic.AtomicLongArray

object Main {
  val threadsNum = 12345
  val threadStates = new AtomicLongArray(threadsNum)

  def main(args: Array[String]): Unit = {
    for (i <- 0 until threadsNum) {
      Thread
        .ofVirtual()
        .start(() => {
          while (true) {
            threadStates.updateAndGet(i, _ ^ 1)
          }
        })
    }
    while (true) {
      Thread.sleep(1234)
      val bytes = Array.ofDim[Byte]((threadsNum + 7) / 8)
      for (i <- 0 until threadsNum) {
        val positionedBit = (threadStates.get(i) & 1) << (i % 8)
        bytes(i / 8) = (bytes(i / 8) | positionedBit).toByte
      }
      println(BigInt(bytes).toString(16))
    }
  }
}

Above code prints lines with sequences of digits. They are all supposed to be randomly changing, but only few bits at the beginning are changing - showing that only a few virtual threads hog the entire underlying platform threads pool forever.

jdegoes · July 23, 2022, 2:32pm

That is completely false. Although Loom has nothing to say about concurrency (despite forthcoming JEPs that will add their own take on structured concurrency), Loom makes it possible for “joins” (“await” in Future terminology) completely “async” and cost free, with ordinary direct, imperative syntax.

In Loom, you can write your example like this:

val b = fork { f(a) }
val c = g(b.result)
val d = fork { h(a) } 
val e = i(c.result, d.result)

for some hypothetical fork that provides something like the Future I have sketched out, with a result method that gives you access to the result of the computation (this method, despite looking like a blocking method, will not block any operating system level thread, but will merely efficiently suspend the virtual thread running the computation until the result is available).

This is not only easier code to write compared to your Future-based equivalent, but it can be written at a significantly higher level, without the use of fork and “join” (.result).

As one possible example:

val e = parallelWith(f(a), h(a)) { (l, r) => i(g(l), r) }

assuming a hypothetical n-ary parallelWith that lets you operate on the result of multiple values all computed in parallel on virtual threads.

It is true that Loom does not provide any help writing “optimally concurrent” code. But then again, as @lihaoyi pointed out, Future does not provide any real assistance either: tiny changes to the structure of your program can greatly impact concurrency, leading to performance-decreasing over-concurrency or latency-increasing under-concurrency, many of which are embedded in live production code already.

The problem of writing “optimally concurrent” code is real, and it is not a problem that Loom helps address, but wrapper types do not help it either.

jdegoes · July 23, 2022, 2:48pm

This is not correct. Recall that in today’s code bases which do not use Future, InputStream#read() will suspend an operating system level thread, as, for example, one thread attempts to communicate over RPC to another node. What Loom is doing is making this operation vastly more efficient (through “async suspension”), but not changing the fundamental requirement for suspension.

Unless you are arguing that all suspension, whether async or sync should use a wrapper type. But keep in mind suspension does not indicate the presence of RPC. Every time you use any lock, any sleep, any code that transitively uses locks or sleeps, or various other operations in the JDK, your threads will suspend, and under Loom, they will merely suspend more efficiently.

So while suspension–which, by the way, is NOT reflected in the Java type system nor completely inferrable even through bytecode analysis–will indeed happen when one thread is waiting for input from a remote node, it will happen in numerous other cases as well, and if you only want to track “doing RPC computation” in the type system, then tracking suspension is a poor proxy.

If we assume the goal is to “track RPC”, then my question is: WHY, and what will a developer do DIFFERENTLY knowing that a method call is “doing RPC”. Type-level tracking of information has a cost, and benefits need to pay for that cost and then some, in order to justify the investment. What are the benefits? What are the actionable outcomes? No one is answering this important question.

It is far more straightforward to simply encourage all RPC frameworks to “throw” a “RemoteException” as a hint to “retry” or “timeout” or something along those lines, which is fundamentally a problem solved by a good and principled system for managing errors, not by “suspension tracking”.

This way of approaching the “RPC problem” has the added advantage that it respects the fact that ultimately, some developer is going to want high-level interfaces like def getUserById(id: UserId): User, and does not want implementation details in such a high-level interface to leak (including a RemoteException, since if failure survives even after application of a retry policy, then it becomes fatal for the route handler, and users do not wish non-recoverable errors to affect static types).

I would suggest that, instead of combining these discussions with Loom or “async tracking”, the question of how best to handle RPC is studied separately, and in combination with other MUCH MORE pressing issues for the RPC programmer (serialization, migration, etc.).

tarsa · July 23, 2022, 2:50pm

@mdedetrich
Also if you want to reduce stack depth then you can use trampolining directly. No monads like IO or Future are needed. Scala’s standard library already has a mechanism for trampolining: Scala Standard Library 2.13.8 - scala.util.control.TailCalls

Example:

package pl.tarsa

import scala.util.control.TailCalls._

object Main {
  object Direct {
    def method1(): Unit = method2()

    def method2(): Unit = method3()

    def method3(): Unit = method4()

    def method4(): Unit = method5()

    def method5(): Unit =
      new Exception().printStackTrace(System.out)
  }

  object Trampolined {
    def method1(): TailRec[Unit] = tailcall(method2())

    def method2(): TailRec[Unit] = tailcall(method3())

    def method3(): TailRec[Unit] = tailcall(method4())

    def method4(): TailRec[Unit] = tailcall(method5())

    def method5(): TailRec[Unit] = done(
      new Exception().printStackTrace(System.out)
    )
  }

  def main(args: Array[String]): Unit = {
    println("direct style")
    Direct.method1()

    println("trampolined")
    // trampolining can be started at any place
    // i.e. there's no viral inescapable coloring
    Trampolined.method1().result
  }
}

Result:

direct style
java.lang.Exception
	at pl.tarsa.Main$Direct$.method5(Main.scala:16)
	at pl.tarsa.Main$Direct$.method4(Main.scala:13)
	at pl.tarsa.Main$Direct$.method3(Main.scala:11)
	at pl.tarsa.Main$Direct$.method2(Main.scala:9)
	at pl.tarsa.Main$Direct$.method1(Main.scala:7)
	at pl.tarsa.Main$.main(Main.scala:35)
	at pl.tarsa.Main.main(Main.scala)
trampolined
java.lang.Exception
	at pl.tarsa.Main$Trampolined$.method5(Main.scala:29)
	at pl.tarsa.Main$Trampolined$.$anonfun$method4$1(Main.scala:26)
	at scala.util.control.TailCalls$TailRec.result(TailCalls.scala:81)
	at pl.tarsa.Main$.main(Main.scala:40)
	at pl.tarsa.Main.main(Main.scala)

jdegoes · July 23, 2022, 2:52pm

It’s very easy to show this is false, but this is not the proper venue for that. It will, however, be the subject of my talk at Functional Scala 2022 this year, where I will demonstrate, “Life After Loom” for functional effect systems (that is NOT to say such concerns do not have alternative solutions, only that, even post Loom, there are numerous problems that have clumsy and / or inefficient solutions with existing machinery, which are elegantly and efficiently solved by functional effect systems).

lihaoyi · July 23, 2022, 3:43pm

Futures have never been a marker of RPCs.

Tons of futures have nothing to do with remote calls: those constructed using a Future{} block, those constructed manually via Future#promise
Tons of RPCs have nothing to do with futures. I can point at literally every single JVM RPC client out there: AWS SDK, Azure SDK, JDBC and ~every database, K8S client, Apache HTTP client, the list goes on. Lightbend tried building an asynchronous database client library in SLICK, and look how successful that has become

JDG has already shown the Futures are not good markers of suspendable code: locks, reading from buffers, file reads/writes, etc. can all take unbounded amounts of time. Even simple while loops can run potentially forever!

This whole “futures are a marker for {RPCs, slow operations, suspendability, …}” argument has always been based on a false foundation. Futures have never been a marker for any of those things. They weren’t in 2010, they aren’t in 2022. Maybe you think you could get them to be something in 2034, but that’s not my call to make.

Futures, as currently implemented in Scala, are two things: (a) syntactic sugar around single-use callbacks, and (b) syntactic sugar around running things on threadpools. Nothing more.

(a) is useful in thread-limited environments as a way of multiplexing multiple workflows onto smaller numbers of threads, a use case which will always exist but become fewer when Loom makes threads cheap. (b) will continue to exist, but without (a) would result in a very different developer facing API

Now I agree that there can be value in tracking RPCs. I agree there can be value in tracking parallelism. I even think there is value in tracking local side effects. But Futures have always been a very poor proxy for tracking any of these things, and given the complete confusion evident in the community surrounding them, seem like a very fuzzy foundation upon which to build a rigorous research project

odersky · July 23, 2022, 4:27pm

@lihaoyi @jdegoes I give up. This will be my last post in a discussion that I find exasperating because it seems we keep talking at cross purposes.

Of course there are other sources of blocking than Futures. And of course Futures are used for other things than RPC. That has never need disputed. All I was ever saying is that if we want to distinguish local calls from crossing the internet in an RPC, we need some way to represent “waiting/suspending/taking an unbounded time to finish” in the type system. Just allowing arbitrary Future[T] -> T operations won’t cut it.

I am very familiar with the implementation details and relative strengths of green vs OS threads, but as I said from the start that has nothing to do with the issue at hand. I am also completely unattached to current concurrency abstractions like Future, it’s the principle that counts here.

mdedetrich · July 23, 2022, 5:09pm

Also to be clear from my end, my position is not that Future is perfect or doesn’t have any problems (I actually have been very vocal about Future’s issues) but stating that Future has no purpose/is irrelevant is, to be diplomatic, extremely naive and shortsighted and I have also stated with concrete examples why.

Taking such a position is getting nowhere and just de-railing the conversation and also increasingly the conversation is taking a path where people are just trying to silicon valley pitch their effect/IO systems while talking past/completely ignoring what other people are trying to communicate in an honest manner.

Right, and this works fine with physical threads because each physical thread has a massive amount of memory (8 megs) that can be used to store metadata about the stack frame so there its less of an issue to delay the destruction of the call stack frame because it just so much space to store the stack by virtue of being a physical thread, that’s what I was trying to convey before.

All of this discussion about IO and Monads and Interpreters and ExecutionContext is really missing the core point, what we are doing with async programming is taking time slices from real physical threads and executing tasks on those time slices (i.e. multiplexing). This is similar to m:n multiptasking that the kernel does between OS threads and hardware threads but on a higher level. If you have lots of tasks that are constantly being suspended on a limited number of hardware threads (i.e. you are IO bound) and you want to preserve and properly correlate stack frames between those batches of tasks you will have some overhead, regardless if its implement as a VirtualThread jumping to another VirtualThread or an ExecutionContext/Future subtype/bytecode instrumentation that is getting the current stack state of the current physical thread to carry it over to the next thread that it designated to execute a task on or how lazy IO interpreters solve the problem.

In any case, the kind of benchmarks you were providing is not relevant to actually what was being discussed but more importantly is unfortunately and commonly is misleading because unless we have different implementations of Loom so we can see the advantages and drawbacks we aren’t showing anything. Anyways due to this I am not going to continue this discussion further because its fruitless, proper benchmarks will have to be done.

mdedetrich · July 23, 2022, 6:10pm

Slick being asynchronous is by far the bottom of the barrel reasons why it got nowhere and trying to state it as such is deceptive and this is being said by someone who has been maintaining slick codebases for almost a decade now (hint: it has much more to do with the fact that slick was an early adopter of EDSL/FRM’s at a time where Scala’s type system wasn’t mature enough to properly express/implement the approach that Slick was taking and at the time Scala’s macros either didn’t exist and/or was just a research project). Slick failed because its execution of FRM/queries was sub-par but it also wasn’t its fault because it was too soon for its time.

The fact that Slick was asynchronous (which is actually a hilarious proposition because its tied to JDBC which is a fundamentally blocking driver) or that executed queries happened to return Future and trying to use those reasons justify that as a reason for its failure is really scraping the barrel in finding excuse, especially considering it was tied to Play which at that time already accepted Future’s in their routes so the whole red/blue color problem was a moot problem.

I will tell you what the actual problems with Slick are (and this had nothing to do with asynchronous/IO types/coloured functions)

It was an FRM (this means you implemented SQL queries as standard Scala functions on collections) however it did the query generation at runtime, this made it very hard to debug and figure out your SQL queries
It defaulted to dynamic queries (i.e. queries being generated at runtime at each execution of a DBIO) rather than compiled queries (i.e. unique queries that are cached and the only dynamic runtime computation are variables for prepared statements). Ontop of this, its implementation of compiled queries had a lot of gotcha’s
It was tied to JDBC which is fundamentally a blocking driver. This means that Slick had the worst of both worlds, it had an odd modular system when it came to expressing different SQL dialects (i.e. PG support for arrays) which it needed since every non trivial SQL app ends up using these however it wasn’t modular enough to have different SQL drivers (it is tied to JDBC, a fundamentally blocking driver).
Due it pushing the type system to the limit, it took ages to compile codebases that had slick. Also remember that Slick was quite old, so at the time it was popular the Scala compiler was slower than it is now which further compounded the problem.

Quill which is a more modern and hence much better executed Slick has none of these issues and also has a modular system that supports fully asynchronous async database drivers. In comparison it is a breeze to work with even considering other languages and in my opinion is one of the best database libraries I have seen and oh, it also happens to return Future’s for executed queries (at least if you are async driver’s or IO types that correctly wrap blocking calls). Such shock and horror! Slick being “asynchronous”/using Future might have been a slight annoyance, but it was nothing compared to all of its other issues.

Oh and to drive this point even further, that same company Lightbend also created a completely asynchronous streaming system akka/akka-streams (which materializes values into Future’s), that in terms of performance happens to be competitive against Rust see https://twitter.com/drmarkclewis/status/1434180382795128840). So much for Lightbend being such a failure with asynchronous model!

Its not doing any favors scapegoating all of Scala’s synchronous/concurrent/Future/etc etc problems on libraries which failed for completely different reasons. At the same time you happen to be cherry picking libraries which failed while at the same time ignoring the ones that are successful. Finally I have no idea why you are bringing Lightbend into this.

tarsa · July 23, 2022, 7:00pm

What I see is that even in monadic (or monadic-like) task composition APIs, recovering prematurely destroyed call stack frames and holding onto them is somehow not an issue. For example:

cats effect enables ‘cached’ tracing mode by default: Tracing · Cats Effect (but has the option to disable tracing altogether): It is for this reason that cached tracing is on by default, and is the recommended mode even when running in production.
zio 2 has tracing enabled always ZIO 2.x Migration Guide | ZIO In ZIO 2.x, the tracing is not optional, and unlike in ZIO 1.x, it is impossible to disable async tracing, either globally, or for specific effects. ZIO now always generates async stack traces, and it is impossible to turn this feature off, either at the global level or at the level of individual effects. Since nearly all users were running ZIO with tracing turned on, this change should have minimal impact on ZIO applications.
C# Task (used in async/await syntax) also for some time already keeps recovered stack trace: Stacktrace improvements in .NET Core 2.1 | Age of Ascent
golang also provides call stack for goroutines (closest thing to virtual threads from Loom), e.g. t3PCGq - Online Go Compiler & Debugging Tool - Ideone.com

If having call stack is not an issue in cats-effect, zio and c# tasks (whole async/await stuff) then why would it be a problem in virtual threads? Are virtual threads designed to blow up? Java’s CompletableFuture probably won’t get any machinery for recovering prematurely destroyed call stacks, because more direct and lightweight solution was implemented i.e. virtual threads of course.

What’s the point of comparing todays Loom with some hypothetical Loom that doesn’t exist yet? There are many things that could and will be improved in Loom over the years, but current version already delivers on some promises and we can test that. IMHO testing some hypothetical custom Loom version that nobody uses would be actually misleading.

lihaoyi · July 23, 2022, 7:10pm

Haha that’s ok, no need to belabour the discussion. I thought we had already stopped earlier, but since you jumped back in i figured you wanted to get back into it If not I’m also happy to call it quits

mdedetrich · July 23, 2022, 7:15pm

Having a call stack is not a problem if you at least havea way of turning it off, and turning it off is usually a very good idea if you don’t use exceptions for error handling because the performance impact is not trivial. For example quoting from that same article you posted about Cats

The reason why this is a bigger problem in Java land is because its become idiomatic in Java to overly rely on exceptions due to how unergonimic it is to do value based error handling (Java only recently got Optional and thats not enough).

I am talking in a broader context rather than just Loom solving some practical problem for typical Java developers. For example it may not be as fast as it could be due to that call stack issue (which would be a shame because a VirtualThread that doesn’t preserve call stack would have been exceptionally fast and a great candidate for Scala’s IO types when you don’t care about the call stack). Of course Loom is an improvement over the current state of affairs but thats because the current state of affairs in Java land is terrible, so posting a synthetic benchmark of 2x is not really showing anything when hypothetically speaking it could have been 3 or 4 or 5x (but again how marginal that is can only be demonstrated by providing an alternate version of Loom).

You can also have a look into these compromises if you compare it to Alibaba’s dragonwell SDK, i.e. Wisp Documentation · dragonwell-project/dragonwell8 Wiki · GitHub

tarsa · July 23, 2022, 7:56pm

From a quick glance it seems that:

they don’t prematurely destroy stack frames, i.e. they keep them intact
they don’t optimize stack memory usage, i.e. instead of moving the stack contents to managed heap like Loom, they keep native stacks in their raw unoptimized form
they implemented preemption based on safepoints - Loom doesn’t yet have that as it switches only on any kind of awaits (locks, I/O, etc) only, but Loom authors kind of promised preemption in the future

Their sample code converted to Scala (automatically by intellij):

import java.util.concurrent.{
  BlockingQueue,
  ExecutorService,
  Executors,
  LinkedBlockingQueue
}

object Main {
  val THREAD_POOL: ExecutorService = Executors.newCachedThreadPool

  def main(args: Array[String]): Unit = {
    val q1 = new LinkedBlockingQueue[Byte]
    val q2 = new LinkedBlockingQueue[Byte]
    THREAD_POOL.submit(() => pingpong(q2, q1)) // thread A

    val f = THREAD_POOL.submit(() => pingpong(q1, q2)) // thread B
    q1.put(1.toByte)
    System.out.println(f.get + " ms")
    sys.exit()
  }

  private def pingpong(in: BlockingQueue[Byte], out: BlockingQueue[Byte]) = {
    val start = System.currentTimeMillis
    for (_ <- 0 until 1_000_000) {
      out.put(in.take)
    }
    System.currentTimeMillis - start
  }
}

result under JDK 19 EA without Loom enabled:

6225 ms

result under JDK 19 EA with Loom enabled (but no virtual threads used, i.e. only added --enable-preview):

6241 ms

now changing code to:

val THREAD_POOL: ExecutorService = Executors.newVirtualThreadPerTaskExecutor()

and launch command to:

java --enable-preview -XX:ActiveProcessorCount=1

(to match their benchmarks)
and the result is:

1376 ms

Loom scheduler is over 4x times faster than the speed of Linux kernel switching threads. Alibaba reported over 10x speedup in their solution, so there’s a room for improvement (for Loom in the future). But again Loom already gives you scheduler that is over 4x faster than native one from kernel (in this benchmark at least). What’s not to like?