PRE-SIP: Suspended functions and continuations

robmwalsh · July 12, 2022, 3:06pm

That sounds… Very confusing and easy to stuff up. Would you be able to provide a couple of examples showing how this might work in practice?

fanf · July 12, 2022, 3:08pm

ZIO 2 “solves” that problem by having a runtime that manages where the function is actually run for you.
JDG being present, he will be able to tell more about that, but I was the trigger to bring the attention to that complicated point in ZIO 1. So, basically, there’s 3 kind of thread:
1/ an interpreter one that is special and that needs extreme care about how it handles blocking: it needs to be made by specialist and is part of the runtime capabilities itself, like GC is a given of the JVM. That interpreter manages where code lives between two main kind (and corresponding threadpools):
2/ the kind of “compution that run quickly, can take advantage of a CPU full time and don’t wait”. Not that it can have I/O things here, as long as they don’t block (for ex, modern network interface are totally able to saturate a CPU usage, even if they are not pure computation)
3/ things that block.
‘Blocks’ is given by some notion of progress that I have zero idea about. But the manager is able to observe, and even learn for futur occurence, which function blocks and migrate them from the CPU-bounded threadpool to the I/O-blocking one.

As an user, I can help the manager by tagging a function as blocking, to avoid the try/observe progress/migrate part. And I’m sure than in a near future I will be able to tag function for core affinity and the like (“this function is pure computation and I really want to take advantage of CPU caches”).

There is inefficiency along the way - migration from one threadpool to another is costly, and other things. But it’s the kind of inefficiencies similar to garbage collection: they allow for me, the user, to never ever have to think “does that implementation of a third party lib might block in a rare scenario”. And just, by default, have a good usage of available resources.

mdedetrich · July 12, 2022, 3:41pm

Unless I mistaken this would still incur overhead, i.e. you have a runtime which is trying to figure out what tasks are CPU bound and what tasks are IO bound where as explicitly defining what is IO bound and what is CPU bound will have better performance since you don’t have this overhead.

I would say that comparing this to a GC wrt inefficiency isn’t really an apt comparison. GC isn’t necessary slower (in fact in some cases it can be faster), the bigger problem with GC is it doesn’t give you control over memory

Most importantly currently the JVM/Java/Scala is being used in contexts where having this inefficiency is not acceptable so while its fine to have a library/framework to make things easier at the cost of performance enforcing this on a language/VM level isn’t going to fly.

fanf · July 12, 2022, 6:10pm

You remember that exactly the same was said about GC when they were introduced? Exactly the same. And as you say, now they allow to sometime raise better perf (at the cost of bigger footprint). You still use rust where you need tight controle over memory.

This is a very general and broad assumption. I would say that having scala solving the “I’m not sure what is blocking and how long, optimze that for me” use case would be very compiling. Even since as soon as the runtime manage that, you can apply the same kind of innovation we got than in GC land, and why not ends up with better predictive / optimistic thread usage than what you could do by hand. In my case, the kind of app we are using scala for takes huge benefits of that.

But also, your last sentence is very important: ZIO demonstrates that it does not need to be done at the lang level, it can be done with a lib. Oh, there certainly things that can be enhanced with a shared knowledge with the langage or the jvm (and actually, ZIO will benefit of Loom optimisation for better perf). But it can fly as lib.

jdegoes · July 12, 2022, 10:08pm

We know that it is not possible to say, in general, whether an arbitrary Scala function will “take a long time to complete”, because the language is too powerful. So what we are really talking about right now is “suspension tracking”.

Now, as for suspension, in this Discourse thread, all I have been hearing are arguments why async suspension should be “tracked in the type system”. Indeed, the PRE-SIP suggests only this.

No one is talking about sync suspension. In fact, it seems everyone is satisfied to track async suspension, and also completely ignore sync suspension.

Let’s take the following purely synchronous function, which would not be tracked under any proposal in this Discourse thread:

public byte[] readNext(int n);

A function calling this method will absolutely suspend nearly every time it is called with sufficient n, because the runtime cannot guarantee to read the next n bytes within any definite amount of time. This is an operating-level system synchronous suspension, mediated by the runtime, whereby the OS thread will be placed into a waiting state, to be awakened when a sufficient number of bytes has been read from the socket.

The key point: here we have a purely synchronous method that is suspending potentially forever.

Yet, no one here is arguing that we should track this suspension in the type system. Instead, everyone seems keen on tracking “async suspension”.

If folks are happy only tracking async suspension, which every argument on this thread would seem to indicate is correct, then Loom revelas a glaring contradiction:

Pre Loom, no one wants to track readBytes in the type system (literally no one has suggested it!), despite the fact that it will almost always synchronously suspend.
Post Loom, neither the signature nor the semantics of readBytes will change, nor will its execution time change in any way.
Post Loom, readBytes magically becomes 100% async, and its synchronous supension will change to async suspension.
Suddenly, now everyone wants to track it, even though nothing changed?!?

This is the definition of an incoherent position.

IF there is value in tracking async suspension—which is not something that anyone has demonstrated, principally, by coming up with an example where a developer would write different code knowing of the existence of an async suspension—then by virtue of the isomorphism between sync and async computation, then there must exist this exact same value in tracking sync suspension. Which no one is arguing for (!).

To be consistent, one must either say that compile-time data on both sync and async suspension is not actionable (which is my position, because I believe timeouts are generally the solution to any such contrived use cases); or that compile-time data on both sync and async suspension is actionable. Forcefully stated, there is no coherent position which gives special status to async suspension.

If some future version of Scala can track suspension, it must necessarily track sync and async suspension, treating them equally (because they are equal in every way that matters). Which means, yes, Collection#add and numerous other functions will have to be “tracked” in the type system as being able to suspend a (physical|virtual) thread. That’s a stupendously noisy future for a hypothesis that no one has yet convincingly demonstrated.

What I am strongly convinced of is that adding two-colored functions that work poorly with higher-order functions and polymorphism by importing a legacy Kotlin design made irrelevant by Loom’s ideal solution is absolutely unnecessary and actively detrimental to Scala 3.

lihaoyi · July 12, 2022, 11:18pm

Ok I think I understand where you are coming from

“Moving slow operations off a key thread to avoid starvation of that specific thread” is a valid concern; that’s a common issue in UI programming and Game programming, where heavy operations in the core event loop cause issues. I feel this all the time when I try to rename a file in Intellij, and it starts indexing the world on the main thread and locking up the IDE

But as JDG has said, this has nothing to do with suspending or not: a heavy CPU-bound computation can also cause the key threads to be blocked problematically. As I’ve mentioned before, scala.Map#toString has personally caused me problems in the past, not because the code is unsuitable, but because the data shape was unusual with a much larger Map than most people would expect to work with.

In the end, starvation of individual threads due to long-running operations is a performance issue, and performance issues are notoriously hard to analyze statically. Even for human programmers, the general advice is “always profile first, don’t try to statically analyze the code”. Suspension, networking, IO, etc. is all a red herring here, because heavy CPU computation causes all the same issues. And the resolution is the same: profile it, identify the hotspots, and optimize them in place or move the long-running tasks (whether CPU or IO bound) to a separate (virtual)thread.

Given how difficult performance issues are to statically analyse, I think expecting the compiler to be perform accurate analysis here with only static information is not a promising approach. The compiler doesn’t know

The shape/size of the input data
How many cores the code will have available to parallelize things
Whether the filesystem is spinning-rust, SSDs, or in-memory tmpfs
Whether your JDBC query is going to an in-process SQLite, or to a postgres database 10,000 miles away
The “use case” of the code which could make some 1ms computations unacceptably slow (e.g. in a 144fps Game loop) while other 10s computations perfectly fine (e.g. in a once-a-day batch job)

All of these are things that can, and do, make-or-break whether a key thread is blocked for an acceptable amount of time or not. e.g. when IntelliJ’s indexing blocks my IDE UI thread and doesn’t block other people’s, it’s because of the shape/size of the input it’s handling, not because I’m running an inferior implementation of IntelliJ compared to others.

That’s not to say inaccurate analysis is not useful. The current rules of thumb of not blocking threads is kinda-sorta useful, even if there are false positives and false negatives (a community leader once told me that instead of using a blocking AWS SDK call in a Play controller in my O(0.0001) qps college toy project, I should instead re-implement the entire AWS SDK in an async fashion…). But this sort of inaccurate analysis seems like something that belongs in a Linter (with appropriate configurability to properly suite individual codebases, and @nowarn escape hatches to override the linter when it’s wrong) rather than built deeply into the language spec and compiler

mdedetrich · July 13, 2022, 6:46am

In an ideal world one would track this case but the reason why at least personally I don’t advocate for this is because in most real world situations this isn’t really that possible because this typically happens in cases where you have large input data, something the programmer only knows about (@lihaoyi example of calling .toString on a large map, or I case I had recently in calculating diff on data structures that are 30mb plus in memory). For such tasks however you still ideally want to have the ability to designate how they will run, for example you may want to execute these tasks on a pinned separate thread so at least you don’t stall the current thread. BTW these cases happen all the time in gaming and to a lesser extent UI’s, you have heavy CPU bound computations that in a lot of cases only the programmer knows about. This reminds of when I built a Scala based GUI using swing some time ago, I was using Scala’s standard Future as my IO type and I had an ExecutionContext that represented the UI thread which gave me fine grained control of only doing rendering on UI thread and other heavy CPU bound/async tasks to be run separately.

This still however doesn’t detract from the fact that marking computations which we know should by run “asynchronously” (i.e IO/network/filesystem) is useful.

Also strongly agreed, while there is a lot of legitimate debate about how to tackle this problem and its not an easy one to solve, co-routines is a solution that doesn’t play well with Scala design and how its idiomatically used.

odersky · July 13, 2022, 7:17am

@jdegoes @lihaoyi I think the main reason why people are more concerned about suspensions than long-running computations is that a long-running computation is usually identified during testing but waiting for an external event can have dramatically different outcomes depending on test vs production environment, system load, state of the network, user behavior, etc. It’s not an absolute, for sure. But if we take the Microsoft guideline as an example, then even long running computations that are identified as such could be treated as if they are suspending.

lihaoyi · July 13, 2022, 9:30am

I think the microsoft guidelines are reasonable, but it is worth noting that they don’t have lightweight threads on the CLR. That sort of guidelines are reasonable on the pre-Loom JVM as well.

Basically Loom removes the applicability of these guidelines to avoid thread starvation from “high-concurrency multi-threaded systems”, which includes most backend systems, web services, API servers, etc. For those, we can just auto-scale the thread pool if some get blocked, with Loom letting us do that cheaply and efficiently

These guidelines continue to apply to “high-concurrency single-threaded systems”: UI threads, Game loops, code running within certain kinds of Actors, and environments like Scala.js. These are scenarios where “throwing more threads at it” is not possible, and work needs to be moved off-thread manually.

The current guidelines in the Scala community are overwhelmingly targeted at the multithreaded use case, for people developing backend systems and servers. Not that much UI development or Game dev happening in Scala, and Scala.js remains niche. That leaves the code running in single-threaded Actors, but only those for which latency and responsiveness is important.

IMO this isn’t a sufficiently broad use case to make a heavy investment in, but that’s a subjective judgement. I think we’ve reached mutual understanding and there isn’t any more to discuss

jdegoes · July 13, 2022, 10:04am

Then I would suggest the conversation move from tracking async suspensions (which is inconsistent or even incoherent, as demonstrated above), to tracking indefinite-length computations, for which tracking both sync and async suspensions could be regarded as a poor man’s proxy.

If the goal is to track indefinite-length computations, then I would regard that as prime topic for future research, and personally, do not see how that relates to Loom, Kotlin coroutines, etc.

Incidentally, I share @lihaoyi’s opinion that the value of tracking indefinite-length computations is virtually gone in a post-Loom world.

Indeed, it is already gone for those using functional effect systems like ZIO.

If I am writing the following code:

for {
  bytes      <- drainStream(request)
  transcoded <- doTranscoding(bytes)
  _          <- uploadToS3(transcoded)
} yield Response.OK

Then for every statement, there exist two possibilities:

I need the result of this statement in order to continue to the next statement.
I do not need the result of this statement in order to continue to the next statement.

Note that this is a question of data flow, and fully resolvable statically.

If I care about latency (game and UI apps are excellent examples, but even in a microservice or API backend, latency matters a lot), then in any case where (2) holds (that is, in any case where the result of some computation is not needed in order to continue to the next statement), I will execute such qualifying statement in the background.

Using ZIO, I would transform the code to the following:

def doProcessing(bytes: Array[Byte]) = 
  for {    
    transcoded <- doTranscoding(bytes)
    _          <- uploadToS3(transcoded)
  } yield ()

for {
  bytes <- drainStream(request)
  _     <- doProcessing(bytes).forkDaemon
} yield Response.OK

In this refactoring, I am respecting sequential ordering in cases where subsequent statements depend on prior statements. In other cases, I am shifting work to new background fibers, which are so cheap you should always use them.

In a post Loom world, this is the new paradigm for low-latency: if you need the result from a previous computation, then you must perform it sequentially after that statement. But if you do not need the result from a previous computation, then you may, and often should, perform that work in the background.

Here’s the kicker: Tracking “suspendability” is neither a necessary nor sufficient condition for performing work in the background. For example, draining the stream will be done on a thread (or virtual thread) that suspends (either synchronously or asynchronously). Yet, we need the result of draining the stream in order to proceed to the transcoding step. So the fact that draining the stream may suspend is irrelevant to our sequential logic. Yet, to return the OK response, we do not need to wait for the transcoding or uploading to complete, so we push that computation into the background on a new virtual thread.

ZIO (and of course Loom) make it so cheap to do background processing that the new paradigm is: if you need to do something sequentially, then you do it sequentially, if you don’t need the result to make further progress, then you push it onto a virtual thread.

At no point do we need to understand or care about whether an OS thread is synchronously suspending, or whether a virtual thread is asynchronously suspending. That is not a relevant consideration, and even if you argue there is a poor man’s proxy there (a heuristic), I can show innumerable examples that demonstrate how weak that proxy is (e.g. Collection#add doing synchronous suspend pre-Loom; URL#hashCode doing asynchronous suspend post-Loom; etc.).

In summary:

Sync suspension and async suspension must be treated together, never separately; any tracking proposal must consider them equivalent in every possible way (they fuse into the same concept under pure green threading models, such as the one Loom is almost giving us).
In the new paradigm of cheap virtual threads, as argued by @lihaoyi, if we don’t need some result in order to make progress, then in any low latency application (not just games and UI), we will push that result into the background. This is a data flow question that has a statically analyzable answer, but it has nothing to do with tracking sync + async suspension in the type system.

Precision and clarity of thought is extremely important to get to the heart of the matter, which I think we have gotten to after much discussion.

julienrf · July 14, 2022, 5:17am

I would like this thread to focus on the proposal of having support for continuations in the language instead of discussing the impact of Loom on functional effects. So, I moved a post to a new topic: Impact of Loom on “functional effects”. Please continue that discussion there!

lihaoyi · July 13, 2022, 11:41am

This is something that I think isn’t quite so trivial. Simple static dataflow analysis makes it easy to statically move stuff onto cheap virtualthreads, except:

Side effects are still prevalent throughout Scala. Not as common as in other languages, but enough that parallelizing automatically is risky business. Even when I’ve parallelized stuff manually, I regularly get bitten by side effects I didn’t notice
Granularity & Overhead: virtualthreads are cheap, but not free. IIRC they’re something like ~1kb each, v.s. ~1kb per Future ~1mb for OS threads. That means you can’t go around parallelizing every 1 + 1 in your program without drowning in overhead, and you have to pick some granularity of work below which you won’t parallelize

It’s easy to parallelize things in Scala thanks to it’s FP-ish nature, and Loom makes it easier by making thrrwds cheaper, but I don’t think it’s reached the point where we can feasibly automate it yet. At some point, you have to make subjective judgements about what to send to background threads and what to run on your current thread.

Deciding what to parallelize is a performance issue, and one the challenges of performance issues is the fact that a decision here that works perfectly in one environment with a given input data can totally fall apart in a different environment on a different set of inputs

jackcviers · July 13, 2022, 1:31pm

Most users of functional effect systems are not relying on them primarily for async programming. Rather, they are relying on them for concurrency (compositional timeouts, races, parallelism), resource-safety in the presence of concurrency, typed errors, context, fiber-local state.

For me, as an author, this was the point of proposing continuations. To encode functional effect systems and use them in a direct style, for more use-cases than asynchronous programming:

We would like to write scala programs in a direct style while maintaining the safety of the indirect monadic style.

We’d like to not have to wait until Loom-capable JDK/JVM usage reaches the majority of deployed production systems. Currently, only 12% of Java users are on JDK 15 (Source). It could be a very long wait for Loom to land. Perhaps it will be backported, but that is unknown.

How we may encode functional effects and program them with a direct style, given the current absence of Loom on most deployed JVM systems and on scala-js/native, is the purpose of this pre-proposal.

jdegoes · July 13, 2022, 1:33pm

I agree I oversimplified the issue to focus on data flow. You do not always want to push every statement (whose result is not necessary for making further progress) onto a background thread.

That said, keep in mind the heuristic would not push 1 + 1 to the background, because that produces an Int that would be required for subsequent computation. Rather, it’s generally Unit returning methods or side-effecting processes (uploading a file, doing CPU processing, etc.) those results are unnecessary.

(In modern loggers, even log(str: String): Unit is effectively pushed onto a background thread via queuing. As a heuristic, Unit would be a vastly better criteria than “does async / sync suspension”, because usually you need the result of such a computation to make further progress.)

For sure. Maybe this is also a good point to think about what tooling could bring to this problem: it’s one strategy to force a programmer to add type annotations to manually propagate statically-known information about runtime behavior (suspendability, big-oh, little-oh, exception types, etc.); and another, quite different strategy to expose runtime behavior via tooling.

Tooling has the advantage that it’s less work for a developer: you do not need to engage in the boilerplate of propagating statically-known information via manually typing ASCII characters (which is probably the most common complain for statically-typed exceptions in Java).

Imagine a tool that looks at your code and says, uploadToS3() returns Unit, which is not used by subsequent statements, and on average takes 24 seconds to complete. Do you want to execute this computation on a virtual thread? [Yes] [No]`

mdedetrich · July 13, 2022, 2:54pm

For modern Linux at least the stack sizes for OS threads are much larger, on most distros its 8 mb (if you run ulimit -s you can find out how big it is on your installation). Its also the same on MacOS M1.

But yes, this point is completely correct. If you are doing mathematical/scientific computations (which is actually a big demographic for JVM, something a lot of people seem to have forgotten) as you said you do not want to parallelize every single 1 + 1 in your program even though you theoritically can because its side effect free. I suspect even doing runtime based analysis that is being alluded to will incur significant overhead (something that is a non concern in IO bound apps).

jdegoes · July 13, 2022, 5:00pm

I appreciate, emphathize with, and support this goal—of making it easier to leverage functional effect systems in more places. However, I do not support this pre-SIP proposal at all.

Every proposal has costs and benefits, but a major problem with this proposal is that the benefits (direct, imperative async on pre-Loom JDKs) decrease rapidly as more and more companies adopt Loom (since Loom already solves the async problem in the correct way, by giving us virtual threads), while the costs are high and ongoing (in terms of development, maintenance, education), and have permanent implications, dictating syntax and semantics of Scala 3 for a whole generation of Scala programmers, and creating challenges for library and even application code bases that would have to straddle the quite significant suspend/Loom divide.

With each passing year, the benefits of the proposal would decrease further, approaching 0, while the costs would still remain high and ongoing, with inescapable permanent implications.

Even if you ignore this basic dynamic (which, in a cost-constrained environment, strongly suggests the proposal be rejected), we are stuck with the fact that the proposal has possibly fatal drawbacks:

Introducing two-colored functions into Scala, which is highly undesirable
Failing to provide clear semantics and typing around higher-order functions
Failing to provide a new notion of polymorphism (“suspend-polymorphism”), which is necessary to write generic code that can suspend or not depending on the functions it is passed
Failing to provide a type and value for suspendable lambdas, which are necessary to preserve the parity between methods and functions introduced in Scala 3
Failing to have a non-Loom-based runtime implementation
Not actually adding any new capabilities atop Loom, since everything in the proposal can be done much more simply in straight up Scala 3 or Java on Loom

I would support the proposal more if it were simply adding continuations to Scala 3 (not with Loom, but through generating resumable bytecode for methods and functions), without any changes to the syntax or semantics of Scala. That means you could leverage the benefits (on Scala Native, Scala.js) without having to pay some of the costs, and without having to change the Scala 3 language at all.

But in its current form, I think it’s a bad idea for Scala 3. In my opinion, even Kotlin needs to re-examine suspend, or runs the risk of becoming legacy compared to Java-on-Loom, which does not have the severe limitations or weirdness of Kotline coroutines.

diesalbla · July 13, 2022, 5:12pm

Some of these ideas, of automatic parallelisation of FP code, went in the research literature under the name of implicitly parallel functional programming or implicit parallelism. “Implicit” in the sense of not requiring a special syntax by the programmer. Here is a reference on that subject. https://archiv.ub.uni-marburg.de/diss/z2008/0547/pdf/djb.pdf.

Now, there may be some domains within Scala, such as the planning of distributed data-parallel processes, or build systems, in which implicit parallelism would be useful. However, that may not be the case for the Scala base language. Scala has a rather conventional operational semantics, in that it is strict and sequential by default. This semantics is key for reasoning about side effects performed by program evaluation. Adding implicit parallelism would mess that mental model. Note that this is a matter of language design, and essentially separate from how efficient a specific platform has become in supporting parallelism.

Nevertheless, this is an interesting and deep topic that indeed deserves a long discussion on its own.

ChristopherDavenport · July 13, 2022, 6:22pm

For my piece I feel that the work here and the suspend keyword would be minimally intrusive while adding capabilities not available today.

There are signifcant obstacles to its use and challenges that can be worked through, but it improves the situation.

Loom is not a reason to not do something that adds a useful feature for the language in the way people use it. The jvm community will need to support 11/17 for the better part of the next decade (September 2029 extended support) Loom does not lessen the value in doing something like this.

ekrich · July 13, 2022, 7:52pm

Even hardcoded-into-the-runtime implementations like Loom or Go’s lightweight threading hit issues with FFI calls into C code or OS-level blocking syscalls.

This could be a big issue with Scala Native as we explicitly advertise “calling native code with ease”, paraphrasing.

lihaoyi · July 13, 2022, 10:25pm

Yes, but it is unavoidable. It applies as much to the compiler/language-driven approach as it does the the runtime-driven approach that Loom does.

The general issue is that you need to “async transform” code for this to work. Compiler support can transform code being compiled, but has problem with upstream libraries compiled earlier. Runtime support (Loom, Go, Erlang) can transform all code inside the runtime, but has problrms with code running outside the runtime (e.g. C libraries or OS syscalls)

If you want to make some async transform that works with everything, including with C libraries and OS syscalls, you need to make changes to the OS, since that’s the thing that runs C code and syscalls.

This can be done, but is maybe outside the scope of this discussion