PRE-SIP: Suspended functions and continuations

There’s something that still isn’t clear for me from the discussion. Does Loom somehow solve the classic N+1 problem? I.e. let’s say I have a function that does a Google search: def google(str: String): List[URL]. Now I try doing this:

val list : List[String] = ???
list.map(str => google(str))

How does Loom ensure this is done efficiently, i.e. by spawning one thread per element of list?

If we tracked in types that google can perform a costly block, we would be able to use that information to, perhaps, forbid the above piece of code. Perhaps there should be a variant of map which always spawns a thread per element and allows blocking operations.

Regardless of what the exact solution is, tracking in types that google can block seems better than the situation where it’s easy to have into performance problems when using it. Sure, similar problems would occur with computation-intensive function as well, but I feel like they occur much more easily once we start doing async programming and a single function call could suddenly take 100ms, or whatever is the local Google roundtrip time.

1 Like

You are conflating concurrency with asynchronicity.

Loom does not change the semantics of your code: in particular, it does not automatically insert any concurrent operations, nor does such a thing make sense in general (see above academic references on auto-parallelization, which is fraught with known issues).

Loom merely takes your synchronous code (that is, code formerly using physical threads and operations like IO or locks that “sync block” those threads) and makes it fully asynchronous (using virtual threads and “async blocking”, which is more efficient than “sync blocking”).

As such, maybe in your code base, you have some code like list.map(str => google(str)), where each invocation to google blocks a physical thread. Under Loom, the code has the same meaning and will produce the same result, only google can now be fully asynchronous (which does not imply it is concurrent with respect to the thread executing the List#map, because it is NOT concurrent), which means you get the same behavior before but it runs more efficiently.

Loom is all about efficiency, not concurrency, per se: taking the same programs and making them work better. As a consequence, you can now do “async operations” (i.e. “efficient operations”) anywhere without having wrapper types like Future, including in List#map.

2 Likes

Ok, so you want to track “local” computation versus “remote” computation. First off, that would not be related to async versus sync tracking: both sync and async can do remote computation, the only difference is efficiency.

Second, in the era of cloud-native applications, the cloud itself has become a sort of standard library: every other call is to some microservice or GraphQL or REST API. Our applications are the glue that hold together operations implemented in the cloud. So tracking “remote” computation may be increasingly and incredibly noisy, as we enter a future in which nearly all calls might be “remote”.

Third, and in my opinion, it is very important to not be obsessed with “tracking” things for the sake of academic novelty (which is good for obtaining grant money but bad for commercial software). Tracking information using types involves considerable effort for developers, who have to type more characters and wrestle with more mistakes (see also: uninferrable exception lists in Java). You can, like Odersky is trying to do, reduce the cost of tracking–preferrably NOT via inserting more magic fraught with edge cases that works in unexpected ways with other language features, such as “auto-adaptation” in context functions–but fundamentally, you must still acknowledge it has a cost.

To pay for itself, you have to demonstrate that the information is (a) actionable, and (b) so frequently actionable that the costs of universal tracking are outweighted by the proven benefits.

I have not even heard a hand-wavvy argument on remote vs local being actionable: what would a developer do differently, knowing that “doX()” is a remote call versus a local call? What would the developer do differently, knowing that “doX()” is a local call versus a remote call? Not abstractly, but what concrete code would a developer write knowing such a difference?

I have argued above that the steps a developer would and should take to flaky computations always involves retries, and the steps a developer would and should take to long-running computations always involves timeouts. Although remote computations are more likely to be flaky and long-running, it is only a correlation, and many local computations can be both flaky and long-running. So the mere presence or absense of a “remote bit” is likely to be insufficient information to be actionable.

If I am wrong, then it should be possible to provide some evidence that:

  1. Devleopers know to do and actually do something radically different based on the “remote bit”, such that it significantly affects correctness or performance or some other metric that matters to the business.
  2. Developers do this so often that it overwhelms the significant drawbacks to infecting every type signature across the entire code base with a “remote bit” (or at least, infecting either all remote code, or all local code, with such a bit, if you can infer its negation by its absence).

Ultimately, my stance is that “effect tracking” is a distraction and a waste of resources, hence my blog post, Effect Tracking Is Commercially Worthless.

That dynamic could change in a future in which tracking things is cost-free or super-low-cost and completely automatic (fully type-inferred), but until when and if that point arrives, I will always be asking proponents of effect tracking to demonstrate (a) actionability of information, and (2) pervasiveness of need, such that benefits clearly outweigh costs. To my knowledge, no one has demonstrated this in the case of remote vs local, and it cannot be demonstrated at all in the case of sync vs async.

2 Likes

Correct. That’s the whole point. I thought we’re past the sync/async distinction :wink:

Also agreed. So if we want local and remote invocations to have a different signature, because of cloud-native the cost has to be minimised. I think that’s the point of @odersky 's research project.

Well I would say that you have demonstrated that two paragraphs below: the actions to take are retries and timeouts, the frequency is there because of cloud native.

One point where I would disagree is that local computations need recovery logic as above to a similar degree as remote do. I don’t think it’s only correlation. Every remote invocation can be flakey / long-running / throw errors randomly. But only some local ones have these characteristics.

Now, I don’t have hard empirical evidence that the “remote bit” actually matters. Only anecdotal :wink: But on the other hand, is there evidence that a consistent and principled approach to errors originating from remote calls doesn’t influence the bug ratio? Especially that these bugs tend to manifest themselves in production, not in the calm and idealised test environment.

Finally, aren’t we talking here about error handling - something that is very close to the heart of every ZIO programmer? The whole point of effect tracking, or remote-call tracking, or however we call it, is to properly handle the error scenarios. Java implements this by requiring methods to add throws IOException, which is often circumvented by programmers. ZIO moves the error channel to a type parameter, for composability. I don’t think it’s at all unreasonable to look for other, maybe more general solutions, where errors are just one specialisation of the “effect” a computation might have.

3 Likes

I would be happy if that were true but given other posts on this thread, including, indeed, the nature of the pre-SIP itself, it seems unlikely. :grinning_face_with_smiling_eyes:

Indeed, Odersky himself stated:

“The sync/async problem is one of the fundamental problems we study [in our 7 persons over 5 year project].” (emphasis added)

From my experience, I would say that developers failing to apply retry or timeout logic is not a significant source of lost business revenue, partially because libraries and frameworks are designed to handle or carot users into doing the correct thing (e.g. Http.get requiring a timeout parameter).

It happens sometimes, and it has measurable costs, but the overall amount of revenue lost due to failure to apply retry or timeout logic pales in comparison to the revenue lost dealing with unexpected null values, transformating data from A to B without mistakes, or possibly even retrying the wrong thing (e.g. NPE) because of the lack of a two-channel error model.

Even for resource handling, the main issue in modern web apps is memory leaks; the occurrence of lost file handles or connections in a database pool is made rare by libraries and frameworks (or try-with-resources in Java).

For things which are not a significant problem in commerical software development, it is all the more important to ensure the costs are minimized; and to ensure that new features aimed at addressing these “problems” produce clear benefit in magnitude sufficient as to overwhelm those minimized costs.

I agree that only some local computations have these characteristics, but not that all remote ones do. For example, if your application is running with EBS or EFS storage, then despite all disk-related operations being remote, it is extremely unlikley to be flaky or long-running.

This raises another important point: that sometimes operations that your application may expect to be local, are in fact remote. Which means that any attempt to track “local” versus “remote” is at best an educated guess. Indeed, a repository interface may suggest the database is remote, while a particular implementation may be using H2 embedded.

To me, this is feeling like researching how many angels can dance on the head of a pin.

Meanwhile, while we discuss whether to embed a remote versus local bit in the type system (in a TBD comonadic effect system that no one is asking for, despite, of course, some academic value), modern cloud-native, industry-focused languages like Ballerina make it trivial to produce and consume cloud services and leverage user-defined data structures in cloud protocols, innovating on real problems that consume massive amounts of developer time.

Which of these focus areas stands to benefit industry the most?

(Actually, we’re not even really discussing local versus remote, because most people contributing to this thread seem to believe the async versus sync distinction is important to track in the type system.)

In my view, ZIO’s error handling works because (a) it is based on values, which allow even polymorphic abstraction over duplication (b) it is fully inferred, meaning no additional developer work is required to benefit from it (“zero” cost), and (c) it leverages the type system to cleanly separate recoverable errors from non-recoverable errors, with an ability to dynamically shift errors between channels (which is critical in a cloud-native environment, where only some errors should be retried). Java failed on all three accounts, which is, I believe, why checked exceptions are regarded mostly as a mistake (CanThrow fails on two accounts, and its potential successor will probably fail on those same two accounts).

I would be happy to see another error model that takes this same direction with fewer costs and / or greater benefits, and if that happens to be part of a capability-based (comonadic) effect system geared toward solving problems rather than tracking bits of debatable value, then I would appreciate that, as well. But keep in mind the burden of proof is on those making the claim that such a system would be superior to what exists today, and that it warrants investment and support from the broader Scala community.

2 Likes

One question that I think we have not answered: What about the error case? If Futures are replaced by VirtualThread with a result (or any of the other options discussed here), how are errors handled?

  • Returning a sum of result and error? But then we are back to monadic instead of direct style. And you could argue that if you are working with monads, you might as well work with one that makes it explicit that things can suspend.
  • Or throw an exception? But then these need to be tracked somehow
3 Likes

I sketched a Future-lite successor in the other thread:

class Future[A](f: CompletableFuture[Try[A]], vt: VirtualThread) {
  def virtualThread = vt 

  def result: A = f.get.get

  // etc.
}
object Future {
  def apply[A](code: => A): Future[A] = {
    val f = new CompletableFuture[Try[A]]()
    val vt = Thread.startVirtualThread { () =>
      f.complete(Try(code()))
    }
    new Future(f, vt)
  }
}

Future#result gives you access to the result, and if that is an exception, then it would throw.

How to model that exceptional case–whether with Java checked exceptions, Try values, Either values, ZIO values, CanThrow capabilities, CanThrow's comonadic capability type-based successor, or something else entirely–is independent from the sync/async question, and also independent from Loom.

In other words, however we would want to describe any exceptional result from any method, we should use that same process to describe the possible error that may result from calling Future#result.

1 Like

On a somewhat related note there is another plausible point which is stack traces + exceptions. The nice thing about Future is its decoupled from stacktraces, that is by design you are not meant to expect the stack to be consistent or (even be there at all). As is evident by anyone that has used Future along with the standard ExecutionContext (such as ForkJoinPool) the stack traces are meaningless because the computation’s can jump between different at whim (thats the whole point of multiplexing computations onto real threads). You can see this even in the Future api, i.e. you have methods like Future.failed to designate a failed Future with a Throwable but its just passed around and propagated as a value. You can still recover from exceptions thrown in Future but as stated before its expensive, critically you don’t have to throw (i.e. you can just use Future.failed).

This is one area where I am a bit skeptical loom, although I haven’t looked at loom in great detail but if VirtualThread is meant to preserve stack traces and there is a lot of code out there that assumes stack traces are consistent and properly propagated, unless I am missing something this will have a performance penalty (versus not caring about the stack at all). This performance penalty is already visible right now, either in the case of Future with custom ExecutionContext's that preserve stack or other IO types that propagate the stack in the interpreter. I do believe that loom’s solution to this problem is not going to have the same overhead but as said previously I don’t see how it can be “cost free”.

Ultimately though this is one of the best benefits of doing value based error handling rather than throwing and catching exceptions, if you throw and expect to catch exceptions its expensive and Scala’s IO/Async takes forced programmers to not rely on the stack for basic error handling (which is a good thing). If my previous point about loom is correct (i.e. loom is forced to propagate stack in order to remain code compatible with existing code that relies on try/catch + preservation of stack to function). I also haven’t seen any ability for Loom to granularly handle stack propagation so you don’t deal with performance penalty if you don’t rely on exceptions.

For this reason alone (and others), despite what people claim Loom is not going to kill Scala’s Future even in the hypothetical where everyone runs JVM 19+ (w/e version is released with Loom) and Scala.js/scala-native is ignored.

This is not a nice thing. In fact, constructing exceptions in Future-based code incurs the cost of building out a stack trace, without the benefit–because the stack traces so constructed are useless and only reflect the callstack from the last “bounce” inside the execution context to the current operation.

Async stack traces in Loom are the same as sync stack traces, and have the same overhead. You pay this overhead only when (1) your code actually fails, and (2) your exception type is generating an exception (not all exception types are wired to generate exceptions, see also NoStackTrace).

I am not a fan of throwing exceptions (versus using typed values), but this is not a reason to prefer values, because if you are using exception types that do not generate stack traces, then it will be faster than value-based error propagation tends to be (Either, Try, etc.).

Stack tracing is absolutely and positively not a reason that Future will survive in a post-Loom world. Future has no benefits whatsoever with respect to stack tracing compared to Loom.

Future will survive only because people don’t want to rip apart legacy code bases, not because Future conveys any value with respect to stack traces (or anything else, really, since the marginal utility of other benefits is better obtained using more modern data types, like a typed-VirtualThread, for example).

3 Likes

I agree, re-throwing exceptions is one of the options. But then, should these exceptions be tracked in the type system or not? Previously, the main criticism of exceptions was that they were untracked. We now have a way to track them with experimental.saferExceptions made watertight with capture checking. But that means successor future should have the error type as a type parameter rather than fixed to Exception, because otherwise info about thrown exceptions is not propagated across futures.

1 Like

Well if you care that much about the cost of a the stack trace you can use scala.util.control.NoStackTrace for exactly this problem, you will just be passing (almost) a reference around.

But this is all besides the point because if you care so much about the cost of stack traces then you shouldn’t be even using Future.failed (or throwing/catching) which goes back to the point of using value based error handling.

How does this work exactly? The reason why stack traces are “free” with normal threads is that the stack is part of the OS thread and due to already paying the cost of a heavy OS thread passing the stack along doesn’t cost anything.

On the other hand the whole point with green thread/fiber implementations is they typically do not have any “stack” on them and they have a very small size (ergo 1kb for Erlag) so while catching/throwing can be made free, preserving the stack trace especially for very large non local calls is another story. Of course you can just pass the incrementally growing stack along in your virtual thread but that has a performance penalty and since you also experience problems due to cache locality of threads.

Well its another reason on a bucket list of reasons but regarding the rest of your point, there is no fundamental reason why error based value propagation is less performant then using try/catch without stack propagation because in the end it all amounts to the same thing, i.e. control flow mechanism. On the JVM using error based value handling can be slower but thats a because JVM isn’t that optimized for it, you can look at go instead which has optimized their runtime for value based error propagation (note that my response also takes into account that we are comparing apples to apples, i.e. if you are referencing error based values then you also need to compare that to catching exception’s to use the value of the exception being caught).

More concerningly though if you care that much about Loom and the JVM, typical JVM/Java code does preserve and propagate stack. scala.util.control.NoStackTrace is a Scala specific feature and I don’t even remember seeing Java programs create their own version of scala.util.control.NoStackTrace to mitigate cost of stack propagation, in fact in such cases they use values/null if they care about performance that much.

I think you misunderstood my point, the benefit of Future is precisely that it forced programmers to NOT care about the stack at all and also to NOT use it as the primary error handling mechanism.

This reminds me of the exact same argument that people were using to justify java.misc.unsafe having no reason to exist. In the worst case scenario, even in the context of a library creator/maintainer, such abstractions are necessary and it has nothing to do with legacy. Whether people like it or not, Future is not going anywhere for reasons aside of legacy.

1 Like

In order for you to see how my statements are correct, I would have to explain Future, the cost of stack trace generation, the connection between Future and stack traces, the cost of exception throwing, the cost of catching exceptions, and the cost of value-based error propagation (both theoretical and as practiced in Try, Either, ZIO, etc.), and possibly more besides.

I have no interest in explaining these things here, but I will repeat myself: that exceptions or error handling in general are NOT a reason to use Future, not even slightly, if anything, the reverse, and that Loom’s impact on exception handling are only net positive.

3 Likes

I think that’s an important question, but one not connected to async/sync or Loom.

If Scala decided to track exceptions via CanThrow (which I would say is not decided, it is opt-in and experimental, and as of yet, lacks broad buy-in), then you would use the following Future successor:

class Future[+E, +A](...) {
  def virtualThread: VirtualThread = vt 

  def result: A throws E = ...

  // etc.
}
object Future {
  def apply[E, A](code: => A throws E): Future[E, A] = ...
}

The only explicit support Scala would need, if any (presumably you can “cheat” with casting), is a way to transfer a capability between (virtual) threads, which is needed anyway for all capabilities.

This allows you to have a “handle” on a running computation that you can use for purposes of:

  • generating useful stack traces
  • checking the progress of the computation
  • interrupting the computation because the result is no longer needed

while still having a way to access the typed success value or exception from the completed computation.

Such a data type would have lots of other methods on it, too (e.g. Future#poll), but would omit nearly all of the callback-based machinery of existing Future, including all the monadic machinery (map, flatMap, etc.).

(And while we’re at it, Future is not a great name, it’s more like RunningTask[E, A]).

1 Like

These things I understand, what is being sold here as black magic here is that Loom provides zero cost catching of exceptions that preserves complete stack traces (if I am not misreading what is being said). The reason why I use the word black magic here is because no other language has solved this problem.

If you want negligible performance impact in the scenario of actually needing to reference the error/exception you have try/catch without full stack trace preservation and/or optimisations for local try/catch inside of functions (i.e. Erlang, OCaml) or you treat errors as values (and since its a value you can reference it).

I don’t disagree that Loom is obviously faster than current JVM in context of async/threads being blocked, what is the more pertinent point is if you accept the proposition that there is a lot of JVM/Java code that relies on full stack trace preservation (either directly or indirectly i.e. debugging) and Loom wants to preserve this property its giving away potential performance. Of course calculating how much it is would require benchmarking one implementation of Loom that never generates and/or preserves stack trace vs the current implementation (which apparently does).

Sure, but its more of a reason compared to using the conventional Java style try/catch in the scenario where you have catch (or reference) the value in the context if your program running and not just in the “let it crash”/500 internal server error scenario.

There is a reason why if you look at any high performing Java code where error cases need to be referenced in the normal running of the program, they don’t use exceptions even if there is a lot more boilerplate and/or its not as ergonomic/idiomatic.

Well I would say the discussion evolved beyond that, so there’s no need to come back to sync/async all the time, but maybe I’m wrong :wink:

Anyway, I don’t think either you or I have any data that would quantify in terms of e.g. lost revenue the effects of using a typed language in the first place, yet alone more advanced constructs. Though I agree that observing the most common sources of developer confusion and bugs is a good indicator as to how evolve our libraries & frameworks.

I agree that a typed error model might help avoid many bugs; but again I would say that this is not very different, if not the same as, an effect system (yes, you do get quite a lot of information from a function where the signature includes a ZIO[R, E, _], both from R and E as opposed to a “normal” one).

It’s great that ZIO successfully demonstrates how to implement error handling with the traits you enumerate. This might be a very good benchmark for other implementations out there, so that they might try to “do better” (I don’t know if that’s possible, but then the research program that @odersky mentions is supposed to take 5 years, so I think they don’t know it either :wink: )

As for focusing on areas that benefit the industry: I agree that languages like Ballerina look great on first sight. I would like at some point to write something bigger using it, to see how it scales (with such specialised approaches it’s often easy to do the common thing, and hard to do the uncommon). You know, Spring, RoR and Tapir all look great when looking at small examples ;).

Taming concurrency, properly handling errors in the presence of remote calls is something that always confused me and I find working with types such as IO helpful. So for me, it is an area where at least my coding would benefit. Of course I might be an isolated case, so it’s just one more datapoint :slight_smile:

3 Likes

Thinking a bit more about this, maybe you are right that what’s valuable is principled error handling, not an effect system. The difference (probably one of) is how the information propagates across method calls.

A follow-up question to “do we want methods which perform RPC to have a different signature than normal ones” is, “do we want this information to propagate to callers”. That is, do we want the “remoteness” to be viral (as IOs are today), or is it enough to handle all errors for the “remote” marker to disappear from the signature? In yet other words, should the “remote” effect be eliminatable?

If we eliminate the RPC errors, then at some point in the call chain the methods start looking as “normal” ones. If on the other hand we propagate the “remote” marker, it will infect all the callers, all the way to the root.

1 Like

Fundamentally speaking you cannot eliminate RPC call failures, you can only “hide” them at which point you are going to very quickly experience problems in any non trivial circumstance. Pretty much every single framework/library that tried to treat RPC calls the same as local calls has catastrophically failed in some way, the earliest example of this is probably CORBA.

For all of its faults, one of the biggest strengths of akka actors as a concurrency framework is it forced programmers to treat every call as a remote call (so you always had to handle potential failure) which means that even if you initially only implemented local concurrency, if you were to scale that out horizontally practically speaking you would just tweak instantiation of actors/actor refs and some other constants.

Treating RPC like normal local calls is doomed to failure because RPC calls can fail in ways that local calls cannot. Treating local calls like RPC calls works fine but is overkill in most scenarios unless you are specifically planning to scale out an initially local concurrent program into a remote/distribute one. Separating out RPC from local calls because you acknowledge they have different properties has the advantage of being granular while also being principled. Even on a pragmatic level, knowing calls are hitting the network/filesystem/remote computer from local calls is immensely powerful and in my opinion in many cases justifies the extra ceremony (whether it be done via types or other methods).

2 Likes

I am content if there is consensus that:

  1. Only actionable information is potentially useful to track.
  2. The costs of tracking must clearly be outweighted by the benefits, which implies some combination of (a) high benefits, and (b) low costs.
  3. The sync/async distinction is not relevant to track.

People can of course disagree on the specifics.

I enjoy effect systems such as ZIO and think they have great and lasting commercial value (not related to “effect tracking” whatsoever), but I am careful to try to avoid biasing language conversations in that direction because the audience of Scala is larger.

As the actionability stems from recoverability, namely, a category of recoverability where merely retrying stands some chance of succeeding, I do think the challenges of RPC are more closely solved by good and principled error system, as well as good compositional concurrency to achieve the efficiency gains made possible through timeouts and cancellations.

However, I view the domain of concurrency as quite outside the language level (at least in a general-purpose and late-stage language like Scala or Java, in which concurrency solutions manifest themselves as new libraries and frameworks), leaving only a good and principled error system as a target for future language evolution.

:100:

2 Likes

That’s exactly the problem we’ve been considering since about PRE-SIP: Suspended functions and continuations - #101 by adamw :slight_smile:

1 Like

It’s somewhat unfortunate that effect tracking topic spilled over to Impact of Loom on “functional effects” so I’ll continue effect tracking discussion here instead of scattering it over two topics.

Akka actors, or rather actors in general, try to hide the location of target actors and in general steer the programming model towards location independence, i.e. use isolation and general process management (supervision, restarts, propagating errors higher up hierarchy) for both local and remote actors. Systems based on Erlang (actor heavy platform) claims availability of multiple “nines”: High availability - Wikipedia and Erlang enthusiasts claim it comes from approach called “let it crash”: The Zen of Erlang It looks that the approaches used in actor systems are at odds with remote calls tracking.