What do we do with `scala.sys.process`?

jvican · January 15, 2018, 10:50am

Good morning everyone

Recently, I found myself needing to spawn interactive system processes, where the processes are long-running and interact with the host application via stdin/stdout. I started using the java Process and ProcessBuilder API, and realized that there were major pitfalls about how the API was supposed to be used.

There is lots of literature on this topic in SO, but I’d like to explain the most important issue: the methods waitFor and exitValue in a process are blocking and may not even return if there’s output that hasn’t been consumed by the client (in both stdout and stderr). The consumption of this output has to happen for both stdout and stderr in indendepent threads so that it doesn’t block. These are known as StreamGobblers.

Read up on some of the pitfalls in this Java World article.

The bottom line is that using java process API correctly requires at least three threads running concurrently and lots of boilerplate. I turned into scala.sys.process to see if this issue was addressed, and it is.

I think that scala.sys.process is actually quite nice (though I prefer the aesthetics of the java process API), and I’d like to know what we’re gonna do with it with the upcoming Scala library modularisation. I have a few questions:

Do we want to add it in the Scala Platform as it is?
Does anyone know of an Scala or Java alternative?
In case there’s no alternative and scala.sys.process is eventually used, would someone want to maintain it?

mghildiy · January 15, 2018, 11:13am

I am open for 3,if eligible.

jvican · January 15, 2018, 11:17am

That’s great news, thank you for volunteering!

som-snytt · January 15, 2018, 8:19pm

I think sys.process would benefit from some iterations as a module. It’s handy and essential functionality, but I always spend time coping with its internal complexity.

jvican · January 15, 2018, 8:50pm

I wholeheartedly agree What are you gripes with it?

I guess it would be great to gather some statistics on how often is scala.sys.process depended upon. I’ll try to write a rule to detect it in @olafurpg’s corpus.

retronym · January 15, 2018, 11:30pm

I liked the look of NuProcess as a third party alternative to the JDK standard Process. It can spawn without consuming JVM threads for each of them. Chris Hunt wrote an adaptor between this library and Akka streams.

The standard process API in Java has gotten quite a bit easier to use in Java 8, and has some key missing features added in 9. Our wrapper isn’t (easily) able to expose those new features while we still need to support Java 8. I guess that’s an inherent problem of wrappers, they sort of add a delay line in exposing new features.

fanf · January 16, 2018, 6:43am

In Rudder, we switched to NuProcess after having used sys.process for some
time.

We switched because of the performance of NuProcess, especially the memory
consumption because of the correct way of doing a fork under Linux, which
is no possible on the jvm otherwise. It was important for us, because we
need to fork a lot on a critical path (and support java 8 for the foreseen
time).

NuProcess is not easy, and far from scala idiomatic, as you can see here:

github.com

Normation/rudder/blob/master/rudder-core/src/main/scala/com/normation/rudder/hooks/RunNuCommand.scala#L122


def run(cmd: Cmd, limit: Duration = Duration(30, TimeUnit.MINUTES)): Future[CmdResult] = {
  /*
   * Some information about NuProcess command line: what NuProcess call "commands" is
   * actually the command (first item in the array) and its parameters (following items).
   * So typically, to execute "/bin/ls /tmp/foo", you can't do:
   * - new NuProcessBuilder("/bin/ls /tmp/foo") => will fail silently
   * But you need to say:
   * - new NuProcessBuilder("/bin/ls", "/tmp/foo")
   * And if you want to pass more arguments:
   * - new NuProcessBuilder("/bin/ls", "-la", "/tmp/foo")
   * This fails:
   * - new NuProcessBuilder("/bin/ls", "-la /tmp/foo") => fails with: /bin/ls : invalid option -- ' '
   *
   * If no time limit is given, something like
   * - new NuProcessBuilder("/bin/cat")
   * Will stall forever.
   *
   * Some intersting error code:
   * - Invocation of posix_spawn() failed, return code: 2, last error: 2
   *   => command not found
   * - Invocation of posix_spawn() failed, return code: 13, last error: 13

sys.process was much easier and idiomatic to use. In an interactive
session, I would use it every time over NuProcess. It could be further
tailored toward that use case.
There is also “better files” for a very limited set of usual commands :

github.com

pathikrit/better-files/blob/master/README.md#unix-dsl

# better-files [![License][licenseImg]][licenseLink] [![CircleCI][circleCiImg]][circleCiLink] [![Codacy][codacyImg]][codacyLink]

`better-files` is a [dependency-free](project/Dependencies.scala) *pragmatic* [thin Scala wrapper](core/src/main/scala/better/files/File.scala) around [Java NIO](https://docs.oracle.com/javase/tutorial/essential/io/fileio.html).

## Talks [![Gitter][gitterImg]][gitterLink]
  - [ScalaDays NYC 2016][scalaDaysNyc2016Event] ([slides][scalaDaysNyc2016Slides])

  <a href="http://www.youtube.com/watch?feature=player_embedded&v=uaYKkpqs6CE" target="_blank">
    <img src="site/tech_talk_preview.png" alt="ScalaDays NYC 2016: Introduction to better-files" width="480" height="360" border="10" />
  </a>

  - [ScalaDays Berlin 2016][scalaDaysBerlin2016Event] ([video][scalaDaysBerlin2016Video], [slides][scalaDaysBerlin2016Slides])
  - [Scalæ by the Bay 2016][scalæByTheBay2016Event] ([video][scalæByTheBay2016Video], [slides][scalæByTheBay2016Slides])

## Tutorial [![Scaladoc][scaladocImg]][scaladocLink]
  0. [Instantiation](#instantiation)
  0. [Simple I/O](#file-readwrite)
  0. [Streams](#streams)
  0. [Encodings](#encodings)
  0. [Java serialization utils](#java-serialization-utils)

This file has been truncated. show original

jvican · January 16, 2018, 6:37pm

NuProcess is a great discovery, I didn’t know about the library and I like it. It’s true though that the API isn’t as easy as the Java ProcessBuilder one.

Perhaps we can create a minimal Scala wrapper around it to simplify the API and, if it gets traction and is used, we can add it to the Scala Platform. There’s one wrapper for Clojure.

Thanks for the links @retronym @fanf.

tfinneid · January 17, 2018, 10:59am

Just curious, how is Process’s std in/out/err streams a problem?

The documentation does say that it is recommended to use buffered inputstreams. Also the documentation for BufferedReaders read() states the conditions which it checks for so that it does not block. So there is no need for multiple threads.

It does of course require some fixture code to actually use the streams, but other than that I see little problems with that part of the Process class. Perhaps except for a possible memory issue, as stated @fanf.

The ProcessBuilder, on the other hand, is some degree of terrible in my oppinion. I much more like the Runtime.exec() methods, they are clean and simple.

Also I think the ProcessBuilder has a design fault, in my opinion. The builder and its data is not immutable. That means some thread or code could change the attributes of the processbuilder and hence introduce unwanted changes into the child-process creation process. Because it acts as a template, it can be changed whenever, and it might not be very visible when and what has changed.

I am non to fond of the java 9 ProcessHandler concept either. It seems to me to be abstraction upon abstraction, which makes it difficult to reason with.

I can imagine a much simpler Process (ProcessBuilder) class than what exists in the jdk and NuProcess, which can be wrapped by a several different classes to extend the basic Process class and in the end wrapped by a scala Process class. But I get ahead of my self.

tfinneid · January 17, 2018, 11:09am

@fanf, could you elaborate on the memory issue of the process creation / forking problem you mention?

fanf · January 17, 2018, 11:30am

Well, it is the standard fork problem on linux or other plateform. IE: either you change your virtual memory overcommit parameters at the system level, or you need to use special care when you fork so that your child does not get as much memory as the parent. And it is a problem, because typically, the JVM (parent process) as a lot of memory allocated, where your “ls” doesn’t need 64GB to run. NuProcess does the fork with care, JVM standard lib utilities don’t.

We have some pointer here: https://www.rudder-project.org/redmine/issues/5617#note-3 (and other comments). And it is also explained in NuProcess readme (the part about vfork: https://github.com/brettwooldridge/NuProcess).

tfinneid · January 17, 2018, 12:19pm

Thanks @fanf, I shall read it with interrest.

shado23 · January 24, 2018, 11:04am

It looks like NuProcess’s claims about the JDK not using vfork are wrong, and it is in fact using vfork since Java 7:
http://hg.openjdk.java.net/jdk7/jdk7/jdk/rev/55186701bdbc

fanf · January 24, 2018, 11:27am

@shado23 as explained in the linked resources, we had to support JDK6 at the time we chose Nuprocess. And then, even after that, JDK7 implementation had a lot of problems and bugs (at least in earlier versions) that NuProcess didn’t had. I don’t remember what/where are the bugs (perhaps linked in previous doc?), but I do recall that it was a nightmare to get a consistant experiance accross openjdk 7 version.

And in all cases, we had to also support other JVM than open jdk (IBM one at least), and so for all these reasons, NuProcess was a more consistant experience whatever the runtime.

It may not be the case anymore, and perhaps all JVM now correctly support vfork - I don’t know, and for my use case, NuProces is just the best solution because I didn’t had to think of it since we adopted it

shado23 · January 24, 2018, 12:41pm

Sure. I don’t have first-hand experience with either solution. I was just surprised to find that one of main claims they make on why it’s better apparently doesn’t hold water. It may ultimately still be the better solution because of other concerns.

brettwooldridge · March 5, 2018, 3:35pm

Sorry to jump in on an old thread. Author of NuProcess here. A couple of points to note…

NuProcess did start life as a Java 6 library, where vfork() was not used on Linux, so the points around the memory difference of vfork() no longer hold for Java 7+.

However, still relevant is the memory overhead associated with threads. On Java, by default, the per-thread stack size is 1 MB. If you take a typical external interactive process, with stdout and stdin requirements, then a typical pattern is a “pumper” thread for each stream. Now, if you need to spawn 500 processes (trust me, users like BitBucket do this), you’re looking at 1 Gigabyte for stack space. Even if you cut the thread stack to say 128 Kb (on the edge of dangerous) via -Xss, you’d still have a 128 MB stack allocation. If you also need to process stderr, non-merged, throw in another thread per-process.

Performance-wise, throw in a heaping helping of context-switching overhead for the 1000-1500 threads.

NuProcess on the other hand is non-blocking. NuProcess can likely handle 500 processes with a single thread, but will use CPU Core count / 2 by default (auto mode).

The “cons” of NuProcess is that it only supports Linux (x86 and ARM), MacOS X (and BSD variants), and Windows. There currently is no Solaris or IBM z-Series mainframe support. Though pull requests are welcome!

EDIT: Forgot to mention that Java on MacOS X still uses regular fork()/clone().