What do we do with `scala.sys.process`?


#1

Good morning everyone :slight_smile:

Recently, I found myself needing to spawn interactive system processes, where the processes are long-running and interact with the host application via stdin/stdout. I started using the java Process and ProcessBuilder API, and realized that there were major pitfalls about how the API was supposed to be used.

There is lots of literature on this topic in SO, but I’d like to explain the most important issue: the methods waitFor and exitValue in a process are blocking and may not even return if there’s output that hasn’t been consumed by the client (in both stdout and stderr). The consumption of this output has to happen for both stdout and stderr in indendepent threads so that it doesn’t block. These are known as StreamGobblers.

Read up on some of the pitfalls in this Java World article.

The bottom line is that using java process API correctly requires at least three threads running concurrently and lots of boilerplate. I turned into scala.sys.process to see if this issue was addressed, and it is.

I think that scala.sys.process is actually quite nice (though I prefer the aesthetics of the java process API), and I’d like to know what we’re gonna do with it with the upcoming Scala library modularisation. I have a few questions:

  1. Do we want to add it in the Scala Platform as it is?
  2. Does anyone know of an Scala or Java alternative?
  3. In case there’s no alternative and scala.sys.process is eventually used, would someone want to maintain it?

SPP Meeting 5th February 2018, 5 PM CEST
#2

I am open for 3,if eligible.


#3

That’s great news, thank you for volunteering! :grin:


#4

I think sys.process would benefit from some iterations as a module. It’s handy and essential functionality, but I always spend time coping with its internal complexity.


#5

I wholeheartedly agree :slight_smile: What are you gripes with it?

I guess it would be great to gather some statistics on how often is scala.sys.process depended upon. I’ll try to write a rule to detect it in @olafurpg’s corpus.


#6

I liked the look of NuProcess as a third party alternative to the JDK standard Process. It can spawn without consuming JVM threads for each of them. Chris Hunt wrote an adaptor between this library and Akka streams.

The standard process API in Java has gotten quite a bit easier to use in Java 8, and has some key missing features added in 9. Our wrapper isn’t (easily) able to expose those new features while we still need to support Java 8. I guess that’s an inherent problem of wrappers, they sort of add a delay line in exposing new features.


#7

In Rudder, we switched to NuProcess after having used sys.process for some
time.

We switched because of the performance of NuProcess, especially the memory
consumption because of the correct way of doing a fork under Linux, which
is no possible on the jvm otherwise. It was important for us, because we
need to fork a lot on a critical path (and support java 8 for the foreseen
time).

NuProcess is not easy, and far from scala idiomatic, as you can see here:

sys.process was much easier and idiomatic to use. In an interactive
session, I would use it every time over NuProcess. It could be further
tailored toward that use case.
There is also “better files” for a very limited set of usual commands :


#8

NuProcess is a great discovery, I didn’t know about the library and I like it. It’s true though that the API isn’t as easy as the Java ProcessBuilder one.

Perhaps we can create a minimal Scala wrapper around it to simplify the API and, if it gets traction and is used, we can add it to the Scala Platform. There’s one wrapper for Clojure.

Thanks for the links @retronym @fanf.


#9

Just curious, how is Process’s std in/out/err streams a problem?

The documentation does say that it is recommended to use buffered inputstreams. Also the documentation for BufferedReaders read() states the conditions which it checks for so that it does not block. So there is no need for multiple threads.

It does of course require some fixture code to actually use the streams, but other than that I see little problems with that part of the Process class. Perhaps except for a possible memory issue, as stated @fanf.

The ProcessBuilder, on the other hand, is some degree of terrible in my oppinion. I much more like the Runtime.exec() methods, they are clean and simple.

Also I think the ProcessBuilder has a design fault, in my opinion. The builder and its data is not immutable. That means some thread or code could change the attributes of the processbuilder and hence introduce unwanted changes into the child-process creation process. Because it acts as a template, it can be changed whenever, and it might not be very visible when and what has changed.

I am non to fond of the java 9 ProcessHandler concept either. It seems to me to be abstraction upon abstraction, which makes it difficult to reason with.

I can imagine a much simpler Process (ProcessBuilder) class than what exists in the jdk and NuProcess, which can be wrapped by a several different classes to extend the basic Process class and in the end wrapped by a scala Process class. But I get ahead of my self.


#10

@fanf, could you elaborate on the memory issue of the process creation / forking problem you mention?


#11

Well, it is the standard fork problem on linux or other plateform. IE: either you change your virtual memory overcommit parameters at the system level, or you need to use special care when you fork so that your child does not get as much memory as the parent. And it is a problem, because typically, the JVM (parent process) as a lot of memory allocated, where your “ls” doesn’t need 64GB to run. NuProcess does the fork with care, JVM standard lib utilities don’t.

We have some pointer here: https://www.rudder-project.org/redmine/issues/5617#note-3 (and other comments). And it is also explained in NuProcess readme (the part about vfork: https://github.com/brettwooldridge/NuProcess).


#12

Thanks @fanf, I shall read it with interrest.


#13

It looks like NuProcess’s claims about the JDK not using vfork are wrong, and it is in fact using vfork since Java 7:
http://hg.openjdk.java.net/jdk7/jdk7/jdk/rev/55186701bdbc


#14

@shado23 as explained in the linked resources, we had to support JDK6 at the time we chose Nuprocess. And then, even after that, JDK7 implementation had a lot of problems and bugs (at least in earlier versions) that NuProcess didn’t had. I don’t remember what/where are the bugs (perhaps linked in previous doc?), but I do recall that it was a nightmare to get a consistant experiance accross openjdk 7 version.

And in all cases, we had to also support other JVM than open jdk (IBM one at least), and so for all these reasons, NuProcess was a more consistant experience whatever the runtime.

It may not be the case anymore, and perhaps all JVM now correctly support vfork - I don’t know, and for my use case, NuProces is just the best solution because I didn’t had to think of it since we adopted it :slight_smile:


#15

Sure. I don’t have first-hand experience with either solution. I was just surprised to find that one of main claims they make on why it’s better apparently doesn’t hold water. It may ultimately still be the better solution because of other concerns.


#16

Sorry to jump in on an old thread. Author of NuProcess here. A couple of points to note…

NuProcess did start life as a Java 6 library, where vfork() was not used on Linux, so the points around the memory difference of vfork() no longer hold for Java 7+.

However, still relevant is the memory overhead associated with threads. On Java, by default, the per-thread stack size is 1 MB. If you take a typical external interactive process, with stdout and stdin requirements, then a typical pattern is a “pumper” thread for each stream. Now, if you need to spawn 500 processes (trust me, users like BitBucket do this), you’re looking at 1 Gigabyte for stack space. Even if you cut the thread stack to say 128 Kb (on the edge of dangerous) via -Xss, you’d still have a 128 MB stack allocation. If you also need to process stderr, non-merged, throw in another thread per-process.

Performance-wise, throw in a heaping helping of context-switching overhead for the 1000-1500 threads.

NuProcess on the other hand is non-blocking. NuProcess can likely handle 500 processes with a single thread, but will use CPU Core count / 2 by default (auto mode).

The “cons” of NuProcess is that it only supports Linux (x86 and ARM), MacOS X (and BSD variants), and Windows. There currently is no Solaris or IBM z-Series mainframe support. Though pull requests are welcome! :wink:

EDIT: Forgot to mention that Java on MacOS X still uses regular fork()/clone().