Improving toInt and friends

martijnhoekstra · May 5, 2017, 9:17am

Parsing strings with toInt and friends - toByte, toShort, toLong, toFoat and toDouble have some deficiencies.

First off, they’re not safe: they throw NumberFormatExceptions in the failure case. They can be wrapped with try/catch or Try, but that’s not great either. It leads not only to boilerplate, but it also makes the failure case orders of magnitude slower than the success case.

Second off, they’re not very performant. A little know feature of them is that they are able to handle not only the “normal” 0-9, but also all other BMP unicode scripts that represent digits. The handling of those additional characters results in up to a factor 2 in performance loss over an implementation that only handles the “normal” 0-9 digits, according to @Ichoran

That makes the current versions of these methods suboptimal for almost all use cases. If you want to parse a large file with billions of numbers, and crash if your data file is corrupt with no hope for recovery, you will almost always prefer the faster version, and forgo being able to parse the more esoteric digits.

If you have some parsing task that can reasonably be expected to fail, you want to have a safe interface that returns Option. I suggest splitting the current implementations in two, to better meet those two use cases.

For the billions of rows or otherwise fail use case, a fast, unsafe variation, which would not be able to parse the more esotheric digits, but would be about 2 times faster in the success case
For the safe and solid general purpose use case a safe, Option returning, all BMP scripts parsing variation, which would be a bit slower than the current toInt (and friends) in the success case, and a lot faster in the failure case.

I have an implementation for the Option returning integral types which I’m happy to submit a PR for. I don’t have an implementation yet for the floating point formats, nor for the fast unsafe methods yet. I’m fairly confident I should be able to write the fast ones, but I don’t think I’ll be able to get parsing the floating point ones right - parsing floating point and returning the same results as the current (delegated to Java) implementation, including all corner cases is really hard, probably too hard for me - other than wrapping them with try/catch, which gives the nicer interface, but still suffers the performance interface for the failure case.

This post aims to get the ball rolling on discussing whether this is what we want, and if so, what scala version it could target. Making the current versions fast would be a breaking change for everyone who currently uses them to parse digits in non-latin scripts. Adding the safe options breaks forward compatibility. That makes me think that 2.13 is the earliest this could be done.

schmitch · May 5, 2017, 9:39am

well of course not they use java.lang.Integer.parseInt, which supports utf-8 which java tries to support everywhere.
Well I’m in favor of removing the NumberFormatException (for Option[Int]), because it detangles Scala more from the Java side. But I’m not sure if it’s a good idea to replace java.lang.Integer.parseInt as a backing function.

martijnhoekstra · May 5, 2017, 10:06am

For the Option case, if you want to have reasonable performance for the failure case, you sort of have to replace Integer.parseInt as a backing function. If you want to pre-validate, you’re 90% of the way to re-implementing the entire thing already. It was the path that I originally went down when experimenting, and by the time I had good validation, I realized I might as well make the validation actually return the result. To me, validating and then delegating makes no sense.

Since the fast version and the Option version will probably (not yet implenented though!) be very similar, only differing in how they return their success/failure and possibly getting the value of the digit (when not handling the other BMP digits), it’s a small step to that faster but less versatile version from there too.

I’m not sure what UTF-8 has to do with anything. JVM strings are not UTF-8, but UTF-16.

Ichoran · May 5, 2017, 3:53pm

Floating point parsing is tricky to get exactly right. Although it is possible to write parsers that are faster in nearly all cases (I have done so twice), it is so fiddly that I would recommend validating and then calling parseDouble instead.

Int parsing is easy if you’re allowed to use Longs, but trickier if you must stay with Int. Long parsing is tricky regardless. (If you don’t care about efficiency everything gets easier.)

martijnhoekstra · May 5, 2017, 4:34pm

I have a strawman so we have something to look at https://github.com/martijnhoekstra/scala/commit/f2639ffbf05e8da98997904d6b628e14925ffc4a

ekrich · May 5, 2017, 8:57pm

I think it is a good idea to decouple/detangle from Java code where practical.

sjrd · May 6, 2017, 10:11am

Your implementations disregard the problem of overflow during parsing.

To get a more accurate view of the complexity of the problem, you can have a look at the (BSD-licensed) implementation of Long.parseLong in Scala.js: https://github.com/scala-js/scala-js/blob/7f8979dedaaa66d5f3e51c328e73a78c0fad49ec/javalanglib/src/main/scala/java/lang/Long.scala#L170
Note that this one is a bit over-complicated for the JVM, since it tries to minimize the number of Long operations, and delegates to (a form of) Integer.parseInt. An actual implementation for the JVM would make different trade-offs and be a bit simpler.

martijnhoekstra · May 6, 2017, 10:15am

Which case is going wrong? I don’t see it, and it handles overflow correctly as far as my tests go.

sjrd · May 6, 2017, 10:22am

Try "2147483649".parseInt. I haven’t tested your code, but I don’t see anything in the code that would detect the overflow for that string.

martijnhoekstra · May 6, 2017, 10:39am

Looks good to me:

That should hit the first if in step (if (agg > 0) None)

EDIT: Maybe you were looking at the Short or Byte cases? They are implemented with an Int, since that should be no more expensive, so they can be a little simpler, and just detect the overflow directly.

EDIT2: I can’t make a derivative work from the BSD licensed code, since while the license is compatible, I can’t CLA other peoples work.

sjrd · May 6, 2017, 10:45am

You’re right, my previous example correctly returns None. Here is one that fails to detect the overflow:

scala> "10737418230".parseInt
res16: Option[Int] = Some(2147483638)

scala> Integer.parseInt("10737418230")
java.lang.NumberFormatException: For input string: "10737418230"
  at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
  at java.lang.Integer.parseInt(Integer.java:583)
  at java.lang.Integer.parseInt(Integer.java:615)
  ... 32 elided

That’s because the -1073741823 * 10 overflows but still yields an overflowed result that is < 0.

martijnhoekstra · May 6, 2017, 10:47am

Yikes, thanks! That should be fixed up.

sjrd · May 6, 2017, 10:51am

That’s my code, and I can CLA it to Scala. In fact, in CLA-ing to Scala.js, I have already done so. (it’s the same CLA)

Ichoran · May 6, 2017, 5:32pm

Flatmapping is not a good way to retain performance.

Sebastien already caught the overflow bug.

I’m not sure the recursive inner function helps anything. If anything it encourages repetition because you repeat the flatMap handling code (unnecessarily; you could just load the correct parameter to begin with).

It’s surprisingly tricky to get this efficient, correct, and clear! (And this is just integers, where you don’t have to worry about decimal vs. binary fractional representation!)

Edit: you forgot "True" and friends; you need a toLowerCase in there.

Ichoran · May 6, 2017, 5:37pm

That’s pretty good! But you discard only leading ‘0’ not ‘०’ and the 30-odd other zero variants, so your implementation will fail on variants.

sjrd · May 6, 2017, 6:01pm

Oops, good point, thanks! And apparently JavaScript’s parseInt, which we delegate to, doesn’t even support non-ASCII scripts, so our implementation is completely broken for non-ASCII scripts. I can’t believe we never tested that.

martijnhoekstra · May 6, 2017, 6:19pm

Thanks for the feedback. The version I currently have is

I’ll see if I can get rid if the inner function, as well as write some (more) tests to prevent further embarrassment

martijnhoekstra · May 7, 2017, 7:34pm

A maybe better concrete implementation of (the integer parts of) this proposal at https://github.com/martijnhoekstra/numparsers/blob/master/src/main/scala/StringConversions.scala

These “fast” implementations are best left out IMO: They are faster than the baseline, but not by that much (see https://docs.google.com/spreadsheets/d/1TohVfP_KDkEvVV2fvRMXYvSYTYNkntpkGdP-NSVcikg/edit?usp=sharing for benchmark results on my machine), though somebody else might have better peforming ones where it is more reasonable to discuss replacing the current ones with fast ones.

The performance loss for returning Option in the synthetic benchmark is IMO reasonable. This could be different under high memory pressure - which I’m not sure how to test properly - but I’m not proposing removing having an unsafe version that doesn’t suffer from that problem. The performance gain in the unhappy case is still ~100x.

RichardBradley · May 7, 2017, 8:21pm

Aside from the obvious difficulties in actually implementing a correct “toInt” which we have seen here, I’m not sure you have yet argued a convincing case for your claimed “deficiencies”.

Do have you any evidence for this?

I have always thought that the old “never use exceptions for control flow” canard was not true on the JVM, and indeed the first actual hard evidence I can find on Google suggests that throwing is only 20% slower than the obvious alternatives (tested in 2011): http://stackoverflow.com/questions/299068/how-slow-are-java-exceptions#comment6390996_5610356

I decided to test the long term performance of the three implementations compared to a control implementation that checks for failure without reporting. The process has a failure rate of about 4%. An iteration of a test invokes the process 10000 times against one of the strategies. Each strategy is tested 1000 times and the last 900 times are used to generate the stats. Here are the average times in nanos: Control 338 Exception 429 Result 348 Sentinel 345

(“Result” above corresponds to “Option” in the current discussion.)

(Sorry this isn’t a great cite, but it’s the best I could find in 5 mins.)

Ichoran · May 7, 2017, 9:28pm

Just read the answers in what you linked! They show huge slowdowns with regular exceptions, and still pretty hefty ones with stackless exceptions.

I gave a ScalaDays talk on this one year (2013 or 2014 I think? I don’t recall any more), pointing out that there were ways to use stackless exceptions to push get closer to the zero-overhead case when exceptions are somewhat uncommon but not unfathomably rare.