Parsing strings with toInt
and friends - toByte
, toShort
, toLong
, toFoat
and toDouble
have some deficiencies.
First off, they’re not safe: they throw NumberFormatException
s in the failure case. They can be wrapped with try/catch
or Try
, but that’s not great either. It leads not only to boilerplate, but it also makes the failure case orders of magnitude slower than the success case.
Second off, they’re not very performant. A little know feature of them is that they are able to handle not only the “normal” 0-9, but also all other BMP unicode scripts that represent digits. The handling of those additional characters results in up to a factor 2 in performance loss over an implementation that only handles the “normal” 0-9 digits, according to @Ichoran
That makes the current versions of these methods suboptimal for almost all use cases. If you want to parse a large file with billions of numbers, and crash if your data file is corrupt with no hope for recovery, you will almost always prefer the faster version, and forgo being able to parse the more esoteric digits.
If you have some parsing task that can reasonably be expected to fail, you want to have a safe interface that returns Option
. I suggest splitting the current implementations in two, to better meet those two use cases.
-
For the billions of rows or otherwise fail use case, a fast, unsafe variation, which would not be able to parse the more esotheric digits, but would be about 2 times faster in the success case
-
For the safe and solid general purpose use case a safe,
Option
returning, all BMP scripts parsing variation, which would be a bit slower than the current toInt (and friends) in the success case, and a lot faster in the failure case.
I have an implementation for the Option returning integral types which I’m happy to submit a PR for. I don’t have an implementation yet for the floating point formats, nor for the fast unsafe methods yet. I’m fairly confident I should be able to write the fast ones, but I don’t think I’ll be able to get parsing the floating point ones right - parsing floating point and returning the same results as the current (delegated to Java) implementation, including all corner cases is really hard, probably too hard for me - other than wrapping them with try/catch, which gives the nicer interface, but still suffers the performance interface for the failure case.
This post aims to get the ball rolling on discussing whether this is what we want, and if so, what scala version it could target. Making the current versions fast would be a breaking change for everyone who currently uses them to parse digits in non-latin scripts. Adding the safe options breaks forward compatibility. That makes me think that 2.13 is the earliest this could be done.