Optimizing BigInt for small integers

The scala repo requires that every commit in a PR passes the CI, this makes it easier to do bisection, partially revert a PR, and generally leads to a better commit history.

Actually I found a way to preserve binary compatibility in 2.13, at the price of slightly increased memory usage.

I’m changing the BigInt class definition to


/** An arbitrary integer type; wraps `java.math.BigInteger`, with optimization for small values that can
 * be encoded in a `Long`.
 */
final class BigInt private (_bigInteger: BigInteger, _long: Long)
  extends ScalaNumber
    with ScalaNumericConversions
    with Serializable
    with Ordered[BigInt]
{
  // Class invariant: if the number fits in a Long, then _bigInteger is null and the number is stored in _long
  //                  otherwise, _bigInteger stores the number and _long = 0L

  def this(_bigInteger: BigInteger) = this(
    if (_bigInteger.bitLength <= 63) null else _bigInteger,
    if (_bigInteger.bitLength <= 63) _bigInteger.longValue else 0L
  )

Then BigInt stays final, we can have optimized logic for small values.

Compared to the subclassing proposal:

  • in the case of a long-sized BigInt, we save the price of the BigInteger instance, so the 4/8 bytes of the reference are not an issue,

  • in the case of a larger BigInt, we now store a unneeded Long field, which implies a maximal memory overhead of around 10%; that maximal overhead is computed for a BigInt that barely doesn’t fit in a Long.

I conjecture that BigInt has two major use cases in the Scala world:

  1. use as a generic integer where most of the instances fit in a Long, but BigInt is used to avoid problems related to overflow; that’s how we use BigInt (or the optimized SafeLong equivalent in Spire)

  2. use for cryptographic purposes.

In the use case 1., we’ll have a net win. In the use case 2., the overhead is likely to be small as numbers will be much larger than a Long.

I propose to optimize BigInt while preserving binary compatibility in 2.13, running it against the community build as an additional safety measure.

Then split the logic using subclasses in 2.14.

What do you think?

1 Like

Modification has been submitted as a PR for discussion in

2 Likes

Wouldn’t it be better in forthcoming versions for BigInt to be dropped in favor of a scala.math.Numeric/Integral typeclass instance for java.math.BigInteger ? And Fractional for BigDecimal ? It would avoid wrappers.

There is a similar question about collections, at least mutable ones : IMO we should have typeclass instances for Java collections instead of reimplementing these.

That’s outside of the scope of what I’m doing. The changes here should be binary compatible and mostly source compatible (source incompatibility is currently limited to the .bigInteger member changing from a val field to a def method).

Is it worth spending time on this, once you start considering the refactoring I mentioned ? Wouldn’t it be better to implement your optimization as a typeclass instance for Pair[BigInteger, Long] ?

Did you mean something like Either[BigInteger, Long]?

I think internally it doesn’t make much difference. But yes, that’s the idea.

We will be using 2.13 for years still. I appreciate any work to optimize the current standard library.

In scala 3 you can use an opaque type to wrap Either[BigInteger, Long] but users are not going to want to carry that type around as a core numeric type.

I see no win to abandoning BigInt vs making it perform as well as possible. E.g. in Spire they worked around the perf issues by making SafeLong. It would have been nice if years ago we just made BigInt the faster one.

Final note: I think it is not clear that typeclasses can be widely useful for perf optimization. The fact there are many instances often leads to megamorphic dispatch on the JVM which harms inlining. So pinning perf hopes on typeclasses seems like we need evidence this can be done. You may be motivated, but I want to encourage, not discourage the work of Denis.

1 Like

I concur with Oscar.

That said, Either[BigInteger, Long] will box the Long argument, so that’d be looking at something strictly worse than my proposal. However, the Dotty union types could be used, but then you’d lose the isInstanceOf semantics.

That said, if you want to have a try at implementing a typeclass based approach, a minima you can use the (future) benchmark that I’m writing now, and compare.

A union type BigInteger | Long would be erased to Object, so Long would still have to be boxed once.

Cannot escape the boxing, right? At least Java would cache the boxed Long at the level of the JVM.

But I was thinking of something else: a typeclass-based approach would break user code using isInstanceOf in match patterns, etc… so we’d need strong arguments to make the change.

That is, not more than with a wrapper, anyway.

No, both approaches could easily live in harmony for as long as necessary. No need to touch BigInt. Just devise a new construct for optimized BigInteger | Long.

This construct (i.e. the one in linked pull request) is already the optimized one.

Either[Int, BigInt] produces two objects when wrapping an Int - one for scala.util.Either and second one for java.lang.Integer. Optimized BigInt produces just one object in case Int is sufficient. Union is not compatible with pattern matching so that’s a big drawback.

To summarize:

Union:
+ no wrapper class
+ Int boxed only once
- problems with pattern matching

Either:
+ no wrapper class
- problems with pattern matching (but less than in union case) - what if someone uses Eithers for other purposes?
- Int boxed twice

Tuple2:
+ no wrapper class
+ Int boxed once (since Tuple2 is specialized)
- problems with pattern matching (but less than in union case) - what if someone uses Tuple2 for other purposes?

Custom type (this BigInt):
+ Int boxed once
+ no problems with pattern matching (bigInt.getClass stays the same all the time and it’s distinct from all other types)
+ internals are private and they can’t be accidentally exposed using e.g. pattern matching
- adds wrapper class (but the custom logic needs to be placed somewhere anyway, why it’s better to place it in typeclass rather than in ordinary class?)

There is scala.collection.JavaConverters class containing thin adapters for converting between Java and Scala API so you’re free to use them already. They are even recommended in case there’s something missing from Scala collection library.

IIUC mutable Scala collections works as builders for immutable Scala collections and converting from mutable Scala collection to immutable Scala collection is cheaper than converting from Java collection to immutable Scala collection. I’m not sure though.

I know [Sorry for diverting the topic] I was thinking more about something like this Scala Collections Proposal · GitHub , which unfortunately did not make it (yet ?) into the language.

Okay, I did not know that.

Because there will be a Numeric instance for BigInt anyway as there already is. So this is all quite redundant currently.

That article doesn’t mention using or implementing Java collections API directly, but simplifying Scala collections hierarchy instead. Two different topics. Furthermore, Scala collections were redesigned recently in Scala 2.13 and probably some ideas from that article were implemented.

Where is the redundancy? Is some code duplicated? Can you give some glaring examples?

Bridge methods are duplicated.

Right, but the design would Allow Java collections as first class citizens in for-loops for instance, which is not currently the case.

That’s not a lot of overhead. In Scala 3 that could probably be solved succintly by exports.

What about first-class Optional, CompletableFuture, Consumer, Supplier, etc

There were implicit conversions in JavaConversions but they were deprecated and only explicit JavaConverters were kept. Should that decision be reversed? I prefer explicit asScala and asJava and I am always irritated by having to guess which class to import (JavaConversions vs JavaConverters).

Can we move the discussion about general typeclass-based approaches to a new topic? BTW, I’ve added benchmarks to the PR.