Could some boxing/unboxing be avoided?

charpov · November 1, 2024, 6:50pm

While profiling code, I noticed a bunch of boxing of integers. It turns out they came from using -= on a mutable.BitSet. Replacing -= with subtractOne solved the problem.

I wonder how many times I’ve been hit because of something like that and never realized because I didn’t profile (I do like to use symbolic names with collections), and what could be done about it. The problem, of course, is that -= is final in Shrinkable and cannot be overridden in BitSet. Does it have to be final? (It’s also marked @`inline`, which didn’t seem to help here.)

In the same application, I found that some other undesirable boxing came from using indexOf on an IArray. Writing my own loop got rid of it. (I’ll admit I don’t even understand the comment `asInstanceOf` needed... in the source.) Could the IArray extensions be more careful about primitive types and avoid the problem?

I’ve had to rewrite some for-do into while in the past to avoid boxing, and I now know to be careful with those in performance-critical code, but I don’t want to second guess myself with every method on things like BitSet and IArray, which I expect to be efficient by default. Is there a way to improve things, at the cost of some ugly overrides here and there?

Any thoughts/advice?

Ichoran · November 1, 2024, 7:34pm

My advice is to ignore IArray entirely. The ecosystem is not written to take advantage of it, which makes it a headache even leaving aside the random surprising performance problems (those can be fixed–submit a bug report–but the friction of the class is too great). Just use Array. Also, if you have primitives, and you want to be fast, don’t touch the standard library.

Alternatively, microbenchmark everything. That will catch almost all substantial problems.

som-snytt · November 1, 2024, 9:11pm

Scala 2 has -Wperformance for this use case, namely, to warn about common gotchas that you’d never want to spend a minute debugging. The lint doesn’t do much at the moment, and would be too noisy for everyday use, perhaps; it hasn’t been introduced on Scala 3.

Maybe it’s true that “if you cared”, you’d have benchmark testing already, and then you’d turn on the lint as a first step in addressing a problem; or it would be used as an audit.

As a reminder, @inline is meaningful only to the Scala 2 optimizer, which must be enabled with -opt:inline:** to mean inline from everywhere.

At some point, I began using addOne and subtractOne exclusively. I think I finally tired of the ambiguity of never knowing how += is desugared; the clever syntax is nice for quick snippets or loop vars such as i += 1 where reading is unhampered.

The other antipattern is buf += (1, 2, 3) aka “multiarg infix”. That may be when I gave it up. I’d love the syntax if it expanded to buf.addOne(1) etc.

charpov · November 1, 2024, 9:15pm

That’s too bad. I guess it’s back to ArraySeq, then.

I have primitives, I want to be fast, and I’m too lazy to reimplement BitSet (or to program in C, for that matter).

dwalend · November 1, 2024, 9:38pm

There’s a bug waiting for attention on this very topic: Warn when boxing value classes · Issue #12271 · scala/bug · GitHub

Ichoran · November 1, 2024, 11:11pm

Don’t use ArraySeq either, not for map or anything like that–the primitives box. Really just use plain Array without collections operations when you really need to be fast. But microbenchmark and/or profile (with appropriate profiling tools so you’re not fooled by overhead or anything) to determine when you really need to be fast. For example, if you’re manipulating raw bytes underlying a grayscale image, use plain Array without collections operations.

charpov · November 2, 2024, 2:22pm

I assume that, eventually, Scala 3 can start using its own inline, which does exactly that. I also wonder why the JIT didn’t inline it, and then get rid of the boxing/unboxing. The mysteries of the JVM…

JD557 · November 2, 2024, 5:51pm

I would also recommend using raw Arrays/while loops on performance intensive code, especially if you plan on cross-compiling to JS/Native (some stuff that might be fast enough in the JVM might not be on other backends).

Unfortunately, I don’t have any better solutions. I also keep getting bit by this.

For example, just today I noticed that Int => AnyVal functions are not specialized, so something like Vector.tabulate(height)(y => Vector.tabulate(width)(x => x*y)) has some really weird performance characteristics (the y is boxed, but the x isn’t)