Equality and hash typeclasses

NthPortal · March 25, 2018, 1:03am

Proposal

I think we should add typeclasses for equality and hashing to the Scala ecosystem - either to the Scala Platform, or to the Scala library itself.

Motivation

Equality

Equality testing (==) uses universal equality, which enables comparing completely different types that can never be equal, and can only lead to bugs. Additionally, in order to have numbers behave someone sensibly, == implements cooperative equality; this leads to many problems, one of which is significantly reduced performance (see Can we get rid of cooperative equality?).

It would be very helpful to have a type-safe way of testing for equality, to help reduce bugs and improve code correctness. On top of this, because an equality typeclass doesn’t allow comparing different types, the cause for cooperative equality doesn’t ever arise, and it is a complication that can be skipped.

Hashing

Hashing uses AnyRef.hashCode():Int, which has a number of problems. It makes it difficult to implement hashing strategies other than Java’s default, as all the hash data about the object is already “mixed” before it is returned. Even if you create a hashing strategy which returns 256 bits, you can only get 32 bits of hash data out of any object, so the object will not be hashed as thoroughly as you wish (this is a major problem for cryptographic hashing). On top of this, if you want to hash just a single object into more than 32 bits, there is no good way to do so - you only have 32 bits of information to work with.

Instead of a method which returns a fixed amount of data for a hash, it would be better to have a typeclass which defines how the object gets mixed into some hash state. Then, you can implement your own hashing strategy, and the hash typeclass will properly mix all of the object’s data into your hash state. This allows even cryptographic hashing over objects, without resorting to roundabout methods (e.g. turning the object into JSON and hashing that).

Existing Libraries

Why do we need this, if there are already several libraries - such as scalaz and cats - with equality typeclasses?

Unfortunately, scalaz does not include a typeclass for hashing, and the one in cats only returns Ints. Additionally, both libraries include lots of other code, which you might not want to depend on if all you want is equality and hashing (arguably, cats kernel doesn’t have too much extra, but it does have stuff for monoids and semigroups and other things). cats does not have implicit operators for typeclass equality either (at least not that I could find), which is important for making type-safe equality usable.

Strawman Implementation

I’ve created a strawman implementation to test the semantics and usability. The design for hashing is significantly inspired by Rust’s std::hash.

trait Eql[T] {
  def equal(x: T, y: T): Boolean
  def notEqual(x: T, y: T): Boolean = !equal(x, y)
}

trait Hash[T] {
  def hash(value: T, state: HashState): Unit
}

trait Hasher[R] {
  def state: HashState
  def reset(): Unit
  def result(): R
  final def +=[T](t: T)(implicit hash: Hash[T]): this.type = {
    hash.hash(t, state)
    this
  }
}

trait HashState {
  def +=(byte: Byte): this.type
  def +=(int: Int): this.type
  def +=(long: Long): this.type
  def +=(bytes: Array[Byte]): this.type
}


trait HashFactory[R, H <: Hasher[R]] {
  def newHasher: H
  def hash[T: Hash](value: T): R = (newHasher += value).result()
}

implicit final class RichAnyEql[A](val self: A) extends AnyVal {
  def ===(that: A)(implicit eql: Eql[A]): Boolean = eql.equal(self, that)
  def =!=(that: A)(implicit eql: Eql[A]): Boolean = eql.notEqual(self, that)
}

The full code is on GitHub.

oscar · March 25, 2018, 2:56am

Note there is scala.math.Equiv and scala.util.hashing.Hashing.

If you ask me, we should make Hashing extend Equiv (to give it laws: equiv(a, b) implies hash(a) == hash(b)).

Note also, cats has typeclasses for this in cats kernel: Eq and Hash.

I would prefer we rehabilitate existing classes and use them to a greater extent than propose new one.

NthPortal · March 25, 2018, 4:27am

scala.util.hashing.Hashing suffers the same problem as cats.kernel.Hash - it only returns Ints. (I have never actually seen anyone use Hashing, probably because it doesn’t add much over hashCode or ##.)

scala.math.Equiv is weaker (and more general*) than equality, in that two objects can be equivalent without being equal. Additionally, scala.math.Ordering extends Equiv, and Ordering is not required to conform to an equality relation (though it should if it is being used for SortedSet/SortedMap).

This feels awkward to me. It is not clear to me why a hash typeclass needs to know about object equality. Certainly, hashing and equality typeclasses should be consistent, but neither type is the other (and it’s quite possible to create a type that implements both and isn’t consistent).

*Because it is more general, scala.math.Equiv has benefits over an equality relation as well; it is not strictly worse. For example, if comparing two library versions for binary compatibility to see if one version can replace the other, you can use Equiv to describe this binary “equivalence”. Of course, the different versions are by no means the same version. It would not be appropriate to use an equality typeclass for that. You can make the argument that an equality typeclass ought to extend Equiv, although it’s not obvious to me that there is a major benefit.

Blaisorblade · March 26, 2018, 6:59am

I think this is interesting, but a few comments.

“is-a” is not quite the point of typeclass inheritance — though you can reconcile the two things.
You agree that hashing and equality should be consistent. I know no way of stating that without one typeclass extending the other, which is what an FP designer would do.
Once you agree that all types implementing hashing should also implement equality (and not vice versa), it seems easiest to say that hashers can also compute equality. An OO would maybe not force you to do that, but doesn’t forbid it either.

Of course you can. Are you suggesting that’s a reasonable use case? (I think that’d be reasonable for equivalence, not for equality — and I think @oscar was talking about equality).

On hashing: I think the existing hashing serves its design goals; I’m happy Rust’s design can do more and more efficiently, but can you demonstrate the claimed problem actually exists and is significant enough to justify the needed work?

For instance, is the overhead of serialization (with the fastest libraries around) significant relative to the overhead of cryptographic hashing? Wouldn’t you have to solve many problems of serialization anyway, since you’re turning a graph of data into a sequence of bytes to stream into the hasher?

Also, the Scala library (and the platform?) still lacks a standard way to do type class derivation, and without derivation such a type class seems less useful — more effort to instantiate, less likely to be adopted.

NthPortal · March 26, 2018, 9:01am

That is fair. I guess my main concern was “to give it laws,” as that sounded to me at least like the hash would be checked for conformance at runtime (I’m not sure if that’s what you meant, @oscar).

The other disadvantage is that if the hash typeclass extends the equality typeclass, then it cannot be implemented by a lambda.

One problem is that a hash table (or map, set) can be made to behave pathologically using AnyRef.hashCode. For example, carefully crafted Strings can be used to have most or all hashes collide, which can lead to denial of service.

To solve this problem, there are essentially two possibilities:

create separate Hashing instance for each type, for each hashing strategy (e.g. one Hashing[String] for cryptographic hashing, one Hashing[String] for non-cryptographic hashing, etc.)
have Hashing describe only how the type is mixed into the hasher, and not the complete hashing process (this way, the same Hashing[String] can be used for both cryptographic and non-cryptographic hashing, etc.)

Option 2 reduces the number of typeclass instances which need to be created/derived, and allows the creation of alternative hashing strategies with little to no overhead. Additionally, there is already nothing like a hasher in the Scala library, so if you need to add one anyway, there’s not really any cost to having it allow arbitrary hash outputs.

I don’t honestly know the output size required for a cryptographic hash to prevent DoS in the aforementioned case. It is possible that 32 bits is sufficient, but it is also possible it is not (Rust uses 64 bits). If not, the current scala.util.hashing.Hashing will not work.

There may also be other similar problems relating to AnyRef.hashCode.

I don’t know if it is significant relative to cryptographic hashing, but it may be significant relative to non-cryptographic hashing (and they should be easily interchangeable with each other).

Additionally, if you’re using typeclass-based serialization, you’re just swapping a hash typeclass for a serialization typeclass (of some sort) - which works, but why not have a typeclass that fits better.

Very true. Theoretically, they could be easily generated for case classes, just like the equals and hashCode methods already are.

Blaisorblade · March 27, 2018, 8:16am

Typeclass laws aren’t enforced at runtime any more than laws of algebraic structures are; they can be tested (QuickCheck-style) and serve as algebraic specifications.

On the rest: most of my doubts are just about maintenance and implementation costs for what seems a pretty specialized scenario. If you volunteer for those, that’s a smaller concern, but your proposal doesn’t acknowledge or detail the costs (though to your credit you have a prototype), For instance, hashing serialization output requires less code than hashing directly (assuming you need serialization anyway).

If youre creating an ecosystem (like for Rust) the costs are different, but transitioning one (like for us) has much more friction. It seems to me your arguments up to now might be enough if you had a time machine and a chance to talk to designers of Java 1.0.

(To be continued but gotta go).

szeiger · April 3, 2018, 1:01pm

I wouldn’t want to mix up cryptographic hashes with a replacement for hashCode. The performance cost for such an abstraction would be huge and there is no benefit for standard data structures like hash tables. A possible option if you’re abandoning Java interoperability is to switch from Int to Long but that’s only useful for huge hash tables based on Long hashes. Rust can get away with these kinds of abstractions because it enforces monomorphisation and can perform aggressive optimizations across compilation units.

Ichoran · April 3, 2018, 2:56pm

Rust actually can’t even get away with these abstractions. Using cryptographic hashes when they’re not needed is a huge performance hit for the unwary, and the machinery to specify the hash you want isn’t invisible* due to a lack of implicits like in Scala. I understand why they did this, but I really am not a fan.

(* - It isn’t invisible in all cases. If you are just creating and consuming your own HashMap, it’s fine because you can create a type alias. But if you use a map from someone else, you can end up with different types of HashMap because the type is parameterized by hashing algorithm!)

Also, there’s no point making it hard to find collisions when you have only a 32 bit hashcode. You can still generate millions of collisions a day brute-forcing it (especially if you allow whichever degeneracy will give the same results for the hash lookup).

Finally, I think the equality and hashing proposals don’t need to be linked in any significant way. There is a lot to talk about in each, and the only real overlap is when considering how to implement hash maps that are typeclass-aware.

scottcarey · April 4, 2018, 5:02pm

Performance issues aside, I find some of this very interesting.

Hashing, Equality, and Serialization for pure data types all can be represented by essentially the same process – producing a sequence of values – but with different modes of consumption. Hashing mixes the sequence of values to mix a hash, Equality compares the sequence of values of two instances, and serialization writes the sequence of values.

It certainly would be nice when implementing a class to merely define what to output and in what order and get all three for free. In reality we have case classes generate the boilerplate for us which increases the bytecode but also limits megamorphic dispatch.

Also, there’s no point making it hard to find collisions when you have only a 32 bit hashcode.

Yeah, and even 64 bit ones, if the attacker knows the algorithm and has the incentive, they can purposely pre-generate collisions for use in many places. This is part of why the Java standard library modified their hash structures to escalate a hash-chain to a tree if the hash chain gets too long rather than fix the overly simple hash function on types like String.

NthPortal · April 5, 2018, 2:26am

That would be really awesome

scottcarey · April 5, 2018, 10:54am

Also, if all the components had a default Ordering, then an Ordering for the whole thing could be synthesized as well (I think we can do this today with things like Shapeless and HList but am not sure).

But this brings some flaws (some shared by case class synthesized variants today):

One does not always want the same ordering.
One does not always want the serialized form to be written in the same order as what would be used to compare or hash.
For equivalence, there is an ordering concern related to performance: it is sometimes useful to either place the ‘cheap’ comparisons first or place the fields that are more likely to differ first in order to speed it up.

Hashing does not care at all about what order is chosen, any will do. The others have some motives for preferring one order over another.

But for the majority of data types and use cases, one order could be used for all of the above.

curoli · April 5, 2018, 3:38pm

I’m not sure there is a universal way to calculate a hash code. For example, you would expect Set(1, 2) and Set(2, 1) to have the same hash code, but not List(1, 2) and List(2, 1).

scottcarey · April 5, 2018, 10:01pm

Sets and Maps deal with this by either having a predictable order in which they hash their contents, or using an unordered hash calculation.

A tree based implementation would have a predictable iteration order. Hash implementations can as well, but it usually imposes constraints that hurt performance (e.g. the table size must be a function of the # of elements with no histeresis). Hash based sets may also have already computed and cached a hash code for each element, so that they don’t have to traverse and calculate the hash of each element again.

What does Rust do for these cases? It is probably too constraining to require ordering. Both Java and Scala use an unordered hash function for hash based sets and maps. A HashState trait might have to have the ability to mix in order-independent values, e.g.

trait HashState {
  def +=(int: Int): this.type
  def ^=(int: Int): this.type
  ...
}

where the ^= variants are added in an order independent manner with respect to the existing state, so a run of such calls are order independent, with order ‘boundaries’ at each call that is a += variant.

A simple example:

IntHashState extends HashState {
  var state: Int = 0;
  def +=(int: Int) = ((state * Prime) ^ int); this
  def ^=(int: Int) = state ^ int; this
  ...
}

Ichoran · April 5, 2018, 10:18pm

It is, but that’s all Rust has for now as far as I know. You can’t (somewhat ironically) get a hash for a HashSet or HashMap.

See Implementing Hash for HashSet/HashMap? - libs - Rust Internals for a discussion. (It’s an old discussion, but HashSet still doesn’t implement Hash.)

NthPortal · April 7, 2018, 3:56am

Unfortunately, not all hash functions/algorithms readily support order-independent hashing (e.g. cryptographic hashes) - thus, I don’t think ^= methods will work well in general.

Additionally, even with the existence of ^=, it wouldn’t work anyway. If we suppose the following scenario:

case class Person(firstName: String, lastName: String)

implicit object PersonHash extends Hash[Person] {
  def hash(person: Person, state: HashState): Unit = {
    state += person.lastName
    state += person.firstName
  }
}

implicit def setHash[T: Hash]: Hash[Set[T]] =
  (set, state) => for (elem <- set) Hash[T].hash(elem, state)

The Hash for Set doesn’t actually call ^= itself - it delegates to Hash[T]! In general, a HashState mixes in complex values component by component. so it doesn’t have the information to hash in a fashion unordered with respect to elements of a Set, but ordered with respect to the components of each element. You need some way for it to collect an intermediate result for each element, and then mix those in an unordered fashion.

Honestly, the model I proposed does not support unordered hashing without significant changes, and I’m not sure what the best way to support unordred hashing is. I will certainly think about it.

NthPortal · April 7, 2018, 4:41am

A naive implementation:

trait UnorderedHash[T] {
  def hash(value: T, state: UnorderedHashState): Unit
}

trait UnorderedHasher[R] {
  def state: UnorderedHashState
  def reset(): Unit
  def result(): R
  final def +=[T](t: T)(implicit hash: UnorderedHash[T]): this.type = {
    hash.hash(t, state)
    this
  }
}

trait UnorderedHashState {
  def +=[T: Hash](t: T): this.type
  def +=[T: UnorderedHash](t: T): this.type
}

trait UnorderedHashFactory[R, H <: UnorderedHasher[R]] {
  def newHasher: H
  def hash[T: Hash](value: T): R = (newHasher += value).result()
}

jroper · October 15, 2018, 3:09am

Also, there’s no point making it hard to find collisions when you have only a 32 bit hashcode. You can still generate millions of collisions a day brute-forcing it.

This is not true. If you know the hash code, yes, you can generate millions of collisions. But the solution that most languages (not Java) have used is to use a random hash function. If using a random hash function (eg generated at start up, or when the map is instantiated), then an attacker doesn’t know the hash function, so they can’t brute force collisions. They must first brute force the hash function, and without local access, how would they do that? Perhaps they can determine whether a particular hash function is used or not using a timing attack, but if the space of possible hash functions is 32 bits or more, given the time it takes to validate a single result using typical timing attack techniques, that’s probably infeasible.

SipHash has been designed specifically for this purpose and is not known to have any weaknesses against collision attacks:

https://131002.net/siphash/

vpatryshev · October 15, 2018, 3:36am

The idea of having a hash function that is not even a function sounds very strange.

Two objects are equal, but their hashes are not equal because of randomness? It’s not a function then. Or something is missing in arguments.

jroper · October 15, 2018, 4:04am

We’re not talking about a single hash function, we’re talking about a randomly selected function, which always produces the same hash for equal objects. What is random is not the output of the function, but rather which function is used.

joshlemer · October 15, 2018, 2:03pm

I’d like to point out that if we do go with a randomly chosen hash function for Sets/Maps, it will completely eliminate any ability to use structural sharing when performing any binary operation on Sets/Maps (so, Set#{diff, union, intersect}, Map#{concat, merge}). Further, we would not even be able to avoid rehashing all the keys in at least the smaller of the two Maps/Sets, with the larger collection’s hash function.