Murmur hash conflicts when hashing many items


#1

The scala foundation library provides a default Murmur hash implementation. This implementation will output a 32 bit hash code ("Int").

We have had cases where hashes were conflicting:

def hash(u: String): Int = {
    scala.util.hashing.MurmurHash3.stringHash(u, 0xf7ca7fd2) // the default seed
  }

these strings will conflict (I can provide more instances):

-420251972 => ArrayBuffer(sku-2407126-73-QdUDWh6C0g, sku-6071698-30-rg34XL7NQg)
1501518886 => ArrayBuffer(sku-3139379-30-oduxX0Mlii, sku-4805921-34-Ut5eOisdBU)

On 10M strings of this form (“sku-xxx-yyy”) we had 11K conflicts (around 1 promil).

Googling around reveals that there are 128 bit variants but no scala implementation seemed to have any traction so we decided against using it

We resorted to expanding the range by hashing “in both directions” and creating a 64 bit hash:

def hash2(u: String): Long = {
    val a = scala.util.hashing.MurmurHash3.stringHash(u, 0xf7ca7fd2)
    val b = scala.util.hashing.MurmurHash3.stringHash(u.reverse.toString, 0xf7ca7fd2)
    val c: Long = a.toLong << 32 | (b & 0xffffffffL)
    c
  }

#2

With a 32 bit hash, each pair has about 1 in 4 billion collision chance. With 10 million strings, you have 10^14 pairs (10^3 ~ 2^10, so 10^14 ~ 2^(14 * 10/3) ~ 2^46 pairs) that means you expect about 2^46/2^32 = 2^14 = 16K.

You saw 11K, so that seems roughly correct.

No 32 bit hash can really do better here. Using more bits can work and may be faster than hashing twice (for instance a directly 64 bit or 128 bit hash).


#3

Exactly. So there should be a 64 or 128 bit variant in the library

Avner


#4

Hash collisions are inevitable, so if you’re trying to find a hash that does not give collisions for your data then you’re depending on luck. For non-cryptographic hashes it should usually be easy to develop a collisions generator. There are even collisions generators for MD5 hash, which for a long time was considered cryptographically secure.


#5

I think the one in the standard library is mainly just there as a helper for overriding the Object#hashcode() method. For that case you only need an Int.