The scala foundation library provides a default Murmur hash implementation. This implementation will output a 32 bit hash code (“Int
”).
We have had cases where hashes were conflicting:
def hash(u: String): Int = {
scala.util.hashing.MurmurHash3.stringHash(u, 0xf7ca7fd2) // the default seed
}
these strings will conflict (I can provide more instances):
-420251972 => ArrayBuffer(sku-2407126-73-QdUDWh6C0g, sku-6071698-30-rg34XL7NQg)
1501518886 => ArrayBuffer(sku-3139379-30-oduxX0Mlii, sku-4805921-34-Ut5eOisdBU)
On 10M strings of this form (“sku-xxx-yyy”) we had 11K conflicts (around 1 promil).
Googling around reveals that there are 128 bit variants but no scala implementation seemed to have any traction so we decided against using it
We resorted to expanding the range by hashing “in both directions” and creating a 64 bit hash:
def hash2(u: String): Long = {
val a = scala.util.hashing.MurmurHash3.stringHash(u, 0xf7ca7fd2)
val b = scala.util.hashing.MurmurHash3.stringHash(u.reverse.toString, 0xf7ca7fd2)
val c: Long = a.toLong << 32 | (b & 0xffffffffL)
c
}