fjall-rs

Which hash function makes the fastest bloom filter?

August 5, 2024
-
5 min. read
Blog post thumbnail

lsm-tree 1 uses seahash for creating hashes in its bloom filters. seahash is decently fast; but as I crunched the numbers, I noticed the overhead of CPU time spent hashing keys can be non-trivial in hot path scenarios (all blocks are cached).

Previously, we explored how hash sharing decreases the hashing phase from O(n) to O(1) (n being the number of disk segments), so while every point read’s hashing phase is just very short, even just a couple of nanoseconds can add up, increasing point read performance by a couple of hundred thousands reads per second: any point read has a base cost of somewhere around 200ns. Even squeezing out 10-20ns can increase hot read performance by about 5-10%.

Results

The benchmark code sets up a single disk segment and reads an existing key from it 25 million times.

All benchmarks ran on an i9 11900k.

ns per read

Key Sizexxhash::h3seahashcityhashmetrohashrustc_hashfasthash::spookyfxhash
1B182ns225ns200ns186ns191ns218ns192ns
8B193ns208ns197ns186ns190ns216ns198ns
36B196ns215ns200ns190ns196ns226ns206ns
73B202ns227ns218ns194ns197ns242ns206ns
147B210ns240ns232ns202ns207ns256ns227ns
1000B253ns388ns324ns257ns288ns318ns457ns

RPS

Key Sizexxhash::h3seahashcityhashmetrohashrustc_hashfasthash::spookyfxhash
1B5494505444444450000005376344523560245871555208333
8B5376344480769250761425376344526315746296295050505
36B5102040465116250000005263157510204044247784854368
73B4950495440528645871555154639507614242016804854368
147B4761904416666643103444950495483091739370074405286
1000B3952569257731930864193891050347222231545742188183

I don’t know which hashing function I will end up using in lsm-tree 2 yet. Probably xxhash, but I have not benchmarked if the hashing functions have any meaningful impact on false positive rates yet. The future will decide…

Interested in LSM-trees and Rust?

Check out fjall, an MIT-licensed LSM-based storage engine written in Rust.


Tags