GitHub - ivoras/goestimators: Fast Cardinality Estimation algorithms for Go: HyperLogLog/SuperLogLog/LogLog, Bloom

Fast Estimator Algorithms for Go: (Hyper|Super)?LogLog, Bloom filter

These algorithms are usable for:

Tracking web page hits in high-performance environments (millions or billions) of pages and users
Tracking product popularity
Tracking approximate statistics

HyperLogLog (and LogLog and SuperLogLog) for Go

The algorithms in this family implement fast approximate counting of arbitrary data (here implemented as binary keys), with a trivially small memory footprint. The algorithms are similar enough that the only point of difference is the summarisation step, where the approximation is calculated from internal data. The same internal data can be used for both LogLog, SuperLogLog and HyperLogLog approximation.

The number of LogLog buckets (which affects accuracy, but soon saturates to diminishing returns) is adjustable, with 1024 being the recommended default, and 32 being the smallest practical size. This number must be an integral power of 2.

This library was implemented from scratch, by interpreting the following papers and tutorials:

Any implementation errors are probably my own.

This library uses the Murmur3 hash function implementation from github.com/spaolacci/murmur3, for optimal hashing performance.

Example

ll, _ := NewLogLog(1024)
var buf := make([]byte, 8)
// Add 100000 random 8-byte buffers to the set
for i := 0; i < 100000; i++ {
	rand.Read(buf)
	ll.Observe(buf)  // Observes the buffer, i.e. updates internal representation from it
}
est := ll.Estimate() // Returns the estimated number of unique observed buffers
fmt.Println("Estimate of a set of 100k random entries: ", est)

This example uses the plain old LogLog estimation algorithm, implemented in the Estimate() function. All three algorithms can be used from the same observation results:

Estimate() - LogLog (fastest)
SuperEstimate() - SuperLogLog
HyperEstimate() - HyperLogLog (slowest)

Performance

Measured on an i5-5200U, the benchmark results (using 1024 buckets) are:

BenchmarkLogLog-4                        2000000               712 ns/op
BenchmarkSuperLogLog-4                   2000000               763 ns/op
BenchmarkHyperLogLog-4                    500000              2367 ns/op
BenchmarkObservationLogLog-4            20000000                61.6 ns/op

Since HyperLogLog improves accuracy only slightly compared to LogLog and SuperLogLog, users should decide for themselves if they can accept the hit in performance.

Accuracy

A typical run of the test functions (using 1024 buckets) looks like this:

LogLogRandom_1M 918170
SuperLogLogRandom_1M 932190
HyperLogLogRandom_1M 1059335

These are the estimates for observing 1M unique 8-byte buffers by the respective algorithms. So, all of them report a result within 10% accuracy. YMMV.

Bloom filter

"Bloom filters" test for the absence of an element in a large set. If a Bloom filter returns negative, the element certainly isn't in the set. If it returns positive, the element might be in the set. Memory usage for this is trivially small.

This algorithm was implemented from scratch, mostly by reading the Wikipedia article. It uses the Murmur3 function for the 64-bit hash implementation.

For performance reasons, there are only two supported Bloom filter sizes: 256-bit (which is 32 bytes), and 65536-bit (8 KiB). The 256-bit Bloom filter implements two hash functions (k=2), and the 65536-bit one implements four hash functions (k=4).

Example

const N = 100
b := NewBloom(goestimators.Size256)

// Add N random 64-bit numbers to the set
var i uint64
buf := make([]byte, 8)
for i = 0; i < N; i++ {
	uint64ToBytes(i, buf)
	b.Observe(buf)
}

// Test for existance of N+10 elements
notfound := 0
for i = 0; i < N+10; i++ {
	uint64ToBytes(i, buf)
	if !b.Check(buf) {
		notfound++
	}
}