Hash saturation calculator

Entropy for perfectly well-behaved hash of 16 bits for selected values
Entropy
In	Out	%
1	1.0000	6.2499
3	2.9999	18.7493
15	14.5472	90.9200
16	15.1728	94.8298
19	15.9076	99.4226
23	15.9944	99.9647
24	15.9972	99.9824
26	15.9993	99.9956

In the quest to "prove" the quality of the numbers output by my Arduino Random Number Generator, I've come across the concept of hash saturation while reading about the turbid random number generator which uses noise plus hash to give high quality random numbers. Implicitly I've relied on hash saturation all along, but now I have is a mathematical basis for this reliance.

Basically, hash saturation is a way to think about how the entropy in an input stream becomes entropy in the output of a hash. In the turbid model, a "well-behaved hash" is one which has an approximately equal frequency of each possible output regardless of the details of the inputs. More specifically, it is assumed that 'the probability that a given cell [i.e., output hash value] will contain exactly m snippets is given by the binomial distribution'.

Turbid then provides a formula for calculating output entropy given input entropy and hash width. Unfortunately, at first glance it's an infeasible value to compute: for his purposes, he calculates a sum over 2¹⁷⁰ elements; the summands contain terms like (2¹⁷⁰!). This can't be directly transcribed into common programming languages, so there was a bit of head scratching before I finally arrived at a program that could reproduce his table of values, and the table of values for a 16-bit hash which is relevant to my interests.

The main trick is to compute the log of the binomial PMF rather than the binomial PMF itself, so that the values don't exceed those of double-precision floating-point numbers (and use the log1p function for the (1-p)^n-k term). The second trick is to avoid calculating the huge factorials; instead of (n!)/(k! * (n-k)!), compute the binomial coefficient via the multiplicative formula. Last, notice that the product is needed for successive values of k, so generate them that way instead of performing the whole product from 1…k each time.

As far as calculating the sum over 2^N terms, you can stop as soon as one of the summands underflows (this is somewhere after the peak of the binomial PMF, N/C; in my program, this bailout is implemented by the test if si=0 and partial_sum). Since somewhere around N/C elements have to be summed, this means that it starts to become slow again somewhere around N=2²⁰*C, and the computed value loses all precision as it starts to vary on both sides(!) of log₂ C. At around the same time, the true value becomes so close to log C that it isn't worth computing, anyway.

Therefore, if my goal is to have each pair of output bytes contain at least 15.99 bits of entropy, I need to measure that there are at least 23 bits of entropy in however much data I feed to my hash. Additionally, I need to show that the CRC hash I use has the necessary equal distribution property.

Files currently attached to this page:

binomial.py

2.9kB

Hash saturation calculator

Tags: