Hash saturation calculator

Entropy for perfectly well-behaved hash of 16 bits for selected values
Entropy
In Out %
11.00006.2499
32.999918.7493
1514.547290.9200
1615.172894.8298
1915.907699.4226
2315.994499.9647
2415.997299.9824
2615.999399.9956

In the quest to "prove" the quality of the numbers output by my Arduino Random Number Generator, I've come across the concept of hash saturation while reading about the turbid random number generator which uses noise plus hash to give high quality random numbers. Implicitly I've relied on hash saturation all along, but now I have is a mathematical basis for this reliance.

Basically, hash saturation is a way to think about how the entropy in an input stream becomes entropy in the output of a hash. In the turbid model, a "well-behaved hash" is one which has an approximately equal frequency of each possible output regardless of the details of the inputs. More specifically, it is assumed that 'the probability that a given cell [i.e., output hash value] will contain exactly m snippets is given by the binomial distribution'.

Turbid then provides a formula for calculating output entropy given input entropy and hash width. Unfortunately, at first glance it's an infeasible value to compute: for his purposes, he calculates a sum over 2170 elements; the summands contain terms like (2170!). This can't be directly transcribed into common programming languages, so there was a bit of head scratching before I finally arrived at a program that could reproduce his table of values, and the table of values for a 16-bit hash which is relevant to my interests.

The main trick is to compute the log of the binomial PMF rather than the binomial PMF itself, so that the values don't exceed those of double-precision floating-point numbers (and use the log1p function for the (1-p)n-k term). The second trick is to avoid calculating the huge factorials; instead of (n!)/(k! * (n-k)!), compute the binomial coefficient via the multiplicative formula. Last, notice that the product is needed for successive values of k, so generate them that way instead of performing the whole product from 1…k each time.

As far as calculating the sum over 2N terms, you can stop as soon as one of the summands underflows (this is somewhere after the peak of the binomial PMF, N/C; in my program, this bailout is implemented by the test if si=0 and partial_sum). Since somewhere around N/C elements have to be summed, this means that it starts to become slow again somewhere around N=220*C, and the computed value loses all precision as it starts to vary on both sides(!) of log2 C. At around the same time, the true value becomes so close to log C that it isn't worth computing, anyway.

Therefore, if my goal is to have each pair of output bytes contain at least 15.99 bits of entropy, I need to measure that there are at least 23 bits of entropy in however much data I feed to my hash. Additionally, I need to show that the CRC hash I use has the necessary equal distribution property.