Entropy | ||
---|---|---|

In | Out | % |

1 | 1.0000 | 6.2499 |

3 | 2.9999 | 18.7493 |

15 | 14.5472 | 90.9200 |

16 | 15.1728 | 94.8298 |

19 | 15.9076 | 99.4226 |

23 | 15.9944 | 99.9647 |

24 | 15.9972 | 99.9824 |

26 | 15.9993 | 99.9956 |

In the quest to "prove" the quality of the numbers output by my Arduino Random Number Generator, I've come across the concept of hash saturation while reading about the turbid random number generator which uses noise plus hash to give high quality random numbers. Implicitly I've relied on hash saturation all along, but now I have is a mathematical basis for this reliance.

Basically, hash saturation is a way to think about how the entropy in an input stream becomes entropy in the output of a hash. In the turbid model, a "well-behaved hash" is one which has an approximately equal frequency of each possible output regardless of the details of the inputs. More specifically, it is assumed that 'the probability that a given cell [i.e., output hash value] will contain exactly m snippets is given by the binomial distribution'.

Turbid then provides a formula for calculating output entropy given
input entropy and hash width. Unfortunately, at first glance it's an
infeasible value to compute: for his purposes, he calculates a sum over
2^{170} elements; the summands contain terms like
(2^{170}!). This can't be directly transcribed into
common programming languages, so there was a bit of head scratching
before I finally arrived at a program that could reproduce his table of
values, and the table of values for a 16-bit hash which is relevant to
my interests.

The main trick is to compute the log of the binomial PMF rather than the
binomial PMF itself, so that the values don't exceed those of
double-precision floating-point numbers (and use the log1p function for the
(1-p)^{n-k} term). The second trick is to avoid
calculating the huge factorials; instead of (n!)/(k! * (n-k)!),
compute the binomial coefficient via the multiplicative formula. Last, notice that the product is needed for
successive values of k, so generate them that way instead of performing the
whole product from 1…k each time.

As far as calculating the sum over 2^{N} terms, you can
stop as soon as one of the summands underflows (this is somewhere after
the peak of the binomial PMF, N/C; in my program, this bailout is
implemented by the test `if si=0 and partial_sum`). Since somewhere
around N/C elements have to be summed, this means that it starts to
become slow again somewhere around N=2^{20}*C, and the computed value
loses all precision as it starts to vary on both sides(!) of log_{2}
C. At around the same time, the true value becomes so close to log C
that it isn't worth computing, anyway.

Therefore, if my goal is to have each pair of output bytes contain at least 15.99 bits of entropy, I need to measure that there are at least 23 bits of entropy in however much data I feed to my hash. Additionally, I need to show that the CRC hash I use has the necessary equal distribution property.

**Files currently attached to this page:**

binomial.py | 2.9kB |

Entry first conceived on 14 January 2013, 13:42 UTC, last modified on 22 January 2013, 20:53 UTC

Website Copyright © 2004-2017 Jeff Epler