ciphergoth | More Geiger counter stuff

ciphergoth

I keep promising to write up things about my life on here, like my astonishingly fabulous weekend with

ergotia,

lilithmagna, and

calar, but the thing that usually inspires me to write is some geek nonsense, so here we go again.

Everyone knows that radioactive decay is truly random: whether or not a given nucleus will decay in the next second is entirely down to chance. If it doesn't decay this second, the chances that it'll do so in the next are exactly the same. If it survives another million years, it still has the same chance of decaying in the second following that; it doesn't "age". For the nucleus, every second is a game of Russian Roulette, and there are always the same number of bullets in the barrel. That's why you can't measure the "life" of a radioactive source - instead, we measure the "half-life", which is the time it takes before the chances it'll have decayed are 1/2 - or to put it another way, the time before half the nuclei in a given sample have decayed.

So it's an appealing source of random numbers - stick a Geiger counter next to a radioactive source, and the "clicks" are randomly distributed. If the source is large enough and has a long enough half life, we can ignore its slow decay, and treat each "click" as an independent event - the probability of a click in the next millisecond is the same no matter what the history is.

Only it isn't. The problem is that the Geiger counter takes a while to recover after emitting a "click" - its sensitivity is impaired, and it takes a while to return to full sensitivity. So the chances of there being a "click" in the next millisecond aren't always the same - they depend on how long ago the last click was, in a complex way to do with the exact hardware of the kind of particle detector you're using.

If this wasn't so, getting random numbers from a Geiger counter would be easy. For each millisecond, record whether there was a "click". Call it "heads" if there was, and "tails" if there wasn't; each millisecond would be independent of the others, so we can treat it as the toss of a coin. It's a biased coin, more likely to come down on one side than the other, but that's OK, because there is a very effective way to turn a sequence of biased but independent coin flips into a sequence of unbiased, independent coin flips: the Advanced Multi-Level Strategy or AMLS for short, developed by Yuval Peres. It doesn't even need to know the bias, and it makes very efficient use of the randomness available given enough data.

However, we can't apply this strategy directly because of the "recovery time" of the Geiger counter; a different approach is needed, one that makes no assumptions about how the intervals between clicks are distributed except that they are independent.

HotBits is a website which allows you to download random bits generated by Geiger counter clicks. HotBits simply takes pairs of successive intervals, and compares them. If the first is bigger, emit 0, if the second, emit 1, otherwise skip them and move on. This produces slightly under 0.5 bits per click.

A better strategy would be to estimate the median interval length, and then emit Heads for an interval which is under, and Tails if it's over; if you estimate the median wrong, then there will be a bias, but you feed the stream of coin flips to AMLS to get a guaranteed unbiased bit stream. This will produce slightly under 1 bit/click.

For rather more complexity, I developed today the following strategy which does rather better; against data such as HotBits uses and taking intervals in blocks of 8192, it produces around 5.72 bits/click. It's inspired by QuickSort and should be pretty easy to implement in C: the following code is in Python, but should be readable by any programmer:

def extract_juice(data):
    if len(data) < 3:
        return []
    pivot = data[0]
    data[0:1] = []
    flips = [x >= pivot for x in data]
    bits = amls(flips)
    left = [x for x in data if x < pivot]
    right = [x for x in data if x >= pivot]
    return (bits + extract_juice(left) + extract_juice(right))

Essentially, instead of estimating the median it takes the very first sample in the block to be the pivot, and produces coin flips based on that which are run through AMLS for unbiasing. However, it then divides the block into two sub-blocks, one for those less than the pivot and one for those greater, and re-runs the algorithm over both sub-blocks. This produces provably unbiased bits, assuming only that the input intervals are shuffled fairly.

I've tried quite a few variants on this, but this one seems to be the winner for both simplicity and efficiency.

Flat | Top-Level Comments Only

From:

akicif.livejournal.com

Ah. Now I see what you were getting at. Back when we used to play with radioactive stuff, we were using sources that had a longer (order of magnitude, at least) mean time between emissions than the latency of the counter. With "dilute", weak sources of long half-life material, the number of clicks per hour was more or less invariant over an academic year.

From:

keirf.livejournal.com

If the geiger counter takes 1ms to recover after each click, can't we just sample across, say, 1ms intervals every 2ms, shielding the counter in the second ms? Perhaps with a spinning lead disc with slots in it?

(Just out of interest, how do we know that radio-active decay is truly random?)

From:

ciphergoth.livejournal.com

Doing that in hardware involves more hardware than most people would want to get, and a spinning lead disk would be prone to failure - it's not a "set it up and forget it" option, in constrast to HotBits which has run unattended for years.

Further, this greatly reduces the total randomness available. I discussed discarding intervals smaller than the recovery time of the Geiger counter and subtracting that time from the rest, but basically that throws away more randomness than it preserves.

From:

keirf.livejournal.com

Oh, okay, I didn't initially realise that we weren't supposed to be throwing away randomness, but re-reading I've noticed the bits/click references which could be read to imply it.

However, if there is a significant time interval between clicks you could get even more bits/click though (by treating it as frequency modulated wave). But then there might be a problem due to there being no upper bound on the wavelength. But how about forcing a click every x ms in that case? Not sure it would work...

Incidentally, couldn't discarding intervals smaller than the recovery time still cause problems because (and I'm thinking in terms of Geiger counters with high voltage across a gas that is ionized and conducts type things here) a second particle entering the detector during the recovery period would cause the recovery time to be extended, i.e. a constant stream of particles might cause the counter never to register anything past the first click?

(Even though it's not solid state I thought a spinning lead disc might be quite cute.)

From:

ciphergoth.livejournal.com

a second particle entering the detector during the recovery period would cause the recovery time to be extended, i.e. a constant stream of particles might cause the counter never to register anything past the first click?

Yes, this would certainly be the case for the HotBits data - basically there are *no* intervals longer than the recovery time!

From:

pavlos.livejournal.com

Hmmmm, yes, I gathered you might be trying to compensate for the tube recovery time.

If your goal is to extract random bits efficiently then, given that a tick is quite short compared to the quiet interval, you could time the interval with a high resolution timer. Instead of a sequence of coin flips, you would then have a sequence of random numbers with inverse-exponential distribution: There is a fixed probability that NONE of the nuclei in your sample decays in the next microsecond and hence that the interval lasts another microsecond. Your randomless-collection rate would then be limited only by the time resolution of your electronics.

Pavlos

PS. Haven't read your program yet. Need food.

PSS. It just occurred to me that this use makes a Geiger counter doubly suspicious. Imagine trying to travel with one.

From:

ciphergoth.livejournal.com

I'm assuming we're measuring the intervals with a high-resolution timer. However, it will not have an inverse-exponential distribution, because of the recovery time; it'll have some weird distribution. If you discard all intervals shorter than the recovery time, and subtract the recovery time from the rest, you get an inverse-exponential distribution, to which you can apply my first algorithm.

However, this leaves you throwing away a lot of data. Interval lengths detected by HotBits are not inverse-exponential at all, but an excellent fit to a normal distribution; this indicates that the source is so hot that the detector pretty much never fully recovers before detecting its next particle. In this situation, your strategies would be very unproductive.

So you need a strategy that doesn't care what the distribution of the intervals is, only that they are all independent. That's what the strategy above does.

I've written a C implementation now...

From:

pavlos.livejournal.com

Hmmm... Maybe this is a naive question. The part I don't understand is why comparing intervals is the thing to do. If you simply measure the length of each tick with a high-resolution timer, why can't you treat each bit of the timer value as a biased coin? Is it because there is no good strategy to remove the correlation between the bits?

If you attempt to rank the intervals by length, as you seem to be doing, the number of bits that you can get per tick before any statistical correction seems limited by the number of ticks (n) in the block. You will get a particular ranking out of n!. Allegedly this can be approximated as (here is how I embarass myself by making some elementary algebra error):

ln(n!) ~= n ln(n) - n

ln(n!) ~= n (ln(n) - 1)
lg(n!) ~= n (ln(n) - 1) / ln(2)

So you should get roughly 94k bits per block, or a yield of 11 bits per sample, and the yield should grow roughly by one bit per tick every time you double the block. That, of course, assumes an infinitely fine timer. You then have to compensate for the fact that, with a finite-resolution timer, some intervals would have the same length and the ranking would be partial.

I've no idea how to begin to analyze the tradeoff between better yield, more computation to extract the random bits out of the ranking, and a better (higher resolution) timer that would result in fewer colliding intervals. All I'm trying to say is that I vaguely expect you need to do something like use the largest practical block in order to make efficient use of the resolution of your timer.

Pavlos

From:

ciphergoth.livejournal.com

Is it because there is no good strategy to remove the correlation between the bits?

Yes, it's for just this reason. I proposed this very strategy myself on sci.crypt before I realised this problem!

One way to remove the correlation is to have separate streams for the n+1th most significant bit based on the n-bit prefix - but if you start on that road you end up on this strategy.

Basically comparing intervals is the most effective way to assume nothing about their distribution!

At least for HotBits data, the tradeoff isn't as sharp as you might think, probably because the timer isn't high resolution enough. The s.d. of the integers is only 26.4, so colliding interval lengths are common. I've tried block sizes of 10 million with my C implementation, and I still can't get 7 bits/click, so I think I'm approaching a natural limit.

From:

pavlos.livejournal.com

Well, if there are lots of equal intervals then the number of distinguishable rankings is much less than n! It's probably

n! / Product {m! : for every set S of equal intervals m = population (S)}

Please excuse the crappy notation. Whould that be adequate explanation for the limited yield of the HotBits data? If not, I can't see what else might be the problem.

Pavlos

From:

ciphergoth.livejournal.com

Yes, I think that's why the yield on HotBits isn't higher. Increasing the s.d. by a factor of four increases the yield by 1.27 bits on a block size of 8192 - I expect the effect would be stronger at a larger block size.

From:

mortice.livejournal.com

Just to be infeasably pedantic...

If I understand it right, then if the half-life of a source is a year, the average time between clicks of your Geiger counter will be double what it was a year earlier.

This must mean that statistically the clicks will get further apart in time, even though they occur randomly.

So, hardware-recovery-time aside, is that a factor that influences the randomness, or is it absorbed in the AMLS algorithm, or is it just too small a factor to worry about?

From:

ciphergoth.livejournal.com

Basically, it's too small to worry about. If the source [...] has a long enough half life, we can ignore its slow decay

It only matters within each "block" of intervals (perhaps a few thousand) that the algorithm processes, and each "block" is far too short a period of time for detectable change in rate.

From:

giolla.livejournal.com

For the nucleus, every second is a game of Russian Roulette, and there are always the same number of bullets in the barrel.
Not a terribly good example that as Russian Roulette isn't really terribly random and is quite easy to manipulate the chances of being shot in any given game.

From:

ciphergoth.livejournal.com

I think you're missing something important about analogies.

From:

giolla.livejournal.com

Analogies only work if the situations are analogous, I was observing that in this case they weren't. It would make as much sense if you'd said that:
For the nucleus, every second is a game of darts.
or
For the nucleus, every second is a game of dominoes.
etc.

From:

mortice.livejournal.com

...quite easy to manipulate the chances of being shot in any given game

You been on the training course?

:-)

From:

giolla.livejournal.com

Odd you should say that...
wasn't so much a training course as an analysis

From:

mortice.livejournal.com

wasn't so much a training course as an analysis

Hmm.. just imagining a room full of thousands of people with guns to their heads, and some scientists with clipboards...

From:

giolla.livejournal.com

Only one gun educational budgets and all that, 3,120 games.

From:

mortice.livejournal.com

..and how many heads?

Must have got a bit messy by the end if there was only one...

From:

giolla.livejournal.com

Just a hypothetical one. Oddly even using blank firing rounds volunteers were not forth coming.

From:

keirf.livejournal.com

There's something I don't get about the HotBit description - are they measuring consecutive intervals and are they therefore comparing the first interval to the second, and the second interval to the third, and the third to the fourth, and so on? I.e. the bit sequence is:

myfun(t1/t2), myfun(t2/t3), myfun(t3/t4) ...

fun myfun(x) { 
  x := int(x)
  if (x > 0) { return 1 } else {return 0 }
}

And so therefore the problem is that if you have a short interval the next interval is more likely to be longer, and if you have a long interval the next is more likely to be shorter, so therefore in the sequence a 1 is more likely to follow a 0 and vice versa (but not guaranteed to, hence the randomness)? Or to put an extreme case, if you had a sequence of 100 1s, the odds would be very hight that the next element would be 0 (because all the tns would be getting longer and longer, and therefore t99 would be considerably higher than the median). Or am I totaly wrong here...

(NB, I know this isn't dealing with the recovery period for the counter, but I'm still trying to get to grips with the basics.)

From:

ciphergoth.livejournal.com

No, it's

myfun(t_1 - t_0), myfun(t_3 - t_2), myfun(t_5 - t_4)...

That's why it only produces 0.5 bits per sample. This strategy copes very well with the detector recovery stuff - it just doesn't produce a lot of data...

(deleted comment)

From:

ciphergoth.livejournal.com

That might be the myfun you meant to write, but it's not the one you wrote, which compares to zero!

And note that I'm not comparing (t_0, t_1) and then (t_1, t_2) - I'm doing (t_0, t_1), (t_2, t_3). So it doesn't have the problem you state, but it only produces a bit from every other interval.

From:

keirf.livejournal.com

Sorry - realised my mistake and deleted my post, but you'd already got there and answered. :-)

I've got it now (god, my brain must be slowing down). I'm overlooking the small details in programming, which probably suggests I've been in management too long.

So the image below has a top line with the actual particle emissions (and times marked in red), and a bottom line with what the geiger counter registers (and times marked in blue).

And because of this the ts are longer then they should be (just as you could get two particles being emitted simultaneously). But that doesn't matter if you compare consecutive pairs. Apologies for being so slow today.

Can you keep running the algorithm over sub-blocks of sub-blocks and so on?

From:

ciphergoth.livejournal.com

The sub-blocks need to contain at least three samples for it to produce any random bits. Apart from that, yes...

From:

naturalborn.livejournal.com

You're throwing out way more information than you have to. Normalizing the input directly is a little silly, you should instead simply estimate how much entropy is in being produced, then hash up all the information you get until it reaches 160 bits, and spit out the sha1 of that in one burst.

It's also a unnecessary to get your bits directly from the source - it's much more efficient to get a seed of 'truly' random bits, then use either AES in counter mode or mersenne twister (depending on your cryptographic needs, or lack thereof) to generate later random numbers.

I'm very bitter about /dev/random blocking unnecessarily. It's a good example of something really basic which the crypto community as a whole can't get its act together on.

From:

ciphergoth.livejournal.com

Thanks for mentioning this, I should have made this clear from the start. I concur completely that for most practical applications the Yarrow-like approach you refer to is more appropriate than this. (I can't see the point of MT in this context - if you don't need cryptographic quality numbers, why go to such effort to get randomness at all?)

However, while in practice I have great faith in SHA-x and AES, their security is unproven. In fact, not only can we not prove that SHA-x is secure for this purpose - we don't even know how to state in precise formal terms what we're assuming about it.

The discussion arose out of how best to gather random bits for applications with unconditional provable security, such as the One Time Pad. I would nearly always recommend against using the OTP, which in most practical situations is less secure than more usual approaches, but I was intrigued by the engineering challenge of how one might gather bits for use in one while preserving the unconditional security.

In practice there are better sources of randomness than Geiger counters full stop, but if you want the strongest theoretical guarantees then an approach like this is interesting. It might also be useful as a conservative entropy estimator.

On your other point, /dev/random wasn't written by a cryptographer. I don't know if I see the point of /dev/random blocking myself - but then there's always /dev/urandom. AFAIK the Yarrow paper is still the best reference we have on practical design of an entropy-gathering CPRNG. You can't hold the community responsible for something that really isn't our responsibility - it's Ted Ts'o's.

From:

pavlos.livejournal.com

Yes, well, I was tempted to mention: Thermal noise, say from a camera or simple amplifier, would probably be a much more plentiful source of randomness than a geiger counter, except that geiger counters are sexier (in a retro 1940's sort of way).

Pavlos

From:

ka9q.livejournal.com

I claim credit for inventing the Linux /dev/random algorithm. It's possible that I did this independently and after someone else, but Ted Tso, the implementer of /dev/random in Linux, got the idea from me.

I remember big arguments over how to estimate the entropy in the pool. I maintained that it didn't really matter because, if you use a good hash function (which may not be as easy as it once sounded), the generator degrades from "true" randomness to "practical" randomness -- an attacker can't predict the future sequence given reasonable CPU, which is all you can really ask for in any cipher.

It wasn't my idea to block /dev/random when the entropy falls too far; but as you say we also have /dev/urandom.

So yeah, the right way to use a Geiger counter to generate random bits is to strobe a fast clock with it, then hash the clock. And hash every other external event you can get your hands on: disk interrupts, keyboard interrupts, network interrupts, etc. Hashing any source of randomness "distills" its entropy regardless of its distribution, and if it's not really random then you've only lost a little CPU time.

From:

silpol.livejournal.com

please open http://www.amazon.com/exec/obidos/ASIN/0201615983 at pages 155-156 ;)

From:

ciphergoth.livejournal.com

And more than four years after your comment, I am finally able to put my hands on my copy. It mentions geiger counter data and describes the Von Neumann unbiasing strategy. I'm not sure I see why you mention it - say more?