Wednesday, July 29, 2009

How many errors does this book contain?


That's a question every author anxiously asks him/her/itself, as soon as the manuscript goes to press. Well, here is what happened regarding the Little Purple Book. By the way, the current list of errata appears on the Book's Page.

As errors and typos were being communicated by readers and carefully recorded, it appeared that it took till the thirtieth (30th) notification before a typo was reported twice. What does this tell us about the total number N of typos in the book?

Naturally, N is unknown. Assume that errors are reported uniformly at random. You'll recognize now a case of the well-known "birthday surprise" (also known as "birthday paradox"). Namely, on Earth (where N=365), it takes about 23 persons (on average, or with probability larger than 1/2) to obtain a sample where two people will have the same birthday. Now, our problem with estimating the number of errors in the book is of the same nature, but with a year that would have N days, instead of 365. The answer is as follows: the number of persons it takes to first obtain a colision of birthdates is on average approximately sqrt{pi N/2}. A discussion can be found on pages 114-115 of AC; this problem is a nice illustration of the power of exponential generating functions.

Let us say we observe that the first collision occurs on the rth trial. Then, we equate sqrt{\pi N/2} with r, and cleverly "deduce"(!) that N should be approximately 2r^2/pi. Fair enough, eh? With r=30, this predicts a number of errors of the order of 600 (the exact solution is 572.95...).

If we actually simulate the experiment, fixing the number N of errors, we find that this formula does predict the rough order of magnitude of N, but with a systematic multiplicative bias. What is going wrong here?

The problem is that the expectation of a square and the square of an expectation are not the same thing. Here, analysis teaches us that the expectation of the square of the time it takes to reach the first collision is asymptotic to 2N. Thus the "right" (i.e., asymptotically unbiased) estimator is in fact r^2/2. See AC, Note VI.25, p. 417, for a discussion. This leads us to infer that we should expect the actual number of errors in the Little Purple Book to be closer to 450, rather than the infamous 600. Soothing, isn't it?

Of course, there are limitations. First, the statistical procedure is not very accurate: its accuracy can be gauged by estimating the expectation of the fourth power of r, in itself a less trivial math exercise. Worse, we expect the figures, 450 and even 600, to be gross understimates since errors are not picked up uniformly at random --the ones closer to the beginning are more likely to be found than the ones in arcane corners of the book. So, a mistake per page sounds more like a reasonable possibility, and the book has 824 pages in total... Depressing!

Where does the idea of such an estimator come from? I first heard it more than twenty years ago from Gilles Brassard, while he was preparing his book with P. Bratley on algorithms. In order to determine how many balls an urn has, repeatedly pick up a ball at random (shake the urn!), mark the ball just picked up (say, with a touch of paint), and repeat until an already marked ball surfaces. Return half the square of the number of steps. Et voila!


To be honest, till to-day, I couldn't think of a real case of application of Brassard's algorithm and I was viewing it mostly as a thought experiment, a gedanken experiment, as a physicist would say. Perhaps we have here the first actual use of Brassard's algorithm. At any rate, his sampling-based algorithm has pedagogical value: it demonstrates that you can really achieve something with sublinear resources. Refer to the many probabilisic stream algorithms that are now known, the current research on property testing, and so on.