The Birthday Problem

And why it matters

Published: 2022-09-27

photorealistic render of the numbers 3,1,4, having a birthday party with hats and balloons. The numbers are upstanding and made out of cake.

It’s a well-known fact that all the best people are born in September. Yours truly is evidence of that. As my birthday approached this year, I found myself thinking on and utilizing one of the more memorable paradoxes from my probability & statistics class: The Birthday Problem.

How many people, chosen at random, would you need to put in a room for it to be likely that two persons share a birthday?

Intuition may initially lead us to consider the answer to be on the order of the number of days in the year. Maybe 200 people? 100?

Well, if the answer were that straightforward, it wouldn’t be worth discussing. Spoiler alert; the answer is 23. It seems incredulous, doesn’t it, that in a room of 23 people, there’s a more than 50% chance that two people share the same birthday? The proof is a relatively straightforward combinatorics problem, and while I’d love to document it here, the Wikipedia entry does a good job (if you’ve done any discrete math/counting). Suffice it to say that the number starts becoming more believable when we remember that we’re comparing birthdays between every possible pair. With 23 people, there are 23 × 22 / 2 = 253 possible pairings — more than half the days in the year. If anyone does want to discuss the math, let me know - I’d be happy to do so in a Twitter space or similar.

Why it matters

Why am I talking about this? Is this relevant in computer science, or am I just some nerd who thinks about weird stuff on his birthday? Yes, and yes.
The birthday problem came up a few weeks ago when I was solving an issue at work, demonstrating its usefulness.

The scenario:

I needed to generate a unique ID. That’s a common enough need. As developers, we often need a unique ID, whether it’s to make a primary key in a database, to link to a user, to create a URL shortener, to have a trace ID for logs, etc. In Golang, a common package to use is UUID. It follows specific standards and gives you a “Universally Unique Identifier”.
However, in this case, I couldn’t use that package because the UUID generated was too long for my needs.

One of the reasons UUIDs are so long — the string representation is 36 characters (or 32 in raw hex) — is that they follow a standard in generating 128-bit IDs. My use case didn’t care about any of the UUID standards. I looked at the package and decided that I could simply replicate its process with fewer bits to be under the limit. But then the question becomes: how do I compare solutions? How do I know what’s “good enough”? You may be thinking, “Why not just trim the UUID to the size you need?” But that raises similar questions about knowing how good the solution is (with a more complex answer). The critical functionality of UUIDs is their “uniqueness.” How can I be sure I won’t generate duplicates? How many IDs can I generate before a collision becomes likely? If I cut a UUID in half, does that mean my resulting ID is now half as good? Nope. The answer is much… much worse than that. But this is where the birthday problem steps in to answer all our questions.

Fundamentally, the birthday problem is about the likelihood of collisions (duplicates, in our case). It tells us how many IDs we can generate before duplicates become likely. For UUID (version 4), the number of IDs that needs to be generated to have a 50% probability of at least one collision is 2.71 quintillion (2.71 × 10¹⁸). To put that in relatable terms, the UUID package states:

Randomly generated UUIDs have 122 random bits. One’s annual risk of being hit by a meteorite is estimated to be one chance in 17 billion, which means the probability is about 0.00000000006 (6 × 10⁻¹¹), equivalent to the odds of creating a few tens of trillions of UUIDs in a year and having one duplicate.

So then we get back to my problem: How to describe to my leads/reviewers how good my solution is compared to the standard UUID package. I leveraged the birthday problem to calculate the collision probabilities with fewer bits. After I concluded that work, I realized that the Wikipedia entry had already made a probability table. I was considering using 64 bits to randomly generate IDs. In terms of uniqueness, I would need to generate 5 billion IDs for a duplicate to be likely. Or, to paraphrase the UUID package:

One’s annual risk of dying by falling out of bed is estimated to be one chance in two million, which means the probability is about 5 × 10^-7, equivalent to the odds of creating a few million IDs in a year and having one duplicate.

That was objectively more “uniqueness” than our use case would ever warrant, and I had the numbers to show it.
After generating the number ID, I also did some base conversion to further condense the string representation. UUIDs use hexadecimal, but I had no reason not to use the rest of the alphabet. So the final ID was in base 36.

I also thought of a few other approaches to generating IDs, such as being time-based. I won’t go into further detail here, but I created a library with various options. I’m happy to discuss anything there as well.

Finally, if you want to see how the birthday problem relates to hashes, cryptography, and security, look into birthday attack.

As always, I hope this read has been helpful to you — or at least enjoyable.