I was looking at the list of US cities by population on Wikipedia yesterday, because I noticed that Sunnyvale, a suburb of San Jose that I had occasion to go to yesterday, had a surprisingly large population of 140,095. There are a lot of places like this in California — despite having about 12% of US population, it has 64 of the 275 largest cities (all those with population above 100,000), or about 23%.

And among those 275 cities there are three pairs with the same population in the 2010 Census:

• Fargo, North Dakota and Norwalk, California, both at 105,549
• Aurora, Illinois and Oxnard, California, both at 197,899

Of course census data shouldn’t actually be taken to be exact. But how many pairs like this would we expect?

The starting point here is Zipf’s law for cities, or the rank-size rule. This rule states that the nth largest city in a region will have population 1/n times that of the largest city. As it turns out, this isn’t quite true for the structure of cities in the US, but they do roughly follow a power law. If we regress log(population) against log(rank), we get the regression line

$\log(pop) = 15.6103 - 0.7287 \log(rank)$

or, if we exponentiate both sides,

$pop = 6018207 \times rank^{-.7287}$

For example, we predict that the hundredth-largest city should have population $6018207 \times 100^{-0.7287} = 209926$. The actual hundredth-largest city is Spokane, Washington, with population 208916. See below for a graph of city size vs. city rank:

Because I don’t want to rewrite these numbers over and over, I’m going to rewrite that as $p = a r^{-b}$, and plug in the numbers at the end. Now let’s invert this relationship. How many cities do we expect to have population greater than some constant $p$? That’s just the rank the corresponds to $p$;. Solving for $r$ gives $r = (p/a)^{-1/b}$. Let’s write this as $r = f(p)$.

The expected number of cities having population exactly $p$ is thern

$-f^\prime(p) = a^{1/b} {1 \over b} p^{-(1+1/b)}$

Taking the derivative here is actually the crux of the analysis, so I’ll elaborate a bit. The expected number of cities having population at least p is $f(p)$; the expected number of cities having population at least p+1 is $f(p+1)$. The expected number of cities having population exactly p, then, is $f(p)-f(p+1) = -(f(p+1) - f(p))$. But $f(p)$ varies slowly so we can approximate $f(p+1) - f(p)$ by $f^\prime(p)$. Let $g(p) = -f^\prime(p)$ for later ease of notation.

Roughly speaking, $g(p)$ is the density of cities per unit population, at p. For example, if we let p = 105,000 we get that we expect 0.0034 cities of population 105,000. Extrapolating to the range from 100,000 to 110,000, we expect 10,000 times this many cities, or 34, in that population range; there are in fact 39.

So now take this expected value, and figure that the actual number of cities of population p is a Poisson random variable with mean $g(p)$. The probability that such a random variable is equal to 2 is $e^{-g(p)} g(p)^2/2$. Since $g(p)$ is very close to 0, I’ll drop the exponential term in what follows. Furthermore for ease of calculation, let’s assume these Poissons are never greater than 2. For example, the probability that a Poisson with mean 0.0034 is at least 2 is exactly

$1 - e^{0.0034} (1 + 0.0034) \approx 5.767 \times 10^{-6}$

and I use the approximation $0.0034^2/2 = 5.78 \times 10^{-6}$. The number of pairs of cities with population greater than c and the same population is then predicted to be

$\sum_{p \ge c} g(p)^2/2$

but I’d rather do an integral instead of a sum, so we’ll approximate this as

$\int_{c}^\infty g(p)^2/2 \: dp$.

Recalling that $g(p) = a^{1/b}/b p^{-(1+1/b)}$, we get

$\int_c^\infty {a^{2/b} \over 2b^2} p^{-(2+2/b)} \: dp$

and doing the integral gives

${a^{2/b} \over 2b^2} {b \over b+2} c^{-(1+2/b)}$

Plugging in the values from above, c = 100000, a = 6018207, b = 0.7287, gives 0.1924. So the expected number of such coincidences is about one-fifth; in the 2010 census it was three.

If you compare data from 2000 the first such coincidence is at rank 467 – Royal Oak, MI and Bristol, CT both had population 60,062 that year. (Note: I scanned the data by eye, so it’s possible I missed something.) You expect to start seeing coincidences this far down; plugging in c = 60000 with the 2010 coefficients gives 1.3. (Properly speaking I should use the 2000 coefficients, but I’d have to compute them first.) So 2010 is probably unusual. Still, I can’t help but suspect that the Census might be fudging the data a little bit to make these cities tie so that the lower-ranked member of each couplet doesn’t complain…

I’m looking for a job, in the SF Bay Area. See my linkedin profile.