Benford's Law

 

Introduction

What's the first digit of the Rydberg constant?

Ok - so we might be familiar with the Balmer formula for spectral lines of the hydrogen atom and know the value is:

10 973 731.568 525 m-1

But what if we were totally ignorant?

What if we didn't - god forbid - even know the units the Rydberg constant is measured in?

What would we have guessed then?


Initial Thoughts...

Well, our initial tendencies might be to guess the distribution is uniform over each of the possible digits.

But convention dictates that numbers do not begin with the digit '0'. So we might guess the distribution is uniform over the digits 1-9.

Can we prove this?


A More Principled Approach

If we are truely ignorant our guess at the initial digit should be independent of the units the quantity is actually measured in.

This means the probability distribution representing our beliefs in the value of the Rydberg constant - x - should have the same form when it is multiplied by a scaling factor:

P[u(x)]=P[x]|dx/du(x)|

Hence:

P[kx]=P[x]/k

By inspection;

P[x]=A/x

Or, as we might have expected;

P[ln(x)]=A


Scale Invariant distribution

Casting mathematical rigour to the winds, we have recovered a uniform distribution, but on a log, rather than linear scale:

In the log space a change of units results in a translation, and a uniform distribution is translation invariant.

Let's look at this distribution, and the distribution of numbers starting with each digit.

 

In the picture we see a repeating pattern where:

  • the length occupied by the numbers 10-19 is equal to that of 100-199, 1000-1999 etc.
  • the length occupied by the numbers 20-29 is equal to that of 200-299, 2000-2999 etc.
  • and identically for 3/4/5/6/7/8/9
  • the length occupied by the numbers 10-19 is shorter than that of 20-29
  • the length occupied by the numbers 20-29 is shorter than that of 30-39
  • and identically for 3/4/5/6/7/8/9

Benford's Law

The probability that our unknown constant starts with a 1 is therefore proportional its length on the log scale:

P[D1=1]=[ln(2)-ln(1)]/[ln(10)-ln(1)]=ln(2)/ln(10)~30%

Generalising for each of the first digits:

P[D1=d]=[ln(d+1)-ln(d)]/[ln(10)-ln(1)]=log(1+1/d)

Let's have a look at this distribution, and compare it with the actual distribution of physical constants:

 

Benford's law holds up.


The Full Distribution

By considering further sub-divisions of the logarithmic intervals, we can derive the full joint distribution:

P(D1=d1,D2=d2,...,Dk=dk)=log[1+(di*10k-i)-1]

From which we find: P[D2=2/D1=1]=0.115 and P[D2=2]=0.109

The digits are CORRELATED, which can also be seen by considering the logarithmic number line.

(This correlation falls off with distance)


Empirical Observations of Benford's law

Benford's law is not just restricted to physical constants and logarithmic tables, here are some other examples:

  • file sizes on your computer
  • annual turnovers (fraud detection)
  • population data
  • the distribution of house numbers

Why does the law work in these cases?


Hill (1997)

Consider the distribution of first digit 1s numbers along the number line.

Ones always come first:

 

Any decaying density over these numbers will generate samples that have more 1s at each significant digit.

Peaks can even be tolerated, but only if the distribution is skewed enough and covers multiple powers of ten.

This is an intuition as to why an approximate form of Benford's law will hold.

Hill's (1996) rigorous proof illustrates why Benford's law is obeyed so exactly by a large number of statistics:

If distributions are selected at random, and random samples are taken from each of these distributions, the significant digits of the combined sample will converge to the logarithmic distribtion