Thinking about all of the things in the statistical world that we can estimate, the one that has always perplexed me is estimating the size of an unknown population \(N\). Usually when we compute estimates based on samples, we involve the size of the sample \(n\) somewhere, thus we take “size” for granted — the size of a sample is known. We also make inferences based on sample statistics using theory such as the Central Limit Theorem, but seem to never care about the population size \(N\), we either know it, or assume it is infinite. But, in fields like ecology and environmental studies, this attitude of gluttony is dangerous!

In the summers I spend a lot of time in the Eastern Sierra where there are deer and bears. These animals are so large that we cannot consider their population in the area to be infinite (like we may with ants or bacteria). A matter of fact, one of our local animal behavior specialists knows the exact number of bears that live in my mountain community. I always had just assumed he spent all day tracking down the local bears and marking them with radio collars. **There must be some way to estimate the population size, but how?** This will be the first aspect of the problem we will discuss. Then we will talk about how to formally derive an estimate of the population. The computation, and even the probability distributed used, is quite niche and really piqued my interest. Finally, I conclude with some wisdom on why we even need to go through this theoretical process, something I did not appreciate in school.

Two deer enjoying a nice day across the street. |
A bear fishing in Lake Mary. |
A (sedated) bear walks past me at Lake Mary. |

### The Capture-Recapture Design

One nugget of information I had learned in a course about web data mining and search taught by Professor John Cho, was a method called *capture-recapture*. Oddly enough, it was used to estimate the number of web pages on the Internet^{1}. A short while later, I came across this same mechanism in a textbook about Markov Chain Monte Carlo. The mechanism is called *capture-recapture*, but goes by many other names such as *mark-recapture* and others.

How it works for animals such as my woodland friends:

- Capture some number of bears, deer, whatever. The number itself doesn’t matter.
- Mark these animals somehow to remember that we have seen them. For animals, this means physically marking them (birds, snails, squirrels) or putting radio collars on them (bears, deer and mountain lions).
- Release them back into the wild.
- Allow enough time to pass so that the animals disperse into their natural habitat.
- Capture a sample of \(n\) animals and count the number of animals that are marked.

For estimating the number of web pages on the Internet, it might look like the following:

- Collect a random sample of web pages using a random walk (see
^{2},^{3}), or assume a week of search queries and the link clicked is a sample of the web (see^{4}). - Store the page address.
- Crawl, or collect, a sample again.
- Count the number of pages in the new sample that we have already seen.

### Estimating \(N\) with a Simple Proportion

We can simply form an 8th grade proportion of the following form, where we let \(N\) represent the unknown population size, \(n\) represent the number of animals we captured at the second sampling point, \(K\) be the total number of animals we marked and \(k\) be the number of tagged animals we counted in the recaptured sample.

\[

\frac{k}{n} = \frac{K}{N}

\]

This yields the estimator:

\[

\hat{N}_{\tiny\mbox{EST}} = \frac{Kn}{k}

\]

This estimator is actually called the Lincoln index and assumes that only one marking and one recapture takes place, and that the population is “closed.” A “closed” population is one that does not lose animals due to death or migration and does not gain any from birth or migration. In my area, the population is mostly closed throughout the year if we perform the mark and recapture with one day or so. During migration periods and mating periods, this assumption is not as reliable. Being in the mountains, targeted or clustered deaths of these larger animals is not common.

### Computing the Maximum Likelihood Estimator of \(N\)

Consider the formulation of the problem again. We have some population \(N\), and from that population we draw a sample of \(n\) objects. That sample itself is divided into two subgroups: marked objects \(k\), and unmarked objects \(n-k\). This sounds like the classic colored balls in urns problem where we want to find the probability that we select \(k\) of the \(K\) red balls when we draw a sample of \(n\) balls from an urn containing \(N\) balls. If this smells **hypergeometric** to you, you would be right.

The hypergeomtric is a discrete probability distribution with the following probability density function:

\[

P(X = k) = \frac{\left( \begin{array}{c} K \\ k \end{array} \right) \left( \begin{array}{c} N – K \\ n – k \end{array} \right)}{\left( \begin{array}{c} N \\ n \end{array} \right)}

\]

Usually when we construct a likelihood function, we take the product of the joint PDF over all the observations. Since we are estimating a population estimate, and it’s not based on a sample of observations and instead just one observation, we represent the likelihood as just the PDF. Perhaps if we did multiple recaptures, this would be different. The likelihood function thus looks something like the following.

\[

\begin{align}

\mathscr{L}(N ; K, k, n) &= \frac{\left( \begin{array}{c} K \\ k \end{array} \right) \left( \begin{array}{c} N – K \\ n – k \end{array} \right)}{\left( \begin{array}{c} N \\ n \end{array} \right)}

\end{align}

\]

We could take the log of this likelihood function like we usually do, but we will see that this is not necessary. Recall that the beauty of the log-likelihood was that it converted all of the products into sums of logs. Since we are only working with one term, we don’t need to do this.

This is where the story gets weird. Usually we take the partial derivative of the log-likelihood function with respect to the parameter of interest, set it equal to zero and then solve for the estimate. There is a wrinkle in this case because the parameter of interest \(N\) is a non-negative integer. This means we cannot find the maximum likelihood estimate using calculus by taking the derivative with respect to \(N\). Instead, we look at the *likelihood ratio* of successive hypothetical values of \(N\). That is

\[

D(N) = \frac{\mathscr{L}(N)}{\mathscr{L}(N-1)} = \frac{\frac{\left( \begin{array}{c} K \\ k \end{array} \right) \left( \begin{array}{c} N – K \\ n – k \end{array} \right)}{\left( \begin{array}{c} N \\ n \end{array} \right)}}{\frac{\left( \begin{array}{c} K \\ k \end{array} \right) \left( \begin{array}{c} N – K – 1 \\ n – k \end{array} \right)}{\left( \begin{array}{c} N – 1 \\ n \end{array} \right)}}

\]

The purpose of taking the partial derivative of the log-likelihood function was to determine the point where the log-likelihood (and thus the likelihood) function switches from being increasing to decreasing — an optimum. We want to find this same phenomenon with our likelihood ratio \(D\). We want to find where \(D\) switches from increasing to decreasing. That is, we want to find where \(D(N) > 1\) and \(D(N) < 1\). Note that for \(N > 1\), \(D(N) \ne 1\). I apologize for the formatting of the math; I wanted to save screen real estate. Anyway, we then have

\[

\begin{align}

D(N) &= \frac{\mathscr{L}(N)}{\mathscr{L}(N-1)} = \frac{\frac{\left( \begin{array}{c} K \\ k \end{array} \right) \left( \begin{array}{c} N – K \\ n – k \end{array} \right)}{\left( \begin{array}{c} N \\ n \end{array} \right)}}{\frac{\left( \begin{array}{c} K \\ k \end{array} \right) \left( \begin{array}{c} N – K – 1 \\ n – k \end{array} \right)}{\left( \begin{array}{c} N – 1 \\ n \end{array} \right)}} = \frac{\frac{\frac{K!}{(K-k)! x!} \frac{(N-K)!}{(n-k)!(N-K-n+k)!}}{\frac{N!}{(N-n)! n!}}}{\frac{\frac{K!}{(K-k)! k!} \frac{(N-K-1)!}{(n-k)!(N-K-n-1+k)!}}{\frac{(N-1)!}{(N-n-1)! n!}}} \\

& \mbox{Cancel the \({}_KC_k\) term as well as the \(n!\) terms and \((n-k)!\) terms.} \\

&= \frac{\frac{\frac{(N-K)!}{(N-K-n+k)!}}{\frac{N !}{(N-n)!}}}{\frac{\frac{(N-K-1)!}{(N-K-n -1+k)!}}{\frac{(N-1)!}{(N-n-1)!}}} = \frac{\frac{(N-K)! (N-n)!}{N! (N-K-n+k)!}}{\frac{(N-K-1)! (N-n-1)!}{(N-1)! (N-K-1-n+k)!}} = \frac{(N-K)! (N-n)! (N-1)! (N-K-1-n+k)!}{N! (N-K-n+k)! (N-K-1)! (N-n-1)!} \\

& \mbox{Apply variations of $y (y – 1)! = y!$} \\

&= \frac{(N-K)(N-K-1)! (N-n)! (N-K-1-n+k)! (N-1)!}{N(N-1)! (N-K-n+k)(N-K-n+k-1)! (N-K-1)! (N-n-1)!} \\

&= \frac{(N-K) (N-n)}{N (N-K-n+k)}

\end{align}

\]

which is a nice result, but we still are not done. We need to find where \(D(N)\) goes from being greater than 1, to being less than 1. So let’s continue.

\[

\begin{align}

D(N) = \frac{(N-K)(N-n)}{N(N-K-n+k)} &> 1 \\

(N-K)(N-n) &> N(N-K-n+k) \\

N^2 + Kn – NK – Nn &> N^2 -NK – Nn +Nk \\

N^2 – N^2 + Kn – NK + NK – Nn + Nn &> Nk \\

Kn &> Nk \\

\frac{Kn}{k} &> N \\

N &< \frac{Kn}{k}

\end{align}

\]

Thus \(D(N) > 1\) and \(\mathscr{L}(N) > \mathscr{L}(N-1)\) when \(N < \frac{Kn}{k}\) and thus \(D(N) < 1\) and \(\mathscr{L}(N) \le \mathscr{L}(N-1)\) when \(N > \frac{Kn}{k}\). So \(\hat{N} = \frac{Kn}{k}\) is the maximum likelihood estimator, right? Not quite.

According to the definition of a maximum likelihood estimator, the estimate must fall in the original parameter space, which is in the field of *integers*. The fraction \(\frac{Kn}{k}\) is likely not an integer, so it cannot be the maximum likelihood estimator. But don’t fret, this is very close to the MLE.

The correct maximum likelihood estimator is (drumroll) the *integer* part of this result.

\[

\hat{N}_{MLE} = \left[ \frac{Kn}{k} \right]

\]

Note that our original 8th grade proportion gave us the Lincoln index, which does not use the integer part. For small \(n\), the Lincoln index is biased. Shao^{5} shows that there is also a second candidate for the MLE, due to the increasing and decreasing nature of \(D(N)\) and the fact that the estimate must be an integer. This second candidate is simply \(\left[ \frac{Kn}{k} \right] + 1\), but there is no further discussion of this candidate in Shao’s work. References: ^{6} ^{7}

### So What Was the Point of Having to Compute the MLE?

When I was in graduate school, I used to ask the same thing all the time. We would spend 5-10 minutes deriving the fact that the sample mean is the maximum-likelihood estimator of the population mean under the normal distribution. I would ask myself “What is the point of this? Isn’t it obvious?”

In some ways it *is* obvious, but just like mathematicians have to prove obvious statements, so do statisticians. If somebody were to ask “how do we *know* that the 8th grade proportion yields a *good estimate*?” As a statistician, we would be expected to prove it rigorously. Consider that in our first derivation of the Lincoln index that we made a crucial assumption that the relationship between the sample markings and the population was *linear*. Aside from that, it came out of educated thin air. We also have no idea about *how good* of an estimator our result is, and because we essentially just assumed it, we cannot prove anything about it without some sort of theoretical context.

So now that we know we have to prove it rigorously, the next step is to figure out what distribution the phenomenon follows. In this case it was the hypergeometric, which is a bit niche, but still a fairly simple distribution to work with. For other problems it could be a distribution that is very messy, or a mixture model, or something requiring simulation. Once we have defined a probability distribution that makes sense for our context, we can compute all the estimators we want, whether it be by maximum likelihood, method of moments or MAP such as in a Bayesian context. Once we have constructed an estimate using these tried and true methods, we can rely on centuries worth of statistical theory to measure how optimal each estimator is — unbiasedness, consistency, efficiency, sufficiency etc. If we wanted, we could probably show that in order for our estimate of \(N\) to have low variance, we would need high \(n\) or high \(K\) or both. Let’s leave that as an exercise for the reader…

### Conclusion

In this article we take a problem that seems hopeless, or at least difficult, to solve and solve it rigorously. We start with an educated guess of an estimator by setting up a simple proportion. We then set up this basic form of the problem as a realization of the hypergeometric distribution. We then execute a niche maximum-likelihood estimation based on the likelihood ratio since the parameter we are trying to solve for must be an integer. We get the same answer as our original guess, but we now have the backup of statistical theory such that we can ask questions about *how good* of an estimator the Lincoln index really is.

#### Footnotes and References

- This course is actually the course that got me interested in pursuing additional graduate study in computer science. ↩
- Baykan, Eda, et al. “A comparison of techniques for sampling web pages.” arXiv preprint arXiv:0902.1604 (2009). ↩
- Rusmevichientong, Paat, et al. “Methods for sampling pages uniformly from the world wide web.” AAAI Fall Symposium on Using Uncertainty Within Computation. 2001. ↩
- Mauldin, Michael L. “Measuring the Size of the Web with Lycos” ↩
- Shao, J. “Mathematical Statistics, Springer Texts in Statistics.” (2003). Pg. 227 ↩
- Watkins, Joe. “Topic 15: Maximum Likelihood Estimation.” 1 Nov. 2011. Mathematics 363, University of Arizona. ↩
- Zhang, Hanwen. “A note about maximum likelihood estimator in hypergeometric distribution.” Revista Comunicaciones en Estadistica, UNIVERSIDAD SANTO TOMAS 2.2 (2009). ↩