Byte Mining

Estimating Population Size: Animals and Web Pages

Thinking about all of the things in the statistical world that we can estimate, the one that has always perplexed me is estimating the size of an unknown population $N$. Usually when we compute estimates based on samples, we involve the size of the sample $n$ somewhere, thus we take “size” for granted — the size of a sample is known. We also make inferences based on sample statistics using theory such as the Central Limit Theorem, but seem to never care about the population size $N$, we either know it, or assume it is infinite. But, in fields like ecology and environmental studies, this attitude of gluttony is dangerous!

In the summers I spend a lot of time in the Eastern Sierra where there are deer and bears. These animals are so large that we cannot consider their population in the area to be infinite (like we may with ants or bacteria). A matter of fact, one of our local animal behavior specialists knows the exact number of bears that live in my mountain community. I always had just assumed he spent all day tracking down the local bears and marking them with radio collars. There must be some way to estimate the population size, but how? This will be the first aspect of the problem we will discuss. Then we will talk about how to formally derive an estimate of the population. The computation, and even the probability distributed used, is quite niche and really piqued my interest. Finally, I conclude with some wisdom on why we even need to go through this theoretical process, something I did not appreciate in school.


Two deer enjoying a nice day across the street.	A bear fishing in Lake Mary.	A (sedated) bear walks past me at Lake Mary.

The Capture-Recapture Design

One nugget of information I had learned in a course about web data mining and search taught by Professor John Cho, was a method called capture-recapture. Oddly enough, it was used to estimate the number of web pages on the Internet¹. A short while later, I came across this same mechanism in a textbook about Markov Chain Monte Carlo. The mechanism is called capture-recapture, but goes by many other names such as mark-recapture and others.

How it works for animals such as my woodland friends:

Capture some number of bears, deer, whatever. The number itself doesn’t matter.
Mark these animals somehow to remember that we have seen them. For animals, this means physically marking them (birds, snails, squirrels) or putting radio collars on them (bears, deer and mountain lions).
Release them back into the wild.
Allow enough time to pass so that the animals disperse into their natural habitat.
Capture a sample of $n$ animals and count the number of animals that are marked.

For estimating the number of web pages on the Internet, it might look like the following:

Collect a random sample of web pages using a random walk (see ², ³), or assume a week of search queries and the link clicked is a sample of the web (see ⁴).
Store the page address.
Crawl, or collect, a sample again.
Count the number of pages in the new sample that we have already seen.

Estimating $N$ with a Simple Proportion

We can simply form an 8th grade proportion of the following form, where we let $N$ represent the unknown population size, $n$ represent the number of animals we captured at the second sampling point, $K$ be the total number of animals we marked and $k$ be the number of tagged animals we counted in the recaptured sample.

\[
\frac{k}{n} = \frac{K}{N}
\]

This yields the estimator:

\[
\hat{N}_{\tiny\mbox{EST}} = \frac{Kn}{k}
\]

This estimator is actually called the Lincoln index and assumes that only one marking and one recapture takes place, and that the population is “closed.” A “closed” population is one that does not lose animals due to death or migration and does not gain any from birth or migration. In my area, the population is mostly closed throughout the year if we perform the mark and recapture with one day or so. During migration periods and mating periods, this assumption is not as reliable. Being in the mountains, targeted or clustered deaths of these larger animals is not common.

Computing the Maximum Likelihood Estimator of $N$

Consider the formulation of the problem again. We have some population $N$, and from that population we draw a sample of $n$ objects. That sample itself is divided into two subgroups: marked objects $k$, and unmarked objects $n-k$. This sounds like the classic colored balls in urns problem where we want to find the probability that we select $k$ of the $K$ red balls when we draw a sample of $n$ balls from an urn containing $N$ balls. If this smells hypergeometric to you, you would be right.

The hypergeomtric is a discrete probability distribution with the following probability density function:

\[
P(X = k) = \frac{\left( \begin{array}{c} K \\ k \end{array} \right) \left( \begin{array}{c} N – K \\ n – k \end{array} \right)}{\left( \begin{array}{c} N \\ n \end{array} \right)}
\]

Usually when we construct a likelihood function, we take the product of the joint PDF over all the observations. Since we are estimating a population estimate, and it’s not based on a sample of observations and instead just one observation, we represent the likelihood as just the PDF. Perhaps if we did multiple recaptures, this would be different. The likelihood function thus looks something like the following.

\[
\begin{align}
\mathscr{L}(N ; K, k, n) &= \frac{\left( \begin{array}{c} K \\ k \end{array} \right) \left( \begin{array}{c} N – K \\ n – k \end{array} \right)}{\left( \begin{array}{c} N \\ n \end{array} \right)}
\end{align}
\]

We could take the log of this likelihood function like we usually do, but we will see that this is not necessary. Recall that the beauty of the log-likelihood was that it converted all of the products into sums of logs. Since we are only working with one term, we don’t need to do this.

This is where the story gets weird. Usually we take the partial derivative of the log-likelihood function with respect to the parameter of interest, set it equal to zero and then solve for the estimate. There is a wrinkle in this case because the parameter of interest $N$ is a non-negative integer. This means we cannot find the maximum likelihood estimate using calculus by taking the derivative with respect to $N$. Instead, we look at the likelihood ratio of successive hypothetical values of $N$. That is

\[
D(N) = \frac{\mathscr{L}(N)}{\mathscr{L}(N-1)} = \frac{\frac{\left( \begin{array}{c} K \\ k \end{array} \right) \left( \begin{array}{c} N – K \\ n – k \end{array} \right)}{\left( \begin{array}{c} N \\ n \end{array} \right)}}{\frac{\left( \begin{array}{c} K \\ k \end{array} \right) \left( \begin{array}{c} N – K – 1 \\ n – k \end{array} \right)}{\left( \begin{array}{c} N – 1 \\ n \end{array} \right)}}
\]

The purpose of taking the partial derivative of the log-likelihood function was to determine the point where the log-likelihood (and thus the likelihood) function switches from being increasing to decreasing — an optimum. We want to find this same phenomenon with our likelihood ratio $D$. We want to find where $D$ switches from increasing to decreasing. That is, we want to find where $D(N) > 1$ and $D(N) < 1$. Note that for $N > 1$, $D(N) \ne 1$. I apologize for the formatting of the math; I wanted to save screen real estate. Anyway, we then have

\[
\begin{align}
D(N) &= \frac{\mathscr{L}(N)}{\mathscr{L}(N-1)} = \frac{\frac{\left( \begin{array}{c} K \\ k \end{array} \right) \left( \begin{array}{c} N – K \\ n – k \end{array} \right)}{\left( \begin{array}{c} N \\ n \end{array} \right)}}{\frac{\left( \begin{array}{c} K \\ k \end{array} \right) \left( \begin{array}{c} N – K – 1 \\ n – k \end{array} \right)}{\left( \begin{array}{c} N – 1 \\ n \end{array} \right)}} = \frac{\frac{\frac{K!}{(K-k)! x!} \frac{(N-K)!}{(n-k)!(N-K-n+k)!}}{\frac{N!}{(N-n)! n!}}}{\frac{\frac{K!}{(K-k)! k!} \frac{(N-K-1)!}{(n-k)!(N-K-n-1+k)!}}{\frac{(N-1)!}{(N-n-1)! n!}}} \\
& \mbox{Cancel the ${}_KC_k$ term as well as the $n!$ terms and $(n-k)!$ terms.} \\
&= \frac{\frac{\frac{(N-K)!}{(N-K-n+k)!}}{\frac{N !}{(N-n)!}}}{\frac{\frac{(N-K-1)!}{(N-K-n -1+k)!}}{\frac{(N-1)!}{(N-n-1)!}}} = \frac{\frac{(N-K)! (N-n)!}{N! (N-K-n+k)!}}{\frac{(N-K-1)! (N-n-1)!}{(N-1)! (N-K-1-n+k)!}} = \frac{(N-K)! (N-n)! (N-1)! (N-K-1-n+k)!}{N! (N-K-n+k)! (N-K-1)! (N-n-1)!} \\
& \mbox{Apply variations of $y (y – 1)! = y!$} \\
&= \frac{(N-K)(N-K-1)! (N-n)! (N-K-1-n+k)! (N-1)!}{N(N-1)! (N-K-n+k)(N-K-n+k-1)! (N-K-1)! (N-n-1)!} \\
&= \frac{(N-K) (N-n)}{N (N-K-n+k)}
\end{align}
\]

which is a nice result, but we still are not done. We need to find where $D(N)$ goes from being greater than 1, to being less than 1. So let’s continue.

\[
\begin{align}
D(N) = \frac{(N-K)(N-n)}{N(N-K-n+k)} &> 1 \\
(N-K)(N-n) &> N(N-K-n+k) \\
N^2 + Kn – NK – Nn &> N^2 -NK – Nn +Nk \\
N^2 – N^2 + Kn – NK + NK – Nn + Nn &> Nk \\
Kn &> Nk \\
\frac{Kn}{k} &> N \\
N &< \frac{Kn}{k}
\end{align}
\]

Thus $D(N) > 1$ and $\mathscr{L}(N) > \mathscr{L}(N-1)$ when $N < \frac{Kn}{k}$ and thus $D(N) < 1$ and $\mathscr{L}(N) \le \mathscr{L}(N-1)$ when $N > \frac{Kn}{k}$. So $\hat{N} = \frac{Kn}{k}$ is the maximum likelihood estimator, right? Not quite.

According to the definition of a maximum likelihood estimator, the estimate must fall in the original parameter space, which is in the field of integers. The fraction $\frac{Kn}{k}$ is likely not an integer, so it cannot be the maximum likelihood estimator. But don’t fret, this is very close to the MLE.

The correct maximum likelihood estimator is (drumroll) the integer part of this result.

\[
\hat{N}_{MLE} = \left[ \frac{Kn}{k} \right]
\]

Note that our original 8th grade proportion gave us the Lincoln index, which does not use the integer part. For small $n$, the Lincoln index is biased. Shao⁵ shows that there is also a second candidate for the MLE, due to the increasing and decreasing nature of $D(N)$ and the fact that the estimate must be an integer. This second candidate is simply $\left[ \frac{Kn}{k} \right] + 1$, but there is no further discussion of this candidate in Shao’s work. References: ⁶ ⁷

So What Was the Point of Having to Compute the MLE?

When I was in graduate school, I used to ask the same thing all the time. We would spend 5-10 minutes deriving the fact that the sample mean is the maximum-likelihood estimator of the population mean under the normal distribution. I would ask myself “What is the point of this? Isn’t it obvious?”

In some ways it is obvious, but just like mathematicians have to prove obvious statements, so do statisticians. If somebody were to ask “how do we know that the 8th grade proportion yields a good estimate?” As a statistician, we would be expected to prove it rigorously. Consider that in our first derivation of the Lincoln index that we made a crucial assumption that the relationship between the sample markings and the population was linear. Aside from that, it came out of educated thin air. We also have no idea about how good of an estimator our result is, and because we essentially just assumed it, we cannot prove anything about it without some sort of theoretical context.

So now that we know we have to prove it rigorously, the next step is to figure out what distribution the phenomenon follows. In this case it was the hypergeometric, which is a bit niche, but still a fairly simple distribution to work with. For other problems it could be a distribution that is very messy, or a mixture model, or something requiring simulation. Once we have defined a probability distribution that makes sense for our context, we can compute all the estimators we want, whether it be by maximum likelihood, method of moments or MAP such as in a Bayesian context. Once we have constructed an estimate using these tried and true methods, we can rely on centuries worth of statistical theory to measure how optimal each estimator is — unbiasedness, consistency, efficiency, sufficiency etc. If we wanted, we could probably show that in order for our estimate of $N$ to have low variance, we would need high $n$ or high $K$ or both. Let’s leave that as an exercise for the reader…

Conclusion

In this article we take a problem that seems hopeless, or at least difficult, to solve and solve it rigorously. We start with an educated guess of an estimator by setting up a simple proportion. We then set up this basic form of the problem as a realization of the hypergeometric distribution. We then execute a niche maximum-likelihood estimation based on the likelihood ratio since the parameter we are trying to solve for must be an integer. We get the same answer as our original guess, but we now have the backup of statistical theory such that we can ask questions about how good of an estimator the Lincoln index really is.

Footnotes and References

This course is actually the course that got me interested in pursuing additional graduate study in computer science. ↩
Baykan, Eda, et al. “A comparison of techniques for sampling web pages.” arXiv preprint arXiv:0902.1604 (2009). ↩
Rusmevichientong, Paat, et al. “Methods for sampling pages uniformly from the world wide web.” AAAI Fall Symposium on Using Uncertainty Within Computation. 2001. ↩
Mauldin, Michael L. “Measuring the Size of the Web with Lycos” ↩
Shao, J. “Mathematical Statistics, Springer Texts in Statistics.” (2003). Pg. 227 ↩
Watkins, Joe. “Topic 15: Maximum Likelihood Estimation.” 1 Nov. 2011. Mathematics 363, University of Arizona. ↩
Zhang, Hanwen. “A note about maximum likelihood estimator in hypergeometric distribution.” Revista Comunicaciones en Estadistica, UNIVERSIDAD SANTO TOMAS 2.2 (2009). ↩

Highlights from My First NIPS

The first few hundred registrations received a mug.

As a machine learning practitioner in the Los Angeles area, I was ecstatic to learn that NIPS 2017 would be in Long Beach this year. The conference sold out in a day or two. The conference was held at the Long Beach Convention Center (and Performing Arts Center), very close to the Aquarium of the Pacific and about a mile from the Queen Mary. The venue itself was beautiful, and probably the nicest place I’ve ever attended a conference. It’s also the most expensive place I’ve ever had a conference. $5 for a bottle of Coke? $11 for two cookies? But I digress.I attended most of the conference, but as someone who has attended many conferences, I’ve learned that attending everything is not necessary, and is counterproductive to one’s sanity. I attended the main conference, and one workshop day, but skipped the tutorials, the Saturday workshops and the industry demos. The conference talks were livestreamed via Facebook Live at the NIPS Foundation’s Facebook page, and the recordings are also archived there.

This may make some question why one would actually want to attend the conference in person, but there […]

Ph.D. Defense Post Mortem and Advice for Others

NOTE 1: This is part 1 in a series that will probably contain 3 or 4 parts. Then I will return to the usual data science etc. posts. NOTE 2: This post was intentionally delayed until I received final approval on the submission of my final dissertation.

On March 14, I passed my final oral defense for my Ph.D. in Statistics.

It was the moment I had been waiting for. The moment I dreaded for so many years and the moment that I thought would never come. I had very few days and times to choose from. It came down to the 13th (bad luck?) and the 14th (Pi Day). My defense began at 10am and was supposed to take a max of 2 hours. I left home around 7am to be sure I arrived in enough time after what could be a long ride. I arrived at 8:45am after sitting in rush hour traffic for almost 2 hours. I printed color copies of my slides and four copies of my dissertation and set up each committee member’s spot nicely: one water, slides, dissertation, a plate and a napkin. I placed a big tray of color-sprinkled cookies in the middle […]

Some New Year Resolutions for This Data Scientist in 2017

I’ve never been very big on New Year’s resolutions. I’ve tried them in the past, and while they are nice to think about, they are always overly vague, difficult to accomplish in a year, trite, or just don’t get done (or attempted). This year I decided to try something different instead of just not making resolutions at all. I set out some professional goals for myself as a Data Scientist. So without further ado…

1. Don’t Complain about It, Fix It: Contribute to Open Source Software (More)

Open source software is only as good as its community and/or developer(s). Developers are human and typically cannot manage all bugs and feature requests themselves. My goal is to routinely contribute back to the community either with new features, or by fixing bugs that I discover. This not only helps the community at large, but also helps me as a software engineer. There is no better way to become an even better engineer than by wading through someone else’s code. While this is something I did all day every day at my $DAYJOB, I do it less while on my sabbatical.

Some of the projects I use the most and that I […]

It’s Been a While

This past three years has really flown. It’s time for me to finally get back to my roots and also start blogging more, like I did previously.

My last post was about Strata 2013. During this time period, I was taking a break from working full-time to finish a Ph.D. dissertation that I had neglected during my previous two positions. I learned my lesson the hard way, never work externally if you want a Ph.D. in a reasonable amount of time! I quickly got my dissertation from an intro to the first 65 pages or so during this gap. I then received an offer from Facebook. I was ready to move to Silicon Valley and enjoy all the things I had been envious over for so many years: the perks, the culture of innovation and intelligence, and the technology community. This was an opportunity I could not pass, and the dissertation went on the back-burner for another two years as I spent the majority of my waking hours, both during the week and the weekend… and on holidays… coding into a frenzy. I was looking forward to living in a world where I was entrenched in the technology and […]

Summary of My First Trip to Strata #strataconf

In this post I am goIing to summarize some of the things that I learned at Strata Santa Clara 2013. For now, I will only discuss the conference sessions as I have a much longer post about the tutorial sessions that I am still working on and will post at a later date. I will add to this post as the conference winds down.

The slides for most talks will be available here but not all speakers will share their slides.

This is/was my first trip to Strata so I was eagerly awaiting participating as an attendant. In the past, I had been put off by the cost and was also concerned that the conference would be an endless advertisement for the conference sponsors and Big Data platforms. I am happy to say that for the most part I was proven wrong. For easier reading, I am summarizing talks by topic rather than giving a laundry list schedule for a long day and also skip sessions that I did not find all that illuminating. I also do not claim 100% accuracy of this text as the days are very long and my ears and mind can only process so much data when […]

Merry Christmas and Happy Holidays!

Wishing you all a very Merry Christmas, Happy Holidays and Happy New Year! An update on me. In October, I began working at Riot Games, the developers of League of Legends. It has been an amazing experience and has occupied the majority of my free time as has my dissertation. My New Year’s resolution this year is to dust the cobwebs off this blog! Have a safe holiday season! Here in California, I will be having Christmas in the Sand

A New Data Toy — Unboxing the Raspberry Pi

Last week I received two Raspberry Pis in the mail from AdaFruit and just now have some time to play with them. The Raspberry Pi is a minimal computer system that is about the size of a credit card. In the embedded systems community, the excitement is for obvious reasons, but I strongly believe that such a device can help collect and use data to help us make better decisions because not only is it a computer, but it is small and portable.

For development, Raspberry Pi can connect to a television (or other display) via HDMI or composite video (the “yellow” plug for those still stuck in the 1900s haha). A keyboard, mouse and other devices can be connected via two USB ports. A powered hub can provide support for even more devices. There are also various pins for connecting to a breadboard for analyzing analog signals, for a camera or for an external (or touchscreen) display. An SD Card essentially serves as the hard disk and probably a portion of the RAM. The more recent Model B ships with 256MB RAM. Raspberry Pi began shipping in February 2012 and these little guys have been very difficult to get […]

Adventures at My First JSM (Joint Statistical Meetings) #JSM2012

During the past few decades that I have been in graduate school (no, not literally) I have boycotted JSM on the notion that “I am not a statistician.” Ok, I am a renegade statistician, a statistician by training. JSM 2012 was held in San Diego, CA, one of the best places to spend a week during the summer. This time, I had no excuse not to go, and I figured that in order to get my Ph.D. in Statistics, I have to have been to at least one JSM. […]

OpenPaths and a Progressive Approach to Privacy

OpenPaths is a service that allows users with mobile phones to transmit and store their location. It is an initiative by the New York Times that allows users to use their own data, or to contribute their location data for research projects and perhaps startups that wish to get into the geospatial space. OpenPaths brands itself as “a secure data locker for personal location information.” There is one aspect where OpenPaths is very different from other services like Google Latitude: Only the user has access to his/her own data and it is never shared with anybody else unless the user chooses to do so. Additionally, initiatives that wish to use a user’s location data must be asked personally via email (pictured below), and the user has the ability to deny the request.The data shared with each initiative provides only location, and not other data that may be personally identifiable such as name, email, browser, mobile type etc. In this sense, OpenPaths has provided a barebones platform for the collection and storage of location information. Google Latitude is similar, but the data stored on Google’s servers is obviously used by other Google services without explicit user permission.

The service is […]