Welcome to my new blog, Byte Mining! Data is all around us, all the time. It flows in from places you would least expect it, and more times that not, it remains in its original form untouched by human and machine. When data simply flows in and out of our lives, we miss out on the story that it tells us, and the clues that it provides to help solve our mysteries.

We humans are becoming more and more astute to the data that we exude and how we release it. There are two side effects to this – concerns for privacy is one. The second is that people are aware that their data is out there for consumption. In other words, people are no longer astonished when they realize how much of their data is readily available either publicly or to particular entities. This is a weight off of my back because I no longer get weird looks when I mention some tidbit of information I learned about someone to them on Facebook or MySpace. No, I am not a stalker, I was just more appreciative, or aware, of data ubiquity much quicker than most people my age, and was able to put together a picture of a person by the details they so readily provided. Of course, once people realized how much data they were releasing, privacy controls became much more refined.

There are many hats in the data-to-information conversion business. There are the analysts that take data and, well, analyze it using statistical methods. They may be called data analysts, or statisticians depending on the extent of their analysis and their qualifications. Until recently, analysts lived in a rectangular world. All of their data consisted of columns of data called fields, with observations structured as rows containing the same fields (usually), delimited by some character like the comma (,), tab or space, or my favorite, the pipe (|). Rectangular datasets do not grow on trees. Someone, or something, somewhere, put that data into a rectangle. Usually this is a restriction imposed by a database, and the rest is history. Other times, a person has converted some crazy chaotic data into the rectangle. I will get to this next in a bit. The point to take home:

Analysts take data and tell a story with it.

There is a the pseudo-hat of a modeler. These people are very important, but I really don’t know where to put them. I guess you can say they are a cross between an analyst and a miner. They are sort of like philosophers, or psychics.

Modelers take a bunch of data and answer the question, “what does this say about you?”

Another hat is that of the visualizer. These folks are kind of similar to analysts, and they may be analysts, but there is one major difference. Rather than focus on describing what the data is saying in words, they create lots of pretty pictures whose goal is to excite the consumer and engage him/her with the data.

Visualizers take data and make it sing.

Finally, there is the miner hat. In my opinion, data miners make the field go round – no longer is the world a square (see what I did there?). Data miners stick their hands in the air and reach for the data as it flies by. They grab it, give it some soul and present it to the consumer. The consumer may be an analyst, a visualizer, another miner (probably a programmer) or in some cases, Joe Schmoe. Miners extract data by using APIs or scraping and then parse the output to turn it into data, using text mining and regular expression magic. The miner may then re-broadcast this data in some other form for consumption.

Miners stick their hands in the air and reach for the data as it flies by. They grab it, give it some soul and present it to the consumer.

So, which one are you? It should be no surprise that I consider myself a miner. Next, by trade, I am an analyst and modeler. I do not consider myself a visualizer by any means, but I definitely appreciate and keep up with those individuals. I am sure I will write about visualization every once in a while.

So where do statistics and computer science come in? Loosely speaking, I believe that a computer scientist takes bytes and turns it into data, and a statistician takes data and turns it into information. There is a lot of blurring between these two fields however. There are many computer scientists that now turn a lot of data into information as well. One would find many computer scientists as miners, and maybe some as visualizers. It has been my experience that very few are pure analysts, but a good number are modelers. On the other hand, one would find statisticians in all of these fields, but more so as analysts, visualizers and modelers and less so as miners.

2 comments to Welcome!

  • Cosma Shalizi has a nice definition of data mining and it’s relation to machine learning and statistics;

    Data mining, more stuffily “knowledge discovery in databases”, is the art of finding and extracting useful patterns in very large collections of data. It’s not quite the same as machine learning, because, while it certainly uses ML techniques, the aim is to directly guide action (praxis!), rather than to develop a technology and theory of induction. In some ways, in fact, it’s closer to what statistics calls “exploratory data analysis”, though with certain advantages and limitations that come from having really big data to explore.


    • Thank you for your comment. True, data mining and machine learning are separate fields, and can even be distinct from one another. A lot of these fields, especially “data mining”, currently seem to have different meanings to different people and for me it is sometimes difficult to determine when something has crossed into ML, IR or data mining territory.

      The “old fashioned” definition of data mining to me is finding some pattern in a dataset, as you suggested. To statisticians, this is just “exploratory data analysis.” In an intro data mining class I took in computer science, the professor defined it exactly the same way a statistician would define EDA, while also adding machine learning algorithms. It can be quite confusing to see so many points of view. Eventually, these terms will converge on well-defined descriptions.

Leave a Reply




You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>