<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Byte Mining</title>
	<atom:link href="http://www.bytemining.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.bytemining.com</link>
	<description>My thoughts on data mining, machine learning, programming languages, open-source software and general nerdery.</description>
	<lastBuildDate>Tue, 05 Mar 2013 03:32:53 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
		<item>
		<title>Summary of My First Trip to Strata #strataconf</title>
		<link>http://www.bytemining.com/2013/02/summary-of-my-first-trip-to-strata-strataconf/</link>
		<comments>http://www.bytemining.com/2013/02/summary-of-my-first-trip-to-strata-strataconf/#comments</comments>
		<pubDate>Thu, 28 Feb 2013 18:00:00 +0000</pubDate>
		<dc:creator>Ryan Rosario</dc:creator>
				<category><![CDATA[Amazon EC2]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Conferences]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Natural Language Processing]]></category>
		<category><![CDATA[Programming Languages]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[Statistics and Statistical Computing]]></category>
		<category><![CDATA[Web Mining]]></category>

		<guid isPermaLink="false">http://www.bytemining.com/?p=1217</guid>
		<description><![CDATA[<p></p>
<p>In this post I am goIing to summarize some of the things that I learned at Strata Santa Clara 2013. For now, I will only discuss the conference sessions as I have a much longer post about the tutorial sessions that I am still working on and will post at a later date. I will add to this post as the conference winds down.</p>
<p>The slides for most talks will be available&#160;here&#160;but not all speakers will share their slides.</p>
<p>This&#160;is/was my first trip to Strata so I was eagerly awaiting participating as an attendant. In the past, I had been put off by the cost and was also concerned that the conference would be an endless advertisement for the conference sponsors and Big Data platforms. I am happy to say that for the most part I was proven wrong. For easier reading, I am summarizing talks by topic rather than giving a laundry list schedule for a long day and also skip sessions that I did not find all that illuminating. I also do not claim 100% accuracy of this text as the days are very long and my ears and mind can only process so much data when I am context [...]]]></description>
				<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><p><img class="lfloat" src="http://www.bytemining.com/wp-content/uploads/2013/02/strata_franchise_santa_clara.jpg" alt="" width="417" height="158" /></p>
<p>In this post I am goIing to summarize some of the things that I learned at Strata Santa Clara 2013. For now, I will only discuss the conference sessions as I have a much longer post about the tutorial sessions that I am still working on and will post at a later date. <span style="text-decoration: line-through;">I will add to this post as the conference winds down.</span></p>
<p><strong>The slides for most talks will be available&nbsp;</strong><strong><a href="http://www.strataconf.com/slides">here</a>&nbsp;but not all speakers will share their slides.</strong></p>
<p>This&nbsp;<span style="text-decoration: line-through;">is/</span>was my first trip to Strata so I was eagerly awaiting participating as an attendant. In the past, I had been put off by the cost and was also concerned that the conference would be an endless advertisement for the conference sponsors and Big Data platforms. I am happy to say that for the most part I was proven wrong. For easier reading, I am summarizing talks by topic rather than giving a laundry list schedule for a long day and also skip sessions that I did not find all that illuminating. I also do not claim 100% accuracy of this text as the days are very long and my ears and mind can only process so much data when I am context switching between listening, tweeting, emailing etc.</p>
<p><img src="http://www.bytemining.com/wp-content/uploads/2013/03/20130304_1458051.jpg" alt="" width="207" height="486" class="rfloat" /><br />
In the mornings there were several short plenary talks where people throughout the industry discussed their particular views of Data Science. This was basically a warm-up for what would become very long days. I mostly used this time to catch up on email and review stuff from the previous day. The second day apparently had a lot of gimmicky sales talks, but I was not paying attention apparently. The most interest talk came from&nbsp;<a href="http://codeforamerica.org/author/jen/">Jennifer Pahlka</a>&nbsp;from&nbsp;<a href="http://codeforamerica.org/">Code for America</a>. I had first learned about Code for America at the&nbsp;<a href="http://www.bytemining.com/2011/05/review-of-2011-data-scientist-summit/">Data Scientist Summit hosted by EMC in 2011</a>. At the time it sounded like a very good idea &#8212; we have college graduates that dedicate a couple years to teaching in inner-city schools so it makes sense that we should have some data scientists working on &#8220;projects that make a difference in the world&#8221; as Jennifer would say. The projects that these data scientists work on involve democratizing data and open data initiatives in local governments. A couple of projects that stood out to me were a project to release some 800+ datasets from the City of Santa Cruz (an amazing city) on a website. Another project involved studying bail amounts and the outcome of a criminal trial. This talk was apropos considering the recent announcement of&nbsp;<a href="http://www.code.org/">code.org</a>, the start of a movement to begin teaching computer programming to K-12 students. [You can sign the petition and register as a volunteer&nbsp;<a href="http://www.code.org/">here</a>.]</p>
<p> <span style="font-size: 18px;"><strong>Visualization Strand</strong></span></p>
<p>I have said in the past that visualization is not my thing. I greatly appreciate interactive graphics and cool infographics that convey strong meaning to non-data scientists but it simply is not my cup of tea yet. However, it is something that I want to invest time into. I decided to attend visualization talks up to my tolerance level (which isn&#8217;t very high)&#8230; which meant&nbsp;<span style="text-decoration: line-through;">one</span>&nbsp;two.</p>
<p>I attended&nbsp;<a href="http://www.linkedin.com/pub/chang-she/9/861/b008975,d.eWU">Chang She</a>&#8216;s talk&nbsp;<a href="http://strataconf.com/strata2013/public/schedule/detail/27455">Agile Data Wrangling and Web-based Visualizations</a>. Chang did what I usually do: pack too much into a one-hour talk&#8230; but I feel that talks like this really whet the appetite to learn more. He discussed how data science is missing a &#8220;blue button&#8221; that takes care of data management and then visualization. Using the&nbsp;<a href="http://www.fec.gov/disclosurep/PDownload.do">federal election commission dataset</a>, he showed political donations by party, candidate and state as the motivating example. Chang showed several examples of using&nbsp;<a href="http://pandas.pydata.org/">pandas</a>&nbsp;(a Python data munging library) to manipulate the data and then passing that data to&nbsp;<a href="http://d3js.org/">d3.js</a>&nbsp;using a&nbsp;<a href="http://en.wikipedia.org/wiki/Json">JSON</a>&nbsp;data format with a web server. I felt that this was just a basic talk on how to combine tools to munge data and then visualize it. It is far from a blue button, but shows how important such processing pipelines are.</p>
<p> <span style="font-size: 18px;"><strong>Law, Ethics and Open Data Strand</strong></span><br />
<img class="lfloat" src="http://www.bytemining.com/wp-content/uploads/2013/02/hat212.jpg" alt="" width="180" height="150" /><br />
One of the highly acclaimed talks of the day came from&nbsp;<a href="https://twitter.com/turian">Joseph Turian</a>&nbsp;of&nbsp;<a href="http://metaoptimize.com/">MetaOptimize</a>, titled&nbsp;<a href="http://strataconf.com/strata2013/public/schedule/detail/27213">Sci vs. Sci: Attack Vectors for Black-Hat Data Scientists and Possible Countermeasures</a>. Every skill has a good use and an evil use and Data Science is no exception. We create models to try to combat fraud, detect spam, measure influence and much more. These &#8220;good&#8221; uses of skills are called &#8220;white hat.&#8221; On the other hand, a more evil Data Scientist can&nbsp;<em>circumvent&nbsp;</em>these models to allow their spam to go undetected or game an influence metric such as PageRank. For example, consider a malicious web page that contains code that simply repeats a user&#8217;s 1Google query endlessly. To a very stupid search engine, such a web page would game a keyword matching algorithm and the search engine that is based on it. This crap web page would appear as the first result because it appears the most relevant. This is a very elementary example, but one can imagine how sophisticated models can produce nasty results.</p>
<p>Turian believes that most Data Scientists originally come from academia where the skills we learned are mainly &#8220;white hat&#8221;, but that our use in industry is mainly &#8220;grey hat&#8221; (somewhere between good and not-so-good). Such &#8220;grey hat&#8221; methods may involve some sort of data privacy issue such as with ad retargeting. A &#8220;black hat&#8221; data scientist may be useful in constructing a botnet, using Markov models or other language models to generate human-looking spam text, or to create sock puppets to sway opinion in a large social network. A&nbsp;<a href="http://en.wikipedia.org/wiki/Sockpuppet_(Internet)">sock puppet</a>&nbsp;is essentially a social media account that is designed to look like a real genuine human but that has an ulterior motive, mainly to proliferate propaganda or false information. The use of these sock puppets is referred to as &#8220;<a href="http://en.wikipedia.org/wiki/Astroturfing">astroturfing</a>&#8221; &#8212; that is, a fake grassroots movement. One easy example I can think of are the thousands and thousands of Twitter accounts that are created simply to sway opinion about President Obama (<a href="https://twitter.com/search?q=%23tcot&amp;src=typd">search for&nbsp;#tcot</a>&nbsp;and you are likely to find some examples, though many are also legitimate users). Turian cited one unsophisticated example of astroturfing:&nbsp;<a href="http://www.pcmag.com/article2/0,2817,2390375,00.asp">Newt Gingrinch and his huge jump of followers</a>&nbsp;in a short period of time, which was determined to be fake. In this case, it is alleged that Gingrinch&#8217;s campaign paid for followers rather than create an army of sock puppets. Some methods for locating sock puppets are the presence of reply spam (@spam), manual classification, and<a href="http://en.wikipedia.org/wiki/Honeypot_(computing)">honeypots</a>.</p>
<p>Some interesting statistics:
<ol>
<li>7% of Tweeps (Twitter users) are spam bots.</li>
<li>20% of us accept friend requests from people we do not know.</li>
<li>30% of us have been deceived by chat bots.</li>
</ol>
<p><strong>Note:</strong>&nbsp;<a href="http://metaoptimize.com/qa/">MetaOptimize hosts an amazing machine learning Q and A site</a>&nbsp;similar in function to StackExchange/StackOverflow. You can visit it&nbsp;<a href="http://metaoptimize.com/qa/">here</a>.</p>
<p> <span style="font-size: 18px;"><strong>Data Science Strand</strong></span></p>
<p> <span style="font-size: 14px;"><strong>IPython Notebooks</strong></span></p>
<p>The first talk of this series I attended was<a href="http://strataconf.com/strata2013/public/schedule/detail/27233">&nbsp;The IPython Notebook: a Comprehensive Tool for Data Science&nbsp;</a>by&nbsp;<a href="https://twitter.com/ellisonbg">Brian Granger</a>&nbsp;at&nbsp;<a href="http://www.calpoly.edu">Cal Poly San Luis Obispo</a>&nbsp;and Chronicle Labs. One of the major problems in Data Science is that &#8220;code and data do not communicate much.&#8221; That is, code is usually placed in one file, and data in another file and an analysis involves the coupling of data and code that must be kept in sync throughout the process. Imagine if all of your work as a Data Scientist could be contained on your physical desktop as separate objects &#8212; this is a good analogy for&nbsp;<a href="http://ipython.org/notebook.html">IPython Notebooks</a>. An IPython Notebook functions much like a<a href="http://www.wolfram.com/mathematica/">Mathematica</a>&nbsp;notebook, or a&nbsp;<a href="http://www.sagenb.org/">Sage notebook</a>. One can analyze data in&nbsp;pandas&nbsp;data frames, use some fancy models from&nbsp;<a href="http://www.scipy.org">SciPy</a>&nbsp;or&nbsp;<a href="http://scikit-learn.org/stable/">scikit-learn</a>, use the general Python language as well as the niceties provided by IPython all in one place. Once the code is written, one can produce plots with&nbsp;<a href="http://matplotlib.org/">matplotlib</a>&nbsp;in place and then distribute the document to others. IPython Notebooks provide a living document of one&#8217;s work and allows resilience from change by keeping all of the code in one place.&nbsp;<span style="text-decoration: underline;">Additionally, the concept of cell magic allows the execution of other languages such as R, Ruby and Julia from within the IPython Notebook!</span>&nbsp;Soon there may be no need to run multiple interpreters or have multiple different open-source notebook projects for each additional language!</p>
<p><span style="text-decoration: underline;">Here is the amazing part:</span>&nbsp;by using so-called cell magic, one can push a Python object, say a&nbsp;pandas&nbsp;dataframe directly into R and it is converted into an R dataframe. I do not remember the specifics of why this is possible, but this is huge.&nbsp;<del style="font-family: georgia, arial, san-serif; font-size: small; background-color: #ffffff;" datetime="2013-03-04T02:53:58+00:00">This eliminates the need for packages like&nbsp;<a style="color: #007b00; font-weight: bold; text-decoration: none;" href="http://rpy.sourceforge.net/rpy2.html">RPy2</a>&nbsp;for basic computations between R and Python.</del>&nbsp;[Edit: RPy2 is used under the hood for this conversion. Thanks to&nbsp;<a style="color: #007b00; font-weight: bold; text-decoration: none; font-family: georgia, arial, san-serif; font-size: small; background-color: #ffffff;" title="Dirk" href="http://dirk.eddelbuettel.com/">Dirk</a>&nbsp;for pointing this out.] Brian mentioned that it also may be possible to eventually allow Python objects to interact with JavaScript libraries such as d3.js for visualization using widgets.</p>
<p>IPython Notebooks support narrative text, headings, graphics and also mathematical typesetting via&nbsp;<a href="http://www.mathjax.org">MathJax</a>. Executing code produces JSON strings that are portable and serializable for saving results without requiring code to be re-executed. The site&nbsp;<a href="http://nbviewer.ipython.com">nbviewer.ipython.com</a>&nbsp;provides an online viewer for IPython notebooks via URL, Git repository URL or Gist URL. This viewer does not require the web service to be installed locally. One current limitation of IPython Notebooks is that they only support a single user and thus cannot be hosted for, say, multiple students to login to their own notebook session in a classroom.</p>
<p>Once <code>ipython</code> and <code>ipython-notebook</code> (the Ubuntu packagename) are installed, one just executes the command <code>ipython notebook</code> in the directory of interest to start up a webserver for working with IPython Notebooks.</p>
<p>Apparently entire textbooks are being written as IPython Notebooks for their beauty, scientific ease and portability.</p>
<p> <span style="font-size: 14px;"><strong>Adversarial Learning</strong></span></p>
<p>The final talk I attended was&nbsp;<a href="http://strataconf.com/strata2013/public/schedule/detail/27257">What To Do When Your Machine Learning Gets Attacked</a>&nbsp;by&nbsp;<a href="http://www.linkedin.com/pub/vishwanath-ramarao/2/672/882">Vishwanath Ramarao</a>. The purpose of this talk was to discuss issues with the bad guys trying to circumvent machine learning models designed to prevent abuse of a system, such as a spammer learning how to get around a spam filter over time. This spammer is called an adversary, and can be a &#8220;black hat&#8221; data scientist. Some examples of adversarial situations are login fraud (<a href="http://en.wikipedia.org/wiki/Spear-phishing">spearfishing</a>, PR embarrassment or financial information), comment/mail spam, sign up fraud, astroturfing, credit card fraud and click fraud. Adversarial learning is the set of techniques that classify data emitted from an adversary.</p>
<p>An adversarial situation arises when the adversary is able to observe the output of the learning system and can change some subset of the features used in that system so that their attempts go unpunished. The goal of adversarial learning is to make it costly for an adversary to change features. The approach towards a solution is labor intensive, but simple to explain. Ramarao essentially said that the best way to combat adversaries is to
<ol>
<li>engineer features interactively and quickly.&nbsp;</li>
<li>not throw away features as we commonly do. It is possible that some features may be activated as the adversary&#8217;s methods evolve.&nbsp;</li>
<li>consider the entire transmission of an adversarial transaction &#8212; that is, do not just look at the words in a spam email but also look at the HTTP headers and other communication information passed along with the text.</li>
<li>study anomalies (outliers and high leverage points) and not discard them. Usually such anomalies are adversaries.</li>
<li>permit overfitting when necessary for the reason mentioned in #3.</li>
</ol>
<p>As a text mining enthusiast, I learned some interesting tricks on fitting machine learning models to text, neither which had anything to do with adversarial learning.
<ul>
<li>A&nbsp;<em><a href="http://en.wikipedia.org/wiki/Homoglyph">homoglyph</a>&nbsp;</em>is the translation of a word by replacing some characters with a character that&nbsp;<strong>looks</strong>&nbsp;similar. For example,&nbsp;p0rn&nbsp;is a homoglyph of&nbsp;porn&nbsp;&#8211; theo&nbsp;in&nbsp;porn&nbsp;is replaced with a character that looks similar, the zero&nbsp;0.&nbsp;<em>Broken words</em></li>
<li>A<em>&nbsp;broken word&nbsp;</em>is a translation of an intended word with spaces added. For example, the word&nbsp;nigeria&nbsp;could be a feature for a spam detection algorithm. An adversary can bypass the filter by instead writing&nbsp;ni geria.</li>
<li><em><a href="http://en.wikipedia.org/wiki/Hash_buster">Hash busters</a>&nbsp;</em>are cases where new words that were not in the lexicon used to train the text model are injected into content. One should use the count of the number of hash busters and use it as a feature in a model. One common hash buster for a naive profanity filter would be the word&nbsp;fcuk&nbsp;instead of the actual word&nbsp;f*ck.</li>
</ul>
<p> <span style="font-size: 14px;"><strong>Julia</strong></span></p>
<p><img src="http://www.bytemining.com/wp-content/uploads/2013/03/logo.png" alt="" width="209" height="139" class="lfloat" /><br />
After being enlightened by this wonderful talk, I am going to write a more substantial post focusing solely on Julia, so for now I will just briefly describe some of the more easy-to-explain content. This talk was presented by <a href="http://forio.com/simulate/mbean/">Michael Bean</a> from <a href="http://www.forio.com">Forio</a> (developers of <a href="http://forio.com/julia/">Julia Studio</a>). As data scientists we love dynamic environments for interactive data munging such as R, or the Python shell (with <code>pandas</code> or SciPy). We typically start with a high level language such as R and then port this code to a compiled or performant language like C, C++ or Java (and maybe Python). This is a large barrier in scientific computing because it requires the data scientist to know&nbsp;<em>two&nbsp;</em>languages: one to experiment, and one to implement. Julia is a scientific computing language that provides the performance of a programming language like C++ and adds technical libraries and accessibility for scientific exploration. Bean cited that Julia&#8217;s performance is similar to C++. Julia allows us to complete tasks faster because we remove the need for &#8220;glue&#8221; code and Julia packages are written in Julia for performance rather than requiring C or Fortran. [R packages can be written solely in R, but for computationally intensive operations, or for packages that will sit in a bottom layer such as data structures etc. there is a huge performance hit.] Once one is familiar with Julia, it is easy to &#8220;hack the core&#8221; so to speak.&nbsp;</p>
<p>Other features that impress me:
<ul>
<li>the user can redefine arithmetic operations and construct new data types. Julia uses <a href="http://en.wikipedia.org/wiki/Multiple_dispatch">multiple dispatch</a> which is a programming language feature that uses different implementations of functions depending on the data types passed to the function. For example, if&nbsp;<em>A&nbsp;</em>and&nbsp;<em>B</em>&nbsp;are of type matrix, then Julia will know that&nbsp;<em>A * B</em>&nbsp;is the&nbsp;<em>matrix multiplication&nbsp;</em>operation rather than elementwise multiplication.&nbsp;</li>
<li>common data structures found in computer science are supported natively such as <a href="http://en.wikipedia.org/wiki/Bit_array">BitArrays</a> and SubArrays as well as types statisticians are already familiar with including Distribution and DataFrame.</li>
<li>support for list comprehensions. For example, to square every element, use <code>[xi^2 for xi in x]</code> instead of a <code>for</code> loop.</li>
<li><em>every </em>package is a Git repository and thus open-source and easy to access.</li>
<li>some packages support multicore natively.</li>
<li>certain functions that can have a bash (<code>!</code>) appended which tells Julia not to make copies of the object (think in-place sort which is&nbsp;<code>sort!</code>).</li>
</ul>
<p>Bean showed that the development process with Julia is shorter than languages such as R because production-level re-implementation is not necessary. The runtime is also faster for the few examples he showed. The following is an example of the recursive implementation of generating Fibonacci numbers in both R and Julia</p>
<table>
<tbody>
<tr>
<td><strong>R Code</strong></td>
<td><strong>Julia Code</strong></td>
</tr>
<tr>
<td>
<pre class="brush: r; title: ; notranslate">
fib &lt;- function(n)
{
  if (n &lt; 2) {
    return(n)
  } else {
    return(fib(n-1) + fib(n-2))
  }
}
 
start &lt;- Sys.time()
fib(36)
end &lt;- Sys.time()
end - start
</pre>
</td>
<td>
<pre class="brush: r; title: ; notranslate">
fib(n) = n &lt; 2 ? n : fib(n - 1) + fib(n - 2)
@elapsed fib(36)
</pre>
</td>
</tr>
<tr>
<td><strong>Runtime: </strong>192 seconds</td>
<td><strong>Runtime: </strong>0.24 second</td>
</tr>
</tbody>
</table>
<p> <span style="font-size: 18px;"><strong>Connected World Strand</strong></span></p>
<p> <span style="font-size: 14px;"><strong>Bit.ly: Deriving an Interest Graph</strong></span><br />
<img src="http://www.bytemining.com/wp-content/uploads/2013/03/bitly_logo.png" alt="" width="150" height="68" class="lfloat" /><br />
The first talk in this strand that I attended was by <a href="https://twitter.com/OMGannaks">Anna Smith</a> of <a href="http://bit.ly">bit.ly</a>&nbsp;titled <a href="http://strataconf.com/strata2013/public/schedule/detail/27466">Deriving an Interest Graph for Social Data</a>.&nbsp;It should be no surprise that a URL shortening service would have a ton of data to sift through.&nbsp;Anna stated that a lot of her work is one-off analysis. What I liked about Anna&#8217;s talk in particular is that the visualizations she used were very basic. There was nothing fancy about what her graphics displayed &#8212; they just displayed some insights about the data and that is it.&nbsp;</p>
<p>Bitly extracts a lot of data from each shortened URL including keywords, topics and the probability the click was a human. One can derive a taxonomy and interest graph by analyzing click data among links. The idea is to look at other webpages a user went to from the page related to the shortened URL. It is hypothesized that the next page the user visits is related in content to the current page. On a domain level, a <em>coclick</em>&nbsp;graph uses domains as nodes and the number of clicks between them as edges. From this, we can derive a graph of keywords by using the <a href="http://en.wikipedia.org/wiki/Jaccard_similarity">Jaccard similarity</a> using the number of clicks to a domain with a particular keyword for both sets. The resulting coclick graph has 4.5 million keywords and 9 million edges. By using some basic processing (removing non-English keywords and keywords with low click numbers) and then running a clustering algorithm called <a href="http://en.wikipedia.org/wiki/DBSCAN">DBSCAN</a>, they were able to simplify their graph to 200,000 keyword clusters and 1 million edges.</p>
<p>The Data Science group at bit.ly keeps an updated GitHub repository for their work <a href="http://bitlyscience.github.com">here</a>.</p>
<p> <span style="font-size: 14px;"><strong>LinkedIn Endorsements</strong></span></p>
<p><img src="http://www.bytemining.com/wp-content/uploads/2013/03/logo-linkedin.png" alt="" width="180" height="51" class="lfloat" /><br />
The last session in this strand that I attended was by <a href="https://twitter.com/sam_shah">Sam Shah</a> and <a href="https://twitter.com/peteskomoroch">Pete Skomoroch</a> from <a href="http://www.linkedin.com">LinkedIn</a>. This talk discussed the skills endorsement feature of LinkedIn and how they made it successful using science. Sam and Pete credit most of the success to establishing viral loops and using recommendation engines as follows:<em>A </em>endorses <em>B</em>&nbsp;-&gt; <em>B </em>is notified -&gt; <em>B </em>accepts the endorsement and endorses someone else.</p>
<p>Social tagging of skills also accelerated adoption. First users market their skills, and then other skills are recommended for them to add. First, a user thinks about their skills and tags them on their profile. Then, a recommendation system recommends other related skills as well as some potential people for the user to endorse. But this is not the interesting part&#8230;</p>
<p>How does LinkedIn maintain a skill dictionary and taxonomy? This is a high unwieldy problem due to human psychology and variations of language usage. One of the biggest issues is in <em><a href="http://en.wikipedia.org/wiki/Word-sense_disambiguation">phrase sense disambiguation</a></em>. The motivating example was the skill <em>angel</em>. If I list <em>angel</em>&nbsp;as a skill on my profile, am I referring to myself as an <em>angel investor </em>or as a <em>spiritual being</em>? The speakers indicated that by using the graph of all the skills listed in addition to <em>angel</em>, we could use agglomerative clustering and then a distance metric to determine which meaning is most likely. This is an example of<em>&nbsp;MS Office, Microsoft Office, Office</em>. All of these concepts refer to the same thing. For this particular problem, LinkedIn used crowdsourcing with <a href="https://www.mturk.com/mturk/welcome">Mechanical Turk</a> tasks. An example human interaction task was to ask a participant to find the best Wikipedia article for the particular topic, since Wikipedia tends to already have a strong army that de-duplicates content.&nbsp;</p>
<p>This is all great for users that actively use the Skills feature, but some do not. For those users, a system passes a sliding window over the profile text (<a href="http://en.wikipedia.org/wiki/N-gram">n-grams</a>) and emits possible matches basd on the taxonomy, and tossing out words that do not fit into the inferred topics in the profile. For example, if my profile text says &#8220;<em>I love working with data, Python, Java and Hadoop.</em>&#8221; The words <em>I</em>, <em>with</em>&nbsp;and <em>and</em>&nbsp;will all be tossed as stopwords. Then, I have the following keywords left:<em>love, working, data, Python, Java, Hadoop</em>For all practical purposes, <em>working </em>is probably considered a stopword or low-impact word because it appears so much on LinkedIn profiles. <em>data </em>is probably&nbsp;not an actual skill so both of these words are removed, leaving<em>love, Python, Java, Hadoop</em>Using LinkedIn&#8217;s taxonomy of skills, we would probably deduce that Python, Java and Hadoop are highly related and <em>love </em>is an extreme outlier (for some people <em>love </em>may be an actual skill, but likely not in this context). Finally, this system would tag <em>Python</em>, <em>Java</em>, and <em>Hadoop</em>&nbsp;as skills to add to the profile. For more complex (realistic) examples, LinkedIn would then apply word sense disambiguation and de-duplication. A simple <a href="http://en.wikipedia.org/wiki/Naive_bayes">Naive Bayes</a> algorithm is used to generate the actual recommendations. In the event of a completely blank profile, recommended skills are based on title, organization and perhaps social network features.</p>
<p>LinkedIn can also suggest endorsements where the system asks a user to endorse another user for particular skills that the user may know about. Some features used for this recommendation engine include people-skill combinations, school overlap, group overlap, similarity in industry, title similarity, site interactions, and co-interactions. Such a recommendation engine is basically a binary classification problem for link presense.</p>
<p>This talk by LinkedIn was surprisingly candid. Obviously, one cannot employ the methods they discussed because we do not have access to their data or infrastructure, thus such a talk is of no risk to intellectual property. Many companies do not get this and do not allow their employees to speak about anything involving their work.</p>
<p> <span style="font-size: 18px;"><strong>Conclusion</strong></span><br />
<img src="http://www.bytemining.com/wp-content/uploads/2013/03/20130226_1747291.jpg" alt="" width="208" height="156" class="rfloat" /><br />
I am glad I forked out the money to attend Strata and I will likely attend next year. The conference was huge and there was something for every data geek including a ton of food. The conference overall was not as sales-y as I thought it would be, but there were definitely moments particularly in the morning sessions and at the expo. I mainly just collected t-shirts at the expo hall, but it was basically just a giant &#8220;my Hadoop distribution is 100x faster than the other guys&#8217;.&#8221; There was also a really cool sensor lab setup for collection of data using <a href="http://arduino.cc">Arduino</a> sensors. There were several sensors placed throughout the conference venue and the data was visualized and place <a href="https://plus.google.com/108178010523548991871/posts">here</a>. &nbsp;</p>
<p>During my time at Strata so far, I have finally had the chance to meet some longtime Twitter friends and reunite with others. It was great meeting&nbsp;<a href="https://twitter.com/neilkod">Neil Kodner</a>&nbsp;and discussing our common interests as well as meeting&nbsp;<a href="http://www.twitter.com/mathieubastian">Mathieu Bastian</a>&nbsp;and discussing graph processing and the future of Gephi (I need to write a blog post about Gephi soon). I had a chance to talk to&nbsp;<a href="http://www.twitter.com/wesmckinn">Wes McKinney</a>&nbsp;over lunch as well about Python and the pandas community. On the last day, I went to an event hosted by Facebook and met several Facebook engineers and other Twitter friends including <a href="http://twitter.com/turian">Joseph Turian</a>, <a href="http://twitter.com/LusciousPear">Bradford Stephens</a>, <a href="http://twitter.com/dtunkelang">Daniel Tunkelang</a> and <a href="http://twitter.com/gregrahn">Greg Rahn</a>. Everybody I met has now relocated to the Bay Area and I think I am going to need to follow&#8230;</p>
<p><img style="display: block; margin-left: auto; margin-right: auto;" src="http://www.bytemining.com/wp-content/uploads/2013/03/20130304_150030.jpg" alt="" width="624" height="468" /></p>
<div class="shr-publisher-1217"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><div class='shareaholic-like-buttonset' style='float:none;height:30px;'><a class='shareaholic-googleplusone' data-shr_size='standard' data-shr_count='true' data-shr_href='http%3A%2F%2Fwww.bytemining.com%2F2013%2F02%2Fsummary-of-my-first-trip-to-strata-strataconf%2F' data-shr_title='Summary+of+My+First+Trip+to+Strata+%23strataconf'></a></div><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://www.bytemining.com/2013/02/summary-of-my-first-trip-to-strata-strataconf/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Merry Christmas and Happy Holidays!</title>
		<link>http://www.bytemining.com/2012/12/merry-christmas-and-happy-holidays/</link>
		<comments>http://www.bytemining.com/2012/12/merry-christmas-and-happy-holidays/#comments</comments>
		<pubDate>Mon, 24 Dec 2012 19:58:02 +0000</pubDate>
		<dc:creator>Ryan Rosario</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.bytemining.com/?p=1208</guid>
		<description><![CDATA[Wishing you all a very Merry Christmas, Happy Holidays and Happy New Year!

An update on me. In October, I began working at Riot Games, the developers of League of Legends. It has been an amazing experience and has occupied the majority of my free time as has my dissertation.&#160;My New Year&#8217;s resolution this year is to dust the cobwebs off this blog!



Have a safe holiday season!

Here in California, I will be having Christmas in the Sand
]]></description>
				<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><div style="text-align: center;"><span style="font-family: 'trebuchet ms', geneva; font-size: x-large; color: #ff0000;"><strong>Wishing you all a very Merry Christmas, Happy Holidays and Happy New Year!</strong></span></div>
<div></div>
<div>An update on me. In October, I began working at <a href="http://www.riotgames.com">Riot Games</a>, the developers of <a href="http://www.leagueoflegends.com">League of Legends</a>. It has been an amazing experience and has occupied the majority of my free time as has my dissertation.&nbsp;My New Year&#8217;s resolution this year is to dust the cobwebs off this blog!</div>
<div></div>
<div><img style="display: block; margin-left: auto; margin-right: auto;" src="http://www.bytemining.com/wp-content/uploads/2012/12/merry-christmas-1353268943jWR-e1356378920640.jpg" alt="" /></div>
<div></div>
<div>Have a safe holiday season!</div>
<div></div>
<div style="text-align: center;"><span style="color: #339966;"><strong><span style="font-size: large;">Here in California, I will be having <em><a href="http://www.youtube.com/watch?v=YnvzsZCJjZ0">Christmas in the Sand</a></em></span></strong></span></div>
<div class="shr-publisher-1208"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><div class='shareaholic-like-buttonset' style='float:none;height:30px;'><a class='shareaholic-googleplusone' data-shr_size='standard' data-shr_count='true' data-shr_href='http%3A%2F%2Fwww.bytemining.com%2F2012%2F12%2Fmerry-christmas-and-happy-holidays%2F' data-shr_title='Merry+Christmas+and+Happy+Holidays%21'></a></div><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://www.bytemining.com/2012/12/merry-christmas-and-happy-holidays/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A New Data Toy &#8212; Unboxing the Raspberry Pi</title>
		<link>http://www.bytemining.com/2012/10/a-new-data-toy-unboxing-the-raspberry-pi/</link>
		<comments>http://www.bytemining.com/2012/10/a-new-data-toy-unboxing-the-raspberry-pi/#comments</comments>
		<pubDate>Tue, 09 Oct 2012 17:30:49 +0000</pubDate>
		<dc:creator>Ryan Rosario</dc:creator>
				<category><![CDATA[Startups]]></category>
		<category><![CDATA[Statistics and Statistical Computing]]></category>

		<guid isPermaLink="false">http://www.bytemining.com/?p=1195</guid>
		<description><![CDATA[<p>Last week I received two Raspberry Pis in the mail from AdaFruit and just now have some time to play with them. The Raspberry Pi is a minimal computer system that is about the size of a credit card. In the embedded systems community, the excitement is for obvious reasons, but I strongly believe that such a device can help collect and use data to help us make better decisions because not only is it a computer, but it is small and portable.</p>
<p>For development, Raspberry Pi can connect to a television (or other display) via HDMI or composite video (the &#8220;yellow&#8221; plug for those still stuck in the 1900s haha). A keyboard, mouse and other devices can be connected via two USB ports. A powered hub can provide support for even more devices. There are also various pins for connecting to a breadboard for analyzing analog signals, for a camera or for an external (or touchscreen) display. An SD Card essentially serves as the hard disk and probably a portion of the RAM. The more recent Model B ships with 256MB RAM. Raspberry Pi began shipping in February 2012 and these little guys have been very difficult to get a [...]]]></description>
				<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><p><img class="lfloatbox" src="http://www.bytemining.com/wp-content/uploads/2012/10/raspberry-pi-580-75.jpeg" alt="" width="128" height="152" />Last week I received two <a href="http://www.raspberrypi.org/">Raspberry Pi</a>s in the mail from <a href="http://www.adafruit.com/">AdaFruit</a> and just now have some time to play with them. The Raspberry Pi is a minimal computer system that is about the size of a credit card. In the embedded systems community, the excitement is for obvious reasons, but I strongly believe that such a device can help collect and use data to help us make better decisions because not only is it a computer, but it is small and portable.</p>
<p>For development, Raspberry Pi can connect to a television (or other display) via HDMI or composite video (the &#8220;yellow&#8221; plug for those still stuck in the 1900s haha). A keyboard, mouse and other devices can be connected via two USB ports. A <strong>powered</strong> hub can provide support for even more devices. There are also various pins for connecting to a breadboard for analyzing analog signals, for a camera or for an external (or touchscreen) display. An SD Card essentially serves as the hard disk and probably a portion of the RAM. The more recent Model B ships with 256MB RAM. Raspberry Pi began shipping in February 2012 and these little guys have been very difficult to get a hold of. I finally got tipped off as to when more became available by following the <a href="http://www.reddit.com/r/raspberrypi">Raspberry Pi subreddit</a>. Raspberry Pi was originally not designed with geeks in mind. In fact, <a href="http://www.raspberrypi.org/faqs">they were originally designed</a> to teach school children about computers and programming.&nbsp;</p>
<p>The figure below shows the size of my Raspberry Pi versus the size of the credit card I purchased it with (just kidding). The price is also small, at about $35 depending on where you buy it!</p>
<p><center><br />
<img src="http://www.bytemining.com/wp-content/uploads/2012/10/2012-10-08_19-49-33_1151.jpeg" alt="" width="452" height="503" /><br />
</center></p>
<p>So what can you do with it? I imagine&nbsp;<em>almost </em>anything a computer can do. Just remember that you are limited by lightweight CPU, power restrictions and potential heat issues. Raspberry Pi does allow outputting high definition video though.&nbsp;<strong>I have not done enough testing to check these though.</strong></p>
<p>Here are some generic ideas:</p>
<ul>
<li>Realtime informational displays of data and graphics on a large display.&nbsp;
<ul>
<li>The Raspberry Pi conforms to some standard that allows it to be mounted (with assistance) to the back of an HDMI display.</li>
<li>Use the RPi as a dedicated system for pulling data from other systems, doing some lightweight processing (or pull results from another system) and then display the results.</li>
</ul>
</li>
<li>Small, portable data collection and transmission devices.
<ul>
<li>Raspbery Pi can be connected to AC power of course, or using a MicroUSB to USB cable, similar to those used to charge Android devices.</li>
<li>Connect a small (or regular sized) wireless adapter, or 3G/4G dongle for data transfer.</li>
<li>Connect a Bluetooth dongle for communication with other data collection devices (think GPS receivers etc.).</li>
<li>Connect an IR receiver via USB for remote control.</li>
<li>Connect a USB battery backup for times where AC is not available (5V) such as in the field, or when an automobile does not provide power.</li>
</ul>
</li>
<li>Development of data-driven &#8220;fat&#8221; clients.
<ul>
<li>Use Raspberry Pi to make automated decisions using machine learning using your favorite development tools and statistical libraries including R. Obviously, mileage may vary. We are not talking about 8-core Xeon CPUs here&#8230;</li>
</ul>
</li>
<li>For use as a &#8220;motherboard&#8221; (pun?) for collecting and analyzing analog signals using a separate breakout board for Raspberry Pi.</li>
</ul>
<p>It is important to understand that hardware compatibility is more hit-or-miss than it is with a standard desktop or laptop. Certain chipsets must be matched, and drivers must be compatible with the ARM architecture. To research which items to purchase, I took a look at the <a href="http://elinux.org/RPi_VerifiedPeripherals">RPi Verified Peripherals wiki page</a>.</p>
<p>A similar platform I was eyeing was the <a href="http://arduino.cc">Arduino</a>. The biggest win of the Raspberry Pi over Arduino (I believe) is that Raspberry Pi is a mini-computer that can run a standard garden-variety operating system (well, Linux), whereas Arduino is a platform for collecting and transmitting analog and digital signals using its own software.</p>
<p>So, what do <em>I </em>plan to do with my Raspberry Pis? It is kind of secret&#8230; OK, not really, but I don&#8217;t want to write about it until I have something to show! What will you do with yours?</p>
<div class="shr-publisher-1195"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><div class='shareaholic-like-buttonset' style='float:none;height:30px;'><a class='shareaholic-googleplusone' data-shr_size='standard' data-shr_count='true' data-shr_href='http%3A%2F%2Fwww.bytemining.com%2F2012%2F10%2Fa-new-data-toy-unboxing-the-raspberry-pi%2F' data-shr_title='A+New+Data+Toy+--+Unboxing+the+Raspberry+Pi'></a></div><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://www.bytemining.com/2012/10/a-new-data-toy-unboxing-the-raspberry-pi/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Adventures at My First JSM (Joint Statistical Meetings) #JSM2012</title>
		<link>http://www.bytemining.com/2012/08/adventures-at-my-first-jsm-joint-statistical-meetings-jsm2012/</link>
		<comments>http://www.bytemining.com/2012/08/adventures-at-my-first-jsm-joint-statistical-meetings-jsm2012/#comments</comments>
		<pubDate>Mon, 06 Aug 2012 16:30:00 +0000</pubDate>
		<dc:creator>Ryan Rosario</dc:creator>
				<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Conferences]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[Statistics and Statistical Computing]]></category>

		<guid isPermaLink="false">http://www.bytemining.com/?p=1139</guid>
		<description><![CDATA[During the past few decades that I have been in graduate school (no, not literally) I have boycotted JSM on the notion that “I am not a statistician.” Ok, I am a renegade statistician, a statistician by training. JSM 2012 was held in San Diego, CA, one of the best places to spend a week during the summer. This time, I had no excuse not to go, and I figured that in order to get my Ph.D. in Statistics, I have to have been to at least one JSM. [...]]]></description>
				<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><p><img class="lfloatbox" src="http://www.bytemining.com/wp-content/uploads/2012/08/P10102181.jpg" alt="" width="200" height="150" />During the past few decades that I have been in graduate school (no, not literally) I have boycotted JSM on the notion that &#8220;I am not a statistician.&#8221; Ok, I am a renegade statistician, a statistician by training. <a href="http://www.amstat.org/meetings/jsm/2012/">JSM 2012</a> was held in San Diego, CA, one of the best places to spend a week during the summer. This time, I had no excuse not to go, and I figured that in order to get my Ph.D. in Statistics, I have to have been to <em>at least one </em>JSM. The conference itself was 5 days, but I did not think I could hang with statisticians for five days, and more importantly, taking three days off the attend the conference was more reasonable since I work in industry.</p>
<p>I arrived at the conference on the third day, July 31. Unfortunately, upon arriving at the <a href="http://manchestergrand.hyatt.com/hyatt/hotels-manchestergrand/index.jsp?null">Manchester Grand Hyatt</a>, I was informed that they had screwed up my reservation and that they had overbooked for the night. No wonder I never received a confirmation email despite calling them three times. Grmph. But my faith in humanity was restored by their compensation package: one night FREE next store at the <a href="http://www.marriott.com/hotels/travel/sandt-san-diego-marriott-marquis-and-marina/">Marriott</a> which was closer to the conference, FREE parking for the entire stay, FREE Internet for the entire stay, and FREE breakfast at their buffet for the entire stay. I also got a free upgrade to a room with a view of the bay and the <a href="http://www.midway.org/">USS Midway</a>.</p>
<table>
<tbody>
<tr>
<td><img src="http://www.bytemining.com/wp-content/uploads/2012/08/P1010182.jpg" alt="" width="200" height="150" /></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2012/08/P1010264.jpg" alt="" width="200" height="150" /></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2012/08/P1010225.jpg" alt="" width="200" height="150" /></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2012/08/P1010294.jpg" alt="" width="200" height="150" /></td>
</tr>
</tbody>
</table>
<p><strong>My First Day, July 31</strong></p>
<p><img class="lfloatbox" src="http://www.bytemining.com/wp-content/uploads/2010/04/facebook-privacy1-e1344224774566.jpg" alt="" />After getting lost in the convention center for quite a while, a friend I had met at <a href="http://www.bytemining.com/2011/08/sigkdd-2011-conference-day-1-graph-mining-and-david-bleitopic-models/">KDD-2011</a> was one of the first people I saw and encouraged me to attend the Facebook talk, strangely titled Stat-Us. Anyway, all of the speakers in this session were Data Scientists at Facebook, including <a href="http://pleasescoopme.com/">Jonathan Chang</a>, the author of the <a href="http://cran.r-project.org/web/packages/lda/index.html">LDA package in R, called lda</a>. Most of their talks discussed cleaning the social graph using machine learning. Facebook actively removes fan pages etc. that are deemed to be duplicates using decision trees and other algorithms with features such as age of the creator, grammar and number of fans. Jonathan Chang discussed disambiguating and clustering places for the Places product. For example, users have created several places to refer to Disneyland: &#8220;disneyland&#8221;, &#8220;Disney Land&#8221;, &#8220;Happiest Place on Earth&#8221; etc. Facebook uses some NLP techniques, but even more effective are their techniques that compare the distribution and demographics of check-ins to each of the places. Of course, seasonality in check-ins is another aspect used in their model. But what about ubiquitous venues such as McDonald&#8217;s or Starbucks? By studying the radial density of these establishments, it is easier to disambiguate which location a user is discussing. This also makes it easier to correct places tagged at the wrong location. One other interesting point is sterile computing. Much of the information Facebook data scientists work with is high personal. During some sensitive analyses, data scientists use sterile machines that are not connected to the Internet to perform their analysis. Analysis, diagnostics and graphics to be conducted during the process must be constructed beforehand. Also, data scientists have no access to any of the raw data in this system, only the results of their analysis. All in all, I was excited. I was surprised at the level of detail they provided.</p>
<p>That afternoon I attended one of the many sessions titled Clustering. I had a difficult time choosing among <a href="http://cran.r-project.org/web/packages/penalized/vignettes/penalized.pdf">L1 Regression</a>, Prediction in Social Networks and Clustering, but since my dissertation topic involves Latent Dirichlet Allocation, a form of clustering text, I felt this was the most interesting. Most of the talks were very good. The most interesting talks in the session were about analyzing massive structured data using <a href="http://nuit-blanche.blogspot.com/2011/09/sparse-generalized-pca-or-why-you-can.html">sparse generalized PCA</a>, estimating similarity metrics to evaluate clustering algorithms, clustering <a href="http://en.wikipedia.org/wiki/Autoregressive_model">autoregressive time series</a> and using clustering to detect network intrusions.&nbsp;</p>
<p>After the Clustering talk, I got to meet John Ramey (<a href="http://www.baylor.edu">Baylor</a>,&nbsp;<a href="http://www.twitter.com/ramhiser">@ramhiser</a>), one of the speakers and also a Twitter friend. As we were talking, <a href="http://yihui.name/en/">Yihui Xie</a> (<a href="http://www.iastate.edu">Iowa State</a>, <a href="http://www.twitter.com/xieyihui">@xieyihui</a>) joined us. It wasn&#8217;t until we were all in conversation that I realized Yihui is the author of the now popular <a href="http://yihui.name/en/2011/12/knitr-elegant-flexible-and-fast-dynamic-report-generation-with-r/">knitR</a> package! Later, John and I grabbed a drink nearby and talked about R, Python, computing and academia. It was a great conversation because we both are in similar interests in both statistics and computer science and we both have a similar way of using tools of solving problems. This was great to hear because lately I have been working as a data scientist in more of a software engineering capacity than a typical data scientist capacity. Since this field is still new, and since I am still new to industry, I am not sure what is &#8220;typical&#8221; yet. I also got some great advice about publications and getting back into academia. (My &#8220;life plan&#8221; is to work in industry in a research capacity and return to academia later in life.) Later in the evening I met up with some colleagues that have since graduated. I also got to meet an intern data scientist from <a href="http://www.redfin.com/home">Redfin</a>.</p>
<p><strong>My Second Day, August 1</strong></p>
<p><img class="lfloatbox" src="http://www.bytemining.com/wp-content/uploads/2012/08/logoCTB.png" alt="" width="199" height="48" />The next day I had my usual stressful dilemma of trying to pick which sessions to attend. I was faced with the choice of <a href="http://en.wikipedia.org/wiki/Non-parametric_statistics">non-parametrics</a>, high dimensional learning, cheating on tests, and networks. My original interest in the field of Statistics was a fascination with <a href="http://en.wikipedia.org/wiki/Psychometrics">Psychometrics</a>, so I attended the &#8220;Statistics in Uncovering Administrative Cheating on Tests&#8221; first. I only attended the first talk which was by <a href="http://www.ctb.com">CTB/McGraw-Hill</a>. They were on my short list of companies I wanted to work for when I started graduate school, and some of their products include familiar names like the <a href="http://www.wisegeek.com/what-is-the-california-achievement-test.htm">California Achievement Tests</a> (CAT/6) and&nbsp;<a href="http://en.wikipedia.org/wiki/California_Standardized_Testing_and_Reporting_(STAR)_Program">California Test of Basic Skills</a> (CTBS) both of which are deprecated and replaced by <a href="https://www.ctb.com/ctb.com/control/productFamilyViewAction?productFamilyId=449&amp;p=products">TerraNova</a>. The speaker addressed three issues: copied answers, fraudulent erasures and stolen test items. Detection of all of these depended on the <a href="http://www.cse.ucla.edu/products/reports/R220.pdf">three parameter logistic model</a> common in <a href="http://en.wikipedia.org/wiki/Item_response_theory">item response theory (IRT)</a>. Detecting copying simply involved doing pairwise comparisons of <a href="http://en.wikipedia.org/wiki/Multinomial_distribution">multinomial distributions</a> (the response to each item forms a multinomial random variable). Fradulent erasures were more interesting, and occur when a teacher changes a student&#8217;s incorrect answer to a correct one. Empirically, students usually change an answer from an incorrect one to a correct one, but sometimes the opposite occurs. McGraw-Hill is able to detect fraudulent erasures as an unusually high number of modified answers from incorrect to correct compared to the student&#8217;s ability. Modern <a href="http://en.wikipedia.org/wiki/Optical_Mark_Reading">optical mark readers</a> (&#8220;<a href="http://www.scantron.com">Scantron</a>&#8220;s) report not only the selected answer, but also a parameter that shows other answers that were chosen and erased, and how dark the mark was. Stolen test items can be detected using the amount of time it took a student to mark an answer with respect to their ability. Stolen items will have a statistically significantly large residual where the student answered the item much faster than they should have given their ability. The speaker showed an illuminating example from the <a href="http://en.wikipedia.org/wiki/Graduate_Management_Admission_Test">GMAT</a>. After this first talk, I ran over to catch the high dimensional learning talk. The most interesting talk was from Peter Hall, where he considered using pairs and groups of interacting variables for machine learning prediction. Although this was interesting, this is very common in machine learning already. I then ran back to the networks talk. At that point, I was exhausted from running back and forth and could only try to listen.</p>
<p><img class="lfloatbox" src="http://www.bytemining.com/wp-content/uploads/2010/11/about_logo.gif" alt="" width="175" height="65" /><img class="rfloatbox" src="http://www.bytemining.com/wp-content/uploads/2012/08/firefox.png" alt="" width="117" height="125" />Later in the morning I attended the Internet-Scale Statistical Computing talk. For some reason, the organizers chose a room that was far too small for the interest. I cannot say I am surprised. One of the talks discussed how Google scales R. Although the talk was great, we all know that Google is probably never going to open-source what they are working on, and that was clear from their responses to some of the audience questions. Google has created a package for R called flume which they say is similar to open-source projects <a href="http://www.cascading.org">Cascading</a> and Clank. They also use an abstraction they call distributed data objects which is probably some way of storing and replicating data across systems using GFS. The final talk was from <a href="http://people.mozilla.org/~sguha/">Saptarshi Guha</a> from <a href="http://www.mozilla.org">Mozilla</a>. Saptarshi developed the package <a href="http://www.datadr.org/">RHIPE</a> which is the original interface to Hadoop from R and is still in active development. Mozilla uses RHIPE to analyze Firefox crash logs and Saptarshi showed an example of using <a href="http://en.wikipedia.org/wiki/Quantile_regression">quantile regression</a> in a RHIPE context. After lunch I attended a talk on the LASSO, but I was far too exhausted to hear about theoretical methods and applications to biological data. All I could think of is that song by <a href="http://www.youtube.com/watch?v=okbDmEGrCZ8">Phoenix</a>&#8230;</p>
<blockquote><p>If there is one way to bore me, it is to make me listen to a talk about biological and medical applications of statistics.<br />
 &#8211;Ryan</p>
</blockquote>
<p><strong>My Final Day, August 2</strong></p>
<p>I attended the Visualizing Complex Models talk first thing in the morning. These talks basically discussed methods for visualizing specific model types related to the speaker&#8217;s research rather than a broad overarching discussion of complex models. Models that were discussed include the <a href="http://en.wikipedia.org/wiki/Multilevel_model">hierarchical linear model (HLM)</a>, <a href="http://en.wikipedia.org/wiki/Likert_scale">likert scales</a>, <a href="http://en.wikipedia.org/wiki/Generalized_linear_model">generalized linear models (GLM)</a>, <a href="http://www.ncbi.nlm.nih.gov/pubmed/20628070">logic forests</a> and maps. I learned about a technique called <a href="http://kooperberg.fhcrc.org/logic/">logic regression</a> which attempts to predict a conjunction or disjunction of terms using predictors which are themselves conjunctions or disjunctions of logical atoms. I would not be surprised if this has already been done in the artificial intelligence community, but it was very cool to see it discussed from the statistics angle. The most entertaining talk came from <a href="http://faculty.nps.edu/sebuttre/">Samuel Buttrey</a> from the <a href="http://www.nps.edu/">Naval Postgraduate School</a>. He discussed a system called DaViTo which is basically a Java dashboard graphically displaying statistics about incidents occurring in Afghanistan including where they occurred (maps) and how they vary throughout the day and throughout the year. The beauty was that the backend which computed the statistics and drew the graphs was R. Buttrey had a knack for keeping the audience on its toes like a comedian. At one point he even rapped which is something I have never seen at a conference. For those inquiring minds, it went something like&#8230;</p>
<blockquote><p>Then ya spot a fine woman sittin in your row<br />
 She&#8217;s dressed in yellow, she says &#8220;Hello,<br />
 come sit next to me you fine fellow.&#8221;</p>
<p>&#8211; Bust a Move, Young MC</p>
</blockquote>
<p>Um, yeah.</p>
<table>
<tbody>
<tr>
<td><img src="http://www.bytemining.com/wp-content/uploads/2012/08/P1010205.jpg" alt="" width="200" height="150" /></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2012/08/P1010214.jpg" alt="" width="200" height="150" /></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2012/08/P1010206.jpg" alt="" width="200" height="150" /></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2012/08/P1010208.jpg" alt="" width="200" height="150" /></td>
</tr>
</tbody>
</table>
<p><strong>Free Time</strong></p>
<p>JSM was rare in that I always felt like I was in a rush to get from one talk to another and I felt like I had very little free time. I spent most of Thursday afternoon and Friday touring San Diego. I love the beach so I spent most of my time walking through <a href="http://www.seaportvillage.com/">Seaport Village</a> and the neighboring parks. I also visited many of the tiki looking bars in the area. I also spent a few hours at the <a href="http://www.midway.org">U.S.S. Midway</a> which was a very interesting but strange experience. Apparently I am claustrophobic because not being able to see the sun while I was on the lower decks and having to crouch on the stairs was kind of scary. I also got lost in the ship&#8217;s maze of hallways and they had these dolls and animatronic figures that freaked me out. The flight deck was amazing though and I got to talk to a few of the WWII-era docents. It also provided a great view of San Diego Bay, <a href="http://www.coronado.ca.us/">Coronado Island</a> and the departures and arrivals at <a href="http://flightaware.com/live/airport/KSAN">San Diego International</a>. It was cool getting to see such a big part of U.S. history.</p>
<table>
<tbody>
<tr>
<td><img src="http://www.bytemining.com/wp-content/uploads/2012/08/P10102671.jpg" alt="" width="200" height="150" /></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2012/08/P1010378.jpg" alt="" width="200" height="150" /></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2012/08/P1010402.jpg" alt="" width="200" height="150" /></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2012/08/P1010418.jpg" alt="" width="200" height="150" /></td>
</tr>
</tbody>
</table>
<p><strong>Conclusion and Differences from Other Conferences</strong></p>
<p>I had a great time at JSM and it was great to see what I have and have not been missing. Still, I prefer Computer Science conferences. I love mathematics, and I love getting into the hairy details&#8230; but only of my own research. I found that most JSM talks got too lost in the math and often times the speaker did not really convey the grand point.&nbsp;</p>
<p>JSM has staunch differences from other conferences I have attended logistically:</p>
<ul>
<li>No free Internet.&nbsp;This is 2012. If you wanted to pay for Internet, it was $12.95 per day in the Convention Center, $12.95 per day at the Hilton (assuming you attended both venues), and then whatever your hotel charged (the Hilton, Marriott and Hyatt all charge around $11 per day). So, unless one used their phone&#8217;s Internet, it would have cost maximum 5 * ($12.95 * 2 * + $11) = $184.50, which is approximately two months of my home cable service. Absurd.</li>
<li>No refreshments except at the Hilton. The multiple Starbucks were a nice touch though.</li>
<li>Having the conference separated into two parts requiring a ten minute walk between venues.&nbsp;Part of JSM was in the middle of the convention center, and another part was at the Hilton next door which was about a 10-15 minute walk in the humidity. You would see a constant steam of people, like ants, moving between the Hilton and the Convention Center to catch their next talk. Many people had to leave sessions early or arrived late because of this walk. The irony was that half the Convention Center was empty.</li>
<li>Far too many sessions on biology and medicine.</li>
<li>Not enough talks on computing and big data, though this was no surprise.</li>
<li>Far too many concurrent sessions (up to 50) that were not optimized into tracks, or based on the &#8220;type&#8221; of audience it would draw. There were times where there were 5 or 6 related talks I wanted to see&#8230; and others where nothing sounded interesting.</li>
<li>Far too many special events that required a fee to attend.</li>
<li>All of the poster sessions took up a full session slot.</li>
<li>All of the vendors were in another room. I never got to even go to the Expo because it closed early, and there was no indication of where it was. That was frustrating because I love the Springer and Wiley tables.</li>
</ul>
<p>Next year <a href="http://www.amstat.org/meetings/jsm/2013/index.cfm">JSM will be held in Montreal, Quebec, Canada</a>. If I submit something and it is accepted, I may attend. Otherwise, I am happy I got to go to a JSM after all these years.</p>
<p>Finally, I&#8217;d like to thank my employer, <a href="http://www.gumgum.com">GumGum</a>, for allowing me to attend JSM 2012 during the work week.</p>
<div class="shr-publisher-1139"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><div class='shareaholic-like-buttonset' style='float:none;height:30px;'><a class='shareaholic-googleplusone' data-shr_size='standard' data-shr_count='true' data-shr_href='http%3A%2F%2Fwww.bytemining.com%2F2012%2F08%2Fadventures-at-my-first-jsm-joint-statistical-meetings-jsm2012%2F' data-shr_title='Adventures+at+My+First+JSM+%28Joint+Statistical+Meetings%29+%23JSM2012'></a></div><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://www.bytemining.com/2012/08/adventures-at-my-first-jsm-joint-statistical-meetings-jsm2012/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>OpenPaths and a Progressive Approach to Privacy</title>
		<link>http://www.bytemining.com/2012/07/openpaths-and-a-progressive-approach-to-privacy/</link>
		<comments>http://www.bytemining.com/2012/07/openpaths-and-a-progressive-approach-to-privacy/#comments</comments>
		<pubDate>Sun, 08 Jul 2012 18:00:00 +0000</pubDate>
		<dc:creator>Ryan Rosario</dc:creator>
				<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Geospatial]]></category>
		<category><![CDATA[Startups]]></category>
		<category><![CDATA[Web Mining]]></category>

		<guid isPermaLink="false">http://www.bytemining.com/?p=1121</guid>
		<description><![CDATA[<p></p>
<p>OpenPaths is a service that allows users with mobile phones to transmit and store their location. It is an initiative by the New York Times that allows users to use their own data, or to contribute their location data for research projects and perhaps startups that wish to get into the geospatial space. OpenPaths brands itself as &#8220;a secure data locker for personal location information.&#8221; There is one aspect where OpenPaths is very different from other services like Google Latitude: Only the user has access to his/her own data and it is never shared with anybody else unless the user chooses to do so. Additionally, initiatives that wish to use a user&#8217;s location data must be asked personally via email (pictured below), and the user has the ability to deny the request.The data shared with each initiative provides only location, and not other data that may be personally identifiable such as name, email, browser, mobile type etc.&#160;In this sense, OpenPaths has provided a barebones platform for the collection and storage of location information. Google Latitude is similar, but the data stored on Google&#8217;s servers is obviously used by other Google services without explicit user permission.</p>
<p></p>
<p>The service is also opt-in, that [...]]]></description>
				<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><p><img src="http://www.bytemining.com/wp-content/uploads/2012/07/openpaths6.png" alt="" width="338" height="70" /></p>
<p>OpenPaths is a service that allows users with mobile phones to transmit and store their location. It is an initiative by the New York Times that allows users to use their own data, or to <a href="https://openpaths.cc/projects">contribute their location data for research projects</a> and perhaps startups that wish to get into the geospatial space. OpenPaths brands itself as &#8220;a secure data locker for personal location information.&#8221; There is one aspect where OpenPaths is very different from other services like Google Latitude: <strong>Only the user has access to his/her own data and it is never shared with anybody else unless the user chooses to do so. Additionally, initiatives that wish to use a user&#8217;s location data must be asked personally via email (pictured below), and the user has the ability to deny the request.</strong>The data shared with each initiative provides only location, and not other data that may be personally identifiable such as name, email, browser, mobile type etc.&nbsp;In this sense, OpenPaths has provided a barebones platform for the collection and storage of location information. Google Latitude is similar, but the data stored on Google&#8217;s servers is obviously used by other Google services without explicit user permission.</p>
<p><img style="display: block; margin-left: auto; margin-right: auto;" src="http://www.bytemining.com/wp-content/uploads/2012/07/openpaths4.png" alt="" width="640" height="352" /></p>
<p>The service is also opt-in, that is, it does not use hidden files on the user&#8217;s phone to track location &#8212; instead, the user must launch the OpenPaths mobile application to &#8220;opt-in&#8221; to location tracking. This is a double-edged sword though. I am in the minority, but I would rather have OpenPaths transparently transmit my location data (like Google Latitude) without me having the launch an app, because then the data is more complete. Despite running the application consistently for the past month, there are some unexplainable gaps in the data. For example, somehow only a few days of data for the month of June are available. Fortunately (or unfortunately), my habits do not change that much. After logging in, the user has the ability to visualize their data, or download it in several formats: JSON, CSV or KML for Google Earth.</p>
<p><img style="display: block; margin-left: auto; margin-right: auto;" src="http://www.bytemining.com/wp-content/uploads/2012/07/openpaths21.png" alt="" width="547" height="415" /></p>
<p>The user can then view his or her location history on a map. The points are colored by time of day (morning, afternoon, night), or time of week (weekday, weekend). The gradient could and should be more fine grained to provide for better understanding of the user&#8217;s location and habits. The map has an animation setting that shows your movement throughout time. Compared to other services, the data seems very coarse, even with the finest settings. Driving at 70mph, a data point seems to be transmitted every 20 miles or so which is 2-3 times per hour. All in all, the map functionality is cool, but leaves a lot to be desired currently.</p>
<p><span style="font-family: verdana, geneva; font-size: small;"><img style="display: block; margin-left: auto; margin-right: auto;" src="http://www.bytemining.com/wp-content/uploads/2012/07/openpaths11.png" alt="" width="640" height="415" /></span></p>
<p>OpenPaths also allows data to be collected from FourSquare if you choose to do so. Unfortunately, none of my check-ins showed up on the map. Of course, <strong>the user can choose to delete their information entirely, adding to the OpenPath&#8217;s idea of &#8220;opt-in&#8221; location sharing.</strong></p>
<p><strong><img style="display: block; margin-left: auto; margin-right: auto;" src="http://www.bytemining.com/wp-content/uploads/2012/07/openpaths3.png" alt="" width="640" height="206" /><br />
 </strong></p>
<p>All in all, I am very excited of where OpenPaths can go. Currently, there seem to be a few technical issues with the Android app that prevents transmitting location data when the app is running in the background. I hope that OpenPaths can run more transparently and reliably on Android as development continues. The privacy policy and ownership of data are both very important and a model that others should implement. The visualization aspect is not incredibly useful or flashy right now, but the user is free to download their data and create their own visualizations or analysis.</p>
<p>To download an OpenPaths app for your Apple or Android phone, <a href="https://openpaths.cc/tools">click here</a>.</p>
<div class="shr-publisher-1121"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><div class='shareaholic-like-buttonset' style='float:none;height:30px;'><a class='shareaholic-googleplusone' data-shr_size='standard' data-shr_count='true' data-shr_href='http%3A%2F%2Fwww.bytemining.com%2F2012%2F07%2Fopenpaths-and-a-progressive-approach-to-privacy%2F' data-shr_title='OpenPaths+and+a+Progressive+Approach+to+Privacy'></a></div><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://www.bytemining.com/2012/07/openpaths-and-a-progressive-approach-to-privacy/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>SIAM Data Mining 2012 Conference</title>
		<link>http://www.bytemining.com/2012/05/siam-data-mining-2012-conference/</link>
		<comments>http://www.bytemining.com/2012/05/siam-data-mining-2012-conference/#comments</comments>
		<pubDate>Tue, 15 May 2012 18:00:00 +0000</pubDate>
		<dc:creator>Ryan Rosario</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Natural Language Processing]]></category>

		<guid isPermaLink="false">http://www.bytemining.com/?p=1076</guid>
		<description><![CDATA[





Note: This would have been up a lot sooner but I have been dealing with a bug on and off for pretty much the past month!
<p></p>

From April 26-28 I had the pleasure to attend the SIAM Data Mining conference in Anaheim on the Disneyland Resort grounds. Aside from KDD2011, most of my recent conferences had been more &#8220;big data&#8221; and &#8220;data science&#8221; oriented, and I wanted to step away from the hype and just listen to talks that had more substance.
<p></p>

Attending a conference on Disneyland property was quite a bizarre experience. I wanted to get everything I could out of the conference, but the weather was so nice that I also wanted to get everything out of Disneyland as I could. Seeing adults wearing Mickey ears carrying Mickey shaped balloons, and seeing girls dressed up as their favorite Disney princesses screams &#8220;fun&#8221; rather than &#8220;business&#8221;, but I managed to make time for both.

<p></p>
The first two days started with a plenary talk from industry or research labs. After a coffee break, there were the usual breakout sessions followed by lunch. During my free 90 minutes, I ran over to Disneyland and California Adventure both days to eat lunch. I managed to [...]]]></description>
				<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><div>
<table>
<tbody>
<tr>
<td><img class="lfloat" src="http://www.bytemining.com/wp-content/uploads/2012/05/SDM12logo.jpg" alt="" width="400" height="453" /></td>
<td>
<div><strong>Note</strong>: This would have been up a lot sooner but I have been dealing with a bug on and off for pretty much the past month!</div>
<p></p>
<div></div>
<div>From April 26-28 I had the pleasure to attend the <a href="http://www.siam.org/meetings/sdm12/">SIAM Data Mining conference in Anaheim on the Disneyland Resort</a> grounds. Aside from <a href="http://www.sigkdd.org/kdd2011/">KDD2011</a>, most of my recent conferences had been more &#8220;big data&#8221; and &#8220;data science&#8221; oriented, and I wanted to step away from the hype and just listen to talks that had more substance.</div>
<p></p>
<div></div>
<div>Attending a conference on Disneyland property was quite a bizarre experience. I wanted to get everything I could out of the conference, but the weather was so nice that I also wanted to get everything out of Disneyland as I could. Seeing adults wearing Mickey ears carrying Mickey shaped balloons, and seeing girls dressed up as their favorite Disney princesses screams &#8220;fun&#8221; rather than &#8220;business&#8221;, but I managed to make time for both.</div>
<div></div>
<p></p>
<div>The first two days started with a plenary talk from industry or research labs. After a coffee break, there were the usual breakout sessions followed by lunch. During my free 90 minutes, I ran over to Disneyland and California Adventure both days to eat lunch. I managed to run there, wait in line, guide myself through crowds, wait in line, get my food, eat it, and run back to the conference in 90 minutes on a weekend. After lunch on the first two days was another plenary session followed by breakout sessions. The evening of the first two days was reserved for poster sessions. Saturday hosted half-day and full-day workshops.</div>
<div></div>
<p></p>
<div>Below is my summary of the conference. Of course, such a summary is very high level my description may miss things, or may not be entirely correct if I misunderstood the speaker.</div>
</td>
</tr>
</tbody>
</table>
</div>
<div></div>
<div><strong>Plenary Talks</strong></div>
<div><strong><br />
</strong></div>
<div>Bharat Rao from SIEMENS provided the first plenary talk bright and early the first day of the conference. I only got to see the first half as I could not wake up. His talk was about privacy preserving data mining in medicine using matrix factorization. Although privacy has become an important issue in data mining, I do not totally buy that it is entirely necessary. The idea is that observations should not personally identifiable. I personally do not agree that such privacy measures are necessary when only a computer system is using the data, and not an individual person. Besides, with such massive amounts of data, someone digging through gigs and gigs of personally identifiable data to find one person&#8217;s data does not seem like a viable threat. My thoughts are similar to those on the Netflix grand challenge dataset lawsuit.</div>
<p></p>
<div></div>
<div>The second plenary talk came from <a href="http://nosh.northwestern.edu/">Noshir Contractor</a>. The main point of his work seemed to be how to build effective teams using graphs and data about each of the candidates for such a team. This did not excite me itself, but it was the data his team used that excited me and some of the stuff they learned from it. The first part of the talk discussed research into NSF grants and the types of collaboration that are more likely to lead to the awarding of such grants. His group found that women were more likely to be collaborators on awarded proposals and that multidisciplinary teams were more likely to be funded. Some analogous work involved the <a href="http://dmitriwilliams.com/Farming.pdf">detection of &#8220;gold farmers&#8221; on the MMORPG game Everquest 2</a>. Gold farming involves gathering and selling virtual goods with real cash. Interestingly, Contractor&#8217;s group found that the graph signatures present in gold farming are remarkably similar to those present with drug trafficking. There were a few other interesting tidbits that the group found. They found that a great number of players only play with friends and are somewhat disconnected from the rest of the game graph. Also, male-male relationships and female-male graph links were very common, but female-female links were uncommon. Contractor hypothesized that the male-male relationships were obvious (men are more likely to play computer games) and that women often play the game with men because it was the only way for them to get time with their significant others.</div>
<p></p>
<div></div>
<div>The Friday morning talk on <a href="http://en.wikipedia.org/wiki/Inductive_transfer">transfer learning</a> came from <a href="http://www.cse.ust.hk/~qyang/">Qiang Yang from Hong Kong University</a>. Transfer learning in this context discussed how to adapt models developed in one domain to data from another domain. Transfer learning seems to be picking up steam in Machine Learning, but anybody within training in Statistics can tell you that it really is just <a href="http://en.wikipedia.org/wiki/Latent_variable">latent variable analysis</a>. Of course, transfer learning applies more to learning classifiers than building descriptive models of data. The speaker&#8217;s proposed method is called <a href="http://www.cse.ust.hk/~qyang/Docs/2009/TCA.pdf">Transfer Component Analysis (TCA)</a> which is similar to, of course, <a href="http://en.wikipedia.org/wiki/Principal_component_analysis">Principal Component Analysis (PCA)</a>. Yang found that semi-supervised TCA was useful for <a href="http://en.wikipedia.org/wiki/Sentiment_analysis">sentiment analysis</a> in a transfer learning context. A common use of transfer learning is mapping a text classifier to an image classifier where we have few labeled instances in the image domain. We can then use unlabeled source data (text) in a semi-supervised way to create a better classifier in the image domain.</div>
<p></p>
<div></div>
<div>The last plenary talk came from <a href="http://research.microsoft.com/en-us/um/people/sdumais/">Susan Dumais from Microsoft Research</a> who discussed <a href="http://research.microsoft.com/en-us/um/people/sdumais/SIAM2012-Keynote-Dumais_Share.pdf">temporal dynamics and information retrieval</a>. The talk basically discussed how to mine concepts important concepts over time from data streams. One part of her research was discovering the staying power of certain words. Susan has noticed four distinct word behaviors based on how the density of the word&#8217;s usage changes over time: fast, hybrid, medium, and slow. Susan&#8217;s research also studies how often people <em>revisit </em>certain webpages and why. Presumably revisits are an alternative measure of influence to in-links and out-links used in PageRank (remember, Microsoft has its own anti-Google search engine). Studying temporal behavior of web visits and keyword usage is important because current methods consider only a snapshot of the web with very little evolution. Susan stated that a great page is defined as a mixture of bags of words that are formed based on page changes. Such research is important because query relevance changes over time. For example, a query of <em>US Open </em>refers to golf at certain times of the year and tennis at others. The query <em>March Madness </em>should probably return ticket prices <em>before </em>the event, scores <em>during </em>the event, and Wikipedia or sports articles recapping the event <em>after </em>the event.</div>
<p></p>
<div></div>
<div><strong>Social Media</strong></div>
<div><strong><br />
</strong></div>
<div>Social media has a session at pretty much every academic conference these days. The speakers in this session used social media data to test their hypotheses and they are always interesting. One talk discussed <a href="http://engineering.asu.edu/sites/default/files/shared/ASUCISE-2011-005.pdf">a feature selection technique for social processes</a> using data from Twitter. The method used in the paper uses user-post relations (favorites, retweets, replies) and user-user relations (following etc.). The second talk used <a href="http://www.cc.gatech.edu/~lingliu/papers/2012/TingWang-SDM2012.pdf">heat diffusion models to model the diffusion, cascading and propagation of ideas</a>. The researchers were interested in also discovering or predicting the &#8220;tipping point&#8221; (or burst of activity, in their words) or a social phenomenon. Another talk discussed <a href="http://www.cs.uiuc.edu/~hanj/pdf/sdm12_mgupta.pdf">credibility in a social network and how credible and incredible information spreads</a>. This work particularly discussed rumors and fake events such as the untimely death of Justin Bieber. Some of the questions investigated were: how can we filter these fake events out of the timeline? How do such rumors spread? The final talk in this session was a bit of an odd duck: <a href="http://arxiv.org/pdf/1102.3340.pdf">how to build a team using social network analysis</a>. The purpose of that work was to balance skillsets in a team and enhance collaborative compatibility.</div>
<div></div>
<p></p>
<div><strong>Pattern Mining</strong></div>
<div><strong><br />
</strong></div>
<div>The Thursday afternoon session I attended had a very generic name considering all of data mining is about finding patterns. Really, it should have been called &#8220;association rule mining.&#8221; Unfortunately, this session was fairly dry and was my least favorite of the conference. The one talk that really stood out to me discussed how to <a href="http://win.ua.ac.be/~adrem/bibrem/pubs/cule12marbles.pdf">mine association rules out of long temporal events</a>. Such association rules consisted of &#8220;episodes&#8221; which were partial orders on the graph of the event. The type of association rules considered were basically motifs &#8212; subsequences of interesting events that occurred within a long event.</div>
<div></div>
<p></p>
<div><strong>Kernels and Classification</strong></div>
<div><strong><br />
</strong></div>
<div>The first two talks in this session discussed <a href="http://en.wikipedia.org/wiki/Multi-label_classification">multi-label classification</a>, which is distinct from multi-class classification. In multi-class classification, we have multiple classes and each instance can belong to one, and only one class. In multi-label classification, each instance can belong to one or more classes/labels. Multi-label classification exploits correlation information among labels whereas independent classifiers do not. The first talk discussed how to use <a href="http://siam.omnibooksonline.com/2012datamining/data/papers/132.pdf">multi-label classification when there are multiple objectives</a>. For example, when buying a cell phone, we may want to minimize price, and maximize battery life. The second talk discussed <a href="http://users.ics.aalto.fi/gonen/files/gonen_sdm12_paper.pdf">dimension reduction for multi-label classification and coupling feature selection with modeling</a>. Another talk attempted to study the <a href="http://www.di.unipi.it/~ruggieri/Papers/sdm2012.pdf">theoretical principles behind pruning and grafting in decision trees</a>. The <a href="http://www.rulequest.com/Personal/">C4.5</a> software does pruning and grafting, but its theoretical properties are not well understood. The last talk discussed <a href="http://people.ee.duke.edu/~lcarin/kpmf_sdm_final.pdf">augmenting matrix factorization with graph information and other metadata</a> prior to building a model. For example, for a movie recommendation problem, one factor would be a movie and another factor would be a user. These factors can be combined into a Bayesian model that can be scaled up better than other existing methods.</div>
<div></div>
<p></p>
<div><strong>Transfer Learning</strong></div>
<div><strong><br />
</strong></div>
<div>As I mentioned earlier, the goal of transfer learning is to map a model used in one domain to another similar domain. The classic example is classifying images using models trained on text data and some labeled images &#8212; both domains are reduced to a common set of concepts. The talks in this session mainly talked about advances in latent variable analysis. I kept finding myself confused and wondering, &#8220;why is this considered groundbreaking?&#8221; The work presented in this session basically used existing models for transfer learning. The first few talks discussed using <a href="http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation">Latent Dirichlet Allocation (LDA)</a> to map data into concepts, and then the third talk discussed <a href="http://books.nips.cc/papers/files/nips16/NIPS2003_AA03.pdf">Hierarchical Latent Dirichlet Allocation (hLDA)</a> which could be used for taxonomies and hierarchies of concepts. Although Transfer Learning is very useful, I did not find it to be all that groundbreaking. Of course, using text and images as the source and target domains is not incredibly interesting. I think Transfer Learning could be revolutionary if it could be applied to two very different domains.</div>
<div></div>
<p></p>
<div><strong>Full Day Workshop: Text Mining</strong></div>
<div><strong><br />
</strong></div>
<div>Of course, if there is a text mining talk, I will attend it. The workshop was led David W. Berry from University of Tennessee, Knoxville. The keynote speaker was <a href="http://www.hpl.hp.com/personal/Malu_Castellanos/">Malu Castellanos from Hewlett-Packard Labs</a>. Malu&#8217;s talk was amazing. She discussed a live customer intelligence system that is used for intent and sentiment analysis on various channels. Working with text is not easy. She began with a discussion of the many challenges in sentiment analysis including deceitful adjectives (<em>despicable </em>is negative, but <em>Despicable Me</em> is a proper noun that is not negative), dependency relations (<em>wicked</em> as slang for &#8220;good&#8221; vs. <em>wicked witch</em>), comparisons (<em>x </em>is better than <em>y</em>), spam, sarcasm, coreferences (use of the word <em>it</em>), special expressions and emoticons (LOL, <img src='http://www.bytemining.com/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' /> ), and context dependencies (<em>predicable movie </em>is negative whereas <em>predictable weather </em>may be positive). What was particularly illiuminating about Malu&#8217;s talk was that she was fairly candid about how complex HP&#8217;s sentiment analysis system is. The system does not use one model for sentiment. Different models are used to handle different kinds of tweets and based on their classifications, these tweets are ushered off to other models for further classification. For example, comparative statements are treated distinctly by the system. There may be a naive Bayes step that classifies the text as comparative or not, and then sends the tweet for further processing. She mentioned something about using special processing such as <a href="http://en.wikipedia.org/wiki/Linear_programming">linear programming</a> and <a href="http://en.wikipedia.org/wiki/Generalized_additive_model">generalized additive models (GAM)</a> to take words such as BUT, AND etc. into account. GAMs seem rare to encounter in text mining. Some other features of the system include sentiment intensity (<em>really good</em> vs. <em>good</em>) and clustering similar words by using temporal histograms (<em>tomorrow </em>and <em>2morrow </em>have similar usage patterns).</div>
<p></p>
<div></div>
<div>The first talk was from <a href="http://research.cs.queensu.ca/~skill/">David Skillicorn</a>, who recently published a book about mining large datasets. He discussed how to pick documents out of a corpus that are the most interesting. The second talk was given by a brave undergraduate student on <a href="http://trec.nist.gov/pubs/trec20/papers/Ursinus.legal.update.pdf">query expansion</a>. He did a very good job, but what was strange about this talk was that it used&#8230; <a href="http://en.wikipedia.org/wiki/Latent_semantic_indexing">Latent Semantic Indexing</a> (&#8230;from 1990&#8230;) rather than one of the more useful and iterative models such as LDA. This brings me to my first personal &#8220;weird moment&#8221; about this workshop. There was very little discussion about modern (post 2000) topic models. This is very strange to me. Just a few months earlier, topic models were all the rage at KDD 2011. After the lunch break, there were talks about incremental online clustering of documents and discovery of patent trolls. The final sessions of the afternoon discussed extraction of hierarchies for increasing performance of multi-labeled classifiers and automatically evaluating text summarizers. Only one of the presentations in this workshop seemed to be attached to a paper.</div>
<div></div>
<div>I do not want to be critical because I am sure a lot of work goes into planning such events. I just found this workshop to be a bit weird. A lot of the methods used in the papers were quite old fashioned for text mining (LSI, regression) and the applications were also quite old-school (patents and legal documents just scream the old-fashioned use of information retrieval&#8230; library cataloging). It also seemed like a disproportionate number of the speakers had a prior relationship with the workshop chair. I am also not used to a workshop with so few associated papers.</div>
<div>
<p class="p1"><strong>Concluding Thoughts</strong></p>
<p class="p1">This was a data mining conference so of course I enjoyed it. I must say though that the vibe was very different from some of the other conferences I have attended like KDD and <a href="http://ijcai.org/">IJCAI</a>. Most of the speakers came from overseas, and as someone with hearing loss, it was very difficult to understand many of the speakers. It also seemed like there were very few people just attending the conference. It seemed like the majority of the people at the conference were presenting, or had a poster etc. and that is different from what I am used to. Because of that, I felt like the usual community feel was a bit missing. Additionally, there was no mention of Hadoop or R, and I found that a bit concerning since every other conference I have been to has speakers that are proud to contribute to those open-source projects. And then there was the weird text mining workshop (could have just been an off-year). Could it be because SIAM is a mathematics group? Not sure. All in all, I still had a great time and learned a lot as always.</p>
<p class="p1"><em>Of course, my attendance would not have been possible without sponsorship and support from my company, <a href="http://www.gumgum.com">GumGum</a>. I attended this conference as part of my position as Data Scientist.</em></p>
<p class="p1"><strong>Disneyland</strong></p>
<p class="p1">Of course, the white elephant in every room of the conference was the fact that Disneyland was only a 5-10 minute walk away. I got a 2-day park hopper pass and spent my lunch hours and evenings at both Disneyland and California Adventure. It really is the Happiest Place on Earth. Just being there I forget about stress and the things that worry me. I had a great time walking around and watching all the kids have fun. At Disneyland I only went on a few rides: Space Mountain, Pirates of the Carribbean, Haunted Mansion, It&#8217;s a Small World and the Disneyland Railroad (not really a ride). I also got to ride the Monorail for the first time. At California Adventure I only did the California Screaming roller coaster and Soaring Over California which features my hometown (the part with the orange orchards). Unfortunately, I missed Tom Sawyer Island again. I will have to go there first next time!</p>
<table>
<tbody>
<tr>
<td><img src="http://www.bytemining.com/wp-content/uploads/2012/05/526086_10101382148210726_2507506_66859310_1965326370_n1.jpg" alt="" width="360" height="270" /></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2012/05/545812_10101382145920316_2507506_66859290_787607657_n.jpg" alt="" width="360" height="270" /></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2012/05/581239_10101382146229696_2507506_66859292_1227525982_n.jpg" alt="" width="270" height="360" /></td>
</tr>
<tr>
<td><em><span style="font-size: small;">The view of California Adventure from my hotel room!</span></em></td>
<td><span style="font-size: small;"><em>A room just for kids.</em></span></td>
<td><span style="font-size: small;"><em>Surfer Goofy at the lobby entrance.</em></span></td>
</tr>
</tbody>
</table>
<p class="p1">
</div>
<div class="shr-publisher-1076"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><div class='shareaholic-like-buttonset' style='float:none;height:30px;'><a class='shareaholic-googleplusone' data-shr_size='standard' data-shr_count='true' data-shr_href='http%3A%2F%2Fwww.bytemining.com%2F2012%2F05%2Fsiam-data-mining-2012-conference%2F' data-shr_title='SIAM+Data+Mining+2012+Conference'></a></div><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://www.bytemining.com/2012/05/siam-data-mining-2012-conference/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>My Interview about the Statistics Major</title>
		<link>http://www.bytemining.com/2012/03/my-interview-about-the-statistics-major/</link>
		<comments>http://www.bytemining.com/2012/03/my-interview-about-the-statistics-major/#comments</comments>
		<pubDate>Fri, 16 Mar 2012 20:23:25 +0000</pubDate>
		<dc:creator>Ryan Rosario</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.bytemining.com/?p=1070</guid>
		<description><![CDATA[<p>Recently, I participated in an email interview about what being a Statistics major entailed, how I got interested in the field and the future of Statistics. I figured this might be of interest to those that are contemplating majoring in Statistics, or considering a career in Data Science.</p>

Q1: Why did you decide to pursue a major in statistics in college?
<p>A: &#8220;When I was a kid, I really enjoyed looking at graphs, plots and maps. My parents and I could not make of what was behind the interest. At the same time, I was also heavily interested in education. My mother was a teacher and the first set of statistics I ever encountered were standardized test scores. I strived to understand what the scores attempted to say about me, and why such scores and tests are so trustworthy. When the stakes increased with the AP and SAT exams, I began reading articles published by the Educational Testing Service and learned a ton about how these tests are constructed to minimize bias, and how scores are comparable across forms. It fascinated me how much science goes into these tests, but in the end of the day they are still just one factor [...]]]></description>
				<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><p>Recently, I participated in an email interview about what being a Statistics major entailed, how I got interested in the field and the future of Statistics. I figured this might be of interest to those that are contemplating majoring in Statistics, or considering a career in Data Science.</p>
<div></div>
<div><strong>Q1:</strong> Why did you decide to pursue a major in statistics in college?</div>
<p><strong>A: </strong>&#8220;When I was a kid, I really enjoyed looking at graphs, plots and maps. My parents and I could not make of what was behind the interest. At the same time, I was also heavily interested in education. My mother was a teacher and the first set of statistics I ever encountered were standardized test scores. I strived to understand what the scores attempted to say about me, and why such scores and tests are so trustworthy. When the stakes increased with the AP and SAT exams, I began reading articles published by the Educational Testing Service and learned a ton about how these tests are constructed to minimize bias, and how scores are comparable across forms. It fascinated me how much science goes into these tests, but in the end of the day they are still just one factor in the whole picture of a student. This niche interest lead me to statistics, psychometrics in particular, and although I no longer study psychometrics, I found what I learned to be incredibly valuable.&#8221;</p>
<p>&nbsp;</p>
<p><strong>Q2: </strong>&#8220;I noticed you have bachelor&#8217;s, master&#8217;s, and doctoral degrees in statistics. How did your graduate study build on what you learned in your undergraduate program?&#8221;</p>
<p><strong>A: </strong>&#8220;For me, the undergraduate and graduate programs were night and day. The undergraduate program focused more on modeling and data analysis. The graduate program focused more on thinking about data and how to develop a scientific &#8220;common sense&#8221; about how to work with, express and make automated decisions based on data. The graduate program was much more mathematically and computationally intensive than the undergraduate major. My graduate study actually built more on my mathematics major in college because many of the concepts in graduate statistics require knowledge of linear algebra, numerical analysis and real analysis. Fortunately, our statistics major requires upper division math courses.&#8221;</p>
<p>&nbsp;</p>
<p><strong>Q3: </strong>What was the most interesting part of majoring in statistics? What did you find most challenging?</p>
<p><strong>A: </strong>&#8220;The most interesting part of majoring in statistics was seeing how many fields can grow and transform based on insights from data and statistics. In my case, I found it most interesting seeing how it integrates and interacts with computer science. Every time someone surfs through Facebook, enters a Google search, or looks at an item on Amazon, data about what you are doing are collected and algorithms process this data to enrich the experience by, say,recommending books and offering special deals on Amazon, recommending friends and showing relevant stories on Facebook, and the most groundbreaking of all: returning relevant search results.</p>
<p>The most challenging part for me was the mathematical theory. Although I loved math, I sometimes had trouble connecting the theory to the application, and statistics is such an applied field. I look at it as a rite of passage and once I saw enough theory relevant to my interests, the learning process became easier.&#8221;</p>
<p>&nbsp;</p>
<p><strong>Q4: </strong>How do you apply what you learned in your statistics education in your current line of work?</p>
<p><strong>A: </strong>&#8220;It is ironic, but it is the more basic concepts of statistics and probability that I use everyday rather than the complicated models I learned. Concepts such as independence, confidence, power, accuracy etc. are important building blocks for building my own models, or for choosing an existing one from those that I learned in school.</p>
<p>I always start with some exploratory analysis such as computing some statistics and making plots that show relationships clearly. Then I set explicit guidelines for the input and output of the model I want to build and note any critical assumptions that are violated or that must be met. I then try several different methods and models and validate their results using common metrics taught in undergraduate statistics before settling on a final model configuration.&#8221;</p>
<p>&nbsp;</p>
<p><strong>Q5: </strong>What skills did you learn in the statistics major that you find useful for work and everyday life?</p>
<p><strong>A: </strong>&#8220;The training in mathematics I received as part of the statistics major taught me how to think logically, and this is very important in my work in computer science. I think patience was another very important skill I learned. I love what I do, and sometimes I take for granted that others have the same mathematical training that I do because I am so entrenched in it. Through my experience teaching as well as consulting as a student, I gained a better sense of the challenges and difficulties many people face when thinking about and interpreting statistics and how to better communicate results and ideas.&#8221;</p>
<p>&nbsp;</p>
<p><strong>Q6: </strong>Any advice for students who are considering majoring in statistics?</p>
<p><strong>A: </strong>&#8220;My advice for students majoring in statistics is to choose an additional major or minor that uses statistics and is of interest to the student. I do not consider statistics to be a &#8220;standalone&#8221; major. When interviewing for a job, employers want to know why an interviewee is passionate about their company. For example, if interviewing for a finance company, the company wants to hear about passion for finance, and see education or experience in such fields. Another way to accomplish this instead of double majoring is to do some internships, projects or research in a field of interest.&#8221;</p>
<p>&nbsp;</p>
<p><strong>Conclusion: </strong>Finally, could you tell me a little about yourself for an intro bio we will include before the Q&amp;A interview? For instance, what university(ies) did you attend, what degree(s) have you earned, what is your current job title, where do you work and for how long&nbsp;(you can be general here, or include a link to your professional website or blog if you have one), and what are you career goals?</p>
<p><strong>A: </strong>&#8220;I attended University of California, Los Angeles (UCLA) for my B.S. (Statistics, Mathematics of Computation), two M.S. (Statistics and Computer Science) and Ph.D. (Statistics). I currently work for an Internet advertising startup in Santa Monica, CA as Chief Data Scientist/Research Engineer, and have been working in the field for three years. Whenever I get a free moment, I write about statistics, data mining and computer science topics on my blog at <a class="moz-txt-link-freetext" href="http://www.bytemining.com/">http://www.bytemining.com</a>. I plan on dedicating the rest of my life working with and communicating about data and turning online phenomena into knowledge that can be used to progress technology and change the world!&#8221;</p>
<div class="shr-publisher-1070"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><div class='shareaholic-like-buttonset' style='float:none;height:30px;'><a class='shareaholic-googleplusone' data-shr_size='standard' data-shr_count='true' data-shr_href='http%3A%2F%2Fwww.bytemining.com%2F2012%2F03%2Fmy-interview-about-the-statistics-major%2F' data-shr_title='My+Interview+about+the+Statistics+Major'></a></div><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://www.bytemining.com/2012/03/my-interview-about-the-statistics-major/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>&#8220;Hold Only That Pair of 2s?&#8221; Studying a Video Poker Hand with R</title>
		<link>http://www.bytemining.com/2012/01/hold-only-that-pair-of-2s-studying-a-video-poker-hand-with-r/</link>
		<comments>http://www.bytemining.com/2012/01/hold-only-that-pair-of-2s-studying-a-video-poker-hand-with-r/#comments</comments>
		<pubDate>Sun, 08 Jan 2012 09:32:00 +0000</pubDate>
		<dc:creator>Ryan Rosario</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[Statistics and Statistical Computing]]></category>

		<guid isPermaLink="false">http://www.bytemining.com/?p=1013</guid>
		<description><![CDATA[<p>Whenever I tell people in my family that I study Statistics, one of the first questions I get from laypeople is &#8220;do you count cards?&#8221; A blank look comes over their face when I say &#8220;no.&#8221;</p>
<p>Look, if I am at a casino, I am well aware that the odds are against me, so why even try to think that I can use statistics to make money in this way? Although I love numbers and math, the stuff flows through my brain all day long (and night long), every day. If the goal is to enjoy and have fun, I do not want to sit there crunching probability formulas in my head (yes that&#8217;s fun, but it is also work). So that leaves me at the video Poker machines enjoying the free drinks. Another positive about video Poker is that $20 can sometimes last a few hours.&#160;So it should be no surprise that I do not agree with using Poker to teach probability. &#160;Poker is an extremely superficial way to introduce such a powerful tool and gives the impression that probability is a way to make a quick buck, rather than as an important tool in science and society. The only [...]]]></description>
				<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><p>Whenever I tell people in my family that I study Statistics, one of the first questions I get from laypeople is &#8220;do you count cards?&#8221; A blank look comes over their face when I say &#8220;no.&#8221;</p>
<p>Look, if I am at a casino, I am well aware that the odds are against me, so why even try to think that I can use statistics to make money in this way? Although I love numbers and math, the stuff flows through my brain all day long (and night long), every day. If the goal is to enjoy and have fun, I do not want to sit there crunching probability formulas in my head (yes that&#8217;s fun, but it is also work). So that leaves me at the video Poker machines enjoying the free drinks. Another positive about video Poker is that $20 can sometimes last a few hours.&nbsp;So it should be no surprise that I do not agree with using Poker to teach probability. &nbsp;Poker is an extremely superficial way to introduce such a powerful tool and gives the impression that probability is a way to make a quick buck, rather than as an important tool in science and society. The only time that I have used Poker in teaching (besides when required), is to cover the hypergeometric distribution and sampling without replacement.</p>
<p>Since I took Intro Probability Theory, I have always wondered what to do in the following situation. Say a pair of cruddy low cards appear on the first draw. The game only awards money for pairs of jacks or better. If all I have in the hand is a pair of low cards and no&nbsp;face cards, my decision is easy: hold the pair of low cards. But what if there is at least one face card showing (no other pairs)? Pictorially this looks like</p>
<div><img style="vertical-align: middle;" src="http://www.bytemining.com/wp-content/uploads/2012/01/200px-Playing_card_club_2.svg_.png" alt="" width="200" height="250" /><img style="vertical-align: middle;" src="http://www.bytemining.com/wp-content/uploads/2012/01/200px-Playing_card_club_5.svg_.png" alt="" width="200" height="250" /><img style="vertical-align: middle;" src="http://www.bytemining.com/wp-content/uploads/2012/01/200px-Playing_card_spade_J.svg_.png" alt="" width="200" height="250" /><img style="vertical-align: middle;" src="http://www.bytemining.com/wp-content/uploads/2012/01/200px-Playing_card_diamond_2.svg_.png" alt="" width="200" height="250" /><img style="vertical-align: middle;" src="http://www.bytemining.com/wp-content/uploads/2012/01/200px-Playing_card_diamond_10.svg_.png" alt="" width="200" height="250" /></div>
<p>The conundrum:</p>
<ol>
<li>Hold the two low cards and deal, hoping for a three of a kind, or</li>
<li>Hold the two low cards AND one of the face cards, hoping for a three of a kind, OR a pair of Jacks of Better.</li>
</ol>
<p>Under each of these decisions, which yields the highest probability of winning <em>something</em> and which one yields the highest payout? This problem can be solved exactly by using combinatorics, conditional probability and expectation, but since a video poker game is basically a simulator (though likely biased), I wrote my own simulation. <strong>For the answer, scroll to the end!</strong></p>
<p><strong>Data Structure</strong></p>
<p>In most card games, we would want to store the state of the game: the outstanding cards in the deck(s), and the hand(s) of each player. In standard video poker, there is one deck, and one player, so only the player hand needs to be recorded because every card in the deck is either in the hand, or it is not. One obvious way to represent the hand is as an array of denomination/suit tuples in an array. Unfortunately, this data structure requires other data structures to store the possible suits, and possible denominations. It is also more tedious to detect certain kinds of wins. For this simulation, I use a 13 x 4 matrix where each row is a different denomination, and each column is each of the four suits. This matrix allows us to easily see which cards are possible to be dealt. Additionally, this matrix, as well as vector-based languages such as R, make&nbsp;it easy to detect wins. Such a matrix looks like the following for the hand <strong>2</strong><strong style="font-family: sans-serif; font-size: 13px; line-height: 19px; background-color: #ffffff;"><span class="spades">&spades;</span>&nbsp;<span class="clubs">5&clubs;</span>&nbsp;<span class="hearts" style="color: red;">8&hearts;</span>&nbsp;<span class="clubs">8&clubs;</span>&nbsp;<span class="diamonds" style="color: red;">A&diams;</span></strong></p>
<div><img style="display: block; margin-left: auto; margin-right: auto;" src="http://www.bytemining.com/wp-content/uploads/2012/01/matrix.png" alt="" width="220" height="303" />&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;</div>
<div><img style="vertical-align: baseline; display: block; margin-left: auto; margin-right: auto;" src="http://www.bytemining.com/wp-content/uploads/2012/01/matrix2.png" alt="" width="153" height="41" /></div>
<div>where <em>Cij&nbsp;</em>denotes a card,<em>&nbsp;i </em>is the denomination <img src='http://s.wordpress.com/latex.php?latex=i%20%5Cin%20%5C%7B%202%2C%20%5Cldots%2C%2010%5C%7D%20%5Ccup%20%5C%7BJ%2C%20Q%2C%20K%2C%20A%5C%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='i \in \{ 2, \ldots, 10\} \cup \{J, Q, K, A\}' title='i \in \{ 2, \ldots, 10\} \cup \{J, Q, K, A\}' class='latex' /> and <em>j </em>is the suit <img src='http://s.wordpress.com/latex.php?latex=j%20%5Cin%20%5C%7B%5Cheartsuit%2C%20%5Cdiamondsuit%2C%20%5Cspadesuit%2C%20%5Cclubsuit%20%5C%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='j \in \{\heartsuit, \diamondsuit, \spadesuit, \clubsuit \}' title='j \in \{\heartsuit, \diamondsuit, \spadesuit, \clubsuit \}' class='latex' /> and <em>H</em>&nbsp;is the player&#8217;s hand in question.</div>
<div>
<p><strong>Detecting Wins</strong></div>
<p>Poker wins are not disjoint. A three of a kind involving Jacks is also a pair of Jacks or better, etc. When checking wins, I start with the lowest paying win, and move up to Royal Flush, only keeping track of the highest win. Thus, this algorithm detects a four-of-a-kind involving Queens as Jacks or Better, two pairs of Queens, and a three-of-a-kind of Queens, but only counts it as the highest win, the four-of-a-kind.</p>
<ol>
<li><em>Pair of Jacks or Better</em>: a pair of Jacks, Queens, Kings or Aces. In <strong>A</strong>, this is simply the condition that at least one row in rows 10 through 13 has a row sum greater than 1.</li>
<li><em>Two pair</em>: two pairs of anything. In <strong>A</strong>, this is the condition that at least two rows have a sum greater than 1.</li>
<li><em>Three of a kind</em>: three of any card. In <strong>A</strong>, this is the condition that at least one row has a sum of at least 3.</li>
<li><em>Straight</em>: all 5 cards can be permuted such that they form an ascending sequence: A, 2, 3, 4, 5, 6, 7, 8, 9, 10, J, Q, K, A. This case is interesting and will be discussed in a bit.</li>
<li><em>Flush</em>: all 5 cards are of the same suit. In <strong>A</strong>, this is the condition that at least one column has a sum of at least 5.</li>
<li><em>Full House</em>: one three-of-a-kind, and a pair of anything. In <strong>A</strong>, this is the condition that a row has sum 3, and another row has sum 2.</li>
<li><em>Four of a Kind</em>: 4 of any card. In <strong>A</strong>, this is the condition that a row has sum 4.</li>
<li><em>Straight Flush</em>: the 5 cards can be permuted to form an ascending sequence and are all of the same suit. In <strong>A</strong>, this is simply the condition that we have a straight and a flush in the same hand.</li>
<li><em>Royal Flush</em>: a straight flush with the Ace as the high card. In <strong>A</strong>, this is simply the condition that we have a straight flush AND the sum of row 13 is 1.</li>
</ol>
<div>Of course, this &#8220;short circuit logic&#8221; only works for a game containing 5 cards. Also, note that under my scenario (a pair of low cards is dealt first), it is never possible to have a straight, flush, royal flush, or straight flush as the highest wins. Also, it is not possible to have Jacks or Better as the highest win because we already have one pair (low cards), and if we randomly are drawn a pair of Jacks or Better, we then have two pairs as the highest win.</div>
<div><em>Detecting the Straight:&nbsp;</em>In <strong>A</strong>, we have a straight when five successive rows have sum equal to 1. We can do this iteratively, but there is a better way. Note that if all of the row sums are 0 or 1, we can treat the vector of row sums as a binary number and convert it to its integer representation. Each binary number has 13 bits. If we let 2 be the zeroth power, then straights will lead to the following binary and integer representations:</div>
<div></div>
<div><img style="display: block; margin-left: auto; margin-right: auto;" src="http://www.bytemining.com/wp-content/uploads/2012/01/matrix3.png" alt="" width="614" height="125" /></div>
<div></div>
<p>
<strong>Bug alert:</strong> It just occurred to me that there are many more wrap-around straights such as <emph>Q, K, A, 2, 3</emph>. This will be fixed this evening.<br />
</p>
<div>From basic computer science and number theory, every natural number can be written as the sum of distinct powers or 2 and the representation of such an integer is unique. Furthermore, the sum of <em>n </em>successive powers of 2 is divisible by <img src='http://s.wordpress.com/latex.php?latex=2%5En%20-%201&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='2^n - 1' title='2^n - 1' class='latex' />. After some experimentation I came up with the following rule: if all of the row sums are 0/1 and the integer representation of this binary vector is divisible by <img src='http://s.wordpress.com/latex.php?latex=%5Cfrac%7B2%5E5-1%7D%7B2%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\frac{2^5-1}{2}' title='\frac{2^5-1}{2}' class='latex' />, then <strong>A </strong>is a straight. The only straight that does not fit this pattern is the wrap-around straight: J, Q, K, A, 2 which can be checked manually.</div>
<div></div>
<div><strong>The Algorithm</strong></div>
<div>
<ol>
<li>Randomly generate a hand containing a pair of low cards (2-10) and at least one face card.</li>
<li>Hold the pair of low cards. Under strategy 2, hold one (and only one) of the face cards.</li>
<li>Discard the unheld cards from the deck and draw 2 or 3 cards at random from the same deck.</li>
<li>Check for wins.</li>
<li>Increment a win counter.</li>
<li>Repeat steps 1-5 tons of times, recording the percentage of hands that yielded a win, of the <em>n </em>games/hands played.</li>
</ol>
<p><strong>Results: Hold the Pair of Low Cards <em>Only</em></strong></p>
<p>My usual strategy is to always hold the low pair and take one face card along for the ride. That way, I hopefully match one of the two denominations I hold. My parents on the other hand, always told me to hold the low pair only, because that gives one more card (degree of freedom) for a win. It turns out they were right. Each game consisted of 1,000 hands. A percentage of these hands yields a win. This percentage is a random variable, so I ran this simulation to play 1,000 games. The table below shows the distribution of the win percentages.</p>
<p style="text-align: left;">
<div align="center"><img src="http://www.bytemining.com/wp-content/uploads/2012/01/pokertable.png" alt="" width="805" height="145" /></div>
<p><em>Note that under strategy 1 (hold low pair only), <span style="text-decoration: underline;">all</span>&nbsp;wins are more likely than under strategy 2! </em>Of course, the estimate in the last column is an average; the mean in this case. The plot below shows the distribution of win percentages for both strategies.&nbsp;</p>
<p><img style="display: block; margin-left: auto; margin-right: auto;" src="http://www.bytemining.com/wp-content/uploads/2012/01/pokersim1.png" alt="" width="600" height="400" /></p>
<ol> </ol>
<p><strong>The Code</strong></p>
<p>The code for my simulation is below. Note that it can easily be modified for your own target hands of interest. In my simulation, certain functions were never used because certain winning hands were not possible.&nbsp;</p>
<p><script src="https://gist.github.com/1608866.js"> </script></p>
<p><strong>DISCLAIMER: </strong>I did this for fun, and it is possible that there are bugs or problems with my code, algorithm or simulation. The results seem correct because I empirically I seem to do about the same using either strategy, and in a gambling perspective, an 8% discrepancy is not likely to set off bells in the head.</p>
</div>
<div class="shr-publisher-1013"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><div class='shareaholic-like-buttonset' style='float:none;height:30px;'><a class='shareaholic-googleplusone' data-shr_size='standard' data-shr_count='true' data-shr_href='http%3A%2F%2Fwww.bytemining.com%2F2012%2F01%2Fhold-only-that-pair-of-2s-studying-a-video-poker-hand-with-r%2F' data-shr_title='%22Hold+Only+That+Pair+of+2s%3F%22+Studying+a+Video+Poker+Hand+with+R'></a></div><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://www.bytemining.com/2012/01/hold-only-that-pair-of-2s-studying-a-video-poker-hand-with-r/feed/</wfw:commentRss>
		<slash:comments>14</slash:comments>
		</item>
		<item>
		<title>Merry Christmas 2011 From Byte Mining!</title>
		<link>http://www.bytemining.com/2011/12/merry-christmas-2011-from-byte-mining/</link>
		<comments>http://www.bytemining.com/2011/12/merry-christmas-2011-from-byte-mining/#comments</comments>
		<pubDate>Sat, 24 Dec 2011 19:28:44 +0000</pubDate>
		<dc:creator>Ryan Rosario</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.bytemining.com/?p=1009</guid>
		<description><![CDATA[</p>
To all of my readers and followers, I wish you a very Merry Christmas and a very joyous and safe Happy New Year! This year, I am thankful for the community that has sprung up around Data Science and open-source data collection and processing. This blog is almost two years old, and like with Twitter, I have been able to communicate with many data scientists, enthusiasts and some of the most prolific contributors to the data science software community. I am thankful for all of the wonderful people I have met and have yet to meet, and for your comments and reading.  


]]></description>
				<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><div align="center"><img src="http://www.bytemining.com/wp-content/uploads/2010/12/merry_christmas.jpg" alt="" /></p>
<div align="left">To all of my readers and followers, I wish you a very Merry Christmas and a very joyous and safe Happy New Year! This year, I am thankful for the community that has sprung up around Data Science and open-source data collection and processing. This blog is almost two years old, and like with Twitter, I have been able to communicate with many data scientists, enthusiasts and some of the most prolific contributors to the data science software community. I am thankful for all of the wonderful people I have met and have yet to meet, and for your comments and reading. <img src='http://www.bytemining.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> 
</div>
</div>
<div class="shr-publisher-1009"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><div class='shareaholic-like-buttonset' style='float:none;height:30px;'><a class='shareaholic-googleplusone' data-shr_size='standard' data-shr_count='true' data-shr_href='http%3A%2F%2Fwww.bytemining.com%2F2011%2F12%2Fmerry-christmas-2011-from-byte-mining%2F' data-shr_title='Merry+Christmas+2011+From+Byte+Mining%21'></a></div><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://www.bytemining.com/2011/12/merry-christmas-2011-from-byte-mining/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Parsing Wikipedia Articles: Wikipedia Extractor and Cloud9</title>
		<link>http://www.bytemining.com/2011/11/parsing-wikipedia-articles-wikipedia-extractor-and-cloud9/</link>
		<comments>http://www.bytemining.com/2011/11/parsing-wikipedia-articles-wikipedia-extractor-and-cloud9/#comments</comments>
		<pubDate>Mon, 28 Nov 2011 19:00:00 +0000</pubDate>
		<dc:creator>Ryan Rosario</dc:creator>
				<category><![CDATA[Amazon EC2]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Natural Language Processing]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Web Mining]]></category>

		<guid isPermaLink="false">http://www.bytemining.com/?p=947</guid>
		<description><![CDATA[<p>



Lately I have doing a lot of work with the Wikipedia XML dump as a corpus. Wikipedia provides a wealth information to researchers in easy to access formats including XML, SQL and HTML dumps for all language properties. Some of the data freely available from the Wikimedia Foundation include

article content and template pages
article content with revision history (huge files)
article content including user pages and talk pages
redirect graph
page-to-page link lists: redirects, categories, image links, page links, interwiki etc.
image metadata
site statistics




<p>The above resources are available not only for Wikipedia, but for other Wikimedia Foundation projects such as Wiktionary, Wikibooks and Wikiquotes.</p>
<p>As Wikipedia readers will notice, the articles are very well formatted and this formatting is generated by a somewhat unusual markup format defined by the MediaWiki project. As Dirk Riehl stated:</p>
<p>There was no grammar, no defined processing rules, and no defined output like a DOM tree based on a well defined document object model. This is to say, the content of Wikipedia is stored in a format that is not an open standard. The format is defined by 5000 lines of php code (the parse function of MediaWiki). That code may be open source, but it is incomprehensible to most. That&#8217;s why [...]]]></description>
				<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><p><table>
<tr>
<td>
<img src="http://www.bytemining.com/wp-content/uploads/2011/11/Wikipedia-logo.png" alt="" width="100" height="100" /></td>
<td>Lately I have doing a lot of work with the <a href="http://dumps.wikimedia.org/enwiki/">Wikipedia XML dump</a> as a corpus. Wikipedia provides a wealth information to researchers in easy to access formats including XML, SQL and HTML dumps for all language properties. Some of the data freely available from the Wikimedia Foundation include
<ul>
<li>article content and template pages</li>
<li>article content with revision history (huge files)</li>
<li>article content including user pages and talk pages</li>
<li>redirect graph</li>
<li>page-to-page link lists: redirects, categories, image links, page links, interwiki etc.</li>
<li>image metadata</li>
<li>site statistics</li>
</ul>
</td>
</tr>
</table>
<p>The above resources are available not only for Wikipedia, but for other <a href="http://www.wikimedia.org">Wikimedia Foundation</a> projects such as <a href="http://www.wiktionary.org">Wiktionary</a>, <a href="http://www.wikibooks.org">Wikibooks</a> and <a href="http://www.wikiquotes.org">Wikiquotes</a>.</p>
<p>As Wikipedia readers will notice, the articles are very well formatted and this formatting is generated by a somewhat unusual markup format defined by the <a href="http://www.mediawiki.org">MediaWiki</a> project. As <a href="http://dirkriehle.com/2011/05/01/the-parser-that-cracked-the-mediawiki-code/">Dirk Riehl</a> stated:</p>
<blockquote><p>There was no grammar, no defined processing rules, and no defined output like a DOM tree based on a well defined document object model. This is to say, the content of Wikipedia is stored in a format that is not an open standard. The format is defined by 5000 lines of php code (the parse function of MediaWiki). That code may be open source, but it is incomprehensible to most. That&rsquo;s why there are 30+ failed attempts at writing alternative parsers.</p>
</blockquote>
<p>For example, below is an excert of Wiki-syntax for a page on data mining.</p>
<pre class="brush: plain; title: ; notranslate">
'''Data mining''' (the analysis step of the '''knowledge discovery in databases''' process,&lt;ref name=&quot;Fayyad&quot;&gt; or KDD), 
a relatively young and interdisciplinary field of [[computer science]]&lt;ref name=&quot;acm&quot; /&gt;
{{cite web|url=http://www.sigkdd.org/curriculum.php |title=Data Mining Curriculum |
publisher=[[Association for Computing Machinery|ACM]] [[SIGKDD]] |date=2006-04-30 |accessdate=2011-10-28}}
&lt;/ref&gt;&lt;ref name=brittanica&gt;{{cite web | last = Clifton | first = Christopher | title = Encyclopedia Britannica: Definition 
of Data Mining | year = 2010 | url = http://www.britannica.com/EBchecked/topic/1056150/data-mining | 
accessdate = 2010-12-09}}&lt;/ref&gt; is the process of discovering new patterns from large [[data set]]s 
involving methods at the intersection of [[artificial intelligence]], [[machine learning]], [[statistics]] and 
[[database system]]s.&lt;ref name=&quot;acm&quot;&gt; The goal of data mining is to extract knowledge from a data set in a 
human-understandable structure&lt;ref name=&quot;acm&quot; /&gt; and involves database and [[data management]], 
[[Data Pre-processing|data preprocessing]], [[statistical model|model]] and [[Statistical inference|inference]] 
considerations, interestingness metrics, [[Computational complexity theory|complexity]] considerations, post-processing 
of found structure, [[Data visualization|visualization]] and [[Online algorithm|online updating]].&lt;ref name=&quot;acm&quot; /&gt;
</pre>
<p>I was epicly worried that I would spend weeks writing my own parser and never complete the project I am working on at work. To my surprise, I found a fairly good parser. Since I am working on <a href="http://en.wikipedia.org/wiki/Named-entity_recognition">named entity extraction</a> and <a href="http://en.wikipedia.org/wiki/Collocation"><em>n</em>gram extraction</a>, I wanted to only extract the plain text. If we take the above junk and extract only the plain text, we would get&nbsp;</p>
<pre class="brush: plain; title: ; notranslate">
Data mining (the analysis step of the knowledge discovery in databases process, or KDD), a relatively young 
and interdisciplinary field of computer science is the process of discovering new patterns from large data sets 
involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems. 
The goal of data mining is to extract knowledge from a data set in a human-understandable structure and involves 
database and data management, data preprocessing, model and inference considerations, interestingness
metrics, complexity considerations, post-processing of found structure, visualization and online updating.
</pre>
<p>and from this we can remove punctuation (except sentence terminators .?!), convert to lower case and perform other pre-processing text mining steps. There are many, many Wikipedia parsers of various qualities. Some do not work at all, some work only on certain articles, some have been abandoned as incomplete and some are slow as molasses.</p>
<p>I was delighted to stumble upon <a href="http://medialab.di.unipi.it/wiki/Wikipedia_Extractor">Wikipedia Extractor</a>, a Python library developed by <a href="http://www.cli.di.unipi.it/~fuschett/">Antonio Fuschetto</a>, <a href="http://www.cli.di.unipi.it/">Multimedia Laboratory, Dipartimento di Informatica, Universit&agrave; di Pisa</a>, that extracts plain-text from the Wikipedia XML dump file. The script is heavily object-oriented, and it is very easy to modify and extend for other purposes. For me, it is the easiest parser to use and yields the best quality output although there are other options.</p>
<p><strong>Pros</strong></p>
<ul>
<li>Very easy to run; it&#8217;s just a Python script.</li>
<li>Yields high quality output; no stray wikisyntax garbage.</li>
<li>Highly object-oriented; easy to extend and embed in text mining projects.</li>
<li>Object-oriented style makes it easier to parallelize with lightweight processes (written by the user).</li>
<li>Allows specifying the maximum size of each produced file (good for sending to S3).</li>
<li>It is written in Python.</li>
</ul>
<p><strong>Cons</strong></p>
<ul>
<li>Far too slow. Python profilers show major overhead involved in regex search and replace, and string replacement.</li>
<li>Is not perfect, but one of the best I have seen. For some reason, Wikilinks are converted to HTML links. Correcting this required modifying the source code.</li>
<li>Retooling the package to work with Hadoop Streaming is not too difficult, but requires some work and grokery that should be easier.</li>
</ul>
<p>Wikipedia Extractor is good for offline analysis, but users will probably want something that runs faster. Wikipedia Extractor parsed the entire Wikipedia dump in approximately 13 hours, on one core, which is quite painful. Add in further parsing and the processing time becomes unbearable even on multiple cores. A Hadoop Streaming job using Wikipedia Extractor as well as too much file I/O between Elastic MapReduce and S3 required 10 hours to complete on 15 c1.medium instances.&nbsp;</p>
<p><a href="http://blog.kenweiner.com/">Ken Weiner</a>&nbsp;(<a href="http://twitter.com/kweiner">@kweiner</a>) recently re-introduced me to the <a href="http://lintool.github.com/Cloud9/">Cloud9</a>&nbsp;package by <a href="http://www.umiacs.umd.edu/~jimmylin/">Jimmy Lin</a> (<a href="http://twitter.com/lintool">@lintool</a>) of Twitter which fills in some of these gaps. I avoided it at first because Java is not the first language I like to turn to. Cloud9 is written in Java and designed for use with Hadoop MapReduce in mind. There is a method within the package that explicitly extracts the body text of each Wikipedia article. This method calls the <a href="http://code.google.com/p/gwtwiki/">Bliki</a> Wikipedia parsing library. One common problem with these Wikipedia parsers is that they often leave syntax still in the output. Jimmy seems to wrap Bliki with his own code to do a better job of extracting high quality text only output. Cloud9 also has counters and functions that detect non-article content such as redirects, disambiguation pages, and more.</p>
<p>Developers can introduce their own analysis, text mining and NLP code to process the article text in the mapper or reducer code. An example job distributed with Cloud9 which simply counts the number of pages in the corpus took approximately 15 minutes to run on 8 cores on an EC2 instance. A job that did more substantial required 3 hours to complete, and once the corpus was refactored as <a href="http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/SequenceFile.html">sequence files</a>, the same job took approximately 90 minutes to run.</p>
<p><strong>Conclusion</strong></p>
<p>I am looking forward to playing with Cloud9 some more&#8230; I will take 90 minutes over 10 hours any day! Wikipedia Extractor is an impressive Python package that does a very good job of extracting plain text from Wikipedia articles and for that I am grateful. Unfortunately, it is far too slow to be used on a pay-per-use system such as <a href="http://aws.amazon.com">AWS</a> or for quick processing. Cloud9 is a Java package designed with scalability and MapReduce in mind, allowing much quicker and more wallet friendly processing.</p>
<div class="shr-publisher-947"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><div class='shareaholic-like-buttonset' style='float:none;height:30px;'><a class='shareaholic-googleplusone' data-shr_size='standard' data-shr_count='true' data-shr_href='http%3A%2F%2Fwww.bytemining.com%2F2011%2F11%2Fparsing-wikipedia-articles-wikipedia-extractor-and-cloud9%2F' data-shr_title='Parsing+Wikipedia+Articles%3A+Wikipedia+Extractor+and+Cloud9'></a></div><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://www.bytemining.com/2011/11/parsing-wikipedia-articles-wikipedia-extractor-and-cloud9/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
		</item>
		<item>
		<title>LexisNexis Open-Sources its Hadoop Alternative</title>
		<link>http://www.bytemining.com/2011/09/lexisnexis-open-sources-its-hadoop-alternative/</link>
		<comments>http://www.bytemining.com/2011/09/lexisnexis-open-sources-its-hadoop-alternative/#comments</comments>
		<pubDate>Sat, 10 Sep 2011 03:33:25 +0000</pubDate>
		<dc:creator>Ryan Rosario</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Databases and Datastores]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Statistics and Statistical Computing]]></category>

		<guid isPermaLink="false">http://www.bytemining.com/?p=938</guid>
		<description><![CDATA[<p>A month ago, I wrote about alternatives to the Hadoop MapReduce platform and HPCC was included in that article. For more information, see here.</p>
<p>LexisNexis has open-sourced its alternative to Hadoop, called High Performance Computing Cluster. The code is available on GitHub. For years the code was restricted to LexisNexis Risk Solutions. The system contains two major components:</p>

Thor (Thor Data Refinery Cluster) is the data processing framework. It &#8220;crunches, analyzes and indexes huge amounts of data a la Hadoop.&#8221;
Roxie (Roxy Radid Data Delivery Cluster) is more like a data warehouse and is designed with quick querying in mind for frontends.

<p>The protocol that drives the whole process is the Enterprise Control Language which is said to be faster and more efficient than Hadoop&#8217;s version of MapReduce. A picture is a much better way to show how the system works. Below is a diagram from the Gigaom article from which most of this information originates.</p>
<p></p>
<p>To me, Roxie seems much more exciting because it seems to complement (or replace) several technologies currently in the space. I do not know all the details, but it seems to potentially encapsulate technologies such as HBase, Hive, RabbitMQ and MemcacheDB, technologies that are common used to query and [...]]]></description>
				<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><p><img class="lfloatbox" src="http://www.bytemining.com/wp-content/uploads/2011/09/update.jpg" alt="" width="180px" /><em>A month ago, I wrote about <a href="http://www.bytemining.com/2011/08/hadoop-fatigue-alternatives-to-hadoop/">alternatives to the Hadoop MapReduce platform</a> and HPCC was included in that article. For more information, <a href="http://www.bytemining.com/2011/08/hadoop-fatigue-alternatives-to-hadoop/">see here</a>.</em></p>
<p><a href="http://www.lexisnexis.com">LexisNexis</a> has open-sourced its alternative to <a href="http://hadoop.apache.org/">Hadoop</a>, called <a href="http://gigaom.com/cloud/lexisnexis-open-sources-its-hadoop-killer/">High Performance Computing Cluster</a>. The code is available on <a href="https://github.com/hpcc-systems">GitHub</a>. For years the code was restricted to LexisNexis Risk Solutions. The system contains two major components:</p>
<ul>
<li><strong>Thor </strong>(Thor Data Refinery Cluster) is the data processing framework. It &#8220;crunches, analyzes and indexes huge amounts of data a la Hadoop.&#8221;</li>
<li><strong>Roxie </strong>(Roxy Radid Data Delivery Cluster) is more like a data warehouse and is designed with quick querying in mind for frontends.</li>
</ul>
<p>The protocol that drives the whole process is the Enterprise Control Language which is said to be faster and more efficient than Hadoop&#8217;s version of MapReduce. A picture is a much better way to show how the system works. Below is a diagram <a href="http://gigaom.com/cloud/lexisnexis-open-sources-code-for-hadoop-alternative/">from the Gigaom article from which most of this information originates</a>.</p>
<p><img style="display: block; margin-left: auto; margin-right: auto;" src="http://www.bytemining.com/wp-content/uploads/2011/09/img_hpcc_arch.jpg" alt="" width="680" height="358" /></p>
<p>To me, Roxie seems much more exciting because it seems to complement (or replace) several technologies currently in the space. I do not know all the details, but it seems to potentially encapsulate technologies such as <a href="http://hbase.apache.org/">HBase</a>, <a href="http://hive.apache.org/">Hive</a>, <a href="http://www.rabbitmq.com/">RabbitMQ</a> and <a href="http://memcached.org/">MemcacheDB</a>, technologies that are common used to query and speed data to a web frontend.</p>
<p>My opinion on HPCC is mixed. Although Hadoop has already taken off in usage, LexisNexis is a very strong institution and could potentially convince some corporate users to use their system instead &#8212; those that do not want to use <a href="http://research.microsoft.com/en-us/projects/dryad/">Microsoft&#8217;s Dryad project</a>. I do not see HPCC being a Hadoop killer, just as I do not see <a href="http://www.spark-project.org/">Spark</a> or any other alternative to be a Hadoop killer. However, if HPCC does become a strong alternative, I sense this could be trouble for some of the newer players in the Hadoop field such as HortonWorks and MapR. I do not have much of an interest in studying business and competition, but <a href="http://www.bytemining.com/2011/08/hadoop-fatigue-alternatives-to-hadoop/">Hadoop Summit 2011</a> showed that the Hadoop space has become crowded, and small breakthroughs such as another company developing a similar project is enough to add volatility and uncertainty for all involved.</p>
<div class="shr-publisher-938"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><div class='shareaholic-like-buttonset' style='float:none;height:30px;'><a class='shareaholic-googleplusone' data-shr_size='standard' data-shr_count='true' data-shr_href='http%3A%2F%2Fwww.bytemining.com%2F2011%2F09%2Flexisnexis-open-sources-its-hadoop-alternative%2F' data-shr_title='LexisNexis+Open-Sources+its+Hadoop+Alternative'></a></div><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://www.bytemining.com/2011/09/lexisnexis-open-sources-its-hadoop-alternative/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>SIGKDD 2011 Conference &#8212; Days 2/3/4 Summary</title>
		<link>http://www.bytemining.com/2011/08/sigkdd-2011-conference-days-234-summary-3/</link>
		<comments>http://www.bytemining.com/2011/08/sigkdd-2011-conference-days-234-summary-3/#comments</comments>
		<pubDate>Sat, 27 Aug 2011 18:00:00 +0000</pubDate>
		<dc:creator>Ryan Rosario</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Databases and Datastores]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Natural Language Processing]]></category>
		<category><![CDATA[Programming Languages]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[Statistics and Statistical Computing]]></category>
		<category><![CDATA[Web Mining]]></category>

		<guid isPermaLink="false">http://www.bytemining.com/?p=917</guid>
		<description><![CDATA[<p></p>
<p>&#60;&#60; My review of Day 1.</p>
<p>I am summarizing all of the days together since each talk was short, and I was too exhausted to write a post after each day. Due to the broken-up schedule of the KDD sessions, I group everything together instead of switching back and forth among a dozen different topics. By far the most enjoyable and interesting aspects of the conference were the breakout sessions.</p>
<p>Keynotes</p>
<p>KDD 2011 featured several keynote speeches that were spread out among three days and throughout the day. This year&#8217;s conference had a few big names.</p>

Steven Boyd, Convex Optimization: From Embedded Real-Time to Large-Scale Distributed. The first keynote, by Steven Boyd, discussed convex optimization. The goal of convex optimization is to minimize some objective function given linear constraints. The caveat is that the objective function and all of the constraints must be convex (&#8220;non-negative curvature&#8221; as Boyd said). The goal of convex optimization is to turn the problem into a linear programming problem. We should care about convex optimization because it comes from some beautiful and complete theory like duality and optimality conditions. I must say, that whenever I am chastising statisticians, I often say that all they care about is &#8220;beautiful theory&#8221; [...]]]></description>
				<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><p><img src="http://www.bytemining.com/wp-content/uploads/2011/08/KDD_Banner_10_Jan.jpg" alt="" width="750" height="85" /></p>
<p><a href="http://www.bytemining.com/2011/08/sigkdd-2011-conference-day-1-graph-mining-and-david-bleitopic-models/">&lt;&lt; My review of Day 1.</a></p>
<p>I am summarizing all of the days together since each talk was short, and I was too exhausted to write a post after each day. Due to the broken-up schedule of the KDD sessions, I group everything together instead of switching back and forth among a dozen different topics. By far the most enjoyable and interesting aspects of the conference were the breakout sessions.</p>
<p><strong>Keynotes</strong></p>
<p>KDD 2011 featured several keynote speeches that were spread out among three days and throughout the day. This year&#8217;s conference had a few big names.</p>
<div><img class="lfloatbox" src="http://www.bytemining.com/wp-content/uploads/2011/08/stephen_boyd.jpg" alt="" /><br />
<em><a href="http://www.stanford.edu/~boyd/">Steven Boyd</a>, Convex Optimization: From Embedded Real-Time to Large-Scale Distributed. </em>The first keynote, by Steven Boyd, discussed <a href="http://en.wikipedia.org/wiki/Convex_optimization">convex optimization</a>. The goal of convex optimization is to minimize some objective function given linear constraints. The caveat is that the objective function and all of the constraints must be convex (&#8220;non-negative curvature&#8221; as Boyd said). The goal of convex optimization is to turn the problem into a linear programming problem. We should care about convex optimization because it comes from some beautiful and complete theory like duality and optimality conditions. I must say, that whenever I am chastising statisticians, I often say that all they care about is &#8220;beautiful theory&#8221; so his comment was humorous to me. Convex optimization is a very intuitive way to think about regression and techniques such as the lasso. Convex optimization has tons of use cases including parameter estimation (<a href="http://en.wikipedia.org/wiki/Maximum_likelihood_estimator">MLE</a>, <a href="http://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation">MAP</a>, <a href="http://en.wikipedia.org/wiki/Least_squares">least-squares</a>, <a href="http://en.wikipedia.org/wiki/Lasso_%28statistics%29#LASSO_method">lasso</a>, logistic SVM and modern <a href="http://pages.cs.wisc.edu/~gfung/GeneralL1/">L1 optimization</a>). Boyd showed an example of convex optimization for <a href="http://www2.cs.uregina.ca/~hamilton/courses/330/notes/io/node6.html">disk head scheduling</a>.</p>
<p>For more information about convex optimization, see the website for<a href="http://www.stanford.edu/~boyd/cvxbook/"> <em>Convex Optimization </em>by Boyd and Vandenberghe</a>. The book is available for free as well as lecture slides etc. Even better, the second author is from UCLA! I did not realize that.</div>
<p><br clear="all" /></p>
<div><img class="rfloatbox" src="http://www.bytemining.com/wp-content/uploads/2011/08/norvig.jpg" alt="" /><br />
<a href="http://norvig.com/"><em>Peter Norvig</em></a>, <em>Internet Scale Data Analysis</em>. It is always great to hear from Peter Norvig. At the very least, you may have seen his name on your Artificial Intelligence introductory textbook <a href="http://aima.cs.berkeley.edu/"><em>Artificial Intelligence: A Modern Approach</em></a>. Norvig is also well known as the Director of Research at Google. He also spoke at SciPyCon 2009 and was wearing a similarly flashy shirt. Norvig discussed how to get around long latencies in a large scale system. Interestingly, his talk began with a discussion about Google&#8217;s interest in its carbon footprint because of course all of Google&#8217;s massive systems require a lot of power. The carbon output of 2500 queries is approximately equal to the carbon output in a beer. Norvig noted that most of Google&#8217;s most successful engineers are well-versed in distributed systems, and this should come as no surprise. He then introduced MapReduce and showed an example of how Google uses MapReduce to process map tiles for Google Maps. Norvig concluded by mentioning a variety of large systems used by Google including <a href="http://labs.google.com/papers/bigtable.html">BigTable</a> (column oriented store), and <a href="http://googleresearch.blogspot.com/2009/06/large-scale-graph-computing-at-google.html">Pregel</a> for graph processing. Pregel is vertex based, and thus programs &#8220;think like a vertex&#8221; where each vertex responds to actions transmitted over edges.</div>
<p><br clear="all"/><br />
(There was a keynote by a fellow named <a href="http://www.cbse.ucsc.edu/people/haussler">David Haussler</a> about cancer genomics. After an exhausting first two days, I skipped this talk as I needed to sleep&#8230;and I was not incredibly interested in the topic.)</p>
<div><img class="lfloatbox" src="http://www.bytemining.com/wp-content/uploads/2011/08/Judea_Pearl.jpg" alt="" width="233px" /><em><a href="http://bayes.cs.ucla.edu/jp_home.html">Judea Pearl</a>, The Mathematics of Causal Inference. </em>Go Bruins! Judea Pearl is a professor at the <a href="http://www.cs.ucla.edu">UCLA Department of Computer Science</a> and teaches a course on his field, Causality, each spring. His talk was essentially the same talk he gives at UCLA at the beginning of the quarter. I attempted to take his course in 2009, but quite frankly, I don&#8217;t get it and my mind cannot bend into that realm. I remember sitting in his class and wondering &#8220;what is wrong with me?&#8221; I love listening to Dr. Pearl speak only because of his sense of humor. Despite his age and the fact that he is slowing down, he had the crowd in hysterics as he struggled with the presentation technology and made intelligent jokes at every chance.</p>
<p>Pearl believes that humans do not communicate with probability, but causality (I do not agree with this entirely). I appreciated that he mentioned that it takes work to overcome the difference in thinking between probability and causality. In statistics, we use some data and a joint distribution to make inferences about some quantity or variable <em>P</em>. In causality, there is an intentional intervention that changes the joint distribution <em>P </em>into another joint distribution <em>P&#8217;</em>. Causality requires new language and mathematics (I do not see it). In order to use causality, one must introduce some untestable hypothesis. Pearl mentioned that some non-standard mathematical methods include counterfactuals and structural equation modeling. I do not know how I feel about any of this. <a href="http://bayes.cs.ucla.edu/BOOK-2K/">For more information about Pearl&#8217;s Causality, check out his book</a>.</div>
<p><br clear="all" /><br />
<strong>Data Mining Competitions</strong></p>
<p>One interesting event during KDD 2011 was the panel <em>Lessons Learned from Contests in Data Mining. </em>This panel featured Jeremy Howard (<a href="http://www.kaggle.com">Kaggle</a>), Yehuda Koren (<a href="http://www.yahoo.com">Yahoo</a>!),&nbsp; Tie-Yan Liu (<a href="http://www.microsoft.com">Microsoft</a>), and Claudia Perlich (<a href="http://media6degrees.com/">Media6Degrees</a>). Both Kaggle and Yahoo <em>run </em>data mining competitions: Kaggle has its own series of competitions and Yahoo is a major sponsor of the <a href="http://www.sigkdd.org/kddcup/">KDD Cup</a> competition. Perlich has participated and won many data mining competitions. Liu provided a different insight into data mining competitions as an industry observer. <strong>&nbsp;</strong></p>
<p>Jeremy Howard gave some insight into the history of data mining competitions. He credited KDD 97 with the formation of the first data mining competition. He announced to the crowd that companies spend 100 billion dollars every year on data mining products and services (not including in-house costs such as employment) and that there are approximately 2 million Data Scientists. The estimate of the number of Data Scientists was based on the number of times R was downloaded, and is an estimate based on David Smith&#8217;s (Revolution Computing) blog post. I love R, and every Data Scientist should use it, but there are several problems with this estimate. Not everyone that uses R is a Data Scientist; a large portion of R users are statisticians (&#8220;beautiful theory&#8221;), teachers, miscellaneous students etc. Second, not all Data Scientists use R. Some are even more creative and write their own tools or use little-adopted software packages. There are also a lot of Data Scientists that use Python instead of R. Howard also announced that over the next year, Kaggle with be starting 1000s of &#8220;invitation only&#8221; competitions. Personally, I do not care for this type of exclusion even though their intentions are good.</p>
<p>Yehuda Koren introduced the crowd to Yahoo&#8217;s involvement in data mining competitions. Yahoo is a major force behind the KDD Cup and the <a href="http://www.heritagehealthprize.com/c/hhp">Heritage Foundation competition</a>. Yahoo also won a progress award in the Netflix challenge. Koren then described how data mining competitions help the community. Competitions raise awareness and attract research to a field, end up involving the release of a cool dataset to the community, encourage contribution and education, and provide publicity for participants and winners. Contestants are attracted to competitions for various reasons including fun, competitiveness, fame, the desire to learn more, peer pressure and of course the monetary reward. As with every competition, data mining competitions have rules and Koren stated that rules are very difficult to enforce. I believe that data mining is vague as it is, so competitions would be just as vague. It is important to maximize participation by minimizing the reduction of participation while maximizing fairness and innovation. Some such &#8220;rules&#8221; include discouraging huge ensembles (which probably overfit anyway), submission frequency, team duplication, team size (the KDD Cup winning team had 25 members). Some obvious keys to success in data mining competitions are ensembles, hard work, team size, innovation vs. fancy models, quick coding and patience.</p>
<p>I felt that Tie-Yan Liu from Microsoft sort of served as the Simon Cowell of the panel, and I feel that his role was necessary. He provided industry insight that provided a bit of a reality check as to what data mining competitions accomplish and do not accomplish. Liu questions if the problems being solved in data mining competitions are really important problems. Part of the problem is that many datasets are censored as to protect privacy. Additionally, the really interesting problems cannot be opened to the public because they involve trade secrets. I consider myself an inclusive guy &#8211; I do not like the concept of winners and losers. I was elated that Liu brought up this point: &#8220;what about the losers?&#8221; Is it bad publicity to &#8220;lose&#8221; several (or all) competitions? The answer to this question varies person-to-person. I honestly believe that the goal of these competitions is of the open-source nature (fun, share, learn, solve) and not so much to cure cancer. They are great for college students, people that are interested in data science but do not have access to great data. For the rest of us, learning on our own using interesting data is probably better.</p>
<p>Claudia Perlich (Media6Degrees) discussed her experience participating in data mining competitions. She has won several contests. She commented on the distinction between sterile/cleaned data and real data as competitions can include either type. The concept of Occam&#8217;s Razor applies to data mining competitions; Perlich won most of her competitions using a linear model, but by using more complex and creative features. Perlich emphasizes that complex features are better than complex models.</p>
<p>Considering the <a href="http://www.netflixprize.com/">Netflix Prize</a> has been one of the biggest data mining competitions, I was disappointed that they were not represented on the panel since there were several researchers from Netflix at the conference.</p>
<p><em>Rather than write a few sentences for each topic, I will just bullet the goals of the research discussed in the sessions. Descriptions with a star (*) denote my favorite papers and are cited later.</em></p>
<p><strong>Text Mining</strong></p>
<p>I attended two of the three text mining sessions. I must say that I am quite topic-modeled and LDAed out! <a href="http://en.wikipedia.org/wiki/Latent_Dirichlet_Allocation">Latent Dirichlet Allocation (LDA)</a> and several variations were part of every talk I heard. That was very exciting and reaffirms that I am in a hot field. Still, nobody has taken my dissertation topic yet (which I have remained quiet about).</p>
<ul>
<li>Using explicit user feedback to improve LDA and display topics appropriately by combining topic labels, topic n-grams and capitalization/entity detection.* This talk was presented by David Andrzejwski (@<a href="http://www.twitter.com/davidandrzej">davidandrzej</a>). I finally got to meet him and I discussed my dissertation topic with him. I am always entertained by the fact that we all look much different than our Twitter avatars portray. <img src='http://www.bytemining.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </li>
<li>Using external metadata and topics (LDA) to predict user ratings on items using localized factor models.</li>
<li>Using preferences and relative emphasis of each factor (i.e. how important to you is free wireless Internet in a hotel room?) to predict rating scores.*</li>
<li>Determining the network process that created a piece of text: who copied from whom?</li>
<li>Using a topic model (LDA) with other features such as part-of-speech tag (noun, verb etc.), <a href="http://wordnet.princeton.edu/">WordNet</a> features, sentiment/polarity etc.*</li>
<li>Modeling how topics and interests grown over time and understanding the correlations between terms over time.*</li>
</ul>
<p><strong>Social Network Analysis and Graph Analysis</strong></p>
<p>The Social Networks session conflicted with one of the Text Mining sessions, but since I knew there would be two more, I decided to attend this one instead. I also combined the two Graph Analysis sessions into this section since they are so related. The goals of the research presented in these talks were as follows:</p>
<ul>
<li>To label venue (Foursquare venues etc.) types (restaurant, bar, park etc.) based on several attributes of the user: user&#8217;s friends, user&#8217;s weekly and daily schedule using label propagation.</li>
<li>To determine the connections/edges in a social network that are the most critical for propagation of data (an idea, tweet, viral marketing etc.)*</li>
<li>To use tagging (items on Amazon can be tagged with keywords by users) and reviews to predict the success of a new item.</li>
<li>To find a better metric for ranking search engine results by starting with a relevant subgraph rather than a random surfer model. Also models attention span of user.*</li>
<li>Classification of nodes, labeling of nodes and node link prediction using one unified algorithm (C3).*</li>
<li>Ranking using large graphs using a priori information about good/bad nodes and edges.*</li>
<li>The importance of bias in sampling from networks.*</li>
</ul>
<p><strong>User Modeling</strong></p>
<p>This session I suspect was similar to the Web User Modeling session and focused on recommendation engines and rating prediction.</p>
<ul>
<li>Using endorsements to measure user bias (retweets, likes, etc.) to perform real time sentiment analysis,</li>
<li>Estimating user reputation using thumbs-up vote rates on Yahoo News comments.</li>
<li>Selecting a set of reviews that encapsulates the most information about a product with the most diverse viewpoints.</li>
</ul>
<p><strong>Frequent Sets<br />
</strong></p>
<p>I did some work with <a href="http://en.wikipedia.org/wiki/Association_rule_learning">itemset mining</a> at my last job and I was not incredibly interested in the Online Data and Streams session at the time so I attended this talk.</p>
<ul>
<li>Using background knowledge about transactions to minimize redundancy.</li>
<li>Studying the effects of order on itemset mining.</li>
<li>Mining graphs as frequent itemsets from streams.</li>
</ul>
<p><strong>Classification</strong></p>
<p>I got stuck in this session because the session I really wanted to attend &#8220;Web User Modeling&#8221; was full and there was nowhere to sit or stand. This session was more technical and theoretical. The only session that I really enjoyed was about <a href="http://dl.acm.org/ft_gateway.cfm?id=2020418&amp;ftid=1012890&amp;dwn=1&amp;CFID=37256003&amp;CFTOKEN=68211979">a classifier called CHIRP. I did not follow the details, but this is a paper that I am interested in reading</a>. The authors used a classifier based on Composite Hypercutes on Interated Random Projections to classify spaces that have complex topology (think of classifying items that appear in a bullseye/dartboard pattern).*</p>
<p><strong>Unsupervised Learning</strong></p>
<p>This talk was similar to the classification talk but more practical in my opinion.</p>
<ul>
<li>Using decision trees for density estimation classifiers.</li>
<li>Clustering cell phone user behavior using &#8220;Earth Mover&#8221; distance.</li>
<li>Clustering of multidimensional data using mixure modeling with components of different distributions and copulas.*</li>
</ul>
<p><strong>Favorite Papers</strong></p>
<p>Below is a short bibliograph of papers that were my favorite. There were also a few at the poster session (the first four) that I include here.<strong><br />
</strong></p>
<ul>
<li><a href="http://dl.acm.org/ft_gateway.cfm?id=2020603&amp;ftid=1013043&amp;dwn=1&amp;CFID=37256003&amp;CFTOKEN=68211979"><em>Ranking-Based Classification of Heterogeneous Information Networks</em></a>, Ming Ji, Jiaewi Han, Marina Danilevsky.</li>
<li><em><a href="http://dl.acm.org/ft_gateway.cfm?id=2020561&amp;ftid=1013057&amp;dwn=1&amp;CFID=37256003&amp;CFTOKEN=68211979">Axiomatic Ranking of Network Role Similarity</a>, </em>Ruomong Jin, Victor E. Lee, Hui Hong.</li>
<li><a href="http://dl.acm.org/ft_gateway.cfm?id=2020558&amp;ftid=1013004&amp;dwn=1&amp;CFID=37256003&amp;CFTOKEN=68211979"><em>Approximate Kernel k-means: Solutions to Large Scale Kernel Clustering</em></a></li>
<li><em><a href="http://dl.acm.org/ft_gateway.cfm?id=2020614&amp;ftid=1013054&amp;dwn=1&amp;CFID=37256003&amp;CFTOKEN=68211979">User-Level Sentiment Analysis Incorporating Social Networks</a>, </em>Chenhao Tan, Lillian Lee, Jie Tang, Lang Jiang, Ming Zhou, Ping Li.</li>
<li><a href="http://dl.acm.org/ft_gateway.cfm?id=2020503&amp;ftid=1012957&amp;dwn=1&amp;CFID=37256003&amp;CFTOKEN=68211979"><em>Latent Topic Feedback for Information Retrieval</em></a>, David Andrzejewski, Lawrence Livermore National La; David Buttler, Lawrence Livermore National Laboratory</li>
<li><a href="http://dl.acm.org/ft_gateway.cfm?id=2020505&amp;ftid=1012959&amp;dwn=1&amp;CFID=37256003&amp;CFTOKEN=68211979"><em>Latent Aspect Rating Analysis without Aspect Keyword Supervision</em></a>, Hongning Wang, UIUC; Yue Lu, University of Illinois; ChengXiang Zhai, UIUC</li>
<li><a href="http://dl.acm.org/ft_gateway.cfm?id=2020484&amp;ftid=1012943&amp;dwn=1&amp;CFID=37256003&amp;CFTOKEN=68211979"><em>Conditional Topical Coding: an Efficient Topic Model Conditioned on Rich Features</em></a>, Jun Zhu, Carnegie Mellon University; Ni Lao, Carnegie Mellon University; Ning Chen, Tsinghua University; Eric Xing, CMU</li>
<li><a href="http://dl.acm.org/ft_gateway.cfm?id=2020485&amp;ftid=1012944&amp;dwn=1&amp;CFID=37256003&amp;CFTOKEN=68211979"><em>Tracking Trends: Incorporating Term Volume into Temporal Topic Models</em></a>, Liangjie Hong, Lehigh University; Dawei Yin, lehigh University; Jian Guo, University of Michigan; Brian Davison, Lehigh University</li>
<li><a href="http://dl.acm.org/ft_gateway.cfm?id=2020428&amp;ftid=1012898&amp;dwn=1&amp;CFID=37256003&amp;CFTOKEN=68211979"><em>Diversity in ranking via resistive graph centers</em></a>, Kumar Dubey, IBM Research; Soumen Chakrabarti, &#8220;Indian Institute of Technology, Bombay&#8221;; Chiru Bhattacharya, IISc</li>
<li><a href="http://dl.acm.org/ft_gateway.cfm?id=2020429&amp;ftid=1012899&amp;dwn=1&amp;CFID=37256003&amp;CFTOKEN=68211979"><em>Collective Graph Identification</em></a>, Galileo Namata, University of Maryland; Stanley Kok, University of Maryland; Lise Getoor, &#8220;University of Maryland, College Park&#8221;</li>
<li><a href="http://dl.acm.org/ft_gateway.cfm?id=2020430&amp;ftid=1012900&amp;dwn=1&amp;CFID=37256003&amp;CFTOKEN=68211979"><em>Semi-Supervised Ranking on Very Large Graph with Rich Metadata</em></a>, Bin Gao, Microsoft Research Asia; Tie-Yan Liu, Microsoft Research Asia; Wei Wei, ; Taifeng Wang, Microsft research; Hang Li, Microsoft</li>
<li><a href="http://dl.acm.org/ft_gateway.cfm?id=2020431&amp;ftid=1012901&amp;dwn=1&amp;CFID=37256003&amp;CFTOKEN=68211979"><em>Benefits of Bias: Towards Better Characterization of Network Sampling</em></a>, Arun Maiya, UIC; Tanya Berger-Wolf, University of Illinois at Chicago</li>
<li><a href="http://dl.acm.org/ft_gateway.cfm?id=2020418&amp;ftid=1012890&amp;dwn=1&amp;CFID=37256003&amp;CFTOKEN=68211979"><em>CHIRP: A new classifier based on Composite Hypercubes on Iterated Random Projections</em></a>, Leland Wilkinson, Systat; Anushka Anand, UIC; Tuan Dang, UIC</li>
<li><a href="http://dl.acm.org/ft_gateway.cfm?id=2020492&amp;ftid=1012949&amp;dwn=1&amp;CFID=37256003&amp;CFTOKEN=68211979"><em>Sparsification of Influence Networks</em></a>, Michael Mathioudakis, University of Toronto; Francesco Bonchi, Yahoo! Research; Carlos Castillo, Yahoo!; Aristides Gionis, Yahoo! Research Barcelona; Antti Ukkonen,</li>
<li><a href="http://dl.acm.org/ft_gateway.cfm?id=2020509&amp;ftid=1012962&amp;dwn=1&amp;CFID=37256003&amp;CFTOKEN=68211979"><em>Online heterogeneous mixture modeling with marginal and copula selection</em></a>, RYOHEI FUJIMAKI, NEC Laboratories America; Yasuhiro Sogawa, ; Satosi Morinaga,</li>
</ul>
<p><strong>Wrapping Up</strong></p>
<p>I had an awesome time at KDD and wish I could go next year, but it will be held in Beijing. I got to meet a lot of different people in the field that have the same passion for data and that was really cool. I got to meet with recruiters from a few different companies and get some swag from Yahoo and Google.</p>
<p>It was awesome being around such greatness. I ran into Peter Norvig several times, ran into Judea Pearl in the restroom (I already know him), as well as <a href="http://www.cs.cmu.edu/~christos/">Christos Faloutsos</a> (I am a huge fan) and <a href="http://www.rulequest.com/Personal/">Ross Quinlan</a>. I stopped at the Springer booth and found a <a href="http://www.amazon.com/gp/product/1441965149">cool book about link prediction with Faloutsos as one of the authors</a>. I went to buy it, handed the lady my credit card, and learned that it was $206 (AFTER conference discount)! Interestingly&#8230; Amazon has the same book for $165. I will probably order it anyway.</p>
<p>Here&#8217;s hoping that KDD returns to California (or the US) real soon!
</p>
<p style="text-align: center;"><img src="http://www.bytemining.com/wp-content/uploads/2011/08/sd.jpg" alt="" /></p>
<p><a href="http://www.bytemining.com/2011/08/sigkdd-2011-conference-day-1-graph-mining-and-david-bleitopic-models/">&lt;&lt; My review of Day 1.</a></p>
<p><strong>Candid Shots</strong></p>
<table border="0">
<tr>
<td>
<img src="http://www.bytemining.com/wp-content/uploads/2011/08/quinlan.jpg" alt="" />
</td>
<td>
<img src="http://www.bytemining.com/wp-content/uploads/2011/08/faloutsos.jpg" alt="" />
</td>
</tr>
<tr>
<td>Ross Quinlan enjoying a beer during the poster session. What a cool guy!</td>
<td>Christos Faloutsos talking with a student during the poster session.</td>
</tr>
</table>
<div class="shr-publisher-917"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><div class='shareaholic-like-buttonset' style='float:none;height:30px;'><a class='shareaholic-googleplusone' data-shr_size='standard' data-shr_count='true' data-shr_href='http%3A%2F%2Fwww.bytemining.com%2F2011%2F08%2Fsigkdd-2011-conference-days-234-summary-3%2F' data-shr_title='SIGKDD+2011+Conference+--+Days+2%2F3%2F4+Summary'></a></div><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://www.bytemining.com/2011/08/sigkdd-2011-conference-days-234-summary-3/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SIGKDD 2011 Conference &#8212; Day 1 (Graph Mining and David Blei/Topic Models)</title>
		<link>http://www.bytemining.com/2011/08/sigkdd-2011-conference-day-1-graph-mining-and-david-bleitopic-models/</link>
		<comments>http://www.bytemining.com/2011/08/sigkdd-2011-conference-day-1-graph-mining-and-david-bleitopic-models/#comments</comments>
		<pubDate>Mon, 22 Aug 2011 16:41:22 +0000</pubDate>
		<dc:creator>Ryan Rosario</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Databases and Datastores]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Natural Language Processing]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[Statistics and Statistical Computing]]></category>

		<guid isPermaLink="false">http://www.bytemining.com/?p=877</guid>
		<description><![CDATA[<p></p>
<p>I have been waiting for the KDD conference to come to California, and I was ecstatic to see it held in San Diego this year. AdMeld did an awesome job displaying KDD ads on the sites that I visit, sometimes multiple times per page. That&#8217;s good targeting!</p>
<p>Mining and Learning on Graphs Workshop 2011</p>
<p>I had originally planned to attend the 2-day workshop Mining and Learning with Graphs (MLG2011) but I forgot that it started on Saturday and I arrived on Sunday. I attended part of MLG2011 but it was difficult to pay attention considering it was my first time waking up at 7am in a long time. The first talk I arrived for was Networks Spill the Beans by Lada Adamic from the University of Michigan. Adamic&#8217;s presented work involved inferring properties of content (the &#8220;what&#8221;) using network structure alone (using only the &#8220;who&#8221;: who shares with whom). One example she presented involved questions and answers on a Java programming language forum. The research problem was to determine things such as who is most likely to answer a Java beginner&#8217;s question: a guru, or a slightly more experienced user? Another research question asked what dynamic interactions tell us about information flow. [...]]]></description>
				<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><p><img src="http://www.bytemining.com/wp-content/uploads/2011/08/KDD_Banner_10_Jan.jpg" alt="" width="750" height="85" /></p>
<p>I have been waiting for the KDD conference to come to California, and I was ecstatic to see it held in San Diego this year. <a href="http://www.admeld.com">AdMeld</a> did an awesome job displaying KDD ads on the sites that I visit, sometimes multiple times per page. That&#8217;s good targeting!</p>
<p><strong>Mining and Learning on Graphs Workshop 2011</strong></p>
<p>I had originally planned to attend the 2-day workshop <a href="http://www.cs.purdue.edu/mlg2011/">Mining and Learning with Graphs (MLG2011)</a> but I forgot that it started on Saturday and I arrived on Sunday. I attended part of MLG2011 but it was difficult to pay attention considering it was my first time waking up at 7am in a long time. The first talk I arrived for was <em>Networks Spill the Beans </em>by <a href="http://www.ladamic.com/">Lada Adamic</a> from the University of Michigan. Adamic&#8217;s presented work involved inferring properties of content (the &#8220;what&#8221;) using network structure alone (using only the &#8220;who&#8221;: who shares with whom). One example she presented involved questions and answers on a Java programming language forum. The research problem was to determine things such as who is most likely to answer a Java beginner&#8217;s question: a guru, or a slightly more experienced user? Another research question asked what dynamic interactions tell us about information flow. For this example, Adamic used data from the virtual world <a href="http://secondlife.com/">SecondLife</a>. Certain landmarks (such as a bench) can be bookmarked by users and certain gestures (like a kiss) can be studied. This made my ears rise. SecondLife is a treasure trove of cool data. Is there a way to access it? It looks there might be a way to access some of it including monetary valuation, market purchases, and several APIs for different aspects of SecondLife.&nbsp; I will have to look into that later though. Adamic concluded with a discussion of Twitter as a social network, but I was starting to fall asleep from my hectic and early morning. The gist of her talk, and many other talks in this field, was to combine semantic variables (NLP) with topological variables (SNA) to predict som other semantic variables. This talk was very digestible, and very interesting (despite my lack of sleep), but featured some of the worst visualizations I have ever seen (area plots representing correlations across multiple levels of an ordinal variable), but that was minor. Of course, <a href="http://www.flowingdata.com">Nathan</a> might disagree <img src='http://www.bytemining.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> .</p>
<p>Social network analysis, and network analysis in general, is a field that I really want to sink my teeth into. The difficulty I have is that the discussion of this field seems to involve so much vernacular that is specific to the field that everything seems so much more difficult than it really is.</p>
<p>At this point I took off to lunch. Just across from the <a href="http://manchestergrand.hyatt.com/hyatt/hotels/index.jsp?null">Hyatt</a> (a beautiful hotel by the way) is <a href="http://www.seaportvillage.com/">Seaport Village</a>, a beautiful waterfront park containing nice landscaping, shops, restaurants, all with the ocean in the background. There is no beach there &#8212; the village backs right up to the water. Across the bay is some type of military complex and <a href="http://www.coronado.ca.us/">Coronado Island</a>. I had a $7 hot dog, followed by a chocolate-covered strawberry and a peanut butter cup from the nearby candy store. It was such a nice day I walked around for a while, grabbed a strawberry shake and then headed back for the next session&#8230; the one I had been waiting for!</p>
<div align="center">
<table>
<tr>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/08/2011-08-21_12-52-40_687.jpg" width="288px"/></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/08/2011-08-21_12-38-45_574.jpg" width="288px"/></td>
</tr>
<tr>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/08/2011-08-21_10-33-32_593.jpg"/></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/08/2011-08-21_12-57-35_758.jpg" width="288px"/></td>
</tr>
</table>
</div>
<p><strong>Afternoon Tutorial: Probabilistic Topic Models </strong><br />
<em>David Blei, Princeton</em></p>
<p>My dissertation topic is related to <a href="http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation">Latent Dirichlet allocation</a> (well, topic modeling in general), so I was definitely interested to hear what the father of LDA had to say. Since this was a 3 hour tutorial, I was expecting that <a href="http://www.cs.princeton.com/~blei">Blei</a> would start with the unigram model, and then discuss <a href="http://en.wikipedia.org/wiki/Latent_semantic_analysis">Latent Semantic Analysis (LSA)</a> and <a href="http://en.wikipedia.org/wiki/PLSI">Probabilistic Latent Semantic Indexing (pLSI)</a> building up to LDA. Instead, Blei started with LDA and for good reason! In this post, I will not summarize the mechanics of Latent Dirichlet Allocation as that is another post entirely. For some introduction, see <a href="http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf">here</a>. LDA and its extensions can be used to model the evolution of topics over time, to model the connections among topics, and to predict links among objects in a network. Topic modeling is a case study in machine learning rather than a field in itself; topic modeling draws on several different concepts including Bayesian statistics, time series analysis, hierarchical models, Markov chain monte carlo (MCMC), Bayesian non-parametric statistics and sparsity. In LDA, a document is represented as a mixture of topics (some hypothetical quantity that captures content clustering), and a topic is a distribution over words in a vocabulary.</p>
<p>Again, this is a high-level description of what was discussed. A full mechanical analysis would require dozens of pages. LDA is just a probabilistic model. As such, there are established ways for estimating the parameters of the model as well as the topic assignments. Some of these include <a href="http://www.springerlink.com/index/N811M25287935571.pdf">mean field variational methods</a>, <a href="http://research.microsoft.com/en-us/um/people/minka/papers/ep/">expectation propagation</a>, <a href="http://en.wikipedia.org/wiki/Gibbs_sampling">Gibbs sampling</a>, collapsed Gibbs sampling, collapsed variational Bayes and online variational Bayes. Each of these estimation methods has its own advantages and disadvantages. Blei showed the LDA and pLSI have a lot in common. Unlike LDA, pLSI uses <a href="http://en.wikipedia.org/wiki/Maximum_likelihood">maximum likelihood estimations</a> (and the <a href="http://en.wikipedia.org/wiki/Expectation-maximization_algorithm">EM algorithm</a>) for parameter estimation; pLSI tends to overfit badly. The hyperparameter &alpha; adds regularization to the ϴ parameter in the LDA model. [Sorry to refer to these random parameters, but it is difficult to describe without them. See the links mentioned earlier for an overview of LDA.]</p>
<p><em>Preprocessing. </em>A lot of preprocessing must be performed before computing a topic model. First, we should <a href="http://en.wikipedia.org/wiki/Stop_words"><strong>remove stopwords</strong></a>, which are words that provide absolutely no clues to the content of the text. If we leave stopwords in the corpus when computing the model, we may end up with meaningless topics that are described with only stopwords, due to their high probability. Second, Blei mentioned that <a href="http://en.wikipedia.org/wiki/Stemming"><strong>stemming</strong></a> is a good idea, but modern stemming algorithms tend to be too aggressive. If resources allow, I think it would be useful to have humans manually strip words to their root words. <strong>Multiword phrases </strong>such as &#8220;black hole&#8221; are also an issue. With sufficient resources, one could ask human labelers to identify these phrases and recode them as a single word by replacing the space between words with an underscore. <a href="http://www.cs.umass.edu/~wallach/publications/wallach06topic.pdf">Hanna Wallach (U. Mass) has a paper</a> that describes how to identify multiwork phrases by using <em>n</em>-grams. Blei has a similar paper that discusses an algorithm called <a href="http://arxiv.org/abs/0907.1013">TurboTopics</a>. He also mentioned that a standard statistical hypothesis test such as chi-squared, permutation tests, or a nested hypothesis test would also be sufficient, though inefficient. I have not thought of how this would work however. Finally, <strong>remove rare words</strong> because they can lead to local optima in the likelihood surface probably yielding inefficient computation.</p>
<p><em>Some hairy details. </em>One of the parameters that makes LDA useful is <em>&nbsp;</em>&alpha;. &alpha; is a hyperparameter in the LDA model that determines the sparsity of draws from the underlying <a href="http://en.wikipedia.org/wiki/Dirichlet_distribution">Dirichlet distribution</a>. &alpha; is typically a small number; Blei mentioned that 0.01 is a good a priori value for &alpha;. As &alpha; gets larger, the distribution of topics tends towards the uniform (each topic equally likely) distribution and as &alpha; approaches 0, we get sparser draws, meaning more peaked topic probabilities. Setting &alpha; to be ridiculously small (i.e. 0.001) may yield a single topic dominating the model. &alpha; can be chosen, or we can fit &alpha; to the data using cross-validation or some other method. He also discussed the parameter &eta;.</p>
<p><em>Open source software. </em>We quickly (flash of an eye) went through a list of some open-source LDA implementations:</p>
<ul>
<li><a href="http://www.cs.princeton.edu/~blei/lda-c/">LDA-C</a> (variational EM), <em>Blei.</em></li>
<li><a href="http://www.cs.princeton.edu/~blei/topicmodeling.html">HDP</a> (hierarchical Dirichlet processes), <em>Blei</em></li>
<li><a href="http://cran.r-project.org/web/packages/lda/">LDA</a> (R package, collapsed Gibbs), <em>Jonathan Chang, </em>Data Scientist, Facebook.</li>
<li><a href="http://alias-i.com/lingpipe/">Lingpipe</a>, <em>alias-i</em></li>
<li><a href="http://mallet.cs.umass.edu/">Mallet</a> (collapsed Gibbs), <em>UMass</em></li>
<li><a href="http://nlp.fi.muni.cz/projekty/gensim/">Gensim</a> (online and batch LDA),  <em>Radim Řehůřek</em></li>
</ul>
<p>To my delight, Blei seemed to favor the R package (although Gensim is a nice Python implementation). The R package not only contains LDA, but several other models including RTMs, MMSB and sLDA which will be discussed later. It is supposedly fast as well. The output from the R package can be visualized using the Topic Model Visualizer by Allison Chaney.</p>
<p>The beauty of LDA is that it can be embedded in many more complicated models. Some applications of these extensions include word sense, graphs and hierarchies. Before delving into specifics, there are a couple of changes to the LDA model that motivate the next topics.</p>
<ol>
<li>The probability of observing word <em>w </em>given a set of topi<span style="font-family: arial,helvetica,sans-serif;">c</span><span style="font-family: symbol;"><span style="font-family: arial,helvetica,sans-serif;">s</span> &beta; <span style="font-family: arial,helvetica,sans-serif;">and a set of topic labels z is given by <em>P(w|</em></span></span><span style="font-family: symbol;">&beta;</span><span style="font-family: symbol;"><span style="font-family: arial,helvetica,sans-serif;"><em>,z)</em></span><span style="font-family: arial,helvetica,sans-serif;"> which is <a href="http://en.wikipedia.org/wiki/Multinomial_distribution">multinomial</a>. The distribution of </span></span><span style="font-family: symbol;"><span style="font-family: arial,helvetica,sans-serif;"><em>P(w|</em></span></span><span style="font-family: symbol;">&szlig;</span><span style="font-family: symbol;"><span style="font-family: arial,helvetica,sans-serif;"><em>,z) </em></span></span><span style="font-family: symbol;"><span style="font-family: arial,helvetica,sans-serif;">can be changed depending on what we are modeling. For example</span></span>, for count data, <span style="font-family: symbol;"><span style="font-family: arial,helvetica,sans-serif;"><em>P(w|</em></span></span><span style="font-family: symbol;">&beta;</span><span style="font-family: symbol;"><span style="font-family: arial,helvetica,sans-serif;"><em>,z) </em>can be <a href="http://en.wikipedia.org/wiki/Poisson_distribution">Poisson</a>. This drastically changes the model, however. In LDA, </span></span><span style="font-family: symbol;"><span style="font-family: arial,helvetica,sans-serif;"><em>P(w|</em></span></span><span style="font-family: symbol;">&szlig;</span><span style="font-family: symbol;"><span style="font-family: arial,helvetica,sans-serif;"><em>,z) </em>is multinomial which is convenient because it is the <a href="http://en.wikipedia.org/wiki/Conjugate_prior">conjugate prior</a> of the Dirichlet distribution.</span></span></li>
<li><span style="font-family: symbol;"><span style="font-family: arial,helvetica,sans-serif;">The characteristic LDA posterior distribution can be used in more creative ways&#8230;</span></span></li>
</ol>
<p><em><a href="http://projecteuclid.org/DPubS?service=UI&amp;version=1.0&amp;verb=Display&amp;handle=euclid.aoas/1183143727">Correlated Topic Model</a>. </em>In LDA, all topics are considered independent of each other, and this is usually unrealistic. CTM allows the topics to be correlated. For example, a paper classified as about calculus is more likely to also be classified as about physics, than it is to be classified as about sewing. Blei mentioned that CTM allows for better prediction, likely because it is more realistic. CTM is also more robust to overfitting. The main distinction from LDA is that <span style="font-family: symbol;"><span style="font-family: arial,helvetica,sans-serif;"><em>ϴ </em>follows the logistic normal distribution instead of the Dirichlet distribution.</span></span></p>
<p><span style="font-family: symbol;"><span style="font-family: arial,helvetica,sans-serif;"><em><a href="http://dl.acm.org/citation.cfm?id=1143859">Dynamic Topic Model</a>. </em>DTM models how each individual topic changes over time. One example Blei showed involved a topic that could be labeled &#8220;technology&#8221;. In the late 1700s, this topic contained the words &#8220;coal&#8221;, &#8220;steel&#8221; (I am making it up from memory&#8230;probably badly&#8230;bear with me) and in 2011 contained the words &#8220;silicon&#8221; and &#8220;solar&#8221;. The main distinction from LDA is two-fold: assuming the topic at time <em>t </em>is normally distributed with the topic at time <em>t-1 </em>as the mean and some variance. That is,</span></span></p>
<img src='http://s.wordpress.com/latex.php?latex=%5Cbeta_%7Bt%2Ck%7D%20%5Cvert%20%5Cbeta_%7Bt-1%2Ck%7D%20%5Csim%20N%28%5Cbeta_%7Bt-1%2Ck%7D%2C%20I%5Csigma%5E2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\beta_{t,k} \vert \beta_{t-1,k} \sim N(\beta_{t-1,k}, I\sigma^2' title='\beta_{t,k} \vert \beta_{t-1,k} \sim N(\beta_{t-1,k}, I\sigma^2' class='latex' />
<p><span style="font-family: symbol;"><span style="font-family: arial,helvetica,sans-serif;">and </span></span><span style="font-family: symbol;"><span style="font-family: arial,helvetica,sans-serif;"><em></em></span></span></p>
<img src='http://s.wordpress.com/latex.php?latex=P%28w%20%5Cvert%20%5Cbeta_%7Bt%2Ck%7D%29%20%5Cpropto%20%5Cexp%5Cbeta_%7Bt%2Ck%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='P(w \vert \beta_{t,k}) \propto \exp\beta_{t,k}' title='P(w \vert \beta_{t,k}) \propto \exp\beta_{t,k}' class='latex' />
<p><span style="font-family: symbol;"><span style="font-family: arial,helvetica,sans-serif;">instead of multinomial.</span></span></p>
<p><span style="font-family: symbol;"><span style="font-family: arial,helvetica,sans-serif;">A limitation of DTM is that it does not handle the death of a topic gracefully. </span></span></p>
<p><span style="font-family: symbol;"><span style="font-family: arial,helvetica,sans-serif;"><a href="http://arxiv.org/abs/1003.0783"><em>Supervised LDA</em></a>. In sLDA, we associate each document with an external variable. For example, a document may be a Yelp review containing text. The external variable associated with the Yelp review may be the number of stars in the associated rating. We can use sLDA to use the topics estimated by LDA as regressors to predict this external variable <em>Y</em>. Various types of regression can be performed from standard linear regression to the <a href="http://en.wikipedia.org/wiki/Generalized_linear_model">generalized linear model (GLM)</a>.The Yelp example would likely use an <a href="http://en.wikipedia.org/wiki/Ordered_logit">ordered logit</a> model for <em>Y</em>.<br />
</span></span></p>
<p><span style="font-family: symbol;"><span style="font-family: arial,helvetica,sans-serif;"><em><a href="https://www.cs.princeton.edu/~blei/papers/ChangBlei2009.pdf">Relational Topic Models</a>. </em>RTM applies sLDA to every pair of documents in a corpus and attempts to use content to predict connectedness in a graph. For example, given the content on my Facebook profile, one could use sLDA to predict what kind of reaction I would have to an ad (i.e. click or no click) and this could be used for targeted ad serving, or any other type of recommendation engine. Think <a href="http://en.wikipedia.org/wiki/Collaborative_filtering">collaborative filtering</a>! RTM is also good for certains types of data that have spatial/geographic dependencies. </span></span></p>
<p><span style="font-family: symbol;"><span style="font-family: arial,helvetica,sans-serif;"><em><a href="http://www.cs.umass.edu/~wallach/workshops/nips2010css/papers/gerrish.pdf">Ideal Point Topic Models</a> </em>were barely touched upon since we were running short on time (although we voted to extend the session by 30 mins and Blei happily obliged). They seem particularly useful in political science for predicting roll call votes.</span></span></p>
<p><span style="font-family: symbol;"><span style="font-family: arial,helvetica,sans-serif;"><em>Bayesian Non-Parametric Models</em> are a hot topic but are too complicated to describe here. In LDA, the number of topics is determined a priori and remains fixed throughout the model. In real life, topics can be &#8220;born&#8221; and can &#8220;die&#8221; off and we may not know a priori how many topics to use. One can model the latter situation as a <a href="http://en.wikipedia.org/wiki/Chinese_restaurant_process"><em>Chinese Restaurant Process</em></a> where each table is associated with a topic. Furthermore, a <em>Chinese Restaurant Franchise </em>can be used for modeling hierarchies (hLDA). In CRF, there is a corpus level restaurant where each table is a parameter and a topic (called plates). Then, each document has its own Chinese restaurant where each table is associated with a customer in the corpus level Chinese restaurant. <a href="http://www.amazon.com/Nonparametrics-Cambridge-Statistical-Probabilistic-Mathematics/dp/0521513464/ref=sr_1_2?ie=UTF8&amp;qid=1314031811&amp;sr=8-2">Blei recommended a book by Hjort</a>.</span></span></p>
<p><span style="font-family: symbol;"><span style="font-family: arial,helvetica,sans-serif;"><strong>Algorithms. </strong>The last few minutes were dedicated to discussing inference algorithms for LDA, particularly Gibbs sampling and variational Bayes. Gibbs sampling is very simple to implement, though Blei stated that it does not work for DTM or CTM because the assumptions of conjugacy (multinomial/Dirichlet) are violated. Variational Bayes is more difficult to implement, but handles non-conjugacy in CTM and DTM much better.</span></span></p>
<p><span style="font-family: symbol;"><span style="font-family: arial,helvetica,sans-serif;"><strong>Plenary Sessions</strong></span></span></p>
<p><span style="font-family: symbol;"><span style="font-family: arial,helvetica,sans-serif;">The plenary sessions consisted of several thank-yous and awards. The committee provided some humor which gave some humility to the long process of writing and submitting papers. They went over paper acceptance statistics and read some of the funnier comments that reviewers gave, one of which was something like &#8220;It is clear that the author did not read this paper before submitting it.&#8221; I don&#8217;t know how many times I have said that in various situations. The committee handed out awards for best paper and best dissertation. This year&#8217;s <a href="http://www.kdd.org/kdd2011/kddcup.shtml">KDD Cup competition was a contest similar to the Netflix challenge</a>, but involved music recommendation. The winner was the <a href="http://www.ntu.edu.tw/engv4/">National Taiwan University</a>, for the fourth straight year in a row I am told. The innovation award went to a researcher dear to my heart, <a href="http://en.wikipedia.org/wiki/Ross_Quinlan">Ross Quinlan</a>, who developed the <a href="http://www.rulequest.com/Personal/">C4.5 decision tree modeling software</a>.<strong><br />
</strong></span></span></p>
<p><emph>For more information about topic modeling software, see David Blei&#8217;s website at <a href="http://www.cs.princeton.edu/~blei">http://www.cs.princeton.edu/~blei</a> which contains code for most if not all of these topic models. For the notes from the tutorial, see <a href="http://www.cs.princeton.edu/~blei/kdd-tutorial.pdf">http://www.cs.princeton.edu/~blei/kdd-tutorial.pdf</a>.</emph></p>
<div class="shr-publisher-877"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><div class='shareaholic-like-buttonset' style='float:none;height:30px;'><a class='shareaholic-googleplusone' data-shr_size='standard' data-shr_count='true' data-shr_href='http%3A%2F%2Fwww.bytemining.com%2F2011%2F08%2Fsigkdd-2011-conference-day-1-graph-mining-and-david-bleitopic-models%2F' data-shr_title='SIGKDD+2011+Conference+--+Day+1+%28Graph+Mining+and+David+Blei%2FTopic+Models%29'></a></div><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://www.bytemining.com/2011/08/sigkdd-2011-conference-day-1-graph-mining-and-david-bleitopic-models/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Hadoop Fatigue &#8212; Alternatives to Hadoop</title>
		<link>http://www.bytemining.com/2011/08/hadoop-fatigue-alternatives-to-hadoop/</link>
		<comments>http://www.bytemining.com/2011/08/hadoop-fatigue-alternatives-to-hadoop/#comments</comments>
		<pubDate>Tue, 16 Aug 2011 17:30:00 +0000</pubDate>
		<dc:creator>Ryan Rosario</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Amazon EC2]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Databases and Datastores]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[MapReduce]]></category>

		<guid isPermaLink="false">http://www.bytemining.com/?p=794</guid>
		<description><![CDATA[<p>It&#8217;s been a while since I have posted&#8230; in the midst of trying to plow through this dissertation while working on papers for submission to some conferences.</p>
<p></p>
<p>Hadoop has become the de facto standard in the research and industry uses of small and large-scale MapReduce. Since its inception, an entire ecosystem has been built around it including conferences (Hadoop World, Hadoop Summit), books, training, and commercial distributions (Cloudera, Hortonworks, MapR) with support. Several projects that integrate with Hadoop have been released from the Apache incubator and are designed for certain use cases:</p>

Pig, developed at Yahoo, is a high-level scripting language for working with big data and Hive is a SQL-like query language for big data in a warehouse configuration.
HBase, developed at Facebook, is a column-oriented database often used as a datastore on which MapReduce jobs can be executed.
ZooKeeper and Chukwa 
Mahout is a library for scalable machine learning, part of which can use Hadoop.
Cascading (Chris Wensel), Oozie (Yahoo) and Azkaban (LinkedIn) provide MapReduce job workflows and scheduling.

<p>Hadoop is meant to be modeled after Google MapReduce. To store and process huge amounts of data, we typically need several machines in some cluster configuration. A distributed filesystem (HDFS for Hadoop) uses space across [...]]]></description>
				<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><p><em>It&#8217;s been a while since I have posted&#8230; in the midst of trying to plow through this dissertation while working on papers for submission to some conferences.</em></p>
<p><em></em><img src="http://www.bytemining.com/wp-content/uploads/2011/06/hadoop.png" alt="" width="335" height="93" /></p>
<p>Hadoop has become the de facto standard in the research and industry uses of small and large-scale MapReduce. Since its inception, an entire ecosystem has been built around it including conferences (<a href="http://www.hadoopworld.com/">Hadoop World</a>, <a href="http://developer.yahoo.com/events/hadoopsummit2011/">Hadoop Summit</a>), books, training, and commercial distributions (<a href="http://www.cloudera.com">Cloudera</a>, <a href="http://www.hortonworks.com">Hortonworks</a>, <a href="http://www.mapr.com">MapR</a>) with support. Several projects that integrate with Hadoop have been released from the <a href="http://incubator.apache.org/">Apache incubator</a> and are designed for certain use cases:</p>
<ul>
<li><a href="http://pig.apache.org/">Pig</a>, developed at Yahoo, is a high-level scripting language for working with big data and <a href="http://hive.apache.org/">Hive</a> is a SQL-like query language for big data in a warehouse configuration.</li>
<li><a href="http://hbase.apache.org/">HBase</a>, developed at Facebook, is a column-oriented database often used as a datastore on which MapReduce jobs can be executed.</li>
<li>ZooKeeper and Chukwa </li>
<li><a href="http://mahout.apache.org/">Mahout</a> is a library for scalable machine learning, part of which can use Hadoop.</li>
<li><a href="http://www.cascading.org/">Cascading</a> (Chris Wensel), <a href="http://yahoo.github.com/oozie/">Oozie</a> (Yahoo) and <a href="http://sna-projects.com/azkaban/">Azkaban</a> (LinkedIn) provide MapReduce job workflows and scheduling.</li>
</ul>
<p>Hadoop is meant to be modeled after <a href="http://labs.google.com/papers/mapreduce.html">Google MapReduce</a>. To store and process huge amounts of data, we typically need several machines in some cluster configuration. A distributed filesystem (<a href="http://hadoop.apache.org/hdfs/">HDFS</a> for Hadoop) uses space across a cluster to store data so that it appears to be in a contiguous volume and provides redundancy to prevent data loss. The distributed filesystem also allows data collectors to dump data into HDFS so that it is already prime for use with MapReduce. A Data Scientist or Software Engineer then writes a Hadoop MapReduce job. <em></em></p>
<p><em>As a review</em>, the Hadoop job consists of two main steps, a map step and a reduce step. There may optionally be other steps before the map phase or between the map and reduce phases. The map step reads in a bunch of data, does something to it, and emits a series of key-value pairs. One can think of the map phase as a partitioner. In text mining, the map phase is where most parsing and cleaning is performed. The output of the mappers is sorted and then fed into a series of reducers. The reduce step takes the key value pairs and computes some aggregate (reduced) set of data such as a sum, average, etc. The <a href="http://wiki.apache.org/hadoop/WordCount">trivial word count exercise</a> starts with a map phase where text is parsed and a key-value pair is emitted: a word, followed by the number &#8220;1&#8243; indicating that the key-value pair represents 1 instance of the word. The user might also emit something to coerce Hadoop into passing data into different reducers. The words and 1s are sorted and passed to the reducers. The reducers take like key-value pairs and compute the number of times the word appears in the original input.</p>
<p>After working extensively with (Vanilla) Hadoop professional for the past 6 months, and at home for research, I have found several nagging issues with Hadoop that have convinced me to look elsewhere for everyday use and certain applications. For these applications, the though of writing a Hadoop job makes me take a deep breath. <strong>Before I continue, I will say that I still love Hadoop and the community.</strong></p>
<ul>
<li>Writing Hadoop jobs in Java is very time consuming because <em>everything </em>must be a class, and many times these classes extend several other classes or extend multiple interfaces; the Java API is very bloated. Adding a simple counter to a Hadoop job becomes a chore of its own.</li>
<li>Documentation for the bloated Java API is sufficient, but not the most helpful.</li>
<li>HDFS is complicated and has plenty of issues of its own. I recently heard a story about data loss in HDFS just because the IP address block used by the cluster changed.</li>
<li>Debugging a failure is a nightmare; is it the code itself? Is it a configuration parameter? Is it the cluster or one/several machines on the cluster? Is it the filesystem or disk itself? Who knows?!</li>
<li>Logging is verbose to the point that finding errors is like finding a needle in a haystack. That is, if you are even lucky to have an error recorded! I&#8217;ve had plenty of instances where jobs fail and there is absolutely nothing in the stdout or stderr logs.</li>
<li>Large clusters require a dedicated team to keep it running properly, but that is not surprising.</li>
<li>Writing a Hadoop job becomes a software engineering task rather than a data analysis task.</li>
</ul>
<p>Hadoop will be around for a long time, and for good reason. MapReduce cannot solve every problem (fact), and Hadoop can solve even fewer problems (opinion?). After dealing with some of the innards of Hadoop, I&#8217;ve often said to myself &#8220;there must be a better way.&#8221; For large corporations that routinely crunch large amounts of data using MapReduce, Hadoop is still a great choice. For research, experimentation, and everyday data munging, one of these other frameworks may be better if the advantages of HDFS are not necessarily imperative:</p>
<p><strong>BashReduce</strong></p>
<p>Unlike Hadoop, <a href="https://github.com/erikfrey/bashreduce">BashReduce</a> is just a script! BashReduce implements MapReduce for standard Unix commands such as sort, awk, grep, join etc. It supports mapping/partitioning, reducing, and merging. The developers note that BashReduce &#8220;sort of&#8221; handles task coordination and a distributed file system. In my opinion, these are strengths rather than weaknesses. There is actually no task coordination as a master process simply fires off jobs and data. There is also no distributed file system at all, but BashReduce will distribute files to worker machines. Of course, without a distributed file system there is a lack of fault-tolerance among other things.</p>
<p>Intermachine communication is facilitated with simple passwordless SSH, but there is a large cost associated with transferring files from a master machine to its workers whereas with Hadoop, data is stored centrally in HDFS. Additionally, partition/merge in the standard unix tools is not optimized for this use case, thus the developer had to use a few additional C programs to speed up the process.</p>
<p>Compared to Hadoop, there is less complexity and faster development. The result is the lack of fault-tolerance, and lack of flexibility as BashReduce only works with certain Unix commands. Unlike Hadoop, BashReduce is more of a tool than a full system for MapReduce. BashReduce was developed by <a href="http://fawx.com/">Erik Frey</a> et. al. of <a href="http://www.last.fm/">last.fm</a>.</p>
<p><strong>Disco Project</strong></p>
<p><a href="http://discoproject.org/">Disco</a> was initially developed by Nokia Research and has been around silently for a few years. Developers write MapReduce jobs in simple, beautiful Python. Disco&#8217;s backend is written in <a href="http://www.erlang.org">Erlang</a>, a scalable functional language with built-in support for concurrency, fault tolerance and distribution &#8212; perfect for a MapReduce system! Similar to Hadoop, Disco distributes and replicates data, but it does not use its own file system. Disco also has efficient job scheduling features.</p>
<p>It seems that Disco is a pretty standard and powerful MapReduce implementation that removes some of the painful aspects of Hadoop, but it also likely removes persistent fault tolerance as it relies on a standard filesystem rather than one like HDFS, but Erlang may impose some functionality that provides a &#8220;good enough&#8221; level of fault tolerance for data.</p>
<p><strong>Spark</strong></p>
<p><a href="http://www.spark-project.org/">Spark</a> is one of the newest players in the MapReduce field. Its purpose is to make data analytics fast to write, and fast to run. Unlike many MapReduce systems, Spark allows <em>in-memory</em> querying of data (even distributed across machines) rather than using disk I/O. It is of no surprise then that Spark out-performs Hadoop on many iterative algorithms. Spark is implemented in <a href="http://www.scala-lang.org/">Scala</a>, a functional object-oriented language that sits on top of the JVM. Similar to other languages like Python, Ruby, and Clojure, Scala has an interactive propt and users can use Spark to query big data straight from the Scala interpreter.</p>
<p>One wrinkle is that Spark requires installing a cluster manager called <a href="http://www.mesosproject.org/">Mesos</a>. I had some difficulty installing it on Ubuntu, but the development team was an amazing help, and made a few changes to the source and now it runs well. On the downside, Mesos adds a layer of complexity that we are trying to avoid. On the upside, Mesos allows Spark to co-exist with Hadoop and it can read any data source that Hadoop supports, and it &#8220;feels&#8221; light, similar to Disco&#8217;s server UI.</p>
<p>Spark was developed by the <a href="http://amplab.cs.berkeley.edu/">UC Berkeley AMP Lab</a>. Currently, its main users are UC Berkeley researchers and <a href="http://www.conviva.com/">Conviva</a>. <a href="http://www.bytemining.com/2011/06/my-review-of-hadoop-summit-2011-hadoopsummit/">Hadoop Summit 2011 featured a talk on Spark by one of the developers, which I wrote about earlier this summer</a>.</p>
<p><strong>GraphLab</strong></p>
<p><a href="http://graphlab.org/">GraphLab</a> was developed at <a href="http://www.cmu.edu">Carnegie Mellon</a> and is designed for use in machine learning. GraphLab&#8217;s goal is to make the design and implementation of efficient and correct parallel machine learning algorithms easier. Their website states that paradigms like MapReduce lack expressiveness while lower level tools such as MPI present overhead by requiring the researcher to write code that beats a dead horse.</p>
<p>GraphLab has its own version of the map stage, called the <em>update</em> phase. Unlike MapReduce, the update phase can both read <em>and</em> modify <em>overlapping </em>sets of data. Recall that MapReduce requires data to be <em>partitioned</em>. GraphLab accomplishes this by allowing the user to specify data as a graph where each vertex and edge in the graph is associated memory. The update phases can be chained in such a way such that one update function can recursively trigger other update functions that operate on vertices in the graph. This graph-based approach would not only make machine learning on graphs more tractable, but it also improves dynamic iterative algorithms.</p>
<p>GraphLab also has its own version of the reduce stage, called the <em>sync operation. </em>The results of the sync operation are <em>global </em>and can be used by all vertices in the graph. In MapReduce, output from the reducers is local (until committed) and there is a strict data barrier among reducers. The sync operations are performed at time intervals, and there is not as strong of a tie between the update and sync phases. What I mean is that the sync intervals are not necessarily dependent on some prior update completing.</p>
<p>GraphLab&#8217;s website also contains the original <a href="http://www.select.cs.cmu.edu/publications/scripts/papers.cgi?Low+al:uai10graphlab">UAI paper</a> and <a href="http://graphlab.org/uai2010_graphlab.pptx">presentation</a>, a <a href="http://graphlab.org/abstractiononly.pdf">document better explaining the abstraction</a>, and there is even a <a href="http://groups.google.com/group/graphlabapi">Google Group for the GraphLab API</a>. To me, GraphLab seems like a very powerful generalization, and re-specification, of MapReduce.</p>
<p><strong>Storm<br />
</strong></p>
<p>Recently, Nathan Marz of BackType made waves in the Twitter big data community with a blog post titled <a href="http://tech.backtype.com/preview-of-storm-the-hadoop-of-realtime-proce"><em>Preview of Storm: The Hadoop of Realtime Processing</em></a>. Within a day, Storm became known as &#8220;Real-time Hadoop&#8221; to the chagrin of some developers from Apache. Hadoop is a batch-processing system &#8212; that is, give it a lot of fixed data and it does something with it. Storm is real-time &#8212; it processes data in parallel as it streams.</p>
<p>Marz writes that with their previous system, much time was spent worrying about graphs of queues and workers: where to send and receive messages, deploying workers and queues, and a lack of fault tolerance. Storm abstracts all of these complications away. Storm is written in Clojure, but any programming language can be used to write programs on top of Storm. Storm is fault-tolerant, horizontally scalable, and reliable. Storm is also very fast, with ZeroMQ used as the underlying message passing system.</p>
<p>Nathan Marz is a software developer at <a href="http://www.backtype.com">BackType</a>, and made waves in 2010 with <a href="https://github.com/nathanmarz/cascalog">Cascalog</a>. Cascalog really took off after his presentation at the 2010 Hadoop Summit, and I am delighted I got to see him present it. Storm will be open-sourced soon and I hope to write more about it later.<strong><br />
 </strong></p>
<p>I included Storm in this post based on its colloquial name &#8220;Real-time Hadoop&#8221; &#8212; it is not clear to me whether or not Storm even uses MapReduce though.</p>
<p><strong>HPCC Systems (from LexisNexis)</strong></p>
<p>Perhaps the project with the least flattering name comes from <a href="https://github.com/nathanmarz/cascalog">LexisNexis</a>, which has developed its own framework for massive data analytics. <a href="http://hpccsystems.com/">HPCC</a> attempts to make writing parallel-processing workflows easier by using Enterprise Control Language (ECL), a declarative, data-centric language. I should note that SQL, Datalog and Pig are also said to be declarative, data-centric languages. A matter of fact, the development team has a converter for translating Pig jobs to ECL. HPCC is written in C++. Some have commented that this will make in-memory querying much faster because there is less bloated object sizes originating from the JVM. I also prefer C++ simply because it feels closer to human though &#8212; we think in terms of objects (object-oriented) at times, and a series of steps (procedural) at other times and use both thought processes together.</p>
<p>HPCC already has its own jungle of technologies like Hadoop. HPCC has two &#8220;systems&#8221; for processing and serving data: the Thor Data Refinery Cluster, and the Roxy Rapid Data Delivery Cluster. Thor is a data processor, like Hadoop. Roxie is similar to a data warehouse (like HBase) and supports transactions. HPCC uses a distributed file system.</p>
<p>Although details are still preliminary as is the system, this certainly has a &#8220;feel&#8221; for potentially being a solid alternative for Hadoop, but only time will tell. <strong><br />
 </strong></p>
<p><strong>With all these alternatives, why use Hadoop?</strong></p>
<p>One word: HDFS. For a moment, assume you could bring all of your files and data with you everywhere you go. No matter what system, or type of system, you login to, your data is intact waiting for you. Suppose you find a cool picture on the Internet. You save it directly to your file store and it goes everywhere you go. HDFS gives users the ability to dump very large datasets (usually log files) to this distributed filesystem and easily access it with tools, namely Hadoop. Not only does HDFS store a large amount of data, it is fault tolerant. Losing a disk, or a machine, typically does not spell disaster for your data. HDFS has become a reliable way to store data and share it with other open-source data analysis tools. <strong>Spark can read data from HDFS</strong>, but if you would rather stick with Hadoop, you can try to spice it up:</p>
<p><strong>Hadoop Streaming </strong>is an easy way to avoid the monolith of Vanilla Hadoop without leaving HDFS, and allows the user to write map and reduce functions in any language that supports writing to stdout, and reading from stdin. Choosing a simple language such as Python for Streaming allows the user to focus more on writing code that processes data rather than software engineering. Once code is written, it is easy to test from the command line:</p>
<pre>cat a_bunch_of_files | ./mapper.py | sort | ./reducer.py</pre>
<p>And, running and monitoring the job is similar to Vanilla Hadoop. Hadoop Streaming was my first introduction to Hadoop and it was quite pleasant.</p>
<p><strong>Or, you could use a Hadoopified project that better solves the problem. </strong>Vanilla Hadoop can do some sophisticated stuff, but it suffers the problems I mentioned at the beginning of the post. Developers have created software that works on HDFS, but is geared toward different audiences. A Data Scientist may prefer Pig or Hive for data analysis whereas a Systems and Software Engineer may prefer a workflow solution (Oozie, Cascading etc.) and a (modern) DBA may want to use HBase. Each of these achieve different goals, but still rely on HDFS.</p>
<div class="shr-publisher-794"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><div class='shareaholic-like-buttonset' style='float:none;height:30px;'><a class='shareaholic-googleplusone' data-shr_size='standard' data-shr_count='true' data-shr_href='http%3A%2F%2Fwww.bytemining.com%2F2011%2F08%2Fhadoop-fatigue-alternatives-to-hadoop%2F' data-shr_title='Hadoop+Fatigue+--+Alternatives+to+Hadoop'></a></div><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://www.bytemining.com/2011/08/hadoop-fatigue-alternatives-to-hadoop/feed/</wfw:commentRss>
		<slash:comments>24</slash:comments>
		</item>
		<item>
		<title>My Review of Hadoop Summit 2011 #hadoopsummit</title>
		<link>http://www.bytemining.com/2011/06/my-review-of-hadoop-summit-2011-hadoopsummit/</link>
		<comments>http://www.bytemining.com/2011/06/my-review-of-hadoop-summit-2011-hadoopsummit/#comments</comments>
		<pubDate>Thu, 30 Jun 2011 07:00:45 +0000</pubDate>
		<dc:creator>Ryan Rosario</dc:creator>
				<category><![CDATA[Amazon EC2]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Databases and Datastores]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[MongoDB]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Web Mining]]></category>

		<guid isPermaLink="false">http://www.bytemining.com/?p=856</guid>
		<description><![CDATA[





<p>I woke up early and cheery Wednesday morning to attend the 2011 Hadoop Summit in Santa Clara, after a long drive from Los Angeles and the Big Data Camp that lasted until 10pm the night before. Having been to Hadoop Summit 2010, I was interested to see how much of the content in the conference had changed.</p>
<p>This year, there were approximately 1,600 participants and the summit was moved a few feet away to the Convention Center rather than the Hyatt. Still, space and seating was pretty cramped. That just goes to show how much the Hadoop field has grown in just one year.</p>
<p>Keynotes</p>
<p>We first heard a series of keynote speeches which I will summarize. The first keynote was from Jay Rossiter, SVP of the Cloud Platform Group at Yahoo. He introduced how Hadoop is used at Yahoo, which is fitting since they organized the event. The content of his presentation was very similar to last year&#8217;s. One interesting application of Hadoop at Yahoo was for &#8220;retiling&#8221; the map of the United States. I imagine this refers to the change in aerial imagery over time. When performed by hand, retiling took 6 weeks; with Hadoop, it took 5 days. Yahoo also [...]]]></description>
				<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><table border=0>
<tr>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/06/hadoop1-e1309419213413.jpg"/></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/06/hadoop2-e1309419340704.jpg"/></td>
</tr>
</table>
<p>I woke up early and cheery Wednesday morning to attend the <a href="http://developer.yahoo.com/events/hadoopsummit2011/">2011 Hadoop Summit</a> in Santa Clara, after a long drive from Los Angeles and the <a href="http://www.bytemining.com/2011/06/big-data-camp-2011-bigdatacamp/">Big Data Camp</a> that lasted until 10pm the night before. Having been to <a href="http://www.bytemining.com/2010/06/my-experience-at-hadoop-summit-2010-hadoopsummit/">Hadoop Summit 2010</a>, I was interested to see how much of the content in the conference had changed.</p>
<p>This year, there were approximately 1,600 participants and the summit was moved a few feet away to the Convention Center rather than the Hyatt. Still, space and seating was pretty cramped. That just goes to show how much the Hadoop field has grown in just one year.</p>
<p><strong>Keynotes</strong></p>
<p>We first heard a series of keynote speeches which I will summarize. The first keynote was from Jay Rossiter, SVP of the Cloud Platform Group at Yahoo. He introduced how Hadoop is used at Yahoo, which is fitting since they organized the event. The content of his presentation was very similar to last year&#8217;s. One interesting application of Hadoop at Yahoo was for &#8220;retiling&#8221; the map of the United States. I imagine this refers to the change in aerial imagery over time. When performed by hand, retiling took 6 weeks; with Hadoop, it took 5 days. Yahoo also uses Hadoop for fraud detection, spam detection, search assist, geotagging data/local indexing, ad targeting, predicting supply and demand and the aggregation and categorization of news stories. Jay also mentioned that Dapper runs models on data with Hadoop for ad personalization. <strong>Jay also mentioned that Big Data conferences all over the country are selling out.</strong></p>
<p>Eric Baldeschwieler, the CEO of <a href="http://www.hortonworks.com/">Hortonworks</a> was next. Hortonworks seems to be a new company that spun off from Yahoo. Their goal is to provide commercial support and a full Apache Hadoop platform for users. Yes, they are very similar to <a href="http://www.cloudera.com">Cloudera</a>, and yes, they are competition. (Hortonworks and MapR both did a good job of not stepping on everyone&#8217;s toes in terms of how they presented themselves.) Cloudera provides its own distribution of Hadoop, which is of course similar to the Apache version. Hortonworks&#8217; goal is to provide similar services, but with more transparency by using the Apache Hadoop distribution rather than wrapping its own. Paraphrasing Eric, Hortonworks is open-source from the ground up. A bit later, Sanjay Radia also of Hortonworks discussed Hadoop for the enterprise. Hortonworks has contributed, or is working on security (preventing users from deleting others&#8217; data), <a href="http://en.wikipedia.org/wiki/Service_level_agreement">service level agreements (SLAs)</a>, predictability and a <a href="http://hadoop.apache.org/common/docs/r0.20.2/fair_scheduler.html">Fair-Share scheduler</a>.</p>
<p>Anant Jhingran, CTO of <a href="http://www.ibm.com">IBM</a> discussed how Hadoop was used in <a href="http://www-03.ibm.com/innovation/us/watson/index.html">IBM Watson</a>. It seemed pretty obvious that Hadoop or some form of map-reduce was used in the system, but it did not seem to be highly publicized. Watson learned from 200 million pages of data, about 2-5TB and required between 3000 and 4000 Watts. Anant went quickly through a cool user interface representing a Jeopardy board and stated that the user interface to an artificial intelligence application is just important as the application itself. He also prefers the term <a href="http://www.wisegeek.com/what-is-intelligence-augmentation-ia.htm">IA (intelligence augmentation)</a> over AI, and apparently this is a common distinction. To me, I interpret AI vs. IA to be artificial intelligence vs. knowledge discovery (data mining).</p>
<p>Karthic Ranganathan from <a href="http://www.facebook.com">Facebook</a> discussed Facebook&#8217;s messaging system which was built on <a href="http://hbase.apache.org/">HBase</a>, <a href="http://scribefire-next/hadoop.apache.org/hdfs/">HDFS</a> and MapReduce. Facebook sees 15 billion messages per month, excluding SMS and email, approximately 14TB of data! There are also 120 billion chat messages (25TB), for a grand total of almost 300TB per month. (I may have missed something as these numbers do not add up). Facebook uses HBase for the bodies of small messages, metadata, and for the search index. Facebook uses HBase because of its high write throughput and easy horizontal scalability. Facebook uses another system called <a href="http://www.facebook.com/note.php?note_id=76191543919">Haystack</a> for photos, bodies of large messages and attachments. Of course, HDFS is used for fault tolerance, scalability, checksums for data integrity and its MapReduce abilities. Profiles and services are partitioned by user. Each machine has 16 cores, with 12 1TB hard disks, and 48GB RAM (24GB used for HBase). Some things that Facebook would like contribute and improve: <a href="http://wiki.apache.org/hadoop/NameNode">NameNode</a> high availability and a second NameNode, better performance overall, and using flash memory to improve performance. Facebook often adds several columns to a table so that DevOps does not need to take the server offline to add new columns.</p>
<blockquote><p>Big Data conferences all over the country are selling out.</p></blockquote>
<p><strong>Breakout Sessions</strong></p>
<p><em>There were so many great sessions and I can only summarize the ones I attended. Check out the <a href="http://developer.yahoo.com/events/hadoopsummit2011/agenda.html">event agenda</a> for abstracts on all sessions.</em></p>
<p>First I attended <em>Web Crawl Cache &#8211; Using HBase to Manage a Copy of the Web. </em>In this talk, we learned about Yahoo&#8217;s Web Crawl Cache (WCC) that collects and organizes data from Microsoft as a result of a search deal. These snapshots of the web are not only useful for search, but also for drilling into other avenues such as local assets, influence and language corpora. WCC uses HBase for several reasons: bulk load, MapReduce jobs are efficient, random access reads, a usable consistency model, and it is easy to dynamically add columns (this seems to contradict Karthick&#8217;s claim).</p>
<p>It was very difficult to pick a session for the 1:45 to 2:15 time slot. Options included Next Generation Hadoop, Scaling out Realtime Data (Facebook) and Building Kafka (LinkedIn). I admire the work and clout that <a href="http://www.linkedin.com">LinkedIn</a> has built over the past year or two, so I attended Jay Kreps session. LinkedIn&#8217;s data pipeline includes a lot of tracking, logging, metrics, messages and queuing. LinkedIn attempted to use messaging systems such as <a href="http://en.wikipedia.org/wiki/Java_Message_Service">JMS</a> and <a href="http://www.rabbitmq.com/">RabbitMQ</a>. Streaming data is prevalent at LinkedIn such as search trends, click trends, invitation social networks etc. <a href="http://sna-projects.com/kafka/">Kafka</a> is LinkedIn&#8217;s solution for a distributed message queue; rather than polling for data, users subscribe to a data stream and data sources publish data to it. Kafka is 7000 lines of Scala, a functional and object-oriented language on top of the Java Virtual Machine (JVM). Kafka can produce about 250,000 messages per second (50 MB) and consume 550,000 messages per second (110 MB).</p>
<p>Next I attended another talk by Hortonworks, this time on <a href="http://incubator.apache.org/hcatalog/">HCatalog</a>. HCatalog changes the way we think about data in HDFS. No longer do we need to worry about files and directories. Instead, HCatalog seems to add a layer of abstraction on top of HDFS that treats data as a set of tables. Tools such as Pig and Hive use this layer of abstraction, and currently Hive is tightly integrated with HCatalog. Hortonworks intends to add support for HBase and Streaming later this year.</p>
<p>I waited all day to see <a href="http://www.cs.berkeley.edu/~matei/">Matei Zaharia</a>&#8216;s talk on <a href="http://www.spark-project.org">Spark</a>. Zaharia is a graduate student at UC Berkeley and it was a nice change of pace to see a student present some work. Spark is a data processing platform that sits on top of the <a href="http://www.mesosproject.org/">Mesos</a> cluster management project (also produced by Berkeley). Mesos can handle 10,000s nodes, 100s of concurrent jobs and can be isolated in Linux containers (i.e. <a href="http://www.openvz.org">OpenVZ</a>). <em>Spark aims to extend MapReduce for iterative algorithms, and interactive low latency data mining.</em> One major difference between MapReduce and Spark is that MapReduce is acyclic. That is, data flows in from a stable source, is processed, and flows out to a stable filesystem. Spark allows iterative computation on the same data, which would form a cycle if jobs were visualized. Resilient Distributed Dataset (RDD) serves as an abstraction to raw data, and some data is kept in memory and cached for later use. This last point is very important; Spark allows data to be committed in RAM for an approximate 20x speedup over MapReduce based on disks. RDDs are immutable and created through parallel transformations such as map, filter, groupBy and reduce. RDD immutability is similar to immutable types in functional programming languages. It does not mean that the dataset cannot change. Instead, it means that a new copy of the dataset is created, with the change included. The user can also perform <em>actions </em>on RDDs such as count, collect, etc. Some applications using Spark are traffic prediction (Berkeley), spam classification (Twitter), kmeans, alternating least squares matrix factorization, and network simulation.</p>
<blockquote><p>The main takeaway from Hadoop Summit 2010 was Cascalog. I predict the main takeaway from Hadoop Summit 2011 is Spark.</p></blockquote>
<p>One time at work I had a bizarre issue with corrupted data in HDFS. After that, I began blaming everything on HDFS. The next session <em>Data Integrity and Availability of HDFS </em>was englightening. HDFS takes good care of Yahoo&#8217;s data. We can trust Yahoo because if HDFS breaks, Yahoo begins losing money so they know what they are talking about! Yahoo&#8217;s goal is to have 60 PB online all the time. The key to HDFS reliability is <a href="http://en.wikipedia.org/wiki/Replication_%28computer_science%29">replication</a>. A replication factor of 3 (3 copies of every file? block?) is appropriate. A replication factor of 2 is also quite robust, but should only be used when there is a backup of the data because the probability of data loss is much higher. Yahoo has had issues with losing blocks (blocks are pieces of data, so lost blocks = data loss). There are a variety of reasons and most of them had nothing to do with HDFS. One cause of lost blocks is a bug in a Hadoop component like Pig, particularly a new version. In one incident, a new version of Pig opened a lot of files without closing them, and created a lot of abandoned files. In the speaker&#8217;s anecdotal case study, none of the incidents of data loss were caused by HDFS proper. Other causes of data loss encoutered were exhausting disk space, users hammering HDFS, and &#8220;other.&#8221; The speaker noted that NameNode high availability (a hot topic) would have only helped in 8 of the 36 incidents studied. Some ways of preventing data loss include resource allocation, selecting good tenants of a cluster, and fixing hardware errors quickly.</p>
<blockquote><p>If your job isn&#8217;t running, it&#8217;s not likely caused by HDFS.
</p></blockquote>
<p>Bill Graham of <a href="http://www.cbsinteractive.com/">CBS Interactive</a> gave an interesting talk about using Hadoop to build a graph of users a content. Surprisingly, CBSi has quite a large arsenal of MapReduce enabled technologies: <a href="http://incubator.apache.org/chukwa/">Chukwa</a>, Pig, Hive, HBase, <a href="http://www.cascading.org">Cascading</a>, <a href="http://www.cloudera.com/blog/2009/06/introducing-sqoop/">Sqoop</a> and <a href="http://yahoo.github.com/oozie/">Oozie</a>. CBSi uses only 100 nodes with 500 TB of disk space for processing data associated with 235 million uniques (individuals, roughly). Mapping users to content should be easy, right? Well, some users have multiple identities, including anonymous identities. The goal is to create a holistic graph that &#8220;matches&#8221; all of the identities efficiently for uses such as ad targeting. CBSi&#8217;s needs in a Hadoop platform: rapid experimentation and data mining, and to power new site features and ad optimization. The main vehicle for representing data is a Pig RDF that allows for a kind of graph based join so to speak. CBSi hopes to add Oozie, <a href="http://sna-projects.com/azkaban/">Azkaban</a>, HCatalog and <a href="http://blogs.apache.org/hama/">Hama</a> (graph processing) to its arsenal.</p>
<p><a href="http://www.mapr.com">MapR</a> was a very prominent sponsor of Hadoop Summit. M. C. Srivas presented a technical discussion of MapR&#8217;s capabilities and how it differs from Apache Hadoop. MapR is a full distribution of Hadoop and is 100% compatible with the Apache distribution and projects such as Pig and Oozie. MapR is fast and boasts high availability by rethinking the NameNode. The NameNode is a bottleneck since 60% of file operations are metadata. The NameNode and its limitations limit the size of a cluster. To resolve some problems with the NameNode, MapR turns every server into a metadata server. Since metadata is seldom retrieved, it is paged to disk so more RAM can be used for MapReduce proper. MapR distributes NameNode functionality and provides full random read and write semantics as well as export to NFS. With the distributed NameNode, runaway tasks no longer take down the NameNode. MapR has some lofty performance goals. While HDFS can handle 10-50PB, MapR can handle 1010 EB (exabytes). While HDFS can handle 2000 nodes in a cluster, MapR can handle 10,000 or more. It was mentioned at BigDataCamp that MapR does not rely on HDFS at all.</p>
<p>The final session I attended was Avery Ching&#8217;s talk on <em>Giraph: Large-scale Graph Processing on Hadoop. </em>Unfortunately, Avery jumped right into the technical details of <a href="https://github.com/aching/Giraph">Giraph</a> without giving a high level overview of the problem Giraph solves. Also, his slides were in 10 point font and I could not read them. Combine this with the fact that my brain was exhausted, so I wanted to head to the bar. Vanilla Hadoop incurs too much overhead for graph data processing. Yahoo used <a href="http://www.lam-mpi.org/">MPI</a> in the past for graph data but it had no fault tolerance and was too generic. Giraph is a library for iterative graph processing. Giraph is fault-tolerant and dynamic. Giraph takes a vertex centric approach to graph data. I found this interesting because most of my work is edge centric. Overall, Giraph is similar in goal to Pregel, but available to non-Googlers and has no single point of failure (except those incurred by Hadoop).</p>
<p>Now I have to catch my breath with some wine, beer and cheese at the nice happy hour reception afterwards. It was a long day, and a great day at Hadoop Summit 2011 and I will of course be back next year. I have no clue what is in store for me next year. Will the NameNode be removed as the single point of failure? Will other open-source software start integrating Hadoop? We shall see&#8230;</p>
<p>And now it is time to head back to Los Angeles.</p>
<p>Have a happy and safe Fourth of July!</p>
<div class="shr-publisher-856"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><div class='shareaholic-like-buttonset' style='float:none;height:30px;'><a class='shareaholic-googleplusone' data-shr_size='standard' data-shr_count='true' data-shr_href='http%3A%2F%2Fwww.bytemining.com%2F2011%2F06%2Fmy-review-of-hadoop-summit-2011-hadoopsummit%2F' data-shr_title='My+Review+of+Hadoop+Summit+2011+%23hadoopsummit'></a></div><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://www.bytemining.com/2011/06/my-review-of-hadoop-summit-2011-hadoopsummit/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>Big Data Camp 2011 #BigDataCamp</title>
		<link>http://www.bytemining.com/2011/06/big-data-camp-2011-bigdatacamp/</link>
		<comments>http://www.bytemining.com/2011/06/big-data-camp-2011-bigdatacamp/#comments</comments>
		<pubDate>Wed, 29 Jun 2011 06:35:45 +0000</pubDate>
		<dc:creator>Ryan Rosario</dc:creator>
				<category><![CDATA[Amazon EC2]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Databases and Datastores]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[MongoDB]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Web Mining]]></category>

		<guid isPermaLink="false">http://www.bytemining.com/?p=833</guid>
		<description><![CDATA[<p>It has been a while since I have been to Silicon Valley, but Hadoop Summit gave me the opportunity to go. To make the most of the long trip, I also decided to check out BigDataCamp held the night before from 5:30 to 10pm. Although the weather was as predicted, I was not prepared for the deluge of pouring rain in the end of June. The weather is one of the things that is preventing me from moving up to Silicon Valley.</p>
<p>The food/drinks/networking event must have been amazing because it was very difficult to get everyone to come to the main room to start the event! We started with a series of lightning talks from some familiar names and some unfamiliar ones.</p>
<p>Chris Wensel, the developer of Cascading, is also the founder of Concurrent, Inc. Cascading is an alternate API for Map-Reduce written in Java. With Cascading, developers can chain multiple map-reduce jobs to form an ad hoc workflow. Cascading adds a built-in planner to manage jobs. Cascading usually infers Hadoop, but Cascading can run on other platforms including EMC Greenplum and the new MapR project. RazorFish and BestBuy use Cascading for behavioral targeting. Flightcaster uses a domain specific language (DSL) [...]]]></description>
				<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><p><img src="http://www.bytemining.com/wp-content/uploads/2011/06/hadoop.png" alt="" class="lfloatbox" />It has been a while since I have been to Silicon Valley, but <a href="http://developer.yahoo.com/events/hadoopsummit2011/">Hadoop Summit</a> gave me the opportunity to go. To make the most of the long trip, I also decided to check out <a href="http://www.bigdatacamp.org">BigDataCamp</a> held the night before from 5:30 to 10pm. Although the weather was as predicted, I was not prepared for the deluge of pouring rain in the end of June. The weather is one of the things that is preventing me from moving up to Silicon Valley.</p>
<p>The food/drinks/networking event must have been amazing because it was very difficult to get everyone to come to the main room to start the event! We started with a series of lightning talks from some familiar names and some unfamiliar ones.</p>
<p><a href="http://chris.wensel.net/">Chris Wensel</a>, the developer of <a href="http://www.cascading.org">Cascading</a>, is also the founder of <a href="http://www.concurrentinc.com">Concurrent, Inc.</a> Cascading is an alternate API for Map-Reduce written in Java. With Cascading, developers can chain multiple map-reduce jobs to form an ad hoc workflow. Cascading adds a built-in planner to manage jobs. Cascading usually infers Hadoop, but Cascading can run on other platforms including EMC Greenplum and the new <a href="http://www.mapr.com">MapR</a> project. <a href="http://www.razorfish.com">RazorFish</a> and <a href="http://www.bestbuy.com">BestBuy</a> use Cascading for behavioral targeting. <a href="http://www.flightcaster.com">Flightcaster</a> uses a domain specific language (DSL) written in <a href="http://clojure.org">Clojure</a> on top of Cascading for large data processing jobs. <a href="http://www.etsy.com">Etsy</a> uses a DSL written in JRuby as a layer on top of Cascading. Of course, the big player is <a href="http://www.backtype.com">BackType</a>. <a href="http://nathanmarz.com/blog/introducing-cascalog-a-clojure-based-query-language-for-hado.html">Cascalog</a> combines Cascading with the Datalog language to provide a declarative language for working with data and map-reduce. Wensel noted that one disadvantage of Pig and Hive that Cascading addresses is that Pig and Hive lack a physical planner. Workflow managers such as <a href="http://scribefire-next/yahoo.github.com/oozie">Oozie</a> and <a href="http://sna-projects.com/azkaban/">Azaband</a> can run Cascading jobs as part of a workflow. Version 2.0 of Cascading removes Hadoop as a dependency and will allow users to run Cascading jobs on data that is in RAM rather than on disk.</p>
<p>James Falgout from <a href="http://www.google.com/url?sa=t&amp;source=web&amp;cd=1&amp;ved=0CB8QFjAA&amp;url=http%3A%2F%2Fwww.pervasivedatarush.com%2F&amp;rct=j&amp;q=pervasive%20datarush&amp;ei=IWMLTuHPJ5Oitgfvi817&amp;usg=AFQjCNHJxwjEHZqwPkA-LLczcc9_5Q4-mw&amp;cad=rja">Pervasive DataRush</a> presented the second lightning talk. Pervasive&#8217;s products seem to use this &#8220;dataflow&#8221; paradigm that attempts to fill in features that are missing in map-reduce. The basic description compared dataflow to the Unix shell pipeline with message passing. James showed an example dataflow that a user could configure visually. Pervasive is working on integrating dataflow with Hive.</p>
<p><a href="http://www.quest.com/newsroom/Guy-Harrison.aspx">Guy Harrison</a> from <a href="http://www.quest.com">Quest Software</a> introduced their system Toad for Cloud Databases. Toad attempts to merge data from several different data sources for analysis such as Hive, <a href="http://www.mongodb.com">MongoDB</a>, and <a href="http://cassandra.apache.org/">Cassandra</a>. Unfortunately, Guy&#8217;s thick Australian accent made his humorous talk unintelligible to me (hearing loss?).</p>
<p>Steve Wooledge from <a href="http://www.asterdata.com/">AsterData</a> (now part of <a href="http://www.teradata.com">Teradata</a>) discussed the company&#8217;s product goal of taking a standard relational database system and integrating map-reduce on top of it. Such a system is flexible and allows both SQL-like access as well as programmatic access to data. This hybrid row-oriented and column-oriented datastore can be used for path and pattern matching, text processing and graph traversal among the usual tasks. nPath is a product that enhances a system with transactional data analytics (click analysis, sessionalization).</p>
<p>Andrew Yu from <a href="http://www.emc.com/">EMC</a> presented some of EMC&#8217;s data analytics products. I wrote about EMC in an earlier blog post so I will spare the details. EMC offers a data warehouse product as well as a hybrid, pre-configured system containing its Greenplum warehouse and map-reduce built-in.</p>
<p>Ben Lee from <a href="http://www.foursquare.com">Foursquare</a> discussed how big data is used at Foursquare and gave some statistics about its service. This was by far the most interesting talk to me. Foursqaure offers realtime suggestions of places to visit based on the user&#8217;s history, and the user&#8217;s friends&#8217; histories based on day of week and time of day. Foursquare has 10 million users, 50 million venues, and 750 million check-ins. There are over 3 million check-ins per day. 10,000 developers use Foursquare&#8217;s API. MongoDB is the main datastore and <a href="http://www.scala-lang.org">Scala</a> is used for the front end. Back end data processing uses Hadoop (both vanilla, and Streaming) as well as <a href="http://archive.cloudera.com/cdh/3/flume-0.9.1+1/UserGuide.html">Flume</a>, Elastic MapReduce, and S3. Ben displayed an awesome visualization of check-in data; researchers took check-ins from New York City and performed sentiment analysis on the text attached to the check-in. The visualization suggested that people were the &#8220;happiest&#8221; in Manhattan.</p>
<p>Paul Baclace introduced some software called Phatvis that allows developers to visualize map-reduce jobs. It is his hope that the visualization can be used to fine tune Hadoop parameters based on evidence from prior jobs. The source can be found <a href="http://www.assembla.com/spaces/phatvis">here</a></p>
<p>Of course, the fun in every &#8220;unconference&#8221; is the circus known as scheduling the sessions. Some of the proposed sessions:</p>
<ul>
<li>Big Data 101 / Intro to Hadoop</li>
<li>Extract MapReduce Data into Relational Database High Performance Database</li>
<li>&#8220;ETL was Yesterday&#8221; What&#8217;s next?</li>
<li>Operations of Hadoop Clusters</li>
<li>SQL / NoSQL Why not Both? (Aster)</li>
<li>Geodata</li>
<li>Big Data Retention / Compression</li>
<li>Business Intelligence and Hadoop</li>
<li>Data Management Lifecycle</li>
<li>Distributions of Hadoop</li>
<li>Hadoop for Bioinformatics and Healthcare</li>
</ul>
<p>The topics did not seem exciting this time, and seemed to have a lot of overlap with presentations at Hadoop Summit, but I found two (we could only attend two) that stood out.</p>
<p><strong>Session 1: Operating a Hadoop Cluster</strong></p>
<p>Thank goodness managing a Hadoop cluster is not in my job description (only small clusters I use for research). Charles Wimmer, the lead of the Operations track for Hadoop Summit, lead this discussion and much of the discussion dovetailed off of incidents that occurred at Yahoo. A popular topic of discussion was backup. There is no such thing as &#8220;backing up&#8221; a Hadoop cluster we agreed. Any data that is important should be replicated, preferrably 3 times, or transmitted in parallel over a pipe to multiple data centers<strong>. </strong>One strict limitation of replication is that if some new release of Hadoop, or some new Hadoop distribution contains a bug that corrupts the data, all replicates may also be corrupted. <strong></strong></p>
<p>Discussion then turned to hardware. Yahoo uses high-density storage nodes with 6 drives each containing 2-3TB of space. Charles mentioned that a common problem with Hadoop is that it is difficult to keep the CPUs busy especially in a server with 8 Nehalem processors (8 CPUs or 8 cores?). The major reason for this is that the main bottleneck in map-reduce jobs is the network I/O required in the shuffle phase as data comes out of the mappers. The map phase is the most CPU bound phase. Wimmer, and several others, made one thing clear: <strong>use SATA, not SAS. </strong>Apparently SATA and SAS drives have similar read performance (I believe I misheard that) for practical purposes. The original Google map-reduce was based on commodity hardware and quantity is more important than quality (within reason). For this reason, SATA provides a lot more space for your data. The same amount of space is an order of magnitude more expensive for SAS drives.<strong></strong></p>
<p>The next topic of discussion was the NameNode as the single point of failure. Apparently the <a href="http://www.mapr.com">MapR</a> system does not use the HDFS, and recovering from a lost NameNode is not as severe as it is for Hadoop. Hadoop 0.20.2 also supposedly introduces sharding, called NameNode federation, where the namespace is divided over several NameNodes.</p>
<p>Hadoop has some issues with certain types of scalability, particularly with the JobTracker. When a large job with a large number of mappers and reducers finish quickly, the TaskTrackers send an influx of messages to the JobTracker and it gets overwhelmed. To prevent users from thrashing a cluster, use a capacity scheduler to put hard caps on queues. There was also some high level discussion of QoS-like functionality among users and sophisticated monitoring of jobs. Map-Reduce NextGen improves scalability by allocating a JobTracker to each individual job whose purpose is solely to monitor resource allocation. The biggest feature Charles would like to see is high availability NameNodes.</p>
<p>Yahoo boasts an impressive 22 clusters each containing between 400 and 4200 nodes. A fellow from AOL indicated that AOL has a cluster of size close to 1000. Is AOL coming back from the dead?</p>
<p>Thank goodness managing a Hadoop cluster is not in my job description&#8230;</p>
<p><strong>Session 2: Geodata</strong></p>
<p>I do not get the opportunity to work with geographical data often, so I was curious to see what these folks had to say. The discussion was lead by a fellow named Brian from <a href="http://www.osgeo.org">OSGeo</a>. The largest point that I took away from this talk was that not an incredible amount of thought has been dedicated to Big Geodata, particularly how to store and process it. <a href="http://www.postgresql.com">PostgreSQL</a> and <a href="http://postgis.refractions.net/">PostGIS</a> are a few ways to store and analyze manageable amounts of data, but not large data. MongoDB is one solution but has its issues. A fellow from Foursquare mentioned that MongoDB cannot shard across geographic data, but I could not hear precisely what he said to that effect. The biggest challenge seems to be a lack of an indexer capable of indexing a large amount of geospatial data aside from the standard RTree implementation. I believe that geodata as well as streaming data and multimedia are some of the biggest unsolved problems in Big Data.<strong><br />
</strong></p>
<p>Anyways, on to Hadoop Summit!</p>
<div class="shr-publisher-833"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><div class='shareaholic-like-buttonset' style='float:none;height:30px;'><a class='shareaholic-googleplusone' data-shr_size='standard' data-shr_count='true' data-shr_href='http%3A%2F%2Fwww.bytemining.com%2F2011%2F06%2Fbig-data-camp-2011-bigdatacamp%2F' data-shr_title='Big+Data+Camp+2011+%23BigDataCamp'></a></div><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://www.bytemining.com/2011/06/big-data-camp-2011-bigdatacamp/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Google &#8212; Is Search-by-Multimedia on the Way?</title>
		<link>http://www.bytemining.com/2011/06/google-is-search-by-multimedia-on-the-way/</link>
		<comments>http://www.bytemining.com/2011/06/google-is-search-by-multimedia-on-the-way/#comments</comments>
		<pubDate>Tue, 21 Jun 2011 17:00:00 +0000</pubDate>
		<dc:creator>Ryan Rosario</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Databases and Datastores]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Natural Language Processing]]></category>
		<category><![CDATA[Web Mining]]></category>

		<guid isPermaLink="false">http://www.bytemining.com/?p=819</guid>
		<description><![CDATA[<p>Recently, I have been thinking about alternate ways of specifying search queries other than with text. A couple of weeks ago I came across a piece of music that I could not identify. I thought it would be a huge win for a search engine to allow me to upload this piece, and it would present me with matches, or near matches to other pieces that sound similar, or have similar characteristics. Some services already exist. Shazam allows a user to place a microphone near playing music and it will identify the artist and song. Some uses of search-by-sound:</p>

Music identification (&#8220;solved&#8221; &#8211; Shazam)
Music personalizaton and recommendation (&#8220;solved&#8221; &#8211; Pandora)
Identification of the source of a sound (i.e. a species of bird, a musical instrument, an inanimate object)
MP3 and media file search
Finding material that violates copyright

<p>As our motivating example, consider we find some really cool graphic on the web and we want to know where it likely originated (i.e. art, a meme). In such a search engine, we could upload the graphic and get results containing the exact image, or images that are very similar such as variations of the image (crop, resize, borders, different effects), modifications of the image (consider Obama-izing [...]]]></description>
				<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><p>Recently, I have been thinking about alternate ways of specifying search queries other than with text. A couple of weeks ago I came across a piece of music that I could not identify. I thought it would be a huge win for a search engine to allow me to upload this piece, and it would present me with matches, or near matches to other pieces that sound similar, or have similar characteristics. Some services already exist. <a href="http://www.shazam.com">Shazam</a> allows a user to place a microphone near playing music and it will identify the artist and song. Some uses of search-by-sound:</p>
<ul>
<li>Music identification (&#8220;solved&#8221; &#8211; Shazam)</li>
<li>Music personalizaton and recommendation (&#8220;solved&#8221; &#8211; <a href="http://www.pandora.com">Pandora</a>)</li>
<li>Identification of the source of a sound (i.e. a species of bird, a musical instrument, an inanimate object)</li>
<li>MP3 and media file search</li>
<li>Finding material that violates copyright</li>
</ul>
<p>As our motivating example, consider we find some really cool graphic on the web and we want to know where it likely originated (i.e. art, a meme). In such a search engine, we could upload the graphic and get results containing the exact image, or images that are very similar such as variations of the image (crop, resize, borders, different effects), modifications of the image (consider Obama-izing [the campaign logo] someone&#8217;s picture), and semantically similar images (different photos of the same object or person). Wouldn&#8217;t this be cool? A billion-dollar idea, right?</p>
<p>Well, Google apparently beat me (and millions of others I&#8217;m sure) to it with its search-by-image feature on Google Images. I uploaded a photo of myself to see what I would get. We see my school website (where the image originated), as well as several other sites that use my <a href="http://www.gravatar.com">Gravatar</a>. Not too bad.</p>
<p><img src="http://www.bytemining.com/wp-content/uploads/2011/06/visual_search_res.png" alt="" /></p>
<p>On the results page, users can also provide some type of labeled data to Google. I am not exactly sure what it is used for yet, but note the text in the search bar: &#8220;Describe this image.&#8221; Upon entering my name, Google found another photo that looks almost identical to the first one &#8212; a <em>variation</em>.</p>
<p><img src="http://www.bytemining.com/wp-content/uploads/2011/06/visual_search_describe.png" alt="" /></p>
<p><img src="http://www.bytemining.com/wp-content/uploads/2011/06/visual_search_sim1.png" alt="" /></p>
<p>Below are the &#8220;visually related&#8221; images that were presented to me (before I labeled my photo in the search bar):</p>
<p><img src="http://www.bytemining.com/wp-content/uploads/2011/06/visual_search_sim2.png" alt="" /></p>
<p>I see Steve Jobs (I am honored), but 7 out of 16 images are women, and of the men, we look nothing alike. I know, I know, &#8220;visually related&#8221; refers to similarity in pixels between images, but I expected more. In these images, we see a lot of red and blue hues.</p>
<p>Let&#8217;s try something that will generate many more hits: a popular meme&#8230;</p>
<p><img src="http://www.bytemining.com/wp-content/uploads/2011/06/visual_search_res2.png" alt="" /></p>
<p>The image I uploaded was originally posted on Amazon S3, and is linked to by the above two web pages. <strong>Google does a much better job when using a URL rather than uploading an image for obvious reasons.</strong> More interestingly, the &#8220;visually similar&#8221; images show variations and modifications of the same image, based on pixel similarity.</p>
<p><img src="http://www.bytemining.com/wp-content/uploads/2011/06/visual_search_sim31.png" alt="" /></p>
<p>And we get also see web pages containing a copy of the image (not linked to the original S3 file):</p>
<p><img src="http://www.bytemining.com/wp-content/uploads/2011/06/visual_search_sim4.png" alt="" /></p>
<p><strong>But this Isn&#8217;t Good Enough Yet!</strong></p>
<p>Google &#8220;Search-by-Image&#8221; is an awesome first step, and I look forward to seeing more as it is undoubtedly coming. For search-by-image (or search-by-multimedia) to be useful, it must also take &#8220;semantic&#8221; or conceptual knowledge into account, just like with text search. That is, if I upload a photo of myself, I should get back other photos of myself from various (hopefully authorized) sources. Or, if we upload a photo of the Eiffel Tower, we should get back integrated search results containing other images of the Eiffel Tower as well as text results with information about the Eiffel Tower, and perhaps a tourist&#8217;s video or documentary.<strong></strong></p>
<p>One may at first believe that the <a href="http://knowyourmeme.com/memes/o-rly">O RLY</a> search used some semantic knowledge; however, all of the images share a large number of pixels and these images are likely just &#8220;visually similar&#8221; as stated. Using semantic knowledge, one may see results of other famous owls used in memes in addition to the variations and modifications of the O RLY owl.</p>
<p>All of the data collected by such a system would also provide a hell of a corpus for image and multimedia classification. Researchers could construct classifiers for detecting spammy multimedia, knockoff multimedia (second, third generation grain in images, waveform distortion in audio), pornographic content, as well as augmenting labeled and unlabeled multimedia with metadata. For example, suppose we take a picture of what I think is a rhodendron (inside joke for readers). With such a large corpus, I can upload the photo and have Google (or some other AwesomeSearch) retag the image as that of a hydrangea instead.</p>
<p>Uses of search-by-multimedia with semantic knowledge:</p>
<ul>
<li>Cross referencing objects or people on various different sites.</li>
<li>Product search when textual information (or QR code) is not known</li>
<li>Catching criminals</li>
<li>Cataloging media</li>
<li>Methods for multimedia spam detection</li>
<li>Geolocation without use of GPS or WiFi, and location search</li>
<li>Augmentation of metadata and tagging of objects, people, etc.</li>
<li>Detecting adult, inappropriate or illegal content.</li>
<li>Identification of actions from images, video or audio and retrieval of related information</li>
</ul>
<p>Of course, search-by-multimedia poses the same challenges that we face in big data today:</p>
<ul>
<li>choosing and boosting the proper features</li>
<li>collecting a significant and correctly labeled corpus</li>
<li>fast processing of large datasets with new and existing machine learning algorithms</li>
<li>efficient indexing and retrieval algorithms to match queries with probably results</li>
<li>these things are easier said than done, but a lot of fun.</li>
</ul>
<p>Search-by-multimedia is a very interesting concept and is exciting to think about. In this age of big data and technology, anything will possible. I look forward to the day where anything on the Internet can be found, no matter its content or medium.</p>
<p><em>To check out Google&#8217;s search-by-image, <a href="http://images.google.com">click here</a> and then click on the camera icon.</em></p>
<div class="shr-publisher-819"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><div class='shareaholic-like-buttonset' style='float:none;height:30px;'><a class='shareaholic-googleplusone' data-shr_size='standard' data-shr_count='true' data-shr_href='http%3A%2F%2Fwww.bytemining.com%2F2011%2F06%2Fgoogle-is-search-by-multimedia-on-the-way%2F' data-shr_title='Google+--+Is+Search-by-Multimedia+on+the+Way%3F'></a></div><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://www.bytemining.com/2011/06/google-is-search-by-multimedia-on-the-way/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Want to Build a Research Server?</title>
		<link>http://www.bytemining.com/2011/05/want-to-build-a-research-server-6/</link>
		<comments>http://www.bytemining.com/2011/05/want-to-build-a-research-server-6/#comments</comments>
		<pubDate>Tue, 31 May 2011 17:00:00 +0000</pubDate>
		<dc:creator>Ryan Rosario</dc:creator>
				<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Statistics and Statistical Computing]]></category>

		<guid isPermaLink="false">http://www.bytemining.com/?p=763</guid>
		<description><![CDATA[<p>I am usually pretty reserved with cash, but after working full-time for six months, I finally decided to spend some of my money on building a new research development server. This process was long overdue and the reason it took me so long to commit to this project was all of the new technology developed since building my last server. This &#8220;new technology&#8221; can be pretty confusing unless one specializes in computer architecture. I want to share what I have learned throughout this process, while giving some background. These are only my opinions, and I may be wrong on some things as I am not a hardware expert. I encourage you to read and learn more on your own.</p>
<p>The CPU/Processor</p>
<p>If you are reading this article, I probably do not need to explain what the CPU/processor does. For high performance computing, you will want to get a CPU that is very &#8220;fast&#8221; and also has multiple cores. The definition of the word &#8220;fast&#8221; is in the eyes of the beholder and typically refers to more than just clock speed (GHz). In the constant war between AMD and Intel, I stick with Intel. AMD processors are powerful, but they seem to have [...]]]></description>
				<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><p>I am usually pretty reserved with cash, but after working full-time for six months, I finally decided to spend some of my money on building a new research development server. This process was long overdue and the reason it took me so long to commit to this project was all of the new technology developed since building my last server. This &#8220;new technology&#8221; can be pretty confusing unless one specializes in computer architecture. I want to share what I have learned throughout this process, while giving some background. These are only my opinions, and I may be wrong on some things as I am not a hardware expert. I encourage you to read and learn more on your own.</p>
<p><strong>The CPU/Processor</strong></p>
<p>If you are reading this article, I probably do not need to explain what the CPU/processor does. For high performance computing, you will want to get a CPU that is very &#8220;fast&#8221; and also has multiple cores. The definition of the word &#8220;fast&#8221; is in the eyes of the beholder and typically refers to more than just clock speed (GHz). In the constant war between AMD and Intel, I stick with Intel. AMD processors are powerful, but they seem to have more of a market with gamers. Intel is my preference, but I have not yet run into anyone that feels strongly towards AMD for high-performance computing (HPC). There are two main processor lines under Intel: standard, and <a href="http://en.wikipedia.org/wiki/Intel_Xeon">Xeon</a>. Standard processors are your run of the mill CPUs that are found in consumer desktop machines. Xeon processors are designed for non-consumer server, workstation and embedded systems use. I do not consider researchers as &#8220;consumers,&#8221; we are producers, so the Xeon family is better suited to our needs. On the other hand, you may find that a standard CPU will fit your needs for your particular research or use case. <a href="http://en.wikipedia.org/wiki/Xeon">Xeon processors typically have more cache and more multiprocessing capabilities</a>&#8230;and they are a lot more expensive. <em>For high-performance computing, I strongly suggest Intel Xeon</em>.</p>
<p>After months of research, I have concluded that multiple Intel Xeon processors are better than one <a href="http://www.intel.com/products/processor/corei7/index.htm">Intel Core i7</a>. As of the time of this writing, it seems that i7 processors cannot be doubled (or tripled etc.) up like Xeons can. Like the AMD, the i7 seems to be favored by gamers and those needing a richer multimedia experience.&nbsp;</p>
<p>In 2011, most CPUs in new systems have multiple <a href="http://en.wikipedia.org/wiki/Multi-core_processor"><em>cores</em></a>. Each core can essentially run one process each. A system with <em>n </em>cores can run <em>n </em>processes simultaneously. Many CPUs are <a href="http://en.wikipedia.org/wiki/Hyperthreading">hyperthreading</a> enabled, meaning that each core can actually run 2 threads simultaneously, bringing the total number of threads to 2<em>n</em>. But can&#8217;t the system already run multiple processes concurrently? We can run Firefox, TweetDeck, Thunderbird etc. concurrently, right? In practice, it <em>seems </em>that the CPU is processing multiple threads simultaneously. If we could slow down time to the micro level, one would see that the CPU works on one process at a time, then does a <em><a href="http://en.wikipedia.org/wiki/Context_switch">context switch</a> </em>to another process. Theoretically, this gives the illusion that the CPU is running multiple processes simultaneously.</p>
<p>While Intel makes great products, its inventory is a nightmare to navigate. There are several things that you must know to ballpark a particular CPU model.</p>
<ul>
<li>the <em>model number</em> (the most reliable!)</li>
<li>the <em>brand name</em> specifies a group of CPU models satisfying similar use cases (Core [i3/i5/i7/i9], Core 2 Duo, Quad Core, Pentium, Xeon).</li>
<li>the <em>architecture/subarchitecture</em> &#8212; specifies a <em>type </em>of processor within a brand, each containing many series (Nehalem, Westmere, Sandy Bridge are common ones these days)</li>
<li>the <em>chipset</em> (not commonly referred to, examples: Tylersburg, Cougar Point, Panther Point)</li>
<li>the <em>platform</em> which refers to a set of models (e.g. Harpertown, Jasper Forest, Gainestown, Prescott, Gulftown). Models within a series are typically only differentiated by clock speed (GHz).</li>
<li>the <em>socket type</em> specifies the shape and size of the CPU. The CPU and the motherboard must have the same socket type (i.e. LGA1366, Socket 775)</li>
</ul>
<p>As if this is not confusing enough, each Intel Xeon model number is prefixed with a letter for different use cases. The letter distinguishes CPUs with differing <a href="http://en.wikipedia.org/wiki/CPU_power_dissipation">thermal dissipation power (TDP)</a>. (<a href="http://www.tomshardware.com/forum/281376-28-what-difference-xeon-series-processors">source</a>)</p>
<ul>
<li><strong>W </strong>stands for &#8220;Workstation&#8221; and is meant to be installed in pairs. This designation does not seem very common anymore. They typically run the fastest (clock speed) and the hottest. They require significant cooling.</li>
<li><strong>E </strong>is &#8220;mainstream (rack mount)&#8221; and the standard model of CPU. Although it is &#8220;standard,&#8221; there is nothing wrong with it performancewise, but will run hot even when idle.</li>
<li><strong>X </strong>stands for &#8220;performance&#8221; and are similar to E but provide for extra overclocking capabilities and have lower idle power draw.</li>
<li><strong>L </strong>stands for &#8220;power optimized&#8221; and are low voltage CPUs (60W or less) that are typically only used for data centers or rack servers. They typically do not come in the higher clock speeds etc.</li>
</ul>
<p>For the Intel Xeon, model numbers indicate what configuration it is compatible with on the motherboard (<a href="http://techtips.salon.com/features-intel-xeon-processor-26.html">source</a>):</p>
<ul>
<li>3xxx Xeons are designed to be used by themselves, as the only CPU on the motherboard.</li>
<li>5xxx Xeons are designed to be used in pairs; two CPUs on the motherboard.</li>
<li>7xxx Xeons are designed to be used in pairs, or in larger groups.</li>
</ul>
<p>The 2 CPUs that I purchased are model <a href="http://ark.intel.com/Product.aspx?id=48768">Intel Xeon E5645</a>. The Intel Xeon E5645 is part of the Gulftown platform of the Xeon family. It uses the Westmere subarchiture which is the 32 nm shrink of the Nehalem architecture spec and connects to the system bus using socket LGA1366. (This is the same architecture used for the i7-9xx series to make it more confusing) The E means that it is a &#8220;mainstream&#8221; CPU. Since it is a 5000 model, it is installed with another identical CPU on the same board.</p>
<p>The <strong>number of cores is important</strong>. Most chips in current desktops contain 2 or 4 cores. Higher end systems and servers may have 6, 8 or 10 cores per chip. Xeons with 8 and 10 cores per unit debuted in Q2 of 2011 and are very expensive (about $2000 for 8 cores). They also require a brand new socket type (LGA1367), which means a new, expensive motherboard. A CPU with more cores allows an application to perform <em>several units of work per task; </em>these processors allow higher bandwidth.</p>
<p>The <strong>clock speed (GHz)</strong> used to be the deciding factor for most people, until <a href="http://en.wikipedia.org/wiki/Moore%27s_law">Moore&#8217;s Law</a> broke down. Higher clock speed possibly allows a single process to complete <em>faster</em>. Since games typically use a limited number of threads and require quick performance, a single i7 is a good choice. The i7 has multiple cores, and also has a very high clock speed.</p>
<p>The <strong>cache size and speed </strong>is also important. The cache allows very high speed access to memory locations that are frequently accessed by copying the data from RAM into the <a href="http://en.wikipedia.org/wiki/Cpu_cache">CPU cache</a>. Modern systems typically have three levels of cache: L1, L2 and L3. L1 cache is said to be the &#8220;closest&#8221; to the CPU, meaning the CPU queries the L1 cache first when performing a memory access. The L1 cache is the smallest. The L2 and L3 caches are accessed next in order, and L3 cache is larger than L2 cache. <em>Very simply put, CPUs with larger caches (especially L1) are better.</em></p>
<p>Newer processors report CPU throughput as <a href="http://en.wikipedia.org/wiki/GT/s">gigatransfers per second (GT/sec)</a> which, like GHz, quantifies some measure of &#8220;speed.&#8221; Using GT/s, one can compute the number of bits the CPU can transfer per second as</p>
<img src='http://s.wordpress.com/latex.php?latex=%5Cmbox%7BData%20Transfer%20Rate%7D%20%3D%20%5Cmbox%7BChannel%20Width%2C%20bits%2Ftransfer%7D%20%5Ctimes%20%5Cmbox%7Btransfers%2Fsec%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mbox{Data Transfer Rate} = \mbox{Channel Width, bits/transfer} \times \mbox{transfers/sec}' title='\mbox{Data Transfer Rate} = \mbox{Channel Width, bits/transfer} \times \mbox{transfers/sec}' class='latex' />
<p>Think of the cores vs. clock speed decision as a highway. Suppose the clock speed indicates the maximum speed limit on a single lane highway. A faster CPU corresponds to a single lane highway with a high speed limit. You will get to your destination faster. On the other hand, consider a one-lane vs. a two-lane highway, both with identical speed limits. If one lane is too busy for you, take the other lane. An increase in the number of cores increases the number of choices of lanes you can transition to. On the single-lane highway, you would need to slow down and wait for the cars in front you to move forward. By switching lanes, you may get to your destination faster, or you may not, but more driving is completed overall.&nbsp; <em></em></p>
<div class="shr-publisher-763"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><div class='shareaholic-like-buttonset' style='float:none;height:30px;'><a class='shareaholic-googleplusone' data-shr_size='standard' data-shr_count='true' data-shr_href='http%3A%2F%2Fwww.bytemining.com%2F2011%2F05%2Fwant-to-build-a-research-server-6%2F' data-shr_title='Want+to+Build+a+Research+Server%3F'></a></div><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://www.bytemining.com/2011/05/want-to-build-a-research-server-6/feed/</wfw:commentRss>
		<slash:comments>13</slash:comments>
		</item>
		<item>
		<title>Review of 2011 Data Scientist Summit</title>
		<link>http://www.bytemining.com/2011/05/review-of-2011-data-scientist-summit/</link>
		<comments>http://www.bytemining.com/2011/05/review-of-2011-data-scientist-summit/#comments</comments>
		<pubDate>Fri, 13 May 2011 23:34:53 +0000</pubDate>
		<dc:creator>Ryan Rosario</dc:creator>
				<category><![CDATA[Amazon EC2]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Databases and Datastores]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[Statistics and Statistical Computing]]></category>
		<category><![CDATA[Web Mining]]></category>

		<guid isPermaLink="false">http://www.bytemining.com/?p=661</guid>
		<description><![CDATA[<p></p>
<p>Some time over the past 6 weeks I randomly saw a tweet announcing the &#8220;Data Scientist Summit&#8221; and shortly below it I saw that it would be held in Las Vegas at the Venetian. Being a Data Scientist myself is reason enough to not pass up this opportunity, but Vegas definitely sweetens the deal! On Wednesday I woke up at 6am to partake on the 5.5 hour voyage to Las Vegas.</p>








<p>The Pre-Party</p>
<p>The Venetian and all close hotels were booked, so I ended up at the Aria;  a new experience. The hotel is beautiful and very ritzy. I had heard  that the rooms were very technologically advanced but I wasn&#8217;t prepared  for the recorded welcome message, music and automatic shades opening  upon entry to the room. The Aria is a geek&#8217;s paradise. Everything is  computerized. Key cards are &#8220;waved&#8221; rather than swiped, lights are  turned on/off and dimmed by use case (&#8220;sleep&#8221;, &#8220;read&#8221; etc.), rather than  manually. There are no paper &#8220;Do Not Disturb&#8221; signs; rather, a switch  on the wall (or via TV) toggles an indicator light outside the door. And  the best part&#8230; Internet is FREE!</p>








The rhododendrons hydrangeas are real!
Work [...]]]></description>
				<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><p><img class="lfloatbox" src="http://www.bytemining.com/wp-content/uploads/2011/05/emc_logo-e1305344955560.jpg" alt="" /></p>
<p>Some time over the past 6 weeks I randomly saw a tweet announcing the &#8220;<a href="http://www.datascientistsummit.com">Data Scientist Summit</a>&#8221; and shortly below it I saw that it would be held in Las Vegas at the <a href="http://www.venetian.com">Venetian</a>. Being a Data Scientist myself is reason enough to not pass up this opportunity, but Vegas definitely sweetens the deal! On Wednesday I woke up at 6am to partake on the <a href="http://maps.google.com/maps?f=d&amp;source=s_d&amp;saddr=Thousand+oaks&amp;daddr=34.29293,-118.85138+to:34.81727,-118.17031+to:las+vegas&amp;hl=en&amp;geocode=FcFmCQIdpq7q-CmRiChwViXogDEmLsxHAXoujQ%3BFcJECwIdzHjq-CnfPIC-8C3ogDEO4O_jNQ4Qww%3BFfZEEwIdOt30-ClvctUz-0bCgDHSpZcovRQoFw%3BFdYQJwIdMJoi-SnRffWkgre-gDGjebPV5tXMOg&amp;mra=dpe&amp;mrsp=1&amp;sz=9&amp;via=1,2&amp;sll=34.18227,-118.484802&amp;sspn=0.790695,2.90863&amp;ie=UTF8&amp;ll=34.445424,-118.210144&amp;spn=0.788221,2.90863&amp;t=h&amp;z=9">5.5 hour voyage to Las Vegas</a>.</p>
<table border="0">
<tbody>
<tr>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/05/sign.jpg" alt="" width="250px" height="307px" /></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/05/logo.jpg" alt="" width="544px" height="307px" /></td>
</tr>
</tbody>
</table>
<p><strong>The Pre-Party</strong></p>
<p>The Venetian and all close hotels were booked, so I ended up at the <a href="http://www.arialasvegas.com/">Aria</a>;  a new experience. The hotel is beautiful and very ritzy. I had heard  that the rooms were very technologically advanced but I wasn&#8217;t prepared  for the recorded welcome message, music and automatic shades opening  upon entry to the room. The Aria is a geek&#8217;s paradise. <em>Everything </em>is  computerized. Key cards are &#8220;waved&#8221; rather than swiped, lights are  turned on/off and dimmed by use case (&#8220;sleep&#8221;, &#8220;read&#8221; etc.), rather than  manually. There are no paper &#8220;Do Not Disturb&#8221; signs; rather, a switch  on the wall (or via TV) toggles an indicator light outside the door. And  the best part&#8230; <em>Internet is FREE!</em></p>
<table border="0">
<tbody>
<tr>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/05/rhodo-e1305328956151.jpg" alt="" /></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/05/geek1-e1305329382179.jpg" alt="" /></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/05/geek2-e1305329434813.jpg" alt="" /></td>
</tr>
<tr>
<td><span style="font-size: x-small;">The <span style="text-decoration: line-through;">rhododendrons</span> hydrangeas are real!</span></td>
<td><span style="font-size: x-small;">Work desk panel contains Ethernet, power, USB, VGA, audio.</span></td>
<td><span style="font-size: x-small;">Cables, provided you want to pay the minibar charge.</span></td>
</tr>
</tbody>
</table>
<p><strong>Data Scientist Summit, Day 1</strong></p>
<p>I arrived to the conference room and quickly took my seat. Seated in the close vicinity were several familiar faces. I also finally got a chance to meet <a href="http://www.drewconway.com/zia/">Drew Conway</a> (<a href="http://twitter.com/drewconway">@drewconway</a>) and David Smith (<a href="http://twitter.com/revodavid">@revodavid</a>), both happened to sit in the row in front of me. The keynote by <a href="http://www.itleadershipacademy.com/Thornton_May.html">Thorton May</a> provided a lot of humor that kicked off a very energetic event. In the second session, we heard from data scientists and team from <a href="http://bloom.io">Bloom Studios</a>, <a href="http://www.23andme.com">23andMe</a>, <a href="http://www.kaggle.com/">Kaggle</a> and <a href="http://www.google.com">Google</a>. I was happy to see somebody from Google present, as they never seem to attend these type events (neither does Facebook).</p>
<table border="0">
<tbody>
<tr>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/05/Screen-shot-2011-05-13-at-4.39.59-PM-e1305330697869.png" alt="" /></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/05/23andme-e1305330883395.jpg" alt="" width="79" height="54" /></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/05/Kaggle_logo-e1305330739136.png" alt="" /></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/05/ps_logo2-e1305330788421.png" alt="" /></td>
</tr>
</tbody>
</table>
<p>There has been a lot of buzz about 23andMe and Kaggle in the past few months. It is hard to keep up with all of the buzz, so it was great to hear from the companies themselves. <a href="http://www.23andme.com">23andMe</a> provides users with a kit containing a test tube into which the user spits. The kit is then sent back to 23AndMe labs which analyzes something like 500,000 to a million different markers (I am not a biologist) and can provide information about what markers are present such as: predisposition to diabetes or cancer etc. In 2011, it costs about $5,000 to do this analysis whereas 10 to 20 years ago the figure was in the millions. 23andMe goes a step further. They understand that genetics have a strong association with particular conditions, but that they are not necessarily causal. For example, someone with a predisposition to diabetes will not necessarily contract the disease. 23andMe wants to integrate other data into their models to help <em>predict </em>how likely a patient is to contract a certain condition, given their genetics.</p>
<p><a href="http://www.kaggle.com">Kaggle</a> is a community-based platform for individuals and organizations to submit datasets and open them up to the Data Science community for analysis&#8230;as a competition. I love the geekiness of this endeavor, and it continues where the <a href="http://www.netflixprize.com/">Netflix Prize</a> left off. Kaggle has some awesome prizes for winning the competition such as $3M for the <a href="http://www.heritagehealthprize.com/c/hhp">Heritage Health Prize</a>. There are other freebies as well, such as <a href="http://info.revolutionanalytics.com/Kaggle.html">Revolution R Enterprise free for competitors</a>.</p>
<p><img class="lfloatbox" src="http://www.bytemining.com/wp-content/uploads/2011/05/ucsbwave-e1305331068331.jpg" alt="" /> As a disclaimer, I am not a huge visualization guy. I see its importance and usefulness in educating end-users about statistical results, and there are quite a few infographics that are exciting to me. However, there are many times when a boring ol&#8217; boxplot works better than a <a href="http://www.processing.org">Processing</a> applet. So, it takes quite a bit to get me excited about cutting-edge graphics. The <em>Immersive Data Visualization </em>session by <a href="http://www.create.ucsb.edu/~musjkm/">Dr. JoAnn Kuchera-Morin</a> from <a href="http://www.ucsb.edu/">UC Santa Barbara</a> did exactly that. They have created a large metal sphere, called <a href="http://www.allosphere.ucsb.edu/">AlloSphere</a>, containing a bridge in the center where researchers/analysts stand. Their data is projected, for the eye, throughout the ball in a 3D, or 3D-like world. Of course, data can be represented several ways to the eye: color, size, shape, texture, etc. AlloSphere also represents data using the other senses, particularly sound. In her presentation, JoAnn took us on a 3D tour of her colleague&#8217;s brain (<a href="http://en.wikipedia.org/wiki/Fmri">fMRI</a>). Of course, we could &#8220;see&#8221; the inside of the brain, but we could also <em>hear </em>the blood pressure change in different parts of the brain, indicating differing activities. There were some other demonstrations of studies from physics, but I cannot comment on those because I lost interest (physics has always been my worst subject). I attended UC Santa Barbara for one year after high school, so I am particularly proud of what they have done.<br />
<table border="0">
<tbody>
<tr>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/05/cal-e1305331523970.jpg" alt="" width="66" height="52" /></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/05/deloitte_logo-e1305331761441.jpg" alt="" width="129" height="24" /></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/05/oreilly-e1305331685411.jpg" alt="" /></td>
</tr>
</tbody>
</table>
<p>Of all the presentations on the first day, <em>Data Scientist DNA </em>was my favorite. In this panel, <a href="http://www.kaggle.com/pages/team">Anthony Goldbloom</a> of Kaggle, Joe Hellerstein from <a href="http://www.berkeley.edu">UC Berkeley</a>, David Steier from <a href="http://www.deloitte.com">Deloitte</a> and <a href="http://www.oreillynet.com/pub/au/2717">Roger Magoulas</a> from <a href="http://www.oreillynet.com">O&#8217;Reilly Media</a> discussed what makes a good Data Scientist or &#8220;data ninja&#8221; as stated in the program. All were in agreement that candidates should have an understanding of Probability and Statistics, although someone on the panel suggested that a &#8220;basic&#8221; background was all that was needed; I disagree with that. A Data Scientist should also be a proficient programmer in some language, either compiled or interpreted and understand at least one statistical package. More importantly, the panel stressed that above and beyond knowledge, it is imperative that a Data Scientist be willing to learn new tools, technologies and languages on the job. Dr. Hellerstein suggested some general guidelines in classes students should take: Statistics (I argue for a full year of upper division statistics, and graduate study), Operating Systems, Database Systems and Distributed Computing. My favorite quote from the panel came from David Steirer, &#8220;you don&#8217;t just hire a Data Scientist by themselves, you hire them onto a team.&#8221; I could not agree more. Finally, the moderator of the panel suggested that Roger Magoulas may have been the one to coin the term &#8220;big data&#8221; in 2005, but a Twitter follower found evidence that <a href="http://t.co/EqXnTRC">the term has been used since as early as 2000</a> (Thanks Amund! <a href="http://twitter.com/atveit">@atveit</a>).</p>
<p><img class="lfloatbox" src="http://www.bytemining.com/wp-content/uploads/2011/05/Code-for-America-e1305331916178.jpg" alt="" /> The last session of the day was given by <a href="http://codeforamerica.org/author/jen/">Jennifer Pahlka</a> from <a href="http://codeforamerica.org">Code for America</a> titled <em>Imagining and Enabling a Better World. </em>Pahlka started her talk by stating that the milennial generation is the most &#8220;pro-government&#8221; generation of the modern day. Regardless of politics, millennials see potential in the goverment and that it can be used for good. Jennifer compared <em>Code for America </em>to <em>Teach for America </em>for Data Scientists. The goal of Code for America is to put together very bright minds to tackle local, state and federal government issues using data. Pahlka brilliantly stated, &#8220;we don&#8217;t need guns, we need geeks. We are trying to create a geek army.&#8221;</p>
<p>During the end of day cocktail reception, I scored two posters of data visualizations: &#8220;super powers&#8221; and &#8220;game controllers over the years.&#8221; The other two posters offered were &#8220;beers&#8221; and &#8220;rappers.&#8221; I also had a chance to quickly meet <a href="http://oreilly.com/oreilly/tim_bio.html">Tim O&#8217;Reilly</a>, Founder and CEO of O&#8217;Reilly Media, whose books are my favorite for learning programming languages and technologies (the animal books).</p>
<p><strong>Data Scientist Summit, Day 2</strong></p>
<p>Personally, I enjoyed the second day more than the first day but that may have been due to the fact that I got sleep the night before. </p>
<p><img class="lfloatbox" src="http://www.bytemining.com/wp-content/uploads/2011/05/wefeelfine-e1305332076350.gif" alt="" /> It seemed that the highlight of the morning was the talk by <a href="http://www.number27.org/">Jonathan Harris</a> titled <em>The Art and Science of Storytelling</em>. He introduced his project &#8220;<a href="http://www.wefeelfine.org/">We Feel Fine</a>&#8221; which is a conglomeration of emotions. His project aims to capture the status of the human condition. This was more of the touchy-feely kind of presentation which is different from most of the Data Science talks. He showed beautiful user interfaces and great examples of fluid user experience. Some statistics that caught my eye regard human emotion over time. It seemed that people experienced loneliness earlier in the week than later in the week. Joy and sadness were approximately inversely related throughout the week and hours of the day, but I cannot remember the direction of the trends. The most interesting graphics involved the difference between &#8220;feeling fat&#8221; and &#8220;being fat.&#8221; States like California and New York were hot spots for &#8220;feeling fat&#8221;, but they are actually some of the skinniest states. Instead, the region between the Gulf of Mexico and the Great Lakes was actually the fattest, but did not feel that way. A graphic for &#8220;I feel sick&#8221; showed a hotspot in Nevada which I thought was very interesting (nuclear fallout? alochol poisoning in Vegas?). The interesting part of this discussion was that it showed the vast geography of the field called Data Science. Some Data Scientists are more of the visualization and human connection variety, and others (where I consider myself) are more of the classic geeks that like to write code and dig into the data to get a noteworthy result. Well, I guess there isn&#8217;t much difference between both camps after all. As Jonathan would probably say, Data Science is about storytelling.</p>
<table border="0">
<tbody>
<tr>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/05/linkedin-e1305333038887.png" alt="" /></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/05/mechanicalturk-e1305333062307.gif" alt="" /></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/05/factual-e1305330766477.png" alt="" width="54" height="56" /></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/05/booz_allen_hamilton-e1305332849283.jpg" alt="" width="138" height="16" /></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/05/karmasphere.jpg" alt="" /></td>
</tr>
</tbody>
</table>
<p>The next few sessions got a bit blurry (as is Data Science); they talked about various interconnected topics. <a href="http://www.datawrangling.com/about/">Pete Skomoroch</a> from <a href="http://www.linkedin.com">LinkedIn</a>, Sharon Franks Chiarella from <a href="https://www.mturk.com/mturk/welcome">Amazon Mechnical Turk</a>, <a href="http://www.crunchbase.com/person/gil-elbaz">Gil Elbaz</a> from <a href="http://www.factual.com">Factual</a> and <a href="http://www.oreillynet.com/pub/au/2972">Toby Segaram</a> from Google discussed the fact that you can&#8217;t turn data into a story without joining the data with, well, other data. Another major topic discussed was how to get labeled data, and this is where Mechnical Turk stands out as a data resource. The next talk was humorously titled <em>Hadoop &#8211; The Data Scienist&#8217;s Dream. </em>I know some people that would gouge their eyes out when seeing that title. Really, Map-Reduce is the Data Scientist&#8217;s dream, but yeah, yeah, I know, Hadoop is the first widely accepted implementaton. Paul Brown from <a href="http://www.boozallen.com/">Booz Allen Hamilton</a> and Martin Hall from <a href="http://www.karmasphere.com">Karmasphere</a> discussed how Hadoop is typically being used in production and the briefly mentioned how Hadoop&#8217;s cousins make the Hadoop ecosystem more powerful.</p>
<table border="0">
<tbody>
<tr>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/05/SAS_logo-e1305333095334.jpg" alt="" /></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/05/informatica_logo-e1305333117665.jpg" alt="" /></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/05/cloudscale.png" alt="" /></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/05/revolution-e1305333148659.png" alt="" /></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/05/zementis-e1305333177380.gif" alt="" /></td>
</tr>
</tbody>
</table>
<p>The last session in this trifecta was titled <em>The Data Scientist&#8217;s Toolset &#8211; The Recipes that Win</em>. Representatives from various companies were panelists: <a href="http://www.sas.com">SAS</a>, <a href="http://www.informatica.com/Pages/index.aspx">Informatica</a>, <a href="http://www.cloudscale.com/">Cloudscale</a>, <a href="http://www.revolutionanalytics.com/">Revolution Analytics</a>, and <a href="http://www.zementis.com">Zementis</a>. I felt that this discussion was lacking. The strength of the Data Science community stems from open-source technology I believe, and except for Revolution Analytics, none of the companies have a strong reputation in the open-source community yet. Discussion seemed to focus too much on enterprise analytics (SQL, SAS, <a href="http://www.greenplum.com/">Greenplum</a> etc.) and Hadoop, and not enough on analysis and visualization. All in all, this panel was a bit too &#8220;enterprisey&#8221; for me. Some Twitterers felt that they were pushing their products too much. This was surprising because I felt the exact opposite, unless they were picking up on the &#8220;enterprisey&#8221; vibe. The panelists were asked what one tool for data science they would choose of they were on a desert island. The panelists responded with the following tools, &#8220;Perl, C++, Java, R [sic, thanks David], SQL and Python.&#8221; I was disappointed that SQL was mentioned without a countermention for NoSQL because not all data fits in a nice rectangle called a table. By itself, SQL is very limited. Python and R I definitely agree with. Perl is dated, but still has a use in the Data Scientist&#8217;s toolbox if the user is not familiar with Python, and doesn&#8217;t want to be. I was baffled by the C++ response and the lack of overlap in the other responses. But these are my opinions only.</p>
<p>The Summit Spotlight, <em>Secrets of Attribution &#8211; The Stories Beyond the Last-Click </em>discussed how researchers are trying to use data to &#8220;give credit&#8221; to not only the site that referred the user to a resource via a click, but all of the sites in the path that lead to that click, the so-called &#8220;conversion path&#8221; in SEO land. The final session, <em>Building Data Science Firepower &#8211; Taking the Leap</em> was very similar to the <em>Data Scientist DNA </em>talk but added in some food for thought. There are two philosophies for hiring and working with Data Scientists. The first is to hire a strong data science team, and the second is to enhance each team with Data Scientists.</p>
<p><strong>2011 EMC Data Hero Awards</strong></p>
<p>At the end of the summit, the <a href="http://www.greenplum.com/media-center/big-data-use-cases/data-hero-awards">recipients of the EMC Data Hero Awards were announced</a>. I missed some of the honorable mentions, but here goes:</p>
<ul>
<li><em>Consumer Services, </em> LinkedIn.</li>
<li><em>Energy, </em>Silver Springs Networks.</li>
<li><em>Heath Care, </em>Jeffrey Brenner, The Camden Coalition.</li>
<li><em>Life Sciences, </em>The Broad Institute of MIT and Harvard.</li>
<li><em>Media</em>, CMU Create Lab.</li>
<li><em>Public Services, </em>Global Virus Forecasting Initiative.</li>
<li><em>Technology Application, </em>IBM Watson Computing System.</li>
<li><em>Technology IT Infrastructure, </em>Apache Foundation, Hadoop.</li>
<li><em>Visionary, </em>Vivek Kundra, CIO, Data.gov.</li>
</ul>
<p>Vivek Kundra was not present at the summit, but recorded a message to the attendees, which was really cool. He stated that in 2009, there were only 47 government datasets publicly available; in 2011, there are close to 400,000 datasets available to the public.</p>
<table border="0">
<tbody>
<tr>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/05/logo_zynga-e1305335191505.png" alt="" /></td>
<td><img src="http://www.bytemining.com/wp-content/uploads/2011/05/tableauSoftware01-e1305335214926.jpg" alt="" /></td>
</tr>
</tbody>
</table>
<p>There were some interesting honorable mentions. <a href="http://www.zynga.com">Zynga</a> received an honorable mention for Consumer Services. As a player of <a href="http://www.farmville.com">Farmville</a> and <a href="http://www.cityville.com">Cityville</a>, I can see the plethora of data that Zynga must work with. Additionally, Zynga has some very creative ways for advertising for brands such as McDonald&#8217;s and Tostitos (with Farmville items for both companies), 7-11&#8242;s new slurpee (<a href="http://blog.games.com/2011/01/07/play-cityville-and-zynga-will-unlock-goji-berries-in-farmville/">seeds for the Goji Berry</a>), and <a href="http://www.zynga.com/ladygaga/">GagaVille</a>. Zynga also participates in community service: &#8220;<a href="http://www.zynga.com/about/article.php?a=20100512">Sweet Seeds for Haiti</a>&#8221; (pay to plant special seeds, with proceeds to Haiti) just to name one.</p>
<p><a href="http://www.tableausoftware.com/">Tableau Software</a> also received an honorable mention for the Media category. Tableau develops data visualization software, and is picking up huge steam in the data viz community.</p>
<p>The conference ended with an awesome video created by EMC called &#8220;I Am a Data Scientist&#8221; featuring several <a href="http://www.emc.com">EMC</a> Data Scientists, most of which I happened to have lunch with!</p>
<p><strong>Overall Impression</strong></p>
<p>All in all, the Data Scientist Summit was an eye-opening and empowering event, and it was only planned in six weeks. There was a great sense of community and collaboration among those in attendance. I work as a Data Scientist professionally because I love it. The one fact that I tend to overlook is that Data Scientists are in high demand and short supply. I was reminded of how important our work as Data Scientists is.</p>
<p>This was the <em>first annual </em>Data Scientist Summit, and I will no doubt be back. With that said, discussion of technical topics had a bit of an introductory flavor to them, which made the discussion of the technology seem dated. For example, &#8220;Vanilla&#8221; Hadoop was introduced as a tool for processing vast amounts of data. I would expect that most Data Scientists have worked with Hadoop, or at least know what it is. Hadoop is somewhat old news in terms of &#8220;cutting-edge technology.&#8221; Tools like <a href="http://pig.apache.org/">Pig</a>, <a href="http://nathanmarz.com/blog/introducing-cascalog-a-clojure-based-query-language-for-hado.html">Cascalog</a>, <a href="http://hbase.apache.org/">HBase</a>, <a href="http://hive.apache.org/">Hive</a>, <a href="http://www.cascading.org/">Cascading</a>, etc. would have been better discussion topics. I was also disappointed with how little coverage that data mining tools there was (except for <a href="http://hadoop.apache.org/">Hadoop</a>, <a href="http://en.wikipedia.org/wiki/NoSQL">NoSQL</a>, and enterpise databases). It seemed as if <a href="http://www.r-project.org">R</a> had gone M.I.A. and I was surprised that there was such little discussion of visualization tools like Tableau, Processing, <a href="http://gephi.org/">Gephi</a>, <a href="http://mbostock.github.com/d3/">D3</a>, <a href="http://polymaps.org/">Polymaps</a>, etc.</p>
<p>The Data Scientist Summit set a very solid foundation for the future. I felt like the modus operandi was &#8220;here is why Data Science is cool&#8221; and &#8220;here is why others should be interested.&#8221; Although this is not a groundbreaking discussion, it sets the stage for future conferences and solidification of the community. The people that probably got the most value out of the <em>technical</em> discissions were people looking to switch careers, or enter Data Science.</p>
<p>Without a doubt I will be at next year&#8217;s Data Scientist Summit!</p>
<p><em>My thoughts and opinion on this blog do not reflect those of my employer, the Rubicon Project.</em></p>
<div class="shr-publisher-661"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><div class='shareaholic-like-buttonset' style='float:none;height:30px;'><a class='shareaholic-googleplusone' data-shr_size='standard' data-shr_count='true' data-shr_href='http%3A%2F%2Fwww.bytemining.com%2F2011%2F05%2Freview-of-2011-data-scientist-summit%2F' data-shr_title='Review+of+2011+Data+Scientist+Summit'></a></div><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://www.bytemining.com/2011/05/review-of-2011-data-scientist-summit/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>EC2 Trials and Tribulations, Part 1 (Web Crawling)</title>
		<link>http://www.bytemining.com/2011/05/ec2-trials-and-tribulations-part-1-web-crawling/</link>
		<comments>http://www.bytemining.com/2011/05/ec2-trials-and-tribulations-part-1-web-crawling/#comments</comments>
		<pubDate>Wed, 11 May 2011 17:00:00 +0000</pubDate>
		<dc:creator>Ryan Rosario</dc:creator>
				<category><![CDATA[Amazon EC2]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Databases and Datastores]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Web Mining]]></category>

		<guid isPermaLink="false">http://www.bytemining.com/?p=632</guid>
		<description><![CDATA[<p></p>
<p>Elastic Compute Cloud (EC2) is a service provided a Amazon Web Services that allows users to leverage computing power without the need to build and maintain servers, or spend money on special hardware. The idea is simple, the user &#8220;boots&#8221; up one or more machines and then accesses those machines as if they were logged into any other machine remotely. I used EC2 and Elastic MapReduce extensively for my M.S. thesis last spring, but mainly used its large memory capabilities rather than its potential for explicit parallelism.</p>
<p>Recently, I ran a crawling job on EC2 using a parellel crawler I wrote in Python with twill. Using EC2 poses its own challenges. Using parallel code poses more challenges. Combining these two facts with the fact that crawling is I/O bound can create some more interesting challenges. If you have taken a course in operating systems, you have heard this stuff over and over again. So have I, but I am stubborn. I tend to learn lessons from experience, and this was no exception. Through this series of posts, I want to point out difficulties and  &#8220;gotchas&#8221; that are important to keep in mind when using EC2, and in this post, with [...]]]></description>
				<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><p><img src="http://www.bytemining.com/wp-content/uploads/2011/05/logo_aws.gif" alt="" /></p>
<p><a href="http://aws.amazon.com/ec2/">Elastic Compute Cloud (EC2)</a> is a service provided a <a href="http://aws.amazon.com/">Amazon Web Services</a> that allows users to leverage computing power without the need to build and maintain servers, or spend money on special hardware. The idea is simple, the user &#8220;boots&#8221; up one or more machines and then accesses those machines as if they were logged into any other machine remotely. I used EC2 and <a href="http://aws.amazon.com/elasticmapreduce/">Elastic MapReduce</a> extensively for my M.S. thesis last spring, but mainly used its large memory capabilities rather than its potential for explicit parallelism.</p>
<p>Recently, I ran a crawling job on EC2 using a parellel crawler I wrote in Python with <a href="http://twill.idyll.org/">twill</a>. Using EC2 poses its own challenges. Using parallel code poses more challenges. Combining these two facts with the fact that crawling is I/O bound can create some more interesting challenges. If you have taken a course in operating systems, you have heard this stuff over and over again. So have I, but I am stubborn. I tend to learn lessons from experience, and this was no exception. Through this series of posts, I want to point out difficulties and  &#8220;gotchas&#8221; that are important to keep in mind when using EC2, and in this post, with using  parallelism in your code to accomplish large tasks.</p>
<p><strong>Monitor your Instances</strong></p>
<p>Monitoring your instances has two important benefits. First, to make sure that you are not maxing out resources on the machine. EC2 is &#8220;elastic.&#8221; With some clever programming, you can boot up more machines if you notice resources becoming scarce on your current machines, and then decommission them later when they are not needed.<strong>&nbsp;</strong> I did not do this at first, and I ran into several issues.</p>
<p><em>Disk Space. </em>The concept of a &#8220;disk&#8221; is very confusing in EC2. The <a href="http://en.wikipedia.org/wiki/Amazon_Machine_Image">AMI</a> forms a disk, sort of. Above and beyond the OS and any other software and packages you may install as part of the AMI, you can use whatever free space is remaining to store output files. The total disk space used by the AMI seems to be configured at the moment the AMI is constructed. <strong>Thus, it is not a good idea to store files in the instance. </strong>I did this. Fortunately, I found out before it was too late that my &#8220;disk&#8221; was filling up. I wrote a <a href="http://en.wikipedia.org/wiki/Cron">cron job</a> to copy all of my output files to <span style="font-family: courier new,courier;">/mnt</span> every five minutes. <strong>Use <span style="font-family: courier new,courier;">/mnt</span> to store your files as it has lots and lots of space; HOWEVER if you terminate your instance, the files are gone. This is still true if you use space within the instance.</strong> Once your job completes, upload your files to <a href="http://aws.amazon.com/s3/">S3</a>. <span style="font-family: courier new,courier;"><a href="http://s3tools.org/s3cmd">s3cmd</a></span> allows access to S3 from the command line, and <a href="https://github.com/pcorliss/s3cmd-modification">with the modification here</a>, you can upload and download files in parallel (a life saver for big batches). Another option is to create an <a href="http://aws.amazon.com/ebs/">EBS volume</a>, mount it, and write files directly to the EBS volume. EBS space is much more expensive than S3 space.</p>
<p><em>Memory. </em>On my first attempt, I maxed out memory to the point that the OS killed 6 of my 8 processes. This caused a huge blow in the performance of my crawler and rendered the extra money I spent on an extra large instance wasted. Monitor your job&#8217;s memory using <span style="font-family: courier new,courier;">top</span>. If memory usage seems to grow too fast to your liking, consider using a <a href="http://stackoverflow.com/questions/110259/python-memory-profiler">memory profiler</a> to make sure that there are no memory leaks in your code. I have found that long running Python processes eat up a lot of RAM, even if there are no explicit growths of data structures.</p>
<p>Additionally, maxing out RAM means that the disk will begin to swap. This is devastating to performance because this extra grinding of the disk decreases the total I/O throughput your job can handle. This is crucial for crawlers as files need to be written to disk quickly.</p>
<p>If after profiling you find that your job is still using too much RAM, consider caching, or using a high memory EC2 instance.</p>
<p><em>I/O Throughput. </em>How fast your job consumes and produces data is a good way to determine if something is going amiss in your job, or with the other resources you are using. When I started my crawling job, I was crawling <em>n</em> pages per hour, but after twelve hours, the throughout decreased exponentially until it got so slow that I had to add more instances. <strong>One way to monitor throughput is to save the results of <span style="font-family: courier new,courier;">ls -latr &#8211;full-time</span> to disk and extract the date/time of each file. Using a tool like R, you can quickly plot your I/O throughput over time using an <span style="font-family: courier new,courier;">aggregate()</span>. </strong>A decrease in I/O throughput can be the result of many things: 1) swapping from exhausting RAM, 2) low disk space, 3) network congestion within AWS, 4) poor resource performance (if crawling, the resource would be the website being crawled), 5) hammering an external resource and/or HTTP throttling, 6) congestion in the Internet. <strong>For crawling, you may want to consider using several smaller instances rather than fewer larger instances. This way, you will be accessing the resource from many IPs and the result of being throttled should be lessened. Additionally, use instances that have &#8220;High&#8221; I/O performance; some are rated &#8220;Moderate&#8221; or &#8220;Low.&#8221;</strong><em>&nbsp;</em></p>
<p><em>CPU. </em>A general rule of thumb is that you can run <em>n </em>processes in parallel, for <em>n </em>cores. Additionally, if each core supports <a href="http://en.wikipedia.org/wiki/Hyper-threading">hyperthreading</a> then the number of processes you can run is approximately <em>2n</em>.  If you run more than the suggested number, the price of context  switching can slow down your performance. If you find the need to  routinely exceed this guideline, use an instance with more cores.<em><br />
 </em></p>
<p>When running parallel code, routinely do a <span style="font-family: courier new,courier;">ps aux | grep processName</span> to make sure the correct number of processes is running. If any were  killed, this will be noted in <span style="font-family: courier new,courier;">/var/syslog</span> with a reason.</p>
<p><em>Financial metrics. </em>Are you getting your money&#8217;s worth? Are you really using all of the cores you are paying for? Are you really using all of the memory you are paying for? This is up to you and your budget to dictate. But do not get carried away and assume that you must stay with the same instance size. Most AMIs can run on different instance types (except 64bit AMIs are restricted to m1.large and bigger).<em><br />
 </em></p>
<p><strong>Quarantine Essential Services</strong></p>
<p>My crawler used <a href="http://redis.io/">Redis</a> as a work queue. Each process could easily write new thread IDs and page numbers to the queue as they are discovered, and read thread IDs and page numbers from the queue as each process is ready to crawl a page. One problem that I faced was that I coupled the crawling operation with queue management into the same script, and ran the Redis server on a server where a crawler was running. This coupling posed two challenges. First, it can sustain nasty bugs. Whenever a process was created on the master Redis node, my code would wipe the Redis queue clean to prepare it for crawling (bug!). <strong>Flushing the queue, and the initial population of the queue should have taken place in two separate scripts. </strong>Due to my major bug, I wiped the entire queue clean in the middle of the  crawl. Fortunately, I followed the advice in the next section.</p>
<p>Second, I had to be careful that my processes did not exceed RAM limitations. Because Redis is mainly an in-memory key-value store, it itself can hog up most of the RAM in the instance. <strong>For this reason, it is best to quarantine essential services such as queues to their own instances.</strong></p>
<p><strong>Document Everything</strong></p>
<p>Log <em>everything</em>. Log every resource you are going to use (URLs for a crawl) and log everything that was done and any problems that arise. Using the directory structure (ls) as well as a log of what work was already performed, I was able to reconstruct and repopulate the work queue and essentially start where I left off. For my crawling operation, I wrote the following events to logs, each with a timestamp.</p>
<ul>
<li>Starting the crawl.</li>
<li>Logging in to the site being crawled.</li>
<li>Clearing and populating the queue.</li>
<li>Visiting a thread&#8217;s first page.</li>
<li>Discovering the number of pages of posts in the thread/inserting to the queue.</li>
<li>HTTP redirects, when a thread has been moved.</li>
<li>Visiting a thread ID that does not exist.</li>
<li>Inadvertent logouts, marking work to be redone.</li>
<li>Queuing inconsistencies.</li>
</ul>
<p>An <em>activity log </em>verbosely documented everything that occurred without logging actual data. An <em>inventory manifest </em>indicated which URLs/forum posts had valid content and how many pages of content were associated with them. A standard <em>directory listing </em>indicated what work had been done. By cross referencing the manifest and the directory listing, it is easy to see which posts had not yet been processed. A <em>system log </em>prepared by the operating system also documents critical failures for you, such as lack of disk space or processes being killed.</p>
<p>When writing your logs, use the advice in the next section!</p>
<p><strong>Take Care to not Clobber Files and Objects<br />
 </strong></p>
<p>It&#8217;s been said over and over again. Each process should hold as much of its own real estate as possible. When two or more processes write to the same object, corruption can occur unless there is a locking mechanism in place. If two processes write to the same file at the same time, you will notice garbled entries in your logs. This did not affect my crawled data because each file was written by a single process. The same can be true for reading data as well. When spawning multiple processes, I shared the same Redis connection with all of the processes. If two processes read from the queue at the same time, one process would get the correct data (a thread ID and page number) and the other would get &#8220;OK&#8221;, which was the result of the first process&#8217; fetch operation. This is mostly my fault, but partially <a href="https://github.com/andymccurdy/redis-py"><span style="font-family: courier new,courier;">redis-py</span></a>&#8216;s fault for filling some buffer between Python and Redis with meaningless information (&#8220;OK&#8221;).</p>
<p>Each process should write is own log files. When opening a file, you can use the following:</p>
<pre class="brush: python; title: ; notranslate">
import os
 OUT = open(&quot;mylogfile-%s.log&quot; % str(os.getpid()), &quot;w&quot;)
 ...
 OUT.close()
</pre>
<p><strong>Crawler Specific: Set an Upper Bound</strong></p>
<p>Crawling is fun, but you must practice moderation or it is easy to attempt to boil the ocean. When I first started, I would run a crawl, have it crash, and then deem the data out of date and start over from the beginning and crawl until there was nothing possible left to crawl. It is good to set an upper bound: &#8220;I will crawl 10 days worth of data&#8221;, or &#8220;I will only use threads created prior to May 1, 2011.&#8221; <strong><br />
 </strong></p>
<p>One of the keys to success with EC2 is to get over the penny pinching. If you have a project, just take the plunge and do it on EC2 (if required). The amount you spend on the first few projects will save you more on future projects.</p>
<div class="shr-publisher-632"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><div class='shareaholic-like-buttonset' style='float:none;height:30px;'><a class='shareaholic-googleplusone' data-shr_size='standard' data-shr_count='true' data-shr_href='http%3A%2F%2Fwww.bytemining.com%2F2011%2F05%2Fec2-trials-and-tribulations-part-1-web-crawling%2F' data-shr_title='EC2+Trials+and+Tribulations%2C+Part+1+%28Web+Crawling%29'></a></div><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://www.bytemining.com/2011/05/ec2-trials-and-tribulations-part-1-web-crawling/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Location Tracking on Android, too!</title>
		<link>http://www.bytemining.com/2011/04/location-tracking-on-android-too/</link>
		<comments>http://www.bytemining.com/2011/04/location-tracking-on-android-too/#comments</comments>
		<pubDate>Sat, 23 Apr 2011 19:37:26 +0000</pubDate>
		<dc:creator>Ryan Rosario</dc:creator>
				<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Databases and Datastores]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://www.bytemining.com/?p=622</guid>
		<description><![CDATA[<p>This week it was revealed that the iPhone stores users&#8217; locations, and this immediately caused a huge firestorm of commentary by tech geeks, panic among privacy advocates, and delight to data geeks like myself. Even better/worse, it seems that the iPhone caches location traces long-term, possibly back to the date the phone was activated.</p>
<p>I ditched my iPhone this past December (good riddance) in favor of the Droid X (Android). I figured, on such an open source OS, Google must be doing the same thing. After surfing through Hacker News, it turns out I was right.</p>
<p>Compared to the iPhone though, getting the data on an Android phone is not simple.</p>

The data is stored in two files, cache.cell and cache.wifi in the directory /data/data/com.google.android.location/files.
First, the user cannot browse this directory by attaching it to a computer. I installed an SSH daemon QuickSSHD to allow remote access into my phone.&#160;
Second, it is not possible to access this directory without getting a Permission denied error, even if logged in as &#8220;root&#8221; as Google has not made this directory readable.
Finally, for those (myself) that are still determined to crack this nut, you will need to root your phone. This makes the &#8220;root&#8221; user a real [...]]]></description>
				<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><p>This week <a href="http://radar.oreilly.com/2011/04/apple-location-tracking.html">it was revealed that the iPhone stores users&#8217; locations</a>, and this immediately caused a huge firestorm of commentary by tech geeks, panic among privacy advocates, and delight to data geeks like myself. Even better/worse, it seems that the iPhone caches location traces long-term, possibly back to the date the phone was activated.</p>
<p>I ditched my iPhone this past December (good riddance) in favor of the Droid X (Android). I figured, on such an open source OS, Google must be doing the same thing. After surfing through Hacker News, it turns out I was right.</p>
<p>Compared to the iPhone though, getting the data on an Android phone is not simple.</p>
<ul>
<li>The data is stored in two files, <tt>cache.cell</tt> and <tt>cache.wifi</tt> in the directory <tt>/data/data/com.google.android.location/files.</tt></li>
<li>First, the user cannot browse this directory by attaching it to a computer. I installed an SSH daemon <a href="http://teslacoilsw.com/quicksshd">QuickSSHD</a> to allow remote access into my phone.&nbsp;</li>
<li>Second, it is not possible to access this directory without getting a <tt>Permission denied</tt> error, even if logged in as &#8220;root&#8221; as Google has not made this directory readable.</li>
<li>Finally, for those (myself) that are still determined to crack this nut, you will need to root your phone. This makes the &#8220;root&#8221; user a real superuser that has near complete control over the phone.</li>
</ul>
<p>Once I downloaded the files to my Mac (via <tt>scp</tt>), I downloaded this handy-dandy parser from <a href="http://twitter.com/#!/packetlss"><tt>packetlss</tt></a> called <a href="https://github.com/packetlss/android-locdump"><tt>android-locdump</tt></a> and converted the <tt>cache.cell</tt> and <tt>cache.wifi</tt> files into GPX files by passing the <tt>--gpx</tt> flag. You can also leave off the <tt>--gpx</tt> flag and parse the output yourself.</p>
<p>Then I used <a href="http://www.gpsbabel.org/">GPSBabel</a> to convert the GPX files to CSV files and loaded them into R. While this was great for a static view, the lack of interactive zooming makes working with this type of data more difficult. I then used some code from the <a href="http://cran.r-project.org/web/packages/RgoogleMaps/index.html"><tt>RgoogleMaps</tt></a> <a href="http://cran.r-project.org/web/packages/RgoogleMaps/vignettes/RgoogleMaps-intro.pdf">package vignette</a>, and adapted for use by <a href="http://malecki.blogspot.com/2011/04/quick-iphone-location-data.html">Michael Malecki</a>. [Drew Conway has developed <a href="https://github.com/drewconway/stalkR">stalkR</a> for analyzing iPhone and iPad location data in R.]</p>
<pre class="brush: r; title: ; notranslate">
library(RgoogleMaps)
Df &lt;- read.csv(&quot;CSV file&quot;, header=FALSE)
names(Df) &lt;- c(&quot;Latitude&quot;, &quot;Longitude&quot;, &quot;Key&quot;)
bb &lt;- qbbox(lat=range(Df$Latitude), lon=range(Df$Longitude))
m &lt;- c(mean(Df$Latitude), mean(Df$Longitude))
zoom &lt;- min(MaxZoom(latrange=bb$latR,lonrange=bb$lonR))
Map &lt;- GetMap.bbox(bb$lonR, bb$latR, zoom=zoom, maptype=&quot;mobile&quot;,
NEWMAP=TRUE, destfile=&quot;tempmap.jpg&quot;, RETURNIMAGE=TRUE, GRAYSCALE=TRUE)
tmp &lt;- PlotOnStaticMap(lat=Df$Latitude, lon=Df$Longitude,
cex=.7,pch=20,col=&quot;red&quot;, MyMap=Map, NEWMAP=FALSE)
</pre>
<p><img style="display: block; margin-left: auto; margin-right: auto;" src="http://www.bytemining.com/wp-content/uploads/2011/04/wla.png" alt="" width="600" height="450" /></p>
<p>The map clusters my activity into a few familiar categories: work, school (Math Sciences Building actually), home, and my parents&#8217;. Android also picked up a dinner outing in Santa Monica, and a trip to the Shopzilla office for the <a href="http://www.meetup.com/LA-HUG/">Los Angeles Hadoop User Group</a> meetup, but little else.</p>
<p><strong>What I Found</strong></p>
<p>The <tt>cache.cell</tt> file uses cell tower triangulation to locate the user. In addition to this imprecise measure, the Android&#8217;s location tracker has several limitations</p>
<ol>
<li>It seems that location is recorded infrequently. I had expected to see trails of activity corresponding to walking or driving. All of my activity is clustered in areas where I am mostly likely stopped (on campus, at work, at home, in Santa Monica, and at the intersection of Gayley and Wilshire which has an excruciatingly painful wait). <strong>The iPhone location history seems to be much more complete/useful.</strong></li>
<li>According to the old Android source, only the last 50 cell locations, and last 200 WiFi locations are recorded (boring). My phone seemed to record more than 50 cell locations (approximately 200), but this is small.</li>
<li>I couldn&#8217;t even convert the <tt>cache.wifi</tt> file because it was apparently empty. This file is apparently cleared when WiFi is disabled.</li>
</ol>
<p>I also found that I need to get out more.</p>
<p><strong>Why Would Apple do Such a Thing?</strong></p>
<p>Earlier iPhone models (up to 2010 apparently) used <a href="http://www.skyhookwireless.com/">Skyhook</a> for its geo-location database. Skyhook employees basically drive cars wired with WiFi sensors and GPS and does what is called &#8220;<a href="http://en.wikipedia.org/wiki/Wardriving">wardriving</a>.&#8221; They drive around cities recording information about the access points it encounters and where it encounters them. When a user logs onto the web via one of those access points, Skyhook customer sites can cross-reference the access point location with its physical location. As of August 2010, Apple dropped Skyhook. Why?</p>
<p>I suspect Apple is using this data to build its own geo-location database, yet there is no evidence that the files on the iPhone are actually being transmitted to Apple. If it is true that the location database is actually transmitted to the user&#8217;s computer, it&#8217;s possible that Apple uses this data from Safari to enable geo-location features in it.</p>
<p>The investigative side of me says that this could be useful in a missing persons case if the phone is dropped.<strong><br />
</strong></p>
<p><strong>Android or iPhone?</strong></p>
<p>Apple and Google pursued different approaches in caching users&#8217; locations. Apple used a standard database file stored on the phone. Although this file is hidden in the phone, it seems to be transmitted to the user&#8217;s computer. The user can then open the file and see what Apple is storing about them. Heck, they could even modify it to privatize it. The iPhone updates this information very frequently, and keeps it around for a very long time. The file is there, the user knows it is there, and the user can see what is in the file. Unfortunately, this also means that people will overreact.</p>
<p>Google, on the other hand, hid the file deep in the filesystem such that a terminal connection is necessary to reach it, and &#8220;rooting&#8221; the phone is necessary to see its content. The user has no idea that this file exists, and cannot see what Google is storing about them. This is a bit shady. On the other hand, the information that Google is collecting is very minimal and has questionable use. Data is not updated often, and is not held on disk for very long. It is also possible to clear at least the WiFi location cache file by turning WiFi off and on.</p>
<p>So, what do you think about all of this?</p>
<div class="shr-publisher-622"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><div class='shareaholic-like-buttonset' style='float:none;height:30px;'><a class='shareaholic-googleplusone' data-shr_size='standard' data-shr_count='true' data-shr_href='http%3A%2F%2Fwww.bytemining.com%2F2011%2F04%2Flocation-tracking-on-android-too%2F' data-shr_title='Location+Tracking+on+Android%2C+too%21'></a></div><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://www.bytemining.com/2011/04/location-tracking-on-android-too/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Instructions for Installing 64bit SciPy, Python 2.7.1 on MacOS X 10.6</title>
		<link>http://www.bytemining.com/2011/03/instructions-for-installing-64bit-scipy-python-2-7-1-on-macos-x-10-6/</link>
		<comments>http://www.bytemining.com/2011/03/instructions-for-installing-64bit-scipy-python-2-7-1-on-macos-x-10-6/#comments</comments>
		<pubDate>Mon, 28 Mar 2011 19:00:00 +0000</pubDate>
		<dc:creator>Ryan Rosario</dc:creator>
				<category><![CDATA[Programming Languages]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Statistics and Statistical Computing]]></category>

		<guid isPermaLink="false">http://www.bytemining.com/?p=586</guid>
		<description><![CDATA[<p>Numpy and SciPy are packages for numerical computation and scientific computing, for Python.</p>
<p>One wrinkle with NumPy/SciPy that needs to be ironed out is the difficulty of installation on certain OSes, and particularly, architectures.The SciPy SuperPack has done a good job of taking care of this issue, but it has not yet been updated for 2.7.1 and manually hacking away at its script has not worked for me.</p>
<p>I cannot take credit for the instructions in this article. A brave warrior, Jeremy Conlin, somehow managed to figure out how to install 64-bit NumPy and SciPy, with 64-bit Python 2.7.1 on Snow Leopard; he posted the directions to the SciPy User mailing list on February 24. I followed the directions, and miraculously they worked. I am reproducing them here for Google bait.</p>
<p>Install Python 2.7.1</p>
<p>1. Download the universal Mac 2.7.1 installer here (Python 2.7.1 Mac OS X 64-bit/32-bit x86-64/i386 Installer). Typically, Python will be installed to /Library/Frameworks/Python.framework/Versions/2.7/, but may be in other locations.</p>
<p>2. Verify that your new version of Python is 64-bit enabled. Note: Python installations typically do not get toggled as the default Python, so find the location of the 2.7.1 Python executable. On my machine, it is /Library/Frameworks/Python.framework/Versions/2.7/bin/python. python2.7 should also work.</p>
<p>Load [...]]]></description>
				<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><p>Numpy and SciPy are packages for numerical computation and scientific computing, for Python.</p>
<p>One wrinkle with NumPy/SciPy that needs to be ironed out is the difficulty of installation on certain OSes, and particularly, architectures.The <a href="http://stronginference.com/scipy-superpack/">SciPy SuperPack</a> has done a good job of taking care of this issue, but it has not yet been updated for 2.7.1 and manually hacking away at its script has not worked for me.</p>
<p>I cannot take credit for the instructions in this article. A brave warrior, Jeremy Conlin, somehow managed to figure out how to install 64-bit NumPy and SciPy, with 64-bit Python 2.7.1 on Snow Leopard; he posted the <a href="http://mail.scipy.org/pipermail/scipy-user/2011-February/028567.html">directions to the SciPy User mailing list on February 24</a>. I followed the directions, and miraculously they worked. I am reproducing them here for Google bait.</p>
<p><strong>Install Python 2.7.1</strong></p>
<p>1. Download the universal Mac 2.7.1 installer <a href="http://www.python.org/ftp/python/2.7.1/python-2.7.1-macosx10.6.dmg">here</a> (Python 2.7.1 Mac OS X 64-bit/32-bit x86-64/i386 Installer). Typically, Python will be installed to <tt>/Library/Frameworks/Python.framework/Versions/2.7/</tt>, but may be in other locations.</p>
<p>2. Verify that your new version of Python is 64-bit enabled. <strong>Note: Python installations typically do not get toggled as the default Python, so find the location of the 2.7.1 Python executable. </strong>On my machine, it is <tt>/Library/Frameworks/Python.framework/Versions/2.7/bin/python</tt>. <tt>python2.7</tt> should also work.</p>
<p>Load Python 2.7.1 and execute the code below. If you get <tt>64</tt>, then you are ready to proceed.</p>
<pre class="brush: python; title: ; notranslate">
 import sys
 from math import log
 log(sys.maxsize, 2) + 1
 </pre>
<p>Another way is to execute the following. If you get 1099511627776 (and NOT 1099511627776L) you are in good shape.</p>
<pre class="brush: python; title: ; notranslate">
 2**40
 </pre>
<p>(This tip comes from <a href="http://asmeurersympy.wordpress.com/2009/11/13/how-to-get-both-32-bit/">Aaron Meurer&#8217;s SymPy blog</a>)</p>
<p><strong>Install gfortran</strong></p>
<p>1. Download gfortran-4.2.3 <a href="http://r.research.att.com/gfortran-4.2.3.dmg">here</a> and double-click the file to mount the disk image.<strong>&nbsp;</strong></p>
<p>2. Create a temporary directory (I call it <tt>tmp</tt> here) and run the following commands.</p>
<pre class="brush: bash; title: ; notranslate">
 cd tmp
 mkdir gfortran
 cd gfortran
 pax -zrvf /Volumes/GNU\ Fortran\ 4.2.3/gfortran.pkg/Contents/Archive.pax.gz .
 cp -r usr/local/* /usr/local
 </pre>
<p><strong>Drop Down to the Root Shell</strong></p>
<p><strong>NOTE: </strong>From this point forward, I dropped to the root command prompt (<tt>sudo su -</tt>) so that I had full control over the environment. This is mainly due to the way to the shell deals with <tt>PYTHONPATH</tt> and the complications that can stem from it when installing with root privileges.</p>
<p>Locate the Python 2.7.1 distribution and find a directory that ends in <tt>site_packages</tt>. For example, for me this is <tt>/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/</tt>.</p>
<p>Assuming Bash is your shell of choice, enter the following</p>
<p><tt>export PYTHONPATH=/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/</tt></p>
<p>Or, for tcsh,</p>
<p><tt>setenv PYTHONPATH /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/</tt></p>
<p><strong>Install distribute-0.6.14</strong></p>
<p>The next few steps follow the typical Python package installation method. I use <tt>python2.7</tt> rather than <tt>python</tt> just to make sure that my system is using 2.7.1 and not built in 2.6.1.</p>
<p>Distribute adds tools for installing Python modules including <tt>pip</tt> and <tt>easy_install</tt>.</p>
<pre class="brush: bash; title: ; notranslate">
 curl -O http://pypi.python.org/packages/source/d/distribute/distribute-0.6.15.tar.gz
 tar -xzvf distribute-0.6.15.tar.gz
 cd distribute-0.6.15
 python2.7 setup.py install
 </pre>
<p> <strong>Install nose</strong></p>
<p>Nose is al alternate test discovery and running process for unittest, similar to py.test.</p>
<p>Installation requires <tt>easy_install</tt>. I am not a fan of <tt>easy_install</tt> in the slightest. There are two important things to remember here:</p>
<ol>
<li>PYTHONPATH must be set correctly, or easy_install will complain.</li>
<li>If you have easy_install for another version of Python, <strong>make sure you use the version for 2.7.1 and not some previous version </strong>or easy_install will complain.</li>
</ol>
<p>On my system, easy_install for 2.7.1 is located at <tt>/Library/Frameworks/Python.framework/Versions/2.7/bin/easy_install</tt> so I use:</p>
<pre class="brush: bash; title: ; notranslate">
 /Library/Frameworks/Python.framework/Versions/2.7/bin/easy_install nose
 </pre>
<p> <strong>Install NumPy</strong></p>
<p>Finally, we can install NumPy. Version 1.5.1 is the first version that apparently works with Python 3.</p>
<pre class="brush: bash; title: ; notranslate">
 curl -O http://downloads.sourceforge.net/project/numpy/NumPy/1.5.1/numpy-1.5.1.tar.gz
 tar xzvf numpy-1.5.1.tar.gz
 cd numpy-1.5.1
 sudo python2.7 setup.py install
 </pre>
<p> Then, open Python and execute the following to import the library, and test it. On my system, all tests passed.</p>
<pre class="brush: python; title: ; notranslate">
 import numpy
 numpy.test()
 </pre>
<p> <strong>Install SciPy</strong></p>
<p>Next, install SciPy. Version 0.9 is the first version that is compatible with Python 3.</p>
<pre class="brush: bash; title: ; notranslate">
 curl -O http://downloads.sourceforge.net/project/scipy/scipy/0.9.0/scipy-0.9.0.tar.gz
 tar xzvf scipy-0.9.0.tar.gz
 cd scipy-0.9.0
 sudo python2.7 setup.py install
 </pre>
<p> Then, open Python and execute the following to import the library, and test it. On my system, all but 14 tests passed.</p>
<pre class="brush: python; title: ; notranslate">
 import scipy
 scipy.test()
 </pre>
<p><strong>Install readline</strong></p>
<p>readline provides functions for completion and reading/writing of history files from the Python interpreter (up/down arrow).</p>
<pre class="brush: bash; title: ; notranslate">
 /Library/Frameworks/Python.framework/Versions/2.7/bin/easy_install readline
 </pre>
<p> <strong>Install IPython</strong></p>
<p>IPython provides an enhanced command line interface to the Python interpreter. It is much more pleasant to work with than the standard command line interface.</p>
<pre class="brush: bash; title: ; notranslate">
 /Library/Frameworks/Python.framework/Versions/2.7/bin/easy_install ipython
 </pre>
<p><strong>Install wxPython</strong></p>
<p>wxPython is a graphical user interface toolkit for Python. Up until recently, this was the blocker in the process &#8212; there was no 64-bit version of wxPython for Mac. Download the DMG <a href="http://downloads.sourceforge.net/project/wxpython/wxPython/2.9.1.1/wxPython2.9-osx-2.9.1.1-cocoa-py2.7.dmg">here</a> and double click the installer. Versions for Cocoa and Carbon are provided, but Python 2.7.1 apparently requires the Cocoa version.</p>
<p><strong>Install matplotlib<br />
 </strong></p>
<p>matplotlib provides the plotting interface which is crucial to SciPy. Installing matplotlib is always the most complicated part of the process.</p>
<pre class="brush: bash; title: ; notranslate">
 wget http://downloads.sourceforge.net/project/matplotlib/matplotlib/matplotlib-1.0.1/matplotlib-1.0.1.tar.gz
 tar xzvf matplotlib-1.0.1.tar.gz
 cd matplotlib-1.0.1
 </pre>
<p>Then, open the file <tt>make.osx</tt>, find lines that look like below, and make the appropriate changes</p>
<pre class="brush: bash; title: ; notranslate">
 PYVERSION=2.7
 ZLIBVERSION=1.2.5
 PNGVERSION=1.4.5
 FREETYPEVERSION=2.4.4
 </pre>
<p>Finally, delete line 63. Line 63 looks like:</p>
<pre class="brush: bash; title: ; notranslate">
 cp .libs/libpng.a . &amp;amp;&amp;amp;\
 </pre>
<p><strong>Within the <tt>matplotlib-1.0.1</tt> directory, download the following archives. Do not extract them!</strong></p>
<ol>
<li><a href="http://zlib.net/zlib-1.2.5.tar.gz">zlib 1.2.5</a></li>
<li><a href="http://downloads.sourceforge.net/project/libpng/libpng14/older-releases/1.4.5/libpng-1.4.5.tar.gz">libpng 1.4.5</a></li>
<li><a href="http://download.savannah.gnu.org/releases/freetype/freetype-2.4.4.tar.bz2">freetype 2.4.4</a></li>
</ol>
<p>Or execute the following commands <strong>within </strong>the matplotlib-1.0.1 directory.</p>
<pre class="brush: bash; title: ; notranslate">
 wget http://zlib.net/zlib-1.2.5.tar.gz
 wget http://downloads.sourceforge.net/project/libpng/libpng14/older-releases/1.4.5/libpng-1.4.5.tar.gz
 wget http://download.savannah.gnu.org/releases/freetype/freetype-2.4.4.tar.bz2
 </pre>
<p>Finally, install <tt>matplotlib-1.0.1</tt> using the following command</p>
<pre class="brush: bash; title: ; notranslate">
 PREFIX=$PYTHONROOT make -f make.osx deps mpl_install
 </pre>
<p>where PYTHONROOT is the directory containing the pacakage tree. On my system, it is <tt>/Library/Frameworks/Python.framework/Versions/2.7/</tt>.</p>
<p>Finally, enter Python and test matplotlib.</p>
<pre class="brush: python; title: ; notranslate">
 import matplotlib.pyplot as pyplot
 pyplot.plot([1,2,3])
 </pre>
<p>If installation was successful, a plot like the following should appear.</p>
<p><img style="display: block; margin-left: auto; margin-right: auto;" src="http://www.bytemining.com/wp-content/uploads/2011/03/matplotlib.png" alt="" width="400" height="300" /></p>
<p>Hopefully this is helpful to someone, and here&#8217;s to hoping a SciPy Superpack for Python 2.7.1 will be released soon!</p>
<div class="shr-publisher-586"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><div class='shareaholic-like-buttonset' style='float:none;height:30px;'><a class='shareaholic-googleplusone' data-shr_size='standard' data-shr_count='true' data-shr_href='http%3A%2F%2Fwww.bytemining.com%2F2011%2F03%2Finstructions-for-installing-64bit-scipy-python-2-7-1-on-macos-x-10-6%2F' data-shr_title='Instructions+for+Installing+64bit+SciPy%2C+Python+2.7.1+on+MacOS+X+10.6'></a></div><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://www.bytemining.com/2011/03/instructions-for-installing-64bit-scipy-python-2-7-1-on-macos-x-10-6/feed/</wfw:commentRss>
		<slash:comments>16</slash:comments>
		</item>
		<item>
		<title>My First Few Days with RStudio</title>
		<link>http://www.bytemining.com/2011/03/my-first-few-days-with-rstudio/</link>
		<comments>http://www.bytemining.com/2011/03/my-first-few-days-with-rstudio/#comments</comments>
		<pubDate>Wed, 09 Mar 2011 06:08:24 +0000</pubDate>
		<dc:creator>Ryan Rosario</dc:creator>
				<category><![CDATA[R]]></category>
		<category><![CDATA[Statistics and Statistical Computing]]></category>

		<guid isPermaLink="false">http://www.bytemining.com/?p=532</guid>
		<description><![CDATA[<p>As most readers are probably aware, the free IDE for R, called RStudio, was recently released for general use and it immediately made huge waves within the R community. IDE stands for Integrated Development Environment. IDEs typically provides a rich set tools developing in some target language. For standard programming languages like C++ (VisualStudio) and Java (Eclipse or NetBeans), IDEs contain:</p>

an editor tailored to the target language. The editor typically has tab/auto-complete for variable names, functions and class methods and properties and also features syntax highlighting.
a multiple document interface (MDI) where there may be several documents opened in different tabs.
a window that interacts with the compiler, or a panel containing the console to the language, a la MATLAB, and even vanilla R&#8217;s GUI.

a debugger
a file browser and language reference.

<p>RStudio plays to this analogy very well, and makes modifications where appropriate. RStudio provides many features that are lacking in the standard R GUI, and improves on features that do not work properly in the Windows R GUI. Over the past few days, I have been doing all of my R analysis within RStudio, shortly with the Desktop version, and mostly with the Server version. I will discuss mostly the server version [...]]]></description>
				<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><p>As most readers are probably aware, the free IDE for <a href="http://www.r-project.org">R</a>, called <a href="http://www.rstudio.org">RStudio</a>, was recently released for general use and it immediately made huge waves within the R community. IDE stands for <a href="http://en.wikipedia.org/wiki/Integrated_development_environment">Integrated Development Environment</a>. IDEs typically provides a rich set tools developing in some target language. For standard programming languages like C++ (VisualStudio) and Java (Eclipse or NetBeans), IDEs contain:</p>
<ul>
<li>an editor tailored to the target language. The editor typically has tab/auto-complete for variable names, functions and class methods and properties and also features <a href="http://en.wikipedia.org/wiki/Syntax_highlighting">syntax highlighting</a>.</li>
<li>a <a href="http://en.wikipedia.org/wiki/Multiple_document_interface">multiple document interface (MDI)</a> where there may be several documents opened in different tabs.</li>
<li>a window that interacts with the compiler, or a panel containing the console to the language, a la MATLAB, and even vanilla R&#8217;s GUI.
</li>
<li>a <a href="http://en.wikipedia.org/wiki/Debugger">debugger</a></li>
<li>a file browser and language reference.</li>
</ul>
<p>RStudio plays to this analogy very well, and makes modifications where appropriate. RStudio provides many features that are lacking in the standard R GUI, and improves on features that do not work properly in the Windows R GUI. Over the past few days, I have been doing all of my R analysis within RStudio, shortly with the <a href="http://www.rstudio.org/download/desktop">Desktop version</a>, and mostly with the <a href="http://www.rstudio.org/download/server">Server version</a>. I will discuss mostly the server version since that is what I have been using. It is identical (AFAIK) to the desktop version, so you are not missing anything by using either version.</p>
<h3>RStudio Server</h3>
<div class="smallbq">The biggest win for me with RStudio is the Server edition. I can  access my work on any system that can communicate with  the server. The  interface always looks the same, and all I need is a  web browser to  access it. Before RStudio Server Edition, I had to run  two versions of R: R GUI on my local machine for graphics and  presentation, and a headless R on a research server for processing,  where the server contained my data and the rest of my workflow. <strong>I no longer need to run multiple versions of R in my workflow.</strong></div>
<p></p>
<p>First, installation is miraculously easy<em>. </em>I only had a few very minor glitches to deal with. Armed with <tt>sudo</tt> access to a machine on a research cluster at work, I was able to simply download the RPM and install it using the <a href="http://www.rstudio.org/download/server">instructions provided on the web site</a>. Then, all I had to do was fire up a browser and go to</p>
<p><tt>http://servername.com:8787</tt></p>
<p>and I was asked for my login credentials. But I couldn&#8217;t get in. This server authenticates using LDAP, but all I had to do was replace the contents of <tt>/etc/pam.d/rstudio</tt> with the contents or <tt>/etc/pam.d/login</tt> and I was able to login. But then there was a &#8220;unknown error.&#8221; Oh, the version of R that was installed was too old (2.8). I just did a <tt>yum upgrade R</tt>, and RStudio logged me in with no problems. What showed up on my browser screen was beautiful! It looked <em>identical </em>to the desktop version of RStudio.</p>
<div align="center"><b><img src="http://www.bytemining.com/wp-content/uploads/2011/03/rstudio.png" alt="" height="65%" width="65%" /><br />
</b></div>
<p>
Once logged in, I somehow have access to ALL of my files on the <em>remote server</em>. I can load my data (typically produced by <a href="http://hadoop.apache.org">Hadoop</a>) already residing on the server, and I can save output, graphs, data and even the R session itself <em>on the remote server! </em>All while just clicking buttons. No commands to remember, no screwed up PDF files, and most importantly&#8230;. no <tt>scp</tt>ing files back and forth from the server just to create a plot (X worked well, but had limitations)!</p>
<p><b>Things I Love about R Studio</p>
<p></b>I will have to go panel by panel, but even then I will have missed cool features. I also will not discuss features that are already present in the MacOS X R GUI and are repeated and beautified in RStudio:<br />
<em>The R command prompt still looks the same. </em>At first, my reaction  was &#8220;Damn, what am I supposed to do?&#8221; But when the GUI finished loading, the familiar R command prompt appeared in all is 1970ish glory. I  immediately started typing commands and seeing fields in the other panes populate and change to display different usages. It left me with a &#8220;oh, I see&#8221; feeling. </p>
<p><em>Saves R sessions correctly</em>, and when I return to RStudio, ALL of my work is there! I could never get the save session/image function to work in R GUI. I gave up several years ago. In RStudio, it works properly, but you don&#8217;t even need it because&#8230; when you leave RStudio and then return, everything is there! The workspace (variables, functions, data, etc), the scripts you were working on, the plots, even the last dang help screen you looked at!</p>
<p><em>The Stop Execution button in the console actually works. </em>When executing a long running computation in R GUI (that&#8217;s the first mistake), it is sometimes necessary to cancel the computation either because I made an error, or because the computation is killing my system&#8217;s performance. In R GUI, particularly on MacOS X, the Stop Execution button did absolutely nothing, because there was typically a spinning beachball preventing me from clicking it. Hitting ESC also did not work. In RStudio, clicking Stop actually seems to break out of the madness.</p>
<div style="vertical-align: middle;">
<img src="http://www.bytemining.com/wp-content/uploads/2011/03/workspace.png" alt="" class="rfloat" height="137" width="371" /><em>Workspace panel.</em> The workspace panel displays the variables, functions, data frames and other objects that reside in the current workspace, a la MATLAB. From this panel, one can also switch or save workspaces. The user can also import a dataset from a text file using a trivial wizard (a la SPSS, etc.), or from a web URL. The user can also clear the workspace. A frequently overlooked command to do the same from the command line is rm(list=ls()), but that is no longer necessary to remember! </p>
<p>Clicking on a data frame object in the workspace pane, causes it to be displayed in a nice tabular format. It can also be printed to a local printer, or opened in a new window.</p>
<p>Clicking on a numerical value allows the user to change it by opening an in-place edit box. Clicking on other objects like lists, vectors and functions opens an edit window displaying the definition that created it.
</div>
<p><b></p>
<p><br style="clear: both;" /></p>
<p></b>
<div style="vertical-align: middle;">
<img src="http://www.bytemining.com/wp-content/uploads/2011/03/files.png" alt="" class="lfloat" height="140" width="459" /><br />
<em>Files panel.</em> There is nothing really exciting to see here, <strong>except that by clicking the Upload button, I can upload files directly to the <em>remote server </em>just by selecting the file, <em>without </em>having to SCP!</strong>
</div>
<p><br style="clear: both;" /></p>
<div style="vertical-align: middle;">
<img src="http://www.bytemining.com/wp-content/uploads/2011/03/script.png" alt="" class="rfloat" height="201" width="456" /><em>Scripting panel.</em> This is the second best feature of R studio and has the same feeling as the stock script editor that ships with R. The largest difference is that the editor in RStudio is stable. On MacOS X, the editor tends to garble 2-3 rows of code together on every single scroll. This editor does a better job of indentation than R GUI. When opening a function, R GUI tends to indent the body properly, but insert a closing } prematurely. RStudio&#8217;s editor also features auto-completion, a feature present in the command-line of R GUI and R, but not in the editor of R GUI. The user can also save their script <em>on the remote server</em>, print code to a local printer and search. Similar to MATLAB, the user can select one or more lines of code and run them by clicking the &#8220;Run Line(s)&#8221; button, rather than having to copy and paste lines. &#8220;Run All&#8221; is a point-and-click replacement for source. </p>
<p>The &#8220;Source on Save&#8221; function is interesting. If enabled, RStudio will run/source the script each time the script is saved. Honestly, I do not find this feature to be all that useful unless in the middle of debugging, and dangerous if not debugging. Suppose after a long 10-fold-cross-validation computation there is an error that we want to fix. We fix the error and save the script. Do we really want to run the computation again? If R were a compiled language, then yes. Since R is not a compiled language, this feature is not entirely useful in concept.</p>
<p>The &#8220;magic wand&#8221; icon contains what I suspect to be a growing collection of coding tools. Currently, the user can comment and uncomment a bunch of lines at once. This is particularly useful since, for some reason, there is no multiline comment flag in R. The user can also select a series of lines and wrap a function around them. This feature could be dangerous for those not familiar with coding but provides a very nice way to put a bunch of code into a function as an afterthought.
</div>
<p><br style="clear: both;" /></p>
<div style="vertical-align: middle;">
<img src="http://www.bytemining.com/wp-content/uploads/2011/03/plots.png" alt="" class="lfloat" height="300px" width="433px" /><br />
<em>Plot panel. </em>By far my favorite part of RStudio is the plot panel! All plots are saved in this panel, and the user move back and forth among plots that were already constructed. <strong>The Export button allows exporting a plot to user defined dimensions and save to the local machine as a PNG, or even copy it to the local machine&#8217;s clipboard! </strong>Of course, the PDF button produces a PDF file of the plot that can be saved on the local machine. If the plots are all too much, we can click &#8220;Clear All&#8221; and start again with a clean slate.</p>
<p>But, is it possible to create plots of larger size? I am sure there is, but I did not spend much time looking.
</p></div>
<p><br style="clear: both;" /></p>
<p><em>LaTeX and Sweave documents.</em> From the File menu the user can create new documents including <a href="http://www.latex-project.org/">LaTeX</a> and <a href="http://www.statistik.lmu.de/%7Eleisch/Sweave/">Sweave</a>. Unfortunately, I cannot experiment more with these features because there is something amiss in my configuration. For students and researchers, having Sweave and LaTeX integrated with RStudio is a huge, huge, huge advantage. No longer must we copy/paste among different programs. <strong>To make the integration complete, <a href="http://www.bibtex.org/">BibTeX</a>, <a href="http://asymptote.sourceforge.net/">Asymptote</a>/<a href="http://www.texample.net/tikz/">TikZ</a>/<a href="http://www.gnuplot.info/">gnuplot</a> whatever should be easily included by the user.</strong><br />
<i><br />
At any point if the user interface shows stale data, there is a  Reload button to help you out by refreshing the entire RStudio interface.</i><br />
<b><br />
Things that Need Improvement</p>
<p></b>I do not really have any complaints about RStudio, quite the opposite actually. However, there are some things that do not seem to work. I should note however, that I have not spent much (well, any) time debugging them. The developers are probably already working on some of them. Some of them are probably problems in my configuration and others are probably settings that I need to tweak.<br />
<b><br />
</b><em>No auto-completion of parentheses or quotation marks. </em>This is a bummer, but not a deal breaker. On the other hand, as you type closing marks, RStudio highlights the matching mark.<br />
<b><br />
</b><em>The dataset view needs work. </em>Columns can&#8217;t be resized. Other natural functionalities that seem to be missing are: column renaming (a call to names), cannot sort or order values by a column, and data manipulation (I didn&#8217;t say that). These missing features are a tad disappointing, but a hell of a lot better than displaying in the terminal.</p>
<p><em>Install packages in the packages panel does not work </em>on our server&#8217;s configuration.<br />
<b><br />
</b><em>LaTeX cannot be found</em>. Upon attempting to create a new LaTeX or Sweave document, I got a friendly notice (instead of a bizarre error message) saying that LaTeX is not installed. The problem is, it is installed and there does not seem to be anywhere in the GUI to configure its location. <strong>Additionally, some LaTeX templates would be useful.</strong></p>
<p><b><br />
</b></p>
<h3>In Conclusion&#8230;</h3>
<p><strong>My Workflow Before and After RStudio</strong></p>
<div>
Before RStudio<br />
<img src="http://www.bytemining.com/wp-content/uploads/2011/03/before.png" alt="" class="lfloat" /><br />
<br style="clear: both;" /><br />
After RStudio<br />
<br style="clear: both;" /><br />
<img src="http://www.bytemining.com/wp-content/uploads/2011/03/after.png" alt="" class="lfloat" />
</div>
<p><br style="clear: both;" /></p>
<div class="smallbq">All in all, the biggest win for me with RStudio is the Server edition. I can access my work on any system that can communicate with the server. The interface always looks the same, and all I need is a web browser to access it. I no longer need to run multiple versions of R in my workflow.</div>
<p>The developers of this open source project seemed to get it right on the first try. How the hell is that possible??? So has anyone switched from the big R to the big blue ball?</p>
<div class="shr-publisher-532"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><div class='shareaholic-like-buttonset' style='float:none;height:30px;'><a class='shareaholic-googleplusone' data-shr_size='standard' data-shr_count='true' data-shr_href='http%3A%2F%2Fwww.bytemining.com%2F2011%2F03%2Fmy-first-few-days-with-rstudio%2F' data-shr_title='My+First+Few+Days+with+RStudio'></a></div><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://www.bytemining.com/2011/03/my-first-few-days-with-rstudio/feed/</wfw:commentRss>
		<slash:comments>29</slash:comments>
		</item>
		<item>
		<title>Web Mining Pitfalls</title>
		<link>http://www.bytemining.com/2011/02/web-mining-pitfalls/</link>
		<comments>http://www.bytemining.com/2011/02/web-mining-pitfalls/#comments</comments>
		<pubDate>Thu, 03 Feb 2011 18:00:15 +0000</pubDate>
		<dc:creator>Ryan Rosario</dc:creator>
				<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Web Mining]]></category>

		<guid isPermaLink="false">http://www.bytemining.com/?p=510</guid>
		<description><![CDATA[<p>Programming defensively requires knowing the input that your code should be able to handle. Typically, the programmer may be intimately familiar with the type of data that his/her code will encounter and can perform checks and catch exceptions with respect to the format of the data.</p>
<p>Web mining requires a lot more sophistication. The programmer in many cases does not know the full formatting of the data published on a web site. Additionally, this format may change over time. There are certain standards that do apply to certain types of data on the web, but one cannot rely on web developers to follow these standards. For example, the RSS Advisory Board developed a convention for the formatting of web pages so that browsers can automatically discover the links to the site&#8217;s RSS feeds. I have found in my research that approximately 95% of my sample actually implemented this convention. Not bad, but not perfect.</p>
<p>Always Have a Plan B, C, D, &#8230;</p>
<p>One might say that 95% is good enough. I am a bit obsessive when it comes to data quality, so I wanted to extract a feed for 99% of the sites I had on my list. Also, I am always leery [...]]]></description>
				<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><p>Programming defensively requires knowing the input that your code should be able to handle. Typically, the programmer may be intimately familiar with the type of data that his/her code will encounter and can perform checks and catch exceptions with respect to the format of the data.</p>
<p>Web mining requires a lot more sophistication. The programmer in many cases does not know the full formatting of the data published on a web site. Additionally, this format may change over time. There are certain standards that do apply to certain types of data on the web, but one cannot rely on web developers to follow these standards. For example, the <a href="http://www.rssboard.org">RSS Advisory Board</a> developed a <a href="http://www.rssboard.org/rss-autodiscovery">convention for the formatting of web pages so that browsers can automatically discover the links to the site&#8217;s RSS feeds</a>. I have found in my research that approximately 95% of my sample actually implemented this convention. Not bad, but not perfect.</p>
<p><strong>Always Have a Plan B, C, D, &#8230;</strong></p>
<p>One might say that 95% is good enough. I am a bit obsessive when it comes to data quality, so I wanted to extract a feed for 99% of the sites I had on my list. Also, I am always leery of bias. Could there be something special about these sites that do not implement RSS autodiscovery? Clearly, there are exceptions to my Plan A. So time to move to Plan B. I found that some of the sites in this 5% used FeedBurner to index their feeds, so Plan B was to use a regular expression to extract FeedBurner URLs. This only added another 1% (actually less than that) to my coverage.</p>
<pre class="brush: python; title: ; notranslate">((?:http|feed)://feeds.feedburner.com/.*?)</pre>
<p>Next, Plan C took the domain name and simply slapped /feed to it and hope it sticks. I called this process &#8220;feed probing&#8221; and it added the remaining 3% that I was looking for. If Plans A, B and C all failed to find a suitable RSS feed, all hope is lost and we just skip this site (1% error).</p>
<p><em>On the other hand, there are  times when it is the HTTP client or server cannot be trusted&#8230;</em></p>
<p><strong>Common Python Exceptions in Web Mining</strong></p>
<p>It is all too common to encounter an exception while web mining or crawling. Code must handle these errors gracefully by catching exceptions or failing without aborting. One method that works well is to provide a resume mechanism that restarts execution where the code left off, rather than having to start a multiple hour/day/week job over again! Below is a taxonomy of common problems (and their Python exceptions):</p>
<p><em>HTTP Errors. </em>These occur frequently. Some are recoverable, and others are worth just throwing out a record over. The most common ones are below, but for more information, refer to <a href="http://www.w3.org/Protocols/rfc2616/rfc2616.html">RFC 2616</a>.</p>
<ul>
<li>404 Not Found: the web page you tried to download could not be found. Skip the record.</li>
<li>400 Bad Request: the server has deemed the client&#8217;s HTTP request as malformed. Either retry, or double check your code!
</li>
<li>401 Unauthorized: authentication is required before proceeding. Either skip the record, or add authentication to your code.</li>
<li>403 Forbidden: you are trying to access something you are not allowed to access. This is a common HTTP error thrown when your program is being <a href="http://en.wikipedia.org/wiki/Rate_limiting"><em>rate limited</em></a>.</li>
<li>500 Internal Server Error: something on the server end is wrong. Either try again immediately, or wait a while and retry the request.</li>
<li>3xx Redirect: the web page you are trying to access has moved somewhere else. In my research, these are rare but common in practice. I choose to skip these sites. You may wish not to.
</li>
</ul>
<p>In Python, these can be caught as <tt>urllib2.HTTPError</tt>. It is also possible to specify actions based on the specific HTTP error code returned:</p>
<pre class="brush: python; title: ; notranslate">
try:
    content = urllib2.urlopen(url).read()
except urllib2.HTTPError, e:
    if e.code == 404:
        print &quot;Not Found&quot;
    elif e.code == 500:
        print &quot;Internal Server Error&quot;
</pre>
<p><em>Server Errors &#8220;URLError&#8221;. </em>These occur frequently as well and seem to denote some sort of server or connection trouble, such as &#8220;Connection refused&#8221; or site does not exist. Usually, these are resolved by retrying the fetch. <i>In Python, it is very important to note that <tt>HTTPError</tt> is a subclass of <tt>URLError</tt>, so when handling both exceptions distinctly, <tt>HTTPError</tt> must be caught first.<b></p>
<p></b></i>
<pre class="brush: python; title: ; notranslate">
try:
    content = urllib2.urlopen(url).read()
except urllib2.HTTPError, e:
    ...
except urllib2.URLError, f:
    print f.reason
</pre>
<p><b><br />
</b><em>Other Bizarreness. </em>The web is very chaotic. Sometimes weird stuff happens. The rare, elusive <tt>httplib.BadStatusLine</tt> exception technically means that the server returned an error code that the client does not understand, but it can also be thrown <a href="http://www.voidspace.org.uk/python/articles/urllib2.shtml#badstatusline-and-httpexception">when the page being fetched is <em>blank</em></a>. On a recent project, I ran into a new one: <tt>httplib.IncompleteRead</tt> which has little documentation. <em>Both of these issues can usually be resolved by retrying the fetch.</em> Both of these pesky errors (and more) can be handled by simply catching their parent exception: <tt>httplib.HTTPException</tt>.</p>
<pre class="brush: python; title: ; notranslate">
try:
    content = urllib2.urlopen(url).read()
except httplib.HTTPException:
    #you've encountered a rare beast. You win a prize!
</pre>
<p><b>Everything Deserves a Second Chance</p>
<p></b>One common reaction to any error is to just throw the record out. <tt>URLError</tt>s errors are so common, that it is probably unwise to do that if you are using the data for something. Typically, these errors go away if you try again. I use the following loop to catch errors and react appropriately.</p>
<pre class="brush: python; title: ; notranslate">
attempt = 0
while attempt &lt; HTTP_RETRIES:
    attempt += 1
    try: 
        temp = urllib2.urlopen(url).read()
        break 
    except urllib2.HTTPError:
        break
    except urllib2.URLError:
        continue
    except httplib.HTTPException:
        continue
else:
    continue
</pre>
<p>This code attempts to fetch URL a maximum of <tt>HTTP_RETRIES</tt> times. If the fetch is successful, Python breaks out of the loop. If a <tt>URLError</tt> or <tt>HTTPException</tt> occurs, we move on to another attempt of the fetch. If we encounter an HTTP error (not found, restricted etc), give up. Depending on the error, we can modify the code to retry on certain errors, and abort on others, but for my purposes, I do not care.</p>
<p><b>The Comatose Crawler</p>
<p></b>If you have ever done a large scale crawl on a web site, you are bound to encounter a state where your crawler becomes comatose &#8211; it is running, maybe using system resources, but is not outputting anything or reporting progress. It looks like an infinite no-op loop. I have encountered this problem since I started doing web mining in 2006 and did not, until just this past weekend, realize exactly why it was happening and how to prevent it.</p>
<p>Your crawler has sunk in a swamp, and is essentially trapped. For whatever reason, the HTTP server your code is communicating with maintains an open connection, but sends no data. I suppose this could be a deadlock-type situation where the HTTP server is waiting for an additional request (?), and the crawler is waiting for output from the HTTP server. <em>It was my misunderstanding that the HTTP protocol had a built-in timeout, and I was relying on it.</em> This is apparently not the case. There is a simple way to avoid this swamp, by setting a timeout on the socket sending the HTTP request:</p>
<pre class="brush: python; title: ; notranslate">
import socket
...
HTTP_TIMEOUT = 5
socket.setdefaulttimeout(HTTP_TIMEOUT)
...
handle = urllib2.urlopen(&quot;http://www.google.com&quot;)
content = handle.read()
...
</pre>
<p>If a request to the socket goes unanswered after <tt>HTTP_TIMEOUT</tt> seconds, Python throws a <tt>urllib2.URLError</tt> exception that can be caught. In my code, I just skip these troublemakers.</p>
<pre>Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "/usr/lib64/python2.4/urllib2.py", line 130, in urlopen
    return _opener.open(url, data)
  File "/usr/lib64/python2.4/urllib2.py", line 358, in open
    response = self._open(req, data)
  File "/usr/lib64/python2.4/urllib2.py", line 376, in _open
    '_open', req)
  File "/usr/lib64/python2.4/urllib2.py", line 337, in _call_chain
    result = func(*args)
  File "/usr/lib64/python2.4/urllib2.py", line 1021, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib64/python2.4/urllib2.py", line 996, in do_open
    raise URLError(err)
urllib2.URLError: urlopen error="" timed="" out=""
</stdin></pre>
<p>With enough experience, dedication, blood, sweat, tears, and caffeine, data mining the jungle known as the World Wide Web becomes both simple and fun. Happy web mining!</p>
<div class="shr-publisher-510"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><div class='shareaholic-like-buttonset' style='float:none;height:30px;'><a class='shareaholic-googleplusone' data-shr_size='standard' data-shr_count='true' data-shr_href='http%3A%2F%2Fwww.bytemining.com%2F2011%2F02%2Fweb-mining-pitfalls%2F' data-shr_title='Web+Mining+Pitfalls'></a></div><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://www.bytemining.com/2011/02/web-mining-pitfalls/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>40 Fascinating Blogs for the Ultimate Statistics Geek!</title>
		<link>http://www.bytemining.com/2011/01/40-fascinating-blogs-for-the-ultimate-statistics-geek/</link>
		<comments>http://www.bytemining.com/2011/01/40-fascinating-blogs-for-the-ultimate-statistics-geek/#comments</comments>
		<pubDate>Thu, 20 Jan 2011 08:00:28 +0000</pubDate>
		<dc:creator>Ryan Rosario</dc:creator>
				<category><![CDATA[R]]></category>
		<category><![CDATA[Statistics and Statistical Computing]]></category>

		<guid isPermaLink="false">http://www.bytemining.com/?p=503</guid>
		<description><![CDATA[<p>I am happy to report that ByteMining is listed on &#8220;40 Fascinating Blogs for the Ultimate Statistics Geek&#8220;!</p>
<p>Some of the ones that I frequently read, or are written by Twitter friends/followers (in no particular order):</p>

R-bloggers, an aggregate site containing blog posts tagged as posts about R. High quality content. 
Statistical modeling, causal inference and social science. This one is a no brainer, as it is the blog for Andrew Gelman&#8216;s group.
FlowingData by Nathan Yau (@flowingdata), fellow Statistics Ph.D. student at UCLA. Focuses on the data and information visualization side of Data Science.
dataists by Hilary Mason (@hmason, bit.ly), Vince Buffalo (@vsbuffalo, UC Davis),
Drew Conway (@drewconway, NYU), Mike Dewar (@mikedewar, Columbia),
John Myles White (@johnmyleswhite, Princeton) and others.
A new blog on several aspects of Data Science including Data Mining, visualization and uses of Statistics in current events. Heavy use of R and ggplot2.
Revolutions by Revolution Analytics provides a variety of content around R, Data Science and Statistics in general.
FiveThirtyEight by Nate Silver shares sophisticated modeling and analysis of elections and government happenings. It is in a different realm, as it attracts political news junkies (and the occasional extremist) rather than just Statisticians.
LoveStats by Annie Pettit, Ph.D. (@LoveStats) discusses Statistics as used in Social [...]]]></description>
				<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><p>I am happy to report that ByteMining is listed on &#8220;<a href="http://www.bschool.com/blog/2011/40-fascinating-blogs-for-the-ultimate-statistics-geek/">40 Fascinating Blogs for the Ultimate Statistics Geek</a>&#8220;!</p>
<p>Some of the ones that I frequently read, or are written by Twitter friends/followers (in no particular order):</p>
<ul>
<li><a href="http://www.r-bloggers.com/">R-bloggers</a>, an aggregate site containing blog posts tagged as posts about R. High quality content. </li>
<li><a href="http://www.stat.columbia.edu/%7Egelman/blog/">Statistical modeling, causal inference and social science</a>. This one is a no brainer, as it is the blog for <a href="http://www.stat.columbia.edu/%7Egelman/">Andrew Gelman</a>&#8216;s group.</li>
<li><a href="http://www.flowingdata.com">FlowingData</a> by Nathan Yau (<a href="http://twitter.com/flowingdata">@flowingdata</a>), fellow Statistics Ph.D. student at UCLA. Focuses on the data and information visualization side of Data Science.</li>
<li><a href="http://www.dataists.com">dataists</a> by Hilary Mason (<a href="http://twitter.com/hmason">@hmason</a>, <a href="http://bit.ly">bit.ly</a>), <a href="http://www.vincebuffalo.com/">Vince Buffalo</a> (<a href="http://twitter.com/vsbuffalo">@vsbuffalo</a>, UC Davis),<br />
<a href="http://www.drewconway.com/zia/">Drew Conway</a> (<a href="http://twitter.com/drewconway">@drewconway</a>, NYU), <a href="http://www.mikedewar.org">Mike Dewar</a> (<a href="http://twitter.com/mikedewar">@mikedewar</a>, Columbia),<br />
<a href="http://www.johnmyleswhite.com/">John Myles White</a> (<a href="http://twitter.com/johnmyleswhite">@johnmyleswhite</a>, Princeton) and others.<br />
A new blog on several aspects of Data Science including Data Mining, visualization and uses of Statistics in current events. Heavy use of R and <a href="http://had.co.nz/ggplot2/">ggplot2</a>.</li>
<li><a href="http://blog.revolutionanalytics.com/">Revolutions</a> by <a href="http://www.revolutionanalytics.com/">Revolution Analytics</a> provides a variety of content around R, Data Science and Statistics in general.</li>
<li><a href="http://fivethirtyeight.blogs.nytimes.com/">FiveThirtyEight</a> by <a href="http://en.wikipedia.org/wiki/Nate_Silver">Nate Silver</a> shares sophisticated modeling and analysis of elections and government happenings. It is in a different realm, as it attracts political news junkies (and the occasional extremist) rather than just Statisticians.</li>
<li><a href="http://lovestats.wordpress.com/">LoveStats</a> by Annie Pettit, Ph.D. (<a href="http://twitter.com/LoveStats">@LoveStats</a>) discusses Statistics as used in Social Media and Market Research.</li>
<li><a href="http://www.johndcook.com/blog/">The Endeavor</a>, by <a href="http://www.johndcook.com/">John D. Cook</a> (<a href="http://twitter.com/johndcook">@johndcook</a>) is more about Mathematics than Statistics, but he posts great stuff including math trivia, hobbyist Mathematics, and philosophy.</li>
</ul>
<p>You can see the full list of all 40 blogs <a href="http://www.bschool.com/blog/2011/40-fascinating-blogs-for-the-ultimate-statistics-geek/">here!</a>.</p>
<div class="shr-publisher-503"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><div class='shareaholic-like-buttonset' style='float:none;height:30px;'><a class='shareaholic-googleplusone' data-shr_size='standard' data-shr_count='true' data-shr_href='http%3A%2F%2Fwww.bytemining.com%2F2011%2F01%2F40-fascinating-blogs-for-the-ultimate-statistics-geek%2F' data-shr_title='40+Fascinating+Blogs+for+the+Ultimate+Statistics+Geek%21'></a></div><div style="clear: both; min-height: 1px; height: 3px; width: 100%;"></div><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://www.bytemining.com/2011/01/40-fascinating-blogs-for-the-ultimate-statistics-geek/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>
